Searching for a word in a text

It's like looking for a needle in a haystack.

How many times
does the word algorithm appear in this chapter?

Searching for a word in a text is a daily activity for most of us. In a computational context, this problem is known as substring searching, or string matching, and can be formulated as follows:

Find all the occurrences of a given array  a  in a given array  b.

We shall assume that the elements of the arrays are bytes, though the problem makes sense for arrays of any other kind. We shall not treat the arrays as strings, and therefore the presence of a final null byte is irrelevant.

Em many applications (and in the examples below), the elements of a and b represent ASCII characters and therefore each byte belongs to the set 0 . . 127.  In other applications, the elements of a and b represent Unicode characters in UTF-8 encoding (in which case each character may correspond to more than one byte).  In general, however, there are no restrictions on the elements of a and b: each element is a byte with value between 0 and 255.

We say that the array a is a word (even if it does not represent a word in English) and the array b is a text.  The problem consists, then, of finding the occurrences of a word in a text.

Looking for a software virus in a digital file, for example, is, essentially, the problem of searching for a word (the virus) in a text (the file).  In the case of genetic code processing, for example, the elements of the arrays are the letters A, C, G, and T.  Here is an example of an occurrence of the word TACTA in the text GTAG…TAG:

G T A G T A T A T A T A T A T A C T A C T A G T A G
T A C T A

Table of contents:

Terminology and design decisions

We have already made the design decision of dealing only with arrays of bytes. Here is another design decision: the word and the text will be indexed starting from 1 (rather than 0) because this will slightly simplify the expressions that specify segments of the arrays.  Finally, we shall restrict ourselves to the simplified version of the problem that asks for just one number:

Find the number of occurrences of a word a[1 . . m] in a text b[1 . . n].

If m > n, the number of occurrences is zero. To make sure that the number of occurrences is finite, we assume m ≥ 1.

We shall use the following terminology when discussing the problem:

Note that two occurrences of a in b may overlap. For example, the two occurrences of BABA in XBABABAX overlap.

Examples.  In the following examples, the sign ↓ indicates the positions  k  at which the array a (lower line) matches a suffix of the array b[1 . . k] (upper line):

U m   v e t o r   a   o c o r r e   e m   b   s e
o c o r r e   e  
3 1 3 1 4 3 1 4 1 3 1 4 1 5 9 3 1 4 1 5 9 2 6 3 1 4
3 1 4 1 5 9
G T A G T A T A T A T A T A T A C T A C T A G T A G
T A C T A

Scan directions.  Any algorithm that looks for a word in a text must scan the text.  To look for occurrences of a word a in a text b, we could scan b from left to right or from right to left.  The two alternatives are equivalent, but we shall always use the first: compare a to b[1 . . m], then to b[2 . . m+1], and so on.

For a fixed k, the element-by-element comparison of a[1 . . m] with a suffix of b[1 . . k] could be done from left to right or from right to left. The two alternatives are equivalent, but one of the algorithms we shall study further down requires that the direction of comparison be opposite to the direction in which the text is scanned. Therefore, the element-by-element comparison will be always done from right to left: first a[m] with b[k], then a[m−1] with b[k−1], and so on.

Exercises 1

  1. How many times the word  AAA  occurs in the text  AAAAA?
  2. Which byte values represent the characters  A, C, G, and T?
  3. Discuss (vaguely, in general terms) the following statement: any algorithm for the simplified version of the problem can be modified to solve the more general version.

The naive algorithm

The following function solves the problem in the most obvious and direct manner. It patiently tries to match a[1..m] with b[1..m], then with b[2..m+1], and so on:

typedef unsigned char byte;

// Receives arrays a[1..m] and b[1..n], with
// m >= 1 and n >= 0, and returns the number
// of occurrences of a in b.

int 
naive (byte a[], int m, byte b[], int n) 
{
   int occurs = 0;
   for (int k = m; k <= n; ++k) {
      // does a[1..m] match b[k-m+1..k]?
      int i = m, j = k;
      while (i >= 1 && a[i] == b[j]) 
         --i, --j;   
      if (i < 1) ++occurs;
   }
   return occurs;
}

We can imagine that, in each iteration, the word  a  slides from left to right along the text  b,  as in the following example:

X C B A B X C B A A X B C B A B X
B C B A
B C B A
B C B A
B C B A
B C B A
etc.

In the worst case, the naive function compares each element of a[1..m] with each element of b[1..n] and therefore consumes time proportional to

m n .

Is it possible to solve the problem without comparing each element of a to each element of b?

Exercises 2

  1. Does the naive algorithm work correctly when m > n?  What happens if we try to execute the algorithm with argument m equal to 0?
  2. Give an example in which the naive function does the greatest possible number of comparisons between elements of a and b. What is this number exactly?
  3. Solution of the original problem.  Modify the naive function to solve the original version of the search problem: for each occurrence of a in b, print the index j for which a[1..m] matches b[j..m+j-1].
  4. Study the documentation of the function strstr in the string library. Try to find out what is the the algorithm that strstr implements.
  5. Consecutive spaces.  Write a function that will receive an array b[1..n] of bytes and an integer m and return the position of the first occurrence of m consecutive spaces (bytes 32) in b.  (You may assume that the elements of b represent ASCII characters, but this is irrelevant.)  Try to examine the least possible number of elements of b. Write a program to test your function.

First Boyer-Moore algorithm

The alphabet associated with an instance of our problem is any set of byte values that contains all the elements of the arrays a and b.  Given our conventions, 0..255 is an alphabet of any instance of the problem. But some instances can have smaller alphabets, such as 0..127 in the case of ASCII characters, or as  65 67 71 84  in the case of a genetics application.

R.S. Boyer and J.S. Moore had the ingenious ideia of using an auxiliary table, indexed by the alphabet, to accelerate the naive algorithm.  Suppose that we have already compared a[1..m] with a suffix of b[1..k]. Now, we can skip some iterations of the naive algorithm and start to compare a[1..m] with a suffix of

b[1..k+d]

for some positive d.  The value of d is chosen so that the position k+1 in b will be paired up with the last occurrence (counting from left to right) of the byte b[k+1] in a[1..m].  In the example below, the sign | indicates the positions that play the role of k+d.  When there is a match, | is replaced by ↓ :

| | | | |
X C B A B X C B A A X B C B A B X
B C B A
B C B A
B C B A
B C B A
B C B A
B C B A

To implement the ideia, we can preprocess the word  a  and remember, for each letter f of the alphabet, ⚠ the position of the last occurrence of f in a.  This position will be denoted by lst[f].  If the alphabet of the previous example is the set of 128 ASCII characters, we shall have

    f     ... = > ? @ A B C D E F G ...
lst[f]    ... 0 0 0 0 4 3 2 0 0 0 0 ...
1 2 3 4
B C B A

Consider the following implementation of the algorithm.  The first iterative process does the preprocessing of the word and the second counts the occurrences of the word in the text.

typedef unsigned char byte;

// Receives arrays a[1..m] and b[1..n] of
// bytes, with m >= 1 and n >= 0, and
// returns the number of occurrences of
// a in b.

int 
boyermoore1 (byte a[], int m,
             byte b[], int n)
{
   int lst[256]; // the alphabet is 0..255

   // preprocessing the word a
   for (int f = 0; f < 256; ++f) lst[f] = 0;
   for (int i = 1; i <=  m; ++i) lst[a[i]] = i;

   // search word a in text b
   int occurs = 0;
   int k = m;
   while (k <= n) {
      // does a[1..m] match b[k-m+1..k]?
      int i = m, j = k;
      while (i >= 1 && a[i] == b[j]) 
         --i, --j;   
      if (i < 1) ++occurs;
      if (k == n) k += 1;
      else k += m - lst[b[k+1]] + 1;
   }
   return occurs;
}

This is the first Boyer-Moore algorithm.

Exercises 3

  1. Answer fast: does the following code fragment work correctly?
    for (byte f = 0; f < 256; ++f) lst[f] = 0;
    
  2. Check that, after preprocessing the word, we have 1lst[f]f for each f.
  3. ★ What happens on the line k += m - lst[b[k+1]] + 1 if the byte b[k+1] does not occur in a[1..m]?  What happens if b[k+1] is equal to a[m]?
  4. Prove that the preprocessing phase in function boyermoore1 correctly fills the table lst.
  5. Is the following version of the code correct? It is supposed to search for a in b.
    occurs = 0; k = m;
    while (k <= n) {
       i = m, j = k;
       while (i >= 1 && a[i] == b[j]) --i, --j;   
       if (i < 1) ++occurs;
       kk = k+1;                                 
       while (kk <= n && lst[b[kk]] == 0) ++kk;  
       if (kk > n) break;                       
       k += m - lst[b[kk]] + kk - k;
    }
    return occurs;
    
  6. In the function boyermoore1, we could introduce an appropriate sentinel in position b[n+1] and then delete if (k == n) k += 1. Write this version of the code.
  7. ★ Show that the following version of function boyermoore1 is correct:
    int lst[256];
    for (int i = 0; i < 256; ++i) lst[i] = 0;
    for (int i = 1; i <   m; ++i) lst[a[i]] = i;
    int occurs = 0, k = m;
    while (k <= n) {
       int i = m, j = k;
       while (i >= 1 && a[i] == b[j]) 
          --i, --j;   
       if (i < 1) ++occurs;
       k += m - lst[b[k]];
    }
    return occurs;
    

Second Boyer-Moore algorithm

The second Boyer-Moore algorithm, unlike the first, does not need to know the alphabet of word a and text b explicitly in advance. But it has to compare the word with the text from right to left, i.e., first a[m] with b[k], then a[m-1] with b[k-1], and so on.

Imagine that the word a[1..m] is  &CBA*CBA.  Suppose we just discovered that a[1..m] does not match a suffix of b[1..k] and we are getting ready to check whether a[1..m] matches a suffix of b[1..k+1].  Now compare the word a to itself and observe that a[h..m] matches ⚠

neither a[h-1..m-1],  nor a[h-2..m-2],  nor a[h-3..m-3].

In our example, m is 8, h is 6 and  CBA  matches neither  *CB  nor  A*C  nor  BA*:

  1   h   m
  & C B A * C B A
        & C B A * C B A  

It is easy to deduce now, without any additional comparisons, that a[1..m] does not match any suffix

of b[1..k+1],  nor of b[1..k+2],  nor of b[1..k+3]

(draw a figure!).  Therefore, the next step must try matching a[1..m] with a suffix of b[1..k+4].  As part of this attempt, we must compare a[h-4..m-4] with b[k-m+h..k]. But this is the same as checking whether a[h-4..m-4] matches a[h..m], since, by assumption, b[k-m+h..k] is equal to a[h..m].

To complete the illustration, suppose that a[1..m] matches a suffix of b[1..k+4].  If h-4 ≥ 1 then a[h..m] matches a suffix of a[1..m-4].

  k
- - - - - - C B A * C B A - - - - -
 
  1   h   m
  & C B A * C B A
          & C B A * C B A

However, if h-4 < 1 then a[1..m-4] matches a suffix of a[h..m]. To illustrate this alternative, suppose the word is  BA*CBA:

  k
- - - - C B A * C B A - - - - - - -
 
  1   h   m
  B A * C B A
          B A * C B A

This example suggests that the implementation of the ideia must begin with the following preprocessing of the word a:  for each h in 1..m, compute the greatest k in 1..m-1 such that ⚠

We shall say that this largest value of k is the jump of h.  Here are some examples of words with the corresponding jump table:

1 2 3 4 5 6        h   6 5 4 3 2 1
C A A B A A   jump[h]  5 3 0 0 0 0
1 2 3 4 5 6 7 8        h   8 7 6 5 4 3 2 1
B A - B A * B A   jump[h]  5 5 2 2 2 2 2 2
1 2 3 4 5 6 7 8 9 10 11        h   11 10 9 8 7 6 5 4 3 2 1
B A - B A * B A * B A   jump[h]  18 18 8 8 8 2 2 2 2 2 2

After these preparations, we can examine the implementation of the second Boyer-Moore algorithm:

typedef unsigned char byte;

// Receives a word a[1..m] with 1 <= m and
// a text b[1..n]. Returns the number of
// occurrences of a in b.

int 
boyermoore2 (byte a[], int m,
             byte b[], int n) 
{
   int *jump = malloc ((m+1) * sizeof (int));
   // our table will be jump[1..m]

   // preprocessing of word a
   int h = m, k = m-1;
   while (h >= 1 && k >= 1) {
      int i = m, j = k; 
      while (i >= h && j >= 1)
         if (a[i] == a[j]) --i, --j;
         else i = m, j = --k;
      jump[h--] = k;
   }
   while (h >= 1)
      jump[h--] = k;

   // search for word a in text b
   int occurs = 0;
   k = m;
   while (k <= n) {
      int i = m, j = k;
      while (i >= 1 && a[i] == b[j]) 
         --i, --j;   
      if (i < 1) ++occurs;
      if (i == m) k += 1;
      else k += m - jump[i+1];
   }
   return occurs;
}  

And here is a more compact and efficient version of the preprocessing:

// preprocessing of word a
h = m, k = m-1;
i = m, j = k;
while (h >= 1) {
   while (i >= h && j >= 1)
      if (a[i] == a[j]) --i, --j;
      else i = m, j = --k;
   jump[h--] = k;
}

Exercises 4

  1. Show that the preprocessing phase in function boyermoore2 correctly fills the jump table.  Show that this phase consumes m2 units of time in the worst case.
  2. We could delete the if (i == m) k += 1 if we were to introduce an appropriate sentinel in position jump[m+1]. Write this version of boyermoore2.
  3. Show that the following version of the preprocessing is correct:
    i = m;
    j = k = m-1;
    for (h = m; h >= 1; --h) {
       while (i >= h && j >= 1)
          if (a[i] == a[j]) --i, --j;
          else i = m, j = --k;
       jump[h] = k;
    }
    
  4. Show that the following version of the preprocessing is correct:
    k = m-1;
    r = 0;
    for (h = m; h >= 1; --h) {
       while (m-r >= h && k-r >= 1)
          if (a[m-r] == a[k-r]) ++r;
          else r = 0, k--;
       jump[h] = k;
    }
    
  5. Use the function boyermoore2 to count the number of occurrences of the word  A B C B C C A B C  in the text  A B C C B A A B C A B C B C C A B C.

Third Boyer-Moore algorithm

The third Boyer-Moore algorithm is just the fusion of the two previous ones. Each step of the algorithm chooses the larger of the two displacements: the one prescribed by the table lst and the one given by the table jump.  (Actually, this is the Boyer-Moore algorithm proper: the distinction between the first and second algorithms was done above for expository convenience only.)

The preprocessing consumes m2 units of time. Unfortunately, the search phase consumes  m n  units of time in the worst case, just as the naive algorithm.  But the worst case is so rare that in the average case, typical of practical applications, the third Boyer-Moore algorithm consumes time proportional to  n  and independent of m.  In other words, on average, each element of the text is compared with only a few elements of the word, regardless of the length of the word.

The definition of the jump table can be perfected so that even in the worst case the search phase of the third algorithm consumes only 6n units of time.

Exercises 5

  1. Implement the third Boyer-Moore algorithm.
  2. Performace tests.  Compare, experimentally, the performance of the third Boyer-Moore algorithm with that of the naive algorithm.  Invent interesting word/text pairs for the tests.  To do the tests, you can use a file of common English words (see, for example, the file 1-1000.txt copied from https://gist.github.com/deekayen/4148741#file-1-1000-txt). To stress-test your program, run it on a larger file of English words. (See, for example the file of 466K words in English.) You can also use the book The Tale of Two Cities by Charles Dickens, as Sedgewick and Wayne do.
  3. Challenge.  Investigate the changes that must be done to the definition of the jump table so that the third Boyer-Moore algorithm does at most 6n comparisons between bytes of the word and the text in the search phase.
  4. Write a function that receives arrays b[1..n], a[1..m], x[1..p] and replaces by x each occurrence of a in b.
  5. Wildcard searching.  Suppose that the character # has a special meaning inside any word: it represents 0 or more occurrences of any other character (that is, # is a wildcard). Examples:
    • the word A#B#C matches any segment of the text that begins with A, ends with C, and has a B somewhere between A and C;
    • the word  x#[#]#=#;  matches  x[i] = x[i-1] + 1;  as well as  x2[i]=x[i-1]+1; y= z;.

    Write a function to search for a word in a text interpreting the character # as described above.

  6. Grep.  Study the powerful grep utility for searching words in files.