It's like looking for a needle in a haystack.
How many times
does the word algorithm
appear in this chapter?
Searching for a word in a text is a daily activity for most of us. In a computational context, this problem is known as substring searching, or string matching, and can be formulated as follows:
Find all the occurrences of a given array a in a given array b.
We shall assume that the elements of the arrays are bytes, though the problem makes sense for arrays of any other kind. We shall not treat the arrays as strings, and therefore the presence of a final null byte is irrelevant.
Em many applications (and in the examples below), the elements of a and b represent ASCII characters and therefore each byte belongs to the set 0 . . 127. In other applications, the elements of a and b represent Unicode characters in UTF-8 encoding (in which case each character may correspond to more than one byte). In general, however, there are no restrictions on the elements of a and b: each element is a byte with value between 0 and 255.
We say that the array a is a word (even if it does not represent a word in English) and the array b is a text. The problem consists, then, of finding the occurrences of a word in a text.
Looking for a software virus in a digital file, for example, is, essentially, the problem of searching for a word (the virus) in a text (the file). In the case of genetic code processing, for example, the elements of the arrays are the letters A, C, G, and T. Here is an example of an occurrence of the word TACTA in the text GTAG…TAG:
G | T | A | G | T | A | T | A | T | A | T | A | T | A | T | A | C | T | A | C | T | A | G | T | A | G |
T | A | C | T | A |
Table of contents:
We have already made the design decision of dealing only with arrays of bytes. Here is another design decision: the word and the text will be indexed starting from 1 (rather than 0) because this will slightly simplify the expressions that specify segments of the arrays. Finally, we shall restrict ourselves to the simplified version of the problem that asks for just one number:
Find the number of occurrences of a word a[1 . . m] in a text b[1 . . n].
If m > n, the number of occurrences is zero. To make sure that the number of occurrences is finite, we assume m ≥ 1.
We shall use the following terminology when discussing the problem:
a[1 . . m] matches a suffix of b[1 . . k]will be used as a shorthand for
a[1 . . m] matches b[k−m+1 . . k];
Note that two occurrences of a in b may overlap. For example, the two occurrences of BABA in XBABABAX overlap.
Examples. In the following examples, the sign ↓ indicates the positions k at which the array a (lower line) matches a suffix of the array b[1 . . k] (upper line):
↓ | ||||||||||||||||||||||||
U | m | v | e | t | o | r | a | o | c | o | r | r | e | e | m | b | s | e | ||||||
o | c | o | r | r | e | e |
↓ | ↓ | ||||||||||||||||||||||||
3 | 1 | 3 | 1 | 4 | 3 | 1 | 4 | 1 | 3 | 1 | 4 | 1 | 5 | 9 | 3 | 1 | 4 | 1 | 5 | 9 | 2 | 6 | 3 | 1 | 4 |
3 | 1 | 4 | 1 | 5 | 9 |
↓ | ↓ | ||||||||||||||||||||||||
G | T | A | G | T | A | T | A | T | A | T | A | T | A | T | A | C | T | A | C | T | A | G | T | A | G |
T | A | C | T | A |
Scan directions. Any algorithm that looks for a word in a text must scan the text. To look for occurrences of a word a in a text b, we could scan b from left to right or from right to left. The two alternatives are equivalent, but we shall always use the first: compare a to b[1 . . m], then to b[2 . . m+1], and so on.
For a fixed k, the element-by-element comparison of a[1 . . m] with a suffix of b[1 . . k] could be done from left to right or from right to left. The two alternatives are equivalent, but one of the algorithms we shall study further down requires that the direction of comparison be opposite to the direction in which the text is scanned. Therefore, the element-by-element comparison will be always done from right to left: first a[m] with b[k], then a[m−1] with b[k−1], and so on.
The following function solves the problem in the most obvious and direct manner. It patiently tries to match a[1..m] with b[1..m], then with b[2..m+1], and so on:
typedef unsigned char byte; // Receives arrays a[1..m] and b[1..n], with // m >= 1 and n >= 0, and returns the number // of occurrences of a in b. int naive (byte a[], int m, byte b[], int n) { int occurs = 0; for (int k = m; k <= n; ++k) { // does a[1..m] match b[k-m+1..k]? int i = m, j = k; while (i >= 1 && a[i] == b[j]) --i, --j; if (i < 1) ++occurs; } return occurs; }
We can imagine that, in each iteration, the word a slides from left to right along the text b, as in the following example:
X | C | B | A | B | X | C | B | A | A | X | B | C | B | A | B | X |
B | C | B | A | |||||||||||||
B | C | B | A | |||||||||||||
B | C | B | A | |||||||||||||
B | C | B | A | |||||||||||||
B | C | B | A | |||||||||||||
etc. |
In the worst case, the naive function compares each element of a[1..m] with each element of b[1..n] and therefore consumes time proportional to
m n .
Is it possible to solve the problem without comparing each element of a to each element of b?
The alphabet associated with an instance of our problem is any set of byte values that contains all the elements of the arrays a and b. Given our conventions, 0..255 is an alphabet of any instance of the problem. But some instances can have smaller alphabets, such as 0..127 in the case of ASCII characters, or as 65 67 71 84 in the case of a genetics application.
R.S. Boyer and J.S. Moore had the ingenious ideia of using an auxiliary table, indexed by the alphabet, to accelerate the naive algorithm. Suppose that we have already compared a[1..m] with a suffix of b[1..k]. Now, we can skip some iterations of the naive algorithm and start to compare a[1..m] with a suffix of
b[1..k+d]
for some positive d. The value of d is chosen so that the position k+1 in b will be paired up with the last occurrence (counting from left to right) of the byte b[k+1] in a[1..m]. In the example below, the sign | indicates the positions that play the role of k+d. When there is a match, | is replaced by ↓ :
| | | | | | | | ↓ | | | |||||||||||
X | C | B | A | B | X | C | B | A | A | X | B | C | B | A | B | X |
B | C | B | A | |||||||||||||
B | C | B | A | |||||||||||||
B | C | B | A | |||||||||||||
B | C | B | A | |||||||||||||
B | C | B | A | |||||||||||||
B | C | B | A |
To implement the ideia,
we can preprocess the word a
and remember, for each letter
f of the alphabet,
⚠ the position of the last occurrence
of f in a.
This position will be denoted by lst[f].
If the alphabet of the previous example is
the set of 128 ASCII characters,
we shall have
f ... = > ? @ A B C D E F G ... lst[f] ... 0 0 0 0 4 3 2 0 0 0 0 ... |
1 | 2 | 3 | 4 |
B | C | B | A |
Consider the following implementation of the algorithm. The first iterative process does the preprocessing of the word and the second counts the occurrences of the word in the text.
typedef unsigned char byte; // Receives arrays a[1..m] and b[1..n] of // bytes, with m >= 1 and n >= 0, and // returns the number of occurrences of // a in b. int boyermoore1 (byte a[], int m, byte b[], int n) { int lst[256]; // the alphabet is 0..255 // preprocessing the word a for (int f = 0; f < 256; ++f) lst[f] = 0; for (int i = 1; i <= m; ++i) lst[a[i]] = i; // search word a in text b int occurs = 0; int k = m; while (k <= n) { // does a[1..m] match b[k-m+1..k]? int i = m, j = k; while (i >= 1 && a[i] == b[j]) --i, --j; if (i < 1) ++occurs; if (k == n) k += 1; else k += m - lst[b[k+1]] + 1; } return occurs; }
This is the first Boyer-Moore algorithm.
for (byte f = 0; f < 256; ++f) lst[f] = 0;
k += m - lst[b[k+1]] + 1if the byte b[k+1] does not occur in a[1..m]? What happens if b[k+1] is equal to a[m]?
occurs = 0; k = m; while (k <= n) { i = m, j = k; while (i >= 1 && a[i] == b[j]) --i, --j; if (i < 1) ++occurs; kk = k+1; while (kk <= n && lst[b[kk]] == 0) ++kk; if (kk > n) break; k += m - lst[b[kk]] + kk - k; } return occurs;
if (k == n) k += 1. Write this version of the code.
int lst[256]; for (int i = 0; i < 256; ++i) lst[i] = 0; for (int i = 1; i < m; ++i) lst[a[i]] = i; int occurs = 0, k = m; while (k <= n) { int i = m, j = k; while (i >= 1 && a[i] == b[j]) --i, --j; if (i < 1) ++occurs; k += m - lst[b[k]]; } return occurs;
The second Boyer-Moore algorithm, unlike the first, does not need to know the alphabet of word a and text b explicitly in advance. But it has to compare the word with the text from right to left, i.e., first a[m] with b[k], then a[m-1] with b[k-1], and so on.
Imagine that the word a[1..m] is &CBA*CBA . Suppose we just discovered that a[1..m] does not match a suffix of b[1..k] and we are getting ready to check whether a[1..m] matches a suffix of b[1..k+1]. Now compare the word a to itself and observe that a[h..m] matches ⚠
neither a[h-1..m-1], nor a[h-2..m-2], nor a[h-3..m-3].
In our example, m is 8, h is 6 and CBA matches neither *CB nor A*C nor BA*:
1 | h | m | ||||||||||
& | C | B | A | * | C | B | A | |||||
& | C | B | A | * | C | B | A |
It is easy to deduce now, without any additional comparisons, that a[1..m] does not match any suffix
of b[1..k+1], nor of b[1..k+2], nor of b[1..k+3]
(draw a figure!). Therefore, the next step must try matching a[1..m] with a suffix of b[1..k+4]. As part of this attempt, we must compare a[h-4..m-4] with b[k-m+h..k]. But this is the same as checking whether a[h-4..m-4] matches a[h..m], since, by assumption, b[k-m+h..k] is equal to a[h..m].
To complete the illustration, suppose that a[1..m] matches a suffix of b[1..k+4]. If h-4 ≥ 1 then a[h..m] matches a suffix of a[1..m-4].
k | |||||||||||||||||
- | - | - | - | - | - | C | B | A | * | C | B | A | - | - | - | - | - |
1 | h | m | |||||||||||||||
& | C | B | A | * | C | B | A | ||||||||||
& | C | B | A | * | C | B | A |
However, if h-4 < 1 then a[1..m-4] matches a suffix of a[h..m]. To illustrate this alternative, suppose the word is BA*CBA :
k | |||||||||||||||||
- | - | - | - | C | B | A | * | C | B | A | - | - | - | - | - | - | - |
1 | h | m | |||||||||||||||
B | A | * | C | B | A | ||||||||||||
B | A | * | C | B | A |
This example suggests that the implementation of the ideia must begin with the following preprocessing of the word a: for each h in 1..m, compute the greatest k in 1..m-1 such that ⚠
We shall say that this largest value of k is the jump of h. Here are some examples of words with the corresponding jump table:
1 | 2 | 3 | 4 | 5 | 6 | h | 6 5 4 3 2 1 | |
C | A | A | B | A | A | jump[h] | 5 3 0 0 0 0 |
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | h | 8 7 6 5 4 3 2 1 | |
B | A | - | B | A | * | B | A | jump[h] | 5 5 2 2 2 2 2 2 |
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | h | 11 10 9 8 7 6 5 4 3 2 1 | |
B | A | - | B | A | * | B | A | * | B | A | jump[h] | 18 18 8 8 8 2 2 2 2 2 2 |
After these preparations, we can examine the implementation of the second Boyer-Moore algorithm:
typedef unsigned char byte; // Receives a word a[1..m] with 1 <= m and // a text b[1..n]. Returns the number of // occurrences of a in b. int boyermoore2 (byte a[], int m, byte b[], int n) { int *jump = malloc ((m+1) * sizeof (int)); // our table will be jump[1..m] // preprocessing of word a int h = m, k = m-1; while (h >= 1 && k >= 1) { int i = m, j = k; while (i >= h && j >= 1) if (a[i] == a[j]) --i, --j; else i = m, j = --k; jump[h--] = k; } while (h >= 1) jump[h--] = k; // search for word a in text b int occurs = 0; k = m; while (k <= n) { int i = m, j = k; while (i >= 1 && a[i] == b[j]) --i, --j; if (i < 1) ++occurs; if (i == m) k += 1; else k += m - jump[i+1]; } return occurs; }
And here is a more compact and efficient version of the preprocessing:
// preprocessing of word a
h = m, k = m-1;
i = m, j = k;
while (h >= 1) {
while (i >= h && j >= 1)
if (a[i] == a[j]) --i, --j;
else i = m, j = --k;
jump[h--] = k;
}
if (i == m) k += 1if we were to introduce an appropriate sentinel in position jump[m+1]. Write this version of boyermoore2.
i = m; j = k = m-1; for (h = m; h >= 1; --h) { while (i >= h && j >= 1) if (a[i] == a[j]) --i, --j; else i = m, j = --k; jump[h] = k; }
k = m-1; r = 0; for (h = m; h >= 1; --h) { while (m-r >= h && k-r >= 1) if (a[m-r] == a[k-r]) ++r; else r = 0, k--; jump[h] = k; }
The third Boyer-Moore algorithm is just the fusion of the two previous ones. Each step of the algorithm chooses the larger of the two displacements: the one prescribed by the table lst and the one given by the table jump. (Actually, this is the Boyer-Moore algorithm proper: the distinction between the first and second algorithms was done above for expository convenience only.)
The preprocessing consumes m^{2} units of time. Unfortunately, the search phase consumes m n units of time in the worst case, just as the naive algorithm. But the worst case is so rare that in the average case, typical of practical applications, the third Boyer-Moore algorithm consumes time proportional to n and independent of m. In other words, on average, each element of the text is compared with only a few elements of the word, regardless of the length of the word.
The definition of the jump table can be perfected so that even in the worst case the search phase of the third algorithm consumes only 6n units of time.
Write a function to search for a word in a text interpreting the character # as described above.