### The Boyer-Moore-Horspool Algorithm

• Introduction

• Boyer-Moore:

• The (complete) Boyer-Moore algorithm uses two heuristics in order to determine the shift distance of the pattern in case of a mismatch:

 the bad-character heuristic the good-suffix heuristic

• The good-suffix heuristic is rather complicated to implement

• Many different researchers have proposed simpler algorithms that are based only on the bad-character heuristic.

• Examples:

• Performance comparisons between some matching algorithms:

These experiments show that:

 The good suffix heuristic has little impact on the performance.

Note:

 These results are valid for natural languages (using a relatively large alphabet) They may not be true for small size alphabets --- such as DNA (alphabet used: {A, C, T, G})

• The idea for the Horspool algorithm

• Horspool (1980) presented a simplification of the Boyer-Moore algorithm, and based on empirical results showed that this simpler version is as good as the original Boyer-Moore algorithm (i.e., including the good suffix heuristic).

• Horspool noted that:

 If there is a mismatch, any one of the characters (from the text T) in the suffix can be used to perform the bad character heuristic (= shift) (The Boyer-Moore algorithm always uses the mismatched character)

• Example:

• The original Boyer-Moore bad character heuristic:

 ``` 0123456789012345678901234 i0=4 i0+j | | v v T = abaaabbababcabdacbaabababc P = abdacabaabd ^ | j=7 Find right-most occurence of 'c' and align: 0123456789012345678901234 i0=7 i0+j | | v v T = abaaabbababcabdacbaabababc P = abdacabaabd ^ ^ | | | j=10 (Restart matching from the end) | lined up (we don't miss out of any match) ```

• You can also use the last character to perform the shift:

 ``` 0123456789012345678901234 i0=4 i0+j | | +---- use THIS char to find slide distance v v v T = abaaabbababcabdacbaabababc P = abdacabaabd ^ ^ | | | j=7 | "Right-most" occurence of 'd' (not including the last character) Align: 0123456789012345678901234 i0=12 | +---- use THIS char to find slide distance v v T = abaaabbababcabdacbaabababc P = abdacabaabd ^ | j=10 (Restart matching from the end) ```

Notice that:

• We did not miss out of any matches, because any matching text must contains a `d'

Proof:

• The pattern contains a `d'

• We did not pass any `d' between the last mismatch that the current alignment !!!

Therefore:

 We cannot miss out on a pattern !!!

• Based on this, Horspool (1980) improved the Simplified Boyer-Moore algorithm by always to right-most occurence of the last character of the pattern.

• We call this algorithm the Boyer-Moore-Horspool algorithm.

• The Horspool algorithm

• The Horspool algorithm is derived from the Boyer-Moore algorithm by making 2 changes:

1. Omit the last character of the pattern when you compute the lastOcc() function

(That is because you want the right-most occurrence of the last character without including that last character

Example:

 ``` 0123456789 P = abcabcabca lastOcc['a'] = 6 lastOcc['b'] = 7 lastOcc['c'] = 8 0123456789 P = abcabcabcd lastOcc['a'] = 6 lastOcc['b'] = 7 lastOcc['c'] = 8 lastOcc['d'] = -1 ```

2. When you detect a mismatch, always use the character in the Text string that is aligned with the last character in the pattern to determine the amount of shift

Example 1: mismatched at a character other than the last character in the pattern

 ``` T = ...............a.... P = abcabcabca ^ | mismatch somewhere Shift: T = ...............a.... P = abcabcabca ```

Example 2: mismatched at the last character in the pattern

 ``` T = ...............b.... P = abcabcabca ^ | mismatch Shift: T = ...............b.... P = abcabcabca ```

• To make the differences clear, I have cut and paste the Simplified Boyer-Moore algorithm here

Simplified Boyer-Moore Algorithm:

 ``` public static int[] computeLastOcc(String P) { int[] lastOcc = new int[128]; // assume ASCII character set for (int i = 0; i < 128; i++) { lastOcc[i] = -1; // initialize all elements to -1 } for (int i = 0; i < P.length(); i++) { lastOcc[P.charAt(i)] = i; // The LAST value will be store } return lastOcc; } public static int BMG (String T, String P) { int[] lastOcc; int i0, j, m, n; n = T.length(); m = P.length(); lastOcc = computeLastOcc(P); // Find last occurence of all characters in P i0 = 0; // Line P up at T[0] while ( i0 < (n-m) ) { j = m-1; // Start at the last char in P while ( P.charAt(j) == T.charAt(i0+j) ) { j--; // Check "next" (= previous) character if ( j < 0 ) return (i0); // P found ! } if ( j < lastOcc[T.charAt(i0+j)] ) { /* ======================================= Bad character caveat detected ======================================= */ i0++; // Slide P 1 char further (Goodrich) } else { i0 = i0 + j - lastOcc[T.charAt(i0+j)]; // Bad char + Looking glass heuristic } } return -1; // no match } ```

• Boyer-Moore-Horspool Algorithm: (changes are in red)

 ``` public static int[] computeLastOcc(String P) { int[] lastOcc = new int[128]; // assume ASCII character set for (int i = 0; i < 128; i++) { lastOcc[i] = -1; // initialize all elements to -1 } for (int i = 0; i < P.length()-1; i++) // Don't use the last char // to compute lastOcc[] { lastOcc[P.charAt(i)] = i; // The LAST value will be store } return lastOcc; } public static int BMG (String T, String P) { int[] lastOcc; int i0, j, m, n; n = T.length(); m = P.length(); lastOcc = computeLastOcc(P); // Find last occurence of all characters in P i0 = 0; // Line P up at T[0] while ( i0 < (n-m) ) { j = m-1; // Start at the last char in P while ( P.charAt(j) == T.charAt(i0+j) ) { j--; // Check "next" (= previous) character if ( j < 0 ) return (i0); // P found ! } /* ========================================================== The character in T aligned with P[m-1] is: T[i0+(m-1))] Always use character T[i0 + (m-1)] to find the shift ========================================================== */ i0 = i0 + (m-1) - lastOcc[T.charAt(i0+(m-1))]; // Use last character: j = (m-1) } return -1; // no match } ```

• Example Program: (Demo above code)

How to run the program:

 Right click on link(s) and save in a scratch directory To compile:   javac Horspool.java To run:          java Horspool