### A Simplified Boyer-Moore Algorithm

• Introduction

• The Boyer-Moore algorithm compares the pattern P with the text T from right to left.

 Repeat: from right to left !!            I.e.: backwards !!!!!!!!!

• Boyer-Moore and it derived algorithms are one of the fastest pattern matching algorithms available !!!

 The Boyer-Moore algorithm is considered the most efficient string-matching algorithm for natural language. The Boyer-Moore-Horspool (a variant of Boyer-Moore) algorithm achieves the best overall results when used with medical texts. This algorithm usually performs at least twice as fast as the other algorithms tested. Reference: click here

• Overview

• The Boyer-Moore algorithm consists of 2 heuristics:

• The bad character heuristic:

 This heuristic tells you how far you should slide the pattern forward when characters do not match

• The good suffix heuristic:

 Sometimes, the bad character heuristic fails to provide any (or poor) jump information The good suffix heuristic will then uses the matched characters (in the suffix because we match backwards) to tells you how far you should slide the pattern forward

• Comment:

• The bad character heuristic is:

 easy to understand (and implement) Can slide the pattern very far down (i.e., make the algorithm run very fast)

• The good suffix heuristic is

 Pretty difficult... So... some versions of the Boyer-Moore algorithm replace the good suffix heuristic with a more simple operation

• Goodrich presents a Simplified Boyer-Moore algorithm:

• Replaces the good suffix heuristic with:

 Slide 1 character down and start matching again

• Intro to the Boyer-Moore algorithm

• I find it easier to explain the Boyer-Moore algorithm without using the variable i:

 ``` 01234567890 i0=2 i=6 (i0 = i - j) | | v v T = abbadabacba P = abcad ^ | j=4 Since: i0 = i - j, we can obtain i from i0 and j: i = i0 + j Result: 01234567890 i0=2 i0+j (6) | | v v T = abbadabacba P = abcad ^ | j=4 ```

• The Boyer-Moore algorithm the pattern P with the text T from right to left.

• Example:

 ``` 01234567890 i0=0 i0+j (i = i0 + j) | | v v T = abbadabacba P = abcad ^ | j=4 if ( P[j] == T[i0+j] ) ==> compare the previous character This can be achieved by the statement: j--; Result: 01234567890 i0=0 i0+j (i0 + j) will also decrease because j is decreased | | v v T = abbadabacba P = abcad ^ | j=3 (j was decremented !!!)) ```

• Why is comparing characters in reverse order a great idea   ----   (the "looking glass" heuristic)

• Fact: (Goodrich calls this the "looking glass" heuristic)

• If the character T[i0+j] (that is currently used in the comparison) does not occur in the pattern P at all, then:

 The pattern P can be shifted by behind T[i0+j].

(Because you will never find a match as long as the pattern overlaps with that character !!!)

• Example 1:

 ``` 01234567890123 i0=3 i0+j (i = i0 + j) | | v v T = abaabbaxabacba x does not occur is P P = abcad ^ | j=4 Because x does not occur in P, we know that these shifts will not produce a match: abbaxabacba abcad abbaxabacba abcad abbaxabacba abcad abbaxabacba abcad So don't waste CPU time and shift pattern behind x: abbaxabacba abcad ^ | j (And we start over by matching the last character in P again...) ```

How to update the variables i0 and j:

 ``` Before: 01234567890123 i0=3 i0+j (i = i0 + j) | | v v T = abaabbaxabacba P = abcad ^ | j=4 After: 01234567890 i0=8 i0+j (i = i0 + j) | | v v T = abaabbaxabacba P = abcad ^ | j=4 Statements that achieve this: i0 = i0 + (j + 1); // Slide (j+1) characters j = m - 1; // Restart matching at last character in P ```

• Example 2: x is a character somewhere in the middle of the text pattern

 ``` 01234567890123456789012 i0=3 i0+j (i = i0 + j) | | v v T = abaabbabaxabacbaabababc P = abcaaadab ^ | j=8 P[j] == T[i0+j] ==> j-- (Compare the previous character) 01234567890123456789012 i0=3 i0+j (i = i0 + j) | | v v T = abaabbabaxabacbaabababc P = abcaaadab ^ | j=7 P[j] == T[i0+j] ==> j-- (Compare the previous character) 01234567890123456789012 i0=3 i0+j (i = i0 + j) | | v v T = abaabbabaxabacbaabababc P = abcaaadab ^ | j=6 x is not in P ==> Shift P behind x Result: 01234567890123456789012 i0=10 i0+j (i = i0 + j) | | v v T = abaabbabaxabacbaabababc P = abcaaadab ^ | j=8 (And we start over by matching the last character in P again...) ```

How to update the variables i0 and j:

 ``` Before: 01234567890123456789012 i0=3 i0+j (i = i0 + j) | | v v T = abaabbabaxabacbaabababc P = abcaaadab ^ | j=6 After: 01234567890123456789012 i0=10 i0+j (i = i0 + j) | | v v T = abaabbabaxabacbaabababc P = abcaaadab ^ | j=8 Statements that achieve this: i0 = i0 + (j + 1); // Slide (j+1) characters j = m - 1; // Restart matching at last character in P ```

• The bad character heuristic

• The bad character heuristic:

• If a bad character "x" (= the character in the text that causes a mismatch), occurs somewhere else in the pattern, say:

 ``` T = .............x...... P = ..x..x.... ```

then:

 The pattern P can be shifted so that the right-most occurrence of the character x in the pattern, is aligned to this text symbol.

• Example

 ``` T = .............x...... P = ..x..x.... Shift P so that right-most occurence of bad character in P align with character in T: T = .............x...... P = ..x..x.... ```

Why does the heuristic work:

 ``` T = ................x...... P = ..x...x....... ^^^ These characters WILL cause a mismatch with x !! ```

• Example:

 ``` 012345678901234567890 i0=0 i0+j | | v v T = abbababcabacbaabababc P = abcacabdab ^ ^ | | | j=7 | right-most occurence of c in P After the shift: 012345678901234567890 i0=3 i0+j | | v v T = abbababcabacbaabababc P = abcacabdab ^ | j=9 <--- (start matching from the end again !) ```

• Preprocessing for the bad character heuristic: the lastOcc() function

• The last occurence function:

 lastOcc(c) = position of its right-most occurrence of the character c in the pattern P            (c ∈ Character Set used, e.g.: ASCII code for English text)

• If a character c does not occur in the pattern P, we will set:

 ``` lastOcc[ c ] = -1 ```

Reason:

 We will see that the value −1, will make the pattern P slide pass the character c (and this will implement the looking glass heuristic that we saw above)

• Example: lastOcc() function

 ``` 012345 P = tomato The lastOcc() function of P is: lastOcc('a') = 3 lastOcc('m') = 2 lastOcc('o') = 5 lastOcc('t') = 4 and for all other characterss: lastOcc(.) = -1 ```

• Algorithm to compute the lastOcc() function:

 ``` public static int[] buildLastFunction (String P) { int[] lastOcc = new int[128]; // assume ASCII character set /* ========================================= Initialize every element to -1 ========================================= */ for (int i = 0; i < 128; i++) { lastOcc[i] = -1; // initialize all elements to -1 } /* =============================================== Update lastOcc[c] with position of character c =============================================== */ for (int pos = 0; pos < P.length(); pos++) { c = P.charAt(pos); lastOcc[ c ] = pos; // ONLY The LAST position will be retained ! } return lastOcc; } ```

• How to use the lastOcc() information

• How to use the lastOcc() information:

 ``` 0123456789012345678901234 i0=4 i0+j | | v v T = abcaabbababcabacbaabababc ('c' == T[i0+j]) P = abcacabdab ^ ^ | | | j=7 | lastOcc['c'] = 4 or better: lastOcc[ T[i0+j] ] = 4 <-> j - lastOcc[ T[i0+j] ] <==== amount to slide pattern P !! After the shift: 0123456789012345678901234 i0=4+3 i0+j | | v v T = abcaabbababcabacbaabababc P = abcacabdab ^ | j=9 <--- (start matching from the end again !) Statements that accomplish this change: i0 = i0 + (j - lastOcc[ T[i0+j] ]); // Slide pattern P j = m - 1; // Start matching over from last character ```

• lastOcc[.] = −1   ==>   slide pass the character

• Fact:

• The statements used to accomplish the bad character heuristic can be used to accomplish the looking glass heuristic if we use:

 ``` lastOcc[ . ] = −1 ```

Proof:

• Recall how the variables i0 and j are updated in the looking glass heuristic:

 ``` Before: 01234567890123456789012 i0=3 i0+j (i = i0 + j) | | v v T = abaabbabaxabacbaabababc P = abcaaadab ^ | j=6 After: 01234567890123456789012 i0=10 i0+j (i = i0 + j) | | v v T = abaabbabaxabacbaabababc P = abcaaadab ^ | j=8 Statements that achieve this: i0 = i0 + (j + 1); // Slide (j+1) characters j = m - 1; // Restart matching at last character in P ```

• The statements used to update the variables i0 and j in the bad character heuristic:

 ``` i0 = i0 + (j - lastOcc[ T[i0+j] ]); // Slide pattern P j = m - 1; // Start matching over from last character ```

• Therefore:

• If we set:

 ``` lastOcc[ . ] = -1 // if the character does // not occur in P ```

then:

 ``` i0 = i0 + (j - lastOcc[ T[i0+j] ] ) = i0 + (j - (-1)) = i0 + (j + 1) ```

Which will make pattern P slide pass that non-occuring character.

• Caveat in using the bad character heuristic

• Caveat:

 The bad character heuristic may cause the pattern P to slide backwards

• Specifically, this phenomenon will happen in the following situation:

 ``` mismatch | v T = .............x.............. P = ..x........x.... ^ | right-most occurrence of mismatched character in P ```

when the right-most occurrence of the mismatched character is located further "down" in the pattern

• Example:

 ``` 0123456789012345678901234 i0=4 i0+j | | v v T = abaaabbababcabcacbaabababc ('c' == T[i0+j]) P = abcacabdabc ^ ^ | | j=7 | | lastOcc['c'] = 10 <-> j - lastOcc[ T[i0+j] ] = -3 ?! If we execute the statements: i0 = i0 + (j - lastOcc[ T[i0+j] ]); // Slide pattern P j = m - 1; we will get: 0123456789012345678901234 i0=4+-3 i0+j | | v v T = abcaabbababcabacbaabababc P = abcacabdabc ^ | j=10 <--- (start matching from the end again !) Result: Pattern slided backwards !!! (Although this won't cause any errors, it will cost more time.... We are trying to CUT running time...) BTW, to ADD running time to any algorithm is EASY :) ```

• Solving the bad-character heuristic caveat: the easiest solution

• There are many proposed solutions to the bad character caveat:

 The Boyer-Moore algorithm uses a "good suffix" heuristic that uses the matched suffix to slide --- this is similar to the KMP prefix failure function Horspool proposed a much simpler solution. Goodrich has a trivial solution in his text book.

• The simple solution put forth by Goodrich:

• If the bad character heuristic will perform a backward slide, then:

 Slide the pattern P one character forward (This is the slowest slide and will surely not make the algorithm miss any matches.)

In terms of program statements, this solution is coded as follows:

 ``` 0123456789012345678901234 i0=4 i0+j | | v v T = abaaabbababcabcacbaabababc ('c' == T[i0+j]) P = abcacabdabc ^ ^ | | j=7 | | lastOcc['c'] = 10 j < lastOcc[ T[i0+j] ] !!!! if ( j < lastOcc[ T[i0+j] ] ) { i0++; // Slide pattern 1 character further j = m-1; // Restart matching from the last char in P } else { i0 = i0 + j - lastOcc[T.charAt(i0+j)]; // FAST slide j = m-1; // Restart matching from the last char in P } ```

• This is the Simplified Boyer-Moore algorithm:

 ``` BoyerMooreSimp(T, P) { n = T.length(); m = P.length(); computeLastOcc(P); // Find last positions of all characters in P i0 = 0; // Line P up at T[0] while ( i0 < (n-m) ) { j = m-1; // Start at the last char in P while ( P[j] == T[i0+j] ) { j--; // Check "next" (= previous) character if ( j < 0 ) return (i0); // P found ! } /* ==================================================== If program reaches this place, we have a mismatch between P[j] <=> T[i0+j] ==================================================== */ if ( j < lastOcc[T.charAt(i0+j)] ) { /* ============================ Handle bad character caveat ============================ */ i0++; // Slide P 1 character further (Goodrich) // "j = m-1" is executed by the start of the loop... } else { i0 = i0 + j - lastOcc[T.charAt(i0+j)]; // "j = m-1" is executed by the start of the loop... } } return -1; // P not found in T } ```

• Java:

 ``` public static int[] computeLastOcc(String P) { int[] lastOcc = new int[128]; // assume ASCII character set for (int i = 0; i < 128; i++) { lastOcc[i] = -1; // initialize all elements to -1 } for (int i = 0; i < P.length(); i++) { lastOcc[P.charAt(i)] = i; // The LAST value will be store } return lastOcc; } public static int BMG (String T, String P) { int[] lastOcc; int i0, j, m, n; n = T.length(); m = P.length(); lastOcc = computeLastOcc(P); // Find last occurence of all characters in P i0 = 0; // Line P up at T[0] while ( i0 < (n-m) ) { j = m-1; // Start at the last char in P while ( P.charAt(j) == T.charAt(i0+j) ) { j--; // Check "next" (= previous) character if ( j < 0 ) return (i0); // P found ! } if ( j < lastOcc[T.charAt(i0+j)] ) { /* ======================================= Bad character caveat detected ======================================= */ i0++; // Slide P 1 char further (Goodrich) } else { i0 = i0 + j - lastOcc[T.charAt(i0+j)]; // Bad char + Looking glass heuristic } } return -1; // no match } ```

• Example Program: (Demo above code)

How to run the program:

 Right click on link(s) and save in a scratch directory To compile:   javac BMG.java To run:          java BMG

Sample output:

 ```+++++++++++++++++++++++++++++++++++++ ===================================== Matching: i = 5, j = 5 01234567890123456789 abacaxbaccabacbbaabb abacbb 012345 ^ | *** slide past "non-occurring" char ****** lastOcc['x'] = -1 +++++++++++++++++++++++++++++++++++++ ===================================== Matching: i = 11, j = 5 01234567890123456789 abacaxbaccabacbbaabb abacbb 012345 ^ | ===================================== Matching: i = 10, j = 4 01234567890123456789 abacaxbaccabacbbaabb abacbb 012345 ^ | *** line up with last occ ****** lastOcc['a'] = 2 +++++++++++++++++++++++++++++++++++++ ===================================== Matching: i = 13, j = 5 01234567890123456789 abacaxbaccabacbbaabb abacbb 012345 ^ | *** line up with last occ ****** lastOcc['c'] = 3 +++++++++++++++++++++++++++++++++++++ ===================================== Matching: i = 15, j = 5 01234567890123456789 abacaxbaccabacbbaabb abacbb 012345 ^ | P found !!! ```

• Postscript

• The good suffix heuristic is similar to the failure function of KMP

• Furthermore, other more simple modifications to the Boyer-Moore algorithm (like the Horspool algorithm) have similar performance as the Boyer-Moore algorithm with the good suffix heuristic

 Here is a link where you can compare the performance of some text machine algorithms: click here

• So I will not discussed the good suffix heuristic.

Instead, I'll spend time teaching the Boyer-Moore-Horspool algorithm