Exact String Matching: Knuth-Morris-Pratt (KMP)

• Using prefix information to do better than brute force algorithm

• Recall the brute-force algorithm:

 ``` n = T.length(); m = P.length(); i0 = 0; // Line P up with the first character of T i = 0; // Start matching with first char in T j = 0; // Start matching with first char in P while ( i < n ) // Not all characters used { if ( T[i] == P[j] ) { /* =============================================== T[i] and P[j] match ==> try next pair =============================================== */ i++; // Match next pair j++; if ( j == m ) return ( i0 ); // Match found at position i0 !!! } else { /* =========================================== T[i] ≠ P[j]: 1. Slide P up 1 position 2. restart from beginning of string =========================================== */ i0 = i0 + 1; // Slide pattern P one character further i = i0; // Restart matching at position i0 in T j = 0; // Restart matching at position 0 in P } } return -1; // Return not found } ```

• \$64,000 question:

 Can we do better ???

Because:

• Using the matched prefix to slide further

• I want show you some examples before showing you the technique

In these example, make a note that:

 You do not need to know the input text in order to answer the questions !!!

• Example 1:

• Suppose the red character in the input text is the first unmatched character:

 ``` T = ??????????????????? P = aaabaaaxyz ```

• We can conclude that:

• The prior characters (= prefix) are equal to those in pattern P:

 ``` T = ????aaab??????????? P = aaabaaaxyz ```

• Now, let us use only the prefix information in the text string (because the other characters are not known):

 ``` T = ????aaab?????????? P = aaab??????? ```

We can know for sure that the pattern P cannot be found starting at these positions:

 ``` The character b prevents P to be matched !!! | V T = ????aaab?????????? P = aaab??????? | V T = ????aaab?????????? aaab??????? | V T = ????aaab?????????? aaab??????? ```

• The first possible way that pattern P can be found in the text is starting here:

 ``` T = ????aaab?????????? P = aaab??????? ```

• Conclusion:

 In the above example, we can slide the pattern P 4 characters further down without missing a matching pattern !!!

• Example 2:

• Suppose the red character in the input text is the first unmatched character:

 ``` T = ??????????????????????? P = aaaaabaaxaaxyz ```

• We can conclude that:

• The prior characters (= prefix) are equal to those in pattern P:

 ``` T = ????aaaaabaa??????????? P = aaaaabaaxaaxyz ```

• Now, let us use only the prefix information in the text string (because the other characters are not known):

 ``` T = ????aaaaabaa?????????? P = aaaaabaa??????? ```

We can know for sure that the pattern cannot be found starting at these positions:

 ``` T = ????aaaaabaa?????????? P = aaaaabaa??????? T = ????aaaaabaa?????????? P = aaaaabaa??????? T = ????aaaaabaa?????????? P = aaaaabaa??????? T = ????aaaaabaa?????????? P = aaaaabaa??????? T = ????aaaaabaa?????????? P = aaaaabaa??????? ```

because:

 One or more known characters in the prefix already failed to match !!!

• The first possible way that pattern P can be found in the text is starting here:

 ``` T = ????aaaaabaa?????????? P = aaaaabaa??????? ```

• Conclusion:

 In the above example, we can slide the pattern P 6 characters further down without missing a matching pattern !!!

OK, now we must consolidate what we have learned....

We will need some new terminology :)

• Terminology

• Prefixes of the pattern P:

• A prefix of the pattern P is a portion of text at the start of the pattern

Example:

 ``` P = aaaaabaa??????? ^^^^^^^^ a prefix of P ```

• Proper suffixe of a prefix:

• A proper suffix of a prefix (of the pattern P) is a tail portion of a prefix that is not equal to the entire prefix

Example:

 ``` P = aaaaabaa??????? ^^ a proper suffix of a prefix of P ```

• This is not a proper suffix:

 ``` P = aaaaabaa??????? ^^^^^^^^ not a proper suffix ```

• The maximum overlap of a prefix

• Maximum overlap of a prefix:

 Let pre be a prefix of the pattern P MaxOverlap(pre) = the longest proper suffix that is equal to a prefix of pre

Examples:

• pre = aaaa

 ``` MaxOverlap("aaaa") = "aaa" because: aaaa longest proper suffix aaaa that is equal to a prefix of "aaaa" ```

• pre = aaba

 ``` MaxOverlap("aaba") = "a" because: aaba longest proper suffix aaba that is equal to a prefix of "aaaa" ```

• pre = aab

 ``` MaxOverlap("aab") = "" (empty string !) because: aab aab There is NO overlap possible ```

• Very important note:

• You cannot use the entire prefix to determine the maximum overlap

Example:

 ``` aaaa aaaa or: aaba aaba ```

• The Maximum Overlap must be a proper suffix (i.e., it must be a substring !!!)

Reminder:

 MaxOverlap(pre) is never equal entire prefix pre !!!!

• How to re-use prefix information when there is a mismatch

• Using what we have learned in the above examples:

• If there is a mismatch at T[i] and P[j] (i.e., T[i] ≠ P[j]) and the MaxOverlap of the prefix has length k:

Then we can slide P so that the suffix and prefix aligns without missing out on a match:

Note:

• If the mismatched location is P[j], then prefix is:

 P[0 .. (j−1)] !!!

• "Fast slide" algorithm on mismatch --- psuedo code:

 ``` prefix = P[ 0..(j-1) ]; // Prefix of pattern at the mismatch k = MaxOverlap( prefix ); // Compute max overlap j = k; i0 = (i - j); // i is unchanged ! ```

• Example:

 ``` 1 2 01234567890123456789012 (ruler) i0=0 i=7 | | v v T: abadababaccabacabaabb P: abadabacb ^ | j=7 Prefix: abadaba Maximum overlap: abadaba abadaba So: k = 3 1 2 01234567890123456789012 (ruler) i0=0 i=7 | | v v T: abadababaccabacabaabb P: abadabacb ^ | j=7 Update: j = 3 (k = 3) i0 = (7-3) = 4 New situation: 1 2 01234567890123456789012 (ruler) i0=4 i=7 | | v v T: abadababaccabacabaabb P: abadabacb ^ | j=3 ```

• The KMP string matching algorithm

• If we incorporate the "fast slide" algorithm:

 ``` (When T[i] ≠ P[j] ): prefix = P[ 0..(j-1) ]; // Prefix of pattern at the mismatch k = MaxOverlap( prefix ); j = k; i0 = (i - j); ```

in the Basic (brute-force) algorithm to speed up the process in a mismatch, we obtain the Knutt-Morris-Pratt (KMP) algorithm:

 ``` KMP( T, P ) { int i0, i, j, m, n; n = T.length(); m = P.length(); i0 = 0; // Line P up with the first character of T i = 0; // Start matching with first char in T j = 0; // Start matching with first char in P while ( i < n ) // Not all characters used { if ( T[i] == P[j] ) { i++; // Match next pair j++; if ( j == m ) return ( i0 ); // Match found atposition i0 !!! } else { /* =========================================== T[i] ≠ P[j] =========================================== */ if ( j == 0 ) { /* ============================================== First character already mismatched We have NO prefix info. to work with... =============================================== */ i0++; // Just slide P 1 character over i = i0; // j = 0; } else { prefix = P[ 0..(j-1) ]; // Prefix of pattern at the mismatch k = MaxOverlap( prefix ); j = k; i0 = (i - j); // i is unchanged ! } } } return -1; // No match found } ```

We will do an example after discussing the KMP failure function first.....

• Prelude to "KMP failure function"

• Consider this part of the KMP algorithm:

 ``` ..... else { prefix = P[ 0..(j-1) ]; // Prefix of pattern at the mismatch k = MaxOverlap( prefix ); j = k; i0 = (i - j); } ..... ```

• Observation:

• There is a finite number of possible prefixes that you can obtain from:

 ``` prefix = P[ 0..(j-1) ]; // Prefix of pattern at the mismatch ```

Example:

 ``` P = abadabacb Possible prefixes: j-1=0 a <---- this prefix is not useful... j-1=1 ab j-1=2 aba j-1=3 abad j-1=4 abada j-1=5 abadab j-1=6 abadaba j-1=7 abadabacb <---- this prefix is used to find multiple occurences ```

Consquently:

 We are computing MaxOverlap(...) for some prefixes over and over again !!!

• Better strategy:

 We pre-compute the MaxOverlap(...) values once for every possible prefix Store the computed MaxOverlap(...) values !!!

• Note:

 The pre-computed MaxOverlap( prefix ) function values is known as the KMP failure function

• The KMP failure function (= the skip distance when there is a mismatch)

• Failure function of a pattern P

• Let P   =   p0 p1 p2 ... pk ... pm-1

• The failure function f(k) is defined as:

 ``` f(k) = MaxOverlap( "p0 p1 p2 ... pk " ) ```

which is the length of the longest suffix of "p1 p2 p3 ... pk" that is a prefix of p0 p1 p2 ... pk

Graphically:

Note:

 We must exclude the first character p0 because the maximum overlap must be a proper suffix

• Example: computing the failure function

 ``` Pattern: Position: 012345 P: abacab Prefix ending at pos k Max overlap f(k) ---------------------- --------------------- ------- k=1 ab (a)b ab 0 k=2 aba (a)ba aba 1 k=3 abac (a)bac abac 0 k=4 abaca (a)baca abaca 1 k=5 abacab (a)bacab abacab 2 Failure function: i = | 0 | 1 | 2 | 3 | 4 | 5 | ----------+---+---+---+---+---+---+ f(i) = | 0 | 0 | 1 | 0 | 1 | 2 | ```

Note:

 By default, we set:   f(0) = 0 which will make the pattern P slide over 1 character position

• The KMP Algorithm

• The following is the KMP algorithm:

 ``` KMP( T, P ) { int i0, i, j, m, n; n = T.length(); m = P.length(); compute failure function f(k) (for all prefixes); i0 = 0; // Line P up with the first character of T i = 0; // Start matching with first char in T j = 0; // Start matching with first char in P while ( i < n ) // Not all characters used { if ( T[i] == P[j] ) { i++; // Match next pair j++; if ( j == m ) return ( i0 ); // Match found atposition i0 !!! } else { /* =========================================== T[i] ≠ P[j] =========================================== */ if ( j == 0 ) { i0++; // Slide 1 character over i = i0; // j = 0; } else { // Fast slide using prefix information k = f(j-1); // = MaxOverlap( P[ 0..(j-1) ] ) // If j=1, f(j-1) = 0 will make pattern P // slide down 1 character j = k; i0 = (i - j); } } } return -1; // No match found } ```

• Using the failure function to speed up pattern matching

• Example:

 ``` Pattern P = abacab Failure function: (see how it was computed above !!!) i = | 0 | 1 | 2 | 3 | 4 | 5 | ----------+---+---+---+---+---+---+ f(i) = | 0 | 0 | 1 | 0 | 1 | 2 | Text T: abacaabaccabacabaabb Matching procedure: (1) 1 01234567890123456789 (ruler) i=0 | v abacaabaccabacabaabb abacab ^ | j=0 T[i] == P[j] ==> advance: i++, j++ (2) 1 01234567890123456789 (ruler) i=1 | v abacaabaccabacabaabb abacab ^ | j=1 T[i] == P[j] ==> advance: i++, j++ .... and so on... (...4) 1 01234567890123456789 (ruler) i=4 | v abacaabaccabacabaabb abacab ^ | j=4 T[i] == P[j] ==> advance: i++, j++ (5) 1 01234567890123456789 (ruler) i=5 | v abacaabaccabacabaabb abacab ^ | j=5 T[i=5] != P[j=5] ==> Don't change i Set j = f(4) = 1 !!! (Because matching prefix ended at pos 4 !!!) Result: 1 01234567890123456789 (ruler) i=5 | v abacaabaccabacabaabb abacab ^ | j=1 (6) 1 01234567890123456789 (ruler) i=5 | v abacaabaccabacabaabb abacab ^ | j=1 T[i=5] != P[j=1] ==> Don't change i Set j = f(0) = 0 !!! (7) 1 01234567890123456789 (ruler) i=5 | v abacaabaccabacabaabb abacab ^ | j=0 T[i=5] == P[j=0] ==> advance: i++, j++ (8) 1 01234567890123456789 (ruler) i=6 | v abacaabaccabacabaabb abacab ^ | j=1 T[i=6] == P[j=1] ==> advance: i++, j++ .... and so on (9) 1 01234567890123456789 (ruler) i=9 | v abacaabaccabacabaabb abacab ^ | j=4 T[i=9] != P[j=4] ==> j = f(3) = 0 (9) 1 01234567890123456789 (ruler) i=9 | v abacaabaccabacabaabb abacab ^ | j=0 ***** Attention: ****** T[i=9] != P[j=0] ==> When j==0 (the first character fails to match), we must try the NEXT character in the text: i++ Result: 1 01234567890123456789 (ruler) i=10 | v abacaabaccabacabaabb abacab ^ | j=0 ```

• The Knutt-Morris-Pratt extact string matching algorithm

• The KMP Algorithm:

 ``` int KMP(String T, String P) { f() = KMP_failure_function(P); // Discussed later ! i0 = 0; i = 0; j = 0; n = T.length(); m = P.length(); while ( i < n ) { if ( P[j] == T[i] ) { i++; j++; /* ------------------------ Check if we found P ------------------------ */ if ( j == m ) { return( i0 ); // Found P at i0 in T ! } } else { /* --------------------------- Fail to match at P[j] ---------------------------- */ if ( j == 0 ) { /* ------------------------------------------------- No prefix information ==> slide P up 1 position ------------------------------------------------- */ i0++; // Slide 1 character over i = i0; j = 0; // This statement is not necessary... } else { /* ----------------------------------------------------- Use prefix info to perform "fast slide" ----------------------------------------------------- */ int k = f(j-1); // Max Overlap (= length of matching prefix) j = k; // Restart matching at character // after matching prefix i0 = (i-j); // Shift pattern (i-j) characters // i is unchanged ! } } } return(-1); // No match found... } ```

• Example Program: (Demo above code)

How to run the program:

 Right click on link(s) and save in a scratch directory To compile:   javac KMP.java To run:          java KMP

• One problem remains: algorithm to find the failure function

• We will discuss a clever algorithm to compute the failure function next....