### Computing the KMP failure function (f(k))

• Review: definition of f(k)

• f(k):

 ``` f(k) = MaxOverlap ( "p0 p1 ... pk" ) where: "p0 p1 ... pk" = the prefix of length k+1 of pattern P ```

Graphically: • Naive way to find f(k):

 ``` Given P = "p0 p1 ... pm-1" Given k = 1, 2, ..., m-1 (k = 0 ==> f(0) = 0) 1. Extract the sub-pattern: "p0 p1 ... pk" 2. Find the first (= largest) overlap: Try: (p0) p1 p2 ... pk-1 p0 p1 ... pk-1 pk If (no match) Try: (p0) p1 p2 ... pk-1 p0 p1 ... pk-1 pk And so on... The first overlap is the longest ! ```

• The values f(k) are computed easily using existing prefix overlap information:

 f(0) = 0 (f(0) is always 0) f(1) is computing using (already computed) value f(0) f(2) is computing using (already computed) value f(0), f(1) f(3) is computing using (already computed) value f(0), f(1), f(2) And so on.

• Relating f(k) to f(k−1)

• According to the definition of f(k): • Suppose that we know that: f(k−1) = x

In other words: the longest overlapping suffix and prefix in "p0 p1 ... pk-1" has x characters:

 ``` f(k-1) = x characters <-----------------------> p1 p2 p3 ... pk-x-2 pk-x-3 pk-x-4 .... pk-1 ^ ^ ^ ^ | | | equal | v v v v p0 p1 p2 .... px-1 px ... pk-1 ```

\$64,000 question:

 Can we use the fact that f(k−1) = x to compute f(k) ?

Yes, because f(k) is computed using a similar prefix as f(k−1):

 ``` prefix used to compute f(k-1) +--------------------------------+ | | p0 p1 p2 .... px-1 ... pk-1 pk | | +------------------------------------+ prefix used to compute f(k) ```

We will next learn how to exploit the similarity to compute f(k)

• Fact between f(k) and f(k−1)

• Fact:

 f(k)   ≤   f(k−1) + 1

Proof: by contradiction

• Suppose f(k) > f(k−1) + 1

In other word, the maximum overlap of "p0 p1 p2 .... pk-1 pk" is as follows: • In that case, if we remove the character pk from the prefix, we will have: and we would have found an overlap using the prefix "p0 p1 p2 .... pk-1"

 f(k−1) = the maximum overlap using prefix "p0 p1 p2 .... pk-1"

Therefore: the length of this overlap can never be greater than f(k−1) !!!

 Contradiction !!!

• Computation trick 1

• Computation trick #1:

• Let use denote: f(k−1) = x

(Note: f(k−1) is equal to some value. The above assumption simply gave a more convenient notation for this value).

If px == pk, then:

 ``` f(k) = x+1 (i.e., the maximum overlap of the prefix p0 p1 p2 .... pk-1 pk has x+1 characters ```

Proof:

• Because of the given fact that: px == pk, we know that the prefix "p0 p1 p2 .... pk-1 pk" has x+1 matching characters:

 ``` These x+1 characters match IF pk == px! <----------------------------> p1 p2 p3 ... pk-x-2 pk-x-3 pk-x-4 .... pk-1 pk ^ ^ ^ ^ ^ | | | equal | |equal v v v v v p0 p1 p2 .... px-1 px ... pk-1 pk | | +--------------------------+ These characters matches because f(k-1) = x ```

• Example 1:

 ``` k = 0123456 Pattern = aaabaaa f(0) = 0 (f(0) is always equal to 0) (x = 0) k=1 | v f(1): (a)a ===> f(1) = 0+1 = 1 (x = 1) aa ^ | x=0 k=2 | v f(2): (a)aa ===> f(2) = 1+1 = 2 (x = 2) aaa ^ | x=1 ```

• Example 2:

 ``` k = 0123456 Pattern = aaabaaa f(3) = 0 because: (a)aab aaab k=4 | v f(4): (a)aaba ===> f(4) = 0+1 = 1 (x = 1) aaaba ^ | x=0 k=5 | v f(5): (a)aabaa ===> f(4) = 1+1 = 2 aaabaa ^ | x=1 ```

• Prelude to computation trick 2

• Before we learn the second (and final) computation trick, I want to use an example to illustrate the trick.

(This trick is a bit tricky :))

• Example:

• Consider the prefix ababyabab where f(8) = 4:

 ``` 012345678 prefix = ababyabab f(8) = 4 because: ababyabab ababyabab <--> 4 characters overlap ```

• We want to compute f(9) using f(8) , but now the next character does not match:

 ``` 0123456789 prefix = ababyababa ababyababa ababyababa Conclusion: *** We CANNOT use f(8) to compute f(9) *** ```

\$64,000 question:

 What should we try next to find the maximum overlap for the prefix "ababyababa"

Answer:

 To find the maximum overlap, we must slide the prefix down and look for matching letters !!!

• Now, let us use only the matching prefix information:

 ``` ababyababa ababyababa Look only at these characters: ?????abab? abab?????? ```

• We can know for sure that the overlap cannot be found starting at these positions:

 ``` ?????abab? abab?????? ```

• The first possible way that overlap can be found is starting here:

 ``` ?????abab? abab?????? ```

• In other words: we can compute f(9) using f(3) :

• f(3) = 2:

 ``` 0123 prefix = abab abab abab f(3) = 2 ```

(Notice that: 3 = 4−1 and f(8) = 4)

• How does this work ???

• It works exactly as when we try to compute f(9) using f(8)

Worked out further:

• Consider the prefix abab where f(3) = 2:

 ``` 0123 prefix = abab f(3) = 2 because: abab abab -- 2 characters overlap ```

• We want to compute f(9) using f(3) :

 ``` 0123456789 prefix = ababyababa ababyababa ababyababa ^ | compare the character at position 2 (f(3) = 2) Note: The prefix abab is hightlighted in yellow ```

Because the characters are equal, we have found the maximum overlap:

 ``` f(9) = f(3) + 1 = 2 + 1 = 3 !!! ```

• Computation trick #2

• Computation trick #2:

• Let: f(k−1) = x

(Note: f(k−1) is equal to some value. The above assumption simply gave a more convenient notation for this value).

If px ≠ pk, then:

• The next prefix that can be used to compute f(k) is:

 ``` p0 p1 .... px-1 ```

In pseudo code:

 ``` i = k-1; // Try to use f(k-1) to compute f(k) x = f(i); // x = character position to match against pk if ( P[k] == P[x] ) then f(k) = f(x−1) + 1 else Use: p0 p1 .... px-1 to compute f(k) What that means in terms of program statements: i = x-1; // Try to use f(x-1) to compute f(k) x = f(i); // x = character position to match against pk ```

• Note:

• We must repeat trick #2 as long as i ≥ 0

In other words:

 use a while loop instead of an if statement !

• Algorithm to compute KMP failure function

• Pseudo code:

 ``` public static int[] KMP_failure_function( P ) { int k, i, x, m; int f[] = new int[P.length()]; // f[] stores the function values m = P.length(); f = 0; // f is always 0... for ( k = 1; k < m; k++ ) { // Compute f(k) and store in f[k] i = k-1; // Try use f(k-1) to compute f(k) x = f[i]; // Character position to match agains P[k] if ( P[k] == P[x] ) // Note: make sure x is valid { f[k] = f[i] + 1; continue; // Compute next f(k) value } else { i = x-1; // Try next prefix (and next f(i)) to compute f(k) x = f[i]; // Character position to match agains P[k] } if ( P[k] == P[x] ) // Note: make sure x is valid { f[k] = f[i] + 1; continue; // Compute next f(k) value } else { i = x-1; // Try next prefix (and next f(i)) to compute f(k) x = f[i]; // Character position to match agains P[k] } .... (obviously we will make this into a loop !!!) } } ```

• Java code:

 ``` public static int[] KMP_failure_function(String P) { int k, i, x, m; int f[] = new int[P.length()]; m = P.length(); f = 0; // f(0) is always 0 for ( k = 1; k < m; k++ ) { // Compute f[k] i = k-1; // First try to use f(k-1) to compute f(k) x = f[i]; while ( P.charAt(x) != P.charAt(k) ) { i = x-1; // Try the next candidate f(.) to compute f(k) if ( i < 0 ) // Make sure x is valid break; // STOP the search !!! x = f[i]; } if ( i < 0 ) f[k] = 0; // No overlap at all: max overlap = 0 characters else f[k] = f[i] + 1; // We can compute f(k) using f(i) } return(f); } ```

• Example Program: (Demo above code) How to run the program:

 Right click on link(s) and save in a scratch directory To compile:   javac ComputeF.java To run:          java ComputeF

Example:

 ```>>> java ComputeF P = ababyababa ----------------------------------------------- Prefix = ab --- Computing f(1): =================================== Try using: f(0) = 0 ===================================== Matching: i = 1, j = 0 01 ab ab 01 ^ | No overlap possible... --> f = 0 ----------------------------------------------- ....... ----------------------------------------------- Prefix = ababyababa --- Computing f(9): =================================== Try using: f(8) = 4 ===================================== Matching: i = 9, j = 4 0123456789 ababyababa ababyababa 0123456789 ^ | =================================== Try using: f(3) = 2 ===================================== Matching: i = 9, j = 2 0123456789 ababyababa ababyababa 0123456789 ^ | Overlap found ... --> f = 3 ```