### Prelude to String Matching: string traversal technqiue

• Preliminaries

• Before we start learn fast string matching algorithm, let use be clear about some notations

• Input text and the pattern:

 ``` T = input text P = pattern (that we need to find in the text T) T[i] = the i-th character of T P[j] = the j-th characetr of P T[i] == P[j] means: check if the i-th character of T and the j-th characetr of P are equal. ```

• You will often see pictures that look like this:

• Example:

(The top line is a ruler that help you identify the character at a certain index quickly

The above picture depicts the following state:

• We are checking on whether:

 ``` T[7] == P[2] ```

Notice that in order to compare T[7] == P[2], we must line up T[5] with P[0]:

 ``` T[0] T[1] T[2] T[3] T[4] T[5] T[6] T[7] .... P[0] P[1] P[2] .... ```

• Action taken when we find a match

• The action that is taken when the characters match is always:

 If we have matched the last character, then pattern P has been found !!! Otherwsie, check if the next pair of characters match.

• Example 1: we matched T[0] == P[0], try next pair of characters: T[1] and P[1]

• Example 2: we matched T[7] == P[2], try next pair of characters: T[8] and P[3]

• Example: a situation where we matched all characters in the parttern P successfully

Observations: (what we learned from the above examples)

• The following test is used to determined if all characters in pattern P has been matched:

 ``` j == P.length (P[j] is the character in P that you try to match) ```

• The following statements is used to move to the next pair of characters in the pattern and the input string:

 ``` i++; // use the next character in input string T j++; // use the next character in pattern P ```

• I can now sketch a portion of the KMP algorithm:

 ``` while ( not all characters matched ) { if ( T[i] == P[j] ) { /* -------------------------------------- Try to match NEXT pair of characters -------------------------------------- */ i++; j++; if ( j == P.length ) { P has been found ! (You can quit, or keep going to find another match) } } .... } ```

• String traversal help variable: i0

• When studying string matching algorithm, it will help to make the algorithm clearer if we introduce the following variable:

 i0 = the character position in T that is lined up with the first character P[0] in P

• Example:

• In the following state:

we have i0 = 5

• In the following state:

we have i0 = 10

• Text book algorithms

• If you read the string matching algorithm in many text books, the update procedure for the i and j variables in the mismatch case is a very complex expression

 The reason is mainly the fact that they try to use i and j only I.e., they do not use the i0 variable that I introduced above.

• In fact, the i0 variable is not necessary because:

 ``` i0 = i - j ```

Example:

• However:

 when you use i − j in an expression, it's hard to tell what the purpose of the statement

Example:

 ``` i = i - j + 1; j = 0; ```

\$64,000 question:

 What did the above statements accomplish ??? (Answer is given below)

• What to do when characters do not match...

• Fact:

 If the characters T[i] and P[j] do not match, we can safely (= without making logic error) "slide" the pattern one position further

Example:

• This can be achieved by the following statements:

 ``` i0 = i0 + 1; // Move pattern P one character further i = i0; // Restart matching at position i0 in T j = 0; // Restart matching at position 0 in P ```

• BTW, the answer to the \$64,000 question is:

 ``` i = i - j + 1; j = 0; achieves the same rsult as: i0 = i0 + 1; // Move pattern P one character further i = i0; // Restart matching at position i0 in T j = 0; // Restart matching at position 0 in P because: i0 is equal to i - j ```

• The brute-force text matching algorithm --- revisted

• Here is the brute force text matching algorithm (tries every starting position) written using the text traversal techniques described above:

 ``` Basic( T, P ) { int i0, i, j, m, n; n = T.length(); m = P.length(); i0 = 0; // Line P up with the first character of T i = 0; // Start matching with first char in T j = 0; // Start matching with first char in P while ( i < n ) // Not all characters used { if ( T[i] == P[j] ) { i++; // Match next pair j++; if ( j == m ) return ( i0 ); // Match found at position i0 !!! } else { /* =========================================== P does not start at position i0... Try another position by moving P further =========================================== */ i0 = i0 + 1; // Move pattern P one character further i = i0; // Restart matching at position i0 in T j = 0; // Restart matching at position 0 in P } } return -1; // No match found } ```

• Example Program: (Demo above code)

How to run the program:

 Right click on link(s) and save in a scratch directory To compile:   javac Basic.java To run:          java xx

• Here is the same program, without using the i0 variable: click here