### Matching "Similar" Strings

• Problem description

• Similarity between 2 strings

• Given 2 strings:

 ``` S1 = ..... S2 = ............ ```

• How similar are the strings S1 and S2 ???

• Weird example:

 ``` S1 = fish S2 = ghoti ```

We can argue that ghoti is pronounced as fish in English (See: click here ):

 ``` enough: gh is pronounced as f women: o is pronounced as i potion: ti is pronounced as sh ```

• Classes of algorithm for matching string similarity

• Classes of algorithm:

• Equivalence Methods

Representatives:

• Word Stemming: reduces closely related words to a basic canonical form or 'stem'.

Example:

 swim and swimming

• Soundex algorithm: attempt to match strings that sound alike

Example:

• Similarity ranking methods

Representatives:

 Longest Common Substring (between 2 strings): the longest contiguous chain of characters shared by both strings. The longer the common substring, the better the match between the two strings. (Will be discussed in detail later) Edit Distance: minimum number of edit operations that it would take to transform one string into another (Will be discussed in detail later)

• The Soundex Algorithm

• Soundex code

 Soundex code = letter   digit   digit   digit       letter = first letter of the word 3 digits = encodes the consonants

• Soundex encoding algorithm:

• The first letter of the word is the letter of the Soundex code

The first letter is not coded to a number.

• Consonants are coded as follows:

 b, f, p, v   ⇒   1                          (labial consonants) c, g, j, k, q, s, x, z   ⇒   2 d, t   ⇒   3 l   ⇒   4 m, n   ⇒   5 r   ⇒   6 h, w   ⇒   not coded

 Two adjacent letters from the same group are coded as a single number. 2 letters from the same group separated by an h or w are coded as a single number. If you run out of consonants, add ZEROs to the code

• Examples:

• "Robert" = R163

 R b ⇒ 1      r ⇒ 6 t ⇒ 3

"Rupert" = R163

 R p ⇒ 1       r ⇒ 6 t ⇒ 3

(So they sounds similar....)