fuzzy string matching

Jared Boehm (jtboehm@kent.edu)
Thu, 27 May 1999 12:12:26 +0200 (MET DST)

Hi, I'm just a lowly undergraduate CS major, and I've yet to find a decent
book on fuzzy logic around here.

I think I've gleaned enough knowledge to at least have some sort of
foundation. I'm currently using fuzzy logic, or what I perceive fuzzy
logic to be, for string matching. Am I way off base? I guess I'm looking
for a little guidance or encouragement or both. Anyways, what follows can
also be found at http://www.personal.kent.edu/~jtboehm/fuzzy.html, which
details my endeavors.

humbly yours,

Jared Boehm

--------------------------------------------------------------------------
introduction

Fuzzy Logic is currently an area of interest to me. Right now I am sort of
stumbling along, which in all frankness is part of the fun. Traditional
boolean logic states that either a fact is true or that it is false. While
this sort of logic is fine for many applications, there are times when one
may wish to know how true or how false is a rule.
My current project involves fuzzy pattern matching of strings. The idea is
that traditional string comparisons (such as c/c++'s strcmp) return only a
boolean value true or fals as to its equivalence. Given two strings, for
example, "cemetery" and "cemetary" -- a common misspelling, strcmp would
return false, while the casual observer can fairly accurately discern the
intended spelling.

The applications of this are fairly simple. I assume most word-processors
use a spell checker based on fuzzy logic, to find and suggestion possible
correct spellings. Compiler parsing and error correction could also
implement a similar fuzzy matching system to handle syntax errors.

evaluation methods
The naive method for comparing strings begins with the first element, and
continues until a.) the end of the string is reached, or b.) a difference
is encountered.
For a given string of length n, one can perform a boolen match on each
element on a one-to-one basis with the second string, returning 1 for a
match, and 0 for a mismatch. Returning to the above example of cemetery
and cemetary:

string 1 c e m e t e r y
string 2 c e m e t a r y

--------------------------------------------------------------------------------

results 1 1 1 1 1 0 1 1

summing up the individual results yields:

1 + 1 + 1 + 1 + 1 + 0 + 1 + 1 = 7
divided by n (the length of the string) yields a value of .875 -- 87.5% of
the characters match.

This method only works in cases in which there is an error in which only a
correct character is replaced by an incorrect one . It fails miserably
when a character is omitted from a string:

string 1 c e m e t e r y
string 2 e m e t a r y

--------------------------------------------------------------------------------

results 0 0 0 0 0 0 0 0

or when two characters are swapped:

string 1 c e m e t e r y
string 2 c e m e e t r y

--------------------------------------------------------------------------------

results 1 1 1 1 0 0 1 1 = .75

We can improve on this by comparing characters relative to its current
position. In my algorithm, not only do i examine the character at location
n, but also at n-1 and n+1 if the first comparison fails. If there is a
match, we can't just assign a true value, but some degree of truthfulness.
My algorithm gives a 25% penalty on matches made at an offset of +1/-1.
This increases the accuracy of the comparison considerably. The above
example would yield:

string 1 c e m e t e r y
string 2 c e m e e t r y

--------------------------------------------------------------------------------

results 1 1 1 1 .75 .75 1 1 = .9375

and a marked increase when a character is ommitted:

string 1 c e m e t e r y
string 2 e m e t e r y

--------------------------------------------------------------------------------

results 0 .75 75 .75 .75 .75 .75 .75 = .65625

Finally, the last method I have implemented is phonetic matching. In the
comparison of "cemetery" and "cemetary" we can easily see that the
intended spelling was "cemetery" and that a simple error based on similar
sounds in the English language was made. My algorithm tries to estimate
how close a letter sounds to another. For example, a "c" may sound like a
"s", or a "s" may sound like a "z". Ideally these values would be derived
from statistical analysis of spelling errors, but I, especially at my low
level of expertise in this area, lack the desire to do so. In my algorithm
I have assigned a value of .66 for the phonetical match between an A and
an E, yielding:

string 1 c e m e t e r y
string 2 c e m e t a r y

--------------------------------------------------------------------------------

results 1 1 1 1 1 .66 1 1 = .9575

This method can also be extended to consonant sounds generanted w/
multiple consonants: "ff" for "gh", "ph" for "p", "wr" for "r".

about the script:
I am currently working on a Javascript demonstration to demonstrate the
performance of these methods, and the combination of them as well, which
will very shortly be online, as soon as i convert it from C++.

############################################################################
This message was posted through the fuzzy mailing list.
(1) To subscribe to this mailing list, send a message body of
"SUB FUZZY-MAIL myFirstName mySurname" to listproc@dbai.tuwien.ac.at
(2) To unsubscribe from this mailing list, send a message body of
"UNSUB FUZZY-MAIL" or "UNSUB FUZZY-MAIL yoursubscription@email.address.com"
to listproc@dbai.tuwien.ac.at
(3) To reach the human who maintains the list, send mail to
fuzzy-owner@dbai.tuwien.ac.at
(4) WWW access and other information on Fuzzy Sets and Logic see
http://www.dbai.tuwien.ac.at/ftp/mlowner/fuzzy-mail.info
(5) WWW archive: http://www.dbai.tuwien.ac.at/marchives/fuzzy-mail/index.html