# Re: fuzzy string matching

Breezy (weiz@spam-block.sympac.com.au)
Wed, 2 Jun 1999 11:37:36 +0200 (MET DST)

hi lowly

I too am not an expert but would tend to view your problem in terms of how
well an attempt at a word matched a universe of possible words.

I would break down the match to a number of properties like word length,
character placement, runs of same or even similar characters etc.. .....

Take one property at a time [say length] one could talk (in a fuzzy way)
about how well the length of the input word (trial word) matched that to
which it was being compared (a comparer).

c e m e t e r y & c e m e t a r y.

Cemetery has 8 characters and so I would plot a graphs x axis of say 6 to 10
characters (for this word).

Any comparer that had a character length of 6 to 10 is then to a [fuzzy]
degree about the same character length (this is the rush of fuzzy for me-
non math talk "about the same" is technical fuzzy talk). Drawing one of
those triangular graphs with the peak (in the y axis) at 8 means that the
degree to which the trial words- character length property matched the
comparer words- character lengths property depends on how close the trial
words character length got to the comparers word length.

The aim of course is to be building up a set of rules joined with and & or
that an expert submits from experience. So our first rule is

If the character length of the trial word is about the same as the comparer
word AND next rule (next property).

We have with this first rule now reduced all possible word suggestions to
those words that are "about the same" character length. You defined the
meaning of "about the same" for character length. The Russians might have a
propensity to miss by more characters than the Yanks when they misspell a
word.

The rules stay the same and only the definition of "about the same" in terms
of character length changes.

It is a very sexy concept cause it so easily caters for differing views of
measure ( i.e. tall to a pigmy verses tall to a basket ball player) and it
allows you to think of a solution in the way you naturally think- fuzzy.

Jared Boehm wrote in message ...
>Hi, I'm just a lowly undergraduate CS major, and I've yet to find a decent
>book on fuzzy logic around here.
>
>I think I've gleaned enough knowledge to at least have some sort of
>foundation. I'm currently using fuzzy logic, or what I perceive fuzzy
>logic to be, for string matching. Am I way off base? I guess I'm looking
>for a little guidance or encouragement or both. Anyways, what follows can
>also be found at http://www.personal.kent.edu/~jtboehm/fuzzy.html, which
>details my endeavors.
>
>humbly yours,
>
>Jared Boehm
>
>--------------------------------------------------------------------------
>introduction
>
>Fuzzy Logic is currently an area of interest to me. Right now I am sort of
>stumbling along, which in all frankness is part of the fun. Traditional
>boolean logic states that either a fact is true or that it is false. While
>this sort of logic is fine for many applications, there are times when one
>may wish to know how true or how false is a rule.
>My current project involves fuzzy pattern matching of strings. The idea is
>that traditional string comparisons (such as c/c++'s strcmp) return only a
>boolean value true or fals as to its equivalence. Given two strings, for
>example, "cemetery" and "cemetary" -- a common misspelling, strcmp would
>return false, while the casual observer can fairly accurately discern the
>intended spelling.
>
>The applications of this are fairly simple. I assume most word-processors
>use a spell checker based on fuzzy logic, to find and suggestion possible
>correct spellings. Compiler parsing and error correction could also
>implement a similar fuzzy matching system to handle syntax errors.
>
>evaluation methods
>The naive method for comparing strings begins with the first element, and
>continues until a.) the end of the string is reached, or b.) a difference
>is encountered.
>For a given string of length n, one can perform a boolen match on each
>element on a one-to-one basis with the second string, returning 1 for a
>match, and 0 for a mismatch. Returning to the above example of cemetery
>and cemetary:
>
>
>string 1 c e m e t e r y
>string 2 c e m e t a r y
>
>---------------------------------------------------------------------------
-----
>
>results 1 1 1 1 1 0 1 1
>
>summing up the individual results yields:
>
>1 + 1 + 1 + 1 + 1 + 0 + 1 + 1 = 7
>divided by n (the length of the string) yields a value of .875 -- 87.5% of
>the characters match.
>
>This method only works in cases in which there is an error in which only a
>correct character is replaced by an incorrect one . It fails miserably
>when a character is omitted from a string:
>
>
>string 1 c e m e t e r y
>string 2 e m e t a r y
>
>---------------------------------------------------------------------------
-----
>
>results 0 0 0 0 0 0 0 0
>
>or when two characters are swapped:
>
>
>string 1 c e m e t e r y
>string 2 c e m e e t r y
>
>---------------------------------------------------------------------------
-----
>
>results 1 1 1 1 0 0 1 1 = .75
>
>We can improve on this by comparing characters relative to its current
>position. In my algorithm, not only do i examine the character at location
>n, but also at n-1 and n+1 if the first comparison fails. If there is a
>match, we can't just assign a true value, but some degree of truthfulness.
>My algorithm gives a 25% penalty on matches made at an offset of +1/-1.
>This increases the accuracy of the comparison considerably. The above
>example would yield:
>
>string 1 c e m e t e r y
>string 2 c e m e e t r y
>
>---------------------------------------------------------------------------
-----
>
>results 1 1 1 1 .75 .75 1 1 = .9375
>
>and a marked increase when a character is ommitted:
>
>
>string 1 c e m e t e r y
>string 2 e m e t e r y
>
>---------------------------------------------------------------------------
-----
>
>results 0 .75 75 .75 .75 .75 .75 .75 = .65625
>
>Finally, the last method I have implemented is phonetic matching. In the
>comparison of "cemetery" and "cemetary" we can easily see that the
>intended spelling was "cemetery" and that a simple error based on similar
>sounds in the English language was made. My algorithm tries to estimate
>how close a letter sounds to another. For example, a "c" may sound like a
>"s", or a "s" may sound like a "z". Ideally these values would be derived
>from statistical analysis of spelling errors, but I, especially at my low
>level of expertise in this area, lack the desire to do so. In my algorithm
>I have assigned a value of .66 for the phonetical match between an A and
>an E, yielding:
>
>string 1 c e m e t e r y
>string 2 c e m e t a r y
>
>---------------------------------------------------------------------------
-----
>
>results 1 1 1 1 1 .66 1 1 = .9575
>
>This method can also be extended to consonant sounds generanted w/
>multiple consonants: "ff" for "gh", "ph" for "p", "wr" for "r".
>
>
>about the script:
>I am currently working on a Javascript demonstration to demonstrate the
>performance of these methods, and the combination of them as well, which
>will very shortly be online, as soon as i convert it from C++.
>
>
>

############################################################################
This message was posted through the fuzzy mailing list.
(1) To subscribe to this mailing list, send a message body of
"SUB FUZZY-MAIL myFirstName mySurname" to listproc@dbai.tuwien.ac.at
(2) To unsubscribe from this mailing list, send a message body of
"UNSUB FUZZY-MAIL" or "UNSUB FUZZY-MAIL yoursubscription@email.address.com"
to listproc@dbai.tuwien.ac.at
(3) To reach the human who maintains the list, send mail to
fuzzy-owner@dbai.tuwien.ac.at
(4) WWW access and other information on Fuzzy Sets and Logic see
http://www.dbai.tuwien.ac.at/ftp/mlowner/fuzzy-mail.info
(5) WWW archive: http://www.dbai.tuwien.ac.at/marchives/fuzzy-mail/index.html