Re: fuzzy string matching

Bruno DiStefano (bruno@ecf.utoronto.ca)
Fri, 28 May 1999 11:24:25 +0200 (MET DST)

I feel that you have a starting point. However, you are not
really using the fuzzy logic in what you show. You are
trying to determine the "error" of a reading. Yes, I know,
this is not what you have in mind. However, looking from the
outside, this is what it looks like.

Assuming a "correct reading" (i.e. cemetary), you are trying to
determine "how good" (fuzzy concept) is an "actual reading"
(i.e. cemetery). This is a start.

What you should be doing is to define:
1) your input(s)
2) your output(s)
3) ways in which (1) influence(s) (2)

If you say that (1) is "a string" (in the C/C++ sense),
than you have to classify the string. You could classify
the string depending on "how far" it is from the
intended string, "the true string". The C/C++ method
would say:
1) it is alphabetically antecendent (-1)
2) it is identical (0)
3) it is alphabetically subsequent (1)

You could say:
1) it is very different
2) it is mildy different
3) it is a bit different
4) it is identical

I stop here, because I do not like monologue.

I feel that this is an interesting problem
It has applications well beyond spell checkers.
Think about search engines.

I hope that more people are going to get involved
and try to derive a real fuzzy model.

Bye

Bruno Di Stefano

-- Bruno Di Stefano -- Private: au843@torfree.net IEEE:b.distefano@ieee.org
Courses: stefano@ecf.toronto.edu http://www.ecf.toronto.edu/~stefano
Research: bruno@ecf.toronto.edu http://www.ecf.toronto.edu/~bruno
Consulting: alawnicz@sympatico.ca http://www3.sympatico.ca/alawnicz/nuptek.htm
-----------------------------------------------------------------------------

>From dbai.tuwien.ac.at!fuzzy-mail Thu May 27 13:10:12 1999
>From: Jared Boehm <jtboehm@kent.edu>
>
>Hi, I'm just a lowly undergraduate CS major, and I've yet to find a decent
>book on fuzzy logic around here.
>
>I think I've gleaned enough knowledge to at least have some sort of
>foundation. I'm currently using fuzzy logic, or what I perceive fuzzy
>logic to be, for string matching. Am I way off base? I guess I'm looking
>for a little guidance or encouragement or both. Anyways, what follows can
>also be found at http://www.personal.kent.edu/~jtboehm/fuzzy.html, which
>details my endeavors.
>
>humbly yours,
>
>Jared Boehm
>
>--------------------------------------------------------------------------
>introduction
>
>Fuzzy Logic is currently an area of interest to me. Right now I am sort of
>stumbling along, which in all frankness is part of the fun. Traditional
>boolean logic states that either a fact is true or that it is false. While
>this sort of logic is fine for many applications, there are times when one
>may wish to know how true or how false is a rule.
>My current project involves fuzzy pattern matching of strings. The idea is
>that traditional string comparisons (such as c/c++'s strcmp) return only a
>boolean value true or fals as to its equivalence. Given two strings, for
>example, "cemetery" and "cemetary" -- a common misspelling, strcmp would
>return false, while the casual observer can fairly accurately discern the
>intended spelling.
>
>The applications of this are fairly simple. I assume most word-processors
>use a spell checker based on fuzzy logic, to find and suggestion possible
>correct spellings. Compiler parsing and error correction could also
>implement a similar fuzzy matching system to handle syntax errors.
>
>evaluation methods
>The naive method for comparing strings begins with the first element, and
>continues until a.) the end of the string is reached, or b.) a difference
>is encountered.
>For a given string of length n, one can perform a boolen match on each
>element on a one-to-one basis with the second string, returning 1 for a
>match, and 0 for a mismatch. Returning to the above example of cemetery
>and cemetary:
>
>
>string 1 c e m e t e r y
>string 2 c e m e t a r y
>
>--------------------------------------------------------------------------------
>
>results 1 1 1 1 1 0 1 1
>
>summing up the individual results yields:
>
>1 + 1 + 1 + 1 + 1 + 0 + 1 + 1 = 7
>divided by n (the length of the string) yields a value of .875 -- 87.5% of
>the characters match.
>
>This method only works in cases in which there is an error in which only a
>correct character is replaced by an incorrect one . It fails miserably
>when a character is omitted from a string:
>
>
>string 1 c e m e t e r y
>string 2 e m e t a r y
>
>--------------------------------------------------------------------------------
>
>results 0 0 0 0 0 0 0 0
>
>or when two characters are swapped:
>
>
>string 1 c e m e t e r y
>string 2 c e m e e t r y
>
>--------------------------------------------------------------------------------
>
>results 1 1 1 1 0 0 1 1 = .75
>
>We can improve on this by comparing characters relative to its current
>position. In my algorithm, not only do i examine the character at location
>n, but also at n-1 and n+1 if the first comparison fails. If there is a
>match, we can't just assign a true value, but some degree of truthfulness.
>My algorithm gives a 25% penalty on matches made at an offset of +1/-1.
>This increases the accuracy of the comparison considerably. The above
>example would yield:
>
>string 1 c e m e t e r y
>string 2 c e m e e t r y
>
>--------------------------------------------------------------------------------
>
>results 1 1 1 1 .75 .75 1 1 = .9375
>
>and a marked increase when a character is ommitted:
>
>
>string 1 c e m e t e r y
>string 2 e m e t e r y
>
>--------------------------------------------------------------------------------
>
>results 0 .75 75 .75 .75 .75 .75 .75 = .65625
>
>Finally, the last method I have implemented is phonetic matching. In the
>comparison of "cemetery" and "cemetary" we can easily see that the
>intended spelling was "cemetery" and that a simple error based on similar
>sounds in the English language was made. My algorithm tries to estimate
>how close a letter sounds to another. For example, a "c" may sound like a
>"s", or a "s" may sound like a "z". Ideally these values would be derived
>from statistical analysis of spelling errors, but I, especially at my low
>level of expertise in this area, lack the desire to do so. In my algorithm
>I have assigned a value of .66 for the phonetical match between an A and
>an E, yielding:
>
>string 1 c e m e t e r y
>string 2 c e m e t a r y
>
>--------------------------------------------------------------------------------
>
>results 1 1 1 1 1 .66 1 1 = .9575
>
>This method can also be extended to consonant sounds generanted w/
>multiple consonants: "ff" for "gh", "ph" for "p", "wr" for "r".
>
>
>about the script:
>I am currently working on a Javascript demonstration to demonstrate the
>performance of these methods, and the combination of them as well, which
>will very shortly be online, as soon as i convert it from C++.
>
>
>
>
>############################################################################
>This message was posted through the fuzzy mailing list.
>(1) To subscribe to this mailing list, send a message body of
>"SUB FUZZY-MAIL myFirstName mySurname" to listproc@dbai.tuwien.ac.at
>(2) To unsubscribe from this mailing list, send a message body of
>"UNSUB FUZZY-MAIL" or "UNSUB FUZZY-MAIL yoursubscription@email.address.com"
>to listproc@dbai.tuwien.ac.at
>(3) To reach the human who maintains the list, send mail to
>fuzzy-owner@dbai.tuwien.ac.at
>(4) WWW access and other information on Fuzzy Sets and Logic see
>http://www.dbai.tuwien.ac.at/ftp/mlowner/fuzzy-mail.info
>(5) WWW archive: http://www.dbai.tuwien.ac.at/marchives/fuzzy-mail/index.html
>
>

############################################################################
This message was posted through the fuzzy mailing list.
(1) To subscribe to this mailing list, send a message body of
"SUB FUZZY-MAIL myFirstName mySurname" to listproc@dbai.tuwien.ac.at
(2) To unsubscribe from this mailing list, send a message body of
"UNSUB FUZZY-MAIL" or "UNSUB FUZZY-MAIL yoursubscription@email.address.com"
to listproc@dbai.tuwien.ac.at
(3) To reach the human who maintains the list, send mail to
fuzzy-owner@dbai.tuwien.ac.at
(4) WWW access and other information on Fuzzy Sets and Logic see
http://www.dbai.tuwien.ac.at/ftp/mlowner/fuzzy-mail.info
(5) WWW archive: http://www.dbai.tuwien.ac.at/marchives/fuzzy-mail/index.html