Re: fuzzy string matching

Scott Dick (dick@morden.csee.usf.edu)
Thu, 3 Jun 1999 20:15:34 +0200 (MET DST)

Breezy wrote:
>
> hi lowly
>
> I too am not an expert but would tend to view your problem in terms of how
> well an attempt at a word matched a universe of possible words.
>
> I would break down the match to a number of properties like word length,
> character placement, runs of same or even similar characters etc.. .....
>
> Take one property at a time [say length] one could talk (in a fuzzy way)
> about how well the length of the input word (trial word) matched that to
> which it was being compared (a comparer).
>
> c e m e t e r y & c e m e t a r y.
>
> Cemetery has 8 characters and so I would plot a graphs x axis of say 6 to 10
> characters (for this word).
>
> Any comparer that had a character length of 6 to 10 is then to a [fuzzy]
> degree about the same character length (this is the rush of fuzzy for me-
> non math talk "about the same" is technical fuzzy talk). Drawing one of
> those triangular graphs with the peak (in the y axis) at 8 means that the
> degree to which the trial words- character length property matched the
> comparer words- character lengths property depends on how close the trial
> words character length got to the comparers word length.
>
> The aim of course is to be building up a set of rules joined with and & or
> that an expert submits from experience. So our first rule is
>
> If the character length of the trial word is about the same as the comparer
> word AND next rule (next property).
>
> We have with this first rule now reduced all possible word suggestions to
> those words that are "about the same" character length. You defined the
> meaning of "about the same" for character length. The Russians might have a
> propensity to miss by more characters than the Yanks when they misspell a
> word.
>
> The rules stay the same and only the definition of "about the same" in terms
> of character length changes.
>
> It is a very sexy concept cause it so easily caters for differing views of
> measure ( i.e. tall to a pigmy verses tall to a basket ball player) and it
> allows you to think of a solution in the way you naturally think- fuzzy.
>
> Jared Boehm wrote in message ...
> >Hi, I'm just a lowly undergraduate CS major, and I've yet to find a decent
> >book on fuzzy logic around here.
> >
> >I think I've gleaned enough knowledge to at least have some sort of
> >foundation. I'm currently using fuzzy logic, or what I perceive fuzzy
> >logic to be, for string matching. Am I way off base? I guess I'm looking
> >for a little guidance or encouragement or both. Anyways, what follows can
> >also be found at http://www.personal.kent.edu/~jtboehm/fuzzy.html, which
> >details my endeavors.
> >
> >humbly yours,
> >
> >Jared Boehm
> >
> >--------------------------------------------------------------------------
> >introduction
> >
> >Fuzzy Logic is currently an area of interest to me. Right now I am sort of
> >stumbling along, which in all frankness is part of the fun. Traditional
> >boolean logic states that either a fact is true or that it is false. While
> >this sort of logic is fine for many applications, there are times when one
> >may wish to know how true or how false is a rule.
> >My current project involves fuzzy pattern matching of strings. The idea is
> >that traditional string comparisons (such as c/c++'s strcmp) return only a
> >boolean value true or fals as to its equivalence. Given two strings, for
> >example, "cemetery" and "cemetary" -- a common misspelling, strcmp would
> >return false, while the casual observer can fairly accurately discern the
> >intended spelling.
> >
> >The applications of this are fairly simple. I assume most word-processors
> >use a spell checker based on fuzzy logic, to find and suggestion possible
> >correct spellings. Compiler parsing and error correction could also
> >implement a similar fuzzy matching system to handle syntax errors.
> >
> >evaluation methods
> >The naive method for comparing strings begins with the first element, and
> >continues until a.) the end of the string is reached, or b.) a difference
> >is encountered.
> >For a given string of length n, one can perform a boolen match on each
> >element on a one-to-one basis with the second string, returning 1 for a
> >match, and 0 for a mismatch. Returning to the above example of cemetery
> >and cemetary:
> >
> >
> >string 1 c e m e t e r y
> >string 2 c e m e t a r y
> >
> >---------------------------------------------------------------------------
> -----
> >
> >results 1 1 1 1 1 0 1 1
> >
> >summing up the individual results yields:
> >
> >1 + 1 + 1 + 1 + 1 + 0 + 1 + 1 = 7
> >divided by n (the length of the string) yields a value of .875 -- 87.5% of
> >the characters match.
> >
> >This method only works in cases in which there is an error in which only a
> >correct character is replaced by an incorrect one . It fails miserably
> >when a character is omitted from a string:
> >
> >
> >string 1 c e m e t e r y
> >string 2 e m e t a r y
> >
> >---------------------------------------------------------------------------
> -----
> >
> >results 0 0 0 0 0 0 0 0
> >
> >or when two characters are swapped:
> >
> >
> >string 1 c e m e t e r y
> >string 2 c e m e e t r y
> >
> >---------------------------------------------------------------------------
> -----
> >
> >results 1 1 1 1 0 0 1 1 = .75
> >
> >We can improve on this by comparing characters relative to its current
> >position. In my algorithm, not only do i examine the character at location
> >n, but also at n-1 and n+1 if the first comparison fails. If there is a
> >match, we can't just assign a true value, but some degree of truthfulness.
> >My algorithm gives a 25% penalty on matches made at an offset of +1/-1.
> >This increases the accuracy of the comparison considerably. The above
> >example would yield:
> >
> >string 1 c e m e t e r y
> >string 2 c e m e e t r y
> >
> >---------------------------------------------------------------------------
> -----
> >
> >results 1 1 1 1 .75 .75 1 1 = .9375
> >
> >and a marked increase when a character is ommitted:
> >
> >
> >string 1 c e m e t e r y
> >string 2 e m e t e r y
> >
> >---------------------------------------------------------------------------
> -----
> >
> >results 0 .75 75 .75 .75 .75 .75 .75 = .65625
> >
> >Finally, the last method I have implemented is phonetic matching. In the
> >comparison of "cemetery" and "cemetary" we can easily see that the
> >intended spelling was "cemetery" and that a simple error based on similar
> >sounds in the English language was made. My algorithm tries to estimate
> >how close a letter sounds to another. For example, a "c" may sound like a
> >"s", or a "s" may sound like a "z". Ideally these values would be derived
> >from statistical analysis of spelling errors, but I, especially at my low
> >level of expertise in this area, lack the desire to do so. In my algorithm
> >I have assigned a value of .66 for the phonetical match between an A and
> >an E, yielding:
> >
> >string 1 c e m e t e r y
> >string 2 c e m e t a r y
> >
> >---------------------------------------------------------------------------
> -----
> >
> >results 1 1 1 1 1 .66 1 1 = .9575
> >
> >This method can also be extended to consonant sounds generanted w/
> >multiple consonants: "ff" for "gh", "ph" for "p", "wr" for "r".
> >
> >
> >about the script:
> >I am currently working on a Javascript demonstration to demonstrate the
> >performance of these methods, and the combination of them as well, which
> >will very shortly be online, as soon as i convert it from C++.
> >
> >
> >

As several others have pointed out, the methods you're talking about
only deal with the syntax of two strings. Borrowing an example, the
syntax of "cat" and "hat" are very similar, but their meanings are very
different. We would need to incorporate semantics in order to generate a
really powerful string matching algorithm. There is already an algorithm
called the "string edit distance" (sorry, I don't have the reference)
which measures a syntactic difference between two strings by determining
the number and cost of transformations needed to convert one string to
another.

I think fuzzy logic would be a good choice for a string matching
algorithm that incorporates syntax and semantics. The reason is that a
linguistic variable, as defined by Zadeh in "The concept of a linguistic
variable and its application to approximate reasoning", already includes
a relation between syntax and semantics. The values of a linguistic
variable (words) are generated syntactically, by a context free grammar
(and in my own work, a phrase-structured grammar). Each of these words
is then given an unambiguous meaning by the semantic rule of the
variable, which associates a fuzzy set with every word.

There are really two issues I see here. One, how do we define a distance
measure for words? As things currently stand, words are treated as
ordinal sets, which means no distance is defined between them. Two, what
do we do about words that are not tied to some underlying numerical
universe of discourse? Earl Cox's idea of creating a semantic net may be
the best way to attack this problem; one could define a distance metric
based on the minimum number (or cost) of transitions between two strings
of interest. A similar method can be found in "Linguistic Geometry:
Methodology and Techniques" by Boris Stilman, 1993.

-- 
***********************************************************************
*                               *                                     *
*  Scott Dick                   *                                     *
*  Research Assistant           *       Cool & brilliant thought      *		
*  USF Computer Science         *       still under contruction       *
*  dick@morden.csee.usf.edu     *                                     *
*                               *                                     *
***********************************************************************

############################################################################ This message was posted through the fuzzy mailing list. (1) To subscribe to this mailing list, send a message body of "SUB FUZZY-MAIL myFirstName mySurname" to listproc@dbai.tuwien.ac.at (2) To unsubscribe from this mailing list, send a message body of "UNSUB FUZZY-MAIL" or "UNSUB FUZZY-MAIL yoursubscription@email.address.com" to listproc@dbai.tuwien.ac.at (3) To reach the human who maintains the list, send mail to fuzzy-owner@dbai.tuwien.ac.at (4) WWW access and other information on Fuzzy Sets and Logic see http://www.dbai.tuwien.ac.at/ftp/mlowner/fuzzy-mail.info (5) WWW archive: http://www.dbai.tuwien.ac.at/marchives/fuzzy-mail/index.html