String comparison


Subject: String comparison
From: Paul Emmons (pemmons@voicenet.com)
Date: Mon May 01 2000 - 16:47:20 MET DST


I am interested in writing a function (in C or Perl)
that returns some kind of index of probability that
two texts are discussing the same topic.

For the application that I have in mind, the texts or strings
are relatively brief, typically between 60 and 300 characters.

I would suppose that when the two texts have substrings
(phrases or relatively long words) in common, the
probability increases that they are about the same subject.

An efficient and simple (but also simple-minded) procedure
along these lines is called "Hamming distance." The two
texts are hashed into binary "signature codes", and the
index is the percentage of bits that these codes have in common.

However, I expect that something more sophisticated would give
better results. Doubtless, work has been done already in this area.

Can anyone give me some suggestions as to algorithms or any other
ideas/places to turn? Thanks in advance--

Paul Emmons

pemmons@voicenet.com

############################################################################
This message was posted through the fuzzy mailing list.
(1) To subscribe to this mailing list, send a message body of
"SUB FUZZY-MAIL myFirstName mySurname" to listproc@dbai.tuwien.ac.at
(2) To unsubscribe from this mailing list, send a message body of
"UNSUB FUZZY-MAIL" or "UNSUB FUZZY-MAIL yoursubscription@email.address.com"
to listproc@dbai.tuwien.ac.at
(3) To reach the human who maintains the list, send mail to
fuzzy-owner@dbai.tuwien.ac.at
(4) WWW access and other information on Fuzzy Sets and Logic see
http://www.dbai.tuwien.ac.at/ftp/mlowner/fuzzy-mail.info
(5) WWW archive: http://www.dbai.tuwien.ac.at/marchives/fuzzy-mail/index.html



This archive was generated by hypermail 2b25 : Mon May 01 2000 - 17:01:27 MET DST