David Kinny wrote:
> Perhaps you'd do well to pose the question a little differently:
> "How do various processes and algorithms for smushing compare?",
> as there's clearly a range of approaches possible. Try some and
> evaluate how well they actually work, analyze when and why they
> fail, and refine the algorithm accordingly. Then publish it!
Well I am probably not the person to do this. But If my postings promotes the idea
in the fuzzy logic community that this is a very real problem to be solved, then it
will have served its purpose.
> - You can on the other hand do better by allowing a "tentative" or
> reversible merge, where the nodes are linked but both retained
> somehow (perhaps one becomes "invisible") so the merge can later
> be undone if further evidence justifies it. Equally, you might
> mark an assignment of a URI as "tentative" initially.
Yes, we could even allow agents to tag nodes as candidates for smushing. Then have
a person look at the candidates and decide. If they decided "yes", then the nodes
are smushed, and the agents that suggested it would be rewarded. This is a big
problem in an big arena (the open ended Semantic Web), we may need to apply genetic
programming. I know I wouldn't be able to come up with all the rules that will be
needed for all the diverse types of nodes that will start being created; and I'm
not so sure that anybody else could either.
> To do the job properly you really will need some sort of background
> ontology of field types and their value domains that allows some
> sort of smart reasoning about identity, subsumption, and similarity.
Definitely. Without getting into a strong logical commitment, I think at a minimum
that smushing should take place in the context of a schema and a ontology -
otherwise it will be very limited.
> Mathematically the problem is about finding the "distance" between
> points, lines and surfaces in some highly non-linear multidimensional
> space, and ultimately the properties of your distance metric will be
> a prime determinant of how well an approach works. Don't assume
> that a particular metric (e.g. Fuzzy etc) will be exactly right for
> the job. Don't reinvent the wheel, but look outside the square.
Hmm... I'm thinking that techniques like Latent Semantic Indexing (LSI) from the IR
community might be helpful here.
> Finally, ask yourself "when and why do I want to smush?" You may
> find that there are different cases which need to be treated in
> different ways, so that the algorithm needs to be parameterized
> according to the circumstances of its use.
There is one, and only one, reason ever to smush: that two nodes represent the same
identical thing from the point of view of the running process. I think that we can
at least be clear on that. But of course the terms "identical", "point of view",
"represent" , and "running process" need a some more explanation.
> Hope this helps,
Yes certainly :) The problem is much bigger than I am personally capable of
tackling. I don't think many researchers that hang out in comp.ai.fuzzy, also hang
out in the W3C mailing lists where we are pioneering the Semantic Web. But we need
your help over there !!!
Original definition of the problem by Dan Brinkley:
re: (2) 2nd pass node convergence ("data smushing")
This message was posted through the fuzzy mailing list.
(1) To subscribe to this mailing list, send a message body of
"SUB FUZZY-MAIL myFirstName mySurname" to email@example.com
(2) To unsubscribe from this mailing list, send a message body of
"UNSUB FUZZY-MAIL" or "UNSUB FUZZY-MAIL firstname.lastname@example.org"
(3) To reach the human who maintains the list, send mail to
(4) WWW access and other information on Fuzzy Sets and Logic see
(5) WWW archive: http://www.dbai.tuwien.ac.at/marchives/fuzzy-mail/index.html
This archive was generated by hypermail 2b30 : Wed Jan 03 2001 - 02:11:54 MET