Re: data smushing on the Internet for the Semantic Web

From: David Kinny (
Date: Mon Jan 01 2001 - 20:56:24 MET

  • Next message: Lakhmi Jain: "Please circulate"

    Seth Russell <> writes:

    >Tcmits1 wrote:

    >> Isn't one issue that 'equality' itself is fuzzy? For example, two nodes may
    >> not be crisply equivelent, but Fuzzily equivalent, that is one node may be a
    >> direct link to a resource, and the other an indirect link (with multiple
    >> pointers) to the same exact resource. The nodes are 'equal' in certain
    >> aspects. Isn't this what Fuzzy sets could capture (partial subsets)?

    >Yes, RDF specifies where the nodes are "crisply equivalent" when it labels them
    >with the same an URI, and URIs have an exact sense of equality. The problem
    >arises where the nodes are anonymous (no URI assigned) or perhaps where the URI
    >has been assigned in error. Let's say we encounter two different nodes in the
    >Semantic Web that don't have URIs:

    >the first node:
    >[ ] --rdf:type--> [person]
    >[ ] --email--> ""
    >[ ] --name--> "Seth Russell"
    >[ ] --homePage--> []

    >the second node:
    >[ ] --rdf:type--> [person]
    >[ ] --email--> ""
    >[ ] --firstName--> "Seth"
    >[ ] --lastName--> "Russell"
    >[ ] --authorOf--> []

    >What is the process and/or the algorithm that lets us [smush] these together

    >[s1] --rdf:type--> [person]
    >[s1] --email--> ""
    >[s1] --name--> "Seth Russell"
    >[s1] --firstName--> "Seth"
    >[s1] --lastName--> "Russell"
    >[s1] --authorOf--> []
    >[s1] --homePage--> []

    >[s1] representing a URI. Defining that kind of a process is what we
    >need .. and we need it badly!!

    Perhaps you'd do well to pose the question a little differently:
    "How do various processes and algorithms for smushing compare?",
    as there's clearly a range of approaches possible. Try some and
    evaluate how well they actually work, analyze when and why they
    fail, and refine the algorithm accordingly. Then publish it!

    Equality or "crisp equivalence" is too much a rare special case to
    be very useful. You need something much more general than that.
    Obviously one needs to determine when the "degree of similarity"
    between 2 nodes is sufficient to justify a merge. Suppose we start
    by defining how to measure the similarity of particular value fields
    (this obviously must be done in different ways according to the type
    of the field), and then define a way of summing these (weighted sum).
    If the overall degree of similarity is "sufficient" then do a merge.
    Some points to take into consideration are:

    - Different types of fields need to be weighted differently in the
      summation: certain ones are good determinants of identity (like
      primary keys in a database), others aren't. Same name is good,
      same age probably isn't. Same name and email is very good. Thus
      weights may need to very depending on the presence or absence of
      other fields, or be applied to particular combinations of fields.

    - If the fields of one node are a subset of another, or the shared
      fields of a pair of nodes are identical in value, similarity = 1,
      PROVIDED that the common fileds contain sufficient "identifying"
      attribute. In other words, the presence of extra information in
      one node probably shouldn't decrease the degree of match.
    - You can make your overall algorithm as conservative or liberal
      as you feel is warranted, according to how likely it is to merge
      two nodes incorrectly or conversely fail to merge two nodes that
      represent the same thing. You probably can't expect to avoid
      both types of errors unless your data is extremely well behaved.

    - You can on the other hand do better by allowing a "tentative" or
      reversible merge, where the nodes are linked but both retained
      somehow (perhaps one becomes "invisible") so the merge can later
      be undone if further evidence justifies it. Equally, you might
      mark an assignment of a URI as "tentative" initially.

    To do the job properly you really will need some sort of background
    ontology of field types and their value domains that allows some
    sort of smart reasoning about identity, subsumption, and similarity.
    Mathematically the problem is about finding the "distance" between
    points, lines and surfaces in some highly non-linear multidimensional
    space, and ultimately the properties of your distance metric will be
    a prime determinant of how well an approach works. Don't assume
    that a particular metric (e.g. Fuzzy etc) will be exactly right for
    the job. Don't reinvent the wheel, but look outside the square.

    Finally, ask yourself "when and why do I want to smush?" You may
    find that there are different cases which need to be treated in
    different ways, so that the algorithm needs to be parameterized
    according to the circumstances of its use.

    Hope this helps,

    This message was posted through the fuzzy mailing list.
    (1) To subscribe to this mailing list, send a message body of
    "SUB FUZZY-MAIL myFirstName mySurname" to
    (2) To unsubscribe from this mailing list, send a message body of
    (3) To reach the human who maintains the list, send mail to
    (4) WWW access and other information on Fuzzy Sets and Logic see
    (5) WWW archive:

    This archive was generated by hypermail 2b30 : Mon Jan 01 2001 - 21:00:16 MET