Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Dataset#equals strict #23

Open
rubensworks opened this issue Jan 3, 2019 · 15 comments
Open

Make Dataset#equals strict #23

rubensworks opened this issue Jan 3, 2019 · 15 comments

Comments

@rubensworks
Copy link
Member

Originally brought up in #18 (comment).

I suggest to make DatasetCore#equals not test for isomorphism, but instead test strict equality.
DatasetCore or Dataset could then have an additional isomorphic method.

As mentioned by @bergos, equals could be implemented as simple as result.difference(initial).size === 0.

The main reason I have for changing this is to be consistent with Term#equals. Furthermore, being isomorphic is not the same thing as being equal, so we should not confuse these things IMO.

@blake-regalia
Copy link
Contributor

This actually affects all other operations too. If you only had an isomorphic #equals and all other operations are strict, you could end up with a situation like this:

a.equals(b);  // true

a.difference(b).size === 0;  // false
a.contains(b);  // false
b.contains(a);  // false
let u = a.union(b);
u.size === a.size;  // false
u.size === b.size;  // false
let i = a.intersection(b);
i.size === a.size;  // false
i.size === b.size;  // false

If anything, I would think that if you wanted isomorphism to hold under each operation, you would need to first canonicalize each dataset anyway.

@vhf
Copy link
Member

vhf commented Mar 7, 2019

Has this been fixed here #18 (comment) or is it still open?

@awwright
Copy link
Member

I think there's good reason to rename Dataset#equals, but I don't think "strict" equality is a useful concept. It's not meaningful to ask if two bnodes are the "same bnode" between graphs except to say they're conveying the same information, and mathematically this means testing if the graphs are isomorphic.

@ericprud
Copy link

Bnodes are scoped to a dataset rather than a graph, so it's possible for two graphs in a dataset to be tested for equivalence beyond isomorphism. Practically, it's handy to see if a copy of a graph has changed from the original (user changed their profile, librarian updated a PREMIS record, clinical record is an exact duplicate). Theoretically, the means you may be drawing the dataset boundary around your business process.

Regardless of whether rdfjs includes a .equals() to support such use cases, I agree that it should not be confused with either isomorphism or entailment. Users can supply their own .equals(), so long as the Dataset spec promises not to rewrite bnode labels.

@vhf vhf changed the title Make DatasetCore#equals strict Make Dataset#equals strict Apr 18, 2019
@ericprud
Copy link

I created a strawman to describe the diff between equals and isomorphicTo.

(BTW, those anchors have "dom-" in them. I wonder if that's intended to convey that it applies to the DOM spec, or if it's just that the anchors are created in (our) DOM.

@awwright
Copy link
Member

First, IMO, I don't think two graphs can share a bnode even if RDF Semantics says they can, as a matter of mathematics, because that would defeat the point of having bnodes being an existential quantifier: you can't conditions to an existential quantifier after the fact. But that's a quibble for that specification. Maybe I'm misunderstanding something, or maybe it's not a problem for some reason I haven't considered.

More relevant to this is, what case do you need to compare bnodes for "strict" equality because isomorphism checking doesn't suffice?

@ericprud
Copy link

Many use cases for this violate your position on the scope of bnodes as they count on some external reference to a bnode, either from another part of the dataset or from some program state. In order to duck this discussion, I'll use a subgraph use case; the results of a .getQuads() operation:

_:alice foaf:knows _:bob ;
  foaf:mbox <mailto:[email protected]> .
_:bob foaf:mbox <mailto:[email protected]> .
_:claire foaf:mbox <mailto:[email protected]> .
let knows = f.namedNode(foaf:knows)
let alice = d.getQuads(null, knows, f.namedNode(<mailto:alice@example.com>))
let known1 = d.getQuads(alice, knows)
// UI process allows user to update `_:alice`'s `foaf:knows` to be `_:claire`
let known2 = d.getQuads(alice, knows)
console.log(known1.equals(known2)) // false
console.log(known1.isomorphicTo(known2)) // true

Having a .equals() isn't strictly necessary as the user can always implement it; but it is:

  1. handy for testing and change detection
  2. much cheaper than .isomorphicTo().

@awwright
Copy link
Member

@ericprud Can you elaborate the example? As I take your example, nothing changed because the graph conveys the same information before and after:
Before, "there exists a person who knows Alice"
And after, "there exists a person who knows Alice"

@ericprud
Copy link

The part of the graph we're examining started out saying:

There exists someone with the mbox mailto:[email protected] .
There exists someone with the mbox mailto:[email protected] .
There exists someone with the mbox mailto:[email protected] .
The person with mbox mailto:[email protected] knows the person with mbox mailto:[email protected].

The use case says that the user changed the graph to say:

There exists someone with the mbox mailto:[email protected] .
There exists someone with the mbox mailto:[email protected] .
There exists someone with the mbox mailto:[email protected] .
The person with mbox mailto:[email protected] knows the person with mbox mailto:[email protected].

The way the program detected that those particular triples changed (many other things may have changed in the graph as well, but we specifically saved state capturing who alice knew), was by asking again for who alice knows and comparing the results with .equals().

@vhf
Copy link
Member

vhf commented May 6, 2019

I think @ericprud's definition is clear (and I really like the other WebIDL changes).

Here is the example in code:

const assert = require('assert')
const rdf = require('some dataset-spec compliant lib')

const first = (dataset) => dataset.toArray()[0]

const knows = rdf.namedNode('http://xmlns.com/foaf/0.1/knows')
const mbox = rdf.namedNode('http://xmlns.com/foaf/0.1/mbox')
const alice = rdf.blankNode('alice')
const bob = rdf.blankNode('bob')
const claire = rdf.blankNode('claire')
const mailto = (address) => rdf.namedNode(`mailto:${address}`)

const dataset = rdf.dataset([
  rdf.quad(alice, knows, bob),
  rdf.quad(alice, mbox, mailto('[email protected]')),
  rdf.quad(bob, mbox, mailto('[email protected]')),
  rdf.quad(claire, mbox, mailto('[email protected]'))
])

const alice2 = dataset.match(null, mbox, mailto('[email protected]')).toArray()
assert(alice2[0].subject.equals(alice))

const known1 = dataset.match(alice, knows)
assert(first(known1).object.equals(bob))

dataset.delete(rdf.quad(alice, knows, bob))
dataset.add(rdf.quad(alice, knows, claire))

const known2 = dataset.match(alice, knows)
assert(first(known2).object.equals(claire))

assert(!known1.equals(known2))
assert(known1.isomorphicTo(known2))

Here's how I had implemented .equals: https://github.com/rdf-ext/rdf-dataset-indexed/blob/ff76073bb73c96eea77b3c7c64917733a399b15a/Dataset.js#L36-L46 , it works in the above example case although it's a bit different than the suggested implementation.


A suggestion to move forward with this:

  • Merge @ericprud's work without the .isomorphicTo part
  • Create a new issue here to discuss .isomorphicTo (do we want it to be part of the spec? is it a feature that would not be used that often and requires a ~big implementation so we leave it out? etc)

If most of you agree with this: @ericprud do you want to take care of creating this PR without .isomorphicTo, otherwise can I cherry-pick from your repo and open it?

@rubensworks
Copy link
Member Author

Merge @ericprud's work without the .isomorphicTo part

👍

Create a new issue here to discuss .isomorphicTo

Implementation of this is indeed not that trivial (I have one here). From a user-perspective, this would definitely be useful to have. However, this would probably not be an operation that is needed for most cases, so I currently lean towards not including it in the spec, to lower the burden of implementors. Furthermore, there are different ways to implement .isomorphicTo, so it would make sense to give developers control over this, so that an implementation can be chosen that is optimal for the use-case.

@awwright
Copy link
Member

awwright commented May 7, 2019

I still have to revisit this, but having a user select between one bnode or another bnode seems like the wrong way to implement a program. Blank nodes are existential quantifiers, not identifiers. It fundamentally doesn't make sense to me for a user to change a node in a statement from one bnode to another one. If you want this functionality, use an identifier (URI/IRI).

Even using the example for the sake of argument, you can still use isomorphism here. What I want to know is, what does a strict check do that isomorphism cannot?

Performance shouldn't be that much of an issue. Let n be the number of statements; if the two graphs contain the same bnodes, or are found in the same internal sort order, the performance will be O(n), same as strict equality. Only if the bnodes are different addresses in memory, and they're in a different internal sort order, will the performance approach O(n!).

to lower the burden of implementors

On the contrary, I think it's important to standardize the things that are most difficult to implement. The whole point of having a library is to remove burden from developers. It wouldn't be much of a library if all the difficult stuff always got shoveled down the road.

@awwright
Copy link
Member

awwright commented May 7, 2019

Also, here's my simple implementation for reference, which additionally returns the mapping of bnodes if such a mapping is found: https://github.com/awwright/node-rdf/blob/master/lib/GraphEquals.js

I haven't done much work between the differences in the algorithms; mine is a simple subgraph matching algorithm instead of e.g. Jeremy Carroll's algorithm; but as I understand, the pathological case is always O(n!) and the best we can do is going to be for certain subsets of cases, like preserved sort order, and such.

@ericprud
Copy link

So where is consensus at this point? My reading of the above is that most folks want to merge in my strict equals() definition but that @awwright remains unconvinced that it's a practical or proper way to manipulate RDF data.

@awwright
Copy link
Member

Well I wouldn't say unconvinced, but I'm looking for persuading information. If there's an application of this, where it needs to know some Datasets have the same statements pointing to the same bnodes, and mere isomorphism is not allowed, then I can consider that.

With such an example in mind, there'd be some follow-up questions. Is the name appropriate? The name "strict" sounds like the === operator which is usually preferable in ECMAScript; but for us, according to RDF Semantics, it's unpreferable.

Finally, in general, there's just a bunch of different ways to test for equality:

  • Are two datasets the same instance in memory?
  • Do two different instances of Dataset point to the same instances of statements? (this could be especially important if Quad is mutable!)
  • Are two Datasets different if they have different bnodes with the same label?
  • Do the Datasets share the same set of named graphs, and are each pair isomorphic?

The only definition of "equals" that ensures two graphs encode the same data as each other is isomorphism; developers who need something different might be better suited by specifically writing what they're looking for. Isomorphism is the only hard test on that list, anyways.

And again, performance isn't too much of an issue. If two graphs don't use any bnodes, or store statements in the same internal sort order, or point to the same bnode instance, or use the same bnode label, those cases can all be optimized down to O(n).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants