Home

Awesome

This repository (and the underlying code) is now frozen. The W3C is in the process of publishing a standard for canonicalization (called RDFC-1.0, and I have made a conforming implementation thereof, published in rdfjs-c14n. The current version is, therefore, definitely out of date.

DOI

Proof-of-concept implementation of Aidan Hogan’s iso-canonical algorithm for RDF graph canonicalization

Aidan Hogan has published two papers on one of the fundamental building blocks for the RDF infrastructure: calculating a canonical RDF graph. A canonical graph allows, for example, to sign an RDF graph, needed for various security, consistency, provenance, etc., considerations in applications like verifiable claims, usage of linked data on block chain, consistency proof for ontologies, publication of data, etc. The two papers are:

  1. A. Hogan, “Skolemising Blank Nodes while Preserving Isomorphism,”, WWW2015 conference proceedings, 2015, pp. 430–440. See PDF version.
  2. A. Hogan, “Canonical Forms for Isomorphic and Equivalent RDF Graphs: Algorithms for Leaning and Labelling Blank Nodes,” ACM Trans. Web, vol. 11, no. 4, pp. 22:1–22:62, Jul. 2017. See PDF version.

(The second paper supersedes the first insofar that it slightly simplifies the relevant algorithm, although it also contains much more than what is relevant for this repository. Note also that the URL above for the second paper refers to the author’s copy on his Web site, which takes care of some minor bugs that surfaced since the publication. That version was used for this implementation.)

This repository contains a proof-of-concept implementation in node.js of what is referred to as the “iso-canonical algorithm” described in the papers, using the N3.js library for the representation of quads. It is fully based on Aidan’s algorithm, except for one tiny extension: while Aidan describes the canonicalization of RDF graphs, this implementations deals with RDF datasets (a.k.a. named graphs). The description of the changes are described below.

The reason for doing this implementation was to explore whether it is possible to define a stable specification (e.g., W3C Recommendation) on the subject starting with Aidan’s algorithm and, if yes, what is missing in terms of a specification (see the separate section below for more details of those). The implementation does not aim, or pretend, to be of a production quality, it has not been profiled for speed, etc. It shows, however, that the algorithm can be implemented without any further mathematical work based solely on the paper. I.e., Aidan’s algorithm is indeed one of the viable approaches if such a specification is to be produced.

(Note that Aidan does have an implementation of his own algorithm in Java, which is also on GitHub. This Javascript implementation is completely independent of that one.)

What is meant by “Canonical RDF Graph/Dataset”?

A matter of terminology: in this paper (and this implementation) a canonical version can(G) of the Graph/Dataset G is isomorphic to G (i.e., can(G)≅G), and if H≅G then can(H)=can(G). In other words, it is an isomorphic graph where the blank node renaming bijection is deterministic.

Adaptation of the algorithms to handle datasets

The changes/additions are as follows (each of those changes are relevant only if there is a separate graph identifier, i.e., if we have a quad instead of a triple. If this is not the case, the algorithm falls back to the original).

  1. In algorithm 1, at step 13 and 16, respectively: the hashTuple function gets a fourth argument, before the '+' and '-' characters, respectively, referring to the hash value of the graph identifier;
  2. In algorithm 1, after step 17, a similar branch is added for the case of a (s,p,o,b) quad (i.e., if the graph identifier is a blank node); the hashTuple in that branch hashes the full (s,p,o) triple, plus the '.' character.
  3. The lexicographic comparison of quads for, e.g., comparing graphs/datasets, take into account the possible graph id, too.

Further details that should be provided for a specification

The article defines and proves the mathematical algorithm; however, for a precise specification, some further details must be properly specified to ensure interoperable implementations. This is not a shortcoming of the paper; it only means that some of the functions, details, etc., that are only characterized by their behavior, should have a more precise definition.

Each entry below describes what has to be specified, and also how this particular implementation implements it. The choice of this implementation may be arbitrary, and should not be considered as binding by any means.

Additional steps to be defined beyond the canonicalization

Aidan’s algorithm specifies can(G), i.e., the deterministic mapping of the blank nodes in the Graph/Dataset. A full specification for signing RDF Datasets would also include some fairly routine engineering steps:

Which hash function to use?

The paper does not specify which hash function to use, just that it should be a perfect one. It also contains some measurements and calculation on which functions are worth using. Note that we don't need a cryptographically strong hash function; it is only used to separate blank nodes. Higher level algorithms, i.e., producing the final and public signature of the graph may choose stronger algorithms if needed.

This implementation uses md5. The only reason is that the resulting values are relatively short which makes it easier to debug. A final specification should probably use SHA256. (Note that the article refers to SHA128 as being probably enough; however, the OpenSSL library, at least as used in node.js, does not seem to offer this function hence the choice for SAH256 as being probably more widely deployed.) The crypto package in node.js (and, consequently, this implementation) stores the values of hashes as a Buffer instance of unsigned 8 bit integers; the number of such integers depend on the specific hash function used.

What does a "0" value of a hash mean?

The hash tables are initialized by setting a '0' as a hash value for each blank node. It must be defined exactly what this means (depending on how hashes are represented): a number or a hexadecimal representation thereof.

This implementation uses a Buffer with all zero entries. Essentially, this is a zero number.

What does comparing hash values mean?

There may be several choices: comparing the values as (large) numbers, lexicographically compare the values' hexadecimal representations,…

This implementation uses the compare function of Buffer in node.js for the representation of a hash. This corresponds (I believe) to the comparison of the values as large numbers.

What is the precise specification of hashTuple?

The article says:

hashTuple(·) will compute an order-dependent hash of its inputs”

This implementation uses the OpenSSL (i.e., node.js’s crypto package) facility to add new values before generating the digest. I am not sure what this means exactly specification-wise but, probably, a concatenation of the values or Buffer instances before generating the digest.

What is the precise specification of hashBag?

The article says:

“The function hashBag(·) computes hashes in a commutative and associative way over its inputs.”

Note that the quality of this function is crucial because, in some cases, hash collision can occur that may create problems. The experimentation of Aidan but also the experience with this implementation shows that such collision do occur (albeit in rare and artificial cases); this was the case when, for example, the XOR values of the Buffer entries were used.

An earlier version of this implementation used a modulo 255 sum of the Buffer entries in the node.js representation of the hashes.

However, the current implementation takes a somewhat more complex change of the algorithm, essentially implementing:

hashBag(t1,...,tn) = hashTuple(sort(t1,...,tn))

Using this approach to hashBag requires a slight modification of Algorithm 1, though, namely:

  1. A new object, hash_bag is initialized before line 12; this object stores, for each b, an array consisting (initially) of the single value of hash<sub>i</sub>[b]
  2. Lines 14 and 17 are replaced by adding the value of c to the corresponding array in hash_bag for the key b
  3. Before line 18 hash<sub>i</sub> is updated, for each b with the value of hashTuple(sort(hash_bag[b]))

Provided the hash function is (acceptably) perfect, this approach secures a deterministic and collision-free hash value when needed.

What is the precise comparison method of graphs?

The algorithm is based on the ability to have a total ordering of graphs/datasets; indeed, the more complex step is based on calculating a minimum of the candidate graphs/datasets.

This implementation uses the example definition as provided in the paper:

G < H if and only if G ⊂ H or there exists a triple t ∈ G \ H such that no triple t' ∈ H \ G exists where t' < t”

where the comparison of quad is made comparing the lexicographic representation of the tuples, represented as N-Quads.

Canonical N-Quads

Though the article does not refer to this, a final specification should probably include the specification of “canonical” N-Quads if N-Quads are used in the various steps. This “canonical” N-Quads should specify, for example, that no comments are allowed, no empty lines are allowed, what EOL characters are to be used, that all white spaces between terms are reduced to one, whether there is a white space before the trailing "." character of a triple or not, etc.