A need to fuse data often arises when you combine multiple datasets. The combined datasets may contain descriptions of the same things that are given different identifiers. If possible, the descriptions of the same thing should be fused into one to simplify querying over the combined data. However, the problem with different co-referent identifiers also appears frequently in a single dataset. If a thing does not have an identifier, then it must be referred to by its description. Likewise, if the dataset's format does not support using identifiers as links, then things must also be referred to by their descriptions. For example, a company referenced from several public contracts as their supplier may have a registered legal entity number, yet its description is duplicated in each awarded contract instead of linking the company by its identifier due to the limitations of the format storing the data, such as CSV.
Fusing descriptions of things is a recurrent task both in intergration of multiple RDF datasets and in transformations of non-RDF data to RDF. Since fusion of RDF data can be complex, there are dedicated data fusion tools, such as Sieve or LD-FusionTool, that can help formulate and execute intricate fusion policies. However, in this post I will deal with basic fusion of RDF data using the humble SPARQL 1.1 Update, which is readily available in most RDF stores and many ETL tools for processing RDF data, such as LinkedPipes-ETL. Moreover, a basic data fusion is widely applicable in many scenarios, which is why I wanted to share several simple ways for approaching it.
Content-based addressing
In the absence of an external identifier, a thing can be identified with a blank node in RDF. Since blank nodes are local identifiers and no two blank nodes are the same, using them can eventually lead to proliferation of aliases for equivalent things. One practice that ameliorates this issue is content-based addressing. Instead of identifying a thing with an arbitrary name, such as a blank node, its name is derived from its description; usually by applying a hash function. This turns the “Web of Names” into the Web of Hashes. Using hash-based IRIs for naming things in RDF completely sidesteps having to fuse aliases with the same description. This is how you can rewrite blank nodes to hash-based IRIs in SPARQL Update and thus merge duplicate data:
In practice, you may want to restrict the renamed resources to those that feature some minimal description that makes them distinguishable. Instead of selecting all blank nodes, you can select those that match a specific graph pattern. This way, you can avoid merging underspecified resources. For example, the following two addresses, for which we only know that they are located in the Czech Republic, are unlikely the same:
@prefix : <http://schema.org/> .
[ a :PostalAddress ;
:addressCountry "CZ" ] .
[ a :PostalAddress ;
:addressCountry "CZ" ] .
More restrictive graph patterns also work to your advantage in case of larger datasets. By lowering the complexity of your SPARQL updates, they reduce the chance of you running into out of memory errors or timeouts.
Hash-based fusion
Hashes can be used as keys not only for blank nodes. Using SPARQL, we can select resources satisfying a given graph pattern and fuse them based on their hashed descriptions. Let's have a tiny sample dataset that features duplicate resources:
If you want to merge equivalent resources instantiating class :C (i.e. :r1, :r2, and :r5), you can do it using the following SPARQL update:
The downside of this method is that the order of bindings in GROUP_CONCAT cannot be set explicitly, nor it is guaranteed to be deterministic. In theory, you may get different concatenations for the same set of bindings. In practice, RDF stores typically concatenate bindings in the same order, which makes this method work.
Fusing subset descriptions
If we fuse resources by hashes of their descriptions, only those with the exact same descriptions are fused. Resources that differ in a value or are described with different properties will not get fused together, because they will have distinct hashes. Nevertheless, we may want to fuse resources with a resource that is described by a superset of their descriptions. For example, we may want to merge the following blank nodes, since the description of the first one is a subset of the second one's description:
@prefix : <http://schema.org/> .
[ a :Organization ;
:name "ACME Inc." ] .
[ a :Organization ;
:name "ACME Inc." ;
:description "The best company in the world."@en ] .
Resources with subset descriptions can be fused in SPARQL Update using double negation:
The above-mentioned caveats apply in this case too, so you can use a more specific graph pattern to avoid merging underspecified resources. The update may execute several rewrites until reaching the largest superset, which makes it inefficient and slow.
Key-based fusion
If you want to fuse resources with unequal descriptions that are not all subsets of one resource's description, a key to identify the resources to fuse must be defined. Keys can be simple, represented by a single inverse functional property, or compound, represented by a combination of properties. For instance, it may be reasonable to fuse the following resources on the basis of shared values for the properties rdf:type and :name:
@prefix : <http://schema.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
[ a :Organization ;
:name "ACME Inc." ;
:foundingDate "1960-01-01"^^xsd:date ;
:email "contact@acme.com" ;
:description "The worst company in the world."@en ] .
[ a :Organization ;
:name "ACME Inc." ;
:foundingDate "1963-01-01"^^xsd:date ;
:description "The best company in the world."@en ] .
To fuse resources by key, we group them by the key properties, select one of them, and rewrite the others to the selected one:
If we fuse the resources in the example above, we can get the following:
@prefix : <http://schema.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
[ a :Organization ;
:name "ACME Inc." ;
:foundingDate "1960-01-01"^^xsd:date, "1963-01-01"^^xsd:date ;
:email "contact@acme.com" ;
:description "The best company in the world."@en, "The worst company in the world."@en ] .
This example illustrates how fusion highlights problems in data, including conflicting values of functional properties, such as the :foundingDate, or contradicting values of other properties, such as the :description. However, resolving these conflicts is a complex task that is beyond the scope of this post.
Conclusion
While I found the presented methods for data fusion to be applicable for a variety of datasets, they may fare worse for complex or large datasets. Besides the concern of correctness, one has to weigh the concern of performance. Based on my experience so far, the hash-based and key-based methods are usually remarkably performant, while the methods featuring double negation are not. Nonetheless, the SPARQL updates from this post can be oftentimes simply copied and pasted and work out of the box (having tweaked the graph patterns selecting the fused resources).
No comments :
Post a Comment