Showing posts with label rdf. Show all posts
Showing posts with label rdf. Show all posts

2017-02-05

Publishing temporal RDF data as create/delete event streams

Today I wondered about publishing temporal data as create/delete event streams. The events can be replayed by a database to produce state valid at a desired time. Use cases for this functionality can be found everywhere. For example, I've been recently working with open data from company registers. When companies relocate, their registered addresses change. It is thus useful to be able to trace back in history to find out what a company's address was at a particular date. When a company register provides only the current state of its data, tracing back to previous addresses is not possible.

Event streams are not a new idea. It corresponds with the well-known event sourcing pattern. This pattern was used in RDF too (for example, in shodan by Michael Hausenblas), although it is definitely not widespread. In 2013, I wrote an article about capturing temporal dimension of linked data. I think most of it holds true to date. In particular, there is still no established way of capturing temporal data in RDF. Event streams might be an option to consider.

Many answers to questions about temporal data in RDF are present in RDF 1.1 Concepts, one of the fundamental documents on RDF. For start, "the RDF data model is atemporal: RDF graphs are static snapshots of information." Nevertheless, "a snapshot of the state can be expressed as an RDF graph," which lends itself as a crude way of representing temporal data through time-indexed snapshots encoded as RDF graphs. There is also the option of reifying everything into observations, which is what statisticians and the Data Cube Vocabulary do. Alternatively, we can reify everything that happens into events.

Events

Events describe actions on data in a dataset. You can also think about them as database transactions, along with the associated properties. RDF triples are immutable, since "an RDF statement cannot be changed – it can only be added and removed" (source). This is why we need only two types of events:

  1. :Create (addition)
  2. :Delete (retraction)

Events can be represented as named graphs. Each event contains its type, which can be either :Create or :Delete, and timestamps, in particular valid time. Valid time tells us when the event's data is valid in the modelled world. The dcterms:valid property seems good enough to specify the valid time. Events may additionally describe other metadata, such as provenance. For example, dcterms:creator may link the person who created the event data. Encapsulating event's metadata in its named graph makes it self-contained, but it mixes operational data with data about the described domain, so an alternative to worth considering is to store the metadata in a separate graph.

The following example event stream describes that Alice was a member of ACME since 2000, while in 2001 she left to work for the Umbrella Corp, and then returned to ACME in 2003. The example is serialized in TriG, which allows to describe quads with named graphs instead of mere triples. You can use this example to test the queries discussed further on.

@prefix :         <http://example.com/> .
@prefix dcterms:  <http://purl.org/dc/terms/> .
@prefix org:      <http://www.w3.org/ns/org#> .
@prefix schema:   <http://schema.org/> .
@prefix xsd:      <http://www.w3.org/2001/XMLSchema#> .

:event-1 {
  :event-1 a :Create ;
    dcterms:valid "2000-01-01T09:00:00Z"^^xsd:dateTime .

  :Alice org:memberOf :ACME ;
    schema:name "Alice" .
}

:event-2 {
  :event-2 a :Delete ;
    dcterms:valid "2001-01-01T09:00:00Z"^^xsd:dateTime .

  :Alice org:memberOf :ACME .
}

:event-3 {
  :event-3 a :Create ;
    dcterms:valid "2001-01-01T09:00:00Z"^^xsd:dateTime .

  :Alice org:memberOf :UmbrellaCorp .
}

:event-4 {
  :event-4 a :Delete ;
    dcterms:valid "2003-01-01T09:00:00Z"^^xsd:dateTime .

  :Alice org:memberOf :UmbrellaCorp .
}

:event-5 {
  :event-5 a :Create ;
    dcterms:valid "2003-01-01T09:00:00Z"^^xsd:dateTime .

  :Alice org:memberOf :ACME .
}

Limitations

Describing event streams in the afore-mentioned way has some limitations. One of the apparent issues is the volume of data that is needed to encode seemingly simple facts. There are several ways how we can deal with this. Under the hood, RDF stores may implement structural sharing as in persistent data structures to avoid duplicating substructures present across multiple events. We can also make several assumptions that save space. :Create can be made the default event type, so that it doesn't need to be provided explicitly. In some limited cases, we can assume that valid time is the same as the transaction time. For example, in some countries, public contracts become valid only after they are published.

Another limitation of this approach is that it doesn't support blank nodes. You have to know the IRIs of the resources your want to describe.

Since named graphs are claimed for events, they cannot be used to distinguish datasets, as they typically are. Datasets need to be distinguished as RDF datasets. Having multiple datasets may hence mean having multiple SPARQL endpoints. Cross-dataset queries then have to be federated, or alternatively, current snapshots of the queried datasets can be loaded into a single RDF store as named graphs.

Integrity constraints

To illustrate properties of the proposed event representation, we can define integrity constraints that the event data must satisfy.

Union of delete graphs must be a subset of the union of create graphs. You cannot delete non-existent data. The following ASK query must return false:

PREFIX : <http://example.com/>

ASK WHERE {
  GRAPH ?delete {
    ?delete a :Delete .
    ?s ?p ?o .
  }
  FILTER NOT EXISTS {
    GRAPH ?create {
      ?create a :Create .
      ?s ?p ?o .
    }
  }
}

Each event graph must contain its type. The following ASK query must return true for each event:

ASK WHERE {
  GRAPH ?g {
    ?g a [] .
  }
}

The event type can be either :Create or :Delete. The following ASK query must return true for each event:

PREFIX : <http://example.com/>

ASK WHERE {
  VALUES ?type {
    :Create
    :Delete
  }
  GRAPH ?g {
    ?g a ?type .
  }
}

Events cannot have multiple types. The following ASK query must return false:

ASK WHERE {
  {
    SELECT ?g
    WHERE {
      GRAPH ?g {
        ?g a ?type .
      }
    }
    GROUP BY ?g
    HAVING (COUNT(?type) > 1)
  }
}

Querying

Querying over event streams is naturally more difficult than querying reconciled dataset snapshots. Nevertheless, the complexity of the queries may be hidden behind a proxy offering a more convenient syntax that extends SPARQL. An easy way to try the following queries is to use Apache Jena's Fuseki with an in-memory dataset loaded from the example event stream above: ./fuseki-server --file data.trig --update /ds.

Queries over the default graph, defined as the union of all graphs, query what has been true at some point in time:

CONSTRUCT {
  ?s ?p ?o .
}
WHERE {
  # Fuseki uses <urn:x-arq:UnionGraph> to denote the union graph,
  # unless tdb:unionDefaultGraph is set to true.
  # (https://jena.apache.org/documentation/tdb/assembler.html#union-default-graph)
  GRAPH <urn:x-arq:UnionGraph> {
    ?s ?p ?o .
  }
}

Current valid data is a subset of the :Create graphs without the triples in the subsequent :Delete graphs:

PREFIX :        <http://example.com/>
PREFIX dcterms: <http://purl.org/dc/terms/>

CONSTRUCT {
  ?s ?p ?o .
}
WHERE {
  GRAPH ?create {
    ?create a :Create ;
            dcterms:valid ?createValid .
    ?s ?p ?o .
  }
  FILTER NOT EXISTS {
    GRAPH ?delete {
      ?delete a :Delete ;
              dcterms:valid ?deleteValid .
      FILTER (?deleteValid > ?createValid)
      ?s ?p ?o .
    }
  }
}

We can also roll back and query data at a particular moment in time. This functionality is what Datomic provides as the asOf filter. For instance, the data valid at January 1, 2001 at 9:00 is union of the :Create events preceding this instant without the :Delete events that followed them until the chosen time:

PREFIX :        <http://example.com/>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX xsd:     <http://www.w3.org/2001/XMLSchema#>

CONSTRUCT {
  ?s ?p ?o .
}
WHERE {
  GRAPH ?create {
    ?create a :Create ;
      dcterms:valid ?validCreate .
    FILTER (?validCreate < "2001-01-01T09:00:00Z"^^xsd:dateTime)
    ?s ?p ?o .
  }
  MINUS {
    GRAPH ?delete {
      ?delete a :Delete ;
        dcterms:valid ?validDelete .
      FILTER (?validDelete < "2001-01-01T09:00:00Z"^^xsd:dateTime)
      ?s ?p ?o .
    }
  }
}

Event resolution proxy

Manipulation with event streams following the proposed representation can be simplied by an event resolution proxy. This proxy may be based on the SPARQL 1.1 Graph Store HTTP Protocol, which provides a standard way to work with named graphs. However, the Graph Store Protocol doesn't support quad-based RDF formats, so the proxy thus needs to partition multi-graph payloads into several transactions.

The proxy can provide several conveniences. It can prune event payloads by removing retractions of non-existent triples or additions of existing triples, or by dropping complete events if found redundant. It can automatically add transaction time; for example by using BIND (now() AS ?transactionTime) in SPARQL. Simplifying even further, the proxy can automatically mint event identifiers as URNs produced by the uuid() function in SPARQL. No event metadata can be provided explicitly in such case, although some metadata may be created automatically. Event type can be inferred from the HTTP method the proxy receives. HTTP PUT may correspond with the :Create type, while HTTP DELETE should indicate the :Delete type. Additionally, the proxy can assume that valid time is the same as transaction time.

Publishing

Create/delete event streams can be effectively split to batches by time intervals, suggesting several ways of publishing such data. An event stream should be published as an updated append-only quad dump. Additionally, there may be quads dumps of events from shorter periods of time, such as a day or month, to enable more responsive data syndication. Currently valid dataset may be materialized and published as a periodically updated dump. Instead of updating the current dataset in place, it may be published in snapshots. A snapshot from a given date can be used as a basis when replaying events, so that you don't have to replay the whole history of events, but only those that came after the snapshot. Any quad-based RDF serialization, such as TriG or N-Quads, will do for the dumps. Finally, in the absence of structural sharing, the dumps should be compressed to avoid size bloat caused by duplication of shared data structures.

The next challenge is to refine and test this idea. We can also wrap the event streams in a convenient abstraction that reduces the cognitive effort that comes with their manipulation. I think this is something developers of RDF store can consider to include in their products.

2016-10-06

Basic fusion of RDF data in SPARQL

A need to fuse data often arises when you combine multiple datasets. The combined datasets may contain descriptions of the same things that are given different identifiers. If possible, the descriptions of the same thing should be fused into one to simplify querying over the combined data. However, the problem with different co-referent identifiers also appears frequently in a single dataset. If a thing does not have an identifier, then it must be referred to by its description. Likewise, if the dataset's format does not support using identifiers as links, then things must also be referred to by their descriptions. For example, a company referenced from several public contracts as their supplier may have a registered legal entity number, yet its description is duplicated in each awarded contract instead of linking the company by its identifier due to the limitations of the format storing the data, such as CSV.

Fusing descriptions of things is a recurrent task both in intergration of multiple RDF datasets and in transformations of non-RDF data to RDF. Since fusion of RDF data can be complex, there are dedicated data fusion tools, such as Sieve or LD-FusionTool, that can help formulate and execute intricate fusion policies. However, in this post I will deal with basic fusion of RDF data using the humble SPARQL 1.1 Update, which is readily available in most RDF stores and many ETL tools for processing RDF data, such as LinkedPipes-ETL. Moreover, a basic data fusion is widely applicable in many scenarios, which is why I wanted to share several simple ways for approaching it.

Content-based addressing

In the absence of an external identifier, a thing can be identified with a blank node in RDF. Since blank nodes are local identifiers and no two blank nodes are the same, using them can eventually lead to proliferation of aliases for equivalent things. One practice that ameliorates this issue is content-based addressing. Instead of identifying a thing with an arbitrary name, such as a blank node, its name is derived from its description; usually by applying a hash function. This turns the “Web of Names” into the Web of Hashes. Using hash-based IRIs for naming things in RDF completely sidesteps having to fuse aliases with the same description. This is how you can rewrite blank nodes to hash-based IRIs in SPARQL Update and thus merge duplicate data:

In practice, you may want to restrict the renamed resources to those that feature some minimal description that makes them distinguishable. Instead of selecting all blank nodes, you can select those that match a specific graph pattern. This way, you can avoid merging underspecified resources. For example, the following two addresses, for which we only know that they are located in the Czech Republic, are unlikely the same:


@prefix : <http://schema.org/> .

[ a :PostalAddress ;
  :addressCountry "CZ" ] .

[ a :PostalAddress ;
  :addressCountry "CZ" ] .

More restrictive graph patterns also work to your advantage in case of larger datasets. By lowering the complexity of your SPARQL updates, they reduce the chance of you running into out of memory errors or timeouts.

Hash-based fusion

Hashes can be used as keys not only for blank nodes. Using SPARQL, we can select resources satisfying a given graph pattern and fuse them based on their hashed descriptions. Let's have a tiny sample dataset that features duplicate resources:

If you want to merge equivalent resources instantiating class :C (i.e. :r1, :r2, and :r5), you can do it using the following SPARQL update:

The downside of this method is that the order of bindings in GROUP_CONCAT cannot be set explicitly, nor it is guaranteed to be deterministic. In theory, you may get different concatenations for the same set of bindings. In practice, RDF stores typically concatenate bindings in the same order, which makes this method work.

Fusing subset descriptions

If we fuse resources by hashes of their descriptions, only those with the exact same descriptions are fused. Resources that differ in a value or are described with different properties will not get fused together, because they will have distinct hashes. Nevertheless, we may want to fuse resources with a resource that is described by a superset of their descriptions. For example, we may want to merge the following blank nodes, since the description of the first one is a subset of the second one's description:


@prefix : <http://schema.org/> .

[ a :Organization ;
  :name "ACME Inc." ] .

[ a :Organization ;
  :name "ACME Inc." ;
  :description "The best company in the world."@en ] .

Resources with subset descriptions can be fused in SPARQL Update using double negation:

The above-mentioned caveats apply in this case too, so you can use a more specific graph pattern to avoid merging underspecified resources. The update may execute several rewrites until reaching the largest superset, which makes it inefficient and slow.

Key-based fusion

If you want to fuse resources with unequal descriptions that are not all subsets of one resource's description, a key to identify the resources to fuse must be defined. Keys can be simple, represented by a single inverse functional property, or compound, represented by a combination of properties. For instance, it may be reasonable to fuse the following resources on the basis of shared values for the properties rdf:type and :name:


@prefix :    <http://schema.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

[ a :Organization ;
  :name "ACME Inc." ;
  :foundingDate "1960-01-01"^^xsd:date ;
  :email "contact@acme.com" ;
  :description "The worst company in the world."@en ] .

[ a :Organization ;
  :name "ACME Inc." ;
  :foundingDate "1963-01-01"^^xsd:date ;
  :description "The best company in the world."@en ] .

To fuse resources by key, we group them by the key properties, select one of them, and rewrite the others to the selected one:

If we fuse the resources in the example above, we can get the following:


@prefix :    <http://schema.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

[ a :Organization ;
  :name "ACME Inc." ;
  :foundingDate "1960-01-01"^^xsd:date, "1963-01-01"^^xsd:date ;
  :email "contact@acme.com" ;
  :description "The best company in the world."@en, "The worst company in the world."@en ] .

This example illustrates how fusion highlights problems in data, including conflicting values of functional properties, such as the :foundingDate, or contradicting values of other properties, such as the :description. However, resolving these conflicts is a complex task that is beyond the scope of this post.

Conclusion

While I found the presented methods for data fusion to be applicable for a variety of datasets, they may fare worse for complex or large datasets. Besides the concern of correctness, one has to weigh the concern of performance. Based on my experience so far, the hash-based and key-based methods are usually remarkably performant, while the methods featuring double negation are not. Nonetheless, the SPARQL updates from this post can be oftentimes simply copied and pasted and work out of the box (having tweaked the graph patterns selecting the fused resources).

2012-08-17

Technologies of linked data: RDF

The following post is an excerpt from my thesis entitled Linked open data for public sector information.
Resource Description Framework (RDF) is a standard format for data interchange on the Web. RDF is a generic graph data format that has several isomorphic representations. Any given RDF dataset may be represented as a directed labelled graph that may be broken down into a set of triples, each consisting of subject, predicate, and object.
Triples are items that RDF data is composed of. Subject of a triple is a referent, an entity that is described by the triple. Predicate-object pairs are the referent’s characteristics.
RDF is a type of entity-attribute-value with classes and relationships (EAV/CR) data model. EAV/CR is a general model that may be grafted onto implementations spanning relational databases or object-oriented data structures, such as JSON. In the case of RDF, entities are represented as subjects, which are instances of classes, attributes are expressed as predicates that qualify relationships in data, and objects account for values.
In terms of the graph representation of RDF, subjects and objects form the graph’s nodes. Predicates constitute the graph’s vertices that connect subjects and objects. The graph’s nodes and vertices are labelled with URIs, blank nodes (nodes without intrinsic names), or literals (textual values).

Serializations

RDF is an abstract data format that needs to be formalized for exchange. To cater for this purpose RDF offers a number of textual serializations suitable for different host environments. A side effect of RDF notations being text-based is that they are open to inspection as anyone can view their sources and learn from them. Now we will describe several examples of the most common RDF serializations.
N-Triples is a simple, line-based RDF serialization that is easy to parse. It compresses well and so it is convenient for exchanging RDF dumps and executing batch processes. However, the character encoding of N-Triples is limited to 7-bit and covers only ASCII characters, while other characters have to be represented using Unicode escaping.
Turtle is a successor to N-Triples that provides a more compact and readable syntax. For instance, it has a mechanism for shortening URIs to namespaced compact URIs. Unlike N-Triples, Turtle requires UTF-8 to be used as the character encoding, which simplifies entry of non-ASCII characters.
RDF serializations based on several common data formats were developed, such as those building on XML or JSON. XML-based syntax of RDF is a W3C recommendation from 2004. With regard to JSON, there are a number of proposed serializations, such as JSON-LD, an unofficial draft for representing linked data. However, these serializations suffer from the fact that their host data formats are tree-based, whereas RDF is graph-based. This introduces difficulties for the format’s syntax as a result of “packing” graph data into hierarchical structures. For example, the same RDF graph may be serialized differently with no way of determining the “canonical” serialization.
Several RDF serializations were proposed to tie RDF data with documents, using document formats as carriers that embed RDF data. An example of this approach is RDFa that allows to interweave structured data into documents by using attribute-value pairs. It is a framework that can be extended to various host languages, of which XHTML has a specification of RDFa syntax that reached the status of an official W3C recommendation.

Vocabularies and ontologies

While RDF is a common data model for linked data, RDF vocabularies and ontologies offer common way of describing various domains. Their role is to provide a means of conveying semantics in data. RDF vocabulary or ontology covers a specific domain of human endeavour and distills the most reusable parts of the domain into “an explicit specification of a conceptualization” [1, p. 1]. Conceptualization is thought of as a way of dividing a domain into discrete concepts.
The distinction between RDF vocabularies and ontologies is somewhat blurry. Ontologies provide not only lexical but also intensional or extensional definitions of concepts that are connected with logical relationships, and thus are thought of as more suitable for the tasks based on logic, e.g., reasoning. RDF vocabularies offer a basic “interface” data for a particular domain and as such as better suited for more lightweight tasks. Most of linked data gets by with using simple RDF vocabularies, that are in rare cases complemented with ontological constructs.
Having data described with a well-defined and machine-readable RDF vocabulary or an ontology enables to perform inference on the data. Inference serves for materializing data implied by the rules defined in RDF vocabularies and ontologies, through the means of which the data is expressed. W3C standardized two ontological languages that may be used to create RDF vocabularies and ontologies: RDF Schema (RDFS) and Web Ontology Language (OWL).
There are countless RDF vocabularies and ontologies available on the Web. However, a great deal of them is used only in the dataset, for which they were defined, and only a few of them reached a sufficient popularity in order to be treated as de facto standards for modelling of the domains they cover. An example of a general and widespread RDF vocabulary is Dublin Core Terms, which provides a basic set of means for expressing descriptive metadata. With regards to the public sector, some of the RDF vocabularies and ontologies covering this domain may be found in the Government vocabulary space of the Linked Open Vocabularies project.

References

  1. GRUBER, Thomas R. A translation approach to portable ontology specifications. Knowledge Acquisition. 1993, vol. 5, iss. 2, p. 199 — 220. Also available from WWW: http://tomgruber.org/writing/ontolingua-kaj-1993.htm

2012-01-21

Computing label specificity

This post has been long shelved in my head. Sometime around the summer of 2011 I started to think again about the problems that arise when you use labels (strings) as identifiers for information retrieval tasks. The ambiguity of labels used as identifiers without the necessary context is a common problem. Consider for example Wikipedia, which is trying to ameliorate this issue by providing disambiguation links to ambiguous label-based URIs. In this case the disambiguation is done by the user, that is provided with more contextual information describing the disambiguated resources.

Gradually I started to be interested not in label ambiguity, but an inverse property of textual labels: label specificity. What particularly interested me was the notion of computing label specificity based on external linked data . At first, I thought that having an indicator of the label's specificity may be useful when such label is to be treated as an identifier of a resource. Interlinking came to mind, in which the probability of a linkage between two resources is often computed based on labels' similarity. Another idea was to use label similarity in ontology design, to label parts of an ontology with unambiguous strings.

The more that I delved into the topic, the more it started to look like a useless, academic exercise. I wrote a few scripts, did some tests, and thought the topic is hardly worth continuing on. Then I stopped, leaving the work unfinished (as it lead nowhere). I still cannot think of a real-world use for the approach I have chosen to investigate, however I believe I have learnt something in the process, something that might stimulate further, and a more fruitful research.

Label ambiguity and specificity

Let's begin a hundred years ago, when Ferdinand de Saussure was writing about language. Language works through relations of difference, then, which place signs in opposition to one another, he wrote. And, a linguistic system is a series of differences of sound combined with a series of differences of ideas. However, these differences are not context-free as their resolution is context-dependent. Natural language is not as precise as DNS and its correct resolution requires reasoning with context, of which humans are more capable than computers. It begs the question to what would happen if computers were provided with contextual, background knowledge in a sufficiently structured form.

Label is a proxy to a concept and its resolution depends on the label's context. Depending on the context in which a label is used it can redirect to different concepts. Thus, addressing RDF resources with literal labels is a crude method of information retrieval. In most cases, labels alone are not sufficient for unique identification of concepts. The ambiguity of labels makes them unfit to be used as identifiers for resources. However, in some cases labels serve the purpose of identification, and this comes with consequences, the consequences of treating labels as values of owl:inverseFunctionalProperties.

From the linguistic perspective, ambiguous labels can be homonyms or polysemes. Homonyms are unrelated meanings that share a label with the same orthography, such as bat as an animal and as a baseball racket. Polysemes, on the other hand are related meanings grouped by the same label, such as mouth as a bodily part and as a place where river enters the sea.

Computing label specificity then largely becomes a task of identification of ambiguous labels. Given the extensive language-related datasets available on the Web, such task seems feasible. For instance, by using the additional data from external sources, one can verify a link based on a rule specifying that the linked resources must match on an unambiguous label. And vice versa, every method of verification may be reversed and used for questioning the certainty of a given link.

The label specificity impacts the quality of interlinking based of matching labels of RDF resources. Harnessing string similarity for literal values of label properties is a common practice for linking data that is fast and easy to implement. Also, when the data about an RDF resource are very brief and there is virtually no other information describing the resource apart from its label, matching based on computing similarity of labels may be the only possible method for discovering links in heterogeneous datasets.

This approach to interlinking may work well if the matched labels are specific enough in the target domain to uniquely identify linked resources. In a well-defined and limited domain, such as medical terminology, it makes sense to treat label properties almost as instances of owl:InverseFunctionalProperty and use their values as strong identifiers of the labelled resources. However, in other cases the quality of results of this approach suffers from lexical ambiguity of resource labels.

In cases where the links between datasets are made in a procedure that is based on similarity matching of resource labels, checking specificity of the matched labels may be used to find links that were created by matching ambiguous labels and therefore need further confirmation by other methods, such as matching labels in a different language in cases the RDF resources have multilingual description, or even manual examination to verify validity of the link.

So I thought label specificity can be used as a method of verification for straight-forward and faulty interlinking based on resources' labels. I designed a couple of tests that were supposed to examine the tested label's specificity.

Label specificity tests

The simplest of the tests was a check of the label's character length. This test was based on a naïve intuition that the label's length correlates with its specificity, and so longer labels tend to be more specific as more letters add more context and in some cases labels that exceed a specified threshold length may be treated as unambiguous.

Just a little bit more complex was the test taking into account if the label is an abbreviation. If the label contains only capital letters it highly likely that it is an abbreviation. Abbreviations used as labels are short and less specific because they may refer to several meanings. For instance, the abbreviation RDF may be expanded to Resource Description Framework, Radio direction finder, or Rapid deployment force (define:RDF). Unless abbreviations are used in a limited context defining topical domain or language, such as the context of semantic web technologies for RDF, it is difficult to guess correctly the meaning they refer to.

Another test based on an intuition assumed that if a single dataset contains more than one resource with the examined label, the label should be treated as ambiguous. It is likely that in a single datasets there are no duplicate resources and thus resources sharing the same label must refer to distinct concepts.

I thought about a test based on a related supposition, that the number of translations of the tested label from one language to another also indicates its specificity. However, at that time I was thinking about this, I have not discovered any sufficient data sources. Google Translate had recently closed its API and other dictionary sources were not equipped with a machine-readable data output, or were not open enough. Not long ago I have learnt about the LOD in Translation project, that queries multiple linked data sources and retrieves translations of strings based on multilingual values of the same properties for the same resources. Looking at the number of translations of bat (from English to English) it supports the inkling that it is not the most specific label.

Then I tried a number of tests based on data from external sources, which explicitly provided data about the various meanings of the tested label. I started with the so-called Thesaurus, which offers a REST API, that provides a service to retrieve a set of categories, in which the given label belongs. Thesaurus supports four languages, with Italian, French and Spanish along with English. However, the best coverage is available for English.

The web service at accepts a label and the code of its language, together with the API key and specification of the requested output format (XML or JSON). For example, for label bat it returns 10 categories representing different senses in which the label may be used.

Then I turned to the resources that provide data in RDF. Lexvo provides RDF data from various linguistic resources combined together, building on datasets such as Princeton WordNet, Wiktionary, or Library of Congress Subject Headings.

Lexvo offers a linked data interface and it is possible to access its resources also in RDF/XML via content negotiation. The Lexvo URIs follow a simple pattern http://lexvo.org/id/term/{language code}/{label}. In order to get the data to examine label's specificity, retrieve the RDF representation of a given label, get the values of the property http://lexvo.org/ontology#means from Lexvo ontology, and group by results by data provider (cluster them according to their base URI) to eliminate the same meanings coming from different datasets included in Lexvo. For example, the URI http://lexvo.org/id/term/eng/bat returns 10 meanings from WordNet 3.0 and 4 meanings from OpenCyc.

Of course I tried DBpedia, one of the biggest sources of structured data, as well. Unlike Lexvo, DBpedia provides a SPARQL endpoint. In this way, instead of requesting several resources and running multiple queries, the information about the ambiguity of a label based on Wikipedia's disambiguation links can be retrieved by a single SPARQL query, such as this one:

PREFIX dbpont: <http://dbpedia.org/ontology/>PREFIX rdfs:   <http://www.w3.org/2000/01/rdf-schema#>SELECT DISTINCT ?resource ?labelWHERE {  ?disambiguation dbpont:wikiPageDisambiguates ?resource, [    rdfs:label "Bat"@en  ] .  ?resource rdfs:label ?label .  FILTER(LANGMATCHES(LANG(?label), "en"))}

After implementing similar tests of label specificity, I was not able to find any practical use case for such functionality. I left this work, shifting my focus to things that looked more applicable in the real world. I still think there is something to it, and maybe I will return to it on some day.

2011-07-03

RDFa in action

RDFa is a way how to exchange structured data inside of HTML documents. RDFa provides information that is formalized enough for computers (such as googlebot) to process it in an automated way. RDFa is a complete serialization of RDF, using the attribute = value pairs to embed data into HTML documents in a way that does not affect their visual display. RDFa is a hack built on top of HTML. It repurposes some of the standard HTML attributes (such as href, src or rel) and adds new ones (such as property, about or typeof) to enrich HTML with semantic mark-up.

A good way to start with RDFa is to read through some of the documents, such as the RDFa Primer or even the RDFa specification. When you want to annotate an HTML document with RDFa you might want to go through a series of steps. We have used this workflow during an RDFa workshop I have helped to organize and this recipe worked quite well. Here it is.

  1. Find out what do you want to describe (e.g., your personal profile).
  2. Find which RDF vocabularies can be used to express description of such a thing (e.g., FOAF). There are multiple ways how to discover suitable vocabularies, some of which are listed at the W3C website for Ontology Dowsing.
  3. Start editting your HTML: either the static files or dynamically rendered templates.
  4. Start at the first line of your document and set the correct DOCTYPE. If you are using XHTML, use <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN" "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd"> (i.e., RDFa 1.0). If you are using HTML5, use <!doctype html> (i.e., RDFa 1.1). This will allow you to validate your document and see if you are using RDFa correctly.
  5. Refer to the used RDF vocabularies. By declaring vocabularies' namespaces you can set up variables that you can use in compact URIs. If you are using XHTML, use the xmlns attribute (e.g., xmlns:dv="http://rdf.data-vocabulary.org/#"). If you are using HTML5, use prefix, vocab, or profile attributes (e.g., prefix="dv: http://rdf.data-vocabulary.org/#").
  6. Identify the thing you want to describe. Use a URI as a name for the thing so that others can link to it. Use the about attribute (e.g., <body about="http://example.com/recipe">). Everything that is nested inside of the HTML element with the about attribute is the description of the identified thing, unless a new subject of description is introduced via a new about attribute.
  7. Use the typeof attribute to express what is the thing that you are describing (e.g., <body about="http://example.com/recipe" typeof="dv:Recipe">). Pick a suitable class from the RDF vocabularies you have chosen to use and define the thing you describe as an instance of this class. Note that every time the typeof attribute is used the subject of description changes.
  8. Use the property, rel and rev attributes to name the properties of the thing you are describing (e.g., <h1 property="name">).
  9. Assing values to the properties of the described thing using either the textual content of the annotated HTML element or an attribute such as content, href, resource or src (e.g., <h1 property="name">RDFa in action</h1> or <span property="v:author" rel="dcterms:creator" resource="http://keg.vse.cz/resource/person/jindrich-mynarz">Jind??ich Mynarz</span>).
  10. If you have assigned the textual content of an HTML element as a value of an attribute of the thing described you can annotate it. To define the language of the text, use either xml:lang (in XHTML) or lang (in HTML5) attributes (e.g., <h1 property="name" lang="en">RDFa in action</h1>). If you want to set the datatype of the value, use the datatype attribute (e.g., <span content="2011-07-03" datatype="xsd:date">July 3, 2011</span>)
  11. Check you RDFa-annotated document using validators and examine the data using RDFa distillers to see if you have got it right.
  12. Publish the annotated HTML documents on the Web. Ping the RDFa consumers such as search engines so that they know about your RDFa-annotated web pages.

2011-04-10

Data-driven e-commerce with GoodRelations

On April 6th at the University of Economics, Prague, Martin Hepp gave a talk entitled Advertising with Linked Data in Web Content: From Semantic SEO to E-Commerce on the Web. Martin presented his view of the current situation in e-commerce and how it can be made better through structured data, explaining it on the use of GoodRelations, the ontology he has created.

GoodRelations

GoodRelations is an ontology describing the domain of electronic commerce. For instance, it can be used to express an offering of a product, specify a price, or describe a business and the like. The author and active maintainer of GoodRelations is Martin Hepp. As he has shared in his talk, there is actually quite a lot of features that set it apart from other ontologies.
  1. It's the single one ontology that someone has paid for doing. At Overstock.com an expert was hired to consult the use of GoodRelations.
  2. It's not only a research project. It's been accepted by the e-commerce industry and it's used by companies such as BestBuy or O'Reilly Media.
  3. Its design is driven mainly by practice and real use cases and not only by research objectives. For instance, it's been amended when Google requested minor changes. And Google even stopped recommending its own vocabulary it has created for the domain of e-commerce in favour of GoodRelations. It's the piece of the semantic web Google has chosen. Nonetheless, it's still an OWL-compliant ontology.
  4. It comes with a healthy ecosystem around it. The ontology provides a thorough documentation with lots of examples and recipes that you can adopt and fine-tune to your specific use case. There are available validators for the ontology and there is a plenty of e-shop extensions and tools built for GoodRelations.
  5. Finally, it's not only a product of necessity. As Martin Hepp said, he actually quite enjoys doing it.

Product Ontology

The other project that was showcased by Martin Hepp is the Product Ontology. It's a dataset describing products that is derived from Wikipedia's pages. It contains a several hundred thousand precise OWL DL class definitions of products. These class definitions are tightly coupled with Wikipedia: the edits in Wikipedia are reflected in Product Ontology. For instance, if the Product Ontology doesn't list the type of product you sell, you can create a page for it in Wikipedia and, given that it's not deleted, the product type will appear within 24 hours in the Product Ontology. This is similar to the way BBC uses Wikipedia. An added benefit is that it can also serve as dictionary containing up to a hundred labels in different languages for a product because it's built on Wikipedia containing the bundles of pages describing the same thing in different languages.

Semantic SEO

The primary benefit of GoodRelations is in how it improves search. We spend more time searching than we have ever used to. Martin Hepp said that there's an order of magnitude increase in the time we spend searching. It takes us long time before we finally find the thing we interested in because the current web search is a blunt instrument.
World-Wide Web acts as a giant information shredder. In databases, data are stored in a structured format. However, during the data transmission to web clients, data are being lost. They aren't sent as structured data but presented in a web page that can be read by a human customer but machines can pretty much treat it only as a black-box. Instead of being sent in the form in which it's stored in database, the message is not kept intact when it's being sent through the web infrastructure. The structure of the data gets lost on the way to a client and only the presentation of the content is delivered. This means that the agent accessing the data via the Web often needs to reconstruct and infer the original structure of the data.
The web search operates on a vast amount of data that is most for part unstructured and as such it doesn't provide the affordances to conduct anything clever. Simple HTML doesn't allow you to articulate your value proposition well. The products and services are often reduced to a price tag. Enter the semantic SEO.
Semantic SEO can be defined as using data to articulate your value proposition on the Web. It strives to preserve the specificity and richness of your value proposition when you need to send it over the Web. Ontologies such as GoodRelations allow you to describe your products and services with a high degree of precision.

Specificity

We need clever and more powerful search engines because of the tremendous growth in specificity. Wealth fosters the differentiation of products and this in turn leads to an increased specificity. This means there is a plethora of various types of goods and services available on the shelves of markets and shops. The size of the type system we use has grown (In RDF-speak, this would be the number of different rdf:types). We're overloaded with the number of different product types we're able to choose from. It's the paradox of choice: faced with a larger number of goods our ability to choose one of them goes down.
What GoodRelations does is that it provides a way to annotate products and services on the Web in a way that can be used by search engines to deliver a better search experience to their users. It allows for the deep search — a search that accepts very specific search queries and gives very precise answers. With GoodRelations you can retain the specificity of your offering and harness it in search. This is a possibility to target niche markets and get customers with highly specific needs in the long tail.
We need better search engines built on the structured data on the Web to alleviate the analysis paralysis that results from us being overwhelmed by the number of things to choose from. The growing amount of GoodRelations-annotated data is a step in the direction to a situation when you'll be able to pose a specific question to a search engine and get a list of only the highly relevant results.
The e-commerce applications and ontologies such as GoodRelations or Product Ontology show the pragmatic approach to the use of the semantic web technologies. Martin Hepp also mentioned his pragmatic view of linked data. In his opinion, the links that create the most business advantage are the most important. And it was interesting to see parts of the semantic web that work. It seems we're headed to a future of data-driven e-commerce.

2011-01-16

Shopping starts at Google

I don't know where the Web ends. It may have multiple ends, or none. But I know where the Web starts. It starts at Google.

Few years back, it was reported that 6 % of all internet traffic starts at Google. Also, plenty of people have Google set as their homepage. I think many of us would agree that our brain is only a thin layer on top of Google.

One reason for using Google is that people don't remember URIs. Google does it well. On the Web the address of a thing is a URI. In human brain the address of a thing is a set of associations which locate it in a neural network. That's why we need a way to translate these associations to a URI. Google does it fairly well. You pass it a bunch of keywords related to the thing you are looking for and it produces a nice, ordered list of URIs that might point to the thing you have on mind.

People don't use URIs to describe the things they are thinking of, machines do. I can't remember URIs, especially those of RDF vocabularies, which tend to be quite long. That's why I use prefix.cc which lets me to find the URI I'm looking for by passing it something I can remember: the vocabulary's prefix. The service remembers the vocabulary's URIs for me.

As it turns out, people don't remember the URIs of the things they want to buy either. So these days, a lot of shopping starts at Google. When you are looking to buy something you often start by describing that something to Google.

In commerce, things are addressed by brand. The problem with that is that people don't search for brands and they don't search for product names; they search for concepts. People don't search for Olympus E-450, they search for a camera. Brands and product names are not in their vocabularies, but concepts described by keywords are. People don't use brand names to describe the things they are thinking of, commerce does.

To bridge this gap you need to translate the keywords that people use to describe stuff to the brands that commerce uses to describe stuff. Enter search engine optimization (SEO). One of the things that SEO does is that it creates synonym rings. Synonym ring is a set of synonyms, words that people use to describe a thing, such as words mentioned in this tweet:

Can you all please stop retweeting those SEO jokes, gags, cracks, funnies, LOLs, humour, ROFLs, chuckles, rib-ticklers, one-liners, puns?

This SEO task consists in collecting the keywords people might use when searching for a thing so that they find your thing™ that you have described with these keywords.

It would be better if you can say that your thing™ (e.g., Olympus E-450) is a kind of thing people search for (e.g., a camera). Then, when people would search for a thing, they may find that your thing™ is such a thing. This is one of the promises of the semantic web vision. But, just as its Wikipedia article, the semantic web still has a lot of issues.

Nevertheless, the semantic web vision created some interesting by-products in the last few years. One of them is the Linked Open Data initiative striving to build a common, open data infrastructure for the semantic web that is coming (for sure). Other by-product of this vision is the so-called semantic SEO.

Both the semantic web and semantic SEO are misnomers. There is nothing exceptionally semantic in them. I would rather like to call it data SEO, but it seems the current name will stick. Semantic SEO is a practise of adding a little bit of structured data (preferably in RDF) to websites instead of adding a bunch of keywords. For instance, you can use the GoodRelations RDF vocabulary to mark-up your web page describing the product you're offering; even Google says you can. In semantic SEO a little bit of semantics is good enough, it can still go a long way.

Having your thing™ described with structured data makes it machine readable. Search engine, like Google, is a kind of machine. Therefore making your data machine-readable makes them readable for search engines. You can try how Google reads your data yourself.

By adding a bit of data into the mark-up of your web page (preferably via RDFa) you can optimize the way it will be displayed in Google's search results. Instead of a boring, text-only rendering you can get a display that contains useful information, such as an image of your thing™, its rating, reviews and the like. See the example at the GoodRelations website to compare the difference.

People are more likely to click on a search result with nice image in it, a result that is enriched with all kinds of useful information. This may lead to an increase in your click-through rate. For example, RDFa adoption at BestBuy resulted in a 30 % increase in search traffic. Pursuing the semantic web vision has been a largely academic undertaking, so it's good to see that its by-product, semantic SEO, has some real financial benefits.

The practise of semantic SEO is definitely not an academic endeavour, quite the opposite, a lot of high-profile companies and institutions are adopting it (e.g., BestBuy, O'Reilly, or Tesco). The share of webpages that have structured data in RDFa in them is growing. In October 2010, RDFa was in 3,5 % webpages, whereas the year before the share was 0,5 %.

E-commerce is one of the key factors that contributed to the growth of the Web in the 1990s. The same may become true for the Web of Data, a.k.a. linked data, and the e-commerce applications of the semantic web technologies, such as semantic SEO, may become a crucial drive behind its growth and lead to accelerate the rate of adoption of the linked data principles.

2010-12-04

Design patterns for modelling bibliographic data

I have done a several conversions of bibliographic library data from the MARC format and most of the times I had to deal with some re-occurring issues of data modelling. During these conversions I have also adopted a set of design patterns, described in the following parts of this post, that can make the conversion process easier. This post was inspired by the things that were discussed at the Semantic Web in Bibliotheken 2010 conference, by a several conversations at SemanticOverflow, and a book on linked data patterns.

  • Do a frequency analysis of MARC tags occurrences before data modelling.

    If you perform a frequency analysis beforehand you will have a better picture about the structure of your dataset. This way you will know which MARC elements are the most frequent ones and thus more important to be modelled properly and captured in RDF without a loss of semantics.
  • Use an HTTP URI for every entity that has some assertions attached to it.

    Even though you may have little information about some entity it is worth minting new URI or re-using an existing one for it. Using explicit (URI) and resolvable (HTTP) identifiers instead of hiding the entity in a literal or referring to it with a blank-node enables it to be used both in the RDF statements in your dataset as well as any other external RDF data. And, as Stefano Bertolo wrote, Linking identifiers is human, re-using them is divine.
  • Create a URI pattern for every distinct resource type.

    In order to differentiate clusters of resources belonging to a certain resource type (rdf:type) you should define a specific URI pattern for each of them. By doing this you will have the resource identifiers grouped in a meaningful manner, preferrably with patterns that enable human users to infer the type of a resource.
  • Use common RDF vocabularies.

    It is better to re-use common RDF vocabularies, such as Dublin Core or Friend of a Friend, instead of making use of complex, library-specific vocabularies, such as those created for the RDA. There is also the option of going with the library-specific vocabularies and then linking them on the schema level to more general and widely-used vocabularies. This enables easier integration of RDF data and a combination of multiple RDF datasets together. It also helps to make your data more accessible to semantic developers because they do not have to learn a new vocabulary just for the sake of using your data. Libraries are not so special and they do not need everything re-invented with a special respect to them.
  • Use separate resources for the document and the record about it.

    This allows to make separate assertions about the document and its description in the record and attach them to the correct resource. For some statements, such as the copyright, it makes sense to attach them the record resource instead of the document. For example, this way you can avoid the mistake of claiming authorship of a book while you are only the author of the record about that book.
  • Provide labels in multiple languages.

    It is useful to provide labels for the bibliographic resources in multiple languages even though they the strings of the labels might be equivalent. This is the case of terms that either do not have a representation in a language or the were adopted from other language, for example "Call for papers"@en and "Call for papers"@cs. It does not create redundand or duplicate data because it adds new information and that is the fact that the labels have the same string representations. They may be pronouced the same or differently but still, they provide useful information, for purposes like translations and the like.