2013-07-31

Vocabularies for the web of data and principles of least markup

I want to share a few thoughts about markup vocabularies that I pondered upon in the past months when developing Schema.org extension proposal targetting the long tail of job market. Schema.org is a prime example of markup vocabulary. In fact, if you search Google for “markup vocabulary”, most results will associate the term with Schema.org. Throughout this post, I’ll use this vocabulary as an example to illustrate the points made.

So, how does a markup vocabulary differ from, say, an ontology? Markup vocabularies serve different purposes than traditional ontologies, albeit their uses overlap. While the distinction between vocabularies and ontologies is blurry, it can be said that ontologies are based on logic, whereas vocabularies are based on convention. Ontologies are typically used for tasks such as inferring additional data, whereas vocabularies serve rather as structures for easier parsing when exchanging data. Alex Shubin likened Schema.org, as an example of markup vocabulary, to a set of “sitemaps for content” (source). Whereas sitemaps serve to machines to find pages within a web site, Schema.org serves to machines to find the bits of content within a web page.

In practice, vocabularies are used for the less orderly data. As Dan Brickley said in one of his talks: “Schema.org is for the rest of the Web; for that big sprawling chaos.” To reach wide adoption vocabularies need to be generic and application-agnostic so that they can be applied at the largest scale possible. So, as asked at the Schema.org panel discussion, “how does schema design at a planetary scale work, in practice?” I think the answer to this question may be approached from two complementary angles: the recommended vocabulary design patterns and markup guidance.

Vocabulary design

While there is a lot of methodologies for developing ontologies (such as NeOn Methodology or METHONTOLOGY), it seems that similar instructions are lacking for vocabularies. It is unclear what such instructions should be based on. Whereas design of artefacts is frequently shaped by their anticipated uses in practice, so that their form follows function, successful vocabularies are often those that don’t anticipate any particular uses. Their function is defined in terms of a broad goal of supporting the widest possible use. And this isn’t an easy goal to provide workable recommendations for.

I think one common rule of thumb is that vocabulary designers should strive to the lower cognitive overhead users face when working with vocabularies and focus on improving vocabulary usability. However, how do these nebulous goals translate into practice?

One (slightly less vague) advice is that vocabulary design shouldn’t require users to make difficult conceptual distinctions. In order to achieve that, make the differences between vocabulary terms clear (using clear labels and descriptions) in order to avoid ambiguity. If users regularly mix up two distinct concepts, either drop one of the concepts or provide the concepts with better definition. As the Zen of Python states on a similar note, “there should be one— and preferably only one —obvious way to do it.”

Another (slightly less vague) advice is to avoid object proliferation in the vocabulary you develop. In his talk from May 2013 Richard Cyganiak mentioned that vocabularies are typically built from bottom to top, based on usage evidence, so that unused object aren’t included. Richard reiterated the claim asserting that successful vocabularies for the web of data are small and simple (such as Dublin Core Terms), which was already presented by Martin Hepp in his account of Possible ontologies.

One practical technique in line with this advice is to avoid intermediate resources, which are typically represented with blank nodes, and are often needed for object properties. Schema.org labels such intermediate objects as “embedded items”. If your vocabulary contains an object property that points to an intermediate object further described with other properties and all these properties have 0…1 cardinality, then you may consider redefining them as direct properties of the object property’s subject. For example, the class schema:JobPosting is used with properties schema:baseSalary and schema:salaryCurrency. These properties could have been associated with an intermediate schema:Salary object, or even with 2 intermediate objects schema:JobPosition and schema:Salary, however, they are instead attached as direct properties of the schema:JobPosting class. Be careful though not to take this as a catch-all rule. Object properties that usually link to URIs, such as schema:hiringOrganization links to schema:Organization, don’t need to be treated in this way.

Markup guidance

Judging from the documentation of markup vocabularies, much of the presented guidance revolves around markup rather than vocabulary design. For instance, the guidance on designing vocabularies for HTML provided by W3C focuses on issues of markup syntax. I think a lot of recommendations concerning markup for the web of data can be considered extensions of the principle of least effort, so calling them the principles of least markup sounds about right.

A practical realization of such principles might advise to omit data that can be computed automatically. This guidance might encompass omitting inferrable types, including class instantiations and literal datatypes, when there’s only one valid option. Note that this approach doesn’t apply in cases when “type” or “unit” needs to be provided to serve as a value reference; for example when describing price and its currency. A more controversial extension of this principle might recommend to avoid forcing users to mint their own URIs unless necessary. For many purposes of data on the Web anonymous nodes represented with blank nodes are sufficient, given that they may be transformed to URIs and linked deterministically during data ingest and subsequent processing.

Besides decreasing the number of characters that users need to type to add vocabulary markup, there are few recurrent issues frequently mentioned in markup advice.

A thorny issue is the single namespace policy, which proposes that users should be able to create markup with a single vocabulary only. This recommendation is based on the assumption that having multiple vocabulary namespaces requires users to shift between multiple contexts of different vocabularies, which is held to be cognitively demanding. For example, Schema.org aims to provide this single all-encompassing namespace, from which every necessary vocabulary term may be drawn. Single namespace policy is also reflected in RDFa’s vocab attribute that enables to specify a single namespace, which is then applied to all unqualified names used in markup.

When looking for the source of errors in markup, unclear scoping rules are often to blame. Scoping is governed by rules prescribing what subject should the properties in markup be attached to, based on positioning of attribute-value pairs in the hierarchical structure of HTML and semantic context as set by other markup. The scoping rules are notoriously difficult to grasp, which might have contributed to Microdata having the itemscope attribute that sets the scope explicitly.

A related issue to scoping is directionality, which prescribes whether the current scope should be used as subject or object of marked up properties. To reverse the default directionality RDFa offers the rev attribute and previously, it used reverse direction for src attribute as well. Directionality, among other issues, is described by Gregg Kellogg in his list of common pitfalls when marking up HTML with RDFa. Microdata, on the other hand, avoids this issue by being uni-directional.

Tolerance

Markup guidelines for data publishers should have a counterpart on the side of data consumers. That counterpart is the principle of tolerance. Schema.org documentation of its data model states: “In the spirit of ‘some data is better than none’, we will accept this markup and do the best we can.” Even though markup may be broken in many different ways, data consumers should try to be fault-tolerant. This attitude is in line with Postel’s principle of robustness that states: “Be conservative in what you send, be liberal in what you accept” And so I think that until we know better about vocabulary design, we better be tolerant and liberal about data on the Web.

2013-07-29

Towards usability metrics for vocabularies

As far as I know, there are no established measures for evaluating vocabulary usability. To clarify, when I use the term “vocabularies”, what I mean simple schemas and lightweight lexical ontologies that are used primarily for marking up content embedded in web pages, using syntaxes such as Microdata or RDFa. A good example of such vocabulary is Schema.org, an overarching, yet simple schema of things and relations that four big search engines (Google, Microsoft Bing, Yahoo! and Yandex) deem to be important to their users.

The closest to the topic seems to be the paper Ontology evaluation through usability measures by Núria Casellas. With regards to syntactical usability of markup, there was a usability study of Microdata done by Ian Hickson, the minimalistic settings of which was a subject of numerous rants, such as the one by Manu Sporny. I presume more thought needs to be spent on discovering how existing usability research relates to vocabularies and which standard usability principles apply. Nevertherless, borrowing from usability testing used for web sites or software or in libraries, three metrics relevant to vocabularies crossed my mind.

The first is error rate when using a vocabulary. It is based on the assumption that the more usable vocabulary is the fewer errors should its users make. Vocabulary validators may be used to automate this technique. Such tools may execute fine-grained rules, which may help to discern the most problematic parts of vocabularies, where users make the most errors. An example of a study testing error rate was conducted by Yandex. Note, however, that it focused more on markup syntaxes, rather than vocabularies themselves. It reported 10 % error rate in RDFa (4 % share in the sample), 10 % error rate in hCard (20 % share) and almost no errors in Facebook’s Open Graph Protocol (1.5 % share), which is also RDFa.

A broader feature that may serve as input to usability testing is data quality. A metric based on data quality should primarily take into account valid data, since invalid data should be caught by error rate testing. Recognizing that data quality as a relevant feature is based on the assumption that more usable vocabularies support creating data of better quality. However, the relation between vocabulary usability and data quality should not be considered as causation, but rather correlation, which might pinpoint weak parts of vocabulary where data quality suffers. Transforming data quality into a discrete metric is tricky, but there already are data quality assessment methodologies, some of which are documented in this paper (PDF), on which test procedures for usability of vocabularies may be derived.

The remaining metric I propose is adopted from library and information science, in which quality of indexing (much like mark-up) can be evaluated in terms of inter-indexer and intra-indexer consistency. Reframing that to usability testing of vocabularies inter-user and intra-user consistency could be more suitable labels. Inter-user consistency is the degree of agreement among users in describing the same content. On the other hand, intra-user consistency is the extent to which one is consistent in marking up content with oneself. Consistent use of a vocabulary may be taken as a sign that the vocabulary terms are not ambiguously defined, so that users do not confuse them. It may also show that there is a documentation providing clear guidance on the ways in which the vocabulary may be used. These metrics might help test if vocabularies can be “easily understood by the users, so that it can be consistently applied and interpreted” (source).

These metrics have a long history in the field of libraries and are already deployed in practice on the Web. For example, Google Image Labeler (now defunct) was a game that asked pairs of users (mutually unknown to each other) to label the same image and rewarded them if they agreed on a label. A similar service that works on the same principle rewarding consistency is LibraryThing’s CoverGuess. A naïve approach to implementing these metrics could compute the size of the diff, so that, for example, markups produced by 2 users given the same web page and instruction to use the same vocabulary can be compared. A more complex implementation might involve distance metrics that measure similarity of patterns in data, such as with the metrics offered by Silk. Finally, when applying the consistency metrics, as observed previously, you should keep in mind is that high consistency may be achieved at the expense of low overall quality. Therefore, these metrics are best complemented with data quality testing.

I believe adopting usability testing as a part of vocabulary design is a step forward for data modelling as a discipline. To start we will first need to find out what usability metrics apply to vocabularies or develop new specific approaches to usability testing. So let’s get user-centric, shall we?

2013-07-17

Capturing temporal dimension of linked data

Summary

The article provides an overview of the existing approaches for capturing temporal dimension of linked data within the limits of RDF data model. Best-of-breed techniques for representing temporally varying data are compared and both their advantages and disadvantages are summarized with a special respect to linked data. Common issues and requirements for the proposed data models are discussed. Modelling patterns based on named graphs and concepts drawn from four-dimensionalism are held to be well suited for the needs of linked data on the Web.

The world is in flux; models and datasets about the world have to co-evolve with it in order to retain their value. The nascent web of data can be thought of as a global model of the world, which, as such model, is a subject to change. The nature of the Web content is dynamic as both individual resources and links between them change frequently.

Facts are time-indexed and as such may cease to be valid, being superseded by newly acquired knowledge. Even though some knowledge of encyclopaedic nature might change infrequently, revolutions in scientific paradigms led us to doubt the existence of eternal truths. Many resources are temporal in nature, for example stock and news (Gutierrez, Hurtado and Vaisman, 2007), and a growing amount of data on the Web comes directly in timestamped streams from sensors. Moreover, data on the Web often starts as raw material that is refined over time, for example, when patches based on user feedback are incorporated. Datasets “are constantly evolving to reflect an updated community understanding of the domain or phenomena under investigation” (Papavasileiou et al., 2013).

At the same time people are recognizing the value of historical data, access to which proves to be essential for making better decisions about the present and for predicting the future. Consequently, access to both current and historical data is deemed vital. However, many data sources on the Web are not archived and disappear on a regular basis. Auer remarks that datasets often “evolve without any indication, subject to changes in the encoded facts, in their structure or the data collection process itself” (Auer et al., 2012). In many cases we fail to record when we get to know about things described in our data. Lack of provenance metadata and temporal annotations makes it difficult to understand how datasets develop with respect to the real world entities they describe. Rate of change in datasets is thus opaque unless they are provided with explicit metadata.1 Most changes are implicit and detecting them requires “reverse engineering” by comparing snapshots of the observed dataset; a crude way to tell what changed and in which way.

All these remarks highlight the temporal dimension of data on the Web. Being able to observe the nature of change in linked data proves important for many data consumption use cases. Knowing what the rate of change is may help to determine which queries can be safely run on cached data and which queries need to be executed on live data (Umbrich, Karnstedt and Land, 2010). Not knowing what changes makes data synchronization inefficient as it requires copying whole datasets. Temporal annotation may vastly benefit applications that combine linked data from the Web. It allows to cluster things that happened simultaneously and determine order of events that might have been spawned by each other. In the data fusion use case, applications may use temporal annotation to favour more recent data. Finally, “in some areas (like houses for sale) it is the new changed information which is of most interest, and in some areas (like currency rates) if you listen to a stream of changes you will in fact accumulate a working knowledge of the area” (Berners-Lee and Connolly, 2009).

Unfortunately, “for many important needs related to changing data, implementation patterns or best practices remain elusive,” (Sanderson and Van de Sompel, 2012) which results into the practice when “accommodating the time-varying nature of the enterprise is largely left to the developers of database applications, leading to ineffective and inefficient ad hoc solutions that must be reinvented each time a new application is developed” (Jensen, 2000). The lack of guidance on this topic is what this article attempts to remedy by contributing with an overview that summarizes the main data modelling patterns for temporal linked data. The article begins by going through preliminaries, including concepts and formalizations used in further sections. The main part provides a review of dominant modelling patterns for capturing temporal dimension of linked data. This overview is followed by a discussion of common issues in modelling temporal linked data. Concluding sections of the article provide pointers to related work and sum up the article’s contributions.

Preliminaries

In order to set up the scene for the main body of this article, we will introduce the formalisms for modelling data (RDF) and representing time, along with description of key concepts in the domain.

Resource Description Framework

RDF (Resource Description Framework) is a generic data format for exchanging structured data on the Web. It expresses data as sets of atomic statements called triples, in which each triple is made of subject, predicate and object. There are three options that may be used as the constituent parts of triples. URIs (Uniform Resource Identifiers) identifying resources may be used at any position within a triple. Blank nodes, existentially quantified anonymous resources, may serve both as subjects and as objects. Finally, literals, textual values with optional data type or language tag, may be put only to the object position. Any set of RDF triples forms a directed labelled graph, in which vertices consist of subjects and objects and edges, oriented from subjects to objects, are labelled with predicates.

The deficiencies of RDF regarding the expression of temporal scope are mostly attributed to the fact that RDF is limited solely to binary predicates, while relations of higher arity have to be decomposed into binary ones. Since RDF predicates are binary, any relation that involves more than two resources can be expressed only indirectly (RDF 1.1 concepts, 2013). What follows from this restriction is that “temporal properties can only be attached to concepts or instances. Whenever a relation needs to be annotated with temporal validity information, workaround solutions such as relationship reification need to be introduced” (Tappolet and Bernstein, 2009). These indirect ways of introducing temporal dimension as an additional argument to binary relations captured by RDF are accounted for by the modelling patterns described further in the article.

In addition to mixing time in relations, temporal scope can be also treated as an annotation, which may be attached anything with an own identity that is valid subject of an RDF triple (i.e. URI, blank node). Although similarly indirect, annotations may be attached at several levels of granularity, including annotations of individual triples, resources or sets of triples grouped in sub-graphs. Since “there is nothing else but the constituents to determine the identity of the triple,” (Kiryakov and Ogyanov, 2002) RDF triples have no identity to refer to, which makes describing their context, such as temporal scope, unfeasible. To provide triples with identity of their own reification must be used. Reification is a way how to make statements into resources, so that they are within the reach of what may be described in RDF. Together with other approaches to annotation reification will be explored in the following data modelling patterns.

As mentioned above, all of the ways of capturing temporal dimension in RDF rely on indirection or convoluted patterns constrained by RDF limitations. Moving from synchronic to diachronic data representation runs up against the restraints of RDF. Design of the “RDF data model is atemporal” (RDF 1.1 concepts, 2013) and there no native support for incorporating time. The format’s specification leaves handling temporal data out of its scope and delegates the question of expressing such data to RDF vocabularies and ontologies. Current RDF specifications also evade the issue of semantics of temporally variant resources by recognizing that “to provide an adequate semantics which would be sensitive to temporal changes is a research problem which is beyond the scope of this document” (RDF semantics, 2004).

Indeed, much criticism was voiced over RDF’s atemporality. For instance, Tennison declares that “the biggest deficiency in RDF is how hard it is to associate metadata with statements,” (2009a) Mittelbach specifies that “there is no built-in way to describe the context in which a given fact is valid,” (2008) and Hickey concludes that “without a temporal notion or proper representation of retraction, RDF statements are insufficient for representing historical information” (2013).

Representation of time

Time can be represented as a “point-based, discrete and linearly ordered domain” (Rula et al., 2012). It is typically conceptualized as one-dimensional, so there is no branching time (Tappolet and Bernstein, 2009). Basic types of temporal entities are time points (instants) and time intervals (periods) delimited by starting and ending time point. If necessary, temporal primitives might be reduced to intervals since “it is generally safe to think of an instant as an interval with zero length, where the beginning and end are the same” (Time Ontology in OWL, 2006). Temporal scope of data may also span multiple periods, which can be represented as a union of disjoint time intervals (Lopes et al., 2009).

Temporal entities can be translated into RDF either as literals or resources. Literals specifying time may conform to one of the XML Schema datatypes,2 such as xsd:dateTime or xsd:duration, which are based on international standards, such as ISO 8601.3 These datatypes are well supported by the SPARQL RDF query language and other tools working with RDF that implement XPath functions for manipulating with such datatype values. Less common literal datatypes4 require custom parsers, so it is advisable to conform to standards and to decompose compound literals into structured values represented as RDF resources.

Complex values, such as intervals delimited with starting and ending timestamps, should be represented as resources. This is the way of modelling adopted by Time Ontology in OWL (2006), in which temporal entities are structured as instances of classes that are further described with datatype properties. Another application of this approach is in the work of Correndo et al., who offer a formalization of time that they held to be well-suited for annotation of RDF (2010). They present the “concept of Linked Timelines, knowledge bases about general instants and intervals that expose resolvable URIs” and provide “also temporal topological relationships inference to the managed discrete time entities.” The URIs generated for time instants and intervals adhere to syntax of ISO 8601 literals and are described with OWL Time Ontology, which the authors extended with better support for XML Schema datatypes.

An alternative option is to represent temporal relations as a hierarchy, for example a year interval may have narrower month intervals. This solution was proposed for the Neo4j graph store,5 although it can be also put into practice in RDF stores using, for example, SKOS6 hierarchical relationships such as skos:narrowerTransitive and skos:broaderTransitive.

In the examples contained in this article we will use the Time ontology in OWL combined with XML Schema datatypes to represent temporal entities.

Key concepts

Change in time is inextricably linked to several concepts, which influence the generic principles that guide data modelling. A fundamental concept that interacts with it is identity. There is no widely accepted definition of identity in database setting. It can be considered as continuity over a series of perceptions (Halloway, 2011). Another perspective views identity as the characteristics that make entity recognizable from others. It is the relation an entity has only to itself, exhibiting the characteristics of being both reflexive, transitive and symmetric.

Identity on the Web may be contemplated through the Leibniz’s ontological principle of identity of indiscernibles, which may be formulated as follows: “if, for every property F, object x has F if and only if object y has F, then x is identical to y” (Identity of indiscernibles, 2010). This rule does not hold on the Web since “resources are not determined extensionally. That is, A and B can have the same correspondences and coincidences, and still be distinct” (Rees, 2009). On the contrary, the reverse law of indiscernibility of identicals holds true since owl:sameAs asserts that identity entails isomorphism, or that if a = b, then all statements of a and b are shared by both” (McCusker and McGuinness, 2010.

When it comes to change in time, a key issue of identity is “the problem of diachronic identity: i.e., how do we logically account for the fact that the ‘same’ entity appears to be ‘different’ at different times?” (Welty and Fikes, 2006). Change is a result of actions over time, each of which produces a new observable state of the changed identity. State is a relationship of identity to value. In Representational State Transfer (REST) “resource R is a temporally varying membership function MR(t), which for time t maps to a set of entities, or values, which are equivalent” (Fielding, 2000). In linked data, which builds on REST, the state is the representation (perceived value) to which a resource URI dereferences, so that “dereferencing [the resource’s] URI at any specific moment yields a response that reflects the resource’s state at that moment” (Van de Sompel, Nelson and Sanderson, 2013). While the resource state may change, the resource URI should be persistent, as recommended by one of the principles of the architecture of World Wide Web:

“Resource state may evolve over time. Requiring a URI owner to publish a new URI for each change in resource state would lead to a significant number of broken references. For robustness, Web architecture promotes independence between an identifier and the state of the identified resource.” (Architecture of the World Wide Web, 2004)

State offers a perceivable value, which is, in contrast to resources, immutable. Values are made of facts, which are in the temporal database research regarded as “any statement that can meaningfully be assigned a truth value, i.e. that is either true or false” (Jensen, 2000). RDF regards individual triples as atomic facts, which are held to be true without respect to context by to sole nature of their existence. Although in fact, without context, such as temporal scope, it might not be feasible to assign truth value to RDF triples. RDF resources are by default persistent yet mutable, although “literals, by design, are constants and never change their value” (RDF 1.1 concepts, 2013). However, RDF triple may be considered immutable because “an RDF statement cannot be changed – it can only be added and removed.” (Kiryakov and Ogyanov, 2002). To the contrary “there is no way to add, remove, or update a resource or literal without changing at least one statement, whereas the opposite does not hold” (Auer and Herre, 2007).

The succession of resource states may spread across multiple temporal dimensions. A common conceptualization of time uses two temporal dimensions; valid time and transaction time, and thus it is referred to as bitemporal data model. It allows to distinguish between the situation when “the world changes” and when “the data about the world changes” (e.g., as a result of changes in the data collection process). Valid time (also “actual time”, “business time” or “application time”) captures when is the data valid in the modelled world. Value is current during valid time. This interpretation fits the dcterms:valid property from the Dublin Core Terms vocabulary.7 Transaction time (also “record time” or “system time”) reflects when data enters database. Value is perceived at transaction time. This dimension’s semantics may be expressed with dcterms:created property, also from Dublin Core Terms.

Modelling patterns

Having set the preliminaries we will now review and compare data modelling patterns for capturing temporal dimension of linked data. To limit our scope we selected only the patterns that can be implemented within RDF without extending it. That said, all the patterns can be expressed in RDF, even though they might not have a defined semantics. And in fact, the differences between some patterns are a matter of syntax, so that they can be transformed to each other. The overview features patterns based on reification, concept of four-dimensionalism, dated URIs and named graphs.

Throughout this section we will use the following data snippet serialized in RDF Turtle syntax8 to serve as a running example. It states the resource :Alice to be a member of the organization :ACME without any anchoring temporal information, the attachment of which will differ in each reviewed modelling pattern.

:Alice org:memberOf :ACME .

The examples use the Time Ontology (Time Ontology in OWL, 2006) to demarcate temporal scope of data, except in cases where the modelling pattern provides its own way to represent the scope. The temporal annotation in all examples asserts the data to be valid since 9 AM on January 1, 2000. The following examples use the RDF Turtle syntax (unless stated otherwise) with these namespaces prefixes:

@prefix :         <http://example.com/> .
@prefix cs:       <http://purl.org/vocab/changeset/schema#> .
@prefix dcterms:  <http://purl.org/dc/terms/> .
@prefix gen:      <http://www.w3.org/2006/gen/ont#> .
@prefix org:      <http://www.w3.org/ns/org#> .
@prefix owl:      <http://www.w3.org/2002/07/owl#> .
@prefix prov:     <http://www.w3.org/ns/prov#> .
@prefix prv:      <http://purl.org/ontology/prv/core#> .
@prefix rdf:      <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs:     <http://www.w3.org/2000/01/rdf-schema#> .
@prefix sit: <http://www.ontologydesignpatterns.org/cp/owl/timeindexedsituation.owl#> .
@prefix ti: <http://www.ontologydesignpatterns.org/cp/owl/timeinterval.owl#> .
@prefix time:     <http://www.w3.org/2006/time#> .
@prefix xsd:      <http://www.w3.org/2001/XMLSchema#> .

Reification

In the first part of the reviewed data modelling patterns we will focus on those that use reification. In previous research these patterns were grouped in the category of fact-centric perspectives (Rula et al., 2012). Reification is a principle by which previously anonymous parts of the data model are given autonomous identity, so that they can be described within the data model. The patterns we go through in this section include statement reification, axiom annotation, changesets, n-ary relations and property localization.

Statement reification

Statement reification9 adopts a sentence-centric perspective (Rula et al., 2012) and attaches temporal annotation to individual triples (statements). It decomposes the reified binary predicate into three binary predicates asserting what the subject (rdf:subject), predicate (rdf:predicate) and object (rdf:object) of the original statement are.

This approach suffers from a number of issues, which is why it is commonly discouraged to use it. First, there is no formal correspondence between statement and its reified form. As RDF primer points out: “note that asserting the reification is not the same as asserting the original statement, and neither implies the other” (RDF primer). Even if “there needs to be some means of associating the subject of the reification triples with an individual triple in some document […] RDF provides no way to do this” (ibid.). At the same time, two reified statements with the same subject, predicate and object cannot be automatically inferred to be the same statement. Moreover, triple reification is inefficient in terms of data size as it requires at least three times more triples than non-reified statement. If statement reification is used, every temporally annotated statement has to be reified, as there is no grouping possible. Given these concerns triple reification is considered for deprecation, one reason for which is that the cases for which it is used are fulfilled by named graphs, which, unlike reification, make do without data transformation.

:Alice org:memberOf :ACME .

[] a rdf:Statement ;
  rdf:subject :Alice ;
  rdf:predicate org:memberOf ;
  rdf:object :ACME ;
  dcterms:valid [
    a time:Interval ;
    time:hasBeginning [
      time:inXSDDateTime "2000-01-01T09:00:00Z"^^xsd:dateTime
    ]    
  ] .

Axiom annotation

OWL 2 offers a way for expressing annotations about axioms.10 OWL annotations are “pieces of extra-logical information describing the ontology or entity” (Grau, 2008) and they “carry no semantics in OWL 2 Direct Semantics” (OWL 2 Web Ontology Language, 2012). They can be considered functionally equivalent to triple reification and thus their issues are closely similar. Gangemi and Presutti add that the downsides include a need for “a lot of reification axioms to introduce a primary binary relation to be used as a pivot for axiom annotations, and that in OWL 2 (DL) reasoning is not supported for axiom annotations” (Gangemi and Presutti, 2013).

:Alice org:memberOf :ACME .

[] a owl:Axiom ;
  owl:annotatedSource :Alice ;
  owl:annotatedProperty org:memberOf ;
  owl:annotatedTarget :ACME ;
  dcterms:valid [
    a time:Interval ;
    time:hasBeginning [
      time:inXSDDateTime "2000-01-01T09:00:00Z"^^xsd:dateTime
    ]    
  ] .

Changeset

Changeset vocabulary11 captures changes as reified statements that are complemented with metadata, such as the type of changeset (addition, removal) or timestamp. Individual changesets comprise annotated reified statements, which represent atomic changes made to data. Changesets may be bundled together by using the Eventset vocabulary,12 which provides a higher level resource-centric view of changes.

[] a cs:ChangeSet ;
  cs:addition [
    a rdf:Statement ;
    rdf:subject :Alice ;
    rdf:predicate org:memberOf ;
    rdf:object :ACME
  ] ;
  cs:createdDate "2000-01-01T09:00:00Z"^^xsd:dateTime .

N-ary relation

A common pattern how to provide relation with an identity is by using a class instance instead of a property. By introducing a specific resource to name the relation this practice allows to escape the limitation of default binary predicates in RDF and instead express arbitrary n-ary relations13 decomposed into binary relations expressible in RDF. “The classes created in this way are often called ‘reified relations’” (Defining n-ary relations on the semantic web, 2006) and bear a resemblance to statement reification syntax. However, they are more concise since they subsume the semantics of rdf:predicate in an instantiation of a specific class. Moreover, whereas in statement reification additional triples characterize a “statement”, in n-ary relations they describe the “relation” itself, which is why this modelling pattern is categorized as relation-centric perspective (Rula et al., 2012).

However, some argue that time is “an additional semantic dimension of data. Therefore, it needs to be regarded as an element of the meta-model instead of being just part of the data model” (Tappolet and Bernstein, 2009). Indeed, ontological solution mixes time model into data model. Expressing temporal dimension in the data model requires per-property effort, which leads to ontology bloat and indirection because many properties exhibit potentially dynamic characteristics. Moreover, custom n-ary relations do not conform to any standard, so interpreting them automatically may be difficult.

Despite these disadvantages n-ary relations are regarded as the most flexible for incorporating temporal annotations (Gangemi, 2011), which is supported by the evidence of their widespread use in practice. For example, they are used in Freebase14 and Wikidata.15 Since n-ary relations are common, Organization Ontology, which offers the org:memberOf predicate used in the initial example, also provides a way to put the membership relation in temporal context using n-ary relationship org:Membership.16

[] a org:Membership ;
  org:member :Alice ;
  org:organization :ACME ;
  org:memberDuring [
    a time:Interval ;
    time:hasBeginning [
      time:inXSDDateTime "2000-01-01T09:00:00Z"^^xsd:dateTime
    ]    
  ] .

N-ary relations may be further refined by “decorating” them with compound keys, using the owl:hasKey feature of OWL 2, which helps in merging identical relations (Gangemi, 2011).

org:Membership owl:hasKey (
  org:member
  org:organization
  org:memberDuring 
) .

The pattern of converting properties to n-ary relations is documented in the Property Reification Vocabulary, (Procházka et al., 2011) which offers a way for establishing a connection between property (prv:shortcut) and its n-ary relations (prv:reification_class).

:memberOfReification a prv:PropertyReification ;
  prv:shortcut org:memberOf ;
  prv:reification_class org:Membership ;
  prv:subject_property org:member ;
  prv:object_property org:organization .

Property localization

A relatively convoluted technique for reifying properties is property localization, which is also known as “name nesting”. “To reduce the arity of a given relation instance” it might be replaced by a sub-property with “partial application” of time, thereby decreasing the number of its arguments (Krieger, 2008). However, in order to add more parameters another sub-property needs to be minted, which, coupled with relying on singletons for property domain and range, makes this pattern inflexible.

:AliceMemberOfACMESince2000-01-01 rdfs:subPropertyOf org:memberOf ;
  rdfs:domain [
    owl:oneOf ( :Alice )
  ] ;
  rdfs:range [
    owl:oneOf ( :ACME )
  ] ;
  dcterms:valid [
    a time:Interval ;
    time:hasBeginning [
      time:inXSDDateTime "2000-01-01T09:00:00Z"^^xsd:dateTime
    ]    
  ] .

:Alice :AliceMemberOfACMESince2000-01-01 :ACME .

Four-dimensionalism

Four-dimensionalism is a school of thought hailing from a strong philosophical background that regards time as the fourth dimension. Traditional data modelling approaches ascribe fourth dimension only to processes (“perdurants”) whereas things (“endurants”) are held to extend solely in three dimensions. “Endurants are wholly present at all times during which they exist, while perdurants have temporal parts that exist during the times the entity exists” (Welty and Fikes, 2006). On the contrary, four-dimensionalism does not make this conceptual distinction and treats every entity as having temporals parts extended through time.

The temporal parts of individuals are referred to as time slices, which represent their time-indexed states. The properties used to describe time slices are called fluents. Fluents are “properties and relations which are understood to be time-dependent” (Hayes, 2004) and can be thought of as “instances of relations whose validity is a function of time,” (Hoffart et al., 2013) so that they are functions that map from objects and situations to truth values (Welty and Fikes, 2006).

Even though some authors hold 4D fluents to be the most expressive modelling pattern for temporal annotation with best support in current reasoners (Batsakis and Petrakis, 2011), this approach begets a number of issues as well. Perhaps the major one is that in 4D view we either have to treat instances of existing classes as time slices or we have to redefine domain and range of existing properties to instances of time slices. Krieger (2008) presents a attempt to solve this issue by reinterpreting existing properties as 4D fluents. He posits that such “interpretation has several advantages and requires no rewriting of an ontology that lacks a treatment of time.” Proliferation of identical time slices may be another potential problem of this pattern. Welty and Fikes argue that “there is no way in OWL to express the identity condition that two temporal parts of the same object with the same temporal extent are the same,” (2006) however, features added in OWL 2 more recently, such as keys, may be used for this purpose.

Unlike the previously mentioned approaches, modelling resource slices tracks changes on a higher level of granularity. Whereas in the above-mentioned examples changes are recorded on the level of individual relations or facts, this pattern captures changes on the level of resources. Consequently, if we use slices as resource’s snapshots, such modelling will likely be inefficient as most of the resource’s data remains unchanged in subsequent resource slices, but it has to be duplicated. A more economical approach is to treat resource slices as deltas that contain only data that changed, which are used, for example, in the work of Tappolet and Bernstein (2009).

In the following example, time slices of resources are expressed as instances of gen:TimeSpecificResource using the Ontology for Relating Generic and Specific Information Resources.17 The property org:memberOf is reinterpreted as if it was a fluent property.

:Alice a gen:TimeGenericResource ;
  gen:timeSpecific :Alice2000-01-01 .

:ACME a gen:TimeGenericResource ;
  gen:timeSpecific :ACME2000-01-01 .

:Alice2000-01-01 a gen:TimeSpecificResource ;
  org:memberOf :ACME2000-01-01 ; 
  dcterms:valid :membershipTime .

:ACME2000-01-01 a gen:TimeSpecificiResource ;
  dcterms:valid :membershipTime .

:membershipTime a ti:TimeInterval 
  ti:hasIntervalStartDate "2000-01-01T09:00:00Z"^^xsd:dateTime .

A variation of this modelling pattern was put forward by Welty as context slices (2010). Context slice is “a projection of the relation arguments in each context for which some binary relation holds between them” (ibid.). It constitutes a shared context (instance of :Context) in which time-indexed “projections” (instances of :ContextualProjection) of both subject and object participate.

:AliceMemberOfACMESince2000-01-01 a :Context ;
  dcterms:valid [
    a time:Interval ;
    time:hasBeginning [
      time:inXSDDateTime "2000-01-01T09:00:00Z"^^xsd:dateTime
    ]
  ] .

:Alice2000-01-01 a :ContextualProjection ;
  :hasContext :AliceMemberOfACMESince2000-01-01 .

:ACME2000-01-01 a :ContextualProjection ;
  :hasContext :AliceMemberOfACMESince2000-01-01 .

:Alice2000-01-01 org:memberOf :ACME2000-01-01 .

Dated URIs

Dated URIs allow to express temporal scope at identifier level. Resource identified with dated URI is time-indexed by a timestamp embedded directly into its URI. A formal version of this modelling pattern was proposed Masinter (2012), who drafted two URI schemes that employ embedding of original URI and its temporal scope into a single identifier. The first scheme, duri (“dated URI”) is used to identify resource as of specific time. The second, tdb (“thing described by”) serves the purpose of identifying temporally scoped state of resources that cannot be retrieved via the Web (i.e. “non-information resources”). Using dated URIs, the running example looks like this:

<tdb:2000-01-01T09:00:00Z:http://example.com/Alice>
org:memberOf
<tdb:2000-01-01T09:00:00Z:http://example.com/ACME> .

An alternative way how to use this pattern would be to instead index properties with time, since these are what changes.

:Alice <tdb:2000-01-01T09:00:00Z:http://www.w3.org/ns/org#memberOf> :ACME .

The principal downside to URIs in both of these schemes is that resolving them is not supported by existing HTTP clients. A more prevalent, yet not formalized technique sticks to regular HTTP URIs, in which temporal information is contained. For example, Tennison (2009b) builds on widespread pattern of dated URIs and proposes to mint URIs with temporal annotations (such as http://example.com/Alice/2000-01-01), to which generic URIs (such as http://example.com/Alice) redirect with HTTP 307 Temporal Redirect. However, all approaches that embed time into URIs interfere with the axiom of URI opacity,18 which discourages from inferring anything from URI’s string representation, since that would overload the function of identifiers to carry information belonging to the data model and would require additional custom parsers to extract temporal annotation from URIs. Since this approach operates on the identifier level, it may be used together with the previously covered modelling patterns. For example, resources identified by dated URIs might be interpreted as time slices presented in the above mentioned pattern.

Named graphs

In this section we discuss the deployment of named graphs as resource states and as data versions. Named graphs are merely syntactically paired names (URIs) with RDF graphs. “RDF does not place any formal restrictions on what resource the graph name may denote, nor on the relationship between that resource and the graph” (RDF 1.1 concepts, 2013) and semantics of named graphs is thus left undefined.

A substantial benefit of named graphs is that this approach does not interfere with the actual data model, since it is agnostic about what RDF vocabularies and ontologies are being used. This makes them compatible with existing atemporal data models, which may be used within temporally annotated named graphs. An advantage of named graphs is that they enable storing mutually contradicting data, since storing inconsistent data in temporally-scoped statements inside a single graph violates RDF’s entailment property, which asserts that from an RDF graph every sub-graph can be entailed (Mittelbach, 2008). Finally, although there is no standardized serialization syntax for named graphs,19 they are well-supported in tools based on SPARQL 1.1, since most current RDF stores are quad-stores. Although their handling in current reasoners is patchy, since named graphs are not a part of the OWL specification to which most reasoners conform to (Batsakis and Petrakis, 2011).

Despite favourable mentions named graphs received in literature about temporal RDF, there are several commonly raised issues when it comes to using this approach for temporal annotation. Even though named graphs have no standardized semantics, they are commonly used as containers of RDF datasets. Named graphs are frequently equated to “documents”, in which RDF data is published. Based on empirical evidence collected by Rula et al. (2012) temporal annotations of documents are relatively prevalent in practice. Grouping a set of related triples into a dataset identified with a named graph URI is recommended among the best practices for working with linked data.20 Since named graphs are already used to express a dataset to which resources belong, for temporal annotation “relying on named graphs is problematic because the statements may be embedded in other datasets which are already enclosed in named graphs” (McCusker and McGuinness, 2010. However, affiliation between RDF triples and datasets may be expressed explicitly in the RDF data model by using properties such as void:inDataset21 that connects named graph that encloses RDF with URI indicating a dataset.

A difficult task when using named graphs is to pick the best level of granularity on which to partition data into individual graphs. On this note, Tennison recommends that “to avoid repetition of data within multiple graphs, graphs should be split up at the level that updates are likely to occur within the source of the data” (2010). In the worst case the granularity amounts to single statement, in which case named graphs may serve as a more elegant syntax for reified statements. As Gangemi and Pressuti warn about named graphs, “this solution has the advantage of being able to talk about assertions, but the disadvantages of needing a lot of contexts (i.e. documents) that could make a model very sparse” (2013). Due to this respect, Rula et al. recommend using named graphs “only when it is possible to group a considerable number of triples into a single graph” (2012). However, when change impacts multiple facts at the same time, named graphs enable to group them and attach single temporal annotation, which is not possible with previously scrutinized modelling patterns.

Data modelling patterns for temporal data based on named graphs attribute special statuses to the default graph. Two basic ways of using default graph emerged. While some approaches employ named graphs to store changing data, default graph is used as a container for metadata about the changes, for example in Rula et al., 2012. Another way is to define default graph as a view of the current state of data (i.e. HEAD in terminology of versioning systems), which is the case for R&Wbase (Vander Sande et al., 2013). Of the different versions of data “the most useful […] is the current graph, which is the one that should be exposed as the default graph in the SPARQL endpoint offered by the triplestore” (Tennison, 2010), which helps to simplify querying via SPARQL as the current state of data is selected by default.

Named graphs were used for representing temporal scope in several research projects, though their use for this particular purpose is still not widespread at the moment. For example, Tappolet and Bernstein (2009) describe named graphs as instances of classes from Time Ontology (2006), which delimit temporal boundaries of data in graphs.

In the following examples we use the TriG syntax,22 which extends the Turtle serialization to represent named graphs. We will demonstrate interpreting named graphs as resource states and as commits.

Resource states

The concept of stateful resources was put forth by Richard Cyganiak.23 It reinterprets every resource as stateful resource and uses named graphs as individual states of the resource. It aligns well with the concepts of four-dimensionalism, using named graphs for temporal parts of stateful resources.

:Alice2000-01-01 {
  :Alice org:memberOf :ACME .
  :Alice2000-01-01 dcterms:valid [
    a time:Interval ;
    time:hasBeginning [
      time:inXSDDateTime "2000-01-01T09:00:00Z"^^xsd:dateTime
    ]    
  ] .
}

Commits

An approach of using named graphs for capturing evolving RDF data was outlined in a proof of concept entitled R&Wbase (Vander Sande et al., 2013), which builds on concepts of distributed version control systems. Each change is represented as a commit, which captures delta between the previous and current state of data. Unlike snapshots, in this way data that persists through subsequent revisions does not need to be duplicated, which contributes to storage space efficiency. Much like transactions in relational databases, commits represent atomic units of data that allow fault recovery and rollback. Commits represent collection of metadata describing changes made to data, including their author, timestamp and link to parent commit. Commit are described via the means of the Provenance Ontology.24 In case of R&Wbase, individual revisions are exposed as virtual named graphs that enable to query database as of specific commit. This approach also allows to both branch and merge different versions of data. The following is an example :version named graph that was generated by the :commit:

:version {
  :commit a prov:InstantaneousEvent ;
    prov:atTime "2000-01-01T09:00:00Z"^^xsd:dateTime ; 
    prov:generated :version .
  
  :version a prov:Entity .

  :Alice org:memberOf :ACME .
}

Main issues

Having outlined the key characteristics of the main modelling patterns for temporal RDF, we summarize their suitability for linked data based on several criteria and requirements. However, before that we discuss a few guiding concerns that serve as a basis on which we decide how the reviewed modelling patterns fare on evaluation criteria.

Guiding concerns

This section draws attention to a few common concerns associated with modelling temporal data in RDF. We highlight the importance of avoiding updating data in place, the cost of identifying data with blank nodes and reasons given for preferring deltas instead of snapshots. The selection should by no means held to be comprehensive, but rather as a sample of more prominent concerns.

No update in place

Update must not affect existing data. Temporal databases eschew update in place and instead prefer immutable data structures, which collect changes instead of applying them directly by rewriting previous data. Hickey states that “since one can’t change the past, this implies that the database accumulates facts, rather than updates places, and that while the past may be forgotten, it is immutable” (2013). In this way, “when capturing the transaction time of data, deletion statements only have a logical effect” (Jensen, 2000), which means that “deleting an entity does not physically remove the entity from the database; rather, the entity remains in the database, but ceases to be part of the database’s current state” (ibid.). Changes or deletes thus do not have the physical effect of modifying or erasing data in place. Delete only marks an existing fact as no longer valid, instead of removing it from the database. Erroneous data may be marked with zero as its end of validity; i.e. data was never valid. On this account, Auer et al. remark that “in analogy with accounting practices, we never physically erase anything, but just add a correcting transaction” (2012). Nevertheless, storage space is limited and legal regulations in some countries might require physical deletion of problematic data. Similar to garbage collection in programming languages a value might be discarded once it is ceased to be used. Likewise, in order to make irreversible changes most temporal databases allow special procedures to be executed, such as “vacuuming” or “excision”.

Avoid blank nodes

Blank nodes are commonly reported to be a source of problems in versioning RDF. For instance, Tappolet and Bernstein (2009) write that blank nodes are “especially problematic as their validity is restricted to the node’s parent graph.” RDF 1.1 concepts (2013) makes that clear by saying that, in fact, blank nodes may be shared across multiple graphs in an RDF dataset, however, these graphs must have a parent-child relationship. “Since blank nodes cannot be identified, deleting them is not trivial” (Vander Sande et al., 2013) and they must be addressed with the context of a graph pattern that they are a part of. Given these downsides many tools working with blank nodes replace them internally with URIs. Moreover, use of blank nodes is frowned upon in general in the linked data community, so the recommendation is to avoid them if possible.

Prefer deltas to snapshots

Change in data may be represented either as a new snapshot, which contains new data as well as unaltered data, or as delta, which record incremental change. Snapshot is a view of data at a particular moment. By using named graphs for snapshots “different versions of data can be stored in different graphs, but this leads to a duplication of all triples” (Vander Sande et al., 2013). Storing only deltas in place of full snapshots has significant space benefits. “The temporal RDF approach vastly reduces the number of triples by eliminating redundancies resulting in an increased performance for processing and querying” (Tappolet and Bernstein).

Delta comprises only the difference between two consecutive snapshots of data, which makes it space efficient at the price of complex resolution of deltas back to snapshots. In particular, the advantages of employing this approach show when dealing with changes with a higher level of granularity (i.e. resource or entire graph). Deltas were introduced to RDF by Berners-Lee and Connolly (2009) and were covered since by many other research articles that highlighted several major issues when employing deltas in RDF. Auer et al. call for “LOD differences (deltas) and representing them as first class citizens with structural, semantic, temporal and provenance information” (2012). Deltas have to be easily applicable, so that resolving them to a snapshot of data as of certain time frame is feasible. Delta resolution depends on preservation of their application order, either via parent links to previous deltas or by inferring the order from deltas’ timestamp succession. An important concern about delta is that they are self-contained, so that they group updates in a way that allows to identify all data pertaining to a single delta and revert it to a previous state. Moreover, deltas should express changes in data that can be comprehended by humans. In that respect, Papavasileiou and her colleagues propose a “change language which allows the formulation of concise and intuitive deltas” (2013).

Comparison

In this section we summarize how the reviewed data modelling patterns compare on criteria that we deem relevant for linked data. Below the table each criterion is described in detail together with explaning comments on assessment of inidividual patterns.

Approach

Criterion

Statement reification N-ary relations Property localization Time slices Dated URIs Stateful resources Commits
Data size 3n 3n 9n 3n n n 2n
Data compatibility
Model compatibility
Sufficient TBox
Technological compatibility
Time-specific URIs
Bitemporality
Extensibility

Criteria

Data size

Data size represents the minimal number of triples needed to represent data as compared with atemporal triple count; without including temporal annotation and data that can be inferred.

Property localization is the least space efficient approach because it requires a lot of boilerplate triples in ungainly OWL constructs. On the contrary, using named graphs as stateful resource has the smallest footprint because it only requires wrapping annotated triples into a named graph. The same result is achieved by using dated URIs, which overload content of resource identifiers, so that no additional triples are created.

Data compatibility

This criterion reflects whether adding temporal annotation can be done without a transformation of existing data.

As binary relations of basic triples are inable to capture temporal scope, all approaches limited to RDF triples require existing atemporal data to be transformed (e.g., reified) in order to attach temporal annotations. Data compatibility is an advantage of named graphs that enable to contextualize RDF without requiring data pre-processing.

Model compatibility

Model compatibility determines if existing atemporal RDF vocabularies and ontologies may be used to represent temporally annotated data.

The ability to keep an atemporal domain model and only extend it with temporal annotation is an important advantage. N-ary relations do not provide this ability because every time-varying property has to be remodelled. Other modelling patterns make keeping existing models feasible, albeit patterns based on named graphs prohibit uses of named graphs for other purposes (e.g., as dataset containers).

Sufficient TBox

Building on the criterion of model compatibility this criterion indicates if a compared modelling pattern makes do without introducing additional TBox axioms25 into RDF vocabularies or ontologies used in temporally annotated data.

Of the reviewed approaches, both n-ary relations and property localization require their users to establish additional TBox axioms that cannot be reused from available RDF vocabularies or ontologies.

Technological compatibility

This aspect of compatibility tests the reviewed approaches whether they are backwards compatible with existing linked data technologies (RDF, HTTP) without requiring custom extensions.

In this comparison, only dated URIs are marked as incompatible because they use a custom protocol, rather than the typical HTTP URIs. All other patterns adhere to technological standards of linked data.

Time-specific URIs

This criterion shows whether external datasets may link to a resource as of particular time.

Referring to a resource’s status to date is widely used when citing online resources. By enabling time-specific URIs third parties may add more data to describe a resource state. Such valuable property is typical for modelling patterns that use the resource-centric perspective; i.e. time slices, dated URIs and stateful resources.

Bitemporality

The criterion of bitemporality judges modelling patterns on the basis of their ability to capture both transaction and valid time.

Most of the examined patterns enable to use bitemporal data model, however, using it is not possible with dated URIs, which facitilate capturing only a single temporal dimension.

Extensibility

Extensibility is the criterion that shows if the approach in question permits other types of annotations to be expressed as well (e.g., provenance metadata).

All patterns that are deemed to be extensible enable bitemporality. However, in case of property localization, though it can provide bitemporal data model, it is not particularly extensible because describing more dimensions of data requires adding new sub-property for every dimension, which results in a plethora of newly minted TBox axioms.

The views presented in the comparison of data modelling patterns in this paper build on a large corpus of existing research literature on the topic. To highlight several of the notable works from this corpus this section introduces core ideas and solutions presented in them. As motivated in the introduction, the issues of temporal data in RDF are widely perceived as pressing. Even though not established as long as in SQL databases, the research of modelling patterns for RDF data varying in time produced a vast array of literature on the topic. An extensive bibliography of research on the subject of temporal aspects of semantic web was compiled by Grandi (2012).

Several comparisons of data modelling patterns for temporal annotation in RDF were conducted in the past. Similarly to this article, Davis went through the options for modelling temporal data in RDF in a series of blog posts (2009a), in which he compared conditions and time slices (2009b), named graphs (2009c), reified relations (2009d) and n-ary relations (2009e). Rula et al. presented a comparison based on empirical study of use of temporal properties in a large corpus crawled from the Web (2012). According to their analysis, the most common ways of representing temporal annotations is with n-ary relations and document metadata. Coming more from the perspective of ontological engineering, Gangemi and Presutti present seven RDF and OWL 2 logical patterns for modelling n-ary relations (2013), each of which is discussed in terms of its usability. In his account of the topic, Hayes presents a unification algorithm for automatical translation between different syntaxes used for representing temporally annotated information (2004). Temporal annotation may “trickle down” through the different levels of attachment, so that it indexes either whole statements, relationships, or individuals, which demonstrates incidental nature of syntax used for expressing temporal scope. In this way interoperability between datasets employing distinct modelling styles may be achieved.

The approaches to data modelling presented here work in any standards-compliant RDF store. However, additional features might be needed to make such data usable and its retrieval efficient. For example, additional dedicated indexes may be built to improve query performance and query rewriting might be employed to reduce complexity of query formulation. Foundational research in implementation of temporal RDF was laid out in publication of Gutierrez, Hurtado and Vaisman, 2005. Working in this direction, SPARQL was extended to function with temporally annotated data in T-SPARQL, which was inspired by the TSQL2 temporal query language in the domain of relational database management (Grandi, 2010). Several research papers proposed additional index structures and normalization of temporal annotations to speed up retrieval of temporal RDF (Pugliese, Udrea and Subrahmanian, 2008). Grandi (2009) proposed a multi-temporal database model with temporal query and update execution techniques, which were motivated by a focus on ontologies from the legal domain.

Support for some form of temporal RDF is already built-in in a few RDF stores and RDF-aware tools. For example, Parliament RDF store supports a dedicated temporal index.26 R&WBase (Vander Sande et al., 2013), which was mentioned previously as a database with versioning capabilities, is a work in progress. A versioning module is a part of the Apache Marmotta,27 an implementation of a linked data platform for publishing RDF data.

Looking into a wider context besides RDF proper, temporal dimension of data is supported in many other database solutions. Some approaches extend triples to n-tuples to surmount the limitations of RDF, such as Google Freebase, the query language of which allows to access historical versions of data.28 Hoffart et al. report that the data model used in YAGO2 extends RDF triples to quintuples with time and location (2013). The model uses statement reification internally and statements get de-reified for quering, so that they can be viewed as quintuples. Efficient query performance is supported by the use of PostgreSQL and additional indexes for all tuples’ permutations.

Research on the topic of temporal data is long-established in the field of relational databases. TSQL2 is a complete temporal query language designed as an extension of SQL-92. Furthermore ISO SQL:2011, the 7th revision of the SQL standard, incorporates temporal support. Temporal retrieval is built into several non-RDF databases, which include Datomic,29 Google’s Spanner30 or IBM’s DB2 (Saracco, Nicola and Gandhi, 2012).

Conclusions

In this overview we surveyed data modelling patterns for temporal linked data. Each pattern was evaluated on a set of generic criteria; omitting dataset-specific criteria, such as size of changes or change frequency. The criteria were designed on the basis of several guiding concerns regarded as relevant for linked data in particular. Giving the chosen criteria a priority was motivated by recognizing what is important for the things linked data is based on.

Backwards compatibility is a crucial virtue for the continual evolution of the Web. Therefore a new modelling style for temporal linked data must work with existing atemporal data and with available technologies. The resource-centric architecture of linked data is built on the core principles of REST. The fundamental concepts of REST map well to four-dimensionalism, in which resources may be treated as perdurants and their representations as their temporal parts. When it comes to the data format of linked data, RDF, it produced the limitations the presented modelling patterns try to circumvent. The version of RDF that was originally standardized proved to be too restrictive for temporal data. Approaches that transcend these limits turned out to be superior to those that hacked and twisted atemporal RDF. Named graphs, a de facto standard on its way to the next version of RDF specification, showed to be a viable option for modelling temporal data, which offers an elegant syntax that bypasses limits of binary relations inherent in RDF.

However, even though best practices for data modelling temporal RDF emerge and technologies supporting such data are being developed, the diachronic dimension of linked data is still missing. Given the large extent of research conducted on the topic, it is now a question of adoption of modelling patterns for temporal data by a broader audience. The research would yet have to be boiled down to concrete recommendations and guidance based on standards distilled from common consensus.

References

  • Architecture of the World Wide Web, volume 1. [online]. W3C Recommendation. December 15th, 2004 [cit. 2013-06-13]. Available from WWW: http://www.w3.org/TR/webarch/
  • AUER, Sören; HERRE, Heinrich. A versioning and evolution framework for RDF knowledge bases. In Proceedings of the 6th International Andrei Ershov Memorial Conference on Perspectives of Systems Informatics. Berlin; Heidelberg: Springer, 2007, p. 55 — 69. ISBN 978-3-540-70880-3.
  • AUER, Sören [et al.]. Diachronic linked data: towards long-term preservation of structured interrelated information. In Proceedings of the 1st International Workshop on Open Data, Nantes, France, May 25, 2012. New York (NY): ACM, 2012, p. 31 — 39. ISBN 978-1-4503-1404-6.
  • BATSAKIS, Sotiris; PETRAKIS, Euripides G. M. Representing temporal knowledge in the semantic web: the extended 4D fluents approach. In Combinations of Intelligent Methods and Applications: proceedings of the 2nd International Workshop, CIMA 2010, France, October 2010. Berlin; Heidelberg: Springer, 2011, p. 55 — 69. Smart innovation, systems and technologies, vol. 8. DOI 10.1007/978-3-642-19618-8_4.
  • BERNERS-LEE, Tim; CONNOLLY, Dan. Delta: an ontology for the distribution of differences between RDF graphs [online]. 2009-08-27 [cit. 2013-06-15]. Available from WWW: http://www.w3.org/DesignIssues/Diff
  • CORRENDO, Gianluca [et al.]. Linked Timelines: temporal representation and management in linked data. In First International Workshop on Consuming Linked Data (COLD 2010), Shanghai, China [online]. Aachen: RWTH Aachen University, 2010 [cit. 2013-06-15]. CEUR workshop proceedings, vol. 665. Available from WWW: http://ceur-ws.org/Vol-665/CorrendoEtAl_COLD2010.pdf
  • COX, Simon. DCMI Period Encoding Scheme: specification of the limits of a time interval, and methods for encoding this in a text string [online]. 2006-04-10 [cit. 2013-06-13]. Available from WWW: http://dublincore.org/documents/dcmi-period/
  • DAVIS, Ian. Representing time in RDF part 1 [online]. August 10, 2009 [cit. 2013-06-15]. Available from WWW: http://blog.iandavis.com/2009/08/10/representing-time-in-rdf-part-1/
  • DAVIS, Ian. Representing time in RDF part 2 [online]. August 10, 2009 [cit. 2013-06-15]. Available from WWW: http://blog.iandavis.com/2009/08/10/representing-time-in-rdf-part-2/
  • DAVIS, Ian. Representing time in RDF part 3 [online]. August 10, 2009 [cit. 2013-06-15]. Available from WWW: http://blog.iandavis.com/2009/08/10/representing-time-in-rdf-part-3/
  • DAVIS, Ian. Representing time in RDF part 4 [online]. August 10, 2009 [cit. 2013-06-15]. Available from WWW: http://blog.iandavis.com/2009/08/10/representing-time-in-rdf-part-4/
  • DAVIS, Ian. Representing time in RDF part 5 [online]. August 10, 2009 [cit. 2013-06-15]. Available from WWW: http://blog.iandavis.com/2009/08/10/representing-time-in-rdf-part-5/
  • Defining n-ary relations on the semantic web” [online]. NOY, Natasha; RECTOR, Alan (eds.). April 12, 2006 [cit. 2013-06-13]. Available from WWW: http://www.w3.org/TR/swbp-n-aryRelations/
  • FIELDING, Roy Thomas. Architectural styles and the design of network-based software architectures. Irvine (CA), 2000. 162 p. Dissertation (PhD.). University of California, Irvine.
  • GANGEMI, Aldo. Super-duper schema: an owl2+rif dns pattern. In CHAUDRY, V. (ed.). Proceedings of DeepKR Challenge Workshop at KCAP 2011. 2011. Also available from WWW: http://www.ai.sri.com/halo/public/dkrckcap2011/Gangemi.pdf
  • GANGEMI, Aldo; PRESUTTI, Valentina. A multi-dimensional comparison of ontology design patterns for representing n-ary relations. In SOFSEM 2013: Theory and Practice of Computer Science. Berlin; Heidelberg: Springer, 2013, p. 86 — 105. Lecture notes in computer science, vol. 7741. DOI 10.1007/978-3-642-35843-2_8.
  • GRANDI, Fabio. Multi-temporal RDF ontology versioning. In Proceedings of the 3rd International Workshop on Ontology Dynamics, collocated with the 8th International Semantic Web Conference, Washington DC, USA, October 26, 2009 [online]. Aachen: RWTH Aachen University, 2009 [cit. 2013-06-11]. CEUR workshop proceedings, vol. 519. Available from WWW: http://ceur-ws.org/Vol-519/grandi.pdf
  • GRANDI, Fabio. T-SPARQL: A TSQL2-like temporal query language for RDF. In Local Proceedings of the 14th East-European Conference on Advances in Databases and Information Systems, Novi Sad, Serbia, September 20-24, 2010. [online]. Aachen: RWTH Aachen University, 2010 [cit. 2013-06-11]. CEUR workshop proceedings, vol. 639. Available from WWW: http://ceur-ws.org/Vol-639/021-grandi.pdf
  • GRANDI, Fabio. Introducing an annotated bibliography on temporal and evolution aspects in the semantic web. ACM SIGMOD Record. December 2012, vol. 41, iss. 4, p. 18 — 21. DOI 10.1145/2430456.2430460.
  • GRAU, Bernardo Cuenca [et al.]. OWL 2: the next step for OWL. Journal of Web Semantics. November 2008, vol. 6, iss. 4, p. 309 — 322. DOI 10.1016/j.websem.2008.05.001.
  • GUTIERREZ, Claudio; HURTADO, Carlos A.; VAISMAN, Alejandro. Temporal RDF. In The semantic web: research and applications: proceedings of the 2nd European Semantic Web Conference, Heraklion, Crete, Greece. Berlin; Heidelberg: Springer, 2005, p. 93 — 107. Lecture Notes in Computer Science, vol. 3532. DOI 10.1007/11431053_7.
  • GUTIERREZ, Claudio; HURTADO, Carlos A.; VAISMAN, Alejandro. Introducing time into RDF. IEEE Transactions on Knowledge and Data Engineering. February 2007, vol. 19, no. 2, p. 207 — 218. Also available from WWW: http://www.spatial.cs.umn.edu/Courses/Fall11/8715/papers/time-rdf.pdf
  • HALLOWAY, Stuart. Perception and action: an introduction to Clojure’s time model [online]. April 15, 2011 [cit. 2013-06-15]. Available from WWW: http://www.infoq.com/presentations/An-Introduction-to-Clojure-Time-Model
  • HAYES, Pat. Formal unifying standards for the representation of spatiotemporal knowledge [online]. Pensacola (FL): IHMC, 2004 [cit. 2013-06-15]. Available from WWW: http://www.ihmc.us/users/phayes/arlada2004final.pdf
  • HICKEY, Rich. The Datomic information model [online]. February 1, 2013 [cit. 2013-06-11]. Available from WWW: http://www.infoq.com/articles/Datomic-Information-Model
  • HOFFART, Johannes [et al.]. YAGO2: a spatially and temporally enhanced knowledge base from Wikipedia. Artificial Intelligence. January 2013, vol. 194, p. 28 — 61. DOI 10.1016/j.artint.2012.06.001.
  • The identity of indiscernibles. In Stanford encyclopedia of philosophy [online]. August 15, 2010 [cit. 2013-06-15]. Available from WWW: http://plato.stanford.edu/entries/identity-indiscernible/
  • JENSEN, Christian S. Introduction to temporal database research. In Temporal database management. Aalborg, 2000. Dissertation thesis. Aalborg University. Also available from WWW: http://people.cs.aau.dk/~csj/Thesis/
  • KIRYAKOV, Atanas; OGNYANOV, Damyan. Tracking changes in RDF(S) repositories. In Proceedings of the 13th International Conference on Knowledge Engineering and Knowledge Management: Ontologies and the Semantic Web. London (UK): Springer, 2002, p. 373 — 378. Also available from WWW: http://www.ontotext.com/sites/default/files/publications/TrackingKTSW02.pdf. ISBN 3-540-44268-5.
  • KRIEGER, Hans-Ulrich. Where temporal description logics fail: representing temporally-changing relationships. In Advances in artificial intelligence: proceedings of the 31st Annual German Conference on AI, KI 2008, Kaiserslautern, Germany, September 23 — 26, 2008. Berlin; Heidelberg: Springer, 2008, p. 249 — 257. Lecture notes in computer science, vol. 5243. DOI 10.1007/978-3-540-85845-4_31.
  • LOPES, Nuno [et al.]. RDF needs annotations. In RDF Next Steps: W3C workshop, June 26 — 27, 2010 [online]. 2009 [cit. 2013-06-13]. Available from WWW: http://www.w3.org/2009/12/rdf-ws/papers/ws09
  • MASINTER, Larry. The ‘tdb’ and ‘duri’ URI schemes, based on dated URIs [online]. 2012 [cit. 2013-06-10]. Available from WWW: http://tools.ietf.org/html/draft-masinter-dated-uri-10
  • MCCUSKER, James P.; MCGUINNESS, Deborah L. Towards identity in linked data. In Proceedings of the 7th International Workshop on OWL: Experiences and Directions. San Francisco, California, USA, June 21 — 22, 2010 [online]. Aachen: RWTH Aachen University, 2010 [cit. 2013-06-09]. CEUR workshop proceedings, vol. 614. Available from WWW: http://ceur-ws.org/Vol-614/owled2010_submission_12.pdf
  • MITTELBACH, Arno. RDF and the time dimension, part 1 [online]. 2008-11-28 [cit. 2013-06-16]. Available from WWW: http://oxforderewhon.wordpress.com/2008/11/28/rdf-and-the-time-dimension-part-1/
  • OWL 2 Web Ontology Language: new features and rationale [online]. GOLBREICH, Christine; WALLACE, Evan K. (eds.). 2nd ed. W3C, 2012 [cit. 2013-06-11]. Available from WWW: http://www.w3.org/TR/owl2-new-features
  • PAPAVASILEIOU, Vicky [et al.]. High-level change detection in RDF(S) KBs. ACM Transations on Database Systems. April 2013, vol. 38, iss. 1. DOI 10.1145/2445583.2445584.
  • PROCHÁZKA, Jiří [et al.]. The Property Reification Vocabulary 0.11 [online]. February 19, 2011 [cit. 2013-06-10]. Available from WWW: http://smiy.sourceforge.net/prv/spec/propertyreification.html
  • PUGLIESE, Andrea; UDREA, Octavian; SUBRAHMANIAN, V. S. Scaling RDF with time. In Proceedings of the 17th international conference on World Wide Web. New York (NY): ACM, 2008, p. 605 — 614. Also available from WWW: http://wwwconference.org/www2008/papers/pdf/p605-puglieseA.pdf. DOI 10.1145/1367497.1367579.
  • RDF semantics [online]. HAYES, Patrick (ed.). 2004 [cit. 2013-06-13]. Available from WWW: http://www.w3.org/TR/rdf-mt/
  • RDF 1.1 concepts and abstract syntax: W3C Working Draft [online]. CYGANIAK, Richard; WOOD, David (eds.). January 15, 2013 [cit. 2013-06-12]. Available from WWW: http://www.w3.org/TR/rdf11-concepts/
  • RDF primer” [online]. MANOLA, Frank; MILLER, Eric (eds.). February 10, 2004 [cit. 2013-06-15]. Available from WWW: http://www.w3.org/TR/rdf-primer/
  • REES, Jonathan; BOOTH, David; HAUSENBLAS, Michael. Towards formal HTTP semantics: AWWSW report to the TAG [online]. December 4, 2009 [cit. 2013-06-12]. Available from WWW: http://www.w3.org/2001/tag/awwsw/http-semantics-report.html
  • RULA, Anisa [et al.]. On the diversity and availability of temporal information in linked open data. In Proceedings of the 11th International Semantic Web Conference, Boston, MA, USA, November 11-15, 2012, part I. Berlin; Heidelberg: Springer, 2012, p. 492 — 507. Lecture notes in computer science, vol. 7649. DOI 10.1007/978-3-642-35176-1_31.
  • SANDERSON, Robert D.; VAN DE SOMPEL, Herbert. Cool URIs and dynamic data. IEEE Internet Computing. 2012, vol. 16, no. 4, p. 76 — 79. Also available from WWW: http://public.lanl.gov/herbertv/papers/Papers/2012/CoolURIsDynamicData.pdf. DOI 10.1109/MIC.2012.78. SARACCO, Cynthia M.; NICOLA, Matthias; GANDHI, Lenisha. A matter of time: Temporal data management in DB2 10 [online]. April 3, 2012 [cit. 2013-06-11]. Available from WWW: http://www.ibm.com/developerworks/data/library/techarticle/dm-1204db2temporaldata/
  • TAPPOLET, Jonas; BERNSTEIN, Abraham. Applied temporal RDF: efficient temporal querying of RDF data with SPARQL. In Proceedings of the 6th European Semantic Web Conference. Berlin; Heidelberg: Springer, 2009, p. 308 — 322. DOI 10.1007/978-3-642-02121-3_25.
  • TENNISON, Jeni. Temporal scope for RDF triples [online]. 2009-02-15 [cit. 2013-06-13]. Available from WWW: http://www.jenitennison.com/blog/node/101
  • TENNISON, Jeni. Linked open data in a changing world [online]. 2009-07-10 [cit. 2013-06-12]. Available from WWW: http://www.jenitennison.com/blog/node/108
  • TENNISON, Jeni. Versioning (UK government) linked data [online]. 2010-02-27 [cit. 2013-06-15]. Available from WWW: http://www.jenitennison.com/blog/node/141
  • Time Ontology in OWL: W3C Working Draft 27 September 2006. HOBBS, Jerry R.; PAN, Feng (eds.). W3C, 2006 [cit. 2013-06-09]. Available from WWW: http://www.w3.org/TR/owl-time/
  • UMBRICH, Jörgen; KARNSTEDT, Marcel; LAND, Sebastian. Towards understanding the changing web: mining the dynamics of linked-data sources and entities. In Proceedings of the LWO 2010 Workshop, October 4-6, 2010, Kassel, Germany [online]. Kassel: Universität Kassel, 2010 [cit. 2013-06-09]. Available from WWW: http://www.kde.cs.uni-kassel.de/conf/lwa10/papers/kdml22.pdf
  • VANDER SANDE, Miel [et al.]. R&Wbase: Git for triples. In Proceedings of the WWW2013 Workshop on Linked Data on the Web 2013, May 14, 2013, Rio de Janeiro, Brazil [online]. Aachen: RWTH Aachen University, 2013 [cit. 2013-06-09]. CEUR workshop proceedings, vol. 996. Available from WWW: http://events.linkeddata.org/ldow2013/papers/ldow2013-paper-01.pdf. ISSN 1613-0073.
  • VAN DE SOMPEL, Herbert; NELSON, Michael L.; SANDERSON, Robert D. HTTP framework for time-based access to resource states: Memento [online]. March 29, 2013 [cit. 2013-06-11]. Available from WWW: http://tools.ietf.org/html/draft-vandesompel-memento-07
  • WELTY, Christopher A.; FIKES, Richard. A reusable ontology for fluents in OWL. Formal Ontology in Information Systems: Proceedings of the Fourth International Conference (FOIS 2006). Amsterdam: IOS, 2006, p. 226 — 236. Frontiers in artificial intelligence and applications, vol. 150. ISBN 978-1-58603-685-0.
  • WELTY, Christopher A. Context slices: representing contexts in OWL. In Proceedings of the 2nd International Workshop on Ontology Patterns - WOP2010 [online]. Aachen: RWTH Aachen University, 2010 [cit. 2013-06-16]. CEUR workshop proceedings, vol. 671. Available from WWW: http://ceur-ws.org/Vol-671/pat01.pdf. ISSN 1613-0073.

Footnotes

  1. For example, Semantic Sitemap with <changefreq> element.

  2. http://www.w3.org/TR/xmlschema-2/

  3. http://en.wikipedia.org/wiki/ISO_8601

  4. For example, Dublin Core initiative proposed a way for encoding time intervals into typed literals (Cox, 2006).

  5. https://github.com/ccattuto/neo4j-dynagraph/wiki/Representing-time-dependent-graphs-in-Neo4j

  6. http://www.w3.org/TR/skos-reference/

  7. http://dublincore.org/documents/dcmi-terms/

  8. http://www.w3.org/TR/turtle/

  9. http://www.w3.org/TR/rdf-mt/#Reif

  10. http://www.w3.org/TR/owl2-mapping-to-rdf/#a_Annotation

  11. http://docs.api.talis.com/getting-started/changesets

  12. http://www.cibiv.at/~niko/dsnotify/vocab/eventset/v0.1/dsnotify-eventset.html

  13. There is a convention of using the term “n-ary relation” for relation with arity higher than 2, even though unary and binary relations are n-ary relations as well.

  14. http://www.freebase.com/

  15. http://www.wikidata.org/wiki/Wikidata:Main_Page

  16. http://www.w3.org/TR/vocab-org/#membership-n-ary-relationship

  17. http://www.w3.org/DesignIssues/Generic

  18. http://www.w3.org/DesignIssues/Axioms.html#opaque

  19. There are several proposed serialization formats (e.g., TriG or TriX), none of which reached the status of an official recommendation.

  20. http://patterns.dataincubator.org/book/named-graphs.html

  21. A property from Vocabulary of Interlinked Datasets (VoID). http://www.w3.org/TR/void/

  22. http://www.w3.org/2010/01/Turtle/Trig

  23. http://www.w3.org/2011/rdf-wg/wiki/User:Rcygania2/RDF_Datasets_and_Stateful_Resources

  24. http://www.w3.org/TR/prov-o/

  25. TBox constitutes the terminology used in RDF assertions.

  26. http://parliament.semwebcentral.org/

  27. http://marmotta.incubator.apache.org/kiwi/versioning.html

  28. Metaweb Query Language. http://mql.freebaseapps.com/ch03.html#history

  29. http://docs.datomic.com/architecture.html

  30. http://research.google.com/archive/spanner.html