2017-02-05

Publishing temporal RDF data as create/delete event streams

Today I wondered about publishing temporal data as create/delete event streams. The events can be replayed by a database to produce state valid at a desired time. Use cases for this functionality can be found everywhere. For example, I've been recently working with open data from company registers. When companies relocate, their registered addresses change. It is thus useful to be able to trace back in history to find out what a company's address was at a particular date. When a company register provides only the current state of its data, tracing back to previous addresses is not possible.

Event streams are not a new idea. It corresponds with the well-known event sourcing pattern. This pattern was used in RDF too (for example, in shodan by Michael Hausenblas), although it is definitely not widespread. In 2013, I wrote an article about capturing temporal dimension of linked data. I think most of it holds true to date. In particular, there is still no established way of capturing temporal data in RDF. Event streams might be an option to consider.

Many answers to questions about temporal data in RDF are present in RDF 1.1 Concepts, one of the fundamental documents on RDF. For start, "the RDF data model is atemporal: RDF graphs are static snapshots of information." Nevertheless, "a snapshot of the state can be expressed as an RDF graph," which lends itself as a crude way of representing temporal data through time-indexed snapshots encoded as RDF graphs. There is also the option of reifying everything into observations, which is what statisticians and the Data Cube Vocabulary do. Alternatively, we can reify everything that happens into events.

Events

Events describe actions on data in a dataset. You can also think about them as database transactions, along with the associated properties. RDF triples are immutable, since "an RDF statement cannot be changed – it can only be added and removed" (source). This is why we need only two types of events:

  1. :Create (addition)
  2. :Delete (retraction)

Events can be represented as named graphs. Each event contains its type, which can be either :Create or :Delete, and timestamps, in particular valid time. Valid time tells us when the event's data is valid in the modelled world. The dcterms:valid property seems good enough to specify the valid time. Events may additionally describe other metadata, such as provenance. For example, dcterms:creator may link the person who created the event data. Encapsulating event's metadata in its named graph makes it self-contained, but it mixes operational data with data about the described domain, so an alternative to worth considering is to store the metadata in a separate graph.

The following example event stream describes that Alice was a member of ACME since 2000, while in 2001 she left to work for the Umbrella Corp, and then returned to ACME in 2003. The example is serialized in TriG, which allows to describe quads with named graphs instead of mere triples. You can use this example to test the queries discussed further on.

@prefix :         <http://example.com/> .
@prefix dcterms:  <http://purl.org/dc/terms/> .
@prefix org:      <http://www.w3.org/ns/org#> .
@prefix schema:   <http://schema.org/> .
@prefix xsd:      <http://www.w3.org/2001/XMLSchema#> .

:event-1 {
  :event-1 a :Create ;
    dcterms:valid "2000-01-01T09:00:00Z"^^xsd:dateTime .

  :Alice org:memberOf :ACME ;
    schema:name "Alice" .
}

:event-2 {
  :event-2 a :Delete ;
    dcterms:valid "2001-01-01T09:00:00Z"^^xsd:dateTime .

  :Alice org:memberOf :ACME .
}

:event-3 {
  :event-3 a :Create ;
    dcterms:valid "2001-01-01T09:00:00Z"^^xsd:dateTime .

  :Alice org:memberOf :UmbrellaCorp .
}

:event-4 {
  :event-4 a :Delete ;
    dcterms:valid "2003-01-01T09:00:00Z"^^xsd:dateTime .

  :Alice org:memberOf :UmbrellaCorp .
}

:event-5 {
  :event-5 a :Create ;
    dcterms:valid "2003-01-01T09:00:00Z"^^xsd:dateTime .

  :Alice org:memberOf :ACME .
}

Limitations

Describing event streams in the afore-mentioned way has some limitations. One of the apparent issues is the volume of data that is needed to encode seemingly simple facts. There are several ways how we can deal with this. Under the hood, RDF stores may implement structural sharing as in persistent data structures to avoid duplicating substructures present across multiple events. We can also make several assumptions that save space. :Create can be made the default event type, so that it doesn't need to be provided explicitly. In some limited cases, we can assume that valid time is the same as the transaction time. For example, in some countries, public contracts become valid only after they are published.

Another limitation of this approach is that it doesn't support blank nodes. You have to know the IRIs of the resources your want to describe.

Since named graphs are claimed for events, they cannot be used to distinguish datasets, as they typically are. Datasets need to be distinguished as RDF datasets. Having multiple datasets may hence mean having multiple SPARQL endpoints. Cross-dataset queries then have to be federated, or alternatively, current snapshots of the queried datasets can be loaded into a single RDF store as named graphs.

Integrity constraints

To illustrate properties of the proposed event representation, we can define integrity constraints that the event data must satisfy.

Union of delete graphs must be a subset of the union of create graphs. You cannot delete non-existent data. The following ASK query must return false:

PREFIX : <http://example.com/>

ASK WHERE {
  GRAPH ?delete {
    ?delete a :Delete .
    ?s ?p ?o .
  }
  FILTER NOT EXISTS {
    GRAPH ?create {
      ?create a :Create .
      ?s ?p ?o .
    }
  }
}

Each event graph must contain its type. The following ASK query must return true for each event:

ASK WHERE {
  GRAPH ?g {
    ?g a [] .
  }
}

The event type can be either :Create or :Delete. The following ASK query must return true for each event:

PREFIX : <http://example.com/>

ASK WHERE {
  VALUES ?type {
    :Create
    :Delete
  }
  GRAPH ?g {
    ?g a ?type .
  }
}

Events cannot have multiple types. The following ASK query must return false:

ASK WHERE {
  {
    SELECT ?g
    WHERE {
      GRAPH ?g {
        ?g a ?type .
      }
    }
    GROUP BY ?g
    HAVING (COUNT(?type) > 1)
  }
}

Querying

Querying over event streams is naturally more difficult than querying reconciled dataset snapshots. Nevertheless, the complexity of the queries may be hidden behind a proxy offering a more convenient syntax that extends SPARQL. An easy way to try the following queries is to use Apache Jena's Fuseki with an in-memory dataset loaded from the example event stream above: ./fuseki-server --file data.trig --update /ds.

Queries over the default graph, defined as the union of all graphs, query what has been true at some point in time:

CONSTRUCT {
  ?s ?p ?o .
}
WHERE {
  # Fuseki uses <urn:x-arq:UnionGraph> to denote the union graph,
  # unless tdb:unionDefaultGraph is set to true.
  # (https://jena.apache.org/documentation/tdb/assembler.html#union-default-graph)
  GRAPH <urn:x-arq:UnionGraph> {
    ?s ?p ?o .
  }
}

Current valid data is a subset of the :Create graphs without the triples in the subsequent :Delete graphs:

PREFIX :        <http://example.com/>
PREFIX dcterms: <http://purl.org/dc/terms/>

CONSTRUCT {
  ?s ?p ?o .
}
WHERE {
  GRAPH ?create {
    ?create a :Create ;
            dcterms:valid ?createValid .
    ?s ?p ?o .
  }
  FILTER NOT EXISTS {
    GRAPH ?delete {
      ?delete a :Delete ;
              dcterms:valid ?deleteValid .
      FILTER (?deleteValid > ?createValid)
      ?s ?p ?o .
    }
  }
}

We can also roll back and query data at a particular moment in time. This functionality is what Datomic provides as the asOf filter. For instance, the data valid at January 1, 2001 at 9:00 is union of the :Create events preceding this instant without the :Delete events that followed them until the chosen time:

PREFIX :        <http://example.com/>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX xsd:     <http://www.w3.org/2001/XMLSchema#>

CONSTRUCT {
  ?s ?p ?o .
}
WHERE {
  GRAPH ?create {
    ?create a :Create ;
      dcterms:valid ?validCreate .
    FILTER (?validCreate < "2001-01-01T09:00:00Z"^^xsd:dateTime)
    ?s ?p ?o .
  }
  MINUS {
    GRAPH ?delete {
      ?delete a :Delete ;
        dcterms:valid ?validDelete .
      FILTER (?validDelete < "2001-01-01T09:00:00Z"^^xsd:dateTime)
      ?s ?p ?o .
    }
  }
}

Event resolution proxy

Manipulation with event streams following the proposed representation can be simplied by an event resolution proxy. This proxy may be based on the SPARQL 1.1 Graph Store HTTP Protocol, which provides a standard way to work with named graphs. However, the Graph Store Protocol doesn't support quad-based RDF formats, so the proxy thus needs to partition multi-graph payloads into several transactions.

The proxy can provide several conveniences. It can prune event payloads by removing retractions of non-existent triples or additions of existing triples, or by dropping complete events if found redundant. It can automatically add transaction time; for example by using BIND (now() AS ?transactionTime) in SPARQL. Simplifying even further, the proxy can automatically mint event identifiers as URNs produced by the uuid() function in SPARQL. No event metadata can be provided explicitly in such case, although some metadata may be created automatically. Event type can be inferred from the HTTP method the proxy receives. HTTP PUT may correspond with the :Create type, while HTTP DELETE should indicate the :Delete type. Additionally, the proxy can assume that valid time is the same as transaction time.

Publishing

Create/delete event streams can be effectively split to batches by time intervals, suggesting several ways of publishing such data. An event stream should be published as an updated append-only quad dump. Additionally, there may be quads dumps of events from shorter periods of time, such as a day or month, to enable more responsive data syndication. Currently valid dataset may be materialized and published as a periodically updated dump. Instead of updating the current dataset in place, it may be published in snapshots. A snapshot from a given date can be used as a basis when replaying events, so that you don't have to replay the whole history of events, but only those that came after the snapshot. Any quad-based RDF serialization, such as TriG or N-Quads, will do for the dumps. Finally, in the absence of structural sharing, the dumps should be compressed to avoid size bloat caused by duplication of shared data structures.

The next challenge is to refine and test this idea. We can also wrap the event streams in a convenient abstraction that reduces the cognitive effort that comes with their manipulation. I think this is something developers of RDF store can consider to include in their products.

No comments :

Post a Comment