2015-05-02

Curling SPARQL HTTP Graph Store protocol

SPARQL HTTP Graph Store protocol provides a way of manipulating RDF graphs via HTTP. Unlike SPARQL Update it does not allow you to work with RDF on the level of individual assertions (triples). Instead, you handle your data on a higher level of named graphs. Named graph is a pair of a URI and a set of RDF triples. A set of triples can contain a single triple only, so it is technically possible to manipulate individual triples with the Graph Store protocol, but this way of storing data is not common. In line with the principles of REST, the protocol defines its operations using HTTP requests. It covers the familiar CRUD (Create, Read, Update, Delete) operations known from REST APIs. It is simple and useful, albeit lesser known part of the family of SPARQL specifications. I have seen software that would have benefited had its developers known this protocol. This is why I decided to cover it in a post.

Instead of showing the HTTP interactions via the Graph Store protocol in a particular programming language I decided to use cURL as the lingua franca of HTTP. I discuss how the Graph Store protocol works in 2 implementations: Virtuoso (version 7.2) and Apache Jena Fuseki (version 2). By default, you can find a Graph Store endpoint at http://localhost:8890/sparql-graph-crud-auth for Virtuoso and at http://localhost:3030/{dataset}/data for Fuseki ({dataset} is the name of the dataset you configure). Virtuoso also allows you to use http://localhost:8890/sparql-graph-crud for read-only operations that do not require authentication. The differences between these implementations are minor, since both implement the protocol's specification well.

If you want to follow along with the examples below, an easy option is to download the latest version of Fuseki and start it with a disposable in-memory dataset using the shell command fuseki-server --update --mem /ds (ds is the name of our dataset). You can use any RDF file as testing data. For example, you can download DBpedia's description of SPARQL in the Turtle syntax as the file data.ttl:

curl -L -H "Accept:text/turtle" \
     http://dbpedia.org/resource/SPARQL > data.ttl

Finally, if any of the arguments you provide to cURL (such as graph URI) contains characters with special meaning in your shell (such as &), you need to enclose them in double quotes. The backslash you see in the example commands is used to escape new lines so that the commands can be split for better readability.

I will now walk through the 4 main operations defined by the Graph Store protocol: creating graphs with the PUT method, reading them using the GET method, adding data to existing graphs using the POST method, and deleting graphs, which can be achieved, quite unsurprisingly, via the DELETE method.

Create: PUT

You can load data into an RDF graph using the PUT HTTP method (see the specification). This is how you load RDF data from file data.ttl to the graph named http://example.com/graph:

Virtuoso
curl -X PUT \
     --digest -u dba:dba \
     -H Content-Type:text/turtle \
     -T data.ttl \
     -G http://localhost:8890/sparql-graph-crud-auth \
     --data-urlencode graph=http://example.com/graph
Fuseki
curl -X PUT \
     -H Content-Type:text/turtle \
     -T data.ttl \
     -G http://localhost:3030/ds/data \
     --data-urlencode graph=http://example.com/graph

The -T named argument uploads a given local file, -H specifies an HTTP header indicating the content type of the uploaded file, -G provides the Graph Store endpoint's URL, and --data-urlencode let's you pass in a URI naming of the created graph (via the graph query parameter). Since the Graph Store protocol's interface is uniform, most of the other operations use similar arguments.

Virtuoso uses HTTP Digest authentication for write and delete operations (i.e. create, update, and delete). The example above assumes the default Virtuoso user and password (i.e. dba:dba). If you fail to provide valid authentication credentials, you will be slapped over your hands with the HTTP 401 Unauthorized status code. Fuseki does not require authentication by default, but you can configure it using Apache Shiro.

When using Virtuoso, you can leave the Content-Type header out, because the data format will be automatically detected, but doing so is not a good idea. You need to provide it for Fuseki and if you fail to do so, you will face HTTP 400 Bad Request response. Try not to rely on the autodetection being correct and provide the Content-Type header explicitly.

If you want to put data into the default graph, you can use the default query parameter with no value:

Fuseki
curl -X PUT \
     -H Content-Type:text/turtle \
     -T data.ttl \
     -G http://localhost:3030/ds/data \
     -d default

If you use Fuseki, you can also omit the graph parameter completely to manipulate with the default graph. Nevertheless, this is not a standard behaviour, so you should not rely on it.

If you PUT data into an existing non-empty graph, its previous data is replaced.

Read: GET

To download data from a given graph, you just issue a GET request (see the specification). You can use the option -G to perform GET request via cURL:

Virtuoso
curl -G http://localhost:8890/sparql-graph-crud \
     --data-urlencode graph=http://example.com/graph
Fuseki
curl -G http://localhost:3030/ds/data \
     --data-urlencode graph=http://example.com/graph

Alternatively, you can simply use curl http://localhost:3030/ds/data?graph=http%3A%2F%2Fexample.com%2Fgraph, but -G allows you to provide the graph query parameter separately via --data-urlencode, which also takes care of the proper URL-encoding. You can specify the RDF serialization you want to get the data in via the Accept HTTP header. For example, if you want the data in N-Triples, you provide the Accept header with the MIME type application/n-triples:

Fuseki
curl -H Accept:application/n-triples \
     -G http://localhost:3030/ds/data \
     --data-urlencode graph=http://example.com/graph

Unfortunately, while Fuseki supports the application/n-triples MIME type, Virtuoso does not. Instead, you will have specify the deprecated MIME type text/ntriples (even text/plain will work) to get the data in N-Triples. Since N-Triples serializes each RDF triple on a separate line, you can use it as a naïve way of counting the triples in a graph by piping the data into wc -l (-s option used to hide the cURL progress bar):

Virtuoso
curl -s -H Accept:text/ntriples \
     -G http://localhost:8890/sparql-graph-crud \
     --data-urlencode graph=http://example.com/graph | \
     wc -l

If a graph named by the requested URI does not exist, you will get HTTP 404 Not Found response.

Update: POST

If you want to add data to an existing graph, use the POST method (see the specification). In case you POST data to a non-existent graph, it will be created just as if using the PUT method. The difference of POST and PUT is that when you send data to an existing graph, POST will merge it with the graph's current data, while PUT will replace it.

Virtuoso
curl -X POST \ 
     --digest -u dba:dba \
     -H Content-Type:text/turtle \
     -T data.ttl \
     -G http://localhost:8890/sparql-graph-crud-auth \
     --data-urlencode graph=http://example.com/graph
Fuseki
curl -X POST \ 
     -H Content-Type:text/turtle \
     -T data.ttl \
     -G http://localhost:3030/ds/data \
     --data-urlencode graph=http://example.com/graph

It is worth knowing how triples are merged during this operation. When you POST data to a non-empty graph, the current set of triples the graph is associated with will be merged with the set of triples from the uploaded data via set union. In most cases, if these two sets shared any triples, they will not be duplicated. However, if the shared triples contain blank nodes, they will be duplicated because, due to their local scope, blank nodes from different datasets are always treated as distinct. For example, if you repeatedly POST the same triples containing blank nodes to the same graph, the first time its size will increase by the number of posted triples, but on the second and subsequent POSTs the size of the graph will increase by the number of triples containing blank nodes. This can be one of the reasons why you may want to avoid using blank nodes.

Delete: DELETE

Unsurprisingly, deleting graphs is achieved using the DELETE method (see the specification). As you may expect by now, if you attempt to delete a non-existent graph, you will get HTTP 404 Not Found response.

Virtuoso
curl -X DELETE \
     --digest -u dba:dba \
     -G http://localhost:8890/sparql-graph-crud-auth \
     --data-urlencode graph=http://example.com/graph
Fuseki
curl -X DELETE \
     -G http://localhost:3030/ds/data \
     --data-urlencode graph=http://example.com/graph

Other methods

As in the HTTP specification, there are other methods defined in the Graph Store protocol. An example of such method is HEAD, which can be used to test whether a graph exists. cURL allows you to issue a HEAD request using the -I option:

Fuseki
curl -I \
     -G http://localhost:3030/ds/data \
     --data-urlencode graph=http://example.com/graph

If you the graph exists, you will receive HTTP 200 OK status code. Otherwise, you will once again see a saddening HTTP 404 Not Found response. In the current version of Virtuoso (7.2) using the HEAD method will trigger 501 Method Not Implemented, so you should use Fuseki if you want to play with this method.

As the Graph Store protocol's specification shows, you can replace any operation of the protocol by an equivalent SPARQL update or query. Graph Store protocol thus provides an uncomplicated interface for basic operations manipulating with RDF graphs. I think it is a simple tool worth knowing.

2 comments :