2016-06-28

On generating SPARQL

The question how to generate SPARQL comes so often in my work that I figured I attempt a summary of the different approaches that answer this question, none of which is established enough to be considered a best practice. The lack of a well-established method for generating SPARQL may have arisen from the fact that SPARQL is serialized to strings. While its string-based format may be convenient to write by hand, it is less convenient to generate through code. To make SPARQL more amenable to programmatic manipulation, several approaches have been devised, some of which I will cover in this post.

Opting for strings or structured data to represent a data manipulation language is a fundamental design decision. Dumbing the decision down, there is a usability trade-off to be made: either adopt a string representation to ease manual authoring or go for a representation using data to ease programmatic manipulation. Can we nevertheless have the best of both options?

SPARQL continues in the string-based tradition of SQL; possibly leveraging a superficial familiarity between the syntaxes. In fact, SPARQL was recently assessed as “a string-based query language, as opposed to a composable data API.” This assessment implicitly reveals that there is a demand for languages represented as structured data, such as the Elasticsearch query DSL serialized in JSON or the Datomic query syntax, which is serialized in EDN.

To illustrate the approaches for generating SPARQL I decided to show how they fare on an example task. The chosen example should be simple enough, yet realistic, and such that it demonstrates the common problems in encountered when generating SPARQL.

Example

Let's say you want to know all the people (i.e. instances of foaf:Person) known to DBpedia. There are 1.8 million such persons, which is way too many to fetch in a single SPARQL query. In order to avoid overload, DBpedia's SPARQL endpoint is configured to provide at most 10 thousand results. Since we want to get all the results, we need to use paging via LIMIT and OFFSET to partition the complete results into smaller parts, such that one part can be retrieved within a single query.

Paging requires a stable sort order over the complete collection of results. However, sorting a large collection of RDF resources is an expensive operation. If a collection's size exceeds a pre-configured limit, Virtuoso requires the queries paging over this collection to use scrollable cursors (see the section “Example: Prevent Limits of Sorted LIMIT/OFFSET query”), which basically wrap an ordered query into a subquery in order to better leverage the temporary storage of the sorted collection. Because of the number of persons in DBpedia we need to apply this technique to our query.

Let's say that for each person we want to get values of several required properties and some optional properties. For example, we may want to get names (foaf:name) and birth dates (dbo:birthDate) and, optionally, dates of death (dbo:deathDate). Since persons can have multiple names and we want only one name per person, we need to use SAMPLE with the names to retrieve a single random name associated with each person. We could assume that a person has no more than one birth date and death date, but in fact in DBpedia there are 31 thousand persons with multiple birth dates and 12 thousand persons with multiple death dates, so we also need to use SAMPLE for these properties. Considering all these requirements, our SPARQL query may look like the following.

We need the LIMIT and OFFSET in this query to be generated dynamically, as well as the listing of the required and the optional properties. In the following, I will cover and compare several approaches for generating SPARQL queries using these parameters. All the approaches are illustrated by examples in Clojure to make them better comparable.

String concatenation

The approach that is readily at hand of most developers is string concatenation. While it is simple to start, it intertwines SPARQL with code, which makes it brittle and error prone. Convoluted manual escaping may be needed and the result is particularly messy in programming languages that lack support of multi-line strings, such as JavaScript. Here is an example implementation using string concatenation to generate the above-mentioned query.

Parameterized queries

If queries to be generated differ only in few variables, they can be generated from a parameterized query, which represents a generic template that can be filled with specific parameter values at runtime. Parameterized queries enable to split static and dynamic parts of the generated queries. While the static query template is represented in SPARQL and can be stored in a separate file, the dynamic parts can be represented using a programming language's data types and passed to the template at runtime. The separation provides both the readability of SPARQL and the expressiveness of programming languages. For example, template parameters can be statically typed in order to improve error reporting during development or avoid some SPARQL injection attacks in production.

Although parameterized queries improve on string concatenation and are highlighted among the linked data patterns, they are limited. In particular, I will discuss the limitations of parameterized queries as implemented in Apache Jena by ParameterizedSparqlString. While the main drawbacks of parameterized queries are true for other implementations as well, the details may differ. In Jena's parameterized queries only variables can be dynamic. Moreover, each variable can be bound only to a single value. For example, for the pattern ?s a ?class . we cannot bind ?class to schema:Organization and schema:Place to produce ?s a schema:Organization, schema:Place .. If we provide multiple bindings for a variable, only the last one is used. Queries that cannot restrict their dynamic parts to variables can escape to using string buffers to append arbitrary strings to the queries, but doing so gives you the same problems string concatenation has. Due to these restrictions we cannot generate the example query using this approach. Here is a partial implementation that generates only the limit and offset.

Jena also provides a similar approach closer to prepared statements in SQL. When executing a query (via QueryExecutionFactory) you can provide it with pre-bound variables (via QuerySolutionMap). Similar restrictions to those discussed above apply. Moreover, your template must be a syntactically valid SPARQL query or update. In turn, this prohibits generating LIMIT numbers, because LIMIT ?limit is not a valid SPARQL syntax. The following implementation thus does not work.

Templating

If we need higher expressivity for generating SPARQL, we can use a templating language. Templating is not restricted to the syntax of SPARQL, because it treats SPARQL as arbitrary text, so that any manipulation is allowed. As is the case with parameterized queries, templating enables to separate the static template from the dynamic parts of SPARQL. Unlike parameterized queries, a template is not represented in pure SPARQL, but in a mix of SPARQL and a templating language. This recalls some of the shortcomings of interposing SPARQL with code in string concatenation. Moreover, templates generally do not constrain their input, for example by declaring types, so that any data can be passed into the templates without it being checked in advance. Notable exceptions allowing to declare types of template data are available in Scala; for example Scalate or Twirl.

In order to show how to generate SPARQL via templating I adopted Mustache, which is a minimalistic and widely known templating language implemented in many programming languages. The following is a Mustache template that can generate the example query.

Rendering this template requires little code that provides the template with input data.

Domain-specific languages

As the string concatenation approach shows, SPARQL can be built using programming language construct. Whereas string concatenation operates on the low level of text manipulation, programming languages can be used to create constructs operating on a higher level closer to SPARQL. In this way, programming languages can be built up to domain-specific languages (DSLs) that compile to SPARQL. DSLs retain the expressivity of the programming languages they are defined in, while providing a syntax closer to SPARQL, thus reducing the cognitive overhead when translating SPARQL into code. However, when using or designing DSLs, we need to be careful about the potential clashes between names in the modelled language and the programming language the DSL is implemented in. For example, concat is used both in SPARQL and Clojure. Conversely, if a DSL lacks a part of the modelled language, escape hatches may be needed, regressing back to string concatenation.

Lisps, Clojure included, are uniquely positioned to serve as languages for defining DSLs. Since Lisp code is represented using data structures, it is easier to manipulate than languages represented as strings, such as SPARQL.

Matsu is a Clojure library that provides a DSL for constructing SPARQL via macros. Macros are expanded at compile time, so when they generate SPARQL they generally cannot access data that becomes available only at runtime. To a limited degree it is possible to work around this limitation by invoking the Clojure reader at runtime. Moreover, since Matsu is built using macros, we need to use macros to extend it. An example of such approach is its in-built the defquery macro that allows to pass parameters into a query template. Nevertheless, mixing macros with runtime data quickly becomes convoluted, especially if larger parts of SPARQL need to be generated dynamically.

If we consider using Matsu for generating the example query, we discover several problems that prevent us from accomplishing the desired outcome, apart from the already mentioned generic issues of macros. For instance, Matsu does not support subqueries. Defining subqueries separately and composing them as subqueries via raw input is also not possible, because Matsu queries contain prefix declarations, which are syntactically invalid in subqueries. Ultimately, the farthest I was able to get with Matsu for the example query was merely to the inner-most subquery.

Query DSLs in object-oriented langauges are often called query builders. For example, Jena provides a query builder that allows to build SPARQL by manipulating Java objects. The query builder is deeply vested in the Jena object model, which provides some type checking at the expense of a more verbose syntax. Since Clojure allows to call Java directly, implementing the example query using the query builder is straightforward.

While Matsu represents queries via macros and Jena's query builder does so via code, there is another option: representing queries via data. Using a programming language's native data structures for representing SPARQL provides arguably the best facility for programmatic manipulation. Data is transparent at runtime and as such it can be easily composed and inspected. In fact, a widespread Clojure design rule is to prefer functions over macros and data over functions. An example of using data to represent a SPARQL-like query language in Clojure is the Fabric DSL. While this DSL is not exactly SPARQL, it is “highly inspired by the W3C SPARQL language, albeit expressed in a more Clojuresque way and not limited to RDF semantics” (source).

SPIN RDF

An approach that uses data in RDF for representing SPARQL is SPIN RDF. It offers an RDF syntax for SPARQL and an API for manipulating it. While the translation of SPARQL to RDF is for the most part straightforward, one of its more intricate parts is using RDF collections for maintaining order in triple patterns or projected bindings, because the collections are difficult to manipulate in SPARQL.

Nonetheless, SPIN RDF seems to have a fundamental problem with passing dynamic parameters from code. For what I can tell, the membrane between SPIN RDF and code is impermeable. It would seem natural to manipulate SPIN RDF via SPARQL Update. However, how can you pass data to the SPARQL Update from your code? If you adopt SPIN RDF wholesale, your SPARQL Update operation is represented in RDF, so you have the same problem. Passing data from code to SPIN RDF thus results in a recursive paradox. Although I tried hard, I have not found a solution to this conundrum in the SPIN RDF documentation, nor in the source code of SPIN API.

This is how the example query can be represented using SPIN RDF; albeit using fixed values in place of the dynamic parts due to the limitations discussed above.

Rendering SPIN RDF to SPARQL can be implemented using the following code.

I have found a way to generate dynamic SPARQL queries in SPIN RDF using JSON-LD. JSON-LD can be represented by data structures, such as hash maps or arrays, that are available in most programming languages. This representation can be serialized to JSON that can be interpreted as RDF using the JSON-LD syntax. SPIN RDF can be in turn translated as SPARQL, obtaining our desired outcome. As may be apparent from this workflow, crossing that many syntaxes (Clojure, JSON-LD, RDF, SPIN, and SPARQL) requires large cognitive effort due to the mappings between the syntaxes one has to be aware of when authoring SPARQL in this way. Here is an implementation of this approach for the example query.

SPARQL algebra

As previously mentioned, a problematic part of SPIN RDF is its use of RDF collections for representing order. The documentation of Apache Jena recognizes this, saying that “RDF itself is often the most appropriate way to do this, but sometimes it isn't so convenient. An algebra expression is a tree, and order matters.” (source). The documentation talks about SPARQL algebra, which formalizes the low-level algebraic operators into which SPARQL is compiled. Instead of using RDF, Jena represents SPARQL algebra in s-expressions (SSE), which are commonly used in programming languages based on Lisp, such as Scheme. In fact, the “SSE syntax is almost valid Scheme” (source), but the SSE's documention acknowledges that Lisps “lacks convenient syntax for the RDF terms themselves” (source).

In order to see how our example query looks in SSE we can use Jena's command-line tools and invoke qparse --print=op --file query.rq to convert the query into SSE. The following is the result we get.

If SSEs were valid Clojure data structures, we could manipulate them as data and then serialize them to SPARQL. Nevertheless, there are minor differences between SSE and the syntax of Clojure. For example, while ?name and _:a are valid symbols in Clojure, absolute IRIs enclosed in angle brackets, such as <http://dbpedia.org/ontology/>, are not. Possibly, these differences can be remedied by using tagged literals for RDF terms.

Conclusions

I hope this post gave you a flavour of the various approaches for generating SPARQL. There is an apparent impedance mismatch between the current programming languages and SPARQL. While the programming languages operate with data structures and objects, SPARQL must be eventually produced as a string. This mismatch motivates the development of approaches for generating SPARQL, which is presented with many challenges, some of which I described in the post.

I assessed these approaches on the basis of how they fare on generating an example query using the data from DBpedia. The complete implementations of these approaches are available in this repository. Out of the approaches I reviewed, I found four in which it is feasible to generate the example SPARQL query without undue effort, which include:

  • String concatenation
  • Templating
  • Jena's query builder DSL
  • SPIN RDF using JSON-LD

My personal favourite that I use for generating SPARQL is templating with Mustache, which appears to mesh best with my brain and the tasks I do with SPARQL. Nonetheless, I am aware of the limitations of this approach and I am on a constant lookout for better solutions, possibly involving rendering SPARQL from data.

While I invested a fair amount of effort into this post, it is entirely possible I might have overlooked something or implemented any of the reviewed approaches in a sub-optimal way, so I would be glad to hear any suggestions on how to improve. In the meantime, while we search for the ideal solution for generating SPARQL, I think the membrane between code and SPARQL will remain only semipermeable.