2017-06-17

What I would like to see in SPARQL 1.2

SPARQL 1.1 is now 4 years old. While considered by many as a thing of timeless beauty, developers are a fickle folk, and so they already started coming up with wish lists for SPARQL 1.2, asking for things like variables in SPARQL property paths. Now, while some RDF stores still struggle with implementing SPARQL 1.1, others keep introducing new features. For example, Stardog is extending SPARQL solutions to include collections. In fact, we could invoke the empirical method and look at a corpus of SPARQL queries to extract the most used non-standard features, such as the Virtuoso's bif:datediff(). If these features end up getting used and adopted among the SPARQL engines, we might as well have them standardized. The little things that maintain backwards compatibility may be added via small accretive changes to make up the 1.2 version of SPARQL. On the contrary, breaking changes that would require rewriting SPARQL 1.1 queries should wait for SPARQL 2.0.

In the past few years SPARQL has been the primary language I code in, so I had the time to think about what I would like to see make its way into SPARQL 1.2. Here is a list of these things; ordered perhaps from the more concrete to the more nebulous ones.

  1. The REDUCED modifier is underspecified. Current SPARQL engines treat it "like a 'best effort' DISTINCT" (source). What constitutes the best effort differs between the implementations, which makes queries using REDUCED unportable (see here). Perhaps due to its unclear interpretation and limited portability queries using REDUCED are rare. Given its limited uptake, future versions of SPARQL may consider dropping it. Alternatively, REDUCED may be explicitly specified as eliminating consecutive duplicates.
  2. The GROUP_CONCAT aggregate is nondeterministic. Unlike in SQL, there is no way to prescribe the order of the concatenated group members, which would make the results of GROUP_CONCAT deterministic. While some SPARQL engines concatenate in the same order, some do not, so adding an explicit ordering would help. It would find its use in many queries, such as in hashing for data fusion. The separator argument of GROUP_CONCAT is optional, so that SPARQL 1.2 can simply add another optional argument for order by.
  3. SPARQL 1.2. can add a property path quantifier elt{n, m} to retrieve n to m hops large graph neighbourhood. This syntax, which mirrors the regular expression syntax for quantifiers, is already specified in SPARQL 1.1 Property Paths draft and several SPARQL engines implement it. While you can substitute this syntax by using repeated optional paths, it would provide a more compact notation for expressing the exact quantification, which is fairly common, for example in query expansion.
  4. (Edit: This is a SPARQL 1.1 implementation issue.) The functions uuid(), struuid(), and rand() should be evaluated once per binding, not once per result set. Nowaydays, some SPARQL engines return the same value for these functions in all query results. I think the behaviour of these functions is underspecified in SPARQL 1.1, which only says that “different numbers can be produced every time this function is invoked” (see the spec) and that “each call of UUID() returns a different UUID” (spec). Again, this specification leads to inconsistent behaviour in SPARQL engines, in some to much chagrin. Specifying that these functions must be evaluated per binding would help address many uses, such as minting unique identifiers in SPARQL Update operations.
  5. SPARQL 1.2 Graph Store HTTP Protocol could support quads, in serializations such as TriG or N-Quads. In such case, the graph query parameter could be omitted if the sent payload explicitly provides its named graph. I think that many tasks that do not fit within the bounds of triples would benefit from being able to work with quad-based multi-graph payloads.
  6. SPARQL 1.2 could adopt the date time arithmetic from XPath. For instance, in XPath the difference between dates returns an xsd:duration. Some SPARQL engines, such as Apache Jena, behave the same. However, in others you have to reach for extension functions, such as Virtuoso's bif:datediff(), or conjure up convoluted ways for seemingly simple things like determining the day of week. Native support for date time arithmetic would also make many interesting temporal queries feasible.
  7. (UPDATE: This is wrong, see here.) Perhaps a minor pet peeve of mine is the lack of support for float division. Division of integers in SPARQL 1.1 produces an integer. As a countermeasure, I always cast one of the operands of division to a decimal number; i.e. ?n/xsd:decimal(?m). I suspect there may be a performance benefit in maintaining the type of division's arguments, yet I also consider the usability of this operation. Nevertheless, changing the interpretation of division is not backwards compatible, so it may need to wait for SPARQL 2.0.
  8. The CONSTRUCT query form could support the GRAPH clause to produce quads. As described above, many tasks with RDF today revolve around quads and not only triples. Being able to construct data with named graphs would be helpful for such tasks.
  9. Would you like SPARQL to be able to group a cake and have it too? I would. Unfortunately, aggregates in SPARQL 1.1 produce scalar values, while many use cases call for structured collections. As mentioned above, Stardog is already extending SPARQL solutions to arrays and objects to support such use cases. Grassroots implementations are great, but having this standardized may help avoid the SPARQL engines to reinvent this particular wheel in many slightly incompatible ways.
  10. Finally, while SPARQL is typically written by people, many applications write SPARQL too. For them having an alternative syntax based on structured data instead of text would enable easier programmatic manipulation of SPARQL. Suggestions for improvement and a detailed analysis of the ways developers cope with the lack of data syntax for SPARQL can be found in my post on generating SPARQL.

No comments:

Post a Comment