2018-02-11

The story of my Ph.D.

The story of my Ph.D. is a story of bitter compromises. A story of compromising research for paid work. A story of abandoning promising research to be able to finish. Probably not that different than any other Ph.D.

Since memories are subjective and malleable, it may not be a completely accurate story. All departures from the true events are solely my faults. Moreover, I am not trying to convey any lessons learnt just yet. I just want to tell the story of my Ph.D. I guess it all makes some sense in retrospect, as if the future was a product of the past.

2010

While I could retrace ever so tenuous links way back, the story of my Ph.D. really began in November 2010 in Cologne, Germany. I was there for the Semantic Web in Bibliotheken (SWIB) conference as a co-author of one of the accepted papers. At the conference I met Sören Auer, there to promote the PUBLINK programme of the LOD2 project. While we talked over a coffee break I suggested that the Czech National Library of Technology, where I worked at the time, could use the consulting from linked open data experts offered in the PUBLINK programme. We exchanged business cards and parted our ways. It was the last time I had business cards. They were colourful and sloppy, manually cut from a thin paper. Still, they helped me to forge one of the most impactful connections for my Ph.D.

Weeks later I received an email from Sören, who concluded that we are qualified enough not to need help from the LOD2 project, and instead asked if we would consider joining the project as partners to represent the “Eastern Europe”. Well, Czechs like to think they are a part of the “Central Europe”, but I gladly seized this opportunity.

2011

After some discussion it became clear that the National Library of Technology did not have the workforce required to join the LOD2 project. I turned to my next closest institution: the University of Economics in Prague, where I already worked on bibliographic linked data that paid for my November trip to Cologne. There was a small team, led by Vojtěch Svátek, already involved in semantic web research for a number of years. To make this team stronger, we formed a strategic alliance with the group of Martin Nečaský from the Faculty of Mathematics and Physics at the Charles University in Prague, thereby transgressing the traditional organizational boundaries. This union later proved successful and lasted through many research projects we worked on together. It perhaps contributed to our affiliations blending in the minds of our foreign project partners to a nebulous concept of the “University of Prague”.

Having secured a team, we needed a challenge it could work on. I started writing down a proposal that later turned into a part of the LOD2 project applying linked open data for running a distributed marketplace of public sector contracts. I based it on a suspicion that linked open data can serve as a better infrastructure for online markets. In such infrastructure, I surmised, we could operate matchmakers to link relevant demands and offers.

No idea is truly novel, and this one was no different. Its key inspiration came from Michael Hausenblas, whom I met at the Linked Data Research Centre at DERI (now Insight Centre for Data Analytics) in Galway, Ireland, where I worked as an intern in 2010. Michael had similar thoughts earlier and came up with Call for Anything, a lightweight vocabulary for machine-readable descriptions of demands on the Web, and prototyped an application matching developers to businesses using the vocabulary. There already was a well-known vocabulary for describing offers on the Web: GoodRelations by Martin Hepp. Call for Anything and GoodRelations clicked into place and I exchanged emails with Michael and Martin, thinking through an example application of matchmaking, which informed what later became our contribution to the LOD2 project.

What we lacked was a market in which data on both supply and demand is available. Earlier in 2010, when forming the nascent Czech open data initiative, we picked public contracts as a high-value dataset to screen-scrape and release as machine-readable linked open data. Public procurement market seemed to be a great setting for our work, since public contracts are demands explicitly represented as data as public procurement notices thanks to their proactive disclosure mandated by law.

We thought these poorly formed ideas through, enveloped them in profound academese, and eventually submitted them as an extension proposal for the LOD2 project.

The proposal was successful and in September 2011 we joined the LOD2 project, getting us three years of funding. It would be a perfect time for a Ph.D. if only I had already completed my Masters. I faced a dilemma later prominent in my Ph.D.: compromising paid work for educational progress. Moreover, back then I still worked part-time at the National Library of Technology. Nevertheless, I decided to fit everything in my limited waking hours and joined the University of Economics as an external researcher working on the LOD2 project.

2012

By the end of 2011 it was clear to me that splitting my time between work and education leads to hardly any progress in either of them. As my former position was no longer tenable, in February 2012 I quit the National Library of Technology to focus on finishing my master’s thesis.

In the following months I started running up into the limits of part-time contracts at the University of Economics. The only reasonable way for me to work more on the LOD2 project was to enroll in the university’s Ph.D. programme in applied computer science. There were no research ambitions at the beginning of my Ph.D. What made me apply for it was a practical concern to be able to continue in work that I found interesting. I applied for the Ph.D. and successfully completed the admission exams in May 2012. When in June 2012 I graduated from the Charles University with a master’s degree in new media studies, I was set to pursue the Ph.D. However, even back then I was asking myself: Is there life after Ph.D.?

I officially began my Ph.D. on September 20, 2012. I started it believing the widespread myth that Ph.D. is the only opportunity in life to focus on a single thing and explore it in depth. I quickly realized how far detached from the reality this myth is.

Besides researching your Ph.D. topic many other duties compete for your attention. First and foremost, as a Ph.D. student I was required to teach. A cynical view has it that Ph.D. students are little more than a cheap resource to provision teaching. I was fortunate enough to be assigned with courses at least tangentially related to my Ph.D., including labs in an XML course and several lectures and labs in a course on linked data. The less lucky ones ended up teaching things like the basic Microsoft Office skills.

While teaching can be satisfying and meaningful at times, it also takes a huge amount of time to do it right, especially when you start a new course. The effort spent on teaching has sporadic returns. Rarely you hear any positive feedback, and given that one of the university’s primary goals is to produce the most graduates, you often experience frustration with disinterested and unmotivated students who expect their graduation to be simply a matter of time. Under this impression, after a year, I decided to forfeit the Ph.D. stipend in order not to be required to teach.

Compared to teaching, other Ph.D. duties were relatively minor and infrequent. Once in a while I had to supervise bachelor’s or master’s theses and oversee admission exams of new students. I enjoyed the apprenticeship of supervising theses more than teaching, although few students invested more than required for a minimum viable thesis. Then there were academic duties that went without explicit acknowledgement, such as peer review, contributing to the bulk of unpaid labour that an obedient member of academia delivers.

The courses I was required to attend were largely irrelevant to the pursuit of my Ph.D. While I endured a course in IT management, I wondered why the courses on statistics or programming were left out of the curriculum. In retrospect, probably the most relevant was the introductory course on basic scientific methods, though it was definitely rudimentary.

Let’s talk money. My Ph.D. stipend amounted to 5400 CZK per month, which was 216 EUR, or 75.8 % of the minimal net wage at the time in the Czech Republic. Back then it was roughly what you would pay for renting a room in a shared apartment in Prague. Since the stipend could not cover the cost of living, I had to find other sources of income, most notable ones being research projects, typically involving uncertain part-time and fixed-term work. I was decidedly a part of the Ph.D. precariat, always compromising my research for paid work.

2013

The habit of following interesting work led me astray from my Ph.D. from time to time. For instance, between January 2013 and May 2014 I followed an opportunity to work with friends from new media studies at the Charles University on a project using semantic web technologies for the long tail of the job market. Also in January 2013 I achieved a minor impact of my Ph.D. outside of academia. I was invited as a (charmingly called) “ad-hoc expert” to the European Commission’s Public Sector Information group, where I talked on the dire present and the bright futures of public procurement data.

Contrary to my expectations, my actual contributions to the LOD2 project were rarely related to my Ph.D. More often than not I ended up doing the grunt work of data preparation or was swamped in the project admin and the ever-present “dissemination”. LOD2 project also allowed me to take the inverse role of what I asked for in 2010 at SWIB. It was the National Library of Israel to which I served as a linked open data expert in the PUBLINK programme. As a result, when the LOD2 project successfully concluded in September 2014 most of my Ph.D. work was still left to be done.

2014

When the LOD2 project ended my future funding was unclear. By that time our proposal for a follow-up Horizon 2020 project called OpenBudgets.eu was rejected.

I used the gap in funding to do a Ph.D. internship at Politecnico di Bari, Italy, joining the research group of Tommaso di Noia between October and December 2014. It was an easy choice. When I surveyed the research literature on matchmaking (the topic of my Ph.D. thesis), I found many links pointing to Bari. In a fortunate turn of affairs, I managed to obtain my university’s internal funding just in time for this internship. Working through a tight series of deadlines I completed my last required Ph.D. courses and re-enrolled as a full-time student in order to be eligible for the internship stipend. This internship was in fact the only period when I could be entirely dedicated to my Ph.D. It was essential in building the fundamental parts of what later became my thesis. I can heartily recommend going abroad for a few months to do such an internship.

2015

Immediately after my Bari gig, in January 2015, I followed with a one-month internship at the University of Göttingen, Germany, working with library data on old prints. Here again, I returned back to my roots in libraries. Also, I received a decent funding that sorted out my financial situation for another month and filled in some gaps from the previous period that the university’s stipend failed to cover.

Since the research project funding at the University of Economics dried out, I arranged a part-time job for EEA from February to September 2015 working on the COMSODE project. There, I assumed a role of data janitor, tirelessly ETL-ing many government datasets. It gave me a novel perspective on the well-known setting of EU research projects. Working for a commercial project partner meant two things improved significantly: management and funding.

A peculiar turn of events took place in spring 2015. While it previously came short, the OpenBudgets.eu project was eventually funded and we were expected to start working on it as soon as possible, despite any plans we made in the meantime. I reluctantly accepted a part-time involvement on the project, starting in May 2015. With mixed feelings, I asked for a break from my Ph.D., lasting till September 2015 when my contract with EEA ended. Due to the workload I imposed on myself, I was simply unable to fit the Ph.D. in.

In order to maintain my sanity during my long Ph.D. journey I occasionally worked on things whimsical. One of these “extra-curricular” efforts was DB-quiz, a Wikipedia-based knowledge game imitating a well-known Czech TV show. I found these activities fulfilling, perhaps because they helped me establish a sense in my Ph.D. in opposition to a clear nonsense. Obviously, I could not settle for anything halfway, so I followed through with the joke to the very end and turned DB-quiz into an academic paper, later winning a prize for the best Ph.D. publication at the University of Economics.

2016

At the end of 2015 I found myself with barely any progress in my Ph.D. It started to dawn on me that, if I am to finish at all, I need to live off my savings for a while instead of always hunting piecemeal income. Consequently, since 2016 I started carefully reducing paid work to make room for research. Since then my savings followed a decidedly declining slope.

I dedicated most of January to preparation for the doctoral state exam required after 3 years of Ph.D. The next month I passed the exam, albeit with barely satisfying performance, and ticked off another Ph.D. duty: submitted a paper to a university’s Ph.D. symposium. With these tasks out of the way there was only one thing left to do: my thesis.

September 2016 marked the start of the final year-long grind on my thesis. In fall of 2016 I thoroughly redone the entire data preparation, meticulously documenting its every step and improving my crude data processing tools on the way.

In December 2016, while entirely immersed in ETL of public procurement data, I realized that I forgot about the deadline for the preliminary thesis defense. By the end of the fourth year every Ph.D. student at my university is obliged to defend an 80% ready thesis. I started hastily piecing up my notes and former publications to meet my deadline coming up in February.

2017

My thesis had meagre 60 pages when I submitted it to the preliminary defense. Yet I managed to conditionally pass the defense thanks to otherwise outstanding results, judged by the modest standards of my university. The stipulated condition was that the thesis would be reviewed once more before the final defense.

Amidst the continued demise of my savings I was running out of options. I was piecing my income by context-switching between several part-time projects at my university. I needed another financial boost, one final kick before I was done with the Ph.D. The strategic alliance with Martin Nečaský came in handy again. Via this link I became a part-time open data expert at the Ministry of Interior of the Czech Republic, working on linked open data in statistics. During the summer of 2017 I pooled my time between this job, my thesis, and OpenBudgets.eu.

The problem with Ph.D. is that it grows without bounds. There is no way of telling that a Ph.D. is done, except the (arbitrary and untimely) end you set for yourself. When by the end of June I signed up as a full-time (linked) data engineer in pharmaceutical industry starting in October, I knew I had exactly 3 months left to finish my thesis. After that point I assumed there would simply be no time to work on the thesis anymore.

With a self-imposed deadline in sight I accepted the need for a bitter compromise. Compared with the original ambitions I left out many interesting experiments to be tried out. By the end of September I was mostly done, given my reduced work scope. Consequently, I passed the additional thesis pre-defense with no problems, giving me a green light to submit the final version of my thesis.

Unfortunately, despite my careful planning I did not manage to hand in my thesis before starting a full-time job. Due to an illness the thesis writing spilt by some weeks into October, with me working the evenings on final editing. I submitted my Ph.D. thesis on October 18. I was done. In relief, I tweeted:

Submitting your Ph.D. thesis feels like a large open wound you’ve been bleeding from for years finally started healing. A nice feeling.

Apart from my turning in the thesis there were several other supplements I had to provide, the most puzzling one being a 20-page summary of the thesis. Frankly, who reads a 20-page summary? I thought people read either the abstract or the whole damn thing. Reservations aside, I bit the bullet once again and played by the rules.

THE END

On January 25, 2018, I passed my Ph.D. viva, with all committee votes unequivocally supporting my graduation. Finishing the Ph.D. was, first and foremost, a testament to my stubbornness, not to my research prowess. It took me 5 years, 4 months, and 5 days. It takes a lot of patience and grit to persist that long.

People are ridiculously bad at answering “what if” questions. Nevertheless, I believe I would not sit idly had I not enrolled in the Ph.D. I believe I would be doing something just as interesting. Hence, my overall evaluation of the Ph.D. is exactly neutral.

However, I could not disregard the Ph.D.’s negative externalities. The Ph.D. levied a toll on my relationships with others. Oftentimes I grew cold, moody, and unresponsive, as I was churning through the flexible working hours for a precarious income. I definitely was not the cheeriest lad around.

While I explicitly avoid any lessons for others here, there were lessons I learnt. I have grown ever so cynical. I have learnt not to care much for deadlines, adopting the attitude of Douglas Adams: “I love deadlines. I like the whooshing sound they make as they fly by.” Finally, I have learnt to understand that “adversity and existence are one and the same.”

2017-12-26

Copy-pasting the history of public procurement

They say that “those who do not learn history are doomed to repeat it.” However, those who machine-learn from history are also doomed to repeat it. Machine learning is in many ways copy-pasting history. A key challenge is thus to learn from history without copying its biases.

For the first time in the history we have open data on past experiences in making public contracts. Contracting authorities and bidders have a chance of learning from these experiences to improve the efficiency of agreeing on future contracts by matching contracts to the most suitable bidders. I wrote my Ph.D. thesis on this subject.

The thing is that we know little when learning from awarded contracts. What we have is basically this: Similar contracts are usually awarded to these bidders. This has two principal flaws: the first is tenuous similarity. Similarity between contracts hinges on the how informative their description is. However, the features comprising the contract description may be incomplete or uninformative. Civil servants administering contracts may either intentionally or inadvertently withhold key information from the contract description or fill it with information that ultimately amounts to noise.

What is perhaps more detrimental is that the associations between bidders and their awarded contracts need not be positive. If we learn from contracts awarded in the past, we assume that the awarded bidders are the best (or simply good enough) matches for the contracts. This assumption is fundamentally problematic. The winning bidder may be awarded on the basis of adverse selection, which is not in favour of the public, but the public decision maker. There are many ways in which the conditions of adverse selection can arise. Relevant information required for informed decision-making can be unevenly distributed between the participants of the public procurement market. Rival bidders can agree to cooperate for mutual benefit and limit open competition, such as by submitting overpriced fake bids to make real bids more appealing in comparison. When a bidder repeatedly wins contracts from the same contracting authority, it needs not be a sign of its exceptional prowess, but instead a sign of clientelism or institutional inertia. In most public procurement datasets the limitations of award data are further pronounced by missing data on unsuccessful bidders. In this light, the validity of contract awards recommended by machine learning from data with these flaws can be easily contested by unsuccessful bidders perceiving themselves to be a subject of discrimination. Consequently, blindly mimicking human decisions in matching bidders to contracts may not be the best course of action.

Some suspect we can solve this by learning from more kinds of data. Besides award data, we can learn from post-award data, including the evaluations of bidders’ performance. However, post-award data is hardly ever available. Without it, the record of a contract is incomplete. Moreover, in case of learning from data for decision support we combine conflicting incentives of decision arbiters and data collectors, who are often the same in public procurement. For instance, a winning bidder may receive a favourable evaluation despite its meagre delivery on the awarded contracts thanks to its above-standard relationship with the civil servant administering the contracts.

In Logics and practices of transparency and opacity in real-world applications of public sector machine learning Michael Veale suggests that we need to carefully select what data gets disclosed to combine “transparency for better decision-making and opacity to mitigate internal and external gaming.” We need to balance publishing open data to achieve transparency with the use of opacity to avoid gaming the rules of public procurement. Apart from data selection, pre-processing training data for decision support is of key importance. In our data-driven culture, engineering data may displace engineering software, as Andrej Karpathy argues in Software 2.0. Careful data pre-processing may help avoid overfitting to biases that contaminate the public procurement data.

Neither technology nor history is neutral. In fact, history always favours the winners. Winners of public contracts are thus always seen in favourable light in public procurement data. Moreover, data is not an objective record of the public procurement history. Therefore, we cannot restrict ourselves to copy-pasting from history of public procurement by a mechanical application of machine learning oblivious of the limitations of its training data. On this note in the context of the criminal justice system, the Raw data podcast asks: “Given that algorithms are trained on historical data – data that holds the weight of hundreds of years of racial discrimination – how do we evaluate the fairness and efficacy of these tools?” Nevertheless, how to escape this weight of history in machine learning is not yet well explored.

2017-06-17

What I would like to see in SPARQL 1.2

SPARQL 1.1 is now 4 years old. While considered by many as a thing of timeless beauty, developers are a fickle folk, and so they already started coming up with wish lists for SPARQL 1.2, asking for things like variables in SPARQL property paths. Now, while some RDF stores still struggle with implementing SPARQL 1.1, others keep introducing new features. For example, Stardog is extending SPARQL solutions to include collections. In fact, we could invoke the empirical method and look at a corpus of SPARQL queries to extract the most used non-standard features, such as the Virtuoso's bif:datediff(). If these features end up getting used and adopted among the SPARQL engines, we might as well have them standardized. The little things that maintain backwards compatibility may be added via small accretive changes to make up the 1.2 version of SPARQL. On the contrary, breaking changes that would require rewriting SPARQL 1.1 queries should wait for SPARQL 2.0.

In the past few years SPARQL has been the primary language I code in, so I had the time to think about what I would like to see make its way into SPARQL 1.2. Here is a list of these things; ordered perhaps from the more concrete to the more nebulous ones.

  1. The REDUCED modifier is underspecified. Current SPARQL engines treat it "like a 'best effort' DISTINCT" (source). What constitutes the best effort differs between the implementations, which makes queries using REDUCED unportable (see here). Perhaps due to its unclear interpretation and limited portability queries using REDUCED are rare. Given its limited uptake, future versions of SPARQL may consider dropping it. Alternatively, REDUCED may be explicitly specified as eliminating consecutive duplicates.
  2. The GROUP_CONCAT aggregate is nondeterministic. Unlike in SQL, there is no way to prescribe the order of the concatenated group members, which would make the results of GROUP_CONCAT deterministic. While some SPARQL engines concatenate in the same order, some do not, so adding an explicit ordering would help. It would find its use in many queries, such as in hashing for data fusion. The separator argument of GROUP_CONCAT is optional, so that SPARQL 1.2 can simply add another optional argument for order by.
  3. SPARQL 1.2. can add a property path quantifier elt{n, m} to retrieve n to m hops large graph neighbourhood. This syntax, which mirrors the regular expression syntax for quantifiers, is already specified in SPARQL 1.1 Property Paths draft and several SPARQL engines implement it. While you can substitute this syntax by using repeated optional paths, it would provide a more compact notation for expressing the exact quantification, which is fairly common, for example in query expansion.
  4. (Edit: This is a SPARQL 1.1 implementation issue.) The functions uuid(), struuid(), and rand() should be evaluated once per binding, not once per result set. Nowaydays, some SPARQL engines return the same value for these functions in all query results. I think the behaviour of these functions is underspecified in SPARQL 1.1, which only says that “different numbers can be produced every time this function is invoked” (see the spec) and that “each call of UUID() returns a different UUID” (spec). Again, this specification leads to inconsistent behaviour in SPARQL engines, in some to much chagrin. Specifying that these functions must be evaluated per binding would help address many uses, such as minting unique identifiers in SPARQL Update operations.
  5. SPARQL 1.2 Graph Store HTTP Protocol could support quads, in serializations such as TriG or N-Quads. In such case, the graph query parameter could be omitted if the sent payload explicitly provides its named graph. I think that many tasks that do not fit within the bounds of triples would benefit from being able to work with quad-based multi-graph payloads.
  6. SPARQL 1.2 could adopt the date time arithmetic from XPath. For instance, in XPath the difference between dates returns an xsd:duration. Some SPARQL engines, such as Apache Jena, behave the same. However, in others you have to reach for extension functions, such as Virtuoso's bif:datediff(), or conjure up convoluted ways for seemingly simple things like determining the day of week. Native support for date time arithmetic would also make many interesting temporal queries feasible.
  7. Perhaps a minor pet peeve of mine is the lack of support for float division. Division of integers in SPARQL 1.1 produces an integer. As a countermeasure, I always cast one of the operands of division to a decimal number; i.e. ?n/xsd:decimal(?m). I suspect there may be a performance benefit in maintaining the type of division's arguments, yet I also consider the usability of this operation. Nevertheless, changing the interpretation of division is not backwards compatible, so it may need to wait for SPARQL 2.0.
  8. The CONSTRUCT query form could support the GRAPH clause to produce quads. As described above, many tasks with RDF today revolve around quads and not only triples. Being able to construct data with named graphs would be helpful for such tasks.
  9. Would you like SPARQL to be able to group a cake and have it too? I would. Unfortunately, aggregates in SPARQL 1.1 produce scalar values, while many use cases call for structured collections. As mentioned above, Stardog is already extending SPARQL solutions to arrays and objects to support such use cases. Grassroots implementations are great, but having this standardized may help avoid the SPARQL engines to reinvent this particular wheel in many slightly incompatible ways.
  10. Finally, while SPARQL is typically written by people, many applications write SPARQL too. For them having an alternative syntax based on structured data instead of text would enable easier programmatic manipulation of SPARQL. Suggestions for improvement and a detailed analysis of the ways developers cope with the lack of data syntax for SPARQL can be found in my post on generating SPARQL.

2017-02-05

Publishing temporal RDF data as create/delete event streams

Today I wondered about publishing temporal data as create/delete event streams. The events can be replayed by a database to produce state valid at a desired time. Use cases for this functionality can be found everywhere. For example, I've been recently working with open data from company registers. When companies relocate, their registered addresses change. It is thus useful to be able to trace back in history to find out what a company's address was at a particular date. When a company register provides only the current state of its data, tracing back to previous addresses is not possible.

Event streams are not a new idea. It corresponds with the well-known event sourcing pattern. This pattern was used in RDF too (for example, in shodan by Michael Hausenblas), although it is definitely not widespread. In 2013, I wrote an article about capturing temporal dimension of linked data. I think most of it holds true to date. In particular, there is still no established way of capturing temporal data in RDF. Event streams might be an option to consider.

Many answers to questions about temporal data in RDF are present in RDF 1.1 Concepts, one of the fundamental documents on RDF. For start, "the RDF data model is atemporal: RDF graphs are static snapshots of information." Nevertheless, "a snapshot of the state can be expressed as an RDF graph," which lends itself as a crude way of representing temporal data through time-indexed snapshots encoded as RDF graphs. There is also the option of reifying everything into observations, which is what statisticians and the Data Cube Vocabulary do. Alternatively, we can reify everything that happens into events.

Events

Events describe actions on data in a dataset. You can also think about them as database transactions, along with the associated properties. RDF triples are immutable, since "an RDF statement cannot be changed – it can only be added and removed" (source). This is why we need only two types of events:

  1. :Create (addition)
  2. :Delete (retraction)

Events can be represented as named graphs. Each event contains its type, which can be either :Create or :Delete, and timestamps, in particular valid time. Valid time tells us when the event's data is valid in the modelled world. The dcterms:valid property seems good enough to specify the valid time. Events may additionally describe other metadata, such as provenance. For example, dcterms:creator may link the person who created the event data. Encapsulating event's metadata in its named graph makes it self-contained, but it mixes operational data with data about the described domain, so an alternative to worth considering is to store the metadata in a separate graph.

The following example event stream describes that Alice was a member of ACME since 2000, while in 2001 she left to work for the Umbrella Corp, and then returned to ACME in 2003. The example is serialized in TriG, which allows to describe quads with named graphs instead of mere triples. You can use this example to test the queries discussed further on.

@prefix :         <http://example.com/> .
@prefix dcterms:  <http://purl.org/dc/terms/> .
@prefix org:      <http://www.w3.org/ns/org#> .
@prefix schema:   <http://schema.org/> .
@prefix xsd:      <http://www.w3.org/2001/XMLSchema#> .

:event-1 {
  :event-1 a :Create ;
    dcterms:valid "2000-01-01T09:00:00Z"^^xsd:dateTime .

  :Alice org:memberOf :ACME ;
    schema:name "Alice" .
}

:event-2 {
  :event-2 a :Delete ;
    dcterms:valid "2001-01-01T09:00:00Z"^^xsd:dateTime .

  :Alice org:memberOf :ACME .
}

:event-3 {
  :event-3 a :Create ;
    dcterms:valid "2001-01-01T09:00:00Z"^^xsd:dateTime .

  :Alice org:memberOf :UmbrellaCorp .
}

:event-4 {
  :event-4 a :Delete ;
    dcterms:valid "2003-01-01T09:00:00Z"^^xsd:dateTime .

  :Alice org:memberOf :UmbrellaCorp .
}

:event-5 {
  :event-5 a :Create ;
    dcterms:valid "2003-01-01T09:00:00Z"^^xsd:dateTime .

  :Alice org:memberOf :ACME .
}

Limitations

Describing event streams in the afore-mentioned way has some limitations. One of the apparent issues is the volume of data that is needed to encode seemingly simple facts. There are several ways how we can deal with this. Under the hood, RDF stores may implement structural sharing as in persistent data structures to avoid duplicating substructures present across multiple events. We can also make several assumptions that save space. :Create can be made the default event type, so that it doesn't need to be provided explicitly. In some limited cases, we can assume that valid time is the same as the transaction time. For example, in some countries, public contracts become valid only after they are published.

Another limitation of this approach is that it doesn't support blank nodes. You have to know the IRIs of the resources your want to describe.

Since named graphs are claimed for events, they cannot be used to distinguish datasets, as they typically are. Datasets need to be distinguished as RDF datasets. Having multiple datasets may hence mean having multiple SPARQL endpoints. Cross-dataset queries then have to be federated, or alternatively, current snapshots of the queried datasets can be loaded into a single RDF store as named graphs.

Integrity constraints

To illustrate properties of the proposed event representation, we can define integrity constraints that the event data must satisfy.

Union of delete graphs must be a subset of the union of create graphs. You cannot delete non-existent data. The following ASK query must return false:

PREFIX : <http://example.com/>

ASK WHERE {
  GRAPH ?delete {
    ?delete a :Delete .
    ?s ?p ?o .
  }
  FILTER NOT EXISTS {
    GRAPH ?create {
      ?create a :Create .
      ?s ?p ?o .
    }
  }
}

Each event graph must contain its type. The following ASK query must return true for each event:

ASK WHERE {
  GRAPH ?g {
    ?g a [] .
  }
}

The event type can be either :Create or :Delete. The following ASK query must return true for each event:

PREFIX : <http://example.com/>

ASK WHERE {
  VALUES ?type {
    :Create
    :Delete
  }
  GRAPH ?g {
    ?g a ?type .
  }
}

Events cannot have multiple types. The following ASK query must return false:

ASK WHERE {
  {
    SELECT ?g
    WHERE {
      GRAPH ?g {
        ?g a ?type .
      }
    }
    GROUP BY ?g
    HAVING (COUNT(?type) > 1)
  }
}

Querying

Querying over event streams is naturally more difficult than querying reconciled dataset snapshots. Nevertheless, the complexity of the queries may be hidden behind a proxy offering a more convenient syntax that extends SPARQL. An easy way to try the following queries is to use Apache Jena's Fuseki with an in-memory dataset loaded from the example event stream above: ./fuseki-server --file data.trig --update /ds.

Queries over the default graph, defined as the union of all graphs, query what has been true at some point in time:

CONSTRUCT {
  ?s ?p ?o .
}
WHERE {
  # Fuseki uses <urn:x-arq:UnionGraph> to denote the union graph,
  # unless tdb:unionDefaultGraph is set to true.
  # (https://jena.apache.org/documentation/tdb/assembler.html#union-default-graph)
  GRAPH <urn:x-arq:UnionGraph> {
    ?s ?p ?o .
  }
}

Current valid data is a subset of the :Create graphs without the triples in the subsequent :Delete graphs:

PREFIX :        <http://example.com/>
PREFIX dcterms: <http://purl.org/dc/terms/>

CONSTRUCT {
  ?s ?p ?o .
}
WHERE {
  GRAPH ?create {
    ?create a :Create ;
            dcterms:valid ?createValid .
    ?s ?p ?o .
  }
  FILTER NOT EXISTS {
    GRAPH ?delete {
      ?delete a :Delete ;
              dcterms:valid ?deleteValid .
      FILTER (?deleteValid > ?createValid)
      ?s ?p ?o .
    }
  }
}

We can also roll back and query data at a particular moment in time. This functionality is what Datomic provides as the asOf filter. For instance, the data valid at January 1, 2001 at 9:00 is union of the :Create events preceding this instant without the :Delete events that followed them until the chosen time:

PREFIX :        <http://example.com/>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX xsd:     <http://www.w3.org/2001/XMLSchema#>

CONSTRUCT {
  ?s ?p ?o .
}
WHERE {
  GRAPH ?create {
    ?create a :Create ;
      dcterms:valid ?validCreate .
    FILTER (?validCreate < "2001-01-01T09:00:00Z"^^xsd:dateTime)
    ?s ?p ?o .
  }
  MINUS {
    GRAPH ?delete {
      ?delete a :Delete ;
        dcterms:valid ?validDelete .
      FILTER (?validDelete < "2001-01-01T09:00:00Z"^^xsd:dateTime)
      ?s ?p ?o .
    }
  }
}

Event resolution proxy

Manipulation with event streams following the proposed representation can be simplied by an event resolution proxy. This proxy may be based on the SPARQL 1.1 Graph Store HTTP Protocol, which provides a standard way to work with named graphs. However, the Graph Store Protocol doesn't support quad-based RDF formats, so the proxy thus needs to partition multi-graph payloads into several transactions.

The proxy can provide several conveniences. It can prune event payloads by removing retractions of non-existent triples or additions of existing triples, or by dropping complete events if found redundant. It can automatically add transaction time; for example by using BIND (now() AS ?transactionTime) in SPARQL. Simplifying even further, the proxy can automatically mint event identifiers as URNs produced by the uuid() function in SPARQL. No event metadata can be provided explicitly in such case, although some metadata may be created automatically. Event type can be inferred from the HTTP method the proxy receives. HTTP PUT may correspond with the :Create type, while HTTP DELETE should indicate the :Delete type. Additionally, the proxy can assume that valid time is the same as transaction time.

Publishing

Create/delete event streams can be effectively split to batches by time intervals, suggesting several ways of publishing such data. An event stream should be published as an updated append-only quad dump. Additionally, there may be quads dumps of events from shorter periods of time, such as a day or month, to enable more responsive data syndication. Currently valid dataset may be materialized and published as a periodically updated dump. Instead of updating the current dataset in place, it may be published in snapshots. A snapshot from a given date can be used as a basis when replaying events, so that you don't have to replay the whole history of events, but only those that came after the snapshot. Any quad-based RDF serialization, such as TriG or N-Quads, will do for the dumps. Finally, in the absence of structural sharing, the dumps should be compressed to avoid size bloat caused by duplication of shared data structures.

The next challenge is to refine and test this idea. We can also wrap the event streams in a convenient abstraction that reduces the cognitive effort that comes with their manipulation. I think this is something developers of RDF store can consider to include in their products.

2016-10-06

Basic fusion of RDF data in SPARQL

A need to fuse data often arises when you combine multiple datasets. The combined datasets may contain descriptions of the same things that are given different identifiers. If possible, the descriptions of the same thing should be fused into one to simplify querying over the combined data. However, the problem with different co-referent identifiers also appears frequently in a single dataset. If a thing does not have an identifier, then it must be referred to by its description. Likewise, if the dataset's format does not support using identifiers as links, then things must also be referred to by their descriptions. For example, a company referenced from several public contracts as their supplier may have a registered legal entity number, yet its description is duplicated in each awarded contract instead of linking the company by its identifier due to the limitations of the format storing the data, such as CSV.

Fusing descriptions of things is a recurrent task both in intergration of multiple RDF datasets and in transformations of non-RDF data to RDF. Since fusion of RDF data can be complex, there are dedicated data fusion tools, such as Sieve or LD-FusionTool, that can help formulate and execute intricate fusion policies. However, in this post I will deal with basic fusion of RDF data using the humble SPARQL 1.1 Update, which is readily available in most RDF stores and many ETL tools for processing RDF data, such as LinkedPipes-ETL. Moreover, a basic data fusion is widely applicable in many scenarios, which is why I wanted to share several simple ways for approaching it.

Content-based addressing

In the absence of an external identifier, a thing can be identified with a blank node in RDF. Since blank nodes are local identifiers and no two blank nodes are the same, using them can eventually lead to proliferation of aliases for equivalent things. One practice that ameliorates this issue is content-based addressing. Instead of identifying a thing with an arbitrary name, such as a blank node, its name is derived from its description; usually by applying a hash function. This turns the “Web of Names” into the Web of Hashes. Using hash-based IRIs for naming things in RDF completely sidesteps having to fuse aliases with the same description. This is how you can rewrite blank nodes to hash-based IRIs in SPARQL Update and thus merge duplicate data:

In practice, you may want to restrict the renamed resources to those that feature some minimal description that makes them distinguishable. Instead of selecting all blank nodes, you can select those that match a specific graph pattern. This way, you can avoid merging underspecified resources. For example, the following two addresses, for which we only know that they are located in the Czech Republic, are unlikely the same:


@prefix : <http://schema.org/> .

[ a :PostalAddress ;
  :addressCountry "CZ" ] .

[ a :PostalAddress ;
  :addressCountry "CZ" ] .

More restrictive graph patterns also work to your advantage in case of larger datasets. By lowering the complexity of your SPARQL updates, they reduce the chance of you running into out of memory errors or timeouts.

Hash-based fusion

Hashes can be used as keys not only for blank nodes. Using SPARQL, we can select resources satisfying a given graph pattern and fuse them based on their hashed descriptions. Let's have a tiny sample dataset that features duplicate resources:

If you want to merge equivalent resources instantiating class :C (i.e. :r1, :r2, and :r5), you can do it using the following SPARQL update:

The downside of this method is that the order of bindings in GROUP_CONCAT cannot be set explicitly, nor it is guaranteed to be deterministic. In theory, you may get different concatenations for the same set of bindings. In practice, RDF stores typically concatenate bindings in the same order, which makes this method work.

Fusing subset descriptions

If we fuse resources by hashes of their descriptions, only those with the exact same descriptions are fused. Resources that differ in a value or are described with different properties will not get fused together, because they will have distinct hashes. Nevertheless, we may want to fuse resources with a resource that is described by a superset of their descriptions. For example, we may want to merge the following blank nodes, since the description of the first one is a subset of the second one's description:


@prefix : <http://schema.org/> .

[ a :Organization ;
  :name "ACME Inc." ] .

[ a :Organization ;
  :name "ACME Inc." ;
  :description "The best company in the world."@en ] .

Resources with subset descriptions can be fused in SPARQL Update using double negation:

The above-mentioned caveats apply in this case too, so you can use a more specific graph pattern to avoid merging underspecified resources. The update may execute several rewrites until reaching the largest superset, which makes it inefficient and slow.

Key-based fusion

If you want to fuse resources with unequal descriptions that are not all subsets of one resource's description, a key to identify the resources to fuse must be defined. Keys can be simple, represented by a single inverse functional property, or compound, represented by a combination of properties. For instance, it may be reasonable to fuse the following resources on the basis of shared values for the properties rdf:type and :name:


@prefix :    <http://schema.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

[ a :Organization ;
  :name "ACME Inc." ;
  :foundingDate "1960-01-01"^^xsd:date ;
  :email "contact@acme.com" ;
  :description "The worst company in the world."@en ] .

[ a :Organization ;
  :name "ACME Inc." ;
  :foundingDate "1963-01-01"^^xsd:date ;
  :description "The best company in the world."@en ] .

To fuse resources by key, we group them by the key properties, select one of them, and rewrite the others to the selected one:

If we fuse the resources in the example above, we can get the following:


@prefix :    <http://schema.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

[ a :Organization ;
  :name "ACME Inc." ;
  :foundingDate "1960-01-01"^^xsd:date, "1963-01-01"^^xsd:date ;
  :email "contact@acme.com" ;
  :description "The best company in the world."@en, "The worst company in the world."@en ] .

This example illustrates how fusion highlights problems in data, including conflicting values of functional properties, such as the :foundingDate, or contradicting values of other properties, such as the :description. However, resolving these conflicts is a complex task that is beyond the scope of this post.

Conclusion

While I found the presented methods for data fusion to be applicable for a variety of datasets, they may fare worse for complex or large datasets. Besides the concern of correctness, one has to weigh the concern of performance. Based on my experience so far, the hash-based and key-based methods are usually remarkably performant, while the methods featuring double negation are not. Nonetheless, the SPARQL updates from this post can be oftentimes simply copied and pasted and work out of the box (having tweaked the graph patterns selecting the fused resources).

2016-06-28

On generating SPARQL

The question how to generate SPARQL comes so often in my work that I figured I attempt a summary of the different approaches that answer this question, none of which is established enough to be considered a best practice. The lack of a well-established method for generating SPARQL may have arisen from the fact that SPARQL is serialized to strings. While its string-based format may be convenient to write by hand, it is less convenient to generate through code. To make SPARQL more amenable to programmatic manipulation, several approaches have been devised, some of which I will cover in this post.

Opting for strings or structured data to represent a data manipulation language is a fundamental design decision. Dumbing the decision down, there is a usability trade-off to be made: either adopt a string representation to ease manual authoring or go for a representation using data to ease programmatic manipulation. Can we nevertheless have the best of both options?

SPARQL continues in the string-based tradition of SQL; possibly leveraging a superficial familiarity between the syntaxes. In fact, SPARQL was recently assessed as “a string-based query language, as opposed to a composable data API.” This assessment implicitly reveals that there is a demand for languages represented as structured data, such as the Elasticsearch query DSL serialized in JSON or the Datomic query syntax, which is serialized in EDN.

To illustrate the approaches for generating SPARQL I decided to show how they fare on an example task. The chosen example should be simple enough, yet realistic, and such that it demonstrates the common problems in encountered when generating SPARQL.

Example

Let's say you want to know all the people (i.e. instances of foaf:Person) known to DBpedia. There are 1.8 million such persons, which is way too many to fetch in a single SPARQL query. In order to avoid overload, DBpedia's SPARQL endpoint is configured to provide at most 10 thousand results. Since we want to get all the results, we need to use paging via LIMIT and OFFSET to partition the complete results into smaller parts, such that one part can be retrieved within a single query.

Paging requires a stable sort order over the complete collection of results. However, sorting a large collection of RDF resources is an expensive operation. If a collection's size exceeds a pre-configured limit, Virtuoso requires the queries paging over this collection to use scrollable cursors (see the section “Example: Prevent Limits of Sorted LIMIT/OFFSET query”), which basically wrap an ordered query into a subquery in order to better leverage the temporary storage of the sorted collection. Because of the number of persons in DBpedia we need to apply this technique to our query.

Let's say that for each person we want to get values of several required properties and some optional properties. For example, we may want to get names (foaf:name) and birth dates (dbo:birthDate) and, optionally, dates of death (dbo:deathDate). Since persons can have multiple names and we want only one name per person, we need to use SAMPLE with the names to retrieve a single random name associated with each person. We could assume that a person has no more than one birth date and death date, but in fact in DBpedia there are 31 thousand persons with multiple birth dates and 12 thousand persons with multiple death dates, so we also need to use SAMPLE for these properties. Considering all these requirements, our SPARQL query may look like the following.

We need the LIMIT and OFFSET in this query to be generated dynamically, as well as the listing of the required and the optional properties. In the following, I will cover and compare several approaches for generating SPARQL queries using these parameters. All the approaches are illustrated by examples in Clojure to make them better comparable.

String concatenation

The approach that is readily at hand of most developers is string concatenation. While it is simple to start, it intertwines SPARQL with code, which makes it brittle and error prone. Convoluted manual escaping may be needed and the result is particularly messy in programming languages that lack support of multi-line strings, such as JavaScript. Here is an example implementation using string concatenation to generate the above-mentioned query.

Parameterized queries

If queries to be generated differ only in few variables, they can be generated from a parameterized query, which represents a generic template that can be filled with specific parameter values at runtime. Parameterized queries enable to split static and dynamic parts of the generated queries. While the static query template is represented in SPARQL and can be stored in a separate file, the dynamic parts can be represented using a programming language's data types and passed to the template at runtime. The separation provides both the readability of SPARQL and the expressiveness of programming languages. For example, template parameters can be statically typed in order to improve error reporting during development or avoid some SPARQL injection attacks in production.

Although parameterized queries improve on string concatenation and are highlighted among the linked data patterns, they are limited. In particular, I will discuss the limitations of parameterized queries as implemented in Apache Jena by ParameterizedSparqlString. While the main drawbacks of parameterized queries are true for other implementations as well, the details may differ. In Jena's parameterized queries only variables can be dynamic. Moreover, each variable can be bound only to a single value. For example, for the pattern ?s a ?class . we cannot bind ?class to schema:Organization and schema:Place to produce ?s a schema:Organization, schema:Place .. If we provide multiple bindings for a variable, only the last one is used. Queries that cannot restrict their dynamic parts to variables can escape to using string buffers to append arbitrary strings to the queries, but doing so gives you the same problems string concatenation has. Due to these restrictions we cannot generate the example query using this approach. Here is a partial implementation that generates only the limit and offset.

Jena also provides a similar approach closer to prepared statements in SQL. When executing a query (via QueryExecutionFactory) you can provide it with pre-bound variables (via QuerySolutionMap). Similar restrictions to those discussed above apply. Moreover, your template must be a syntactically valid SPARQL query or update. In turn, this prohibits generating LIMIT numbers, because LIMIT ?limit is not a valid SPARQL syntax. The following implementation thus does not work.

Templating

If we need higher expressivity for generating SPARQL, we can use a templating language. Templating is not restricted to the syntax of SPARQL, because it treats SPARQL as arbitrary text, so that any manipulation is allowed. As is the case with parameterized queries, templating enables to separate the static template from the dynamic parts of SPARQL. Unlike parameterized queries, a template is not represented in pure SPARQL, but in a mix of SPARQL and a templating language. This recalls some of the shortcomings of interposing SPARQL with code in string concatenation. Moreover, templates generally do not constrain their input, for example by declaring types, so that any data can be passed into the templates without it being checked in advance. Notable exceptions allowing to declare types of template data are available in Scala; for example Scalate or Twirl.

In order to show how to generate SPARQL via templating I adopted Mustache, which is a minimalistic and widely known templating language implemented in many programming languages. The following is a Mustache template that can generate the example query.

Rendering this template requires little code that provides the template with input data.

Domain-specific languages

As the string concatenation approach shows, SPARQL can be built using programming language construct. Whereas string concatenation operates on the low level of text manipulation, programming languages can be used to create constructs operating on a higher level closer to SPARQL. In this way, programming languages can be built up to domain-specific languages (DSLs) that compile to SPARQL. DSLs retain the expressivity of the programming languages they are defined in, while providing a syntax closer to SPARQL, thus reducing the cognitive overhead when translating SPARQL into code. However, when using or designing DSLs, we need to be careful about the potential clashes between names in the modelled language and the programming language the DSL is implemented in. For example, concat is used both in SPARQL and Clojure. Conversely, if a DSL lacks a part of the modelled language, escape hatches may be needed, regressing back to string concatenation.

Lisps, Clojure included, are uniquely positioned to serve as languages for defining DSLs. Since Lisp code is represented using data structures, it is easier to manipulate than languages represented as strings, such as SPARQL.

Matsu is a Clojure library that provides a DSL for constructing SPARQL via macros. Macros are expanded at compile time, so when they generate SPARQL they generally cannot access data that becomes available only at runtime. To a limited degree it is possible to work around this limitation by invoking the Clojure reader at runtime. Moreover, since Matsu is built using macros, we need to use macros to extend it. An example of such approach is its in-built the defquery macro that allows to pass parameters into a query template. Nevertheless, mixing macros with runtime data quickly becomes convoluted, especially if larger parts of SPARQL need to be generated dynamically.

If we consider using Matsu for generating the example query, we discover several problems that prevent us from accomplishing the desired outcome, apart from the already mentioned generic issues of macros. For instance, Matsu does not support subqueries. Defining subqueries separately and composing them as subqueries via raw input is also not possible, because Matsu queries contain prefix declarations, which are syntactically invalid in subqueries. Ultimately, the farthest I was able to get with Matsu for the example query was merely to the inner-most subquery.

Query DSLs in object-oriented langauges are often called query builders. For example, Jena provides a query builder that allows to build SPARQL by manipulating Java objects. The query builder is deeply vested in the Jena object model, which provides some type checking at the expense of a more verbose syntax. Since Clojure allows to call Java directly, implementing the example query using the query builder is straightforward.

While Matsu represents queries via macros and Jena's query builder does so via code, there is another option: representing queries via data. Using a programming language's native data structures for representing SPARQL provides arguably the best facility for programmatic manipulation. Data is transparent at runtime and as such it can be easily composed and inspected. In fact, a widespread Clojure design rule is to prefer functions over macros and data over functions. An example of using data to represent a SPARQL-like query language in Clojure is the Fabric DSL. While this DSL is not exactly SPARQL, it is “highly inspired by the W3C SPARQL language, albeit expressed in a more Clojuresque way and not limited to RDF semantics” (source).

SPIN RDF

An approach that uses data in RDF for representing SPARQL is SPIN RDF. It offers an RDF syntax for SPARQL and an API for manipulating it. While the translation of SPARQL to RDF is for the most part straightforward, one of its more intricate parts is using RDF collections for maintaining order in triple patterns or projected bindings, because the collections are difficult to manipulate in SPARQL.

Nonetheless, SPIN RDF seems to have a fundamental problem with passing dynamic parameters from code. For what I can tell, the membrane between SPIN RDF and code is impermeable. It would seem natural to manipulate SPIN RDF via SPARQL Update. However, how can you pass data to the SPARQL Update from your code? If you adopt SPIN RDF wholesale, your SPARQL Update operation is represented in RDF, so you have the same problem. Passing data from code to SPIN RDF thus results in a recursive paradox. Although I tried hard, I have not found a solution to this conundrum in the SPIN RDF documentation, nor in the source code of SPIN API.

This is how the example query can be represented using SPIN RDF; albeit using fixed values in place of the dynamic parts due to the limitations discussed above.

Rendering SPIN RDF to SPARQL can be implemented using the following code.

I have found a way to generate dynamic SPARQL queries in SPIN RDF using JSON-LD. JSON-LD can be represented by data structures, such as hash maps or arrays, that are available in most programming languages. This representation can be serialized to JSON that can be interpreted as RDF using the JSON-LD syntax. SPIN RDF can be in turn translated as SPARQL, obtaining our desired outcome. As may be apparent from this workflow, crossing that many syntaxes (Clojure, JSON-LD, RDF, SPIN, and SPARQL) requires large cognitive effort due to the mappings between the syntaxes one has to be aware of when authoring SPARQL in this way. Here is an implementation of this approach for the example query.

SPARQL algebra

As previously mentioned, a problematic part of SPIN RDF is its use of RDF collections for representing order. The documentation of Apache Jena recognizes this, saying that “RDF itself is often the most appropriate way to do this, but sometimes it isn't so convenient. An algebra expression is a tree, and order matters.” (source). The documentation talks about SPARQL algebra, which formalizes the low-level algebraic operators into which SPARQL is compiled. Instead of using RDF, Jena represents SPARQL algebra in s-expressions (SSE), which are commonly used in programming languages based on Lisp, such as Scheme. In fact, the “SSE syntax is almost valid Scheme” (source), but the SSE's documention acknowledges that Lisps “lacks convenient syntax for the RDF terms themselves” (source).

In order to see how our example query looks in SSE we can use Jena's command-line tools and invoke qparse --print=op --file query.rq to convert the query into SSE. The following is the result we get.

If SSEs were valid Clojure data structures, we could manipulate them as data and then serialize them to SPARQL. Nevertheless, there are minor differences between SSE and the syntax of Clojure. For example, while ?name and _:a are valid symbols in Clojure, absolute IRIs enclosed in angle brackets, such as <http://dbpedia.org/ontology/>, are not. Possibly, these differences can be remedied by using tagged literals for RDF terms.

Conclusions

I hope this post gave you a flavour of the various approaches for generating SPARQL. There is an apparent impedance mismatch between the current programming languages and SPARQL. While the programming languages operate with data structures and objects, SPARQL must be eventually produced as a string. This mismatch motivates the development of approaches for generating SPARQL, which is presented with many challenges, some of which I described in the post.

I assessed these approaches on the basis of how they fare on generating an example query using the data from DBpedia. The complete implementations of these approaches are available in this repository. Out of the approaches I reviewed, I found four in which it is feasible to generate the example SPARQL query without undue effort, which include:

  • String concatenation
  • Templating
  • Jena's query builder DSL
  • SPIN RDF using JSON-LD

My personal favourite that I use for generating SPARQL is templating with Mustache, which appears to mesh best with my brain and the tasks I do with SPARQL. Nonetheless, I am aware of the limitations of this approach and I am on a constant lookout for better solutions, possibly involving rendering SPARQL from data.

While I invested a fair amount of effort into this post, it is entirely possible I might have overlooked something or implemented any of the reviewed approaches in a sub-optimal way, so I would be glad to hear any suggestions on how to improve. In the meantime, while we search for the ideal solution for generating SPARQL, I think the membrane between code and SPARQL will remain only semipermeable.

2016-03-22

Academia-driven development

In this post I present an opinionated (and mostly-wrong) account on programming in academia. It's based in part on my experience as a developer working in academia and in part on conversations I had with fellow developers working in the private sector.

In academia, you rarely program even if you are in computer science. Instead, you read papers, write papers, write deliverables for ongoing projects, or write project proposals. You aren't paid to program, you are paid to publish. You do only as much programming as is needed to have something to write a paper about. “Scientists use programming the way software engineers use public transport – just as a means to get to what they have to do,” observes Bozhidar Bozhanov. While programming purists may be dissatisfied with that, Jason Baldridge is content with this state of affairs and writes: “For academics, there is basically little to no incentive to produce high quality software, and that is how it should be.”

Albert Einstein allegedly said this: “If we knew what it was we were doing, it would not be called research, would it?” While the attribution of this quote is dubious at best, there's a grain of truth in what the quote says. It's natural in research that you often don't know what you work on. I think this is the reason why test-driven development (TDD) is not truly applicable in research. Programming in research is used to explore new ideas. TDD, on the contrary, requires upfront specification of what you are building. German has the verb ‘basteln’ that stands for DIY fiddling. The word was adopted into the Czech spoken language with a negative connotation of not knowing what you're doing, which I think captures nicely what often happens in academic programming.

The low quality of academic software hinders its maintenance and extensibility in long-term development. For one-off experiments these concerns aren't an issue, but most experiments needs to be reproducible. Academic software must allow to reproduce and verify the results that are reported in the publication associated with it. Anyone must be able to re-run the software. It must be open-source, allowing others to scrutinize its inner workings. Unfortunately, it's often the case that academic software isn't released or, when it's made available, it's nigh impossible to run it without asking its creators for assistance.

What's more is that usability of software is hardly ever a concern in academia, in spite of the fact that usable software may attract more citations, thereby increasing the academic prestige of its author. An often-mentioned example of this effect in practice is Word2vec, the paper of which boasts with 1305 citations according to Google Scholar. Indeed, it would be a felicitous turn if we reconsidered the usability of academic software as a valuable proxy that increases citation numbers.

A great benefit that comes with reproducible and usable software is extensibility. Ted Pedersen argues that there's “a very happy side-effect that comes from creating releasable code—you will be more efficient in producing new work of your own since you can easily reproduce and extend your own results.” Nonetheless, even though software may be both reproducible and usable, extending a code base without tests may be like building on quicksand. This is usually an opportunity for refactoring. For example, the feature to be extended can be first covered with tests that document its expected behaviour, as Nell Shamrell-Harrington suggests in surgical refactoring. The subsequent feature extension must not break these tests, unless the expected behaviour should change. I think adopting this approach can do a great good to the continuity of academic development.

Finally, there's also an economic argument to make for ‘poor-quality’ academic software. If software developed in the academia achieved production quality, it would constitute a competition to software produced in the private sector. Since academia is a part of the public sector, academic endeavours are financed mostly from the public funds. Hence such competition with commercial software can be considered unfair. Dennis Polhill argues that “unfair competition exists when a government or quasi-government entity takes advantage of its tax exemption and other privileges to supply private goods to the market in competition with private suppliers.” Following this line of thought, the public sector should not subsidize the development of software that is commercially viable and can be built by private companies. Instead of developing working solutions, academia can try and test new prototypes. If released openly, this proof-of-concept work can be then adopted in the private sector and grown into commercial products.

Eventually, when exploring my thoughts on academia-driven development, I realized that I'm torn between settling for the current status quo and pushing for emancipating software with publications. While I'm stuck figuring this out, there are laudable initiatives, such as Semantic Web Developers, which organizes regular conference workshops that showcase semantic web software and incite conversations about the status of software in academia. Let's see how these conversations pan out.