blog.mynarz.net

The story of my Ph.D.

2018-02-11T15:48:00.001+01:00

The story of my Ph.D. is a story of bitter compromises. A story of compromising research for paid work. A story of abandoning promising research to be able to finish. Probably not that different than any other Ph.D.

Since memories are subjective and malleable, it may not be a completely accurate story. All departures from the true events are solely my faults. Moreover, I am not trying to convey any lessons learnt just yet. I just want to tell the story of my Ph.D. I guess it all makes some sense in retrospect, as if the future was a product of the past.

2010

While I could retrace ever so tenuous links way back, the story of my Ph.D. really began in November 2010 in Cologne, Germany. I was there for the Semantic Web in Bibliotheken (SWIB) conference as a co-author of one of the accepted papers. At the conference I met Sören Auer, there to promote the PUBLINK programme of the LOD2 project. While we talked over a coffee break I suggested that the Czech National Library of Technology, where I worked at the time, could use the consulting from linked open data experts offered in the PUBLINK programme. We exchanged business cards and parted our ways. It was the last time I had business cards. They were colourful and sloppy, manually cut from a thin paper. Still, they helped me to forge one of the most impactful connections for my Ph.D.

Weeks later I received an email from Sören, who concluded that we are qualified enough not to need help from the LOD2 project, and instead asked if we would consider joining the project as partners to represent the “Eastern Europe”. Well, Czechs like to think they are a part of the “Central Europe”, but I gladly seized this opportunity.

2011

After some discussion it became clear that the National Library of Technology did not have the workforce required to join the LOD2 project. I turned to my next closest institution: the University of Economics in Prague, where I already worked on bibliographic linked data that paid for my November trip to Cologne. There was a small team, led by Vojtěch Svátek, already involved in semantic web research for a number of years. To make this team stronger, we formed a strategic alliance with the group of Martin Nečaský from the Faculty of Mathematics and Physics at the Charles University in Prague, thereby transgressing the traditional organizational boundaries. This union later proved successful and lasted through many research projects we worked on together. It perhaps contributed to our affiliations blending in the minds of our foreign project partners to a nebulous concept of the “University of Prague”.

Having secured a team, we needed a challenge it could work on. I started writing down a proposal that later turned into a part of the LOD2 project applying linked open data for running a distributed marketplace of public sector contracts. I based it on a suspicion that linked open data can serve as a better infrastructure for online markets. In such infrastructure, I surmised, we could operate matchmakers to link relevant demands and offers.

No idea is truly novel, and this one was no different. Its key inspiration came from Michael Hausenblas, whom I met at the Linked Data Research Centre at DERI (now Insight Centre for Data Analytics) in Galway, Ireland, where I worked as an intern in 2010. Michael had similar thoughts earlier and came up with Call for Anything, a lightweight vocabulary for machine-readable descriptions of demands on the Web, and prototyped an application matching developers to businesses using the vocabulary. There already was a well-known vocabulary for describing offers on the Web: GoodRelations by Martin Hepp. Call for Anything and GoodRelations clicked into place and I exchanged emails with Michael and Martin, thinking through an example application of matchmaking, which informed what later became our contribution to the LOD2 project.

What we lacked was a market in which data on both supply and demand is available. Earlier in 2010, when forming the nascent Czech open data initiative, we picked public contracts as a high-value dataset to screen-scrape and release as machine-readable linked open data. Public procurement market seemed to be a great setting for our work, since public contracts are demands explicitly represented as data as public procurement notices thanks to their proactive disclosure mandated by law.

We thought these poorly formed ideas through, enveloped them in profound academese, and eventually submitted them as an extension proposal for the LOD2 project.

The proposal was successful and in September 2011 we joined the LOD2 project, getting us three years of funding. It would be a perfect time for a Ph.D. if only I had already completed my Masters. I faced a dilemma later prominent in my Ph.D.: compromising paid work for educational progress. Moreover, back then I still worked part-time at the National Library of Technology. Nevertheless, I decided to fit everything in my limited waking hours and joined the University of Economics as an external researcher working on the LOD2 project.

2012

By the end of 2011 it was clear to me that splitting my time between work and education leads to hardly any progress in either of them. As my former position was no longer tenable, in February 2012 I quit the National Library of Technology to focus on finishing my master’s thesis.

In the following months I started running up into the limits of part-time contracts at the University of Economics. The only reasonable way for me to work more on the LOD2 project was to enroll in the university’s Ph.D. programme in applied computer science. There were no research ambitions at the beginning of my Ph.D. What made me apply for it was a practical concern to be able to continue in work that I found interesting. I applied for the Ph.D. and successfully completed the admission exams in May 2012. When in June 2012 I graduated from the Charles University with a master’s degree in new media studies, I was set to pursue the Ph.D. However, even back then I was asking myself: Is there life after Ph.D.?

I officially began my Ph.D. on September 20, 2012. I started it believing the widespread myth that Ph.D. is the only opportunity in life to focus on a single thing and explore it in depth. I quickly realized how far detached from the reality this myth is.

Besides researching your Ph.D. topic many other duties compete for your attention. First and foremost, as a Ph.D. student I was required to teach. A cynical view has it that Ph.D. students are little more than a cheap resource to provision teaching. I was fortunate enough to be assigned with courses at least tangentially related to my Ph.D., including labs in an XML course and several lectures and labs in a course on linked data. The less lucky ones ended up teaching things like the basic Microsoft Office skills.

While teaching can be satisfying and meaningful at times, it also takes a huge amount of time to do it right, especially when you start a new course. The effort spent on teaching has sporadic returns. Rarely you hear any positive feedback, and given that one of the university’s primary goals is to produce the most graduates, you often experience frustration with disinterested and unmotivated students who expect their graduation to be simply a matter of time. Under this impression, after a year, I decided to forfeit the Ph.D. stipend in order not to be required to teach.

Compared to teaching, other Ph.D. duties were relatively minor and infrequent. Once in a while I had to supervise bachelor’s or master’s theses and oversee admission exams of new students. I enjoyed the apprenticeship of supervising theses more than teaching, although few students invested more than required for a minimum viable thesis. Then there were academic duties that went without explicit acknowledgement, such as peer review, contributing to the bulk of unpaid labour that an obedient member of academia delivers.

The courses I was required to attend were largely irrelevant to the pursuit of my Ph.D. While I endured a course in IT management, I wondered why the courses on statistics or programming were left out of the curriculum. In retrospect, probably the most relevant was the introductory course on basic scientific methods, though it was definitely rudimentary.

Let’s talk money. My Ph.D. stipend amounted to 5400 CZK per month, which was 216 EUR, or 75.8 % of the minimal net wage at the time in the Czech Republic. Back then it was roughly what you would pay for renting a room in a shared apartment in Prague. Since the stipend could not cover the cost of living, I had to find other sources of income, most notable ones being research projects, typically involving uncertain part-time and fixed-term work. I was decidedly a part of the Ph.D. precariat, always compromising my research for paid work.

2013

The habit of following interesting work led me astray from my Ph.D. from time to time. For instance, between January 2013 and May 2014 I followed an opportunity to work with friends from new media studies at the Charles University on a project using semantic web technologies for the long tail of the job market. Also in January 2013 I achieved a minor impact of my Ph.D. outside of academia. I was invited as a (charmingly called) “ad-hoc expert” to the European Commission’s Public Sector Information group, where I talked on the dire present and the bright futures of public procurement data.

Contrary to my expectations, my actual contributions to the LOD2 project were rarely related to my Ph.D. More often than not I ended up doing the grunt work of data preparation or was swamped in the project admin and the ever-present “dissemination”. LOD2 project also allowed me to take the inverse role of what I asked for in 2010 at SWIB. It was the National Library of Israel to which I served as a linked open data expert in the PUBLINK programme. As a result, when the LOD2 project successfully concluded in September 2014 most of my Ph.D. work was still left to be done.

2014

When the LOD2 project ended my future funding was unclear. By that time our proposal for a follow-up Horizon 2020 project called OpenBudgets.eu was rejected.

I used the gap in funding to do a Ph.D. internship at Politecnico di Bari, Italy, joining the research group of Tommaso di Noia between October and December 2014. It was an easy choice. When I surveyed the research literature on matchmaking (the topic of my Ph.D. thesis), I found many links pointing to Bari. In a fortunate turn of affairs, I managed to obtain my university’s internal funding just in time for this internship. Working through a tight series of deadlines I completed my last required Ph.D. courses and re-enrolled as a full-time student in order to be eligible for the internship stipend. This internship was in fact the only period when I could be entirely dedicated to my Ph.D. It was essential in building the fundamental parts of what later became my thesis. I can heartily recommend going abroad for a few months to do such an internship.

2015

Immediately after my Bari gig, in January 2015, I followed with a one-month internship at the University of Göttingen, Germany, working with library data on old prints. Here again, I returned back to my roots in libraries. Also, I received a decent funding that sorted out my financial situation for another month and filled in some gaps from the previous period that the university’s stipend failed to cover.

Since the research project funding at the University of Economics dried out, I arranged a part-time job for EEA from February to September 2015 working on the COMSODE project. There, I assumed a role of data janitor, tirelessly ETL-ing many government datasets. It gave me a novel perspective on the well-known setting of EU research projects. Working for a commercial project partner meant two things improved significantly: management and funding.

A peculiar turn of events took place in spring 2015. While it previously came short, the OpenBudgets.eu project was eventually funded and we were expected to start working on it as soon as possible, despite any plans we made in the meantime. I reluctantly accepted a part-time involvement on the project, starting in May 2015. With mixed feelings, I asked for a break from my Ph.D., lasting till September 2015 when my contract with EEA ended. Due to the workload I imposed on myself, I was simply unable to fit the Ph.D. in.

In order to maintain my sanity during my long Ph.D. journey I occasionally worked on things whimsical. One of these “extra-curricular” efforts was DB-quiz, a Wikipedia-based knowledge game imitating a well-known Czech TV show. I found these activities fulfilling, perhaps because they helped me establish a sense in my Ph.D. in opposition to a clear nonsense. Obviously, I could not settle for anything halfway, so I followed through with the joke to the very end and turned DB-quiz into an academic paper, later winning a prize for the best Ph.D. publication at the University of Economics.

2016

At the end of 2015 I found myself with barely any progress in my Ph.D. It started to dawn on me that, if I am to finish at all, I need to live off my savings for a while instead of always hunting piecemeal income. Consequently, since 2016 I started carefully reducing paid work to make room for research. Since then my savings followed a decidedly declining slope.

I dedicated most of January to preparation for the doctoral state exam required after 3 years of Ph.D. The next month I passed the exam, albeit with a barely satisfying performance, and ticked off another Ph.D. duty: submitted a paper to my university’s Ph.D. symposium. With these tasks out of the way there was only one thing left to do: my thesis.

September 2016 marked the start of the final year-long grind on my thesis. In fall of 2016 I thoroughly redone the entire data preparation, meticulously documenting its every step and improving my crude data processing tools on the way.

In December 2016, while entirely immersed in ETL of public procurement data, I realized that I forgot about the deadline for the preliminary thesis defense. By the end of the fourth year every Ph.D. student at my university is obliged to defend an 80% ready thesis. I started hastily piecing up my notes and former publications to meet my deadline coming up in February.

2017

My thesis had meagre 60 pages when I submitted it to the preliminary defense. Yet I managed to conditionally pass the defense thanks to otherwise outstanding results, judged by the modest standards of my university. The stipulated condition was that the thesis would be reviewed once more before the final defense.

Amidst the continued demise of my savings I was running out of options. I was piecing my income by context-switching between several part-time projects at my university. I needed another financial boost, one final kick before I was done with the Ph.D. The strategic alliance with Martin Nečaský came in handy again. Via this link I became a part-time open data expert at the Ministry of Interior of the Czech Republic, working on linked open data in statistics. During the summer of 2017 I pooled my time between this job, my thesis, and OpenBudgets.eu.

The problem with Ph.D. is that it grows without bounds. There is no way of telling that a Ph.D. is done, except the (arbitrary and untimely) end you set for yourself. When by the end of June I signed up as a full-time (linked) data engineer in pharmaceutical industry starting in October, I knew I had exactly 3 months left to finish my thesis. After that point I assumed there would simply be no time to work on the thesis anymore.

With a self-imposed deadline in sight I accepted the need for a bitter compromise. Compared with the original ambitions I left out many interesting experiments to be tried out. By the end of September I was mostly done, given my reduced work scope. Consequently, I passed the additional thesis pre-defense with no problems, giving me a green light to submit the final version of my thesis.

Unfortunately, despite my careful planning I did not manage to hand in my thesis before starting a full-time job. Due to an illness the thesis writing spilt by some weeks into October, with me working the evenings on final editing. I submitted my Ph.D. thesis on October 18. I was ｄｏｎｅ． In relief, I tweeted:

Submitting your Ph.D. thesis feels like a large open wound you’ve been bleeding from for years finally started healing. A nice feeling.

Apart from my turning in the thesis there were several other supplements I had to provide, the most puzzling one being a 20-page summary of the thesis. Frankly, who reads a 20-page summary? I thought people read either the abstract or the whole damn thing. Reservations aside, I bit the bullet once again and played by the rules.

THE END

On January 25, 2018, I passed my Ph.D. viva, with all committee votes unequivocally supporting my graduation. Finishing the Ph.D. was, first and foremost, a testament to my stubbornness, not to my research prowess. It took me 5 years, 4 months, and 5 days. It takes a lot of patience and grit to persist that long.

People are ridiculously bad at answering “what if” questions. Nevertheless, I believe I would not sit idly had I not enrolled in the Ph.D. I believe I would be doing something just as interesting. Hence, my overall evaluation of the Ph.D. is exactly neutral.

However, I could not disregard the Ph.D.’s negative externalities. The Ph.D. levied a toll on my relationships with others. Oftentimes I grew cold, moody, and unresponsive, as I was churning through the flexible working hours for a precarious income. I definitely was not the cheeriest lad around.

While I explicitly avoid any lessons for others here, there were lessons I learnt. I have grown ever so cynical. I have learnt not to care much for deadlines, adopting the attitude of Douglas Adams: “I love deadlines. I like the whooshing sound they make as they fly by.” Finally, I have learnt to understand that “adversity and existence are one and the same.”

Copy-pasting the history of public procurement

2017-12-26T09:54:00.000+01:00

They say that “those who do not learn history are doomed to repeat it.” However, those who machine-learn from history are also doomed to repeat it. Machine learning is in many ways copy-pasting history. A key challenge is thus to learn from history without copying its biases.

For the first time in the history we have open data on past experiences in making public contracts. Contracting authorities and bidders have a chance of learning from these experiences to improve the efficiency of agreeing on future contracts by matching contracts to the most suitable bidders. I wrote my Ph.D. thesis on this subject.

The thing is that we know little when learning from awarded contracts. What we have is basically this: Similar contracts are usually awarded to these bidders. This has two principal flaws: the first is tenuous similarity. Similarity between contracts hinges on the how informative their description is. However, the features comprising the contract description may be incomplete or uninformative. Civil servants administering contracts may either intentionally or inadvertently withhold key information from the contract description or fill it with information that ultimately amounts to noise.

What is perhaps more detrimental is that the associations between bidders and their awarded contracts need not be positive. If we learn from contracts awarded in the past, we assume that the awarded bidders are the best (or simply good enough) matches for the contracts. This assumption is fundamentally problematic. The winning bidder may be awarded on the basis of adverse selection, which is not in favour of the public, but the public decision maker. There are many ways in which the conditions of adverse selection can arise. Relevant information required for informed decision-making can be unevenly distributed between the participants of the public procurement market. Rival bidders can agree to cooperate for mutual benefit and limit open competition, such as by submitting overpriced fake bids to make real bids more appealing in comparison. When a bidder repeatedly wins contracts from the same contracting authority, it needs not be a sign of its exceptional prowess, but instead a sign of clientelism or institutional inertia. In most public procurement datasets the limitations of award data are further pronounced by missing data on unsuccessful bidders. In this light, the validity of contract awards recommended by machine learning from data with these flaws can be easily contested by unsuccessful bidders perceiving themselves to be a subject of discrimination. Consequently, blindly mimicking human decisions in matching bidders to contracts may not be the best course of action.

Some suspect we can solve this by learning from more kinds of data. Besides award data, we can learn from post-award data, including the evaluations of bidders’ performance. However, post-award data is hardly ever available. Without it, the record of a contract is incomplete. Moreover, in case of learning from data for decision support we combine conflicting incentives of decision arbiters and data collectors, who are often the same in public procurement. For instance, a winning bidder may receive a favourable evaluation despite its meagre delivery on the awarded contracts thanks to its above-standard relationship with the civil servant administering the contracts.

In Logics and practices of transparency and opacity in real-world applications of public sector machine learning Michael Veale suggests that we need to carefully select what data gets disclosed to combine “transparency for better decision-making and opacity to mitigate internal and external gaming.” We need to balance publishing open data to achieve transparency with the use of opacity to avoid gaming the rules of public procurement. Apart from data selection, pre-processing training data for decision support is of key importance. In our data-driven culture, engineering data may displace engineering software, as Andrej Karpathy argues in Software 2.0. Careful data pre-processing may help avoid overfitting to biases that contaminate the public procurement data.

Neither technology nor history is neutral. In fact, history always favours the winners. Winners of public contracts are thus always seen in favourable light in public procurement data. Moreover, data is not an objective record of the public procurement history. Therefore, we cannot restrict ourselves to copy-pasting from history of public procurement by a mechanical application of machine learning oblivious of the limitations of its training data. On this note in the context of the criminal justice system, the Raw data podcast asks: “Given that algorithms are trained on historical data – data that holds the weight of hundreds of years of racial discrimination – how do we evaluate the fairness and efficacy of these tools?” Nevertheless, how to escape this weight of history in machine learning is not yet well explored.

What I would like to see in SPARQL 1.2

2017-06-17T18:57:00.001+02:00

SPARQL 1.1 is now 4 years old. While considered by many as a thing of timeless beauty, developers are a fickle folk, and so they already started coming up with wish lists for SPARQL 1.2, asking for things like variables in SPARQL property paths. Now, while some RDF stores still struggle with implementing SPARQL 1.1, others keep introducing new features. For example, Stardog is extending SPARQL solutions to include collections. In fact, we could invoke the empirical method and look at a corpus of SPARQL queries to extract the most used non-standard features, such as the Virtuoso's bif:datediff(). If these features end up getting used and adopted among the SPARQL engines, we might as well have them standardized. The little things that maintain backwards compatibility may be added via small accretive changes to make up the 1.2 version of SPARQL. On the contrary, breaking changes that would require rewriting SPARQL 1.1 queries should wait for SPARQL 2.0.

In the past few years SPARQL has been the primary language I code in, so I had the time to think about what I would like to see make its way into SPARQL 1.2. Here is a list of these things; ordered perhaps from the more concrete to the more nebulous ones.

The REDUCED modifier is underspecified. Current SPARQL engines treat it "like a 'best effort' DISTINCT" (source). What constitutes the best effort differs between the implementations, which makes queries using REDUCED unportable (see here). Perhaps due to its unclear interpretation and limited portability queries using REDUCED are rare. Given its limited uptake, future versions of SPARQL may consider dropping it. Alternatively, REDUCED may be explicitly specified as eliminating consecutive duplicates.
The GROUP_CONCAT aggregate is nondeterministic. Unlike in SQL, there is no way to prescribe the order of the concatenated group members, which would make the results of GROUP_CONCAT deterministic. While some SPARQL engines concatenate in the same order, some do not, so adding an explicit ordering would help. It would find its use in many queries, such as in hashing for data fusion. The separator argument of GROUP_CONCAT is optional, so that SPARQL 1.2 can simply add another optional argument for order by.
SPARQL 1.2. can add a property path quantifier elt{n, m} to retrieve n to m hops large graph neighbourhood. This syntax, which mirrors the regular expression syntax for quantifiers, is already specified in SPARQL 1.1 Property Paths draft and several SPARQL engines implement it. While you can substitute this syntax by using repeated optional paths, it would provide a more compact notation for expressing the exact quantification, which is fairly common, for example in query expansion.
(Edit: This is a SPARQL 1.1 implementation issue.) The functions uuid(), struuid(), and rand() should be evaluated once per binding, not once per result set. Nowaydays, some SPARQL engines return the same value for these functions in all query results. I think the behaviour of these functions is underspecified in SPARQL 1.1, which only says that “different numbers can be produced every time this function is invoked” (see the spec) and that “each call of UUID() returns a different UUID” (spec). Again, this specification leads to inconsistent behaviour in SPARQL engines, in some to much chagrin. Specifying that these functions must be evaluated per binding would help address many uses, such as minting unique identifiers in SPARQL Update operations.
SPARQL 1.2 Graph Store HTTP Protocol could support quads, in serializations such as TriG or N-Quads. In such case, the graph query parameter could be omitted if the sent payload explicitly provides its named graph. I think that many tasks that do not fit within the bounds of triples would benefit from being able to work with quad-based multi-graph payloads.
SPARQL 1.2 could adopt the date time arithmetic from XPath. For instance, in XPath the difference between dates returns an xsd:duration. Some SPARQL engines, such as Apache Jena, behave the same. However, in others you have to reach for extension functions, such as Virtuoso's bif:datediff(), or conjure up convoluted ways for seemingly simple things like determining the day of week. Native support for date time arithmetic would also make many interesting temporal queries feasible.
(UPDATE: This is wrong, see here.) Perhaps a minor pet peeve of mine is the lack of support for float division. Division of integers in SPARQL 1.1 produces an integer. As a countermeasure, I always cast one of the operands of division to a decimal number; i.e. ?n/xsd:decimal(?m). I suspect there may be a performance benefit in maintaining the type of division's arguments, yet I also consider the usability of this operation. Nevertheless, changing the interpretation of division is not backwards compatible, so it may need to wait for SPARQL 2.0.
The CONSTRUCT query form could support the GRAPH clause to produce quads. As described above, many tasks with RDF today revolve around quads and not only triples. Being able to construct data with named graphs would be helpful for such tasks.
Would you like SPARQL to be able to group a cake and have it too? I would. Unfortunately, aggregates in SPARQL 1.1 produce scalar values, while many use cases call for structured collections. As mentioned above, Stardog is already extending SPARQL solutions to arrays and objects to support such use cases. Grassroots implementations are great, but having this standardized may help avoid the SPARQL engines to reinvent this particular wheel in many slightly incompatible ways.
Finally, while SPARQL is typically written by people, many applications write SPARQL too. For them having an alternative syntax based on structured data instead of text would enable easier programmatic manipulation of SPARQL. Suggestions for improvement and a detailed analysis of the ways developers cope with the lack of data syntax for SPARQL can be found in my post on generating SPARQL.

Publishing temporal RDF data as create/delete event streams

2017-02-05T21:54:00.000+01:00

Today I wondered about publishing temporal data as create/delete event streams. The events can be replayed by a database to produce state valid at a desired time. Use cases for this functionality can be found everywhere. For example, I've been recently working with open data from company registers. When companies relocate, their registered addresses change. It is thus useful to be able to trace back in history to find out what a company's address was at a particular date. When a company register provides only the current state of its data, tracing back to previous addresses is not possible.

Event streams are not a new idea. It corresponds with the well-known event sourcing pattern. This pattern was used in RDF too (for example, in shodan by Michael Hausenblas), although it is definitely not widespread. In 2013, I wrote an article about capturing temporal dimension of linked data. I think most of it holds true to date. In particular, there is still no established way of capturing temporal data in RDF. Event streams might be an option to consider.

Many answers to questions about temporal data in RDF are present in RDF 1.1 Concepts, one of the fundamental documents on RDF. For start, "the RDF data model is atemporal: RDF graphs are static snapshots of information." Nevertheless, "a snapshot of the state can be expressed as an RDF graph," which lends itself as a crude way of representing temporal data through time-indexed snapshots encoded as RDF graphs. There is also the option of reifying everything into observations, which is what statisticians and the Data Cube Vocabulary do. Alternatively, we can reify everything that happens into events.

Events

Events describe actions on data in a dataset. You can also think about them as database transactions, along with the associated properties. RDF triples are immutable, since "an RDF statement cannot be changed – it can only be added and removed" (source). This is why we need only two types of events:

:Create (addition)
:Delete (retraction)

Events can be represented as named graphs. Each event contains its type, which can be either :Create or :Delete, and timestamps, in particular valid time. Valid time tells us when the event's data is valid in the modelled world. The dcterms:valid property seems good enough to specify the valid time. Events may additionally describe other metadata, such as provenance. For example, dcterms:creator may link the person who created the event data. Encapsulating event's metadata in its named graph makes it self-contained, but it mixes operational data with data about the described domain, so an alternative to worth considering is to store the metadata in a separate graph.

The following example event stream describes that Alice was a member of ACME since 2000, while in 2001 she left to work for the Umbrella Corp, and then returned to ACME in 2003. The example is serialized in TriG, which allows to describe quads with named graphs instead of mere triples. You can use this example to test the queries discussed further on.

@prefix :         <http://example.com/> .
@prefix dcterms:  <http://purl.org/dc/terms/> .
@prefix org:      <http://www.w3.org/ns/org#> .
@prefix schema:   <http://schema.org/> .
@prefix xsd:      <http://www.w3.org/2001/XMLSchema#> .

:event-1 {
  :event-1 a :Create ;
    dcterms:valid "2000-01-01T09:00:00Z"^^xsd:dateTime .

  :Alice org:memberOf :ACME ;
    schema:name "Alice" .
}

:event-2 {
  :event-2 a :Delete ;
    dcterms:valid "2001-01-01T09:00:00Z"^^xsd:dateTime .

  :Alice org:memberOf :ACME .
}

:event-3 {
  :event-3 a :Create ;
    dcterms:valid "2001-01-01T09:00:00Z"^^xsd:dateTime .

  :Alice org:memberOf :UmbrellaCorp .
}

:event-4 {
  :event-4 a :Delete ;
    dcterms:valid "2003-01-01T09:00:00Z"^^xsd:dateTime .

  :Alice org:memberOf :UmbrellaCorp .
}

:event-5 {
  :event-5 a :Create ;
    dcterms:valid "2003-01-01T09:00:00Z"^^xsd:dateTime .

  :Alice org:memberOf :ACME .
}

Limitations

Describing event streams in the afore-mentioned way has some limitations. One of the apparent issues is the volume of data that is needed to encode seemingly simple facts. There are several ways how we can deal with this. Under the hood, RDF stores may implement structural sharing as in persistent data structures to avoid duplicating substructures present across multiple events. We can also make several assumptions that save space. :Create can be made the default event type, so that it doesn't need to be provided explicitly. In some limited cases, we can assume that valid time is the same as the transaction time. For example, in some countries, public contracts become valid only after they are published.

Another limitation of this approach is that it doesn't support blank nodes. You have to know the IRIs of the resources your want to describe.

Since named graphs are claimed for events, they cannot be used to distinguish datasets, as they typically are. Datasets need to be distinguished as RDF datasets. Having multiple datasets may hence mean having multiple SPARQL endpoints. Cross-dataset queries then have to be federated, or alternatively, current snapshots of the queried datasets can be loaded into a single RDF store as named graphs.

Integrity constraints

To illustrate properties of the proposed event representation, we can define integrity constraints that the event data must satisfy.

Union of delete graphs must be a subset of the union of create graphs. You cannot delete non-existent data. The following ASK query must return false:

PREFIX : <http://example.com/>

ASK WHERE {
  GRAPH ?delete {
    ?delete a :Delete .
    ?s ?p ?o .
  }
  FILTER NOT EXISTS {
    GRAPH ?create {
      ?create a :Create .
      ?s ?p ?o .
    }
  }
}

Each event graph must contain its type. The following ASK query must return true for each event:

ASK WHERE {
  GRAPH ?g {
    ?g a [] .
  }
}

The event type can be either :Create or :Delete. The following ASK query must return true for each event:

PREFIX : <http://example.com/>

ASK WHERE {
  VALUES ?type {
    :Create
    :Delete
  }
  GRAPH ?g {
    ?g a ?type .
  }
}

Events cannot have multiple types. The following ASK query must return false:

ASK WHERE {
  {
    SELECT ?g
    WHERE {
      GRAPH ?g {
        ?g a ?type .
      }
    }
    GROUP BY ?g
    HAVING (COUNT(?type) > 1)
  }
}

Querying

Querying over event streams is naturally more difficult than querying reconciled dataset snapshots. Nevertheless, the complexity of the queries may be hidden behind a proxy offering a more convenient syntax that extends SPARQL. An easy way to try the following queries is to use Apache Jena's Fuseki with an in-memory dataset loaded from the example event stream above: ./fuseki-server --file data.trig --update /ds.

Queries over the default graph, defined as the union of all graphs, query what has been true at some point in time:

CONSTRUCT {
  ?s ?p ?o .
}
WHERE {
  # Fuseki uses <urn:x-arq:UnionGraph> to denote the union graph,
  # unless tdb:unionDefaultGraph is set to true.
  # (https://jena.apache.org/documentation/tdb/assembler.html#union-default-graph)
  GRAPH <urn:x-arq:UnionGraph> {
    ?s ?p ?o .
  }
}

Current valid data is a subset of the :Create graphs without the triples in the subsequent :Delete graphs:

PREFIX :        <http://example.com/>
PREFIX dcterms: <http://purl.org/dc/terms/>

CONSTRUCT {
  ?s ?p ?o .
}
WHERE {
  GRAPH ?create {
    ?create a :Create ;
            dcterms:valid ?createValid .
    ?s ?p ?o .
  }
  FILTER NOT EXISTS {
    GRAPH ?delete {
      ?delete a :Delete ;
              dcterms:valid ?deleteValid .
      FILTER (?deleteValid > ?createValid)
      ?s ?p ?o .
    }
  }
}

We can also roll back and query data at a particular moment in time. This functionality is what Datomic provides as the asOf filter. For instance, the data valid at January 1, 2001 at 9:00 is union of the :Create events preceding this instant without the :Delete events that followed them until the chosen time:

PREFIX :        <http://example.com/>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX xsd:     <http://www.w3.org/2001/XMLSchema#>

CONSTRUCT {
  ?s ?p ?o .
}
WHERE {
  GRAPH ?create {
    ?create a :Create ;
      dcterms:valid ?validCreate .
    FILTER (?validCreate < "2001-01-01T09:00:00Z"^^xsd:dateTime)
    ?s ?p ?o .
  }
  MINUS {
    GRAPH ?delete {
      ?delete a :Delete ;
        dcterms:valid ?validDelete .
      FILTER (?validDelete < "2001-01-01T09:00:00Z"^^xsd:dateTime)
      ?s ?p ?o .
    }
  }
}

Event resolution proxy

Manipulation with event streams following the proposed representation can be simplied by an event resolution proxy. This proxy may be based on the SPARQL 1.1 Graph Store HTTP Protocol, which provides a standard way to work with named graphs. However, the Graph Store Protocol doesn't support quad-based RDF formats, so the proxy thus needs to partition multi-graph payloads into several transactions.

The proxy can provide several conveniences. It can prune event payloads by removing retractions of non-existent triples or additions of existing triples, or by dropping complete events if found redundant. It can automatically add transaction time; for example by using BIND (now() AS ?transactionTime) in SPARQL. Simplifying even further, the proxy can automatically mint event identifiers as URNs produced by the uuid() function in SPARQL. No event metadata can be provided explicitly in such case, although some metadata may be created automatically. Event type can be inferred from the HTTP method the proxy receives. HTTP PUT may correspond with the :Create type, while HTTP DELETE should indicate the :Delete type. Additionally, the proxy can assume that valid time is the same as transaction time.

Publishing

Create/delete event streams can be effectively split to batches by time intervals, suggesting several ways of publishing such data. An event stream should be published as an updated append-only quad dump. Additionally, there may be quads dumps of events from shorter periods of time, such as a day or month, to enable more responsive data syndication. Currently valid dataset may be materialized and published as a periodically updated dump. Instead of updating the current dataset in place, it may be published in snapshots. A snapshot from a given date can be used as a basis when replaying events, so that you don't have to replay the whole history of events, but only those that came after the snapshot. Any quad-based RDF serialization, such as TriG or N-Quads, will do for the dumps. Finally, in the absence of structural sharing, the dumps should be compressed to avoid size bloat caused by duplication of shared data structures.

The next challenge is to refine and test this idea. We can also wrap the event streams in a convenient abstraction that reduces the cognitive effort that comes with their manipulation. I think this is something developers of RDF store can consider to include in their products.

Basic fusion of RDF data in SPARQL

2016-10-06T10:27:00.002+02:00

A need to fuse data often arises when you combine multiple datasets. The combined datasets may contain descriptions of the same things that are given different identifiers. If possible, the descriptions of the same thing should be fused into one to simplify querying over the combined data. However, the problem with different co-referent identifiers also appears frequently in a single dataset. If a thing does not have an identifier, then it must be referred to by its description. Likewise, if the dataset's format does not support using identifiers as links, then things must also be referred to by their descriptions. For example, a company referenced from several public contracts as their supplier may have a registered legal entity number, yet its description is duplicated in each awarded contract instead of linking the company by its identifier due to the limitations of the format storing the data, such as CSV.

Fusing descriptions of things is a recurrent task both in intergration of multiple RDF datasets and in transformations of non-RDF data to RDF. Since fusion of RDF data can be complex, there are dedicated data fusion tools, such as Sieve or LD-FusionTool, that can help formulate and execute intricate fusion policies. However, in this post I will deal with basic fusion of RDF data using the humble SPARQL 1.1 Update, which is readily available in most RDF stores and many ETL tools for processing RDF data, such as LinkedPipes-ETL. Moreover, a basic data fusion is widely applicable in many scenarios, which is why I wanted to share several simple ways for approaching it.

Content-based addressing

In the absence of an external identifier, a thing can be identified with a blank node in RDF. Since blank nodes are local identifiers and no two blank nodes are the same, using them can eventually lead to proliferation of aliases for equivalent things. One practice that ameliorates this issue is content-based addressing. Instead of identifying a thing with an arbitrary name, such as a blank node, its name is derived from its description; usually by applying a hash function. This turns the “Web of Names” into the Web of Hashes. Using hash-based IRIs for naming things in RDF completely sidesteps having to fuse aliases with the same description. This is how you can rewrite blank nodes to hash-based IRIs in SPARQL Update and thus merge duplicate data:

In practice, you may want to restrict the renamed resources to those that feature some minimal description that makes them distinguishable. Instead of selecting all blank nodes, you can select those that match a specific graph pattern. This way, you can avoid merging underspecified resources. For example, the following two addresses, for which we only know that they are located in the Czech Republic, are unlikely the same:


@prefix : <http://schema.org/> .

[ a :PostalAddress ;
  :addressCountry "CZ" ] .

[ a :PostalAddress ;
  :addressCountry "CZ" ] .

More restrictive graph patterns also work to your advantage in case of larger datasets. By lowering the complexity of your SPARQL updates, they reduce the chance of you running into out of memory errors or timeouts.

Hash-based fusion

Hashes can be used as keys not only for blank nodes. Using SPARQL, we can select resources satisfying a given graph pattern and fuse them based on their hashed descriptions. Let's have a tiny sample dataset that features duplicate resources:

If you want to merge equivalent resources instantiating class :C (i.e. :r1, :r2, and :r5), you can do it using the following SPARQL update:

The downside of this method is that the order of bindings in GROUP_CONCAT cannot be set explicitly, nor it is guaranteed to be deterministic. In theory, you may get different concatenations for the same set of bindings. In practice, RDF stores typically concatenate bindings in the same order, which makes this method work.

Fusing subset descriptions

If we fuse resources by hashes of their descriptions, only those with the exact same descriptions are fused. Resources that differ in a value or are described with different properties will not get fused together, because they will have distinct hashes. Nevertheless, we may want to fuse resources with a resource that is described by a superset of their descriptions. For example, we may want to merge the following blank nodes, since the description of the first one is a subset of the second one's description:


@prefix : <http://schema.org/> .

[ a :Organization ;
  :name "ACME Inc." ] .

[ a :Organization ;
  :name "ACME Inc." ;
  :description "The best company in the world."@en ] .

Resources with subset descriptions can be fused in SPARQL Update using double negation:

The above-mentioned caveats apply in this case too, so you can use a more specific graph pattern to avoid merging underspecified resources. The update may execute several rewrites until reaching the largest superset, which makes it inefficient and slow.

Key-based fusion

If you want to fuse resources with unequal descriptions that are not all subsets of one resource's description, a key to identify the resources to fuse must be defined. Keys can be simple, represented by a single inverse functional property, or compound, represented by a combination of properties. For instance, it may be reasonable to fuse the following resources on the basis of shared values for the properties rdf:type and :name:


@prefix :    <http://schema.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

[ a :Organization ;
  :name "ACME Inc." ;
  :foundingDate "1960-01-01"^^xsd:date ;
  :email "contact@acme.com" ;
  :description "The worst company in the world."@en ] .

[ a :Organization ;
  :name "ACME Inc." ;
  :foundingDate "1963-01-01"^^xsd:date ;
  :description "The best company in the world."@en ] .

To fuse resources by key, we group them by the key properties, select one of them, and rewrite the others to the selected one:

If we fuse the resources in the example above, we can get the following:


@prefix :    <http://schema.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

[ a :Organization ;
  :name "ACME Inc." ;
  :foundingDate "1960-01-01"^^xsd:date, "1963-01-01"^^xsd:date ;
  :email "contact@acme.com" ;
  :description "The best company in the world."@en, "The worst company in the world."@en ] .

This example illustrates how fusion highlights problems in data, including conflicting values of functional properties, such as the :foundingDate, or contradicting values of other properties, such as the :description. However, resolving these conflicts is a complex task that is beyond the scope of this post.

Conclusion

While I found the presented methods for data fusion to be applicable for a variety of datasets, they may fare worse for complex or large datasets. Besides the concern of correctness, one has to weigh the concern of performance. Based on my experience so far, the hash-based and key-based methods are usually remarkably performant, while the methods featuring double negation are not. Nonetheless, the SPARQL updates from this post can be oftentimes simply copied and pasted and work out of the box (having tweaked the graph patterns selecting the fused resources).

On generating SPARQL

2016-06-28T11:09:00.000+02:00

The question how to generate SPARQL comes so often in my work that I figured I attempt a summary of the different approaches that answer this question, none of which is established enough to be considered a best practice. The lack of a well-established method for generating SPARQL may have arisen from the fact that SPARQL is serialized to strings. While its string-based format may be convenient to write by hand, it is less convenient to generate through code. To make SPARQL more amenable to programmatic manipulation, several approaches have been devised, some of which I will cover in this post.

Opting for strings or structured data to represent a data manipulation language is a fundamental design decision. Dumbing the decision down, there is a usability trade-off to be made: either adopt a string representation to ease manual authoring or go for a representation using data to ease programmatic manipulation. Can we nevertheless have the best of both options?

SPARQL continues in the string-based tradition of SQL; possibly leveraging a superficial familiarity between the syntaxes. In fact, SPARQL was recently assessed as “a string-based query language, as opposed to a composable data API.” This assessment implicitly reveals that there is a demand for languages represented as structured data, such as the Elasticsearch query DSL serialized in JSON or the Datomic query syntax, which is serialized in EDN.

To illustrate the approaches for generating SPARQL I decided to show how they fare on an example task. The chosen example should be simple enough, yet realistic, and such that it demonstrates the common problems in encountered when generating SPARQL.

Example

Let's say you want to know all the people (i.e. instances of foaf:Person) known to DBpedia. There are 1.8 million such persons, which is way too many to fetch in a single SPARQL query. In order to avoid overload, DBpedia's SPARQL endpoint is configured to provide at most 10 thousand results. Since we want to get all the results, we need to use paging via LIMIT and OFFSET to partition the complete results into smaller parts, such that one part can be retrieved within a single query.

Paging requires a stable sort order over the complete collection of results. However, sorting a large collection of RDF resources is an expensive operation. If a collection's size exceeds a pre-configured limit, Virtuoso requires the queries paging over this collection to use scrollable cursors (see the section “Example: Prevent Limits of Sorted LIMIT/OFFSET query”), which basically wrap an ordered query into a subquery in order to better leverage the temporary storage of the sorted collection. Because of the number of persons in DBpedia we need to apply this technique to our query.

Let's say that for each person we want to get values of several required properties and some optional properties. For example, we may want to get names (foaf:name) and birth dates (dbo:birthDate) and, optionally, dates of death (dbo:deathDate). Since persons can have multiple names and we want only one name per person, we need to use SAMPLE with the names to retrieve a single random name associated with each person. We could assume that a person has no more than one birth date and death date, but in fact in DBpedia there are 31 thousand persons with multiple birth dates and 12 thousand persons with multiple death dates, so we also need to use SAMPLE for these properties. Considering all these requirements, our SPARQL query may look like the following.

We need the LIMIT and OFFSET in this query to be generated dynamically, as well as the listing of the required and the optional properties. In the following, I will cover and compare several approaches for generating SPARQL queries using these parameters. All the approaches are illustrated by examples in Clojure to make them better comparable.

String concatenation

The approach that is readily at hand of most developers is string concatenation. While it is simple to start, it intertwines SPARQL with code, which makes it brittle and error prone. Convoluted manual escaping may be needed and the result is particularly messy in programming languages that lack support of multi-line strings, such as JavaScript. Here is an example implementation using string concatenation to generate the above-mentioned query.

Parameterized queries

If queries to be generated differ only in few variables, they can be generated from a parameterized query, which represents a generic template that can be filled with specific parameter values at runtime. Parameterized queries enable to split static and dynamic parts of the generated queries. While the static query template is represented in SPARQL and can be stored in a separate file, the dynamic parts can be represented using a programming language's data types and passed to the template at runtime. The separation provides both the readability of SPARQL and the expressiveness of programming languages. For example, template parameters can be statically typed in order to improve error reporting during development or avoid some SPARQL injection attacks in production.

Although parameterized queries improve on string concatenation and are highlighted among the linked data patterns, they are limited. In particular, I will discuss the limitations of parameterized queries as implemented in Apache Jena by ParameterizedSparqlString. While the main drawbacks of parameterized queries are true for other implementations as well, the details may differ. In Jena's parameterized queries only variables can be dynamic. Moreover, each variable can be bound only to a single value. For example, for the pattern ?s a ?class . we cannot bind ?class to schema:Organization and schema:Place to produce ?s a schema:Organization, schema:Place .. If we provide multiple bindings for a variable, only the last one is used. Queries that cannot restrict their dynamic parts to variables can escape to using string buffers to append arbitrary strings to the queries, but doing so gives you the same problems string concatenation has. Due to these restrictions we cannot generate the example query using this approach. Here is a partial implementation that generates only the limit and offset.

Jena also provides a similar approach closer to prepared statements in SQL. When executing a query (via QueryExecutionFactory) you can provide it with pre-bound variables (via QuerySolutionMap). Similar restrictions to those discussed above apply. Moreover, your template must be a syntactically valid SPARQL query or update. In turn, this prohibits generating LIMIT numbers, because LIMIT ?limit is not a valid SPARQL syntax. The following implementation thus does not work.

Templating

If we need higher expressivity for generating SPARQL, we can use a templating language. Templating is not restricted to the syntax of SPARQL, because it treats SPARQL as arbitrary text, so that any manipulation is allowed. As is the case with parameterized queries, templating enables to separate the static template from the dynamic parts of SPARQL. Unlike parameterized queries, a template is not represented in pure SPARQL, but in a mix of SPARQL and a templating language. This recalls some of the shortcomings of interposing SPARQL with code in string concatenation. Moreover, templates generally do not constrain their input, for example by declaring types, so that any data can be passed into the templates without it being checked in advance. Notable exceptions allowing to declare types of template data are available in Scala; for example Scalate or Twirl.

In order to show how to generate SPARQL via templating I adopted Mustache, which is a minimalistic and widely known templating language implemented in many programming languages. The following is a Mustache template that can generate the example query.

Rendering this template requires little code that provides the template with input data.

Domain-specific languages

As the string concatenation approach shows, SPARQL can be built using programming language construct. Whereas string concatenation operates on the low level of text manipulation, programming languages can be used to create constructs operating on a higher level closer to SPARQL. In this way, programming languages can be built up to domain-specific languages (DSLs) that compile to SPARQL. DSLs retain the expressivity of the programming languages they are defined in, while providing a syntax closer to SPARQL, thus reducing the cognitive overhead when translating SPARQL into code. However, when using or designing DSLs, we need to be careful about the potential clashes between names in the modelled language and the programming language the DSL is implemented in. For example, concat is used both in SPARQL and Clojure. Conversely, if a DSL lacks a part of the modelled language, escape hatches may be needed, regressing back to string concatenation.

Lisps, Clojure included, are uniquely positioned to serve as languages for defining DSLs. Since Lisp code is represented using data structures, it is easier to manipulate than languages represented as strings, such as SPARQL.

Matsu is a Clojure library that provides a DSL for constructing SPARQL via macros. Macros are expanded at compile time, so when they generate SPARQL they generally cannot access data that becomes available only at runtime. To a limited degree it is possible to work around this limitation by invoking the Clojure reader at runtime. Moreover, since Matsu is built using macros, we need to use macros to extend it. An example of such approach is its in-built the defquery macro that allows to pass parameters into a query template. Nevertheless, mixing macros with runtime data quickly becomes convoluted, especially if larger parts of SPARQL need to be generated dynamically.

If we consider using Matsu for generating the example query, we discover several problems that prevent us from accomplishing the desired outcome, apart from the already mentioned generic issues of macros. For instance, Matsu does not support subqueries. Defining subqueries separately and composing them as subqueries via raw input is also not possible, because Matsu queries contain prefix declarations, which are syntactically invalid in subqueries. Ultimately, the farthest I was able to get with Matsu for the example query was merely to the inner-most subquery.

Query DSLs in object-oriented langauges are often called query builders. For example, Jena provides a query builder that allows to build SPARQL by manipulating Java objects. The query builder is deeply vested in the Jena object model, which provides some type checking at the expense of a more verbose syntax. Since Clojure allows to call Java directly, implementing the example query using the query builder is straightforward.

While Matsu represents queries via macros and Jena's query builder does so via code, there is another option: representing queries via data. Using a programming language's native data structures for representing SPARQL provides arguably the best facility for programmatic manipulation. Data is transparent at runtime and as such it can be easily composed and inspected. In fact, a widespread Clojure design rule is to prefer functions over macros and data over functions. An example of using data to represent a SPARQL-like query language in Clojure is the Fabric DSL. While this DSL is not exactly SPARQL, it is “highly inspired by the W3C SPARQL language, albeit expressed in a more Clojuresque way and not limited to RDF semantics” (source).

SPIN RDF

An approach that uses data in RDF for representing SPARQL is SPIN RDF. It offers an RDF syntax for SPARQL and an API for manipulating it. While the translation of SPARQL to RDF is for the most part straightforward, one of its more intricate parts is using RDF collections for maintaining order in triple patterns or projected bindings, because the collections are difficult to manipulate in SPARQL.

Nonetheless, SPIN RDF seems to have a fundamental problem with passing dynamic parameters from code. For what I can tell, the membrane between SPIN RDF and code is impermeable. It would seem natural to manipulate SPIN RDF via SPARQL Update. However, how can you pass data to the SPARQL Update from your code? If you adopt SPIN RDF wholesale, your SPARQL Update operation is represented in RDF, so you have the same problem. Passing data from code to SPIN RDF thus results in a recursive paradox. Although I tried hard, I have not found a solution to this conundrum in the SPIN RDF documentation, nor in the source code of SPIN API.

This is how the example query can be represented using SPIN RDF; albeit using fixed values in place of the dynamic parts due to the limitations discussed above.

Rendering SPIN RDF to SPARQL can be implemented using the following code.

I have found a way to generate dynamic SPARQL queries in SPIN RDF using JSON-LD. JSON-LD can be represented by data structures, such as hash maps or arrays, that are available in most programming languages. This representation can be serialized to JSON that can be interpreted as RDF using the JSON-LD syntax. SPIN RDF can be in turn translated as SPARQL, obtaining our desired outcome. As may be apparent from this workflow, crossing that many syntaxes (Clojure, JSON-LD, RDF, SPIN, and SPARQL) requires large cognitive effort due to the mappings between the syntaxes one has to be aware of when authoring SPARQL in this way. Here is an implementation of this approach for the example query.

SPARQL algebra

As previously mentioned, a problematic part of SPIN RDF is its use of RDF collections for representing order. The documentation of Apache Jena recognizes this, saying that “RDF itself is often the most appropriate way to do this, but sometimes it isn't so convenient. An algebra expression is a tree, and order matters.” (source). The documentation talks about SPARQL algebra, which formalizes the low-level algebraic operators into which SPARQL is compiled. Instead of using RDF, Jena represents SPARQL algebra in s-expressions (SSE), which are commonly used in programming languages based on Lisp, such as Scheme. In fact, the “SSE syntax is almost valid Scheme” (source), but the SSE's documention acknowledges that Lisps “lacks convenient syntax for the RDF terms themselves” (source).

In order to see how our example query looks in SSE we can use Jena's command-line tools and invoke qparse --print=op --file query.rq to convert the query into SSE. The following is the result we get.

If SSEs were valid Clojure data structures, we could manipulate them as data and then serialize them to SPARQL. Nevertheless, there are minor differences between SSE and the syntax of Clojure. For example, while ?name and _:a are valid symbols in Clojure, absolute IRIs enclosed in angle brackets, such as <http://dbpedia.org/ontology/>, are not. Possibly, these differences can be remedied by using tagged literals for RDF terms.

Conclusions

I hope this post gave you a flavour of the various approaches for generating SPARQL. There is an apparent impedance mismatch between the current programming languages and SPARQL. While the programming languages operate with data structures and objects, SPARQL must be eventually produced as a string. This mismatch motivates the development of approaches for generating SPARQL, which is presented with many challenges, some of which I described in the post.

I assessed these approaches on the basis of how they fare on generating an example query using the data from DBpedia. The complete implementations of these approaches are available in this repository. Out of the approaches I reviewed, I found four in which it is feasible to generate the example SPARQL query without undue effort, which include:

String concatenation
Templating
Jena's query builder DSL
SPIN RDF using JSON-LD

My personal favourite that I use for generating SPARQL is templating with Mustache, which appears to mesh best with my brain and the tasks I do with SPARQL. Nonetheless, I am aware of the limitations of this approach and I am on a constant lookout for better solutions, possibly involving rendering SPARQL from data.

While I invested a fair amount of effort into this post, it is entirely possible I might have overlooked something or implemented any of the reviewed approaches in a sub-optimal way, so I would be glad to hear any suggestions on how to improve. In the meantime, while we search for the ideal solution for generating SPARQL, I think the membrane between code and SPARQL will remain only semipermeable.

Academia-driven development

2016-03-22T15:46:00.000+01:00

In this post I present an opinionated (and mostly-wrong) account on programming in academia. It's based in part on my experience as a developer working in academia and in part on conversations I had with fellow developers working in the private sector.

In academia, you rarely program even if you are in computer science. Instead, you read papers, write papers, write deliverables for ongoing projects, or write project proposals. You aren't paid to program, you are paid to publish. You do only as much programming as is needed to have something to write a paper about. “Scientists use programming the way software engineers use public transport – just as a means to get to what they have to do,” observes Bozhidar Bozhanov. While programming purists may be dissatisfied with that, Jason Baldridge is content with this state of affairs and writes: “For academics, there is basically little to no incentive to produce high quality software, and that is how it should be.”

Albert Einstein allegedly said this: “If we knew what it was we were doing, it would not be called research, would it?” While the attribution of this quote is dubious at best, there's a grain of truth in what the quote says. It's natural in research that you often don't know what you work on. I think this is the reason why test-driven development (TDD) is not truly applicable in research. Programming in research is used to explore new ideas. TDD, on the contrary, requires upfront specification of what you are building. German has the verb ‘basteln’ that stands for DIY fiddling. The word was adopted into the Czech spoken language with a negative connotation of not knowing what you're doing, which I think captures nicely what often happens in academic programming.

The low quality of academic software hinders its maintenance and extensibility in long-term development. For one-off experiments these concerns aren't an issue, but most experiments needs to be reproducible. Academic software must allow to reproduce and verify the results that are reported in the publication associated with it. Anyone must be able to re-run the software. It must be open-source, allowing others to scrutinize its inner workings. Unfortunately, it's often the case that academic software isn't released or, when it's made available, it's nigh impossible to run it without asking its creators for assistance.

What's more is that usability of software is hardly ever a concern in academia, in spite of the fact that usable software may attract more citations, thereby increasing the academic prestige of its author. An often-mentioned example of this effect in practice is Word2vec, the paper of which boasts with 1305 citations according to Google Scholar. Indeed, it would be a felicitous turn if we reconsidered the usability of academic software as a valuable proxy that increases citation numbers.

A great benefit that comes with reproducible and usable software is extensibility. Ted Pedersen argues that there's “a very happy side-effect that comes from creating releasable code—you will be more efficient in producing new work of your own since you can easily reproduce and extend your own results.” Nonetheless, even though software may be both reproducible and usable, extending a code base without tests may be like building on quicksand. This is usually an opportunity for refactoring. For example, the feature to be extended can be first covered with tests that document its expected behaviour, as Nell Shamrell-Harrington suggests in surgical refactoring. The subsequent feature extension must not break these tests, unless the expected behaviour should change. I think adopting this approach can do a great good to the continuity of academic development.

Finally, there's also an economic argument to make for ‘poor-quality’ academic software. If software developed in the academia achieved production quality, it would constitute a competition to software produced in the private sector. Since academia is a part of the public sector, academic endeavours are financed mostly from the public funds. Hence such competition with commercial software can be considered unfair. Dennis Polhill argues that “unfair competition exists when a government or quasi-government entity takes advantage of its tax exemption and other privileges to supply private goods to the market in competition with private suppliers.” Following this line of thought, the public sector should not subsidize the development of software that is commercially viable and can be built by private companies. Instead of developing working solutions, academia can try and test new prototypes. If released openly, this proof-of-concept work can be then adopted in the private sector and grown into commercial products.

Eventually, when exploring my thoughts on academia-driven development, I realized that I'm torn between settling for the current status quo and pushing for emancipating software with publications. While I'm stuck figuring this out, there are laudable initiatives, such as Semantic Web Developers, which organizes regular conference workshops that showcase semantic web software and incite conversations about the status of software in academia. Let's see how these conversations pan out.

In science, form follows funding

2016-03-21T12:33:00.002+01:00

Recently, brief Twitter exchanges I had with @csarven on the subject of #LinkedResearch made me want to articulate a longer-form opinion on scientific publishing that no longer fitted a tweet. Be wary though, although this opinion is longer, it's still oversimplifying a rather complex matter for the sake of conveying the few key points I have.

There's a pervasive belief that “if you can't measure it, you can't manage it”. Not only is this quote often misattributed to Peter Drucker, its author William Edwards Deming actually wrote that “it is wrong to suppose that if you can't measure it, you can't manage it — a costly myth” (The New Economics, 2000, p. 35). Contradicting the misquoted statement, Deming instead suggested that it's possible to manage without measuring. With that being said, he acknowledges that metrics remain an essential input to management.

Since funding is a key instrument of management, metrics influence funding decisions too. Viewed from this perspective, science is difficult to fund because its quality is hard to measure. The difficulty of measuring science is widely recognized, so that scientometrics was devised with the purpose of studying how to measure science. Since measuring science directly is difficult, scientometrics found ways to measure scientific publishing, such as citation indices. Though using publishing as a proxy to science comes with an implicit assumption that the quality of scientific publications correlates positively with the quality of science, many are willing to take on this assumption simply because of the lack of a better way for evaluating science. The key issue of this approach is that the emphasis on measurability constrains the preferred form of scientific publishing to make measuring it simpler. A large share of scientific publishing is centralized in the hands of few large publishers who establish a constrained environment that can be measured with less effort. The form of publishing imposes a systemic influence on science. As Marshall McLuhan wrote, the medium is the message. While in architecture form follows function, in science, form follows funding.

Measuring distributed publishing on the Web is a harder task, though not an insurmountable one. For instance, Google's PageRank algorithm provides a fair approximation of the influence the documents distributed on the Web have. Linked research, which proposes to use the linked data principles for scientific publishing, may enable to measure science without the cost incurred by centralization of publishing. In fact, I think its proverbial “killer application” may be a measurable index like the Science Citation Index. Indeed, SCI was a great success, and it “did not stem from its primary function as a search engine, but from its use as an instrument for measuring scientific productivity” (Eugene Garfield: The evolution of the Science Citation Index, 2007). A question that naturally follows is: how off am I in thinking this?

Coding dub techno in Ruby using Sonic Pi

2015-12-08T22:49:00.000+01:00

Dub techno lends itself to code thanks to its formulaic nature. Indeed, A Bullshitter's Guide to Dub Techno says:

“Sadly, a lot of dub techno out there is unbelievably dull — greyscale, unadventurous, utterly and literally generic. It must seem easy to make because after all, all you need is the a submerged kickdrum, a few clanking chords stretched pointlessly out into arching waves of unmoving, unfeeling nothingness, and maybe the odd snatch of tired melodica, snaking around like a cobra that desperately needs to be put out of its misery.”

I decided to try to code dub techno in Ruby using Sonic Pi. Sonic Pi is an app for live coding sound. It started as a tool for teaching computer science using Raspberry Pi but it works damn well for making a lot of noise. Here's my attempt at coding dub techno in Sonic Pi:

The source code is available here.

My attempt exploits some of the dub techno stereotypes, such as the excessive repetition progressing as I slowly build the sound. As Joanna Demers writes on dub techno and related genres in Listening through the Noise: “Static music goes nowhere, achieves no goals, does no work, and sounds the same three hours into the work as it did when the work began.” In case of dub techno, it is an intentionally bare and stripped-down version of techno. It often focuses on the timbre of sound, using modulating synthesizers heavily drenched in reverb and echoes. Demers writes: “Static music is not only music that avoids conventional harmonic or melodic goals but also music that takes specific steps to obscure any sense of the passage of time.” Dub techno keeps melodic or harmonic progressions to a minimum, usually employing single minor chords oscillating through entire tracks. In my code, I use solely the D minor chord, which varies only in chord inversions and octave shifts.

I think that Sonic Pi offers a fluent live coding experience. For example, the nested with_* functions (such as with_fx) accepting Ruby blocks as arguments provide an intuitive way of representing bottom-up sound processing pipelines. Furthermore, live coding provides a fast feedback loop. Your ears are the tests of your code and you can hear the results of your code immediately.

Overall, I really enjoyed this attempt at dub techno. I would like to thank to Sam Aaron and co. for creating Sonic Pi and I would encourage you to give Sonic Pi a shot.

Curling SPARQL HTTP Graph Store protocol

2015-05-02T20:52:00.000+02:00

SPARQL HTTP Graph Store protocol provides a way of manipulating RDF graphs via HTTP. Unlike SPARQL Update it does not allow you to work with RDF on the level of individual assertions (triples). Instead, you handle your data on a higher level of named graphs. Named graph is a pair of a URI and a set of RDF triples. A set of triples can contain a single triple only, so it is technically possible to manipulate individual triples with the Graph Store protocol, but this way of storing data is not common. In line with the principles of REST, the protocol defines its operations using HTTP requests. It covers the familiar CRUD (Create, Read, Update, Delete) operations known from REST APIs. It is simple and useful, albeit lesser known part of the family of SPARQL specifications. I have seen software that would have benefited had its developers known this protocol. This is why I decided to cover it in a post.

Instead of showing the HTTP interactions via the Graph Store protocol in a particular programming language I decided to use cURL as the lingua franca of HTTP. I discuss how the Graph Store protocol works in 2 implementations: Virtuoso (version 7.2) and Apache Jena Fuseki (version 2). By default, you can find a Graph Store endpoint at http://localhost:8890/sparql-graph-crud-auth for Virtuoso and at http://localhost:3030/{dataset}/data for Fuseki ({dataset} is the name of the dataset you configure). Virtuoso also allows you to use http://localhost:8890/sparql-graph-crud for read-only operations that do not require authentication. The differences between these implementations are minor, since both implement the protocol's specification well.

If you want to follow along with the examples below, an easy option is to download the latest version of Fuseki and start it with a disposable in-memory dataset using the shell command fuseki-server --update --mem /ds (ds is the name of our dataset). You can use any RDF file as testing data. For example, you can download DBpedia's description of SPARQL in the Turtle syntax as the file data.ttl:

curl -L -H "Accept:text/turtle" \
     http://dbpedia.org/resource/SPARQL > data.ttl

Finally, if any of the arguments you provide to cURL (such as graph URI) contains characters with special meaning in your shell (such as &), you need to enclose them in double quotes. The backslash you see in the example commands is used to escape new lines so that the commands can be split for better readability.

I will now walk through the 4 main operations defined by the Graph Store protocol: creating graphs with the PUT method, reading them using the GET method, adding data to existing graphs using the POST method, and deleting graphs, which can be achieved, quite unsurprisingly, via the DELETE method.

Create: PUT

You can load data into an RDF graph using the PUT HTTP method (see the specification). This is how you load RDF data from file data.ttl to the graph named http://example.com/graph:

Virtuoso
curl -X PUT \ --digest -u dba:dba \ -H Content-Type:text/turtle \ -T data.ttl \ -G http://localhost:8890/sparql-graph-crud-auth \ --data-urlencode graph=http://example.com/graph

Fuseki
curl -X PUT \ -H Content-Type:text/turtle \ -T data.ttl \ -G http://localhost:3030/ds/data \ --data-urlencode graph=http://example.com/graph

The -T named argument uploads a given local file, -H specifies an HTTP header indicating the content type of the uploaded file, -G provides the Graph Store endpoint's URL, and --data-urlencode let's you pass in a URI naming of the created graph (via the graph query parameter). Since the Graph Store protocol's interface is uniform, most of the other operations use similar arguments.

Virtuoso uses HTTP Digest authentication for write and delete operations (i.e. create, update, and delete). The example above assumes the default Virtuoso user and password (i.e. dba:dba). If you fail to provide valid authentication credentials, you will be slapped over your hands with the HTTP 401 Unauthorized status code. Fuseki does not require authentication by default, but you can configure it using Apache Shiro.

When using Virtuoso, you can leave the Content-Type header out, because the data format will be automatically detected, but doing so is not a good idea. You need to provide it for Fuseki and if you fail to do so, you will face HTTP 400 Bad Request response. Try not to rely on the autodetection being correct and provide the Content-Type header explicitly.

If you want to put data into the default graph, you can use the default query parameter with no value:

Fuseki
curl -X PUT \ -H Content-Type:text/turtle \ -T data.ttl \ -G http://localhost:3030/ds/data \ -d default

If you use Fuseki, you can also omit the graph parameter completely to manipulate with the default graph. Nevertheless, this is not a standard behaviour, so you should not rely on it.

If you PUT data into an existing non-empty graph, its previous data is replaced.

Read: GET

To download data from a given graph, you just issue a GET request (see the specification). You can use the option -G to perform GET request via cURL:

Virtuoso
curl -G http://localhost:8890/sparql-graph-crud \ --data-urlencode graph=http://example.com/graph

Fuseki
curl -G http://localhost:3030/ds/data \ --data-urlencode graph=http://example.com/graph

Alternatively, you can simply use curl http://localhost:3030/ds/data?graph=http%3A%2F%2Fexample.com%2Fgraph, but -G allows you to provide the graph query parameter separately via --data-urlencode, which also takes care of the proper URL-encoding. You can specify the RDF serialization you want to get the data in via the Accept HTTP header. For example, if you want the data in N-Triples, you provide the Accept header with the MIME type application/n-triples:

Fuseki
curl -H Accept:application/n-triples \ -G http://localhost:3030/ds/data \ --data-urlencode graph=http://example.com/graph

Unfortunately, while Fuseki supports the application/n-triples MIME type, Virtuoso does not. Instead, you will have specify the deprecated MIME type text/ntriples (even text/plain will work) to get the data in N-Triples. Since N-Triples serializes each RDF triple on a separate line, you can use it as a naïve way of counting the triples in a graph by piping the data into wc -l (-s option used to hide the cURL progress bar):

Virtuoso
curl -s -H Accept:text/ntriples \ -G http://localhost:8890/sparql-graph-crud \ --data-urlencode graph=http://example.com/graph \| \ wc -l

If a graph named by the requested URI does not exist, you will get HTTP 404 Not Found response.

Update: POST

If you want to add data to an existing graph, use the POST method (see the specification). In case you POST data to a non-existent graph, it will be created just as if using the PUT method. The difference of POST and PUT is that when you send data to an existing graph, POST will merge it with the graph's current data, while PUT will replace it.

Virtuoso
curl -X POST \ --digest -u dba:dba \ -H Content-Type:text/turtle \ -T data.ttl \ -G http://localhost:8890/sparql-graph-crud-auth \ --data-urlencode graph=http://example.com/graph

Fuseki
curl -X POST \ -H Content-Type:text/turtle \ -T data.ttl \ -G http://localhost:3030/ds/data \ --data-urlencode graph=http://example.com/graph

It is worth knowing how triples are merged during this operation. When you POST data to a non-empty graph, the current set of triples the graph is associated with will be merged with the set of triples from the uploaded data via set union. In most cases, if these two sets shared any triples, they will not be duplicated. However, if the shared triples contain blank nodes, they will be duplicated because, due to their local scope, blank nodes from different datasets are always treated as distinct. For example, if you repeatedly POST the same triples containing blank nodes to the same graph, the first time its size will increase by the number of posted triples, but on the second and subsequent POSTs the size of the graph will increase by the number of triples containing blank nodes. This can be one of the reasons why you may want to avoid using blank nodes.

Delete: DELETE

Unsurprisingly, deleting graphs is achieved using the DELETE method (see the specification). As you may expect by now, if you attempt to delete a non-existent graph, you will get HTTP 404 Not Found response.

Virtuoso
curl -X DELETE \ --digest -u dba:dba \ -G http://localhost:8890/sparql-graph-crud-auth \ --data-urlencode graph=http://example.com/graph

Fuseki
curl -X DELETE \ -G http://localhost:3030/ds/data \ --data-urlencode graph=http://example.com/graph

Other methods

As in the HTTP specification, there are other methods defined in the Graph Store protocol. An example of such method is HEAD, which can be used to test whether a graph exists. cURL allows you to issue a HEAD request using the -I option:

Fuseki
curl -I \ -G http://localhost:3030/ds/data \ --data-urlencode graph=http://example.com/graph

If you the graph exists, you will receive HTTP 200 OK status code. Otherwise, you will once again see a saddening HTTP 404 Not Found response. In the current version of Virtuoso (7.2) using the HEAD method will trigger 501 Method Not Implemented, so you should use Fuseki if you want to play with this method.

As the Graph Store protocol's specification shows, you can replace any operation of the protocol by an equivalent SPARQL update or query. Graph Store protocol thus provides an uncomplicated interface for basic operations manipulating with RDF graphs. I think it is a simple tool worth knowing.

Methods for designing vocabularies for data on the Web

2014-07-09T18:24:00.001+02:00

Over the past year and a half I have been working on a project, in which we were tasked with producing a vocabulary for describing job postings.¹ In doing so, we were expected to write down what worked, so that others can avoid our mistakes. Apart from our own experience, the write-up I prepared took into account the largest public discussion on designing vocabularies for data on the Web. Perusing its archive, I have read every email on the public-vocabs mailing list since its start in June 2011 until April 2014. The following text distills some of what I have learnt from the conversations on this mailing list, especially from the vocabulary design veterans including Dan Brickley or Martin Hepp, coupled with research of other sources and our own experiments in data modelling for the Web.

This work was supported by the project no. CZ.1.04/5.1.01/77.00440 funded by the European Social Fund through the Human Resources and Employment Operational Programme and the state budget of the Czech Republic.

“All models are wrong, but some are useful.”
George E. P. Box

The presented text offers a set of recommendations for designing ontologies and vocabularies for data on the Web. The motivation for creating it was to collect relevant advice for data modelling scattered in various sources into a single resource. It focuses on the intersection of vocabularies defined using the RDF Schema (Brickley, Guha, 2014) and those that are intended to be used in RDFa Lite syntax (Sporny, 2012) in HTML web pages. It specifically aims to support vocabularies that aspire to large-scale adoption.

The vocabularies in question in this text are domain-specific, unlike upper ontologies that span general aspects of many domains. Therefore, it is necessary to delimit the domain to be covered by the developed vocabulary to restrict its scope. The target domain can have a broad definition, which may be further clarified by examples of data falling into the domain and examples of data that is out of the domain’s scope. Particular details of the vocabulary’s specialization may be made more specific during the initial research or vocabulary’s design.

“Do not reinvent the wheel.”
HTML design principles
(Kesteren, Stachowiak, 2007)

It is appropriate to devote the initial stage of vocabulary development to research and preparation. One may consider three principal kinds of relevant resources that can be pooled when designing a vocabulary. These resources comprise existing data models, knowledge of domain experts, and domain-specific texts.

Existing data models

Research of existing data models helps to prevent unnecessary work by answering two main questions:

Is there an available data model that can be reused as a whole instead of developing a new data model?
What parts of existing data models can be reused in design of a new data model?

There are two main types of data models that are relevant for reuse in vocabulary development. The first type covers ontological resources that consist of available vocabularies and ontologies. If one finds such resource that describes the target domain and fits the envisioned use cases, it can be directly reused as a whole, provided that its terms of use permit it. If there is a suitable vocabulary that addresses only some of the foreseen uses, it can be extended to cover the others as well. Otherwise, a new vocabulary may be composed of elements that are cherry-picked from the available ontological resources, which forms a basis for the reuse-based development of vocabularies (Poveda-Villalón, 2012). One of the best places to look for these resources is Linked Open Vocabularies, which provides a full-text search engine for the publicly available vocabularies formalized in RDF Schema or OWL (Motik, Patel-Schneider, Parsia, 2012).

The second kind of resources to consider encompasses non-ontological resources, such as XML schemas or data models in relational databases. As these resources cannot be reused directly for building vocabularies, they need to be re-engineered into ontological resources, which is a process that is also referred to as ‘semantic lifting’. Taking non-ontological resources into account may complement the input from ontological sources well. Special attention should be paid to industry standards produced by standardization bodies such as ISO. An alternative approach is to analyze what schemas are employed in public datasets from the given domain, for which data catalogues, such as Datahub, may be used.

Knowledge elicitation with domain experts

“Role models are important.”
Officer Alex J. Murphy / RoboCop

Domain experts constitute a source of implicit knowledge that is not yet formalized in conceptualizations documented in data models (Schreiber et al., 2000). Knowledge elicited from experts who have internalized a working knowledge of the domain of interest can feed in the conceptual distinctions captured by the developed vocabulary. The choice of experts to consult depends on the domain in question. The interviewed experts can range from academic researchers to practitioners from the industry. Similarly, the selection of knowledge elicitation methods should be motivated by the intended use cases for the developed vocabulary. Common methods that serve the purpose of knowledge acquisition include discussion of a glossary, manual simulation of tasks to automate, and competency questions.

Glossary is a useful aid that may guide interviews with domain experts. It can be either manually prepared or constructed automatically from the developed vocabulary. Glossary can be written down as a table in which each vocabulary term is listed together with its label, working definition and broadly described type (e.g., class, property or individuum). It can then serve as a basis for discussion about the established terminology in the domain covered by the developed vocabulary.

Collaboration with domain experts is an opportunity to conduct manual simulation of tasks that are intended to be performed automatically using data described by the developed vocabulary. Such simulation can provide a practical grounding for the vocabulary design with respect to its planned use cases. The simulation should reveal what kinds of data are important for carrying out the envisioned tasks successfully. It can indicate what data can be added to aid in such tasks and what data makes a difference in deciding how to proceed in the chosen tasks. For example, if the target domain is the job market, a simulation task may set about matching sample CVs of job seekers to actual job offers, which can suggest what properties are important to tell a likely successful candidate.

A classical approach to eliciting knowledge from domain experts is to discuss competency questions. These are the questions that data described with the developed vocabulary should be able to answer. As such, competency questions can serve as tests that examine if a vocabulary is sufficiently capable to support its planned use cases. For example, these questions may specify what views on data must be possible, what are the users’ needs that data must be able to answer in a single query, or what level of data granularity and detail is needed.

Analysis of domain-specific corpora

“Pave the cowpaths.”
HTML design principles
(Kesteren, Stachowiak, 2007)

While eliciting knowledge from domain experts concentrates on implicit knowledge, analyses of domain-specific corpora seek for common patterns in explicit, yet unstructured, natural-language text. Textual analysis can be considered a data-driven approach to schema discovery. Its key purpose is to ensure that the designed vocabulary can express the most common kinds of data that are published in the target domain. The approaches to processing domain-specific textual corpora can be divided into qualitative, manual analyses and quantitative, automated analyses.

Qualitative analysis

Manual qualitative analysis can be performed with a smaller domain-specific corpus, which can consist of tens of sample documents. The corpus should be analysed by knowledge engineer to spot common patterns and identify the most important types of data in the domain. Qualitative analysis may result in clusters of similar types of data grouped into a hierarchical tree, in which the most frequently occurring kinds of data are highlighted. The identified clusters may then serve as precursors for classes in the developed vocabulary.

Quantitative analysis

A corpus of texts prepared for quantitative analysis can be sampled from sources on the Web that publish semi-structured data describing the domain of the vocabulary. Producers of these sources can be projected as potential adopters of the developed vocabulary. The texts need to be written in a single language, so that translation is not necessary. Contents of the corpus ought to be sampled from a wide array of diverse sources in order to avoid sampling bias. The corpus needs to be sufficiently large, so that the findings based on analysing it may be taken as indicative of general characteristics of the covered domain. Establishment of such extensive corpus typically requires automated harvesting of texts via web crawlers or scripts that access data through APIs.

Quantitative analysis of domain-specific corpora can be likened to ‘distant reading’. Its aim is to read through the corpus and discover patterns of interest to the vocabulary creator. A typical task of this type of analysis is to extract the most frequent n-grams, indicating common phrases in the established domain terminology, and map their co-occurrences. Quantitative analyses on textual corpora may be performed using dedicated software, such as Voyant Tools or CorpusViewer.

Abstract data model

The results of the performed analyses and knowledge elicitation should provide a basis for development of an abstract data model. At this stage, data model of the designed vocabulary is abstract because it is not mapped to any concrete vocabulary terms in order to avoid being closely tied to particular implementation. Abstract data model may start to be formalized as a mind map, hierarchical tree list or table. Vocabulary creators can base the model on the clusters of the most commonly found terms from domain corpora and sort them into a glossary table. Such proto-model should pass through several rounds of iteration based on successive reviews by vocabulary creators. Key classes and properties in the data model should be identified and equipped with both preferred and non-preferred labels (i.e. synonyms) and preliminary definitions. To get an overview of the whole model and the relationships of its constitutive concepts it may be visualised as UML class diagram or using a generic graph visualization.

Data model’s implementation

“One language’s syntax can be another’s semantics.”
Brian L. Meek

When the abstract data model is deemed to be sound from the conceptual standpoint, it can be formalized in a concrete syntax. The primary languages that should be employed for formalization of the abstract data model are RDF and RDF Schema.As simplicity should be a key design goal, the use of more complex ontological restrictions expressed via OWL ought to be limited to a minimum. The implementation should map the elements of the abstract data model to concrete vocabulary terms that may be either reused from the available ontological resources or newly created.² At this stage, the expressive RDF Turtle (Prud’hommeaux, Carothers, 2014) syntax may be used conveniently to produce a formal specification of the developed vocabulary.

The implementation process should follow an iterative development workflow, using examples of data in place of software prototypes. During each iteration samples of existing data from the vocabulary’s domain may be modelled using the means provided by the vocabulary, so that it can be assessed how the proposed data model fits its intended uses by seeing it being applied in the context of real examples.

General design principles

Implementation of a vocabulary may be guided by several general principles recommended for vocabularies targeting data written in markup that is embedded in HTML web pages. The goal of widespread adoption of the vocabulary on the Web puts an emphasis on specific design principles. Instead of focusing on conceptual clarity and expressivity as in traditional ontologies, the driving principles of design of lightweight web vocabularies accentuate simplicity, ease of adoption, and usability. This section then further discusses some of the key concerns in vocabulary development, including conceptual parsimony, vocabulary’s coverage that is driven by existing data and the like.

Simplicity

Vocabulary should avoid complex ontological axioms and subtle conceptual distinctions. Instead, it ought to seek simplicity for the data producer, rather than the data consumer.³ It is advisable that vocabulary design tries to strike a fine balance between expressivity and implementation complexity cost. Following the principle of minimal ontological commitment (Gruber, 1995), vocabularies should limit the number of ontological axioms (and especially restrictions) to improve their reusability. The developed vocabulary should thus be as simple as possible without sacrificing the leverage its structure gives to data consumers. Nevertheless, not only it should make simple things simple, it should also make complex things possible. Practical vocabulary design can reflect this guideline by focusing on solving simpler problems first and complex problems later.

Ease of adoption

Adoption of a vocabulary may be made easier if the vocabulary builds on common idioms and established terminology that is already familiar to data publishers. Vocabulary design should strive for intuitiveness. In line with the principle of least astonishment, vocabulary users should rather be exposed largely to things that can be expected.

Usability

Vocabulary design should focus on documentation rather than specification. That being said, neither specification nor documentation can ensure correct use of a vocabulary. Even though vocabulary terms may be precisely defined and documented, their meaning is largely established by their use in practice. Nonetheless, correct application of vocabulary terms may be supported by providing good examples showing the vocabulary in use. As Guha (2013) emphasizes, the default mode of authoring structured data on the Web is copy, paste and edit; for which the availability of examples is essential. Usability of vocabularies can be also improved by following the recommendations of cognitive ergonomics (Gavrilova, Gorovoy, Bolotnikova, 2010), such as readable documentation or vocabulary with narrow width and shallow depth.

Conceptual parsimony

Vocabulary design should introduce as few conceptual distinctions as possible, while still producing a useful conceptualization. Vocabulary does not need to include means of expressing data that can be computed or inferred from data expressed by other means. For example, it is not necessary to include a :numberOfOffers property because its value may be computed if there already is a :hasOffer property, which may have its distinct objects counted to arrive to the same data. An exception to this rule is warranted if it is expected that data producers may only have the computed data, but not the primary data from which it was derived from. For example, the number of offers may not available in disaggregated form as the list of individual offers. There is also no need to define inverse properties, such as :isOfferOf for the :hasOffer property. In a similar manner, vocabulary should not require explicit assertion of data that can be recovered from implicit context, such as data types for literal values. On the other hand, it is important to recognize that this approach shifts the burden from data publishers to clients consuming data that need to execute additional computation, such as inference, to materialize implicit data.

In general, additional conceptual distinctions are useful only if vocabulary users are able to apply them consistently. It is important to realize that valuable conceptual distinctions, justified from experts’ perspective, may not lead to more reliable data. Vocabulary creators should mainly concentrate on offering means for describing data that can be reliably provided by a large number of parties. A key reason for adding conceptual distinct is enabling to publish more data.

The merits of conceptual distinctions should be judged based on their discriminatory value. In other words, the value of distinction is in how it differs from the rest of the vocabulary. The more finely or ambiguously a vocabulary term is defined, the more likely it will be used incorrectly. Complex designs are a subject to misinterpretation. If vocabulary terms cannot be understood by data producers with ease and reliably, they will not be used (resulting in less data) or will be used inconsistently (resulting in lower data quality). Therefore, vocabulary should only use conceptual distinctions that matter and are well understood in the target domain.

Data-driven coverage

Since enabling to publish existing data in a structured form is an essential goal of vocabulary development, it ought to be driven by the available data. Data-driven approach implies that vocabularies should not use conceptualizations that do not match well to common database schemas in their target domains. If this is not the case, then data producers do not have a way of providing their data described using the vocabulary unless they alter their database’s schemas and change the way how they collect data. Vocabulary should be rather descriptive than prescriptive. Vocabulary design should be driven by existing data rather than prescribing what data should be published.

Communication interface

Vocabularies should accurately represent the domain they cover only to the degree it improves consistency of vocabulary use. Shared reality mirrored by a vocabulary may serve as a common referent improving shared understanding. However, the prime goal of a vocabulary is not to model the world, but to enable communication that gets a message across, which means that the prime aim of vocabulary is communication instead of representation. For example, structured values such as postal addresses, do not represent reality but they help formalize communication.

Vocabulary defines a communication interface between data producers and data consumers. Data producers are typically people, whereas data consumers are typically machines. Therefore, vocabulary design should balance usability for people with usability for machines. Vocabularies ought to be designed for people first and machines second (The microformats process, 2013). Thus vocabulary design should reflect the trade-off between consistent understanding of vocabulary among people and the degree to which it makes data machine readable.

Syntax limitations

Vocabulary should be aligned with the syntax, in which it is intended to be used. The design of a vocabulary is constrained by expressivity of its intended syntax. For example, HTML5 Microdata’s lack of mechanism for expressing inverse properties, such as with RDFa rev attribute, may warrant adding inverse properties into a vocabulary. Syntax of data can be considered a medium for the vocabulary. In case of vocabularies made for data embedded in web pages, such as Schema.org, their design should correspond to simpler markup. For example, vocabulary should require less nesting.

Tolerant specification

Vocabulary specification should be tolerant about data that it can express. It should not impose a fixed schema. No properties should be required, so that not providing some data is not invalid. On the other hand, vocabulary should allow additional data to be expressed, so that superfluous data is also not invalid, unless it raises a contradiction. It is advisable to use cardinality restrictions for properties only sparingly, as it is difficult to make them generally valid in the broad context of the multicultural Web. Vocabulary should support dynamic data granularity and varying level of detail, so that unstructured text values are allowed to be used in place of structured values if the structure cannot be reconstructed from the source data. On the other hand, specific consumers of data may add specific requirements that may be negotiated on a case by case basis with particular data producers. Overall, data consumers should be expected to follow the spirit of “some data is better than none” (Schema.org: data model, 2012) and accept even broken or partial data.

Vocabulary evolution

If a vocabulary aims for mass adoption, backwards incompatible changes need to be avoided. It is therefore advisable not to remove or deprecate any vocabulary terms, but rather list them as non-preferred with a link to their preferred variant. Large-scale use of a vocabulary raises the cost of changes, because more vocabulary users (both data producers and consumers) need to react to the changes. Widespread adoption increases the difficulty of propagating the changes, because updates about vocabulary changes need to reach a larger audience.

Conclusion

“It’s probably better to allow volcanoes to have fax machines than try to define everything ‘correctly’. Usage will win out in the end.”
Martin Hepp

The methods for designing vocabularies for data on the Web introduced in this text do not form a coherent methodology but instead compile and synthesize recommendations proposed in related work. Guiding principles manifested in the presented methods shall not be considered as hard-and-fast rules but rather as suggestions based on experience of seasoned vocabulary designers. These include both practical advice on researching state of the art in vocabulary’s target domain and concerns to keep in mind when implementing a formal conceptualization for a vocabulary. Moreover, the presented methods do not involve the notion of vocabulary being “right” but instead aim for developing vocabularies that are useful. Therefore, it is only by practical use on the Web in the long-term that these methods and recommendations may be “proved” of being useful themselves.

References

BRICKLEY, Dan; GUHA, R.V. (eds.). RDF Schema 1.1 [online]. W3C Recommendation 25 February 2014. W3C, 2004-2014 [cit. 2014-04-29]. Available from WWW: http://www.w3.org/TR/rdf-schema/
GAVRILOVA, T. A.; GOROVOY, V. A.; BOLOTNIKOVA, E. S. Evaluation of the cognitive ergonomics of ontologies on the basis of graph analysis. Scientific and Technical Information Processing. December 2010, vol. 37, iss. 6, p. 398-406. Also available from WWW: http://link.springer.com/article/10.3103%2FS0147688210060043. ISSN 0147-6882. DOI 10.3103/S0147688210060043.
GRUBER, Thomas R. Toward principles for the design of ontologies used for knowledge sharing? International Journal of Human-Computer Studies. November 1995, vol. 43, iss. 5-6, p. 907-928. Also available from WWW: http://tomgruber.org/writing/onto-design.pdf
GUHA, R. V. Light at the end of the tunnel [video]. Keynote at 12th International Semantic Web Conference. Sydney, 2013. Also available from WWW: http://videolectures.net/iswc2013_guha_tunnel
KESTEREN, Anne van; STACHOWIAK, Maciej (eds.). HTML design principles [online]. W3C Working Draft 26 November 2007. W3C, 2007 [cit. 2014-04-29]. Available from WWW: http://www.w3.org/TR/html-design-principles/
MOTIK, Boris; PATEL-SCHNEIDER, Peter F.; PARSIA, Bijan (eds.). OWL 2 Web Ontology Language: structural specification and functional-style syntax [online]. W3C Recommendation 11 December 2012. 2^nd ed. W3C, 2012 [cit. 2014-04-29]. Available from WWW: http://www.w3.org/TR/owl2-syntax/
POVEDA-VILLALÓN, María. A reuse-based lightweight method for developing linked data ontologies and vocabularies. In Proceedings of the 9^th Extended Semantic Web Conference, Heraklion, Crete, Greece, May 27-31, 2012. Berlin; Heidelberg: Springer, 2012, p. 833-837. Lecture notes in computers science, vol. 7295. Also available from WWW: http://link.springer.com/chapter/10.1007%2F978-3-642-30284-8_66. ISSN 0302-9743. DOI 10.1007/978-3-642-30284-8_66.
PRUD’HOMMEAUX, Eric; CAROTHERS, Gavin (eds.). RDF 1.1 Turtle: terse RDF triple language [online]. W3C Recommendation 25 February 2014. W3C, 2008-2014 [cit. 2014-04-30]. Available from WWW: http://www.w3.org/TR/turtle/
Schema.org: data model [online]. June 6^th, 2012 [cit. 2014-04-29]. Available from WWW: http://schema.org/docs/datamodel.html
SCHREIBER, Guus [et al.] (eds.). Knowledge elicitation techniques. In Knowledge engineering and management: the CommonKADS methodology. Cambridge (MA): MIT, 2000, p. 187-214. ISBN 0-262-19300-0.
SPORNY, Manu. RDFa Lite 1.1 [online]. W3C Recommendation 07 June 2012. W3C, 2012 [cit. 2014-04-29]. Available from WWW: http://www.w3.org/TR/rdfa-lite/
The microformats process [online]. April 28^th, 2013 [cit. 2014-04-29]. Available from WWW: http://microformats.org/wiki/process

Footnotes

The result of this endeavour can be found here: https://github.com/OPLZZ/data-modelling ↩
Those may be in turn mapped to other vocabularies’ terms; e.g., via rdfs:subClassOf. ↩
However, it must be possible to reconstruct the main data structures; at least from its context and without out-of-band knowledge. ↩

Epistemology of data in contemporary science

2014-07-09T16:21:00.007+02:00

On the turn of the last year I published a paper in E-LOGOS, a Czech philosophy journal (the paper is available here). The published text deals with the image of data that is widespread in contemporary science. It offers a critical look on the understanding of data that is a key part of the big data hype. Again, as is the case for most of my works in areas, which I am not deeply familiar with, it is a compilation and remix of thoughts drawn from a long array of sources. What you see below is a (rough) English translation of the text (with proper hyperlinks). This way, there’s a chance it will be indexed well enough so that the interested audience can find it.

Abstract: Contemporary science is dominated by positivist epistemology of data, which builds on the foundations of metaphysical realism and the ideal of mechanical objectivity. This approach to data suffers from a number of flaws. Shortcomings of this approach were identified in many critical responses and led to a new problematization of the established concept of data. The often criticised aspects of this epistemology remark on data being embedded in the context of its making, and point out to the mediation of data and its openness to manipulation. In recent years, the function of data gained an unprecedented importance due to the rising appetite of science for data, which attracted attention to this formerly unproblematic concept. Several alternative approaches to epistemology of data appeared, of which this text introduces the positions proceeding from constructivism and rhetoric. The presented paper draws heavily on critical literature in epistemology of data. Due to its summarising character, it may be understood as a synthesis and reconfiguration of the existing thoughts on the topic. In this way, the paper offers a contribution to rhetorical argumentation in the discourse of data in contemporary science.

Introduction

In spite of the fact that etymology of words often bears no correspondence to their use, the roots of ‘data’ give many hints about the way this word is used. Data, as Rosenberg describes (2013, p. 18), comes from the plural of the Latin word ‘datum’; a neuter past participle of the verb ‘dare’, which is translated as ‘to give’. ‘Datum’ can be thus translated as something ‘given’. Common use of data is in line with this explanation, often treating it as something given, which needs no questioning.

Constructivist epistemology takes a stance opposing this viewpoint and claims that nothing is given, since everything is a product of human construction. For example, Bachelard writes:

“For a scientific mind, all knowledge is an answer to a question. If there has been no question, there can be no scientific knowledge. Nothing is self-evident. Nothing is given. Everything is constructed.” (Bachelard, 2002, p. 25)

Diverging understanding of data provides a basis for contemporary criticism that undermines the established status of data in science. On the one hand, data is perceived as direct reflection of reality, whereas on the other hand, it is designed as an artifact of human creation. These diverging perspectives are reflected in problematization of data, which causes a lot of doubt. For example, the work of Poovey dedicated to history of modern fact formulates many questions, which apply to the concept of data as well:

“What are facts? Are they incontrovertible data that simply demonstrate what is true? Or are they bits of evidence marshalled to persuade others of the theory one sets out with? Do facts somehow exist in the world like pebbles, waiting to be picked up? Or are they manufactured and thus informed by all social and personal factors that go into every act of human creation? Are facts beyond interpretation? Or are they the very stuff of interpretation, its symptomatic incarnation instead of the place where it begins?” (Poovey, 1998, p. 1)

This text comprises some of the possible answers to these questions. Proceeding from historical traces of the evolving understanding of data the following sections introduce criticism of the dominant realist epistemology and offer alternative epistemologies drawing from constructivism or rhetoric.

Brief history of data

The concept of ‘data’ is in use for a long time, yet it acquired its current meaning no sooner than on the onset of modernity (Gitelman, 2013, p. 15). One of first known uses of this concepts appears in Euclid’s book entitled Data (Euclid, 1834). The book describes methods for solving and analysing problems, in which data serve either as that what is known in relation to a hypothesis, or that what can be demonstrated to be known. Data offers starting points of inquiry, in which new knowledge may be inferred.

Describing givens as data remained in use at least until the 17^th century. In disciplines such as mathematics, philosophy and theology, data signified given foundations, which are not to be disputed (Gitelman, 2013, p. 19). For instance, theology employed this term for the things given by God or the Bible.

The concept came near to its modern use in the early 18^th century. Instead of standing for unquestionable givens, data appeared in use as results of experiments, experience or collection. In other words, data “went from being reflexively associated with those things that are outside of any possible process of discovery to being the very paradigm of what one seeks through experiment and observation” (Gitelman, 2013, p. 36). Making this approach a part of general knowledge can be ascribed chiefly to positivism. Viewing it the terms of epistemology, it can be described as metaphysical realism of data.

Metaphysical realism of data

The view of metaphysical realism assumes that data can be collected from objectively perceivable reality. The realist framework deems data as exact record or faithful representation of reality. Rhetoric of scientific ‘discoveries’ requires that knowledge to be discovered already exists in reality; science is then tasked with revealing such knowledge. Photography offers a prototypic example of accurate capture of reality. Photographs are “raw representations of the natural world,” which stand for a “unique and literal transcription of nature - a ‘scientific record’ (Gitelman, 2013, p. 4, appendix).

Scientists generally regard data as records of structured observation that is guided by protocol designed up front (Halavais, 2013). Data collection is described as observation without interfering with the observed reality. The desired objectivity of scientific data presupposes separation of perceiving subject from perceived reality. Metaphysical realism, according to Von Glasersfeld (Von Glasersfeld, 1984, p. 2), asserts that “we may call something ‘true’ only if it corresponds to an independent, ‘objective’ reality.” The realist framework bestows data with a privileged role, while holding the subjective description to be distant from ‘genuine’ reality.

Epistemic privilege of data

The emphasis on data is particularly characteristic of the science in recent decades, when large volumes of data became widely available. Yet data had a fundamental role in science in former times as well; for example Nelson (2009) mentions Rudolphine Tables by Johannes Kepler as an early example of scientific use of data. However, data acquired its peculiar function “in the epistemology we associate with modernity” (Poovey, 1998, p. 1).

Modern science started to drew increasingly on quantized data, which contributed “in a major way to the impression of objectivity in scientific prose” (Gross, 2002, p. 37). The average number of data tables used in scientific articles almost doubled between 19^th and 20^th century. Half of a sample of 20^th century articles contained a table, with an average number of 5 tables per article (Gross, 2002, p. 182). Science in 20^th century increased the distinctive preference of quantitative facts over qualitative facts. Language of science reflects this shift in distinguishing ‘hard data’, the quantitative nature of which lends the data an aura of unquestionability, and ‘soft data’, the qualitative nature of which enables the data the be bent at will. In some cases, this preference goes to extreme situations, in which quantitative datasets “are given considerable weight even when nobody defends their validity with real conviction” (Porter, 1995, p. 8). A large share of modern scientists fill their papers with mechanical or mathematical explanations of the facts they describe, while their “argumentative strategy for establishing facts and explanations typically revolves around comparisons of data sets” (Gross, 2002, p. 188). Mathematical explanations referring to data are often privileged for their alleged elegance and clarity (Halevy, 2009). During the 20^th century there appears a marked inclination to favour “comparison of large data sets; in addition, mathematics is applied, seemingly whenever possible” (Gross, 2002, p. 231). These trends gradually lead to “rapid ‘commodification’ of data”, which causes data to be presented as “complete, interchangeable products in readily exchanged formats” and may encourage “misinterpretation, over reliance on weak or suspect data sources, and ‘data arbitrage’ based more on availability than on quality” (Edwards, 2013, p. 7).

Data-driven science

In recent years, the emphasis on using data in science increased to such an extent that some proclaim it be bring about a new methodological paradigm of data-driven science (Leonelli, 2014). This approach is labelled as the fourth paradigm of science that uses data-intensive research, in which computers help finding knowledge in data, to extend the preceding three paradigms; the paradigm of empirical observation, the paradigm of explanatory models and the paradigm of simulation for insight into complex phenomena (Nielsen, 2012). Data is considered as a product of quantitative research, which is especially privileged to serve as scientific evidence.

The extreme cases promoting this scientific paradigm earned a label of ‘data fundamentalism” (Crawford, 2013). For example, Anderson’s controversial article from 2008 (2008) announces big data as the “end of theory” and claims that numbers speaking for themselves make hypotheses unnecessary. However, as Keller (1985, p. 130) points out, “the problem with this argument is, of course, that data never do speak for themselves.” Regardless of these critics, some authors believe that getting rid of formulating hypotheses contributes to ‘purification’ of science:

“In a small-data world, because so little data tended to be available, both causal investigations and correlation analysis began with a hypothesis, which was then tested to be either falsified or verified. But because both methods required a hypothesis to start with, both were equally susceptible to prejudice and erroneous intuition.” (Mayer-Schonberger, 2013)

The cause of disregard to making hypotheses in this scientific paradigm may be attributed to the ‘unreasonable effectiveness of data’. Some authors, such as Halevy, 2009, contend that simple models or hypotheses equipped with large enough data inevitable surpass complex models that lack data. In its extreme form, the data-driven science overturns the usual relation between hypotheses and data, in which hypothesis plays a primary role and data only provides grounds for verification or falsification. Instead, this paradigm promotes processes for inductive generalisation of data into valid hypotheses. Research of these methods is a principal concern of the field of data mining, where particulars given in data may be distilled into universals, such as sets of association rules.

Data-driven science considers hypotheses inherently untrustworthy if their reliability is not backed by data. However, as critical rationalism of Karl Popper teaches, data supporting hypothesis does not verify it; instead the data merely falsifies incompatible hypotheses. Trustworthiness ascribed to data can be illustrated by the popular saying: “In god we trust, everyone else bring data.” Data thus functions as evidence testifying for truthiness of the presented claims. For instance, Markham (Markham, 2013) mentions the impression of ‘instant credibility’ that proceeds data.

Contrary to the afore-mentioned claims Boyd and Crawford (2012, p. 663) describe this uncritically accepted approach as a “widespread belief that large data sets offer a higher form of intelligence and knowledge that can generate insights that were previously impossible, with the aura of truth, objectivity, and accuracy.” Keil, for example, adds that “data-driven science is a failure of imagination” (Keil, 2013). Keil stresses that science cannot ignore hypotheses, which constitute models or theories. Instead, it is necessary to combine both empirical observation and making of hypotheses. As Keil suggests, vast volume of data does not help, if it is not confronted with useful theory. Although larger volume of data increases the support it lends to prevalent hypotheses, it increases the noise in data in the same way. This is the reason why large volume might in fact, despite expectations, multiply problems in data. These expectations usually held for data stem from the assumption of their mechanical objectivity.

Myth of mechanical objectivity of data

The epistemic privilege of data springs from the ideal of mechanical objectivity of data (Gitelman, 2013), which ignores contextuality, mediation and manipulation of data. The presumed absence of human input (e.g., in photography) and minimisation of unwanted influences is seen fundamental in achieving the goal of objectivity. The belief in neutrality, autonomy and objectivity of data is widespread. For example, Porter states that “when philosophers speak of the objectivity of science, they generally mean its ability to know things as they really are”” (Porter, 1995, p. 3). In accordance with this demand metaphysical realism deems data to be a direct representation of reality. However, data cannot be an exact reflection of complex reality, as it cannot avoid reducing the reality and omitting details that are reckoned unnecessary for the purpose of data. High level of reduction makes data lose the ability to represent, and thus data can be considered no more than an approximation of reality (Markham, 2013). More causes of data failing to represent reality can be identified, some of which are examined in the further sections of the text.

Nevertheless, it is important to acknowledge that there are other ways of formulating the objectivity of science. One of them is an influential definition of objectivity as an “ability to reach consensus” (Porter, 1995, p. 3), another is equating objectivity to “fairness and impartiality” (ibid., p. 4). However, the ideal of mechanical objectivity is unattainable, because data is always mediated and their creation is embedded in context, which cannot be avoided nor reproduced.

Contextuality of data

Data is shaped to a large extent by context, in which the data is created. In science, the sense of data is “tightly dependent on a precise understanding of how, where, and when they were created” (Edwards, 2013). “Knowledge production is never separate from the knowledge producer” and nor data can be obtained without direct or indirect human influence, so that human thinking always marks the produced data. Direct sensory input is thus combined with mental contents of perceiver, while indirect perception using instruments is affected by views of the instruments’ creators. Bachelard sums this up in writing that “when we contemplate reality, what we think we know very well casts its shadow over what we ought to know” (Bachelard, 2002, p. 24). Therefore, there is a need to keep on mind the “situated, material conditions of knowledge production” (Gitelman, 2013, p. 4), which cause the resulting data to be “framed and framing” (ibid., p. 5).

Apart from the data creators, data is significantly framed by the environment in which it is created. For instance, Magee draws attention to this influence:

“Knowledge systems are all too frequently characterised in essentialist terms - as though, as the etymology of ‘data’ would suggest, they are merely the housing of neutral empirical givens. […] on the contrary, that systems always carry with them the assumptions of cultures that design and use them - cultures that are, in the very broadest sense, responsible for them.” (Magee, 2011, p. 15)

Environment may determine the way of gathering data, such as by standardising different methods, which can furthermore evolve in time. For example, reclassifications within systems of categories may happen over the years, which significantly worsens the comparability of data from time series (Diakopoulos, 2013).¹

The influence of context cannot be eliminated, nor it can be reproduced. Even though data is of discrete nature, so that “each datum is individual, separate and separable, while still alike in kind to others in its set” (Gitelman, 2013, p. 8), and thus data may be partially decontextualised, the efforts to remove contextual influences completely are bound to fail.

Some expect that data may be purged from contextual distortions and subjectivity, if large volume of data containing records of many variations of the same phenomenon is available. The promise of big data rests on a conjecture that by combination and aggregation data may be neutralised and its individual tint may be dampened, so that it draws nearer to the objective reality. The fallacy of this approach is the neglect of arbitrariness of the chosen aggregation method and failure to take the incompleteness of data into account. Choice of aggregation is a subjective act, in which arbitrary conceptualisations (e.g., categories) are selected, so that the resulting aggregated data may end up being further away from the described reality. No matter how large data is, it always remains an incomplete sample, the selection of which may omit that what is important. The extensiveness of data does not guarantee its representativeness, because data is subject to limitations and prejudices independently of its size. For the same matter quantity cannot fill in for quality and consistency of data, which may be diverging due to varying contexts. In case of data samples, absolute values are never exact and relative values are loaded with the skew of sample selection and aggregation.

Moreover, attempts to decontextualise data may be harmful for the context of its use. Boyd and Crawford remark that if “taken out of context, data lose meaning and value” (Boyd, 2012, p. 670). Data is a medium that requires active participation and understanding from its users. Knowledge is not a passive process (Von Glasersfeld, 1984, p. 9). Even though the assumption of objectively perceivable reality constitutes a common object of data, which forms the basis of shared understanding, the universal comprehension of data remains fiction, because interpretation of data depends not only on its object, but its context as well (Markham, 2013).

Mechanical objectivity tries to reduce contextual influences into a clearly delimited protocol. Production of data is thus conducted by strict rules. In this way, mechanical objectivity is defined as an ability to follow rules and fixed protocol (Porter, 1995, p. 4). The function of protocol is to setup a controlled context and minimise unwanted influences, which may be reflected on the created data. Transparent and documented protocol of data preparation, containing detailed information about data provenance, contributes to trustworthiness of data. Bird adds that “what makes something an item of observational knowledge is the reliability and uncontentious nature of the mechanism which produces it” (Bird, 2010, p. 10). This way, users of data may evaluate the “adequacy of the experimental conditions under which data have been produced” and determine, what level of reliability can be expected from the data and what is its evidential value.

In similar manner, the restrictions of protocol aim to make data reproduction feasible. However, Leonelli states that, for the most part, data is “idiosyncratic to particular experimental contexts, and typically cannot occur outside of those contexts” (Leonelli, 2009). It is unavoidable that data is embedded in unrepeatable context and it makes data impossible to reproduce in full. At most, one can attempt to reproduce the methods used to create the data, which may lead to other, yet partially compatible data.

Mediation of data

Immediacy ascribed to data comes from a desire for direct knowledge of reality. ‘Raw’ data is attributed with the quality of primariness. It is thought to be data coming ‘directly’ from its source, which is reality itself. This alleged quality relates to the seemingly natural process of mechanical production of data. One may succumb to an impression that the value of data depends on the straightforwardness of its derivation from reality. For example, data from automated sensors might be perceived as substantially more trustworthy than calculations of impact factor based on indirect inputs that are considerably distant from reality. Thanks to the implied immediacy of data it is often understood, in accordance with its etymology, as having an axiomatic nature, which makes it beyond dispute (Halavais, 2013). “At first glance data are apparently before the fact: they are the starting point for what we know, who we are, and how we communicate” (Gitelman, 2013, p. 2). “Data is beyond argument,” writes Markham (2013), because data is understood as that which precedes argument. In this perspective, data avoids interpretation, analysis and thereby is held to be free of subjective influence. Nevertheless, as Boyd and Crawford suggest, “claims to objectivity are necessarily made by subjects and are based on subjective observations and choices” (Boyd, 2012, p. 667).

The assumption of pre-analytical nature of data was subjected to criticism that problematised the concept of ‘raw data’ and validity of this assumption was contested. Already in 1929 Dewey criticised this presumption:

“[…] all of the rivalries and connected problems grow from a single root. They spring from the assumption that the true and valid object of knowledge is that which has being prior to and independent of the operations of knowing. They spring from the doctrine that knowledge is a grasp or beholding of reality without anything being done to modify its antecedent state - the doctrine which is the source of the separation of knowledge from practical activity.” (Dewey, 1929, p. 196)

An approach alike is deemed generally adopted in 1985, when Keller writes that “it is by now a near truism that there is no such thing as raw data; all data presuppose interpretation” (Keller, 1985, p. 130). Bowker adds a remark that “raw data is both an oxymoron and a bad idea; to the contrary, data should be cooked with care” (Bowker, 2005, p. 184).

The key point of such criticism is the recognition that interpretation is already present in observation and data production on its own. Nunberg contends that properties that we ascribe to information, and which can be ascribed to data as well, such as its “metaphysical haeceity or ‘thereness,’ its transferability, its quantised and extended substance, its interpretive transparency or autonomy — are simply the reifications of the various principles of interpretation that we bring to bear in reading these forms” (Nunberg, 1996). Data and “numbers are interpretive, for they embody theoretical assumptions about what should be counted, how one should understand material reality, and how quantification contributes to systematic knowledge about the world” (Poovey, 1998, p. xii). Therefore, data cannot be accepted as “simple observations about particulars, which were immune from interest and theoretical conjectures of any kind” (ibid., p. xxiv). Due to various reasons, data is always mediated, and so, as Bachelard writes:

“Knowledge of reality is a light that always casts a shadow in some nook or cranny. It is never immediate, never complete.” (Bachelard, 2002, p. 24)

Data is inevitably produced via media. Examples of such media are instruments, such as microscope, or models, that may synthesise data. It is because of media that science may make claims about objects and properties that escape direct observation, such as long-extinct galaxies noticeable via telescopes (Bogen, 1988). A general view then assumes that “scientific theories predict and explain facts about ‘observables’: objects and properties which can be perceived by the senses, sometimes augmented by instruments” (ibid., p. 303).

In the course of data development it necessarily passes through models of reality. Model is a medium of cognition. “Rational views of the universe are idealised models that only approximate reality” (Kent, 2000, p. 220), however, “we can share a common enough view of it for most of our working purposes, so that reality does appear to be objective and stable” (ibid., p. 228). Yet there are some, who assert that “‘sound science’ must mean ‘incontrovertible proof by observational data,’ whereas models were inherently untrustworthy” (Edwards, 2013, p. xviii). “Let the data speak for themselves,” (Keller, 1985, p. 129) is demanded by those, who call for raw, immediate data. Edwards calls the assumption that immediacy can be achieved by “waiting for (model-independent) data” to be misguided (Edwards, 2010, p. xiii). As he writes further, “no collection of signals or observations — even from satellites, which can ‘see’ the whole planet — becomes global in time and space without first passing through a series of data models” (ibid., p. xiii). The dependence of data on models can be seen on the example of weather forecasts and climate change predictions, for which “only about ten percent of the data used by global weather prediction models originate in actual instrument readings. The remaining ninety percent are synthesised by another computer model” (ibid., p. 21). In the same way as models or theories, data is only an imperfect approximation of reality. Nevertheless, in a similar way as Box (1987, p. 424) claims that “all models are wrong, but some are useful”, an analogous approach may be applied to data.

Data manipulation

Mediation of data allows for manipulation and purposeful reconstruction. Data may be distorted either deliberately or unintentionally. In some cases, the influence of context can leave data with its marks that are barely noticeable. For example, Fanelli argues that “scientific results can be distorted in several ways, which can often be very subtle and/or elude researchers’ conscious control” (Fanelli, 2009). Nonetheless, even though science is generally associated with “fairness and impartiality” (Porter, 1995, p. 4), a significant share of data manipulation is deliberate. Babbage warns about data manipulation in science already in 1830:

“Of cooking. This is an art of various forms, the object of which is to give to ordinary observations the appearance and character of those of the highest degree of accuracy. One of its numerous processes is to make multitudes of observations, and out of these to select those only which agree, or very nearly agree. If a hundred observations are made, the cook must be very unlucky if he cannot pick out fifteen or twenty which will do for serving up.” (Babbage, 1830, p. 178)

Data manipulation in science is relatively prevalent. Anonymous survey revealed that roughly 2 % of scientists admit to have manipulated data. About a third of the survey’s participants conceded to be involved in dubious scientific practices. However, it should be kept in mind that these estimates are likely conservative as this is a sensitive topic (Fanelli, 2009). Moreover, besides manipulation on purpose data can be distorted because of laziness or malpractice.

Intentional manipulation of data includes disregard of unfavourable data, data answering suggestive questions, excessive generalisation, skewed sample (e.g., non-random), misunderstanding of error margins, false causality or finding statistically insignificant correlation in big data (Misuse of statistics, 2013). Data quality may be also deteriorated and made unclear by reducing data to aggregations (Diakopoulos, 2013).

Alternative epistemologies of data

Apart from metaphysical realism, epistemology of data can be considered from alternative viewpoints that do not suffer the afore-mentioned shortcomings. This essentially “positivist picture of the structure of scientific theories is now widely rejected” (Bogen, 1988, p. 304) and its place was seized up by approaches that fall within the postmodernism, yet frequently draw on older thinking, which in some cases date back to the rhetorical origins of philosophy. The following sections introduce the approaches of constructivist epistemology and rhetoric, which are deemed to be mutually compatible.

Constructivist epistemology of data

Constructivist epistemology is based on the presumption that all knowledge is a construction of man. The constructivist school of thought departs from metaphysical realism in not requiring a concept of objective reality. However, treating constructivism as simple rejection of the concept of objective reality would be overly simplistic. Constructivism reverses the relation between data and reality and instead claims that data constitutes the reality it describes, so that “data are not found, they are made” (Halavais, 2013).

Some of the central theses of constructivist epistemology may be clearly seen already in works of Giambattista Vico from the 18^th century. The treatises of this intellectual predecessor of constructivism claim that “science (scientia) is the knowledge (cognitio) of origins, of the ways and the manner how things are made” and therefore “ we can only know what we ourselves construct” (Von Glasersfeld, 1984). Such recognition is what distinguishes scientific and pre-scientific mind, because “whereas the pre-scientific mind possesses reality, the scientific mind constructs and reconstructs it, and in doing so is itself constantly reformed” (Bachelard, 2002, p. 9).

Foundations of constructivist epistemology are likely built on the fallout from the shift of philosophy towards language. Constructivist reading may be applied to the works of anthropologist and linguist Edward Sapir, who argues that ‘world’ is constructed by language of community:

“The fact of the matter is that the ‘real world’ is to a large extent unconsciously built up on the language habits of the group. No two languages are ever sufficiently similar to be considered as representing the same social reality. The worlds in which different societies live are distinct worlds, not merely the same world with different labels attached.” (Sapir, 1990, p. 221)

Following Sapir’s reasoning, constructivism has no need for homomorphism between data and reality, in which data correspond to experience of reality. Instead, data and knowledge is what fits the reality and functions in a consistent way within the reality. To illustrate this relationship Von Glasersdorf offers a simile of key that fits in lock, in the same way as data matches reality (Von Glasersfeld, 1984, p. 3).

Constructivist claims are prone to attract simplified reading. For example, “the claim that science is socially constructed has too often been read as an attack on its validity or truth” (Porter, 1995, p. 11). In this regard, constructivism offers to replace the criterion of truth with the concept of inner consistency and rule of no contradiction within a system of knowledge (Von Glasersfeld, 1984, p. 9). Given such conditions, research can be seen as a generative process that produces data and eliminates non-functional knowledge. An example of knowledge revealed as non-functional is what results from ‘apophenia’; a phenomenon of “seeing patterns where none actually exist” (Boyd, 2012, p. 668). Apophenia can affect data analysts, who succumb to the impression that they discovered causal chain of inference in data, whereas it is merely an idiosyncratic construction of the observer.

Constructivist approach is supported by the fact that production of data is always to some degree an act of classification. Malleability of data allows it to be casted using chosen data structures and conceptualisations. The moment that classification is established in data, it becomes part of the data and it is difficult to distinguish it. Arbitrariness of data structures is what Kent spends a lot of thought on:

“Data structures are artificial formalisms. They differ from information in the same sense that grammars don’t describe the language we really use, and formal logical systems don’t describe the way we think.” (Kent, 2000, p. xix)

In a similar fashion like language of community is used to construct shared world, data structures form a basis for shared understanding of data. Data structures are created with specific purposes in mind. “Like different kinds of maps, each kind of structure has its strengths and weaknesses, serving different purposes, and appealing to different people in different situations” (ibid.).

Albeit various aspects of constructivism are referred by critics of metaphysical realism, mentioned mainly in the section discussing mediation, its implications for epistemology do not dominate many scientific domains. For example, while in the field psychology Piaget writes in 1980 that “fifty years of experience have taught us that knowledge does not result from a mere recording of observations without a structuring activity on the part of the subject” (Piaget, 1980, p. 377) and the principles of constructivist epistemology are already widely adopted in humanities and social sciences, still in sciences these principles are rather ignored (Hennig, 2002) and instead substituted with the remains of metaphysical realism.

Rhetoric of data

Rhetoric provides an interpretation of data that is complementary to the approach of constructivist epistemology. Its compatibility can be seen mostly in case of ontological approach to epistemic understanding of rhetoric, which is distinguished by Brummett (Brummett, 1979). The ontological explanation of rhetorical epistemology purports that “discourse does not merely discover truth or make it effective. Discourse creates realities rather than truths about realities” (ibid.). The function of rhetoric is not limited to persuasion and justification, but it covers production of assertions as well. Therefore, Scott, as one of the first who linked rhetoric to epistemology, writes that “rhetoric may be viewed not as a matter of giving effectiveness to truth but of creating truth” (Scott, 1967, p. 13). The lens of constructivist epistemology seem to be present, when Scott remarks that “‘truth,’ of course, can be taken in several senses. If one takes it as prior and immutable, then one has no use for rhetoric except to address inferiors” (ibid., p. 9).

As the use of ‘data’ in Euclid’s treatises (Euclid, 1834) indicates, the concept was already used in rhetorical sense during Euclid’s time. According to the etymology of data, it is “‘that which is given prior to argument,’ given in order to provide a rhetorical basis” (Gitelman, 2013, p. 7). Production of data in science is set in a discourse of rhetorical argumentation. Data is constructed as one of the products of scientific discourse, primary as a vehicle of persuasion. Selection and processing of data can be tailored to support the sought purpose in argument. If validity of claims is attacked, their authors are required to justify these claims. “If challenged it is up to us to produce whatever data, facts, or other backing we consider to be relevant and sufficient to make good the initial claim” (Toulmin, 2003, p. 13). Production of data may be considered as a specific speech act, which is of use in argumentation for justifying previous or forthcoming claims.

Rhetoric argument offers an alternative to analytical logic. Similarly to logic, in rhetoric “given certain data, certain conclusions may be proven or argued to follow” (Gitelman, 2013, p. 18). Data does not belong into the framework of analytical logic though, because it cannot be evaluated to truth value. “When a fact is proven false, it ceases to be a fact. False data is data nonetheless.” (Gitelman, 2013, p. 18). Use of data is thus rhetorical. Rosenberg summarises the distinction features of data by stating that “facts are ontological, evidence is epistemological, data is rhetorical” (ibid.).

Rhetoric has a bad reputation in science. The dangers of rhetoric are pointed out by Thomas Sprat in 1667, when he published his treatise on history of the British Royal Society: “And to accomplish this, they have indeavor’d to separate the knowledge of Nature, from the colours of Rhetorick, the devices of Fancy, or the delightful deceit of Fables” (Sprat, 1667, p. 62). Historically, rhetoric is associated with deliberate manipulative uses of data. Some of these uses are described in the previous section on data manipulation. Data manipulation, used for example to obtain grant funding, can be regarded as a kind of rhetorical argumentation. For example, examples of using data for rhetorical purposes can be found in propaganda infographics,² debates on the existence of global warming, or in pre-election surveys, the creators of which are frequently accused of intentional manipulation.

In case of data, it is its ability to aggregate, which gives it its “potential power” and “rhetorical weight” (Gitelman, 2013, p. 8). Aggregation may contribute to an impression of false objectivity. An example of rhetorical production of data is the reconceptualisation of newspaper as database carried out by Angeline Grimké Weld, her husband Theodore and her sister Sarah (Gitelman, 2013, p. 90). In this case data about slavery was compiled from newspaper, such as from ads for runaway slaves. The collected data was reframed as testimony of slaveholders’ brutality as it turned their own words against them.

Modern rhetoric has a much broader scope than manipulation or persuasion. Ontological approached mentioned by Brummett positions rhetoric as a dimension present in all epistemic activities (Brummett, 1979). Rhetorical dimension is also present in scientific data. Even though the proclamations saying that “data is apolitical” (Peled, 2013) appear, data is never impartial and it necessary to take into account that it may include hidden rhetorical agenda. Even though practical data analysis mostly lacks deliberate rhetorical approach (Schron, 2013), the growing amount of research in this area³ suggest that there is interest in disruption of the established understanding of data.

Conclusion

Epistemology of data needs to be paid attention to because of the fundamental status data has in contemporary science. Due to a dramatic decrease of costs of producing large data of sufficient quality the science adopted data as its central resource. If such power is bestowed to data, it is important not to treat data as an unquestionable concept exempt from scrutiny. Regardless of this need, the predominant epistemology of data in current science is based on the alleged pre-analytical nature of data. The popular pyramid data - information - knowledge - wisdom puts data at the first place, as a basis for following levels of knowing. Given such position, data is preceded solely by the very reality, the direct representation of which data purports to be.

Many critics contributed to unveil the weaknesses of this concept and it is their works on which this text is built on. A lot of the publications cited hereby brings attention to shortcomings of the positivist heritage. A growing number of authors casts doubt upon the established role of data in science. Reformulation of epistemology of data-intensive science is attempted by several researchers, while first projects focused on this topic appear.⁴ This text also tried to oppose the unproblematic view of data in present science. In accordance with Kent, the text:

“[…] projects a philosophy that life and reality are at bottom amorphous, disordered, contradictory, inconsistent, non-rational, and non-objective. Science and much of western philosophy have in the past presented us with the illusion that things are otherwise.” (Kent, 2000, p. 220)

Critical reflection of the dominant epistemology of data in western philosophy found many holes in the uncritical, positivist approach to data. In the light of these findings, the interpretation of data as an unquestionable representation of objectively perceivable reality does not stand the test. The alternative interpretations offered by constructivist epistemology or rhetoric appear to be more productive frames for thinking about data. Whatever the path is chosen, science cannot treat data as unproblematic input to mathematical task; instead, it needs to subject data to questions.

References

ANDERSON, Chris. The end of theory: the data deluge makes the scientific method obsolete. Wired [online]. 2008-06-23 [cit. 2013-12-23]. Available from WWW: http://www.wired.com/science/discoveries/magazine/16-07/pb_theory
BABBAGE, Charles. Reflections on the decline of science in England and on some of its causes. London: B. Fellowes; J. Booth, 1830. Available from WWW: https://archive.org/details/reflectionsonde00mollgoog
BACHELARD, Gaston. The formation of the scientific mind: a contribution to a psychoanalysis of objective knowledge. Translated by Mary MCALLESTER JONES. Manchester: Clinamen Press, 2002. ISBN 1-903083-20-6.
BIRD, Alexander. The epistemology of science: a bird’s-eye view. Synthese. 2010, vol. 175, no. 1 appendix, pp. 5–16. Available from WWW: http://eis.bris.ac.uk/~plajb/teaching/The_Epistemology_of_Science.pdf. DOI 10.1007/s11229-010-9740-4.
BOELLSTORFF, Tom. Making big data, in theory. First Monday [online]. 2013 [cit. 2013-12-28], vol. 18, no. 10. Available from WWW: http://uncommonculture.org/ojs/index.php/fm/article/view/4869/3750
BOGEN, James; WOODWARD, James. Saving the phenomena. The Philosophical Review. 1988, vol. 97, no. 3, pp. 303–352. Also available from WWW: http://www.pitt.edu/~rtjbog/bogen/saving.pdf
BOX, George E. P.; DRAPER, Norman R. Empirical model-building and response surfaces. Hoboken (NJ): Wiley, 1987. Wiley series in probability and statistics, vol. 157. ISBN 0-471-81033-9.
BOWKER, Geoffrey C. Memory practices in the sciences. Cambridge (MA): MIT Press, 2005, 280 p. Inside technology. ISBN 978-0-262-52489-6.
BOYD, Danah; CRAWFORD, Kate. Critical questions for big data. Information, Communication & Society. 2012, vol. 15, no. 5, pp. 662–679. Also available from WWW: http://dx.doi.org/10.1080/1369118X.2012.678878. DOI 10.1080/1369118X.2012.678878.
BRUMMETT, Barry. Three meanings of epistemic rhetoric. Speech Communication Association Convention: Seminar on Discursive Reality. San Antonio (TX): 1979. Also available from WWW: http://ap2008.wdfiles.com/local--files/selected-research-articles/Brummett1979.doc
CRAWFORD, Kate. The hidden biases in big data. Harvard Business Review Blog Network [online]. April 1, 2013 [cit. 2014-01-11]. Available from WWW: http://blogs.hbr.org/2013/04/the-hidden-biases-in-big-data/
DEWEY, John. The quest for certainty: a study of the relation of knowledge and action. New York: Minton, Balch & Company, 1929. Gifford lectures. Also available from WWW: https://archive.org/details/questforcertaint032529mbp
DIAKOPOULOS, Nick. The rhetoric of data [online]. July 25, 2013 [cit. 2013-12-22]. Available from WWW: http://www.nickdiakopoulos.com/2013/07/25/the-rhetoric-of-data/
EDWARDS, Paul N. A vast machine: computer models, climate data, and the politics of global warming. Cambridge (MA): MIT Press, 2010, 552 p. ISBN 978-0-262-01392-5.
EDWARDS, Paul N. [et al.] (eds.). Knowledge infrastructures: intellectual frameworks and research challenges [online]. Report of a workshop sponsored by the National Science Foundation and the Sloan Foundation, University of Michigan School of Information, 25–28 May 2012. May 2013 [cit. 2013-12-22]. Available from WWW: http://hdl.handle.net/2027.42/97552
EUCLID. Data. In SIMSON, Robert (ed.). The elements of Euclid. Philadelphia: Desilver, Thomas & co., 1834.
FANELLI, Daniele. How many scientists fabricate and falsify research? A systematic review and meta-analysis of survey data. Public Library of Science ONE [online]. May 29, 2009 [cit. 2014-01-03], vol. 4, no. 5. Available from WWW: http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0005738. DOI 10.1371/journal.pone.0005738.
GITELMAN, Lisa (ed.). ‘Raw data’ is an oxymoron. Cambridge (MA): MIT Press, 2013. ISBN 978-0-262-51828-4.
GROSS, Alan G.; HARMON, Joseph E.; REIDY, Michael. Communicating science: the scientific article from the 17^th century to the present. New York (NY): Oxford University Press, 2002. ISBN 0-19-513454-0.
HALAVAIS, Alexander. Home made big data? Challenges and opportunities for participatory social research. First Monday [online]. 2013 [cit. 2013-12-28], vol. 18, no. 10. Available from WWW: http://uncommonculture.org/ojs/index.php/fm/article/view/4876/3754
HALEVY, Alon; NORVIG, Peter; PERREIRA, Fernando. The unreasonable effectiveness of data. Intelligent Systems. 2009, vol. 24, no. 2, pp. 8–12. Available from WWW: http://static.googleusercontent.com/media/research.google.com/en/pubs/archive/35179.pdf. ISSN 1541-1672. DOI 10.1109/MIS.2009.36.
HENNIG, Christian. Confronting data analysis with constructivist philosophy. In Classification, clustering, and data analysis: recent advances and applications, part II. Berlin; Heidelberg: Springer, 2002, pp. 235–243. ISBN 978-3-642-56181-8. DOI 10.1007/978-3-642-56181-8_26.
KEIL, Petr. Data-driven science is a failure of imagination [online]. January 2, 2013 [cit. 2013-12-22]. Available from WWW: http://www.petrkeil.com/?p=302
KELLER, Evelyn Fox. Reflections on gender and science. New Haven (MA): Yale University Press, 1985. ISBN 0-300-06595-7.
KENT, William. Data and reality. Bloomington (IN): 1^st Books Library, 2000. ISBN 1-58500-970-9.
LEONELLI, Sabina. On the locality of data and claims about phenomena. Philosophy of Science. 2009, vol. 76, no. 5, pp. 737–749. Also available from WWW: https://ore.exeter.ac.uk/repository/handle/10871/9429. ISSN 0031-8248.
LEONELLI, Sabina. Data interpretation in the digital age. Perspectives on Science [in print]. 2014. Also available from WWW: https://ore.exeter.ac.uk/repository/handle/10036/4484. ISSN 1063-6145.
MAGEE, Liam. Frameworks for knowledge representation. In COPE, Bill; KALANTZIS, Mary; MAGEE, Liam (eds.). Towards a semantic web: connecting knowledge in academic research. Oxford: Chandos, 2011. ISBN 978-1-84334-601-2.
MARKHAM, Annette N. Undermining ‘data’: a critical examination of a core term in scientific inquiry. First Monday [online]. 2013 [cit. 2013-12-22], vol. 18, no. 10. Available from WWW: http://uncommonculture.org/ojs/index.php/fm/article/view/4868/3749. DOI 10.5210/fm.v18i10.4868.
MAYER-SCHÖNBERGER, Viktor; CUKIER, Kenneth. Big data: a revolution that will transform how we live, work, and think. Boston (MA): Houghton Mifflin Harcourt, 2013. ISBN 978-0-544-00269-2.
Misuse of statistics. Wikipedia [online]. Last modified December 19, 2013 [cit. 2014-01-12]. Available from WWW: http://en.wikipedia.org/wiki/Misuse_of_statistics
NELSON, Michael L. Data-driven science: a new paradigm? EDUCAUSE Review [online]. July/August 2009 [cit. 2013-12-22], vol. 44, no. 4, pp. 6–7. Available from WWW: http://www.educause.edu/ero/article/data-driven-science-new-paradigm
NIELSEN, Michael. Reinventing discovery: the new era of networked science. New Jersey: Princeton University Press, 2011, 273 p. ISBN 978-0-691-14890-8.
NUNBERG, Geoffrey. Farewell to the Information age. In NUNBERG, Geoffrey (ed.). The future of the book. Berkeley (CA): University of California Press, 1996. ISBN 0-520-20451-4.
PELED, Alon. The politics of big data: a three-level analysis [online]. 2013 [cit. 2014-01-06]. Available from WWW: http://ssrn.com/abstract=2315891
PIAGET, Jean. The psychogenesis of knowledge and its epistemological significance. In PIATTELLI-PALMARINI, Massimo (ed.). Language and learning: the debate between Jean Piaget and Noam Chomsky. Cambridge (MA): Harvard University Press, 1980. ISBN 0-674-50940-4.
POOVEY, Mary. A history of the modern fact: problems of knowledge in the sciences of wealth and society. 1^st ed. Chicago: University of Chicago Press, 1998, 436 p. ISBN 0-226-67526-2.
PORTER, Theodore M. Trust in numbers: the pursuit of objectivity in science and public life. Princeton (NJ): Princeton University Press, 1995. ISBN 0-691-03776-0.
SAPIR, Edward. The collected works of Edward Sapir. VIII, Takelma texts and grammar. Berlin; New York: Mouton de Gruyter, 1990. Also available from WWW: https://archive.org/details/collectedworksof01sapi
SCHRON, Max. Data’s missing ingredient? Rhetoric [online]. April 11, 2013 [cit. 2014-01-09].Available from WWW: http://strata.oreilly.com/2013/04/datas-missing-ingredient-rhetoric.html
SCOTT, Robert L. On viewing rhetoric as epistemic. Central States Speech Journal. 1967, vol. 18, no. 1, pp. 9–17. DOI 10.1080/10510976709362856.
SPRAT, Thomas. The history of the Royal-Society of London, for the improving of natural knowledge. [T.R.]: London, 1667, 438 p. Also available from WWW: https://archive.org/details/historyroyalsoc00martgoog
TOULMIN, Stephen E. The uses of argument. Cambridge: Cambridge University Press, 2003. ISBN 978-0-511-07117-1.
VON GLASERSFELD, Ernst. An introduction to radical constructivism. In WATZLAWICK, Paul (ed.). The invented reality. New York: Norton, 1984, pp. 17–40.

Footnotes

For example, this problem concerns volatile groups, such as the group of the 1 % of the richest people, for which, due to its volatility, one cannot compare different time slices of data describing the group. ↩
As Gitelman mentions, “data visualisation amplifies the rhetorical function of data” (Gitelman, 2013, p. 12). ↩
Such as the anthology ‘Raw data’ is an oxymoron from 2013 (Gitelman, 2013). ↩
For example, http://www.datastudies.eu/. ↩

Vocabularies for the web of data and principles of least markup

2013-07-31T14:21:00.001+02:00

I want to share a few thoughts about markup vocabularies that I pondered upon in the past months when developing Schema.org extension proposal targetting the long tail of job market. Schema.org is a prime example of markup vocabulary. In fact, if you search Google for “markup vocabulary”, most results will associate the term with Schema.org. Throughout this post, I’ll use this vocabulary as an example to illustrate the points made.

So, how does a markup vocabulary differ from, say, an ontology? Markup vocabularies serve different purposes than traditional ontologies, albeit their uses overlap. While the distinction between vocabularies and ontologies is blurry, it can be said that ontologies are based on logic, whereas vocabularies are based on convention. Ontologies are typically used for tasks such as inferring additional data, whereas vocabularies serve rather as structures for easier parsing when exchanging data. Alex Shubin likened Schema.org, as an example of markup vocabulary, to a set of “sitemaps for content” (source). Whereas sitemaps serve to machines to find pages within a web site, Schema.org serves to machines to find the bits of content within a web page.

In practice, vocabularies are used for the less orderly data. As Dan Brickley said in one of his talks: “Schema.org is for the rest of the Web; for that big sprawling chaos.” To reach wide adoption vocabularies need to be generic and application-agnostic so that they can be applied at the largest scale possible. So, as asked at the Schema.org panel discussion, “how does schema design at a planetary scale work, in practice?” I think the answer to this question may be approached from two complementary angles: the recommended vocabulary design patterns and markup guidance.

Vocabulary design

While there is a lot of methodologies for developing ontologies (such as NeOn Methodology or METHONTOLOGY), it seems that similar instructions are lacking for vocabularies. It is unclear what such instructions should be based on. Whereas design of artefacts is frequently shaped by their anticipated uses in practice, so that their form follows function, successful vocabularies are often those that don’t anticipate any particular uses. Their function is defined in terms of a broad goal of supporting the widest possible use. And this isn’t an easy goal to provide workable recommendations for.

I think one common rule of thumb is that vocabulary designers should strive to the lower cognitive overhead users face when working with vocabularies and focus on improving vocabulary usability. However, how do these nebulous goals translate into practice?

One (slightly less vague) advice is that vocabulary design shouldn’t require users to make difficult conceptual distinctions. In order to achieve that, make the differences between vocabulary terms clear (using clear labels and descriptions) in order to avoid ambiguity. If users regularly mix up two distinct concepts, either drop one of the concepts or provide the concepts with better definition. As the Zen of Python states on a similar note, “there should be one— and preferably only one —obvious way to do it.”

Another (slightly less vague) advice is to avoid object proliferation in the vocabulary you develop. In his talk from May 2013 Richard Cyganiak mentioned that vocabularies are typically built from bottom to top, based on usage evidence, so that unused object aren’t included. Richard reiterated the claim asserting that successful vocabularies for the web of data are small and simple (such as Dublin Core Terms), which was already presented by Martin Hepp in his account of Possible ontologies.

One practical technique in line with this advice is to avoid intermediate resources, which are typically represented with blank nodes, and are often needed for object properties. Schema.org labels such intermediate objects as “embedded items”. If your vocabulary contains an object property that points to an intermediate object further described with other properties and all these properties have 0…1 cardinality, then you may consider redefining them as direct properties of the object property’s subject. For example, the class schema:JobPosting is used with properties schema:baseSalary and schema:salaryCurrency. These properties could have been associated with an intermediate schema:Salary object, or even with 2 intermediate objects schema:JobPosition and schema:Salary, however, they are instead attached as direct properties of the schema:JobPosting class. Be careful though not to take this as a catch-all rule. Object properties that usually link to URIs, such as schema:hiringOrganization links to schema:Organization, don’t need to be treated in this way.

Markup guidance

Judging from the documentation of markup vocabularies, much of the presented guidance revolves around markup rather than vocabulary design. For instance, the guidance on designing vocabularies for HTML provided by W3C focuses on issues of markup syntax. I think a lot of recommendations concerning markup for the web of data can be considered extensions of the principle of least effort, so calling them the principles of least markup sounds about right.

A practical realization of such principles might advise to omit data that can be computed automatically. This guidance might encompass omitting inferrable types, including class instantiations and literal datatypes, when there’s only one valid option. Note that this approach doesn’t apply in cases when “type” or “unit” needs to be provided to serve as a value reference; for example when describing price and its currency. A more controversial extension of this principle might recommend to avoid forcing users to mint their own URIs unless necessary. For many purposes of data on the Web anonymous nodes represented with blank nodes are sufficient, given that they may be transformed to URIs and linked deterministically during data ingest and subsequent processing.

Besides decreasing the number of characters that users need to type to add vocabulary markup, there are few recurrent issues frequently mentioned in markup advice.

A thorny issue is the single namespace policy, which proposes that users should be able to create markup with a single vocabulary only. This recommendation is based on the assumption that having multiple vocabulary namespaces requires users to shift between multiple contexts of different vocabularies, which is held to be cognitively demanding. For example, Schema.org aims to provide this single all-encompassing namespace, from which every necessary vocabulary term may be drawn. Single namespace policy is also reflected in RDFa’s vocab attribute that enables to specify a single namespace, which is then applied to all unqualified names used in markup.

When looking for the source of errors in markup, unclear scoping rules are often to blame. Scoping is governed by rules prescribing what subject should the properties in markup be attached to, based on positioning of attribute-value pairs in the hierarchical structure of HTML and semantic context as set by other markup. The scoping rules are notoriously difficult to grasp, which might have contributed to Microdata having the itemscope attribute that sets the scope explicitly.

A related issue to scoping is directionality, which prescribes whether the current scope should be used as subject or object of marked up properties. To reverse the default directionality RDFa offers the rev attribute and previously, it used reverse direction for src attribute as well. Directionality, among other issues, is described by Gregg Kellogg in his list of common pitfalls when marking up HTML with RDFa. Microdata, on the other hand, avoids this issue by being uni-directional.

Tolerance

Markup guidelines for data publishers should have a counterpart on the side of data consumers. That counterpart is the principle of tolerance. Schema.org documentation of its data model states: “In the spirit of ‘some data is better than none’, we will accept this markup and do the best we can.” Even though markup may be broken in many different ways, data consumers should try to be fault-tolerant. This attitude is in line with Postel’s principle of robustness that states: “Be conservative in what you send, be liberal in what you accept” And so I think that until we know better about vocabulary design, we better be tolerant and liberal about data on the Web.

Towards usability metrics for vocabularies

2013-07-29T20:13:00.001+02:00

As far as I know, there are no established measures for evaluating vocabulary usability. To clarify, when I use the term “vocabularies”, what I mean simple schemas and lightweight lexical ontologies that are used primarily for marking up content embedded in web pages, using syntaxes such as Microdata or RDFa. A good example of such vocabulary is Schema.org, an overarching, yet simple schema of things and relations that four big search engines (Google, Microsoft Bing, Yahoo! and Yandex) deem to be important to their users.

The closest to the topic seems to be the paper Ontology evaluation through usability measures by Núria Casellas. With regards to syntactical usability of markup, there was a usability study of Microdata done by Ian Hickson, the minimalistic settings of which was a subject of numerous rants, such as the one by Manu Sporny. I presume more thought needs to be spent on discovering how existing usability research relates to vocabularies and which standard usability principles apply. Nevertherless, borrowing from usability testing used for web sites or software or in libraries, three metrics relevant to vocabularies crossed my mind.

The first is error rate when using a vocabulary. It is based on the assumption that the more usable vocabulary is the fewer errors should its users make. Vocabulary validators may be used to automate this technique. Such tools may execute fine-grained rules, which may help to discern the most problematic parts of vocabularies, where users make the most errors. An example of a study testing error rate was conducted by Yandex. Note, however, that it focused more on markup syntaxes, rather than vocabularies themselves. It reported 10 % error rate in RDFa (4 % share in the sample), 10 % error rate in hCard (20 % share) and almost no errors in Facebook’s Open Graph Protocol (1.5 % share), which is also RDFa.

A broader feature that may serve as input to usability testing is data quality. A metric based on data quality should primarily take into account valid data, since invalid data should be caught by error rate testing. Recognizing that data quality as a relevant feature is based on the assumption that more usable vocabularies support creating data of better quality. However, the relation between vocabulary usability and data quality should not be considered as causation, but rather correlation, which might pinpoint weak parts of vocabulary where data quality suffers. Transforming data quality into a discrete metric is tricky, but there already are data quality assessment methodologies, some of which are documented in this paper (PDF), on which test procedures for usability of vocabularies may be derived.

The remaining metric I propose is adopted from library and information science, in which quality of indexing (much like mark-up) can be evaluated in terms of inter-indexer and intra-indexer consistency. Reframing that to usability testing of vocabularies inter-user and intra-user consistency could be more suitable labels. Inter-user consistency is the degree of agreement among users in describing the same content. On the other hand, intra-user consistency is the extent to which one is consistent in marking up content with oneself. Consistent use of a vocabulary may be taken as a sign that the vocabulary terms are not ambiguously defined, so that users do not confuse them. It may also show that there is a documentation providing clear guidance on the ways in which the vocabulary may be used. These metrics might help test if vocabularies can be “easily understood by the users, so that it can be consistently applied and interpreted” (source).

These metrics have a long history in the field of libraries and are already deployed in practice on the Web. For example, Google Image Labeler (now defunct) was a game that asked pairs of users (mutually unknown to each other) to label the same image and rewarded them if they agreed on a label. A similar service that works on the same principle rewarding consistency is LibraryThing’s CoverGuess. A naïve approach to implementing these metrics could compute the size of the diff, so that, for example, markups produced by 2 users given the same web page and instruction to use the same vocabulary can be compared. A more complex implementation might involve distance metrics that measure similarity of patterns in data, such as with the metrics offered by Silk. Finally, when applying the consistency metrics, as observed previously, you should keep in mind is that high consistency may be achieved at the expense of low overall quality. Therefore, these metrics are best complemented with data quality testing.

I believe adopting usability testing as a part of vocabulary design is a step forward for data modelling as a discipline. To start we will first need to find out what usability metrics apply to vocabularies or develop new specific approaches to usability testing. So let’s get user-centric, shall we?

Capturing temporal dimension of linked data

2013-07-17T14:55:00.001+02:00

Summary

The article provides an overview of the existing approaches for capturing temporal dimension of linked data within the limits of RDF data model. Best-of-breed techniques for representing temporally varying data are compared and both their advantages and disadvantages are summarized with a special respect to linked data. Common issues and requirements for the proposed data models are discussed. Modelling patterns based on named graphs and concepts drawn from four-dimensionalism are held to be well suited for the needs of linked data on the Web.

The world is in flux; models and datasets about the world have to co-evolve with it in order to retain their value. The nascent web of data can be thought of as a global model of the world, which, as such model, is a subject to change. The nature of the Web content is dynamic as both individual resources and links between them change frequently.

Facts are time-indexed and as such may cease to be valid, being superseded by newly acquired knowledge. Even though some knowledge of encyclopaedic nature might change infrequently, revolutions in scientific paradigms led us to doubt the existence of eternal truths. Many resources are temporal in nature, for example stock and news (Gutierrez, Hurtado and Vaisman, 2007), and a growing amount of data on the Web comes directly in timestamped streams from sensors. Moreover, data on the Web often starts as raw material that is refined over time, for example, when patches based on user feedback are incorporated. Datasets “are constantly evolving to reflect an updated community understanding of the domain or phenomena under investigation” (Papavasileiou et al., 2013).

At the same time people are recognizing the value of historical data, access to which proves to be essential for making better decisions about the present and for predicting the future. Consequently, access to both current and historical data is deemed vital. However, many data sources on the Web are not archived and disappear on a regular basis. Auer remarks that datasets often “evolve without any indication, subject to changes in the encoded facts, in their structure or the data collection process itself” (Auer et al., 2012). In many cases we fail to record when we get to know about things described in our data. Lack of provenance metadata and temporal annotations makes it difficult to understand how datasets develop with respect to the real world entities they describe. Rate of change in datasets is thus opaque unless they are provided with explicit metadata.¹ Most changes are implicit and detecting them requires “reverse engineering” by comparing snapshots of the observed dataset; a crude way to tell what changed and in which way.

All these remarks highlight the temporal dimension of data on the Web. Being able to observe the nature of change in linked data proves important for many data consumption use cases. Knowing what the rate of change is may help to determine which queries can be safely run on cached data and which queries need to be executed on live data (Umbrich, Karnstedt and Land, 2010). Not knowing what changes makes data synchronization inefficient as it requires copying whole datasets. Temporal annotation may vastly benefit applications that combine linked data from the Web. It allows to cluster things that happened simultaneously and determine order of events that might have been spawned by each other. In the data fusion use case, applications may use temporal annotation to favour more recent data. Finally, “in some areas (like houses for sale) it is the new changed information which is of most interest, and in some areas (like currency rates) if you listen to a stream of changes you will in fact accumulate a working knowledge of the area” (Berners-Lee and Connolly, 2009).

Unfortunately, “for many important needs related to changing data, implementation patterns or best practices remain elusive,” (Sanderson and Van de Sompel, 2012) which results into the practice when “accommodating the time-varying nature of the enterprise is largely left to the developers of database applications, leading to ineffective and inefficient ad hoc solutions that must be reinvented each time a new application is developed” (Jensen, 2000). The lack of guidance on this topic is what this article attempts to remedy by contributing with an overview that summarizes the main data modelling patterns for temporal linked data. The article begins by going through preliminaries, including concepts and formalizations used in further sections. The main part provides a review of dominant modelling patterns for capturing temporal dimension of linked data. This overview is followed by a discussion of common issues in modelling temporal linked data. Concluding sections of the article provide pointers to related work and sum up the article’s contributions.

Preliminaries

In order to set up the scene for the main body of this article, we will introduce the formalisms for modelling data (RDF) and representing time, along with description of key concepts in the domain.

Resource Description Framework

RDF (Resource Description Framework) is a generic data format for exchanging structured data on the Web. It expresses data as sets of atomic statements called triples, in which each triple is made of subject, predicate and object. There are three options that may be used as the constituent parts of triples. URIs (Uniform Resource Identifiers) identifying resources may be used at any position within a triple. Blank nodes, existentially quantified anonymous resources, may serve both as subjects and as objects. Finally, literals, textual values with optional data type or language tag, may be put only to the object position. Any set of RDF triples forms a directed labelled graph, in which vertices consist of subjects and objects and edges, oriented from subjects to objects, are labelled with predicates.

The deficiencies of RDF regarding the expression of temporal scope are mostly attributed to the fact that RDF is limited solely to binary predicates, while relations of higher arity have to be decomposed into binary ones. Since RDF predicates are binary, any relation that involves more than two resources can be expressed only indirectly (RDF 1.1 concepts, 2013). What follows from this restriction is that “temporal properties can only be attached to concepts or instances. Whenever a relation needs to be annotated with temporal validity information, workaround solutions such as relationship reification need to be introduced” (Tappolet and Bernstein, 2009). These indirect ways of introducing temporal dimension as an additional argument to binary relations captured by RDF are accounted for by the modelling patterns described further in the article.

In addition to mixing time in relations, temporal scope can be also treated as an annotation, which may be attached anything with an own identity that is valid subject of an RDF triple (i.e. URI, blank node). Although similarly indirect, annotations may be attached at several levels of granularity, including annotations of individual triples, resources or sets of triples grouped in sub-graphs. Since “there is nothing else but the constituents to determine the identity of the triple,” (Kiryakov and Ogyanov, 2002) RDF triples have no identity to refer to, which makes describing their context, such as temporal scope, unfeasible. To provide triples with identity of their own reification must be used. Reification is a way how to make statements into resources, so that they are within the reach of what may be described in RDF. Together with other approaches to annotation reification will be explored in the following data modelling patterns.

As mentioned above, all of the ways of capturing temporal dimension in RDF rely on indirection or convoluted patterns constrained by RDF limitations. Moving from synchronic to diachronic data representation runs up against the restraints of RDF. Design of the “RDF data model is atemporal” (RDF 1.1 concepts, 2013) and there no native support for incorporating time. The format’s specification leaves handling temporal data out of its scope and delegates the question of expressing such data to RDF vocabularies and ontologies. Current RDF specifications also evade the issue of semantics of temporally variant resources by recognizing that “to provide an adequate semantics which would be sensitive to temporal changes is a research problem which is beyond the scope of this document” (RDF semantics, 2004).

Indeed, much criticism was voiced over RDF’s atemporality. For instance, Tennison declares that “the biggest deficiency in RDF is how hard it is to associate metadata with statements,” (2009a) Mittelbach specifies that “there is no built-in way to describe the context in which a given fact is valid,” (2008) and Hickey concludes that “without a temporal notion or proper representation of retraction, RDF statements are insufficient for representing historical information” (2013).

Representation of time

Time can be represented as a “point-based, discrete and linearly ordered domain” (Rula et al., 2012). It is typically conceptualized as one-dimensional, so there is no branching time (Tappolet and Bernstein, 2009). Basic types of temporal entities are time points (instants) and time intervals (periods) delimited by starting and ending time point. If necessary, temporal primitives might be reduced to intervals since “it is generally safe to think of an instant as an interval with zero length, where the beginning and end are the same” (Time Ontology in OWL, 2006). Temporal scope of data may also span multiple periods, which can be represented as a union of disjoint time intervals (Lopes et al., 2009).

Temporal entities can be translated into RDF either as literals or resources. Literals specifying time may conform to one of the XML Schema datatypes,² such as xsd:dateTime or xsd:duration, which are based on international standards, such as ISO 8601.³ These datatypes are well supported by the SPARQL RDF query language and other tools working with RDF that implement XPath functions for manipulating with such datatype values. Less common literal datatypes⁴ require custom parsers, so it is advisable to conform to standards and to decompose compound literals into structured values represented as RDF resources.

Complex values, such as intervals delimited with starting and ending timestamps, should be represented as resources. This is the way of modelling adopted by Time Ontology in OWL (2006), in which temporal entities are structured as instances of classes that are further described with datatype properties. Another application of this approach is in the work of Correndo et al., who offer a formalization of time that they held to be well-suited for annotation of RDF (2010). They present the “concept of Linked Timelines, knowledge bases about general instants and intervals that expose resolvable URIs” and provide “also temporal topological relationships inference to the managed discrete time entities.” The URIs generated for time instants and intervals adhere to syntax of ISO 8601 literals and are described with OWL Time Ontology, which the authors extended with better support for XML Schema datatypes.

An alternative option is to represent temporal relations as a hierarchy, for example a year interval may have narrower month intervals. This solution was proposed for the Neo4j graph store,⁵ although it can be also put into practice in RDF stores using, for example, SKOS⁶ hierarchical relationships such as skos:narrowerTransitive and skos:broaderTransitive.

In the examples contained in this article we will use the Time ontology in OWL combined with XML Schema datatypes to represent temporal entities.

Key concepts

Change in time is inextricably linked to several concepts, which influence the generic principles that guide data modelling. A fundamental concept that interacts with it is identity. There is no widely accepted definition of identity in database setting. It can be considered as continuity over a series of perceptions (Halloway, 2011). Another perspective views identity as the characteristics that make entity recognizable from others. It is the relation an entity has only to itself, exhibiting the characteristics of being both reflexive, transitive and symmetric.

Identity on the Web may be contemplated through the Leibniz’s ontological principle of identity of indiscernibles, which may be formulated as follows: “if, for every property F, object x has F if and only if object y has F, then x is identical to y” (Identity of indiscernibles, 2010). This rule does not hold on the Web since “resources are not determined extensionally. That is, A and B can have the same correspondences and coincidences, and still be distinct” (Rees, 2009). On the contrary, the reverse law of indiscernibility of identicals holds true since “owl:sameAs asserts that identity entails isomorphism, or that if a = b, then all statements of a and b are shared by both” (McCusker and McGuinness, 2010.

When it comes to change in time, a key issue of identity is “the problem of diachronic identity: i.e., how do we logically account for the fact that the ‘same’ entity appears to be ‘different’ at different times?” (Welty and Fikes, 2006). Change is a result of actions over time, each of which produces a new observable state of the changed identity. State is a relationship of identity to value. In Representational State Transfer (REST) “resource R is a temporally varying membership function MR(t), which for time t maps to a set of entities, or values, which are equivalent” (Fielding, 2000). In linked data, which builds on REST, the state is the representation (perceived value) to which a resource URI dereferences, so that “dereferencing [the resource’s] URI at any specific moment yields a response that reflects the resource’s state at that moment” (Van de Sompel, Nelson and Sanderson, 2013). While the resource state may change, the resource URI should be persistent, as recommended by one of the principles of the architecture of World Wide Web:

“Resource state may evolve over time. Requiring a URI owner to publish a new URI for each change in resource state would lead to a significant number of broken references. For robustness, Web architecture promotes independence between an identifier and the state of the identified resource.” (Architecture of the World Wide Web, 2004)

State offers a perceivable value, which is, in contrast to resources, immutable. Values are made of facts, which are in the temporal database research regarded as “any statement that can meaningfully be assigned a truth value, i.e. that is either true or false” (Jensen, 2000). RDF regards individual triples as atomic facts, which are held to be true without respect to context by to sole nature of their existence. Although in fact, without context, such as temporal scope, it might not be feasible to assign truth value to RDF triples. RDF resources are by default persistent yet mutable, although “literals, by design, are constants and never change their value” (RDF 1.1 concepts, 2013). However, RDF triple may be considered immutable because “an RDF statement cannot be changed – it can only be added and removed.” (Kiryakov and Ogyanov, 2002). To the contrary “there is no way to add, remove, or update a resource or literal without changing at least one statement, whereas the opposite does not hold” (Auer and Herre, 2007).

The succession of resource states may spread across multiple temporal dimensions. A common conceptualization of time uses two temporal dimensions; valid time and transaction time, and thus it is referred to as bitemporal data model. It allows to distinguish between the situation when “the world changes” and when “the data about the world changes” (e.g., as a result of changes in the data collection process). Valid time (also “actual time”, “business time” or “application time”) captures when is the data valid in the modelled world. Value is current during valid time. This interpretation fits the dcterms:valid property from the Dublin Core Terms vocabulary.⁷ Transaction time (also “record time” or “system time”) reflects when data enters database. Value is perceived at transaction time. This dimension’s semantics may be expressed with dcterms:created property, also from Dublin Core Terms.

Modelling patterns

Having set the preliminaries we will now review and compare data modelling patterns for capturing temporal dimension of linked data. To limit our scope we selected only the patterns that can be implemented within RDF without extending it. That said, all the patterns can be expressed in RDF, even though they might not have a defined semantics. And in fact, the differences between some patterns are a matter of syntax, so that they can be transformed to each other. The overview features patterns based on reification, concept of four-dimensionalism, dated URIs and named graphs.

Throughout this section we will use the following data snippet serialized in RDF Turtle syntax⁸ to serve as a running example. It states the resource :Alice to be a member of the organization :ACME without any anchoring temporal information, the attachment of which will differ in each reviewed modelling pattern.

:Alice org:memberOf :ACME .

The examples use the Time Ontology (Time Ontology in OWL, 2006) to demarcate temporal scope of data, except in cases where the modelling pattern provides its own way to represent the scope. The temporal annotation in all examples asserts the data to be valid since 9 AM on January 1, 2000. The following examples use the RDF Turtle syntax (unless stated otherwise) with these namespaces prefixes:

@prefix :         <http://example.com/> .
@prefix cs:       <http://purl.org/vocab/changeset/schema#> .
@prefix dcterms:  <http://purl.org/dc/terms/> .
@prefix gen:      <http://www.w3.org/2006/gen/ont#> .
@prefix org:      <http://www.w3.org/ns/org#> .
@prefix owl:      <http://www.w3.org/2002/07/owl#> .
@prefix prov:     <http://www.w3.org/ns/prov#> .
@prefix prv:      <http://purl.org/ontology/prv/core#> .
@prefix rdf:      <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs:     <http://www.w3.org/2000/01/rdf-schema#> .
@prefix sit: <http://www.ontologydesignpatterns.org/cp/owl/timeindexedsituation.owl#> .
@prefix ti: <http://www.ontologydesignpatterns.org/cp/owl/timeinterval.owl#> .
@prefix time:     <http://www.w3.org/2006/time#> .
@prefix xsd:      <http://www.w3.org/2001/XMLSchema#> .

Reification

In the first part of the reviewed data modelling patterns we will focus on those that use reification. In previous research these patterns were grouped in the category of fact-centric perspectives (Rula et al., 2012). Reification is a principle by which previously anonymous parts of the data model are given autonomous identity, so that they can be described within the data model. The patterns we go through in this section include statement reification, axiom annotation, changesets, n-ary relations and property localization.

Statement reification

Statement reification⁹ adopts a sentence-centric perspective (Rula et al., 2012) and attaches temporal annotation to individual triples (statements). It decomposes the reified binary predicate into three binary predicates asserting what the subject (rdf:subject), predicate (rdf:predicate) and object (rdf:object) of the original statement are.

This approach suffers from a number of issues, which is why it is commonly discouraged to use it. First, there is no formal correspondence between statement and its reified form. As RDF primer points out: “note that asserting the reification is not the same as asserting the original statement, and neither implies the other” (RDF primer). Even if “there needs to be some means of associating the subject of the reification triples with an individual triple in some document […] RDF provides no way to do this” (ibid.). At the same time, two reified statements with the same subject, predicate and object cannot be automatically inferred to be the same statement. Moreover, triple reification is inefficient in terms of data size as it requires at least three times more triples than non-reified statement. If statement reification is used, every temporally annotated statement has to be reified, as there is no grouping possible. Given these concerns triple reification is considered for deprecation, one reason for which is that the cases for which it is used are fulfilled by named graphs, which, unlike reification, make do without data transformation.

:Alice org:memberOf :ACME .

[] a rdf:Statement ;
  rdf:subject :Alice ;
  rdf:predicate org:memberOf ;
  rdf:object :ACME ;
  dcterms:valid [
    a time:Interval ;
    time:hasBeginning [
      time:inXSDDateTime "2000-01-01T09:00:00Z"^^xsd:dateTime
    ]    
  ] .

Axiom annotation

OWL 2 offers a way for expressing annotations about axioms.¹⁰ OWL annotations are “pieces of extra-logical information describing the ontology or entity” (Grau, 2008) and they “carry no semantics in OWL 2 Direct Semantics” (OWL 2 Web Ontology Language, 2012). They can be considered functionally equivalent to triple reification and thus their issues are closely similar. Gangemi and Presutti add that the downsides include a need for “a lot of reification axioms to introduce a primary binary relation to be used as a pivot for axiom annotations, and that in OWL 2 (DL) reasoning is not supported for axiom annotations” (Gangemi and Presutti, 2013).

:Alice org:memberOf :ACME .

[] a owl:Axiom ;
  owl:annotatedSource :Alice ;
  owl:annotatedProperty org:memberOf ;
  owl:annotatedTarget :ACME ;
  dcterms:valid [
    a time:Interval ;
    time:hasBeginning [
      time:inXSDDateTime "2000-01-01T09:00:00Z"^^xsd:dateTime
    ]    
  ] .

Changeset

Changeset vocabulary¹¹ captures changes as reified statements that are complemented with metadata, such as the type of changeset (addition, removal) or timestamp. Individual changesets comprise annotated reified statements, which represent atomic changes made to data. Changesets may be bundled together by using the Eventset vocabulary,¹² which provides a higher level resource-centric view of changes.

[] a cs:ChangeSet ;
  cs:addition [
    a rdf:Statement ;
    rdf:subject :Alice ;
    rdf:predicate org:memberOf ;
    rdf:object :ACME
  ] ;
  cs:createdDate "2000-01-01T09:00:00Z"^^xsd:dateTime .

N-ary relation

A common pattern how to provide relation with an identity is by using a class instance instead of a property. By introducing a specific resource to name the relation this practice allows to escape the limitation of default binary predicates in RDF and instead express arbitrary n-ary relations¹³ decomposed into binary relations expressible in RDF. “The classes created in this way are often called ‘reified relations’” (Defining n-ary relations on the semantic web, 2006) and bear a resemblance to statement reification syntax. However, they are more concise since they subsume the semantics of rdf:predicate in an instantiation of a specific class. Moreover, whereas in statement reification additional triples characterize a “statement”, in n-ary relations they describe the “relation” itself, which is why this modelling pattern is categorized as relation-centric perspective (Rula et al., 2012).

However, some argue that time is “an additional semantic dimension of data. Therefore, it needs to be regarded as an element of the meta-model instead of being just part of the data model” (Tappolet and Bernstein, 2009). Indeed, ontological solution mixes time model into data model. Expressing temporal dimension in the data model requires per-property effort, which leads to ontology bloat and indirection because many properties exhibit potentially dynamic characteristics. Moreover, custom n-ary relations do not conform to any standard, so interpreting them automatically may be difficult.

Despite these disadvantages n-ary relations are regarded as the most flexible for incorporating temporal annotations (Gangemi, 2011), which is supported by the evidence of their widespread use in practice. For example, they are used in Freebase¹⁴ and Wikidata.¹⁵ Since n-ary relations are common, Organization Ontology, which offers the org:memberOf predicate used in the initial example, also provides a way to put the membership relation in temporal context using n-ary relationship org:Membership.¹⁶

[] a org:Membership ;
  org:member :Alice ;
  org:organization :ACME ;
  org:memberDuring [
    a time:Interval ;
    time:hasBeginning [
      time:inXSDDateTime "2000-01-01T09:00:00Z"^^xsd:dateTime
    ]    
  ] .

N-ary relations may be further refined by “decorating” them with compound keys, using the owl:hasKey feature of OWL 2, which helps in merging identical relations (Gangemi, 2011).

org:Membership owl:hasKey (
  org:member
  org:organization
  org:memberDuring 
) .

The pattern of converting properties to n-ary relations is documented in the Property Reification Vocabulary, (Procházka et al., 2011) which offers a way for establishing a connection between property (prv:shortcut) and its n-ary relations (prv:reification_class).

:memberOfReification a prv:PropertyReification ;
  prv:shortcut org:memberOf ;
  prv:reification_class org:Membership ;
  prv:subject_property org:member ;
  prv:object_property org:organization .

Property localization

A relatively convoluted technique for reifying properties is property localization, which is also known as “name nesting”. “To reduce the arity of a given relation instance” it might be replaced by a sub-property with “partial application” of time, thereby decreasing the number of its arguments (Krieger, 2008). However, in order to add more parameters another sub-property needs to be minted, which, coupled with relying on singletons for property domain and range, makes this pattern inflexible.

:AliceMemberOfACMESince2000-01-01 rdfs:subPropertyOf org:memberOf ;
  rdfs:domain [
    owl:oneOf ( :Alice )
  ] ;
  rdfs:range [
    owl:oneOf ( :ACME )
  ] ;
  dcterms:valid [
    a time:Interval ;
    time:hasBeginning [
      time:inXSDDateTime "2000-01-01T09:00:00Z"^^xsd:dateTime
    ]    
  ] .

:Alice :AliceMemberOfACMESince2000-01-01 :ACME .

Four-dimensionalism

Four-dimensionalism is a school of thought hailing from a strong philosophical background that regards time as the fourth dimension. Traditional data modelling approaches ascribe fourth dimension only to processes (“perdurants”) whereas things (“endurants”) are held to extend solely in three dimensions. “Endurants are wholly present at all times during which they exist, while perdurants have temporal parts that exist during the times the entity exists” (Welty and Fikes, 2006). On the contrary, four-dimensionalism does not make this conceptual distinction and treats every entity as having temporals parts extended through time.

The temporal parts of individuals are referred to as time slices, which represent their time-indexed states. The properties used to describe time slices are called fluents. Fluents are “properties and relations which are understood to be time-dependent” (Hayes, 2004) and can be thought of as “instances of relations whose validity is a function of time,” (Hoffart et al., 2013) so that they are functions that map from objects and situations to truth values (Welty and Fikes, 2006).

Even though some authors hold 4D fluents to be the most expressive modelling pattern for temporal annotation with best support in current reasoners (Batsakis and Petrakis, 2011), this approach begets a number of issues as well. Perhaps the major one is that in 4D view we either have to treat instances of existing classes as time slices or we have to redefine domain and range of existing properties to instances of time slices. Krieger (2008) presents a attempt to solve this issue by reinterpreting existing properties as 4D fluents. He posits that such “interpretation has several advantages and requires no rewriting of an ontology that lacks a treatment of time.” Proliferation of identical time slices may be another potential problem of this pattern. Welty and Fikes argue that “there is no way in OWL to express the identity condition that two temporal parts of the same object with the same temporal extent are the same,” (2006) however, features added in OWL 2 more recently, such as keys, may be used for this purpose.

Unlike the previously mentioned approaches, modelling resource slices tracks changes on a higher level of granularity. Whereas in the above-mentioned examples changes are recorded on the level of individual relations or facts, this pattern captures changes on the level of resources. Consequently, if we use slices as resource’s snapshots, such modelling will likely be inefficient as most of the resource’s data remains unchanged in subsequent resource slices, but it has to be duplicated. A more economical approach is to treat resource slices as deltas that contain only data that changed, which are used, for example, in the work of Tappolet and Bernstein (2009).

In the following example, time slices of resources are expressed as instances of gen:TimeSpecificResource using the Ontology for Relating Generic and Specific Information Resources.¹⁷ The property org:memberOf is reinterpreted as if it was a fluent property.

:Alice a gen:TimeGenericResource ;
  gen:timeSpecific :Alice2000-01-01 .

:ACME a gen:TimeGenericResource ;
  gen:timeSpecific :ACME2000-01-01 .

:Alice2000-01-01 a gen:TimeSpecificResource ;
  org:memberOf :ACME2000-01-01 ; 
  dcterms:valid :membershipTime .

:ACME2000-01-01 a gen:TimeSpecificiResource ;
  dcterms:valid :membershipTime .

:membershipTime a ti:TimeInterval 
  ti:hasIntervalStartDate "2000-01-01T09:00:00Z"^^xsd:dateTime .

A variation of this modelling pattern was put forward by Welty as context slices (2010). Context slice is “a projection of the relation arguments in each context for which some binary relation holds between them” (ibid.). It constitutes a shared context (instance of :Context) in which time-indexed “projections” (instances of :ContextualProjection) of both subject and object participate.

:AliceMemberOfACMESince2000-01-01 a :Context ;
  dcterms:valid [
    a time:Interval ;
    time:hasBeginning [
      time:inXSDDateTime "2000-01-01T09:00:00Z"^^xsd:dateTime
    ]
  ] .

:Alice2000-01-01 a :ContextualProjection ;
  :hasContext :AliceMemberOfACMESince2000-01-01 .

:ACME2000-01-01 a :ContextualProjection ;
  :hasContext :AliceMemberOfACMESince2000-01-01 .

:Alice2000-01-01 org:memberOf :ACME2000-01-01 .

Dated URIs

Dated URIs allow to express temporal scope at identifier level. Resource identified with dated URI is time-indexed by a timestamp embedded directly into its URI. A formal version of this modelling pattern was proposed Masinter (2012), who drafted two URI schemes that employ embedding of original URI and its temporal scope into a single identifier. The first scheme, duri (“dated URI”) is used to identify resource as of specific time. The second, tdb (“thing described by”) serves the purpose of identifying temporally scoped state of resources that cannot be retrieved via the Web (i.e. “non-information resources”). Using dated URIs, the running example looks like this:

<tdb:2000-01-01T09:00:00Z:http://example.com/Alice>
org:memberOf
<tdb:2000-01-01T09:00:00Z:http://example.com/ACME> .

An alternative way how to use this pattern would be to instead index properties with time, since these are what changes.

:Alice <tdb:2000-01-01T09:00:00Z:http://www.w3.org/ns/org#memberOf> :ACME .

The principal downside to URIs in both of these schemes is that resolving them is not supported by existing HTTP clients. A more prevalent, yet not formalized technique sticks to regular HTTP URIs, in which temporal information is contained. For example, Tennison (2009b) builds on widespread pattern of dated URIs and proposes to mint URIs with temporal annotations (such as http://example.com/Alice/2000-01-01), to which generic URIs (such as http://example.com/Alice) redirect with HTTP 307 Temporal Redirect. However, all approaches that embed time into URIs interfere with the axiom of URI opacity,¹⁸ which discourages from inferring anything from URI’s string representation, since that would overload the function of identifiers to carry information belonging to the data model and would require additional custom parsers to extract temporal annotation from URIs. Since this approach operates on the identifier level, it may be used together with the previously covered modelling patterns. For example, resources identified by dated URIs might be interpreted as time slices presented in the above mentioned pattern.

Named graphs

In this section we discuss the deployment of named graphs as resource states and as data versions. Named graphs are merely syntactically paired names (URIs) with RDF graphs. “RDF does not place any formal restrictions on what resource the graph name may denote, nor on the relationship between that resource and the graph” (RDF 1.1 concepts, 2013) and semantics of named graphs is thus left undefined.

A substantial benefit of named graphs is that this approach does not interfere with the actual data model, since it is agnostic about what RDF vocabularies and ontologies are being used. This makes them compatible with existing atemporal data models, which may be used within temporally annotated named graphs. An advantage of named graphs is that they enable storing mutually contradicting data, since storing inconsistent data in temporally-scoped statements inside a single graph violates RDF’s entailment property, which asserts that from an RDF graph every sub-graph can be entailed (Mittelbach, 2008). Finally, although there is no standardized serialization syntax for named graphs,¹⁹ they are well-supported in tools based on SPARQL 1.1, since most current RDF stores are quad-stores. Although their handling in current reasoners is patchy, since named graphs are not a part of the OWL specification to which most reasoners conform to (Batsakis and Petrakis, 2011).

Despite favourable mentions named graphs received in literature about temporal RDF, there are several commonly raised issues when it comes to using this approach for temporal annotation. Even though named graphs have no standardized semantics, they are commonly used as containers of RDF datasets. Named graphs are frequently equated to “documents”, in which RDF data is published. Based on empirical evidence collected by Rula et al. (2012) temporal annotations of documents are relatively prevalent in practice. Grouping a set of related triples into a dataset identified with a named graph URI is recommended among the best practices for working with linked data.²⁰ Since named graphs are already used to express a dataset to which resources belong, for temporal annotation “relying on named graphs is problematic because the statements may be embedded in other datasets which are already enclosed in named graphs” (McCusker and McGuinness, 2010. However, affiliation between RDF triples and datasets may be expressed explicitly in the RDF data model by using properties such as void:inDataset²¹ that connects named graph that encloses RDF with URI indicating a dataset.

A difficult task when using named graphs is to pick the best level of granularity on which to partition data into individual graphs. On this note, Tennison recommends that “to avoid repetition of data within multiple graphs, graphs should be split up at the level that updates are likely to occur within the source of the data” (2010). In the worst case the granularity amounts to single statement, in which case named graphs may serve as a more elegant syntax for reified statements. As Gangemi and Pressuti warn about named graphs, “this solution has the advantage of being able to talk about assertions, but the disadvantages of needing a lot of contexts (i.e. documents) that could make a model very sparse” (2013). Due to this respect, Rula et al. recommend using named graphs “only when it is possible to group a considerable number of triples into a single graph” (2012). However, when change impacts multiple facts at the same time, named graphs enable to group them and attach single temporal annotation, which is not possible with previously scrutinized modelling patterns.

Data modelling patterns for temporal data based on named graphs attribute special statuses to the default graph. Two basic ways of using default graph emerged. While some approaches employ named graphs to store changing data, default graph is used as a container for metadata about the changes, for example in Rula et al., 2012. Another way is to define default graph as a view of the current state of data (i.e. HEAD in terminology of versioning systems), which is the case for R&Wbase (Vander Sande et al., 2013). Of the different versions of data “the most useful […] is the current graph, which is the one that should be exposed as the default graph in the SPARQL endpoint offered by the triplestore” (Tennison, 2010), which helps to simplify querying via SPARQL as the current state of data is selected by default.

Named graphs were used for representing temporal scope in several research projects, though their use for this particular purpose is still not widespread at the moment. For example, Tappolet and Bernstein (2009) describe named graphs as instances of classes from Time Ontology (2006), which delimit temporal boundaries of data in graphs.

In the following examples we use the TriG syntax,²² which extends the Turtle serialization to represent named graphs. We will demonstrate interpreting named graphs as resource states and as commits.

Resource states

The concept of stateful resources was put forth by Richard Cyganiak.²³ It reinterprets every resource as stateful resource and uses named graphs as individual states of the resource. It aligns well with the concepts of four-dimensionalism, using named graphs for temporal parts of stateful resources.

:Alice2000-01-01 {
  :Alice org:memberOf :ACME .
  :Alice2000-01-01 dcterms:valid [
    a time:Interval ;
    time:hasBeginning [
      time:inXSDDateTime "2000-01-01T09:00:00Z"^^xsd:dateTime
    ]    
  ] .
}

Commits

An approach of using named graphs for capturing evolving RDF data was outlined in a proof of concept entitled R&Wbase (Vander Sande et al., 2013), which builds on concepts of distributed version control systems. Each change is represented as a commit, which captures delta between the previous and current state of data. Unlike snapshots, in this way data that persists through subsequent revisions does not need to be duplicated, which contributes to storage space efficiency. Much like transactions in relational databases, commits represent atomic units of data that allow fault recovery and rollback. Commits represent collection of metadata describing changes made to data, including their author, timestamp and link to parent commit. Commit are described via the means of the Provenance Ontology.²⁴ In case of R&Wbase, individual revisions are exposed as virtual named graphs that enable to query database as of specific commit. This approach also allows to both branch and merge different versions of data. The following is an example :version named graph that was generated by the :commit:

:version {
  :commit a prov:InstantaneousEvent ;
    prov:atTime "2000-01-01T09:00:00Z"^^xsd:dateTime ; 
    prov:generated :version .
  
  :version a prov:Entity .

  :Alice org:memberOf :ACME .
}

Main issues

Having outlined the key characteristics of the main modelling patterns for temporal RDF, we summarize their suitability for linked data based on several criteria and requirements. However, before that we discuss a few guiding concerns that serve as a basis on which we decide how the reviewed modelling patterns fare on evaluation criteria.

Guiding concerns

This section draws attention to a few common concerns associated with modelling temporal data in RDF. We highlight the importance of avoiding updating data in place, the cost of identifying data with blank nodes and reasons given for preferring deltas instead of snapshots. The selection should by no means held to be comprehensive, but rather as a sample of more prominent concerns.

No update in place

Update must not affect existing data. Temporal databases eschew update in place and instead prefer immutable data structures, which collect changes instead of applying them directly by rewriting previous data. Hickey states that “since one can’t change the past, this implies that the database accumulates facts, rather than updates places, and that while the past may be forgotten, it is immutable” (2013). In this way, “when capturing the transaction time of data, deletion statements only have a logical effect” (Jensen, 2000), which means that “deleting an entity does not physically remove the entity from the database; rather, the entity remains in the database, but ceases to be part of the database’s current state” (ibid.). Changes or deletes thus do not have the physical effect of modifying or erasing data in place. Delete only marks an existing fact as no longer valid, instead of removing it from the database. Erroneous data may be marked with zero as its end of validity; i.e. data was never valid. On this account, Auer et al. remark that “in analogy with accounting practices, we never physically erase anything, but just add a correcting transaction” (2012). Nevertheless, storage space is limited and legal regulations in some countries might require physical deletion of problematic data. Similar to garbage collection in programming languages a value might be discarded once it is ceased to be used. Likewise, in order to make irreversible changes most temporal databases allow special procedures to be executed, such as “vacuuming” or “excision”.

Avoid blank nodes

Blank nodes are commonly reported to be a source of problems in versioning RDF. For instance, Tappolet and Bernstein (2009) write that blank nodes are “especially problematic as their validity is restricted to the node’s parent graph.” RDF 1.1 concepts (2013) makes that clear by saying that, in fact, blank nodes may be shared across multiple graphs in an RDF dataset, however, these graphs must have a parent-child relationship. “Since blank nodes cannot be identiﬁed, deleting them is not trivial” (Vander Sande et al., 2013) and they must be addressed with the context of a graph pattern that they are a part of. Given these downsides many tools working with blank nodes replace them internally with URIs. Moreover, use of blank nodes is frowned upon in general in the linked data community, so the recommendation is to avoid them if possible.

Prefer deltas to snapshots

Change in data may be represented either as a new snapshot, which contains new data as well as unaltered data, or as delta, which record incremental change. Snapshot is a view of data at a particular moment. By using named graphs for snapshots “different versions of data can be stored in different graphs, but this leads to a duplication of all triples” (Vander Sande et al., 2013). Storing only deltas in place of full snapshots has significant space benefits. “The temporal RDF approach vastly reduces the number of triples by eliminating redundancies resulting in an increased performance for processing and querying” (Tappolet and Bernstein).

Delta comprises only the difference between two consecutive snapshots of data, which makes it space efficient at the price of complex resolution of deltas back to snapshots. In particular, the advantages of employing this approach show when dealing with changes with a higher level of granularity (i.e. resource or entire graph). Deltas were introduced to RDF by Berners-Lee and Connolly (2009) and were covered since by many other research articles that highlighted several major issues when employing deltas in RDF. Auer et al. call for “LOD differences (deltas) and representing them as first class citizens with structural, semantic, temporal and provenance information” (2012). Deltas have to be easily applicable, so that resolving them to a snapshot of data as of certain time frame is feasible. Delta resolution depends on preservation of their application order, either via parent links to previous deltas or by inferring the order from deltas’ timestamp succession. An important concern about delta is that they are self-contained, so that they group updates in a way that allows to identify all data pertaining to a single delta and revert it to a previous state. Moreover, deltas should express changes in data that can be comprehended by humans. In that respect, Papavasileiou and her colleagues propose a “change language which allows the formulation of concise and intuitive deltas” (2013).

Comparison

In this section we summarize how the reviewed data modelling patterns compare on criteria that we deem relevant for linked data. Below the table each criterion is described in detail together with explaning comments on assessment of inidividual patterns.

Approach Criterion	Statement reification	N-ary relations	Property localization	Time slices	Dated URIs	Stateful resources	Commits
Data size	3n	3n	9n	3n	n	n	2n
Data compatibility	✗	✗	✗	✗	✗	✓	✓
Model compatibility	✓	✗	✓	✓	✓	✓	✓
Sufficient TBox	✓	✗	✗	✓	✓	✓	✓
Technological compatibility	✓	✓	✓	✓	✗	✓	✓
Time-specific URIs	✗	✗	✗	✓	✓	✓	✗
Bitemporality	✓	✓	✓	✓	✗	✓	✓
Extensibility	✓	✓	✗	✓	✗	✓	✓

Criteria

Data size

Data size represents the minimal number of triples needed to represent data as compared with atemporal triple count; without including temporal annotation and data that can be inferred.

Property localization is the least space efficient approach because it requires a lot of boilerplate triples in ungainly OWL constructs. On the contrary, using named graphs as stateful resource has the smallest footprint because it only requires wrapping annotated triples into a named graph. The same result is achieved by using dated URIs, which overload content of resource identifiers, so that no additional triples are created.

Data compatibility

This criterion reflects whether adding temporal annotation can be done without a transformation of existing data.

As binary relations of basic triples are inable to capture temporal scope, all approaches limited to RDF triples require existing atemporal data to be transformed (e.g., reified) in order to attach temporal annotations. Data compatibility is an advantage of named graphs that enable to contextualize RDF without requiring data pre-processing.

Model compatibility

Model compatibility determines if existing atemporal RDF vocabularies and ontologies may be used to represent temporally annotated data.

The ability to keep an atemporal domain model and only extend it with temporal annotation is an important advantage. N-ary relations do not provide this ability because every time-varying property has to be remodelled. Other modelling patterns make keeping existing models feasible, albeit patterns based on named graphs prohibit uses of named graphs for other purposes (e.g., as dataset containers).

Sufficient TBox

Building on the criterion of model compatibility this criterion indicates if a compared modelling pattern makes do without introducing additional TBox axioms²⁵ into RDF vocabularies or ontologies used in temporally annotated data.

Of the reviewed approaches, both n-ary relations and property localization require their users to establish additional TBox axioms that cannot be reused from available RDF vocabularies or ontologies.

Technological compatibility

This aspect of compatibility tests the reviewed approaches whether they are backwards compatible with existing linked data technologies (RDF, HTTP) without requiring custom extensions.

In this comparison, only dated URIs are marked as incompatible because they use a custom protocol, rather than the typical HTTP URIs. All other patterns adhere to technological standards of linked data.

Time-specific URIs

This criterion shows whether external datasets may link to a resource as of particular time.

Referring to a resource’s status to date is widely used when citing online resources. By enabling time-specific URIs third parties may add more data to describe a resource state. Such valuable property is typical for modelling patterns that use the resource-centric perspective; i.e. time slices, dated URIs and stateful resources.

Bitemporality

The criterion of bitemporality judges modelling patterns on the basis of their ability to capture both transaction and valid time.

Most of the examined patterns enable to use bitemporal data model, however, using it is not possible with dated URIs, which facitilate capturing only a single temporal dimension.

Extensibility

Extensibility is the criterion that shows if the approach in question permits other types of annotations to be expressed as well (e.g., provenance metadata).

All patterns that are deemed to be extensible enable bitemporality. However, in case of property localization, though it can provide bitemporal data model, it is not particularly extensible because describing more dimensions of data requires adding new sub-property for every dimension, which results in a plethora of newly minted TBox axioms.

The views presented in the comparison of data modelling patterns in this paper build on a large corpus of existing research literature on the topic. To highlight several of the notable works from this corpus this section introduces core ideas and solutions presented in them. As motivated in the introduction, the issues of temporal data in RDF are widely perceived as pressing. Even though not established as long as in SQL databases, the research of modelling patterns for RDF data varying in time produced a vast array of literature on the topic. An extensive bibliography of research on the subject of temporal aspects of semantic web was compiled by Grandi (2012).

Several comparisons of data modelling patterns for temporal annotation in RDF were conducted in the past. Similarly to this article, Davis went through the options for modelling temporal data in RDF in a series of blog posts (2009a), in which he compared conditions and time slices (2009b), named graphs (2009c), reified relations (2009d) and n-ary relations (2009e). Rula et al. presented a comparison based on empirical study of use of temporal properties in a large corpus crawled from the Web (2012). According to their analysis, the most common ways of representing temporal annotations is with n-ary relations and document metadata. Coming more from the perspective of ontological engineering, Gangemi and Presutti present seven RDF and OWL 2 logical patterns for modelling n-ary relations (2013), each of which is discussed in terms of its usability. In his account of the topic, Hayes presents a unification algorithm for automatical translation between different syntaxes used for representing temporally annotated information (2004). Temporal annotation may “trickle down” through the different levels of attachment, so that it indexes either whole statements, relationships, or individuals, which demonstrates incidental nature of syntax used for expressing temporal scope. In this way interoperability between datasets employing distinct modelling styles may be achieved.

The approaches to data modelling presented here work in any standards-compliant RDF store. However, additional features might be needed to make such data usable and its retrieval efficient. For example, additional dedicated indexes may be built to improve query performance and query rewriting might be employed to reduce complexity of query formulation. Foundational research in implementation of temporal RDF was laid out in publication of Gutierrez, Hurtado and Vaisman, 2005. Working in this direction, SPARQL was extended to function with temporally annotated data in T-SPARQL, which was inspired by the TSQL2 temporal query language in the domain of relational database management (Grandi, 2010). Several research papers proposed additional index structures and normalization of temporal annotations to speed up retrieval of temporal RDF (Pugliese, Udrea and Subrahmanian, 2008). Grandi (2009) proposed a multi-temporal database model with temporal query and update execution techniques, which were motivated by a focus on ontologies from the legal domain.

Support for some form of temporal RDF is already built-in in a few RDF stores and RDF-aware tools. For example, Parliament RDF store supports a dedicated temporal index.²⁶ R&WBase (Vander Sande et al., 2013), which was mentioned previously as a database with versioning capabilities, is a work in progress. A versioning module is a part of the Apache Marmotta,²⁷ an implementation of a linked data platform for publishing RDF data.

Looking into a wider context besides RDF proper, temporal dimension of data is supported in many other database solutions. Some approaches extend triples to n-tuples to surmount the limitations of RDF, such as Google Freebase, the query language of which allows to access historical versions of data.²⁸ Hoffart et al. report that the data model used in YAGO2 extends RDF triples to quintuples with time and location (2013). The model uses statement reification internally and statements get de-reified for quering, so that they can be viewed as quintuples. Efficient query performance is supported by the use of PostgreSQL and additional indexes for all tuples’ permutations.

Research on the topic of temporal data is long-established in the field of relational databases. TSQL2 is a complete temporal query language designed as an extension of SQL-92. Furthermore ISO SQL:2011, the 7^th revision of the SQL standard, incorporates temporal support. Temporal retrieval is built into several non-RDF databases, which include Datomic,²⁹ Google’s Spanner³⁰ or IBM’s DB2 (Saracco, Nicola and Gandhi, 2012).

Conclusions

In this overview we surveyed data modelling patterns for temporal linked data. Each pattern was evaluated on a set of generic criteria; omitting dataset-specific criteria, such as size of changes or change frequency. The criteria were designed on the basis of several guiding concerns regarded as relevant for linked data in particular. Giving the chosen criteria a priority was motivated by recognizing what is important for the things linked data is based on.

Backwards compatibility is a crucial virtue for the continual evolution of the Web. Therefore a new modelling style for temporal linked data must work with existing atemporal data and with available technologies. The resource-centric architecture of linked data is built on the core principles of REST. The fundamental concepts of REST map well to four-dimensionalism, in which resources may be treated as perdurants and their representations as their temporal parts. When it comes to the data format of linked data, RDF, it produced the limitations the presented modelling patterns try to circumvent. The version of RDF that was originally standardized proved to be too restrictive for temporal data. Approaches that transcend these limits turned out to be superior to those that hacked and twisted atemporal RDF. Named graphs, a de facto standard on its way to the next version of RDF specification, showed to be a viable option for modelling temporal data, which offers an elegant syntax that bypasses limits of binary relations inherent in RDF.

However, even though best practices for data modelling temporal RDF emerge and technologies supporting such data are being developed, the diachronic dimension of linked data is still missing. Given the large extent of research conducted on the topic, it is now a question of adoption of modelling patterns for temporal data by a broader audience. The research would yet have to be boiled down to concrete recommendations and guidance based on standards distilled from common consensus.

References

Architecture of the World Wide Web, volume 1. [online]. W3C Recommendation. December 15^th, 2004 [cit. 2013-06-13]. Available from WWW: http://www.w3.org/TR/webarch/
AUER, Sören; HERRE, Heinrich. A versioning and evolution framework for RDF knowledge bases. In Proceedings of the 6^th International Andrei Ershov Memorial Conference on Perspectives of Systems Informatics. Berlin; Heidelberg: Springer, 2007, p. 55 — 69. ISBN 978-3-540-70880-3.
AUER, Sören [et al.]. Diachronic linked data: towards long-term preservation of structured interrelated information. In Proceedings of the 1^st International Workshop on Open Data, Nantes, France, May 25, 2012. New York (NY): ACM, 2012, p. 31 — 39. ISBN 978-1-4503-1404-6.
BATSAKIS, Sotiris; PETRAKIS, Euripides G. M. Representing temporal knowledge in the semantic web: the extended 4D fluents approach. In Combinations of Intelligent Methods and Applications: proceedings of the 2^nd International Workshop, CIMA 2010, France, October 2010. Berlin; Heidelberg: Springer, 2011, p. 55 — 69. Smart innovation, systems and technologies, vol. 8. DOI 10.1007/978-3-642-19618-8_4.
BERNERS-LEE, Tim; CONNOLLY, Dan. Delta: an ontology for the distribution of differences between RDF graphs [online]. 2009-08-27 [cit. 2013-06-15]. Available from WWW: http://www.w3.org/DesignIssues/Diff
CORRENDO, Gianluca [et al.]. Linked Timelines: temporal representation and management in linked data. In First International Workshop on Consuming Linked Data (COLD 2010), Shanghai, China [online]. Aachen: RWTH Aachen University, 2010 [cit. 2013-06-15]. CEUR workshop proceedings, vol. 665. Available from WWW: http://ceur-ws.org/Vol-665/CorrendoEtAl_COLD2010.pdf
COX, Simon. DCMI Period Encoding Scheme: specification of the limits of a time interval, and methods for encoding this in a text string [online]. 2006-04-10 [cit. 2013-06-13]. Available from WWW: http://dublincore.org/documents/dcmi-period/
DAVIS, Ian. Representing time in RDF part 1 [online]. August 10, 2009 [cit. 2013-06-15]. Available from WWW: http://blog.iandavis.com/2009/08/10/representing-time-in-rdf-part-1/
DAVIS, Ian. Representing time in RDF part 2 [online]. August 10, 2009 [cit. 2013-06-15]. Available from WWW: http://blog.iandavis.com/2009/08/10/representing-time-in-rdf-part-2/
DAVIS, Ian. Representing time in RDF part 3 [online]. August 10, 2009 [cit. 2013-06-15]. Available from WWW: http://blog.iandavis.com/2009/08/10/representing-time-in-rdf-part-3/
DAVIS, Ian. Representing time in RDF part 4 [online]. August 10, 2009 [cit. 2013-06-15]. Available from WWW: http://blog.iandavis.com/2009/08/10/representing-time-in-rdf-part-4/
DAVIS, Ian. Representing time in RDF part 5 [online]. August 10, 2009 [cit. 2013-06-15]. Available from WWW: http://blog.iandavis.com/2009/08/10/representing-time-in-rdf-part-5/
Defining n-ary relations on the semantic web” [online]. NOY, Natasha; RECTOR, Alan (eds.). April 12, 2006 [cit. 2013-06-13]. Available from WWW: http://www.w3.org/TR/swbp-n-aryRelations/
FIELDING, Roy Thomas. Architectural styles and the design of network-based software architectures. Irvine (CA), 2000. 162 p. Dissertation (PhD.). University of California, Irvine.
GANGEMI, Aldo. Super-duper schema: an owl2+rif dns pattern. In CHAUDRY, V. (ed.). Proceedings of DeepKR Challenge Workshop at KCAP 2011. 2011. Also available from WWW: http://www.ai.sri.com/halo/public/dkrckcap2011/Gangemi.pdf
GANGEMI, Aldo; PRESUTTI, Valentina. A multi-dimensional comparison of ontology design patterns for representing n-ary relations. In SOFSEM 2013: Theory and Practice of Computer Science. Berlin; Heidelberg: Springer, 2013, p. 86 — 105. Lecture notes in computer science, vol. 7741. DOI 10.1007/978-3-642-35843-2_8.
GRANDI, Fabio. Multi-temporal RDF ontology versioning. In Proceedings of the 3^rd International Workshop on Ontology Dynamics, collocated with the 8^th International Semantic Web Conference, Washington DC, USA, October 26, 2009 [online]. Aachen: RWTH Aachen University, 2009 [cit. 2013-06-11]. CEUR workshop proceedings, vol. 519. Available from WWW: http://ceur-ws.org/Vol-519/grandi.pdf
GRANDI, Fabio. T-SPARQL: A TSQL2-like temporal query language for RDF. In Local Proceedings of the 14^th East-European Conference on Advances in Databases and Information Systems, Novi Sad, Serbia, September 20-24, 2010. [online]. Aachen: RWTH Aachen University, 2010 [cit. 2013-06-11]. CEUR workshop proceedings, vol. 639. Available from WWW: http://ceur-ws.org/Vol-639/021-grandi.pdf
GRANDI, Fabio. Introducing an annotated bibliography on temporal and evolution aspects in the semantic web. ACM SIGMOD Record. December 2012, vol. 41, iss. 4, p. 18 — 21. DOI 10.1145/2430456.2430460.
GRAU, Bernardo Cuenca [et al.]. OWL 2: the next step for OWL. Journal of Web Semantics. November 2008, vol. 6, iss. 4, p. 309 — 322. DOI 10.1016/j.websem.2008.05.001.
GUTIERREZ, Claudio; HURTADO, Carlos A.; VAISMAN, Alejandro. Temporal RDF. In The semantic web: research and applications: proceedings of the 2^nd European Semantic Web Conference, Heraklion, Crete, Greece. Berlin; Heidelberg: Springer, 2005, p. 93 — 107. Lecture Notes in Computer Science, vol. 3532. DOI 10.1007/11431053_7.
GUTIERREZ, Claudio; HURTADO, Carlos A.; VAISMAN, Alejandro. Introducing time into RDF. IEEE Transactions on Knowledge and Data Engineering. February 2007, vol. 19, no. 2, p. 207 — 218. Also available from WWW: http://www.spatial.cs.umn.edu/Courses/Fall11/8715/papers/time-rdf.pdf
HALLOWAY, Stuart. Perception and action: an introduction to Clojure’s time model [online]. April 15, 2011 [cit. 2013-06-15]. Available from WWW: http://www.infoq.com/presentations/An-Introduction-to-Clojure-Time-Model
HAYES, Pat. Formal unifying standards for the representation of spatiotemporal knowledge [online]. Pensacola (FL): IHMC, 2004 [cit. 2013-06-15]. Available from WWW: http://www.ihmc.us/users/phayes/arlada2004final.pdf
HICKEY, Rich. The Datomic information model [online]. February 1, 2013 [cit. 2013-06-11]. Available from WWW: http://www.infoq.com/articles/Datomic-Information-Model
HOFFART, Johannes [et al.]. YAGO2: a spatially and temporally enhanced knowledge base from Wikipedia. Artificial Intelligence. January 2013, vol. 194, p. 28 — 61. DOI 10.1016/j.artint.2012.06.001.
The identity of indiscernibles. In Stanford encyclopedia of philosophy [online]. August 15, 2010 [cit. 2013-06-15]. Available from WWW: http://plato.stanford.edu/entries/identity-indiscernible/
JENSEN, Christian S. Introduction to temporal database research. In Temporal database management. Aalborg, 2000. Dissertation thesis. Aalborg University. Also available from WWW: http://people.cs.aau.dk/~csj/Thesis/
KIRYAKOV, Atanas; OGNYANOV, Damyan. Tracking changes in RDF(S) repositories. In Proceedings of the 13^th International Conference on Knowledge Engineering and Knowledge Management: Ontologies and the Semantic Web. London (UK): Springer, 2002, p. 373 — 378. Also available from WWW: http://www.ontotext.com/sites/default/files/publications/TrackingKTSW02.pdf. ISBN 3-540-44268-5.
KRIEGER, Hans-Ulrich. Where temporal description logics fail: representing temporally-changing relationships. In Advances in artificial intelligence: proceedings of the 31^st Annual German Conference on AI, KI 2008, Kaiserslautern, Germany, September 23 — 26, 2008. Berlin; Heidelberg: Springer, 2008, p. 249 — 257. Lecture notes in computer science, vol. 5243. DOI 10.1007/978-3-540-85845-4_31.
LOPES, Nuno [et al.]. RDF needs annotations. In RDF Next Steps: W3C workshop, June 26 — 27, 2010 [online]. 2009 [cit. 2013-06-13]. Available from WWW: http://www.w3.org/2009/12/rdf-ws/papers/ws09
MASINTER, Larry. The ‘tdb’ and ‘duri’ URI schemes, based on dated URIs [online]. 2012 [cit. 2013-06-10]. Available from WWW: http://tools.ietf.org/html/draft-masinter-dated-uri-10
MCCUSKER, James P.; MCGUINNESS, Deborah L. Towards identity in linked data. In Proceedings of the 7^th International Workshop on OWL: Experiences and Directions. San Francisco, California, USA, June 21 — 22, 2010 [online]. Aachen: RWTH Aachen University, 2010 [cit. 2013-06-09]. CEUR workshop proceedings, vol. 614. Available from WWW: http://ceur-ws.org/Vol-614/owled2010_submission_12.pdf
MITTELBACH, Arno. RDF and the time dimension, part 1 [online]. 2008-11-28 [cit. 2013-06-16]. Available from WWW: http://oxforderewhon.wordpress.com/2008/11/28/rdf-and-the-time-dimension-part-1/
OWL 2 Web Ontology Language: new features and rationale [online]. GOLBREICH, Christine; WALLACE, Evan K. (eds.). 2^nd ed. W3C, 2012 [cit. 2013-06-11]. Available from WWW: http://www.w3.org/TR/owl2-new-features
PAPAVASILEIOU, Vicky [et al.]. High-level change detection in RDF(S) KBs. ACM Transations on Database Systems. April 2013, vol. 38, iss. 1. DOI 10.1145/2445583.2445584.
PROCHÁZKA, Jiří [et al.]. The Property Reification Vocabulary 0.11 [online]. February 19, 2011 [cit. 2013-06-10]. Available from WWW: http://smiy.sourceforge.net/prv/spec/propertyreification.html
PUGLIESE, Andrea; UDREA, Octavian; SUBRAHMANIAN, V. S. Scaling RDF with time. In Proceedings of the 17^th international conference on World Wide Web. New York (NY): ACM, 2008, p. 605 — 614. Also available from WWW: http://wwwconference.org/www2008/papers/pdf/p605-puglieseA.pdf. DOI 10.1145/1367497.1367579.
RDF semantics [online]. HAYES, Patrick (ed.). 2004 [cit. 2013-06-13]. Available from WWW: http://www.w3.org/TR/rdf-mt/
RDF 1.1 concepts and abstract syntax: W3C Working Draft [online]. CYGANIAK, Richard; WOOD, David (eds.). January 15, 2013 [cit. 2013-06-12]. Available from WWW: http://www.w3.org/TR/rdf11-concepts/
RDF primer” [online]. MANOLA, Frank; MILLER, Eric (eds.). February 10, 2004 [cit. 2013-06-15]. Available from WWW: http://www.w3.org/TR/rdf-primer/
REES, Jonathan; BOOTH, David; HAUSENBLAS, Michael. Towards formal HTTP semantics: AWWSW report to the TAG [online]. December 4, 2009 [cit. 2013-06-12]. Available from WWW: http://www.w3.org/2001/tag/awwsw/http-semantics-report.html
RULA, Anisa [et al.]. On the diversity and availability of temporal information in linked open data. In Proceedings of the 11^th International Semantic Web Conference, Boston, MA, USA, November 11-15, 2012, part I. Berlin; Heidelberg: Springer, 2012, p. 492 — 507. Lecture notes in computer science, vol. 7649. DOI 10.1007/978-3-642-35176-1_31.
SANDERSON, Robert D.; VAN DE SOMPEL, Herbert. Cool URIs and dynamic data. IEEE Internet Computing. 2012, vol. 16, no. 4, p. 76 — 79. Also available from WWW: http://public.lanl.gov/herbertv/papers/Papers/2012/CoolURIsDynamicData.pdf. DOI 10.1109/MIC.2012.78. SARACCO, Cynthia M.; NICOLA, Matthias; GANDHI, Lenisha. A matter of time: Temporal data management in DB2 10 [online]. April 3, 2012 [cit. 2013-06-11]. Available from WWW: http://www.ibm.com/developerworks/data/library/techarticle/dm-1204db2temporaldata/
TAPPOLET, Jonas; BERNSTEIN, Abraham. Applied temporal RDF: efficient temporal querying of RDF data with SPARQL. In Proceedings of the 6^th European Semantic Web Conference. Berlin; Heidelberg: Springer, 2009, p. 308 — 322. DOI 10.1007/978-3-642-02121-3_25.
TENNISON, Jeni. Temporal scope for RDF triples [online]. 2009-02-15 [cit. 2013-06-13]. Available from WWW: http://www.jenitennison.com/blog/node/101
TENNISON, Jeni. Linked open data in a changing world [online]. 2009-07-10 [cit. 2013-06-12]. Available from WWW: http://www.jenitennison.com/blog/node/108
TENNISON, Jeni. Versioning (UK government) linked data [online]. 2010-02-27 [cit. 2013-06-15]. Available from WWW: http://www.jenitennison.com/blog/node/141
Time Ontology in OWL: W3C Working Draft 27 September 2006. HOBBS, Jerry R.; PAN, Feng (eds.). W3C, 2006 [cit. 2013-06-09]. Available from WWW: http://www.w3.org/TR/owl-time/
UMBRICH, Jörgen; KARNSTEDT, Marcel; LAND, Sebastian. Towards understanding the changing web: mining the dynamics of linked-data sources and entities. In Proceedings of the LWO 2010 Workshop, October 4-6, 2010, Kassel, Germany [online]. Kassel: Universität Kassel, 2010 [cit. 2013-06-09]. Available from WWW: http://www.kde.cs.uni-kassel.de/conf/lwa10/papers/kdml22.pdf
VANDER SANDE, Miel [et al.]. R&Wbase: Git for triples. In Proceedings of the WWW2013 Workshop on Linked Data on the Web 2013, May 14, 2013, Rio de Janeiro, Brazil [online]. Aachen: RWTH Aachen University, 2013 [cit. 2013-06-09]. CEUR workshop proceedings, vol. 996. Available from WWW: http://events.linkeddata.org/ldow2013/papers/ldow2013-paper-01.pdf. ISSN 1613-0073.
VAN DE SOMPEL, Herbert; NELSON, Michael L.; SANDERSON, Robert D. HTTP framework for time-based access to resource states: Memento [online]. March 29, 2013 [cit. 2013-06-11]. Available from WWW: http://tools.ietf.org/html/draft-vandesompel-memento-07
WELTY, Christopher A.; FIKES, Richard. A reusable ontology for fluents in OWL. Formal Ontology in Information Systems: Proceedings of the Fourth International Conference (FOIS 2006). Amsterdam: IOS, 2006, p. 226 — 236. Frontiers in artificial intelligence and applications, vol. 150. ISBN 978-1-58603-685-0.
WELTY, Christopher A. Context slices: representing contexts in OWL. In Proceedings of the 2^nd International Workshop on Ontology Patterns - WOP2010 [online]. Aachen: RWTH Aachen University, 2010 [cit. 2013-06-16]. CEUR workshop proceedings, vol. 671. Available from WWW: http://ceur-ws.org/Vol-671/pat01.pdf. ISSN 1613-0073.

Footnotes

For example, Semantic Sitemap with <changefreq> element.↩
http://www.w3.org/TR/xmlschema-2/↩
http://en.wikipedia.org/wiki/ISO_8601 ↩
For example, Dublin Core initiative proposed a way for encoding time intervals into typed literals (Cox, 2006).↩
https://github.com/ccattuto/neo4j-dynagraph/wiki/Representing-time-dependent-graphs-in-Neo4j ↩
http://www.w3.org/TR/skos-reference/↩
http://dublincore.org/documents/dcmi-terms/↩
http://www.w3.org/TR/turtle/↩
http://www.w3.org/TR/rdf-mt/#Reif ↩
http://www.w3.org/TR/owl2-mapping-to-rdf/#a_Annotation ↩
http://docs.api.talis.com/getting-started/changesets ↩
http://www.cibiv.at/~niko/dsnotify/vocab/eventset/v0.1/dsnotify-eventset.html ↩
There is a convention of using the term “n-ary relation” for relation with arity higher than 2, even though unary and binary relations are n-ary relations as well.↩
http://www.freebase.com/↩
http://www.wikidata.org/wiki/Wikidata:Main_Page ↩
http://www.w3.org/TR/vocab-org/#membership-n-ary-relationship ↩
http://www.w3.org/DesignIssues/Generic ↩
http://www.w3.org/DesignIssues/Axioms.html#opaque ↩
There are several proposed serialization formats (e.g., TriG or TriX), none of which reached the status of an official recommendation.↩
http://patterns.dataincubator.org/book/named-graphs.html ↩
A property from Vocabulary of Interlinked Datasets (VoID). http://www.w3.org/TR/void/↩
http://www.w3.org/2010/01/Turtle/Trig ↩
http://www.w3.org/2011/rdf-wg/wiki/User:Rcygania2/RDF_Datasets_and_Stateful_Resources ↩
http://www.w3.org/TR/prov-o/↩
TBox constitutes the terminology used in RDF assertions.↩
http://parliament.semwebcentral.org/↩
http://marmotta.incubator.apache.org/kiwi/versioning.html ↩
Metaweb Query Language. http://mql.freebaseapps.com/ch03.html#history ↩
http://docs.datomic.com/architecture.html ↩
http://research.google.com/archive/spanner.html ↩

Alphabet of your web

2013-05-12T15:20:00.005+02:00

Your browser knows a lot about you. A feature that demonstrates its vast knowledge of yourself is URL autocomplete. When you start to type a URL, then it suggests possible URLs containing the fragment that you typed based on the web sites you visit most frequently. When you start to use a new browser, pretty soon it'll be completing your URLs.

The collection of the web sites you visit most often forms your personal subset of the Web. It describes your online habits, which in turn describe you; the things you do on the Web, both those that you must do and the ones you enjoy. I think a nice picture of your online world is to collect URLs that your browser suggests for every letter of the alphabeth. Here's mine personal list of places that I'm always returning on the Web.

A ... answers.semanticweb.com
B ... bit.ly
C ... calendar.google.com
D ... drive.google.com
E ... eurostat.linked-statistics.org/sparql
F ... facebook.com
G ... github.com/OPLZZ/data-modelling
H ... headtoweb.posterous.com
I ... isis.vse.cz
J ... joinup.ec.europa.eu/asset/esco/home
K ... kosek.cz/vyuka/4iz238/
L ... lod2.vse.cz:8890/sparql
M ... mail.google.com
N ... netstorage.vse.cz
O ... or.justice.cz
P ... prefix.cc
Q ... foursquare.com
R ... regiojet.cz
S ... slovnik.seznam.cz
T ... twitter.com
U ... usaspending.gov/data
V ... virtuoso.openlinksw.com/dataspace/doc/dav/wiki/Main/VirtBulkRDFLoader#Bulk loading process
W ... w3.org/TR/sparql11-query/
X ... xrg15.projekty.ms.mff.cuni.cz/geo-enhancer.html
Y ... youtube.com
Z ... zvon.org

When you have such list, you can look through it and reflect on the places on the Web you spent hours with. Browsing through these links (all coloured as visited ones), you become aware of the parts of the Web where you do your work and where you waste time avoiding work. You recognize web sites where you seek information and web sites where you seek entertainment. You can also find dead places, such as the URL of my previous blog in my list. Overall, it's likely that such list can satisfy your craving for stats for a while and can provide you with insights into your quantified online self.

Now you know what to do: build a browser extension that creates the alphabet of your web automatically.

Applying linked open data to public procurement

2013-01-31T19:39:00.000+01:00

The following post comprises some of my notes for the talk Applying Linked Open Data to Public Procurement, which I gave at the EU PSI Group meeting on January 24, 2013.

Public procurement is quite a specific domain. It is the domain in which public sector and business interfaces, forming business deals that trade public funding for goods and services. It is a huge intersection that, as of 2010, makes up for 17,3 % of the gross domestic product of the European Union (source [PDF]). For example, it amounts almost to 20 billion EUR in the Czech Republic. If I was to use a technical term, then “profiling the EU” might show public procurement as its hot-spot. Therefore, I think that applying linked open data to public procurement in the EU is optimizing where it matters.

The life-cycle of data in this domain is “solitary, poor, nasty, brutish, and short” (source). The transience is natural to public procurement data. The data comes in streams of public notices, calls for tenders and the like, and loses most of its value once opportunities for businesses are turned into contracts that are awarded and signed. The life-cycle of the data is tightly bound to applications that are used either to produce or consume the data. The data could serve a lot longer than it does, trapped inside applications and restrictive licensing regimes. Even the public procurement data from the past, while losing its perceived value progressively over time, may well serve for analytics.

In this respect linked data helps by decoupling data from applications. It models data in an application-agnostic way so that it can power all kinds of applications. Moreover, semantic web technologies offer means to decouple data from natural languages by describing data in a structured way, so that language understanding is not required for all types of useful data processing. In this way, linked data prolongs the life-cycle of data and makes it work both for the public sector and the public.

Information about public procurement is distributed unevenly. Not all parties interested in public procurement have equal level of access to information describing it. Access to information is vital for the pre-award stage of procurement, which is in fact also called the information stage. However, the distribution of information in this domain is asymmetric, which may be the root cause of inefficiencies the domain is infested with. Linked data, and linked open data in particular, strives for an equal level of access to information, so that everyone starts with the same initial conditions when participating in public procurement.

Reforming the way data in public procurement works is needed to ensure optimal functioning in the domain on a number of levels.

Public procurement is an area well-saturated with money and so it presents numerous opportunities for corruption and affords systemic inefficiencies. Thus, it is crucial to procure public contracts in a transparent manner. In public procurement contracting should be done in public; it should be documented and available to the members of public. Documents should clearly show that decisions about public contracts should be subordinate to public interest, not self-interest. In this way, transparency may yield accountability, as individual decisions can be attributed to decision-makers.

Yet most of the volume of public procurement is not transparent. What is governed by the current rules (e.g., in the European Union) is only a subset of all public procurement. These are the public contracts, which exceed the thresholds for mandatory reporting. What is not included in these forms the long tail of public procurement. For example, it is estimated that 85 % of public contracts in the EU are not announced via the central system TED (source). Even though some of this massive “dark matter” of public procurement is available through local portals, it constitutes information that is difficult to reach.

Transparency and symmetric distribution of information configure public procurement for fairer competition. If all interested parties have equal access to information about public procurement, then their opportunity to participate in procurement is equal, and eventually, more openness leads to higher competition.

Ensuring equal initial conditions is particularly important for increasing the volume of cross-country procurement. Even if companies and individuals frequently buy from suppliers in other countries, such behaviour is still quite rare in public procurement. Having the data on opportunities in public procurement openly available, in a structured format that does not rely on natural language description, could help boost the number of cross-country public contracts.

In the end, what optimizations of the flow of data in public procurement bring is efficiency. These might help the public sector to achieve better resource allocation, exploiting potential for cost savings. In general, quality of public procurement depends on quality of procurement data: better data affords better operation.

Barriers

I have tried to outline some of the potential benefits of applying linked open data to public procurement. In many cases such application would lower or remove the barriers that are present in the current public procurement systems.

Unclear licensing
What does '© European Union' mean? Discovering correct interpretation of legal conditions governing the use of public procurement data may often require significant investment of time. Especially in the case of national portals, the conditions of use are either vague, missing or burdened with special disclaimers. When foreign public contracts are of concern, then delving into local legislation may prove to be a time-consuming exercise. In any of these cases, explicit and standardized licences recommended for open data would help.
Machine readability
Being hardly accessible for machines, many of the data sources in public procurement seem to have exclusive arrangements with humans. Such datasets are embedded in web pages, out of which the original structured data has to be reconstructed by techniques such as screen-scraping. We have already learnt that screen-scraping incurs a high marginal cost of reproduction. To avoid this unnecessary cost data sources should provide direct access to raw, structured data.
Entity reconciliation
Basically, entity reconciliation is a technical term for merging the same entities together. Unfortunately, in many cases public procurement data sources do not offer enough information to determine if two entities are the same. Identifiers are helpful, yet in some sources they occur fairly rarely, even though reporting them is mandatory. In practice, doing things like deduplication of business entities proves to be difficult without having their rich descriptions available in data.

I think that linked open data can help to remove these road blocks on the way to a single, global market for public procurement. The potential of new technologies and their impact on public procurement is immense. European Commission recognized it in an official decision saying that:

“The new information and communication technologies have created unprecedented possibilities to aggregate and combine content from different sources.”

Combining and merging data is an area where linked data shines the most. Ultimately, if we want to combine markets we must first combine the data they are built on. Let's see if linked data technologies offer a good “glue” to mashup public procurement markets.

Sampling CSV headers from the Data Hub

2012-11-22T14:43:00.000+01:00

Recently, I decided to check how useful column headers in typical CSV files are. My hunch was that in many cases columns would be labelled ambiguously or that the header row would be simply missing from many CSVs. In such cases data may be near to useless, since hints how to use data are lacking.

To support my assumptions about the typical CSV file, I needed sample data. Many such files are listed as downloadable resources in the Data Hub, which is one of the most extensive CKAN instances. Fortunately for me, CKAN exposes a friendly API. However, an even friendlier way for me was to obtain the data by using the SPARQL endpoint of the Semantic CKAN, which offers access to the Data Hub data in RDF. Simply put:

SPARQL is the best API.
— Jindřich Mynarz (@jindrichmynarz) November 15, 2012

This is the query that I used:

PREFIX dcat:    <http://www.w3.org/ns/dcat#>

SELECT ?accessURL

WHERE {
  ?s a dcat:Distribution ;
    dcat:accessURL ?accessURL .
  FILTER (STRENDS(STR(?accessURL), "csv"))
}

I saved the query in query.txt file and executed it on the endpoint:

curl -H "Accept:text/csv" --data-urlencode "query@query.txt" http://semantic.ckan.net/sparql > files.csv

In the command, I took advantage of content negotiation provided by OpenLink's Virtuoso and set the HTTP Accept header to the MIME type text/csv. I made curl to load the query from the query.txt file and pass it in the query parameter by using the argument "query@query.txt" (thanks to @cygri for this tip). The query results were stored in the files.csv file.

Having a list of CSV files, I was prepared to download them. I created a directory for the files that I wanted to get and moved into it with mkdir download; cd download. To download the CSV files I executed:

tail -n+2 ../files.csv | xargs -n 1 curl -L --range 0-499 --fail --silent --show-error -O 2> fails.txt

To skip the header row containing the SPARQL results variable name, I used -n+2. I piped the list of CSV files to curl. I switched the -L argument on in order to follow redirects. To minimize the amount of downloaded data I used --range to 0-499 to fetch only a partial response containing the first 500 bytes from servers that support HTTP/1.1. Finally, I muted curl with --silent and --fail to turn error reporting off and redirected errors to fails.txt file.

When the CSV files were retrieved, I concatenated their first lines:

find * | xargs -n 1 head -1 | sort | perl -p -e "s/^M//g" > 1st_lines.txt

head -1 outputted the first line from every file was passed to it through xargs. To polish the output a bit, I sorted it and removed superfluous characters with perl -p -e "s/^M//g". Finally, I had a list with samples of CSV column headers.

By inspecting the samples, I found that ambiguous column labels are indeed the case, as labels such as “amount” or “id” are fairly widespread. Examples of other labels that caught my attention included “A-in-A”, “Column 42” and the particularly mysterious label “X”. Disabiguating such column names would be difficult without additional contextual information, including examples of data from the columns or supplementary documentation. Such data could be hard to use, especially for automated processing.

How linked data improves recall via data integration

2012-10-15T16:39:00.000+02:00

Linked data is an approach that materializes relationships in between resources described in data. It makes implicit relationships explicit, which makes them reusable. When working with linked data integration is performed on the level of data. It offloads (some of) the integration costs from consumers onto data producers. In this post, I compare the integration on the query level with the integration done on the level of data, showing the limits of the first approach as contrasted to the second one, demonstrated on the improvement of recall when querying the data.

All SPARQL queries featured in this post may be executed on this SPARQL endpoint.

For the purposes of this demonstration, I want to investigate public contracts issued by the capital of Prague. If I know a URI of the authority, say <http://ld.opendata.cz/resource/business-entity/00064581>, I can write a simple, naïve SPARQL query and I get to know there are 3 public contracts associated with this authority:

## Number of public contracts issued by Prague (without data integration) #
PREFIX pc: <http://purl.org/procurement/public-contracts#>

SELECT (COUNT(?contract) AS ?numberOfContracts)
FROM <http://ld.opendata.cz/resource/dataset/isvzus.cz/new>
WHERE {
  ?contract a pc:Contract ;
    pc:contractingAuthority <http://ld.opendata.cz/resource/business-entity/00064581>
    .

I can get the official number of this contracting authority that was assigned to it by the Czech Statistical Office. This number is “00064581”.

## The official number of Prague #
PREFIX br: <http://purl.org/business-register#>

SELECT ?officialNumber
FROM <http://ld.opendata.cz/resource/dataset/isvzus.cz/new>
WHERE {
  GRAPH <http://ld.opendata.cz/resource/dataset/isvzus.cz/new> {
      <http://ld.opendata.cz/resource/business-entity/00064581> br:officialNumber ?officialNumber .
  }
}

Consequently, I can look up all the contracts associated with a contracting authority identified with either the previously used URI or this official number. I get an answer telling me there is 195 public contracts issued by this authority.

## Number of public contracts of Prague with manual integration #
PREFIX br: <http://purl.org/business-register#>
PREFIX pc: <http://purl.org/procurement/public-contracts#>

SELECT (COUNT(?contract) AS ?numberOfContracts)
FROM <http://ld.opendata.cz/resource/dataset/isvzus.cz/new>
WHERE {
  GRAPH <http://ld.opendata.cz/resource/dataset/isvzus.cz/new> {
    ?contract a pc:Contract .
    {
      ?contract pc:contractingAuthority <http://ld.opendata.cz/resource/business-entity/00064581> .
    } UNION {
      ?contract pc:contractingAuthority ?authority .
      ?authority br:officialNumber "00064581" .
    }
  }
}

However, in some cases, the official number is missing, so I might want to try the authority’s name as its identifier. However, expanding my search by adding an option to match contracting authority based on its exact name will give me 195 public contracts that were issued by this authority. In effect, in this case the recall is not improved by matching on the authority’s legal name.

## Number of public contracts of Prague with manual integration #
PREFIX br: <http://purl.org/business-register#>
PREFIX pc: <http://purl.org/procurement/public-contracts#>

SELECT (COUNT(?contract) AS ?numberOfContracts)
FROM <http://ld.opendata.cz/resource/dataset/isvzus.cz/new>
WHERE {
  GRAPH <http://ld.opendata.cz/resource/dataset/isvzus.cz/new> {
    ?contract a pc:Contract .
    {
      ?contract pc:contractingAuthority <http://ld.opendata.cz/resource/business-entity/00064581> .
    } UNION {
      ?contract pc:contractingAuthority ?authority .
      ?authority br:officialNumber "00064581" .
    } UNION {
       ?contract pc:contractingAuthority ?authority .
       ?authority gr:legalName "Hlavní město Praha" .
    }
  }
}

Even still, I know there might be typing errors in the name of the contracting authority. Listing distinct legal names of the authority of which I know either its URI or its official number will give me 8 different spelling variants, which might indicate there are more errors in the data.

## Names that are used for Prague as a contracting authority #
PREFIX br: <http://purl.org/business-register#>
PREFIX gr: <http://purl.org/goodrelations/v1#>
PREFIX pc: <http://purl.org/procurement/public-contracts#>

SELECT DISTINCT ?legalName
FROM <http://ld.opendata.cz/resource/dataset/isvzus.cz/new>
WHERE {
  GRAPH <http://ld.opendata.cz/resource/dataset/isvzus.cz/new> {
    ?contract a pc:Contract .
    {
      ?contract pc:contractingAuthority <http://ld.opendata.cz/resource/business-entity/00064581> .
      OPTIONAL {
        <http://ld.opendata.cz/resource/business-entity/00064581> gr:legalName ?legalName .
      }
    } UNION {
      ?contract pc:contractingAuthority ?authority .
      ?authority br:officialNumber "00064581" ;
        gr:legalName ?legalName .
    }
  }
}

Given the assumption there might be unmatched instances of the same contracting authority labelled with erroneous legal names, I may want to perform an approximate, fuzzy match when search for the authority’s contracts. Doing so will give me 717 public contracts that might be attributed to the contracting authority with a reasonable degree of certainty.

## Number of public contracts of Prague with manual integration #
PREFIX br: <http://purl.org/business-register#>
PREFIX pc: <http://purl.org/procurement/public-contracts#>

SELECT (COUNT(?contract) AS ?numberOfContracts)
FROM <http://ld.opendata.cz/resource/dataset/isvzus.cz/new>
WHERE {
  GRAPH <http://ld.opendata.cz/resource/dataset/isvzus.cz/new> {
    ?contract a pc:Contract .
    {
      ?contract pc:contractingAuthority <http://ld.opendata.cz/resource/business-entity/00064581> .
    } UNION {
      ?contract pc:contractingAuthority ?authority .
      ?authority br:officialNumber "00064581" .
    } UNION {
       ?contract pc:contractingAuthority ?authority .
      ?authority gr:legalName ?legalName .
      ?legalName <bif:contains> '"Hlavní město Praha"' .
    }
  }
}

Further integration on the query level would make the query more complex or it would not be possible to express the integration steps within the limits of the query language. This approach is both laborious and computationally inefficient, since the equivalence relationships need to be reinvented and recomputed every time the query is created and run.

Contrarily, when I use a URI of the contracting authority plus its owl:sameAs links, it results in a simpler query. In this case, 232 public contracts are found. In this way the recall is improved, and, even though it is not as high as in the case of the query that takes into account various spellings of the authority’s name, which may be possibly attributed to a greater precision of the interlinking done on the level of data instead of intergration on the query level.

The following query harnesses equivalence relationships within the data. The query extends the first query shown in this post. In the FROM clause, it adds a new data source to be queried (<http://ld.opendata.cz/resource/dataset/isvzus.cz/new/beSameAs>), which co
ntains the equivalence links between URIs identifying the same contracting authorities. Other than that, a Virtuoso-specific directive DEFINE input:same-as "yes" is turned on, so that owl:sameAs links are followed.

## Number of public contracts of Prague with owl:sameAs links #
DEFINE input:same-as "yes"
PREFIX pc: <http://purl.org/procurement/public-contracts#>

SELECT (COUNT(?contract) AS ?numberOfContracts)
FROM <http://ld.opendata.cz/resource/dataset/isvzus.cz/new>
FROM <http://ld.opendata.cz/resource/dataset/isvzus.cz/new/beSameAs>
WHERE {
  ?contract a pc:Contract ;
   pc:contractingAuthority <http://ld.opendata.cz/resource/business-entity/00064581>
   .
}

How the Big Clean addresses the challenges of open data

2012-10-08T14:01:00.000+02:00

The Big Clean 2012 is a one-day conference dedicated to three principal themes: screen-scraping, data refining and data-driven journalism. These topics address some of the current challenges of open data, focusing on usability, misinterpretation of data and on the issue of making data-driven journalism work.

Usability

A key challenge of the Big Clean is refining raw data into usable data. People often fall victim to the fallacy of treating screen-scraped data as a resource that can be used directly, fed straight into visualizations or analysed to yield insights. However, validity of data must not be taken for granted. It needs to be questioned.
Just as some raw ingredients need to be cooked to become edible, raw data needs to be preprocessed to become usable. Patchy data extracted from web pages should be refined into data that can be relied upon. Cleaning data makes it more regular, error-free and ultimately more usable.
The Big Clean will take this challenge into account in several talks. Jiří Skuhrovec will try to strike a fine balance, considering the question of how much do we need to clean. Štefan Urbánek will walk the event's participants through a data processing pipeline. Apart from the invited talks, this topic will be a subject to a screen-scraping workshop lead by Thomas Levine. The workshop will run in parallel with the main track of the conference.

Misinterpretation

Access to raw data allows people take control of the interpretation of data. Effectively, people are not only taking hold of uninterpreted data, but also of the right to interpret it. This is not the case in the current state of affairs, where there is often no access to raw data, since all data is mediated through user interfaces. In such case, the interface owners control the ways in which data may be viewed. On the contrary, raw data gives you a freedom to interpret data on your own. It allows you to skip the intermediaries and access data directly, instead of limiting yourself to the views provided by the interface owners.
While the loss of control over presentation of data may be perceived as a loss of control over the meaning of the data, it is actually a call for more explicit semantics in the data. It is a call for an encoding of the meaning in data in a waythat does not rely on the presentation of data.
A common excuse for not releasing data held in the public sector is the assumption that the data will be misinterpreted. As reported in Andrew Stott's OKCon 2011 talk, among the civil servants, there is a widespread expectation that “people will draw superficial conclusions from the data without understanding the wider picture.”. First, there is not a single correct interpretation of data possessed by the public sector. Instead, there are multiple valid interpretations that may coexist together. Second, the fact that data is prone to incorrect interpretation may not attest to the ambiguity of the data, but to the ambiguity of its representation.
Tighter semantics may make the danger of misinterpretation less probable. As examples such as Data.gov.uk in the United Kingdom have shown, one way to encode clearer interpretation rules directly into the data is by using semantic web technologies.

Data-driven journalism

Nevertheless, in most cases public sector data is not self-describing. The data is not smart and thus people interpreting it need to be smart. A key group that needs to become smarter, reading the clues conveyed in data, comprises of journalists. Journalists should read data, not only press releases. In becoming data literati the importance of their work increases. They serve as translators, mediating understanding derived from data to the wider public. In this way, data-driven journalism contributes to the goal of making data more usable as stories told with data are more accessible than the data itself.
Raw data opens space for different and potentially competing interpretations. This is the democratic aspect of open data. It invites participation in a shared discourse constructed around the data. A fundamental element of such discourse are the media. Journalists using the data may contribute to this conversation by finding what is new in the data, discovering issues hidden from public oversight or tracing the underlying systemic trends. This is the key contribution of data-driven journalism, providing diagnoses of the present society.
The principal part of data-driven journalism in the open data ecosystem will be reflected in a couple of talks given at the Big Clean. Liliana Bounegru will explain why data journalism is something you too should care about and Caelainn Barr will showcase how the EU data can be used in journalism.

Practical details

The Big Clean will be held on November 3^rd, 2012, at the National Technical Library in Prague, Czech Republic. You can register by following this link. The admission to the event is free.
I hope to see many of you there.

What makes open data weak will not get engineered away

2012-10-04T10:28:00.000+02:00

Open data is still weak but growing strong. I have written a few fairly random points covering the weak points, in which open data may need to grow.

With the Open Government Partnership, open data is losing its edge. Open data is being assimilated into the current bureaucratic structures. It might be about time to reignite the subversive potential of open data.
There is no long-term commitment to open data. All activity in the domain seems to be fragmented in small projects that do not last long, nor do they share results. We need to find ways to make projects outlive their funding. Open data has an attention deficit disorder.
What makes open data weak and strange will not get engineered away. Better tools will not solve the inherent issues in open data, albeit they might help to grow the open data community in order to be able to solve those. Even though open data might be broken, we should not try to fix it, we should try to grow it to fix it itself.
People are getting lost on the way to realization of the goals of the open data movement. They fall for the distractions encountered on the way and get enchanted by the technology, a mere tool for addressing the goals of open data. People get stuck teaching others how to open and use data, while themselves not doing what they preach. People stop at community building, grasping for momentum using social media.
There is a legal uncertainty making people believe that taking legal actions is not possible without having a lawyer holding your hand. People are careful not to breach any of their imagined implications of the law. Civil servants are afraid to release the data their institutions hold, citizens are afraid of using data to effect real-world consequences.

State of open data in the Czech Republic in 2012

2012-10-03T22:33:00.000+02:00

During the Open Knowledge Festival 2012 in Helsinki I presented a lightning-fast two minutes summary of four key things that happened with open data in the Czech Republic. Here is a brief recap of the things I mentioned.

One of the most tangible results of the open data community in the past year was the launch of a national portal called “Náš stát” (which stands for “Our state”). It provides an overview of a network of Czech projects working towards improving Czech public sector with applications and services built on top of its data. What turned out to be one of its main benefits is that it started unifying disparate organizations that are often working on the same issues without knowing they might be duplicating work of others, and we will see in the coming years if it becomes the proverbial one ring to bind them all.

A Czech local chapter of the Open Knowledge Foundation was conceived and started its incubation phase. So far, we have managed to run several meetups and workshops, yet still, we have failed to involve a sufficient number of people contributing their cognitive surplus to the chapter in order to be able to sustain it in the long-term.

In this year data-driven journalism has appeared in mainstream news media. Inspired by the Guardian's Datablog the data blog was set up at iHNed.cz. The blog has become a source of data-driven stories supported by visualizations that regularly make it on the news site's front page.

Arguably, the main thing related to open data that happened in Czech Republic during the past year was the commitment to the Open Government Partnership. Czech Republic has committed to an action plan, in which opening government data plays a key role, encompassing the establishment of an official data catalogue and release of core datasets, such as the company register. On the other hand, there is no money to be spent on the OGP commitments and the list of efforts to date is blank. Thus the work on the implementation of commitments in mainly driven on by NGOs, which is very much in line with the spirit of “hacking” the Open Government Partnership.

To sum up, there have been both #wins and #fails. We keep calm and carry on.

Challenges of open data: summary

2012-09-12T10:00:00.000+02:00

The following post is an excerpt from my thesis entitled Linked open data for public sector information.
Open data creates opportunities that may end up being missed if the challenges associated with them are left unaddressed. The previous blog posts raised some of the questions the open data “movement” would have to face and resolve in order not to lose these opportunities and restore the faith in the transformative potential of open data.
Open data agenda is biased by its prevailing focus on the supply side of open data and its negligence of the demand side that gets to use the data. A significant part of the challenges associated with open data stems from a narrow-minded view of open data as a technology-triggered change that might be engineered. Although open data brings a change in which technology plays a fundamental role, it is important not to fail to recognize its side effects and the issues that cannot be solved by better engineering.
It is comfortable to abstract away from these issues at hand. So far, the challenges of open data are in most cases temporarily bypassed. While the essential features of open data are described thoroughly, its impact is left mostly unexplored. In fact, open data advocates frequently substitute their expectations for the effects of this relatively new phenomenon. The full implications of open data still need to be worked out. The blog posts about the challenges associated with open data can be thus read as an outline of some of the areas in which further research may be conducted and case studies may be commissioned.

Challenges of open data: procured data

2012-09-11T10:00:00.000+02:00

The following post is an excerpt from my thesis entitled Linked open data for public sector information.
The public sector is not only considered to be unable to deliver applications in a cost-efficient way, it may also lack the abilities to collect some data. There are several kinds of data, including geospatial surveys, that are difficult to gather using the means available in the public sector. The solution that public bodies adopt for such cases is to outsource data collection to private companies. Using the standard procedures of public procurement, the public bodies contract a provider to produce the requested data.
The challenge starts to appear when commercial data suppliers recognize the value of the procured data and become aware of the possibilities for reuse of such data that might generate revenue for them. Hence the suppliers offer the data under the terms of licences that prevent public sector bodies to share the data with the public, since releasing the data as open data would hamper the suppliers’ prospects to resell it. Should the public sector require a licence that allows to open the procured data, it would markedly increase the contract price.
Privatisation of collection of public sector data might be a way to achieve a better efficiency [1], yet without a significant investment it prohibits releasing the data as open data. It leaves open the question asking if public sector bodies should buy in expensive data to share it with others or if the infrastructure of the public sector should be enhanced to cater for acquisition of data that would be difficult to collect without such improvements.
Note: The topic of public sector data obtained through public procurement is the subject of a previous blog post.

References

YIU, Chris. A right to data: fulfilling the promise of open public data in the UK [online]. Research note. March 6^th, 2012 [cit. 2012-03-06]. Available from WWW: http://www.policyexchange.org.uk/publications/category/item/a-right-to-data-fulfilling-the-promise-of-open-public-data-in-the-uk

Challenges of open data: trust

2012-09-10T10:00:00.000+02:00

The following post is an excerpt from my thesis entitled Linked open data for public sector information.
Transparency brought about by the adoption of open data affects the trust in the public sector. Current governments experience a crisis of legitimacy [1, p. 58] and lack the trust of citizens. Improved visibility of the workings of public sector bodies established by the open access to their proceedings enables to track their actions in detail and improves the trust citizens put in the bodies. Nevertheless, the release of open data may reveal many fallacies of public sector bodies, which may produce a temporary disillusion, distrust in government, and loss of interest in politics [2].
The initial assumption of most open data advocates is that the data made in the public sector may be relied on. However, the public sector data cannot be treated as neutral and uncontested resource. “Unaudited, unverfied statistics abound in government data, particularly when outside parties-local government agencies, federal lobbyists, campaign committees-collect the data and turn it over to the government” [1, p. 261]. False data may be fabricated to provide alibi for corruption behaviour. For instance, Nithya Raman draws attention to an Indian dataset on urban planning in which non-existent public toilets are present, so that the spending, that supposedly goes for the toilets’ maintenance, may be justified [3]. Another example that demonstrates how false data is contained with the public sector data is the exposure of errors in subsidies awarded by the EU Common Agricultural Policy. The data shows that the oldest recipients of these funds, coming from Sweden, were 100 years old, though both dead [4, p. 85].
In the light of such facts, it is important to acknowledge that “public confidence in the veracity of government-published information is critical to Open Government Data take-off, essential to spurring demand and use of public datasets” [5]. If the data is regarded as manipulated instead of being recognized as trustworthy, the impact of open data will be significantly diminished.

References

LATHROP, Daniel; RUMA, Laurel (eds.). Open government: collaboration, transparency, and participation in practice. Sebastopol: O'Reilly, 2010. ISBN 978-0-596-80435-0.
FIORETTI, Marco. Open data, open society: a research project about openness of public data in EU local administration [online]. Pisa, 2010 [cit. 2012-03-10]. Available from WWW: http://stop.zona-m.net/2011/01/the-open-data-open-society-report-2/
RAMAN, Nithya V. Collecting data in Chennai city and the limits of openness. Journal of Community Informatics [online]. 2012 [cit. 2012-04-12], vol. 8, no. 2. Available from WWW: http://ci-journal.net/index.php/ciej/article/view/877/908. ISSN 1712-4441.
Beyond access: open government data & the right to (re)use public information [online]. Access Info Europe, Open Knowledge Foundation, January 7^th, 2011 [cit. 2012-04-15]. Available from WWW: http://www.access-info.org/documents/Access_Docs/Advancing/Beyond_Access_7_January_2011_web.pdf
GIGLER, Bjorn-Soren; CUSTER, Samantha; RAHEMTULLA, Hanif. Realizing the vision of open government data: opportunities, challenges and pitfalls [online]. World Bank, 2011 [cit. 2012-04-11]. Available from WWW: http://www.scribd.com/WorldBankPublications/d/75642397-Realizing-the-Vision-of-Open-Government-Data-Long-Version-Opportunities-Challenges-and-Pitfalls

blog.mynarz.net

The story of my Ph.D.

2010

2011

2012

2013

2014

2015

2016

2017

THE END

Copy-pasting the history of public procurement

What I would like to see in SPARQL 1.2

Publishing temporal RDF data as create/delete event streams

Events

Limitations

Integrity constraints

Querying

Event resolution proxy

Publishing

Basic fusion of RDF data in SPARQL

Content-based addressing

Hash-based fusion

Fusing subset descriptions

Key-based fusion

Conclusion

On generating SPARQL

Example

String concatenation

Parameterized queries

Templating

Domain-specific languages

SPIN RDF

SPARQL algebra

Conclusions

Academia-driven development

In science, form follows funding

Coding dub techno in Ruby using Sonic Pi

Curling SPARQL HTTP Graph Store protocol

Create: PUT

Read: GET

Update: POST

Delete: DELETE

Other methods

Methods for designing vocabularies for data on the Web

Related work research

Existing data models

Knowledge elicitation with domain experts

Analysis of domain-specific corpora

Qualitative analysis

Quantitative analysis

Abstract data model

Data model’s implementation

General design principles

Simplicity

Ease of adoption

Usability

Conceptual parsimony

Data-driven coverage

Communication interface

Syntax limitations

Tolerant specification

Vocabulary evolution

Conclusion

References

Footnotes

Epistemology of data in contemporary science

Introduction

Brief history of data

Metaphysical realism of data

Epistemic privilege of data

Data-driven science

Myth of mechanical objectivity of data

Contextuality of data

Mediation of data

Data manipulation

Alternative epistemologies of data

Constructivist epistemology of data

Rhetoric of data

Conclusion