2016-06-28

On generating SPARQL

The question how to generate SPARQL comes so often in my work that I figured I attempt a summary of the different approaches that answer this question, none of which is established enough to be considered a best practice. The lack of a well-established method for generating SPARQL may have arisen from the fact that SPARQL is serialized to strings. While its string-based format may be convenient to write by hand, it is less convenient to generate through code. To make SPARQL more amenable to programmatic manipulation, several approaches have been devised, some of which I will cover in this post.

Opting for strings or structured data to represent a data manipulation language is a fundamental design decision. Dumbing the decision down, there is a usability trade-off to be made: either adopt a string representation to ease manual authoring or go for a representation using data to ease programmatic manipulation. Can we nevertheless have the best of both options?

SPARQL continues in the string-based tradition of SQL; possibly leveraging a superficial familiarity between the syntaxes. In fact, SPARQL was recently assessed as “a string-based query language, as opposed to a composable data API.” This assessment implicitly reveals that there is a demand for languages represented as structured data, such as the Elasticsearch query DSL serialized in JSON or the Datomic query syntax, which is serialized in EDN.

To illustrate the approaches for generating SPARQL I decided to show how they fare on an example task. The chosen example should be simple enough, yet realistic, and such that it demonstrates the common problems in encountered when generating SPARQL.

Example

Let's say you want to know all the people (i.e. instances of foaf:Person) known to DBpedia. There are 1.8 million such persons, which is way too many to fetch in a single SPARQL query. In order to avoid overload, DBpedia's SPARQL endpoint is configured to provide at most 10 thousand results. Since we want to get all the results, we need to use paging via LIMIT and OFFSET to partition the complete results into smaller parts, such that one part can be retrieved within a single query.

Paging requires a stable sort order over the complete collection of results. However, sorting a large collection of RDF resources is an expensive operation. If a collection's size exceeds a pre-configured limit, Virtuoso requires the queries paging over this collection to use scrollable cursors (see the section “Example: Prevent Limits of Sorted LIMIT/OFFSET query”), which basically wrap an ordered query into a subquery in order to better leverage the temporary storage of the sorted collection. Because of the number of persons in DBpedia we need to apply this technique to our query.

Let's say that for each person we want to get values of several required properties and some optional properties. For example, we may want to get names (foaf:name) and birth dates (dbo:birthDate) and, optionally, dates of death (dbo:deathDate). Since persons can have multiple names and we want only one name per person, we need to use SAMPLE with the names to retrieve a single random name associated with each person. We could assume that a person has no more than one birth date and death date, but in fact in DBpedia there are 31 thousand persons with multiple birth dates and 12 thousand persons with multiple death dates, so we also need to use SAMPLE for these properties. Considering all these requirements, our SPARQL query may look like the following.

We need the LIMIT and OFFSET in this query to be generated dynamically, as well as the listing of the required and the optional properties. In the following, I will cover and compare several approaches for generating SPARQL queries using these parameters. All the approaches are illustrated by examples in Clojure to make them better comparable.

String concatenation

The approach that is readily at hand of most developers is string concatenation. While it is simple to start, it intertwines SPARQL with code, which makes it brittle and error prone. Convoluted manual escaping may be needed and the result is particularly messy in programming languages that lack support of multi-line strings, such as JavaScript. Here is an example implementation using string concatenation to generate the above-mentioned query.

Parameterized queries

If queries to be generated differ only in few variables, they can be generated from a parameterized query, which represents a generic template that can be filled with specific parameter values at runtime. Parameterized queries enable to split static and dynamic parts of the generated queries. While the static query template is represented in SPARQL and can be stored in a separate file, the dynamic parts can be represented using a programming language's data types and passed to the template at runtime. The separation provides both the readability of SPARQL and the expressiveness of programming languages. For example, template parameters can be statically typed in order to improve error reporting during development or avoid some SPARQL injection attacks in production.

Although parameterized queries improve on string concatenation and are highlighted among the linked data patterns, they are limited. In particular, I will discuss the limitations of parameterized queries as implemented in Apache Jena by ParameterizedSparqlString. While the main drawbacks of parameterized queries are true for other implementations as well, the details may differ. In Jena's parameterized queries only variables can be dynamic. Moreover, each variable can be bound only to a single value. For example, for the pattern ?s a ?class . we cannot bind ?class to schema:Organization and schema:Place to produce ?s a schema:Organization, schema:Place .. If we provide multiple bindings for a variable, only the last one is used. Queries that cannot restrict their dynamic parts to variables can escape to using string buffers to append arbitrary strings to the queries, but doing so gives you the same problems string concatenation has. Due to these restrictions we cannot generate the example query using this approach. Here is a partial implementation that generates only the limit and offset.

Jena also provides a similar approach closer to prepared statements in SQL. When executing a query (via QueryExecutionFactory) you can provide it with pre-bound variables (via QuerySolutionMap). Similar restrictions to those discussed above apply. Moreover, your template must be a syntactically valid SPARQL query or update. In turn, this prohibits generating LIMIT numbers, because LIMIT ?limit is not a valid SPARQL syntax. The following implementation thus does not work.

Templating

If we need higher expressivity for generating SPARQL, we can use a templating language. Templating is not restricted to the syntax of SPARQL, because it treats SPARQL as arbitrary text, so that any manipulation is allowed. As is the case with parameterized queries, templating enables to separate the static template from the dynamic parts of SPARQL. Unlike parameterized queries, a template is not represented in pure SPARQL, but in a mix of SPARQL and a templating language. This recalls some of the shortcomings of interposing SPARQL with code in string concatenation. Moreover, templates generally do not constrain their input, for example by declaring types, so that any data can be passed into the templates without it being checked in advance. Notable exceptions allowing to declare types of template data are available in Scala; for example Scalate or Twirl.

In order to show how to generate SPARQL via templating I adopted Mustache, which is a minimalistic and widely known templating language implemented in many programming languages. The following is a Mustache template that can generate the example query.

Rendering this template requires little code that provides the template with input data.

Domain-specific languages

As the string concatenation approach shows, SPARQL can be built using programming language construct. Whereas string concatenation operates on the low level of text manipulation, programming languages can be used to create constructs operating on a higher level closer to SPARQL. In this way, programming languages can be built up to domain-specific languages (DSLs) that compile to SPARQL. DSLs retain the expressivity of the programming languages they are defined in, while providing a syntax closer to SPARQL, thus reducing the cognitive overhead when translating SPARQL into code. However, when using or designing DSLs, we need to be careful about the potential clashes between names in the modelled language and the programming language the DSL is implemented in. For example, concat is used both in SPARQL and Clojure. Conversely, if a DSL lacks a part of the modelled language, escape hatches may be needed, regressing back to string concatenation.

Lisps, Clojure included, are uniquely positioned to serve as languages for defining DSLs. Since Lisp code is represented using data structures, it is easier to manipulate than languages represented as strings, such as SPARQL.

Matsu is a Clojure library that provides a DSL for constructing SPARQL via macros. Macros are expanded at compile time, so when they generate SPARQL they generally cannot access data that becomes available only at runtime. To a limited degree it is possible to work around this limitation by invoking the Clojure reader at runtime. Moreover, since Matsu is built using macros, we need to use macros to extend it. An example of such approach is its in-built the defquery macro that allows to pass parameters into a query template. Nevertheless, mixing macros with runtime data quickly becomes convoluted, especially if larger parts of SPARQL need to be generated dynamically.

If we consider using Matsu for generating the example query, we discover several problems that prevent us from accomplishing the desired outcome, apart from the already mentioned generic issues of macros. For instance, Matsu does not support subqueries. Defining subqueries separately and composing them as subqueries via raw input is also not possible, because Matsu queries contain prefix declarations, which are syntactically invalid in subqueries. Ultimately, the farthest I was able to get with Matsu for the example query was merely to the inner-most subquery.

Query DSLs in object-oriented langauges are often called query builders. For example, Jena provides a query builder that allows to build SPARQL by manipulating Java objects. The query builder is deeply vested in the Jena object model, which provides some type checking at the expense of a more verbose syntax. Since Clojure allows to call Java directly, implementing the example query using the query builder is straightforward.

While Matsu represents queries via macros and Jena's query builder does so via code, there is another option: representing queries via data. Using a programming language's native data structures for representing SPARQL provides arguably the best facility for programmatic manipulation. Data is transparent at runtime and as such it can be easily composed and inspected. In fact, a widespread Clojure design rule is to prefer functions over macros and data over functions. An example of using data to represent a SPARQL-like query language in Clojure is the Fabric DSL. While this DSL is not exactly SPARQL, it is “highly inspired by the W3C SPARQL language, albeit expressed in a more Clojuresque way and not limited to RDF semantics” (source).

SPIN RDF

An approach that uses data in RDF for representing SPARQL is SPIN RDF. It offers an RDF syntax for SPARQL and an API for manipulating it. While the translation of SPARQL to RDF is for the most part straightforward, one of its more intricate parts is using RDF collections for maintaining order in triple patterns or projected bindings, because the collections are difficult to manipulate in SPARQL.

Nonetheless, SPIN RDF seems to have a fundamental problem with passing dynamic parameters from code. For what I can tell, the membrane between SPIN RDF and code is impermeable. It would seem natural to manipulate SPIN RDF via SPARQL Update. However, how can you pass data to the SPARQL Update from your code? If you adopt SPIN RDF wholesale, your SPARQL Update operation is represented in RDF, so you have the same problem. Passing data from code to SPIN RDF thus results in a recursive paradox. Although I tried hard, I have not found a solution to this conundrum in the SPIN RDF documentation, nor in the source code of SPIN API.

This is how the example query can be represented using SPIN RDF; albeit using fixed values in place of the dynamic parts due to the limitations discussed above.

Rendering SPIN RDF to SPARQL can be implemented using the following code.

I have found a way to generate dynamic SPARQL queries in SPIN RDF using JSON-LD. JSON-LD can be represented by data structures, such as hash maps or arrays, that are available in most programming languages. This representation can be serialized to JSON that can be interpreted as RDF using the JSON-LD syntax. SPIN RDF can be in turn translated as SPARQL, obtaining our desired outcome. As may be apparent from this workflow, crossing that many syntaxes (Clojure, JSON-LD, RDF, SPIN, and SPARQL) requires large cognitive effort due to the mappings between the syntaxes one has to be aware of when authoring SPARQL in this way. Here is an implementation of this approach for the example query.

SPARQL algebra

As previously mentioned, a problematic part of SPIN RDF is its use of RDF collections for representing order. The documentation of Apache Jena recognizes this, saying that “RDF itself is often the most appropriate way to do this, but sometimes it isn't so convenient. An algebra expression is a tree, and order matters.” (source). The documentation talks about SPARQL algebra, which formalizes the low-level algebraic operators into which SPARQL is compiled. Instead of using RDF, Jena represents SPARQL algebra in s-expressions (SSE), which are commonly used in programming languages based on Lisp, such as Scheme. In fact, the “SSE syntax is almost valid Scheme” (source), but the SSE's documention acknowledges that Lisps “lacks convenient syntax for the RDF terms themselves” (source).

In order to see how our example query looks in SSE we can use Jena's command-line tools and invoke qparse --print=op --file query.rq to convert the query into SSE. The following is the result we get.

If SSEs were valid Clojure data structures, we could manipulate them as data and then serialize them to SPARQL. Nevertheless, there are minor differences between SSE and the syntax of Clojure. For example, while ?name and _:a are valid symbols in Clojure, absolute IRIs enclosed in angle brackets, such as <http://dbpedia.org/ontology/>, are not. Possibly, these differences can be remedied by using tagged literals for RDF terms.

Conclusions

I hope this post gave you a flavour of the various approaches for generating SPARQL. There is an apparent impedance mismatch between the current programming languages and SPARQL. While the programming languages operate with data structures and objects, SPARQL must be eventually produced as a string. This mismatch motivates the development of approaches for generating SPARQL, which is presented with many challenges, some of which I described in the post.

I assessed these approaches on the basis of how they fare on generating an example query using the data from DBpedia. The complete implementations of these approaches are available in this repository. Out of the approaches I reviewed, I found four in which it is feasible to generate the example SPARQL query without undue effort, which include:

  • String concatenation
  • Templating
  • Jena's query builder DSL
  • SPIN RDF using JSON-LD

My personal favourite that I use for generating SPARQL is templating with Mustache, which appears to mesh best with my brain and the tasks I do with SPARQL. Nonetheless, I am aware of the limitations of this approach and I am on a constant lookout for better solutions, possibly involving rendering SPARQL from data.

While I invested a fair amount of effort into this post, it is entirely possible I might have overlooked something or implemented any of the reviewed approaches in a sub-optimal way, so I would be glad to hear any suggestions on how to improve. In the meantime, while we search for the ideal solution for generating SPARQL, I think the membrane between code and SPARQL will remain only semipermeable.

2016-03-22

Academia-driven development

In this post I present an opinionated (and mostly-wrong) account on programming in academia. It's based in part on my experience as a developer working in academia and in part on conversations I had with fellow developers working in the private sector.

In academia, you rarely program even if you are in computer science. Instead, you read papers, write papers, write deliverables for ongoing projects, or write project proposals. You aren't paid to program, you are paid to publish. You do only as much programming as is needed to have something to write a paper about. “Scientists use programming the way software engineers use public transport – just as a means to get to what they have to do,” observes Bozhidar Bozhanov. While programming purists may be dissatisfied with that, Jason Baldridge is content with this state of affairs and writes: “For academics, there is basically little to no incentive to produce high quality software, and that is how it should be.”

Albert Einstein allegedly said this: “If we knew what it was we were doing, it would not be called research, would it?” While the attribution of this quote is dubious at best, there's a grain of truth in what the quote says. It's natural in research that you often don't know what you work on. I think this is the reason why test-driven development (TDD) is not truly applicable in research. Programming in research is used to explore new ideas. TDD, on the contrary, requires upfront specification of what you are building. German has the verb ‘basteln’ that stands for DIY fiddling. The word was adopted into the Czech spoken language with a negative connotation of not knowing what you're doing, which I think captures nicely what often happens in academic programming.

The low quality of academic software hinders its maintenance and extensibility in long-term development. For one-off experiments these concerns aren't an issue, but most experiments needs to be reproducible. Academic software must allow to reproduce and verify the results that are reported in the publication associated with it. Anyone must be able to re-run the software. It must be open-source, allowing others to scrutinize its inner workings. Unfortunately, it's often the case that academic software isn't released or, when it's made available, it's nigh impossible to run it without asking its creators for assistance.

What's more is that usability of software is hardly ever a concern in academia, in spite of the fact that usable software may attract more citations, thereby increasing the academic prestige of its author. An often-mentioned example of this effect in practice is Word2vec, the paper of which boasts with 1305 citations according to Google Scholar. Indeed, it would be a felicitous turn if we reconsidered the usability of academic software as a valuable proxy that increases citation numbers.

A great benefit that comes with reproducible and usable software is extensibility. Ted Pedersen argues that there's “a very happy side-effect that comes from creating releasable code—you will be more efficient in producing new work of your own since you can easily reproduce and extend your own results.” Nonetheless, even though software may be both reproducible and usable, extending a code base without tests may be like building on quicksand. This is usually an opportunity for refactoring. For example, the feature to be extended can be first covered with tests that document its expected behaviour, as Nell Shamrell-Harrington suggests in surgical refactoring. The subsequent feature extension must not break these tests, unless the expected behaviour should change. I think adopting this approach can do a great good to the continuity of academic development.

Finally, there's also an economic argument to make for ‘poor-quality’ academic software. If software developed in the academia achieved production quality, it would constitute a competition to software produced in the private sector. Since academia is a part of the public sector, academic endeavours are financed mostly from the public funds. Hence such competition with commercial software can be considered unfair. Dennis Polhill argues that “unfair competition exists when a government or quasi-government entity takes advantage of its tax exemption and other privileges to supply private goods to the market in competition with private suppliers.” Following this line of thought, the public sector should not subsidize the development of software that is commercially viable and can be built by private companies. Instead of developing working solutions, academia can try and test new prototypes. If released openly, this proof-of-concept work can be then adopted in the private sector and grown into commercial products.

Eventually, when exploring my thoughts on academia-driven development, I realized that I'm torn between settling for the current status quo and pushing for emancipating software with publications. While I'm stuck figuring this out, there are laudable initiatives, such as Semantic Web Developers, which organizes regular conference workshops that showcase semantic web software and incite conversations about the status of software in academia. Let's see how these conversations pan out.

2016-03-21

In science, form follows funding

Recently, brief Twitter exchanges I had with @csarven on the subject of #LinkedResearch made me want to articulate a longer-form opinion on scientific publishing that no longer fitted a tweet. Be wary though, although this opinion is longer, it's still oversimplifying a rather complex matter for the sake of conveying the few key points I have.

There's a pervasive belief that “if you can't measure it, you can't manage it”. Not only is this quote often misattributed to Peter Drucker, its author William Edwards Deming actually wrote that “it is wrong to suppose that if you can't measure it, you can't manage it — a costly myth” (The New Economics, 2000, p. 35). Contradicting the misquoted statement, Deming instead suggested that it's possible to manage without measuring. With that being said, he acknowledges that metrics remain an essential input to management.

Since funding is a key instrument of management, metrics influence funding decisions too. Viewed from this perspective, science is difficult to fund because its quality is hard to measure. The difficulty of measuring science is widely recognized, so that scientometrics was devised with the purpose of studying how to measure science. Since measuring science directly is difficult, scientometrics found ways to measure scientific publishing, such as citation indices. Though using publishing as a proxy to science comes with an implicit assumption that the quality of scientific publications correlates positively with the quality of science, many are willing to take on this assumption simply because of the lack of a better way for evaluating science. The key issue of this approach is that the emphasis on measurability constrains the preferred form of scientific publishing to make measuring it simpler. A large share of scientific publishing is centralized in the hands of few large publishers who establish a constrained environment that can be measured with less effort. The form of publishing imposes a systemic influence on science. As Marshall McLuhan wrote, the medium is the message. While in architecture form follows function, in science, form follows funding.

Measuring distributed publishing on the Web is a harder task, though not an insurmountable one. For instance, Google's PageRank algorithm provides a fair approximation of the influence the documents distributed on the Web have. Linked research, which proposes to use the linked data principles for scientific publishing, may enable to measure science without the cost incurred by centralization of publishing. In fact, I think its proverbial “killer application” may be a measurable index like the Science Citation Index. Indeed, SCI was a great success, and it “did not stem from its primary function as a search engine, but from its use as an instrument for measuring scientific productivity” (Eugene Garfield: The evolution of the Science Citation Index, 2007). A question that naturally follows is: how off am I in thinking this?

2015-12-08

Coding dub techno in Ruby using Sonic Pi

Dub techno lends itself to code thanks to its formulaic nature. Indeed, A Bullshitter's Guide to Dub Techno says:

“Sadly, a lot of dub techno out there is unbelievably dull — greyscale, unadventurous, utterly and literally generic. It must seem easy to make because after all, all you need is the a submerged kickdrum, a few clanking chords stretched pointlessly out into arching waves of unmoving, unfeeling nothingness, and maybe the odd snatch of tired melodica, snaking around like a cobra that desperately needs to be put out of its misery.”

I decided to try to code dub techno in Ruby using Sonic Pi. Sonic Pi is an app for live coding sound. It started as a tool for teaching computer science using Raspberry Pi but it works damn well for making a lot of noise. Here's my attempt at coding dub techno in Sonic Pi:

The source code is available here.

My attempt exploits some of the dub techno stereotypes, such as the excessive repetition progressing as I slowly build the sound. As Joanna Demers writes on dub techno and related genres in Listening through the Noise: “Static music goes nowhere, achieves no goals, does no work, and sounds the same three hours into the work as it did when the work began.” In case of dub techno, it is an intentionally bare and stripped-down version of techno. It often focuses on the timbre of sound, using modulating synthesizers heavily drenched in reverb and echoes. Demers writes: “Static music is not only music that avoids conventional harmonic or melodic goals but also music that takes specific steps to obscure any sense of the passage of time.” Dub techno keeps melodic or harmanic progressions to a minimum, usually employing single minor chords oscillating through entire tracks. In my code, I use solely the D minor chord, which varies only in chord inversions and octave shifts.

I think that Sonic Pi offers a fluent live coding experience. For example, the nested with_* functions (such as with_fx) accepting Ruby blocks as arguments provide an intuitive way of representing bottom-up sound processing pipelines. Furthermore, live coding provides a fast feedback loop. Your ears are the tests of your code and you can hear the results of your code immediately.

Overall, I really enjoyed this attempt at dub techno. I would like to thank to Sam Aaron and co. for creating Sonic Pi and I would encourage you to give Sonic Pi a shot.

2015-05-02

Curling SPARQL HTTP Graph Store protocol

SPARQL HTTP Graph Store protocol provides a way of manipulating RDF graphs via HTTP. Unlike SPARQL Update it does not allow you to work with RDF on the level of individual assertions (triples). Instead, you handle your data on a higher level of named graphs. Named graph is a pair of a URI and a set of RDF triples. A set of triples can contain a single triple only, so it is technically possible to manipulate individual triples with the Graph Store protocol, but this way of storing data is not common. In line with the principles of REST, the protocol defines its operations using HTTP requests. It covers the familiar CRUD (Create, Read, Update, Delete) operations known from REST APIs. It is simple and useful, albeit lesser known part of the family of SPARQL specifications. I have seen software that would have benefited had its developers known this protocol. This is why I decided to cover it in a post.

Instead of showing the HTTP interactions via the Graph Store protocol in a particular programming language I decided to use cURL as the lingua franca of HTTP. I discuss how the Graph Store protocol works in 2 implementations: Virtuoso (version 7.2) and Apache Jena Fuseki (version 2). By default, you can find a Graph Store endpoint at http://localhost:8890/sparql-graph-crud-auth for Virtuoso and at http://localhost:3030/{dataset}/data for Fuseki ({dataset} is the name of the dataset you configure). Virtuoso also allows you to use http://localhost:8890/sparql-graph-crud for read-only operations that do not require authentication. The differences between these implementations are minor, since both implement the protocol's specification well.

If you want to follow along with the examples below, an easy option is to download the latest version of Fuseki and start it with a disposable in-memory dataset using the shell command fuseki-server --update --mem /ds (ds is the name of our dataset). You can use any RDF file as testing data. For example, you can download DBpedia's description of SPARQL in the Turtle syntax as the file data.ttl:

curl -L -H "Accept:text/turtle" \
     http://dbpedia.org/resource/SPARQL > data.ttl

Finally, if any of the arguments you provide to cURL (such as graph URI) contains characters with special meaning in your shell (such as &), you need to enclose them in double quotes. The backslash you see in the example commands is used to escape new lines so that the commands can be split for better readability.

I will now walk through the 4 main operations defined by the Graph Store protocol: creating graphs with the PUT method, reading them using the GET method, adding data to existing graphs using the POST method, and deleting graphs, which can be achieved, quite unsurprisingly, via the DELETE method.

Create: PUT

You can load data into an RDF graph using the PUT HTTP method (see the specification). This is how you load RDF data from file data.ttl to the graph named http://example.com/graph:

Virtuoso
curl -X PUT \
     --digest -u dba:dba \
     -H Content-Type:text/turtle \
     -T data.ttl \
     -G http://localhost:8890/sparql-graph-crud-auth \
     --data-urlencode graph=http://example.com/graph
Fuseki
curl -X PUT \
     -H Content-Type:text/turtle \
     -T data.ttl \
     -G http://localhost:3030/ds/data \
     --data-urlencode graph=http://example.com/graph

The -T named argument uploads a given local file, -H specifies an HTTP header indicating the content type of the uploaded file, -G provides the Graph Store endpoint's URL, and --data-urlencode let's you pass in a URI naming of the created graph (via the graph query parameter). Since the Graph Store protocol's interface is uniform, most of the other operations use similar arguments.

Virtuoso uses HTTP Digest authentication for write and delete operations (i.e. create, update, and delete). The example above assumes the default Virtuoso user and password (i.e. dba:dba). If you fail to provide valid authentication credentials, you will be slapped over your hands with the HTTP 401 Unauthorized status code. Fuseki does not require authentication by default, but you can configure it using Apache Shiro.

When using Virtuoso, you can leave the Content-Type header out, because the data format will be automatically detected, but doing so is not a good idea. You need to provide it for Fuseki and if you fail to do so, you will face HTTP 400 Bad Request response. Try not to rely on the autodetection being correct and provide the Content-Type header explicitly.

If you want to put data into the default graph, you can use the default query parameter with no value:

Fuseki
curl -X PUT \
     -H Content-Type:text/turtle \
     -T data.ttl \
     -G http://localhost:3030/ds/data \
     -d default

If you use Fuseki, you can also omit the graph parameter completely to manipulate with the default graph. Nevertheless, this is not a standard behaviour, so you should not rely on it.

If you PUT data into an existing non-empty graph, its previous data is replaced.

Read: GET

To download data from a given graph, you just issue a GET request (see the specification). You can use the option -G to perform GET request via cURL:

Virtuoso
curl -G http://localhost:8890/sparql-graph-crud \
     --data-urlencode graph=http://example.com/graph
Fuseki
curl -G http://localhost:3030/ds/data \
     --data-urlencode graph=http://example.com/graph

Alternatively, you can simply use curl http://localhost:3030/ds/data?graph=http%3A%2F%2Fexample.com%2Fgraph, but -G allows you to provide the graph query parameter separately via --data-urlencode, which also takes care of the proper URL-encoding. You can specify the RDF serialization you want to get the data in via the Accept HTTP header. For example, if you want the data in N-Triples, you provide the Accept header with the MIME type application/n-triples:

Fuseki
curl -H Accept:application/n-triples \
     -G http://localhost:3030/ds/data \
     --data-urlencode graph=http://example.com/graph

Unfortunately, while Fuseki supports the application/n-triples MIME type, Virtuoso does not. Instead, you will have specify the deprecated MIME type text/ntriples (even text/plain will work) to get the data in N-Triples. Since N-Triples serializes each RDF triple on a separate line, you can use it as a naïve way of counting the triples in a graph by piping the data into wc -l (-s option used to hide the cURL progress bar):

Virtuoso
curl -s -H Accept:text/ntriples \
     -G http://localhost:8890/sparql-graph-crud \
     --data-urlencode graph=http://example.com/graph | \
     wc -l

If a graph named by the requested URI does not exist, you will get HTTP 404 Not Found response.

Update: POST

If you want to add data to an existing graph, use the POST method (see the specification). In case you POST data to a non-existent graph, it will be created just as if using the PUT method. The difference of POST and PUT is that when you send data to an existing graph, POST will merge it with the graph's current data, while PUT will replace it.

Virtuoso
curl -X POST \ 
     --digest -u dba:dba \
     -H Content-Type:text/turtle \
     -T data.ttl \
     -G http://localhost:8890/sparql-graph-crud-auth \
     --data-urlencode graph=http://example.com/graph
Fuseki
curl -X POST \ 
     -H Content-Type:text/turtle \
     -T data.ttl \
     -G http://localhost:3030/ds/data \
     --data-urlencode graph=http://example.com/graph

It is worth knowing how triples are merged during this operation. When you POST data to a non-empty graph, the current set of triples the graph is associated with will be merged with the set of triples from the uploaded data via set union. In most cases, if these two sets shared any triples, they will not be duplicated. However, if the shared triples contain blank nodes, they will be duplicated because, due to their local scope, blank nodes from different datasets are always treated as distinct. For example, if you repeatedly POST the same triples containing blank nodes to the same graph, the first time its size will increase by the number of posted triples, but on the second and subsequent POSTs the size of the graph will increase by the number of triples containing blank nodes. This can be one of the reasons why you may want to avoid using blank nodes.

Delete: DELETE

Unsurprisingly, deleting graphs is achieved using the DELETE method (see the specification). As you may expect by now, if you attempt to delete a non-existent graph, you will get HTTP 404 Not Found response.

Virtuoso
curl -X DELETE \
     --digest -u dba:dba \
     -G http://localhost:8890/sparql-graph-crud-auth \
     --data-urlencode graph=http://example.com/graph
Fuseki
curl -X DELETE \
     -G http://localhost:3030/ds/data \
     --data-urlencode graph=http://example.com/graph

Other methods

As in the HTTP specification, there are other methods defined in the Graph Store protocol. An example of such method is HEAD, which can be used to test whether a graph exists. cURL allows you to issue a HEAD request using the -I option:

Fuseki
curl -I \
     -G http://localhost:3030/ds/data \
     --data-urlencode graph=http://example.com/graph

If you the graph exists, you will receive HTTP 200 OK status code. Otherwise, you will once again see a saddening HTTP 404 Not Found response. In the current version of Virtuoso (7.2) using the HEAD method will trigger 501 Method Not Implemented, so you should use Fuseki if you want to play with this method.

As the Graph Store protocol's specification shows, you can replace any operation of the protocol by an equivalent SPARQL update or query. Graph Store protocol thus provides an uncomplicated interface for basic operations manipulating with RDF graphs. I think it is a simple tool worth knowing.

2014-07-09

Methods for designing vocabularies for data on the Web

Over the past year and a half I have been working on a project, in which we were tasked with producing a vocabulary for describing job postings.1 In doing so, we were expected to write down what worked, so that others can avoid our mistakes. Apart from our own experience, the write-up I prepared took into account the largest public discussion on designing vocabularies for data on the Web. Perusing its archive, I have read every email on the public-vocabs mailing list since its start in June 2011 until April 2014. The following text distills some of what I have learnt from the conversations on this mailing list, especially from the vocabulary design veterans including Dan Brickley or Martin Hepp, coupled with research of other sources and our own experiments in data modelling for the Web.

The presented text is also available on the project’s web site. In case you were wondering, this project no. CZ.1.04/5.1.01/77.00440 was funded by the European Social Fund through the Human Resources and Employment Operational Programme and the state budget of the Czech Republic.


“All models are wrong, but some are useful.”
George E. P. Box

The presented text offers a set of recommendations for designing ontologies and vocabularies for data on the Web. The motivation for creating it was to collect relevant advice for data modelling scattered in various sources into a single resource. It focuses on the intersection of vocabularies defined using the RDF Schema (Brickley, Guha, 2014) and those that are intended to be used in RDFa Lite syntax (Sporny, 2012) in HTML web pages. It specifically aims to support vocabularies that aspire to large-scale adoption.

The vocabularies in question in this text are domain-specific, unlike upper ontologies that span general aspects of many domains. Therefore, it is necessary to delimit the domain to be covered by the developed vocabulary to restrict its scope. The target domain can have a broad definition, which may be further clarified by examples of data falling into the domain and examples of data that is out of the domain’s scope. Particular details of the vocabulary’s specialization may be made more specific during the initial research or vocabulary’s design.

“Do not reinvent the wheel.”
HTML design principles
(Kesteren, Stachowiak, 2007)

It is appropriate to devote the initial stage of vocabulary development to research and preparation. One may consider three principal kinds of relevant resources that can be pooled when designing a vocabulary. These resources comprise existing data models, knowledge of domain experts, and domain-specific texts.

Existing data models

Research of existing data models helps to prevent unnecessary work by answering two main questions:

  1. Is there an available data model that can be reused as a whole instead of developing a new data model?
  2. What parts of existing data models can be reused in design of a new data model?

There are two main types of data models that are relevant for reuse in vocabulary development. The first type covers ontological resources that consist of available vocabularies and ontologies. If one finds such resource that describes the target domain and fits the envisioned use cases, it can be directly reused as a whole, provided that its terms of use permit it. If there is a suitable vocabulary that addresses only some of the foreseen uses, it can be extended to cover the others as well. Otherwise, a new vocabulary may be composed of elements that are cherry-picked from the available ontological resources, which forms a basis for the reuse-based development of vocabularies (Poveda-Villalón, 2012). One of the best places to look for these resources is Linked Open Vocabularies, which provides a full-text search engine for the publicly available vocabularies formalized in RDF Schema or OWL (Motik, Patel-Schneider, Parsia, 2012).

The second kind of resources to consider encompasses non-ontological resources, such as XML schemas or data models in relational databases. As these resources cannot be reused directly for building vocabularies, they need to be re-engineered into ontological resources, which is a process that is also referred to as ‘semantic lifting’. Taking non-ontological resources into account may complement the input from ontological sources well. Special attention should be paid to industry standards produced by standardization bodies such as ISO. An alternative approach is to analyze what schemas are employed in public datasets from the given domain, for which data catalogues, such as Datahub, may be used.

Knowledge elicitation with domain experts

“Role models are important.”
Officer Alex J. Murphy / RoboCop

Domain experts constitute a source of implicit knowledge that is not yet formalized in conceptualizations documented in data models (Schreiber et al., 2000). Knowledge elicited from experts who have internalized a working knowledge of the domain of interest can feed in the conceptual distinctions captured by the developed vocabulary. The choice of experts to consult depends on the domain in question. The interviewed experts can range from academic researchers to practitioners from the industry. Similarly, the selection of knowledge elicitation methods should be motivated by the intended use cases for the developed vocabulary. Common methods that serve the purpose of knowledge acquisition include discussion of a glossary, manual simulation of tasks to automate, and competency questions.

Glossary is a useful aid that may guide interviews with domain experts. It can be either manually prepared or constructed automatically from the developed vocabulary. Glossary can be written down as a table in which each vocabulary term is listed together with its label, working definition and broadly described type (e.g., class, property or individuum). It can then serve as a basis for discussion about the established terminology in the domain covered by the developed vocabulary.

Collaboration with domain experts is an opportunity to conduct manual simulation of tasks that are intended to be performed automatically using data described by the developed vocabulary. Such simulation can provide a practical grounding for the vocabulary design with respect to its planned use cases. The simulation should reveal what kinds of data are important for carrying out the envisioned tasks successfully. It can indicate what data can be added to aid in such tasks and what data makes a difference in deciding how to proceed in the chosen tasks. For example, if the target domain is the job market, a simulation task may set about matching sample CVs of job seekers to actual job offers, which can suggest what properties are important to tell a likely successful candidate.

A classical approach to eliciting knowledge from domain experts is to discuss competency questions. These are the questions that data described with the developed vocabulary should be able to answer. As such, competency questions can serve as tests that examine if a vocabulary is sufficiently capable to support its planned use cases. For example, these questions may specify what views on data must be possible, what are the users’ needs that data must be able to answer in a single query, or what level of data granularity and detail is needed.

Analysis of domain-specific corpora

“Pave the cowpaths.”
HTML design principles
(Kesteren, Stachowiak, 2007)

While eliciting knowledge from domain experts concentrates on implicit knowledge, analyses of domain-specific corpora seek for common patterns in explicit, yet unstructured, natural-language text. Textual analysis can be considered a data-driven approach to schema discovery. Its key purpose is to ensure that the designed vocabulary can express the most common kinds of data that are published in the target domain. The approaches to processing domain-specific textual corpora can be divided into qualitative, manual analyses and quantitative, automated analyses.

Qualitative analysis

Manual qualitative analysis can be performed with a smaller domain-specific corpus, which can consist of tens of sample documents. The corpus should be analysed by knowledge engineer to spot common patterns and identify the most important types of data in the domain. Qualitative analysis may result in clusters of similar types of data grouped into a hierarchical tree, in which the most frequently occurring kinds of data are highlighted. The identified clusters may then serve as precursors for classes in the developed vocabulary.

Quantitative analysis

A corpus of texts prepared for quantitative analysis can be sampled from sources on the Web that publish semi-structured data describing the domain of the vocabulary. Producers of these sources can be projected as potential adopters of the developed vocabulary. The texts need to be written in a single language, so that translation is not necessary. Contents of the corpus ought to be sampled from a wide array of diverse sources in order to avoid sampling bias. The corpus needs to be sufficiently large, so that the findings based on analysing it may be taken as indicative of general characteristics of the covered domain. Establishment of such extensive corpus typically requires automated harvesting of texts via web crawlers or scripts that access data through APIs.

Quantitative analysis of domain-specific corpora can be likened to ‘distant reading’. Its aim is to read through the corpus and discover patterns of interest to the vocabulary creator. A typical task of this type of analysis is to extract the most frequent n-grams, indicating common phrases in the established domain terminology, and map their co-occurrences. Quantitative analyses on textual corpora may be performed using dedicated software, such as Voyant Tools or CorpusViewer.

Abstract data model

The results of the performed analyses and knowledge elicitation should provide a basis for development of an abstract data model. At this stage, data model of the designed vocabulary is abstract because it is not mapped to any concrete vocabulary terms in order to avoid being closely tied to particular implementation. Abstract data model may start to be formalized as a mind map, hierarchical tree list or table. Vocabulary creators can base the model on the clusters of the most commonly found terms from domain corpora and sort them into a glossary table. Such proto-model should pass through several rounds of iteration based on successive reviews by vocabulary creators. Key classes and properties in the data model should be identified and equipped with both preferred and non-preferred labels (i.e. synonyms) and preliminary definitions. To get an overview of the whole model and the relationships of its constitutive concepts it may be visualised as UML class diagram or using a generic graph visualization.

Data model’s implementation

“One language’s syntax can be another’s semantics.”
Brian L. Meek

When the abstract data model is deemed to be sound from the conceptual standpoint, it can be formalized in a concrete syntax. The primary languages that should be employed for formalization of the abstract data model are RDF and RDF Schema.As simplicity should be a key design goal, the use of more complex ontological restrictions expressed via OWL ought to be limited to a minimum. The implementation should map the elements of the abstract data model to concrete vocabulary terms that may be either reused from the available ontological resources or newly created.2 At this stage, the expressive RDF Turtle (Prud’hommeaux, Carothers, 2014) syntax may be used conveniently to produce a formal specification of the developed vocabulary.

The implementation process should follow an iterative development workflow, using examples of data in place of software prototypes. During each iteration samples of existing data from the vocabulary’s domain may be modelled using the means provided by the vocabulary, so that it can be assessed how the proposed data model fits its intended uses by seeing it being applied in the context of real examples.

General design principles

Implementation of a vocabulary may be guided by several general principles recommended for vocabularies targeting data written in markup that is embedded in HTML web pages. The goal of widespread adoption of the vocabulary on the Web puts an emphasis on specific design principles. Instead of focusing on conceptual clarity and expressivity as in traditional ontologies, the driving principles of design of lightweight web vocabularies accentuate simplicity, ease of adoption, and usability. This section then further discusses some of the key concerns in vocabulary development, including conceptual parsimony, vocabulary’s coverage that is driven by existing data and the like.

Simplicity

Vocabulary should avoid complex ontological axioms and subtle conceptual distinctions. Instead, it ought to seek simplicity for the data producer, rather than the data consumer.3 It is advisable that vocabulary design tries to strike a fine balance between expressivity and implementation complexity cost. Following the principle of minimal ontological commitment (Gruber, 1995), vocabularies should limit the number of ontological axioms (and especially restrictions) to improve their reusability. The developed vocabulary should thus be as simple as possible without sacrificing the leverage its structure gives to data consumers. Nevertheless, not only it should make simple things simple, it should also make complex things possible. Practical vocabulary design can reflect this guideline by focusing on solving simpler problems first and complex problems later.

Ease of adoption

Adoption of a vocabulary may be made easier if the vocabulary builds on common idioms and established terminology that is already familiar to data publishers. Vocabulary design should strive for intuitiveness. In line with the principle of least astonishment, vocabulary users should rather be exposed largely to things that can be expected.

Usability

Vocabulary design should focus on documentation rather than specification. That being said, neither specification nor documentation can ensure correct use of a vocabulary. Even though vocabulary terms may be precisely defined and documented, their meaning is largely established by their use in practice. Nonetheless, correct application of vocabulary terms may be supported by providing good examples showing the vocabulary in use. As Guha (2013) emphasizes, the default mode of authoring structured data on the Web is copy, paste and edit; for which the availability of examples is essential. Usability of vocabularies can be also improved by following the recommendations of cognitive ergonomics (Gavrilova, Gorovoy, Bolotnikova, 2010), such as readable documentation or vocabulary with narrow width and shallow depth.

Conceptual parsimony

Vocabulary design should introduce as few conceptual distinctions as possible, while still producing a useful conceptualization. Vocabulary does not need to include means of expressing data that can be computed or inferred from data expressed by other means. For example, it is not necessary to include a :numberOfOffers property because its value may be computed if there already is a :hasOffer property, which may have its distinct objects counted to arrive to the same data. An exception to this rule is warranted if it is expected that data producers may only have the computed data, but not the primary data from which it was derived from. For example, the number of offers may not available in disaggregated form as the list of individual offers. There is also no need to define inverse properties, such as :isOfferOf for the :hasOffer property. In a similar manner, vocabulary should not require explicit assertion of data that can be recovered from implicit context, such as data types for literal values. On the other hand, it is important to recognize that this approach shifts the burden from data publishers to clients consuming data that need to execute additional computation, such as inference, to materialize implicit data.

In general, additional conceptual distinctions are useful only if vocabulary users are able to apply them consistently. It is important to realize that valuable conceptual distinctions, justified from experts’ perspective, may not lead to more reliable data. Vocabulary creators should mainly concentrate on offering means for describing data that can be reliably provided by a large number of parties. A key reason for adding conceptual distinct is enabling to publish more data.

The merits of conceptual distinctions should be judged based on their discriminatory value. In other words, the value of distinction is in how it differs from the rest of the vocabulary. The more finely or ambiguously a vocabulary term is defined, the more likely it will be used incorrectly. Complex designs are a subject to misinterpretation. If vocabulary terms cannot be understood by data producers with ease and reliably, they will not be used (resulting in less data) or will be used inconsistently (resulting in lower data quality). Therefore, vocabulary should only use conceptual distinctions that matter and are well understood in the target domain.

Data-driven coverage

Since enabling to publish existing data in a structured form is an essential goal of vocabulary development, it ought to be driven by the available data. Data-driven approach implies that vocabularies should not use conceptualizations that do not match well to common database schemas in their target domains. If this is not the case, then data producers do not have a way of providing their data described using the vocabulary unless they alter their database’s schemas and change the way how they collect data. Vocabulary should be rather descriptive than prescriptive. Vocabulary design should be driven by existing data rather than prescribing what data should be published.

Communication interface

Vocabularies should accurately represent the domain they cover only to the degree it improves consistency of vocabulary use. Shared reality mirrored by a vocabulary may serve as a common referent improving shared understanding. However, the prime goal of a vocabulary is not to model the world, but to enable communication that gets a message across, which means that the prime aim of vocabulary is communication instead of representation. For example, structured values such as postal addresses, do not represent reality but they help formalize communication.

Vocabulary defines a communication interface between data producers and data consumers. Data producers are typically people, whereas data consumers are typically machines. Therefore, vocabulary design should balance usability for people with usability for machines. Vocabularies ought to be designed for people first and machines second (The microformats process, 2013). Thus vocabulary design should reflect the trade-off between consistent understanding of vocabulary among people and the degree to which it makes data machine readable.

Syntax limitations

Vocabulary should be aligned with the syntax, in which it is intended to be used. The design of a vocabulary is constrained by expressivity of its intended syntax. For example, HTML5 Microdata’s lack of mechanism for expressing inverse properties, such as with RDFa rev attribute, may warrant adding inverse properties into a vocabulary. Syntax of data can be considered a medium for the vocabulary. In case of vocabularies made for data embedded in web pages, such as Schema.org, their design should correspond to simpler markup. For example, vocabulary should require less nesting.

Tolerant specification

Vocabulary specification should be tolerant about data that it can express. It should not impose a fixed schema. No properties should be required, so that not providing some data is not invalid. On the other hand, vocabulary should allow additional data to be expressed, so that superfluous data is also not invalid, unless it raises a contradiction. It is advisable to use cardinality restrictions for properties only sparingly, as it is difficult to make them generally valid in the broad context of the multicultural Web. Vocabulary should support dynamic data granularity and varying level of detail, so that unstructured text values are allowed to be used in place of structured values if the structure cannot be reconstructed from the source data. On the other hand, specific consumers of data may add specific requirements that may be negotiated on a case by case basis with particular data producers. Overall, data consumers should be expected to follow the spirit of “some data is better than none” (Schema.org: data model, 2012) and accept even broken or partial data.

Vocabulary evolution

If a vocabulary aims for mass adoption, backwards incompatible changes need to be avoided. It is therefore advisable not to remove or deprecate any vocabulary terms, but rather list them as non-preferred with a link to their preferred variant. Large-scale use of a vocabulary raises the cost of changes, because more vocabulary users (both data producers and consumers) need to react to the changes. Widespread adoption increases the difficulty of propagating the changes, because updates about vocabulary changes need to reach a larger audience.

Conclusion

“It’s probably better to allow volcanoes to have fax machines than try to define everything ‘correctly’. Usage will win out in the end.”
Martin Hepp

The methods for designing vocabularies for data on the Web introduced in this text do not form a coherent methodology but instead compile and synthesize recommendations proposed in related work. Guiding principles manifested in the presented methods shall not be considered as hard-and-fast rules but rather as suggestions based on experience of seasoned vocabulary designers. These include both practical advice on researching state of the art in vocabulary’s target domain and concerns to keep in mind when implementing a formal conceptualization for a vocabulary. Moreover, the presented methods do not involve the notion of vocabulary being “right” but instead aim for developing vocabularies that are useful. Therefore, it is only by practical use on the Web in the long-term that these methods and recommendations may be “proved” of being useful themselves.

References

  • BRICKLEY, Dan; GUHA, R.V. (eds.). RDF Schema 1.1 [online]. W3C Recommendation 25 February 2014. W3C, 2004-2014 [cit. 2014-04-29]. Available from WWW: http://www.w3.org/TR/rdf-schema/
  • GAVRILOVA, T. A.; GOROVOY, V. A.; BOLOTNIKOVA, E. S. Evaluation of the cognitive ergonomics of ontologies on the basis of graph analysis. Scientific and Technical Information Processing. December 2010, vol. 37, iss. 6, p. 398-406. Also available from WWW: http://link.springer.com/article/10.3103%2FS0147688210060043. ISSN 0147-6882. DOI 10.3103/S0147688210060043.
  • GRUBER, Thomas R. Toward principles for the design of ontologies used for knowledge sharing? International Journal of Human-Computer Studies. November 1995, vol. 43, iss. 5-6, p. 907-928. Also available from WWW: http://tomgruber.org/writing/onto-design.pdf
  • GUHA, R. V. Light at the end of the tunnel [video]. Keynote at 12th International Semantic Web Conference. Sydney, 2013. Also available from WWW: http://videolectures.net/iswc2013_guha_tunnel
  • KESTEREN, Anne van; STACHOWIAK, Maciej (eds.). HTML design principles [online]. W3C Working Draft 26 November 2007. W3C, 2007 [cit. 2014-04-29]. Available from WWW: http://www.w3.org/TR/html-design-principles/
  • MOTIK, Boris; PATEL-SCHNEIDER, Peter F.; PARSIA, Bijan (eds.). OWL 2 Web Ontology Language: structural specification and functional-style syntax [online]. W3C Recommendation 11 December 2012. 2nd ed. W3C, 2012 [cit. 2014-04-29]. Available from WWW: http://www.w3.org/TR/owl2-syntax/
  • POVEDA-VILLALÓN, María. A reuse-based lightweight method for developing linked data ontologies and vocabularies. In Proceedings of the 9th Extended Semantic Web Conference, Heraklion, Crete, Greece, May 27-31, 2012. Berlin; Heidelberg: Springer, 2012, p. 833-837. Lecture notes in computers science, vol. 7295. Also available from WWW: http://link.springer.com/chapter/10.1007%2F978-3-642-30284-8_66. ISSN 0302-9743. DOI 10.1007/978-3-642-30284-8_66.
  • PRUD’HOMMEAUX, Eric; CAROTHERS, Gavin (eds.). RDF 1.1 Turtle: terse RDF triple language [online]. W3C Recommendation 25 February 2014. W3C, 2008-2014 [cit. 2014-04-30]. Available from WWW: http://www.w3.org/TR/turtle/
  • Schema.org: data model [online]. June 6th, 2012 [cit. 2014-04-29]. Available from WWW: http://schema.org/docs/datamodel.html
  • SCHREIBER, Guus [et al.] (eds.). Knowledge elicitation techniques. In Knowledge engineering and management: the CommonKADS methodology. Cambridge (MA): MIT, 2000, p. 187-214. ISBN 0-262-19300-0.
  • SPORNY, Manu. RDFa Lite 1.1 [online]. W3C Recommendation 07 June 2012. W3C, 2012 [cit. 2014-04-29]. Available from WWW: http://www.w3.org/TR/rdfa-lite/
  • The microformats process [online]. April 28th, 2013 [cit. 2014-04-29]. Available from WWW: http://microformats.org/wiki/process

Footnotes

  1. The result of this endeavour can be found here: https://github.com/OPLZZ/data-modelling

  2. Those may be in turn mapped to other vocabularies’ terms; e.g., via rdfs:subClassOf.

  3. However, it must be possible to reconstruct the main data structures; at least from its context and without out-of-band knowledge.

Epistemology of data in contemporary science

On the turn of the last year I published a paper in E-LOGOS, a Czech philoshopy journal (the paper is available here). The published text deals with the image of data that is widespread in contemporary science. It offers a critical look on the understanding of data that is a key part of the big data hype. Again, as is the case for most of my works in areas, which I am not deeply familiar with, it is a compilation and remix of thoughts drawn from a long array of sources. What you see below is a (rough) English translation of the text (with proper hyperlinks). This way, there’s a chance it will be indexed well enough so that the interested audience can find it.


Abstract: Contemporary science is dominated by positivist epistemology of data, which builds on the foundations of metaphysical realism and the ideal of mechanical objectivity. This approach to data suffers from a number of flaws. Shortcomings of this approach were identified in many critical responses and led to a new problematization of the established concept of data. The often criticised aspects of this epistemology remark on data being embedded in the context of its making, and point out to the mediation of data and its openness to manipulation. In recent years, the function of data gained an unprecedented importance due to the rising appetite of science for data, which attracted attention to this formerly unproblematic concept. Several alternative approaches to epistemology of data appeared, of which this text introduces the positions proceeding from constructivism and rhetoric. The presented paper draws heavily on critical literature in epistemology of data. Due to its summarising character, it may be understood as a synthesis and reconfiguration of the existing thoughts on the topic. In this way, the paper offers a contribution to rhetorical argumentation in the discourse of data in contemporary science.

Introduction

In spite of the fact that etymology of words often bears no correspondence to their use, the roots of ‘data’ give many hints about the way this word is used. Data, as Rosenberg describes (2013, p. 18), comes from the plural of the Latin word ‘datum’; a neuter past participle of the verb ‘dare’, which is translated as ‘to give’. ‘Datum’ can be thus translated as something ‘given’. Common use of data is in line with this explanation, often treating it as something given, which needs no questioning.

Constructivist epistemology takes a stance opposing this viewpoint and claims that nothing is given, since everything is a product of human construction. For example, Bachelard writes:

“For a scientific mind, all knowledge is an answer to a question. If there has been no question, there can be no scientific knowledge. Nothing is self-evident. Nothing is given. Everything is constructed.” (Bachelard, 2002, p. 25)

Diverging understanding of data provides a basis for contemporary criticism that undermines the established status of data in science. On the one hand, data is perceived as direct reflection of reality, whereas on the other hand, it is designed as an artifact of human creation. These diverging perspectives are reflected in problematization of data, which causes a lot of doubt. For example, the work of Poovey dedicated to history of modern fact formulates many questions, which apply to the concept of data as well:

“What are facts? Are they incontrovertible data that simply demonstrate what is true? Or are they bits of evidence marshalled to persuade others of the theory one sets out with? Do facts somehow exist in the world like pebbles, waiting to be picked up? Or are they manufactured and thus informed by all social and personal factors that go into every act of human creation? Are facts beyond interpretation? Or are they the very stuff of interpretation, its symptomatic incarnation instead of the place where it begins?” (Poovey, 1998, p. 1)

This text comprises some of the possible answers to these questions. Proceeding from historical traces of the evolving understanding of data the following sections introduce criticism of the dominant realist epistemology and offer alternative epistemologies drawing from constructivism or rhetoric.

Brief history of data

The concept of ‘data’ is in use for a long time, yet it acquired its current meaning no sooner than on the onset of modernity (Gitelman, 2013, p. 15). One of first known uses of this concepts appears in Euclid’s book entitled Data (Euclid, 1834). The book describes methods for solving and analysing problems, in which data serve either as that what is known in relation to a hypothesis, or that what can be demonstrated to be known. Data offers starting points of inquiry, in which new knowledge may be inferred.

Describing givens as data remained in use at least until the 17th century. In disciplines such as mathematics, philosophy and theology, data signified given foundations, which are not to be disputed (Gitelman, 2013, p. 19). For instance, theology employed this term for the things given by God or the Bible.

The concept came near to its modern use in the early 18th century. Instead of standing for unquestionable givens, data appeared in use as results of experiments, experience or collection. In other words, data “went from being reflexively associated with those things that are outside of any possible process of discovery to being the very paradigm of what one seeks through experiment and observation” (Gitelman, 2013, p. 36). Making this approach a part of general knowledge can be ascribed chiefly to positivism. Viewing it the terms of epistemology, it can be described as metaphysical realism of data.

Metaphysical realism of data

The view of metaphysical realism assumes that data can be collected from objectively perceivable reality. The realist framework deems data as exact record or faithful representation of reality. Rhetoric of scientific ‘discoveries’ requires that knowledge to be discovered already exists in reality; science is then tasked with revealing such knowledge. Photography offers a prototypic example of accurate capture of reality. Photographs are “raw representations of the natural world,” which stand for a “unique and literal transcription of nature - a ‘scientific record’ (Gitelman, 2013, p. 4, appendix).

Scientists generally regard data as records of structured observation that is guided by protocol designed up front (Halavais, 2013). Data collection is described as observation without interfering with the observed reality. The desired objectivity of scientific data presupposes separation of perceiving subject from perceived reality. Metaphysical realism, according to Von Glasersfeld (Von Glasersfeld, 1984, p. 2), asserts that “we may call something ‘true’ only if it corresponds to an independent, ‘objective’ reality.” The realist framework bestows data with a privileged role, while holding the subjective description to be distant from ‘genuine’ reality.

Epistemic privilege of data

The emphasis on data is particularly characteristic of the science in recent decades, when large volumes of data became widely available. Yet data had a fundamental role in science in former times as well; for example Nelson (2009) mentions Rudolphine Tables by Johannes Kepler as an early example of scientific use of data. However, data acquired its peculiar function “in the epistemology we associate with modernity” (Poovey, 1998, p. 1).

Modern science started to drew increasingly on quantized data, which contributed “in a major way to the impression of objectivity in scientific prose” (Gross, 2002, p. 37). The average number of data tables used in scientific articles almost doubled between 19th and 20th century. Half of a sample of 20th century articles contained a table, with an average number of 5 tables per article (Gross, 2002, p. 182). Science in 20th century increased the distinctive preference of quantitative facts over qualitative facts. Language of science reflects this shift in distinguishing ‘hard data’, the quantitative nature of which lends the data an aura of unquestionability, and ‘soft data’, the qualitative nature of which enables the data the be bent at will. In some cases, this preference goes to extreme situations, in which quantitative datasets “are given considerable weight even when nobody defends their validity with real conviction” (Porter, 1995, p. 8). A large share of modern scientists fill their papers with mechanical or mathematical explanations of the facts they describe, while their “argumentative strategy for establishing facts and explanations typically revolves around comparisons of data sets” (Gross, 2002, p. 188). Mathematical explanations referring to data are often privileged for their alleged elegance and clarity (Halevy, 2009). During the 20th century there appears a marked inclination to favour “comparison of large data sets; in addition, mathematics is applied, seemingly whenever possible” (Gross, 2002, p. 231). These trends gradually lead to “rapid ‘commodification’ of data”, which causes data to be presented as “complete, interchangeable products in readily exchanged formats” and may encourage “misinterpretation, over reliance on weak or suspect data sources, and ‘data arbitrage’ based more on availability than on quality” (Edwards, 2013, p. 7).

Data-driven science

In recent years, the emphasis on using data in science increased to such an extent that some proclaim it be bring about a new methodological paradigm of data-driven science (Leonelli, 2014). This approach is labelled as the fourth paradigm of science that uses data-intensive research, in which computers help finding knowledge in data, to extend the preceding three paradigms; the paradigm of empirical observation, the paradigm of explanatory models and the paradigm of simulation for insight into complex phenomena (Nielsen, 2012). Data is considered as a product of quantitative research, which is especially privileged to serve as scientific evidence.

The extreme cases promoting this scientific paradigm earned a label of ‘data fundamentalism” (Crawford, 2013). For example, Anderson’s controversial article from 2008 (2008) announces big data as the “end of theory” and claims that numbers speaking for themselves make hypotheses unnecessary. However, as Keller (1985, p. 130) points out, “the problem with this argument is, of course, that data never do speak for themselves.” Regardless of these critics, some authors believe that getting rid of formulating hypotheses contributes to ‘purification’ of science:

“In a small-data world, because so little data tended to be available, both causal investigations and correlation analysis began with a hypothesis, which was then tested to be either falsified or verified. But because both methods required a hypothesis to start with, both were equally susceptible to prejudice and erroneous intuition.” (Mayer-Schonberger, 2013)

The cause of disregard to making hypotheses in this scientific paradigm may be attributed to the ‘unreasonable effectiveness of data’. Some authors, such as Halevy, 2009, contend that simple models or hypotheses equipped with large enough data inevitable surpass complex models that lack data. In its extreme form, the data-driven science overturns the usual relation between hypotheses and data, in which hypothesis plays a primary role and data only provides grounds for verification or falsification. Instead, this paradigm promotes processes for inductive generalisation of data into valid hypotheses. Research of these methods is a principal concern of the field of data mining, where particulars given in data may be distilled into universals, such as sets of association rules.

Data-driven science considers hypotheses inherently untrustworthy if their reliability is not backed by data. However, as critical rationalism of Karl Popper teaches, data supporting hypothesis does not verify it; instead the data merely falsifies incompatible hypotheses. Trustworthiness ascribed to data can be illustrated by the popular saying: “In god we trust, everyone else bring data.” Data thus functions as evidence testifying for truthiness of the presented claims. For instance, Markham (Markham, 2013) mentions the impression of ‘instant credibility’ that proceeds data.

Contrary to the afore-mentioned claims Boyd and Crawford (2012, p. 663) describe this uncritically accepted approach as a “widespread belief that large data sets offer a higher form of intelligence and knowledge that can generate insights that were previously impossible, with the aura of truth, objectivity, and accuracy.” Keil, for example, adds that “data-driven science is a failure of imagination” (Keil, 2013). Keil stresses that science cannot ignore hypotheses, which constitute models or theories. Instead, it is necessary to combine both empirical observation and making of hypotheses. As Keil suggests, vast volume of data does not help, if it is not confronted with useful theory. Although larger volume of data increases the support it lends to prevalent hypotheses, it increases the noise in data in the same way. This is the reason why large volume might in fact, despite expectations, multiply problems in data. These expectations usually held for data stem from the assumption of their mechanical objectivity.

Myth of mechanical objectivity of data

The epistemic privilege of data springs from the ideal of mechanical objectivity of data (Gitelman, 2013), which ignores contextuality, mediation and manipulation of data. The presumed absence of human input (e.g., in photography) and minimisation of unwanted influences is seen fundamental in achieving the goal of objectivity. The belief in neutrality, autonomy and objectivity of data is widespread. For example, Porter states that “when philosophers speak of the objectivity of science, they generally mean its ability to know things as they really are”” (Porter, 1995, p. 3). In accordance with this demand metaphysical realism deems data to be a direct representation of reality. However, data cannot be an exact reflection of complex reality, as it cannot avoid reducing the reality and omitting details that are reckoned unnecessary for the purpose of data. High level of reduction makes data lose the ability to represent, and thus data can be considered no more than an approximation of reality (Markham, 2013). More causes of data failing to represent reality can be identified, some of which are examined in the further sections of the text.

Nevertheless, it is important to acknowledge that there are other ways of formulating the objectivity of science. One of them is an influential definition of objectivity as an “ability to reach consensus” (Porter, 1995, p. 3), another is equating objectivity to “fairness and impartiality” (ibid., p. 4). However, the ideal of mechanical objectivity is unattainable, because data is always mediated and their creation is embedded in context, which cannot be avoided nor reproduced.

Contextuality of data

Data is shaped to a large extent by context, in which the data is created. In science, the sense of data is “tightly dependent on a precise understanding of how, where, and when they were created” (Edwards, 2013). “Knowledge production is never separate from the knowledge producer” and nor data can be obtained without direct or indirect human influence, so that human thinking always marks the produced data. Direct sensory input is thus combined with mental contents of perceiver, while indirect perception using instruments is affected by views of the instruments’ creators. Bachelard sums this up in writing that “when we contemplate reality, what we think we know very well casts its shadow over what we ought to know” (Bachelard, 2002, p. 24). Therefore, there is a need to keep on mind the ldquo;situated, material conditions of knowledge production” (Gitelman, 2013, p. 4), which cause the resulting data to be “framed and framing” (ibid., p. 5).

Apart from the data creators, data is significantly framed by the environment in which it is created. For instance, Magee draws attention to this influence:

“Knowledge systems are all too frequently characterised in essentialist terms - as though, as the etymology of ‘data’ would suggest, they are merely the housing of neutral empirical givens. […] on the contrary, that systems always carry with them the assumptions of cultures that design and use them - cultures that are, in the very broadest sense, responsible for them.” (Magee, 2011, p. 15)

Environment may determine the way of gathering data, such as by standardising different methods, which can furthermore evolve in time. For example, reclassifications within systems of categories may happen over the years, which significantly worsens the comparability of data from time series (Diakopoulos, 2013).1

The influence of context cannot be eliminated, nor it can be reproduced. Even though data is of discrete nature, so that “each datum is individual, separate and separable, while still alike in kind to others in its set” (Gitelman, 2013, p. 8), and thus data may be partially decontextualised, the efforts to remove contextual influences completely are bound to fail.

Some expect that data may be purged from contextual distortions and subjectivity, if large volume of data containing records of many variations of the same phenomenon is available. The promise of big data rests on a conjecture that by combination and aggregation data may be neutralised and its individual tint may be dampened, so that it draws nearer to the objective reality. The fallacy of this approach is the neglect of arbitrariness of the chosen aggregation method and failure to take the incompleteness of data into account. Choice of aggregation is a subjective act, in which arbitrary conceptualisations (e.g., categories) are selected, so that the resulting aggregated data may end up being further away from the described reality. No matter how data is large, it always remains an incomplete sample, the selection of which may omit that what is important. The extensiveness of data does not guarantee its representativeness, because data is subject to limitations and prejudices independently of its size. For the same matter quantity cannot fill in for quality and consistency of data, which may be diverging due to varying contexts. In case of data samples, absolute values are never exact and relative values are loaded with the skew of sample selection and aggregation.

Moreover, attempts to decontextualise data may be harmful for the context of its use. Boyd and Crawford remark that if “taken out of context, data lose meaning and value” (Boyd, 2012, p. 670). Data is a medium that requires active participation and understanding from its users. Knowledge is not a passive process (Von Glasersfeld, 1984, p. 9). Even though the assumption of objectively perceivable reality constitutes a common object of data, which forms the basis of shared understanding, the universal comprehension of data remains fiction, because interpretation of data depends not only on its object, but its context as well (Markham, 2013).

Mechanical objectivity tries to reduce contextual influences into a clearly delimited protocol. Production of data is thus conducted by strict rules. In this way, mechanical objectivity is defined as an ability to follow rules and fixed protocol (Porter, 1995, p. 4). The function of protocol is to setup a controlled context and minimise unwanted influences, which may be reflected on the created data. Transparent and documented protocol of data preparation, containing detailed information about data provenance, contributes to trustworthiness of data. Bird adds that “what makes something an item of observational knowledge is the reliability and uncontentious nature of the mechanism which produces it” (Bird, 2010, p. 10). This way, users of data may evaluate the “adequacy of the experimental conditions under which data have been produced” and determine, what level of reliability can be expected from the data and what is its evidential value.

In similar manner, the restrictions of protocol aim to make data reproduction feasible. However, Leonelli states that, for the most part, data is “idiosyncratic to particular experimental contexts, and typically cannot occur outside of those contexts” (Leonelli, 2009). It is unavoidable that data is embedded in unrepeatable context and it makes data impossible to reproduce in full. At most, one can attempt to reproduce the methods used to create the data, which may lead to other, yet partially compatible data.

Mediation of data

Immediacy ascribed to data comes from a desire for direct knowledge of reality. ‘Raw’ data is attributed with the quality of primariness. It is thought to be data coming ‘directly’ from its source, which is reality itself. This alleged quality relates to the seemingly natural process of mechanical production of data. One may succumb to an impression that the value of data depends on the straightforwardness of its derivation from reality. For example, data from automated sensors might be perceived as substantially more trustworthy than calculations of impact factor based on indirect inputs that are considerably distant from reality. Thanks to the implied immediacy of data it is often understood, in accordance with its etymology, as having an axiomatic nature, which makes it beyond dispute (Halavais, 2013). “At first glance data are apparently before the fact: they are the starting point for what we know, who we are, and how we communicate” (Gitelman, 2013, p. 2). “Data is beyond argument,” writes Markham (2013), because data is understood as that which precedes argument. In this perspective, data avoids interpretation, analysis and thereby is held to be free of subjective influence. Nevertheless, as Boyd and Crawford suggest, “claims to objectivity are necessarily made by subjects and are based on subjective observations and choices” (Boyd, 2012, p. 667).

The assumption of pre-analytical nature of data was subjected to criticism that problematised the concept of ‘raw data’ and validity of this assumption was contested. Already in 1929 Dewey criticised this presumption:

“[…] all of the rivalries and connected problems grow from a single root. They spring from the assumption that the true and valid object of knowledge is that which has being prior to and independent of the operations of knowing. They spring from the doctrine that knowledge is a grasp or beholding of reality without anything being done to modify its antecedent state - the doctrine which is the source of the separation of knowledge from practical activity.” (Dewey, 1929, p. 196)

An approach alike is deemed generally adopted in 1985, when Keller writes that “it is by now a near truism that there is no such thing as raw data; all data presuppose interpretation” (Keller, 1985, p. 130). Bowker adds a remark that “raw data is both an oxymoron and a bad idea; to the contrary, data should be cooked with care” (Bowker, 2005, p. 184).

The key point of such criticism is the recognition that interpretation is already present in observation and data production on its own. Nunberg contends that properties that we ascribe to information, and which can be ascribed to data as well, such as its “metaphysical haeceity or ‘thereness,’ its transferability, its quantised and extended substance, its interpretive transparency or autonomy — are simply the reifications of the various principles of interpretation that we bring to bear in reading these forms” (Nunberg, 1996). Data and “numbers are interpretive, for they embody theoretical assumptions about what should be counted, how one should understand material reality, and how quantification contributes to systematic knowledge about the world” (Poovey, 1998, p. xii). Therefore, data cannot be accepted as “simple observations about particulars, which were immune from interest and theoretical conjectures of any kind” (ibid., p. xxiv). Due to various reasons, data is always mediated, and so, as Bachelard writes:

“Knowledge of reality is a light that always casts a shadow in some nook or cranny. It is never immediate, never complete.” (Bachelard, 2002, p. 24)

Data is inevitably produced via media. Examples of such media are instruments, such as microscope, or models, that may synthesise data. It is because of media that science may make claims about objects and properties that escape direct observation, such as long-extinct galaxies noticeable via telescopes (Bogen, 1988). A general view then assumes that “scientific theories predict and explain facts about ‘observables’: objects and properties which can be perceived by the senses, sometimes augmented by instruments” (ibid., p. 303).

In the course of data development it necessarily passes through models of reality. Model is a medium of cognition. “Rational views of the universe are idealised models that only approximate reality” (Kent, 2000, p. 220), however, “we can share a common enough view of it for most of our working purposes, so that reality does appear to be objective and stable” (ibid., p. 228). Yet there are some, who assert that “‘sound science’ must mean ‘incontrovertible proof by observational data,’ whereas models were inherently untrustworthy” (Edwards, 2013, p. xviii). “Let the data speak for themselves,” (Keller, 1985, p. 129) is demanded by those, who call for raw, immediate data. Edwards calls the assumption that immediacy can be achieved by “waiting for (model-independent) data” to be misguided (Edwards, 2010, p. xiii). As he writes further, “no collection of signals or observations — even from satellites, which can ‘see’ the whole planet — becomes global in time and space without first passing through a series of data models” (ibid., p. xiii). The dependence of data on models can be seen on the example of weather forecasts and climate change predictions, for which “only about ten percent of the data used by global weather prediction models originate in actual instrument readings. The remaining ninety percent are synthesised by another computer model” (ibid., p. 21). In the same way as models or theories, data is only an imperfect approximation of reality. Nevertheless, in a similar way as Box (1987, p. 424) claims that “all models are wrong, but some are useful”, an analogous approach may be applied to data.

Data manipulation

Mediation of data allows for manipulation and purposeful reconstruction. Data may be distorted either deliberately or unintentionally. In some cases, the influence of context can leave data with its marks that are barely noticeable. For example, Fanelli argues that “scientific results can be distorted in several ways, which can often be very subtle and/or elude researchers’ conscious control”” (Fanelli, 2009). Nonetheless, even though science is generally associated with “fairness and impartiality” (Porter, 1995, p. 4), a significant share of data manipulation is deliberate. Babbage warns about data manipulation in science already in 1830:

“Of cooking. This is an art of various forms, the object of which is to give to ordinary observations the appearance and character of those of the highest degree of accuracy. One of its numerous processes is to make multitudes of observations, and out of these to select those only which agree, or very nearly agree. If a hundred observations are made, the cook must be very unlucky if he cannot pick out fifteen or twenty which will do for serving up.” (Babbage, 1830, p. 178)

Data manipulation in science is relatively prevalent. Anonymous survey revealed that roughly 2 % of scientists admit to have manipulated data. About a third of the survey’s participants conceded to be involved in dubious scientific practices. However, it should be kept in mind that these estimates are likely conservative as this is a sensitive topic (Fanelli, 2009). Moreover, besides manipulation on purpose data can be distorted because of laziness or malpractice.

Intentional manipulation of data includes disregard of unfavourable data, data answering suggestive questions, excessive generalisation, skewed sample (e.g., non-random), misunderstanding of error margins, false causality or finding statistically insignificant correlation in big data (Misuse of statistics, 2013). Data quality may be also deteriorated and made unclear by reducing data to aggregations (Diakopoulos, 2013).

Alternative epistemologies of data

Apart from metaphysical realism, epistemology of data can be considered from alternative viewpoints that do not suffer the afore-mentioned shortcomings. This essentially “positivist picture of the structure of scientific theories is now widely rejected” (Bogen, 1988, p. 304) and its place was seized up by approaches that fall within the postmodernism, yet frequently draw on older thinking, which in some cases date back to the rhetorical origins of philosophy. The following sections introduce the approaches of constructivist epistemology and rhetoric, which are deemed to be mutually compatible.

Constructivist epistemology of data

Constructivist epistemology is based on the presumption that all knowledge is a construction of man. The constructivist school of thought departs from metaphysical realism in not requiring a concept of objective reality. However, treating constructivism as simple rejection of the concept of objective reality would be overly simplistic. Constructivism reverses the relation between data and reality and instead claims that data constitutes the reality it describes, so that “data are not found, they are made” (Halavais, 2013).

Some of the central theses of constructivist epistemology may be clearly seen already in works of Giambattista Vico from the 18th century. The treatises of this intellectual predecessor of constructivism claim that “science (scientia) is the knowledge (cognitio) of origins, of the ways and the manner how things are made” and therefore “ we can only know what we ourselves construct” (Von Glasersfeld, 1984). Such recognition is what distinguishes scientific and pre-scientific mind, because “whereas the pre-scientific mind possesses reality, the scientific mind constructs and reconstructs it, and in doing so is itself constantly reformed” (Bachelard, 2002, p. 9).

Foundations of constructivist epistemology are likely built on the fallout from the shift of philosophy towards language. Constructivist reading may be applied to the works of anthropologist and linguist Edward Sapir, who argues that ‘world’ is constructed by language of community:

“The fact of the matter is that the ‘real world’ is to a large extent unconsciously built up on the language habits of the group. No two languages are ever sufficiently similar to be considered as representing the same social reality. The worlds in which different societies live are distinct worlds, not merely the same world with different labels attached.” (Sapir, 1990, p. 221)

Following Sapir’s reasoning, constructivism has no need for homomorphism between data and reality, in which data correspond to experience of reality. Instead, data and knowledge is what fits the reality and functions in a consistent way within the reality. To illustrate this relationship Von Glasersdorf offers a simile of key that fits in lock, in the same way as data matches reality (Von Glasersfeld, 1984, p. 3).

Constructivist claims are prone to attract simplified reading. For example, “the claim that science is socially constructed has too often been read as an attack on its validity or truth” (Porter, 1995, p. 11). In this regard, constructivism offers to replace the criterion of truth with the concept of inner consistency and rule of no contradiction within a system of knowledge (Von Glasersfeld, 1984, p. 9). Given such conditions, research can be seen as a generative process that produces data and eliminates non-functional knowledge. An example of knowledge revealed as non-functional is what results from ‘apophenia’; a phenomenon of “seeing patterns where none actually exist” (Boyd, 2012, p. 668). Apophenia can affect data analysts, who succumb to the impression that they discovered causal chain of inference in data, whereas it is merely an idiosyncratic construction of the observer.

Constructivist approach is supported by the fact that production of data is always to some degree an act of classification. Malleability of data allows it to be casted using chosen data structures and conceptualisations. The moment that classification is established in data, it becomes part of the data and it is difficult to distinguish it. Arbitrariness of data structures is what Kent spends a lot of thought on:

“Data structures are artificial formalisms. They differ from information in the same sense that grammars don’t describe the language we really use, and formal logical systems don’t describe the way we think.” (Kent, 2000, p. xix)

In a similar fashion like language of community is used to construct shared world, data structures form a basis for shared understanding of data. Data structures are created with specific purposes in mind. “Like different kinds of maps, each kind of structure has its strengths and weaknesses, serving different purposes, and appealing to different people in different situations” (ibid.).

Albeit various aspects of constructivism are referred by critics of metaphysical realism, mentioned mainly in the section discussion mediation, its implications for epistemology do not dominate many scientific domains. For example, while in the field psychology Piaget writes in 1980 that “fifty years of experience have taught us that knowledge does not result from a mere recording of observations without a structuring activity on the part of the subject” (Piaget, 1980, p. 377) and the principles of constructivist epistemology are already widely adopted in humanities and social sciences, still in sciences these principles are rather ignored (Hennig, 2002) and instead substituted with remains of metaphysical realism.

Rhetoric of data

Rhetoric provides an interpretation of data that is complementary to the approach of constructivist epistemology. Its compatibility can be seen mostly in case of ontological approach to epistemic understanding of rhetoric, which is distinguished by Brummett (Brummett, 1979). The ontological explanation of rhetorical epistemology purports that “discourse does not merely discover truth or make it effective. Discourse creates realities rather than truths about realities” (ibid.). The function of rhetoric is not limited to persuasion and justification, but it covers production of assertions as well. Therefore, Scott, as one of the first who linked rhetoric to epistemology, writes that “rhetoric may be viewed not as a matter of giving effectiveness to truth but of creating truth” (Scott, 1967, p. 13). The lens of constructivist epistemology seem to be present, when Scott remarks that “‘truth,’ of course, can be taken in several senses. If one takes it as prior and immutable, then one has no use for rhetoric except to address inferiors” (ibid., p. 9).

As the use of ‘data’ in Euclid’s treatises (Euclid, 1834) indicates, the concept was already used in rhetorical sense during Euclid’s time. According to the etymology of data, it is “‘that which is given prior to argument,’ given in order to provide a rhetorical basis” (Gitelman, 2013, p. 7). Production of data in science is set in a discourse of rhetorical argumentation. Data is constructed as one of the products of scientific discourse, primary as a vehicle of persuasion. Selection and processing of data can be tailored to support the sought purpose in argument. If validity of claims is attacked, their authors are required to justify these claims. “If challenged it is up to us to produce whatever data, facts, or other backing we consider to be relevant and sufficient to make good the initial claim” (Toulmin, 2003, p. 13). Production of data may be considered as a specific speech act, which is of use in argumentation for justifying previous or forthcoming claims.

Rhetoric argument offers an alternative to analytical logic. Similarly to logic, in rhetoric “given certain data, certain conclusions may be proven or argued to follow” (Gitelman, 2013, p. 18). Data does not belong into the framework of analytical logic though, because it cannot be evaluated to truth value. “When a fact is proven false, it ceases to be a fact. False data is data nonetheless.” (Gitelman, 2013, p. 18). Use of data is thus rhetorical. Rosenberg summarises the distinction features of data by stating that “facts are ontological, evidence is epistemological, data is rhetorical” (ibid.).

Rhetoric has a bad reputation in science. The dangers of rhetoric are pointed out by Thomas Sprat in 1667, when he published his treatise on history of the British Royal Society: “And to accomplish this, they have indeavor’d to separate the knowledge of Nature, from the colours of Rhetorick, the devices of Fancy, or the delightful deceit of Fables” (Sprat, 1667, p. 62). Historically, rhetoric is associated with deliberate manipulative uses of data. Some of these uses are described in the previous section on data manipulation. Data manipulation, used for example to obtain grant funding, can be regarded as a kind of rhetorical argumentation. For example, examples of using data for rhetorical purposes can be found in propaganda infographics,2 debates on the existence of global warming, or in pre-election surveys, the creators of which are frequently accused of intentional manipulation.

In case of data, it is its ability to aggregate, which gives it its “potential power, their rhetorical weight” (Gitelman, 2013, p. 8). Aggregation may contribute to an impression of false objectivity. An example of rhetorical production of data is the reconceptualisation of newspaper as database carried out by Angeline Grimké Weld, her husband Theodore and her sister Sarah (Gitelman, 2013, p. 90). In this case data about slavery was compiled from newspaper, such as from ads for runaway slaves. The collected data was reframed as testimony of slaveholders’ brutality as it turned their own words against them.

Modern rhetoric has a much broader scope than manipulation or persuasion. Ontological approached mentioned by Brummett positions rhetoric as a dimension present in all epistemic activities (Brummett, 1979). Rhetorical dimension is also present in scientific data. Even though the proclamations saying that “data is apolitical” (Peled, 2013) appear, data is never impartial and it necessary to take into account that it may include hidden rhetorical agenda. Even though practical data analysis mostly lacks deliberate rhetorical approach (Schron, 2013), the growing amount of research in this area3 suggest that there is interest in disruption of the established understanding of data.

Conclusion

Epistemology of data needs to be paid attention to because of the fundamental status data has in contemporary science. Due to a dramatic decrease of costs of producing large data of sufficient quality the science adopted data as its central resource. If such power is bestowed to data, it is important not to treat data as an unquestionable concept exempt from scrutiny. Regardless of this need, the predominant epistemology of data in current science is based on the alleged pre-analytical nature of data. The popular pyramid data - information - knowledge - wisdom puts data at the first place, as a basis for following levels of knowing. Given such position, data is preceded solely by the very reality, the direct representation of which data purports to be.

Many critics contributed to unveil the weaknesses of this concept and it is their works on which this text is built on. A lot of the publications cited hereby brings attention to shortcomings of the positivist heritage. A growing number of authors casts doubt upon the established role of data in science. Reformulation of epistemology of data-intensive science is attempted by several researchers, while first projects focused on this topic appear.4 This text also tried to oppose the unproblematic view of data in present science. In accordance with Kent, the text:

“[…] projects a philosophy that life and reality are at bottom amorphous, disordered, contradictory, inconsistent, non-rational, and non-objective. Science and much of western philosophy have in the past presented us with the illusion that things are otherwise.” (Kent, 2000, p. 220)

Critical reflection of the dominant epistemology of data in western philosophy found many holes in the uncritical, positivist approach to data. In the light of these findings, the interpretation of data as an unquestionable representation of objectively perceivable reality does not stand the test. The alternative interpretations offered by constructivist epistemology or rhetoric appear to be more productive frames for thinking about data. Whatever the path is chosen, science cannot treat data as unproblematic input to mathematical task; instead, it needs to subject data to questions.

References

  • ANDERSON, Chris. The end of theory: the data deluge makes the scientific method obsolete. Wired [online]. 2008-06-23 [cit. 2013-12-23]. Available from WWW: http://www.wired.com/science/discoveries/magazine/16-07/pb_theory
  • BABBAGE, Charles. Reflections on the decline of science in England and on some of its causes. London: B. Fellowes; J. Booth, 1830. Available from WWW: https://archive.org/details/reflectionsonde00mollgoog
  • BACHELARD, Gaston. The formation of the scientific mind: a contribution to a psychoanalysis of objective knowledge. Translated by Mary MCALLESTER JONES. Manchester: Clinamen Press, 2002. ISBN 1-903083-20-6.
  • BIRD, Alexander. The epistemology of science: a bird’s-eye view. Synthese. 2010, vol. 175, no. 1 appendix, pp. 5–16. Available from WWW: http://eis.bris.ac.uk/~plajb/teaching/The_Epistemology_of_Science.pdf. DOI 10.1007/s11229-010-9740-4.
  • BOELLSTORFF, Tom. Making big data, in theory. First Monday [online]. 2013 [cit. 2013-12-28], vol. 18, no. 10. Available from WWW: http://uncommonculture.org/ojs/index.php/fm/article/view/4869/3750
  • BOGEN, James; WOODWARD, James. Saving the phenomena. The Philosophical Review. 1988, vol. 97, no. 3, pp. 303–352. Also available from WWW: http://www.pitt.edu/~rtjbog/bogen/saving.pdf
  • BOX, George E. P.; DRAPER, Norman R. Empirical model-building and response surfaces. Hoboken (NJ): Wiley, 1987. Wiley series in probability and statistics, vol. 157. ISBN 0-471-81033-9.
  • BOWKER, Geoffrey C. Memory practices in the sciences. Cambridge (MA): MIT Press, 2005, 280 p. Inside technology. ISBN 978-0-262-52489-6.
  • BOYD, Danah; CRAWFORD, Kate. Critical questions for big data. Information, Communication & Society. 2012, vol. 15, no. 5, pp. 662–679. Also available from WWW: http://dx.doi.org/10.1080/1369118X.2012.678878. DOI 10.1080/1369118X.2012.678878.
  • BRUMMETT, Barry. Three meanings of epistemic rhetoric. Speech Communication Association Convention: Seminar on Discursive Reality. San Antonio (TX): 1979. Also available from WWW: http://ap2008.wdfiles.com/local--files/selected-research-articles/Brummett1979.doc
  • CRAWFORD, Kate. The hidden biases in big data. Harvard Business Review Blog Network [online]. April 1, 2013 [cit. 2014-01-11]. Available from WWW: http://blogs.hbr.org/2013/04/the-hidden-biases-in-big-data/
  • DEWEY, John. The quest for certainty: a study of the relation of knowledge and action. New York: Minton, Balch & Company, 1929. Gifford lectures. Also available from WWW: https://archive.org/details/questforcertaint032529mbp
  • DIAKOPOULOS, Nick. The rhetoric of data [online]. July 25, 2013 [cit. 2013-12-22]. Available from WWW: http://www.nickdiakopoulos.com/2013/07/25/the-rhetoric-of-data/
  • EDWARDS, Paul N. A vast machine: computer models, climate data, and the politics of global warming. Cambridge (MA): MIT Press, 2010, 552 p. ISBN 978-0-262-01392-5.
  • EDWARDS, Paul N. [et al.] (eds.). Knowledge infrastructures: intellectual frameworks and research challenges [online]. Report of a workshop sponsored by the National Science Foundation and the Sloan Foundation, University of Michigan School of Information, 25–28 May 2012. May 2013 [cit. 2013-12-22]. Available from WWW: http://hdl.handle.net/2027.42/97552
  • EUCLID. Data. In SIMSON, Robert (ed.). The elements of Euclid. Philadelphia: Desilver, Thomas & co., 1834.
  • FANELLI, Daniele. How many scientists fabricate and falsify research? A systematic review and meta-analysis of survey data. Public Library of Science ONE [online]. May 29, 2009 [cit. 2014-01-03], vol. 4, no. 5. Available from WWW: http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0005738. DOI 10.1371/journal.pone.0005738.
  • GITELMAN, Lisa (ed.). ‘Raw data’ is an oxymoron. Cambridge (MA): MIT Press, 2013. ISBN 978-0-262-51828-4.
  • GROSS, Alan G.; HARMON, Joseph E.; REIDY, Michael. Communicating science: the scientific article from the 17th century to the present. New York (NY): Oxford University Press, 2002. ISBN 0-19-513454-0.
  • HALAVAIS, Alexander. Home made big data? Challenges and opportunities for participatory social research. First Monday [online]. 2013 [cit. 2013-12-28], vol. 18, no. 10. Available from WWW: http://uncommonculture.org/ojs/index.php/fm/article/view/4876/3754
  • HALEVY, Alon; NORVIG, Peter; PERREIRA, Fernando. The unreasonable effectiveness of data. Intelligent Systems. 2009, vol. 24, no. 2, pp. 8–12. Available from WWW: http://static.googleusercontent.com/media/research.google.com/en/pubs/archive/35179.pdf. ISSN 1541-1672. DOI 10.1109/MIS.2009.36.
  • HENNIG, Christian. Confronting data analysis with constructivist philosophy. In Classification, clustering, and data analysis: recent advances and applications, part II. Berlin; Heidelberg: Springer, 2002, pp. 235–243. ISBN 978-3-642-56181-8. DOI 10.1007/978-3-642-56181-8_26.
  • KEIL, Petr. Data-driven science is a failure of imagination [online]. January 2, 2013 [cit. 2013-12-22]. Available from WWW: http://www.petrkeil.com/?p=302
  • KELLER, Evelyn Fox. Reflections on gender and science. New Haven (MA): Yale University Press, 1985. ISBN 0-300-06595-7.
  • KENT, William. Data and reality. Bloomington (IN): 1st Books Library, 2000. ISBN 1-58500-970-9.
  • LEONELLI, Sabina. On the locality of data and claims about phenomena. Philosophy of Science. 2009, vol. 76, no. 5, pp. 737–749. Also available from WWW: https://ore.exeter.ac.uk/repository/handle/10871/9429. ISSN 0031-8248.
  • LEONELLI, Sabina. Data interpretation in the digital age. Perspectives on Science [in print]. 2014. Also available from WWW: https://ore.exeter.ac.uk/repository/handle/10036/4484. ISSN 1063-6145.
  • MAGEE, Liam. Frameworks for knowledge representation. In COPE, Bill; KALANTZIS, Mary; MAGEE, Liam (eds.). Towards a semantic web: connecting knowledge in academic research. Oxford: Chandos, 2011. ISBN 978-1-84334-601-2.
  • MARKHAM, Annette N. Undermining ‘data’: a critical examination of a core term in scientific inquiry. First Monday [online]. 2013 [cit. 2013-12-22], vol. 18, no. 10. Available from WWW: http://uncommonculture.org/ojs/index.php/fm/article/view/4868/3749. DOI 10.5210/fm.v18i10.4868.
  • MAYER-SCHÖNBERGER, Viktor; CUKIER, Kenneth. Big data: a revolution that will transform how we live, work, and think. Boston (MA): Houghton Mifflin Harcourt, 2013. ISBN 978-0-544-00269-2.
  • Misuse of statistics. Wikipedia [online]. Last modified December 19, 2013 [cit. 2014-01-12]. Available from WWW: http://en.wikipedia.org/wiki/Misuse_of_statistics
  • NELSON, Michael L. Data-driven science: a new paradigm? EDUCAUSE Review [online]. July/August 2009 [cit. 2013-12-22], vol. 44, no. 4, pp. 6–7. Available from WWW: http://www.educause.edu/ero/article/data-driven-science-new-paradigm
  • NIELSEN, Michael. Reinventing discovery: the new era of networked science. New Jersey: Princeton University Press, 2011, 273 p. ISBN 978-0-691-14890-8.
  • NUNBERG, Geoffrey. Farewell to the Information age. In NUNBERG, Geoffrey (ed.). The future of the book. Berkeley (CA): University of California Press, 1996. ISBN 0-520-20451-4.
  • PELED, Alon. The politics of big data: a three-level analysis [online]. 2013 [cit. 2014-01-06]. Available from WWW: http://ssrn.com/abstract=2315891
  • PIAGET, Jean. The psychogenesis of knowledge and its epistemological significance. In PIATTELLI-PALMARINI, Massimo (ed.). Language and learning: the debate between Jean Piaget and Noam Chomsky. Cambridge (MA): Harvard University Press, 1980. ISBN 0-674-50940-4.
  • POOVEY, Mary. A history of the modern fact: problems of knowledge in the sciences of wealth and society. 1st ed. Chicago: University of Chicago Press, 1998, 436 p. ISBN 0-226-67526-2.
  • PORTER, Theodore M. Trust in numbers: the pursuit of objectivity in science and public life. Princeton (NJ): Princeton University Press, 1995. ISBN 0-691-03776-0.
  • SAPIR, Edward. The collected works of Edward Sapir. VIII, Takelma texts and grammar. Berlin; New York: Mouton de Gruyter, 1990. Also available from WWW: https://archive.org/details/collectedworksof01sapi
  • SCHRON, Max. Data’s missing ingredient? Rhetoric [online]. April 11, 2013 [cit. 2014-01-09].Available from WWW: http://strata.oreilly.com/2013/04/datas-missing-ingredient-rhetoric.html
  • SCOTT, Robert L. On viewing rhetoric as epistemic. Central States Speech Journal. 1967, vol. 18, no. 1, pp. 9–17. DOI 10.1080/10510976709362856.
  • SPRAT, Thomas. The history of the Royal-Society of London, for the improving of natural knowledge. [T.R.]: London, 1667, 438 p. Also available from WWW: https://archive.org/details/historyroyalsoc00martgoog
  • TOULMIN, Stephen E. The uses of argument. Cambridge: Cambridge University Press, 2003. ISBN 978-0-511-07117-1.
  • VON GLASERSFELD, Ernst. An introduction to radical constructivism. In WATZLAWICK, Paul (ed.). The invented reality. New York: Norton, 1984, pp. 17–40.

Footnotes

  1. For example, this problem concerns volatile groups, such as the group of the 1 % of the richest people, for which, due to its volatility, one cannot compare different time slices of data describing the group.

  2. As Gitelman mentions, “data visualisation amplifies the rhetorical function of data” (Gitelman, 2013, p. 12).

  3. Such as the anthology ‘Raw data’ is an oxymoron from 2013 (Gitelman, 2013).

  4. For example, http://www.datastudies.eu/.