How the Big Clean addresses the challenges of open data

The Big Clean 2012 is a one-day conference dedicated to three principal themes: screen-scraping, data refining and data-driven journalism. These topics address some of the current challenges of open data, focusing on usability, misinterpretation of data and on the issue of making data-driven journalism work.


A key challenge of the Big Clean is refining raw data into usable data. People often fall victim to the fallacy of treating screen-scraped data as a resource that can be used directly, fed straight into visualizations or analysed to yield insights. However, validity of data must not be taken for granted. It needs to be questioned.
Just as some raw ingredients need to be cooked to become edible, raw data needs to be preprocessed to become usable. Patchy data extracted from web pages should be refined into data that can be relied upon. Cleaning data makes it more regular, error-free and ultimately more usable.
The Big Clean will take this challenge into account in several talks. Jiří Skuhrovec will try to strike a fine balance, considering the question of how much do we need to clean. Štefan Urbánek will walk the event's participants through a data processing pipeline. Apart from the invited talks, this topic will be a subject to a screen-scraping workshop lead by Thomas Levine. The workshop will run in parallel with the main track of the conference.


Access to raw data allows people take control of the interpretation of data. Effectively, people are not only taking hold of uninterpreted data, but also of the right to interpret it. This is not the case in the current state of affairs, where there is often no access to raw data, since all data is mediated through user interfaces. In such case, the interface owners control the ways in which data may be viewed. On the contrary, raw data gives you a freedom to interpret data on your own. It allows you to skip the intermediaries and access data directly, instead of limiting yourself to the views provided by the interface owners.
While the loss of control over presentation of data may be perceived as a loss of control over the meaning of the data, it is actually a call for more explicit semantics in the data. It is a call for an encoding of the meaning in data in a waythat does not rely on the presentation of data.
A common excuse for not releasing data held in the public sector is the assumption that the data will be misinterpreted. As reported in Andrew Stott's OKCon 2011 talk, among the civil servants, there is a widespread expectation that “people will draw superficial conclusions from the data without understanding the wider picture.”. First, there is not a single correct interpretation of data possessed by the public sector. Instead, there are multiple valid interpretations that may coexist together. Second, the fact that data is prone to incorrect interpretation may not attest to the ambiguity of the data, but to the ambiguity of its representation.
Tighter semantics may make the danger of misinterpretation less probable. As examples such as Data.gov.uk in the United Kingdom have shown, one way to encode clearer interpretation rules directly into the data is by using semantic web technologies.

Data-driven journalism

Nevertheless, in most cases public sector data is not self-describing. The data is not smart and thus people interpreting it need to be smart. A key group that needs to become smarter, reading the clues conveyed in data, comprises of journalists. Journalists should read data, not only press releases. In becoming data literati the importance of their work increases. They serve as translators, mediating understanding derived from data to the wider public. In this way, data-driven journalism contributes to the goal of making data more usable as stories told with data are more accessible than the data itself.
Raw data opens space for different and potentially competing interpretations. This is the democratic aspect of open data. It invites participation in a shared discourse constructed around the data. A fundamental element of such discourse are the media. Journalists using the data may contribute to this conversation by finding what is new in the data, discovering issues hidden from public oversight or tracing the underlying systemic trends. This is the key contribution of data-driven journalism, providing diagnoses of the present society.
The principal part of data-driven journalism in the open data ecosystem will be reflected in a couple of talks given at the Big Clean. Liliana Bounegru will explain why data journalism is something you too should care about and Caelainn Barr will showcase how the EU data can be used in journalism.

Practical details

The Big Clean will be held on November 3rd, 2012, at the National Technical Library in Prague, Czech Republic. You can register by following this link. The admission to the event is free.
I hope to see many of you there.

No comments :

Post a Comment