2012-11-22

Sampling CSV headers from the Data Hub

Recently, I decided to check how useful column headers in typical CSV files are. My hunch was that in many cases columns would be labelled ambiguously or that the header row would be simply missing from many CSVs. In such cases data may be near to useless, since hints how to use data are lacking.

To support my assumptions about the typical CSV file, I needed sample data. Many such files are listed as downloadable resources in the Data Hub, which is one of the most extensive CKAN instances. Fortunately for me, CKAN exposes a friendly API. However, an even friendlier way for me was to obtain the data by using the SPARQL endpoint of the Semantic CKAN, which offers access to the Data Hub data in RDF. Simply put:
This is the query that I used:
PREFIX dcat:    <http://www.w3.org/ns/dcat#>
SELECT ?accessURL
WHERE {
  ?s a dcat:Distribution ;
    dcat:accessURL ?accessURL .
  FILTER (STRENDS(STR(?accessURL), "csv"))
}

I saved the query in query.txt file and executed it on the endpoint:
curl -H "Accept:text/csv" --data-urlencode "query@query.txt" http://semantic.ckan.net/sparql > files.csv

In the command, I took advantage of content negotiation provided by OpenLink's Virtuoso and set the HTTP Accept header to the MIME type text/csv. I made curl to load the query from the query.txt file and pass it in the query parameter by using the argument "query@query.txt" (thanks to @cygri for this tip). The query results were stored in the files.csv file.

Having a list of CSV files, I was prepared to download them. I created a directory for the files that I wanted to get and moved into it with mkdir download; cd download. To download the CSV files I executed:
tail -n+2 ../files.csv | xargs -n 1 curl -L --range 0-499 --fail --silent --show-error -O 2> fails.txt
To skip the header row containing the SPARQL results variable name, I used -n+2. I piped the list of CSV files to curl. I switched the -L argument on in order to follow redirects. To minimize the amount of downloaded data I used --range to 0-499 to fetch only a partial response containing the first 500 bytes from servers that support HTTP/1.1. Finally, I muted curl with --silent and --fail to turn error reporting off and redirected errors to fails.txt file.

When the CSV files were retrieved, I concatenated their first lines:
find * | xargs -n 1 head -1 | sort | perl -p -e "s/^M//g" > 1st_lines.txt
head -1 outputted the first line from every file was passed to it through xargs. To polish the output a bit, I sorted it and removed superfluous characters with perl -p -e "s/^M//g". Finally, I had a list with samples of CSV column headers.

By inspecting the samples, I found that ambiguous column labels are indeed the case, as labels such as “amount” or “id” are fairly widespread. Examples of other labels that caught my attention included “A-in-A”, “Column 42” and the particularly mysterious label “X”. Disabiguating such column names would be difficult without additional contextual information, including examples of data from the columns or supplementary documentation. Such data could be hard to use, especially for automated processing.

2012-10-15

How linked data improves recall via data integration

Linked data is an approach that materializes relationships in between resources described in data. It makes implicit relationships explicit, which makes them reusable. When working with linked data integration is performed on the level of data. It offloads (some of) the integration costs from consumers onto data producers. In this post, I compare the integration on the query level with the integration done on the level of data, showing the limits of the first approach as contrasted to the second one, demonstrated on the improvement of recall when querying the data.

All SPARQL queries featured in this post may be executed on this SPARQL endpoint.

For the purposes of this demonstration, I want to investigate public contracts issued by the capital of Prague. If I know a URI of the authority, say <http://ld.opendata.cz/resource/business-entity/00064581>, I can write a simple, naïve SPARQL query and I get to know there are 3 public contracts associated with this authority:
## Number of public contracts issued by Prague (without data integration) #
PREFIX pc: <http://purl.org/procurement/public-contracts#>

SELECT (COUNT(?contract) AS ?numberOfContracts)
FROM <http://ld.opendata.cz/resource/dataset/isvzus.cz/new>
WHERE {
  ?contract a pc:Contract ;
    pc:contractingAuthority <http://ld.opendata.cz/resource/business-entity/00064581>
    .
}
I can get the official number of this contracting authority that was assigned to it by the Czech Statistical Office. This number is “00064581”.
## The official number of Prague #
PREFIX br: <http://purl.org/business-register#>

SELECT ?officialNumber
FROM <http://ld.opendata.cz/resource/dataset/isvzus.cz/new>
WHERE {
  GRAPH <http://ld.opendata.cz/resource/dataset/isvzus.cz/new> {
      <http://ld.opendata.cz/resource/business-entity/00064581> br:officialNumber ?officialNumber .
  }
}
Consequently, I can look up all the contracts associated with a contracting authority identified with either the previously used URI or this official number. I get an answer telling me there is 195 public contracts issued by this authority.
## Number of public contracts of Prague with manual integration #
PREFIX br: <http://purl.org/business-register#>
PREFIX pc: <http://purl.org/procurement/public-contracts#>

SELECT (COUNT(?contract) AS ?numberOfContracts)
FROM <http://ld.opendata.cz/resource/dataset/isvzus.cz/new>
WHERE {
  GRAPH <http://ld.opendata.cz/resource/dataset/isvzus.cz/new> {
    ?contract a pc:Contract .
    {
      ?contract pc:contractingAuthority <http://ld.opendata.cz/resource/business-entity/00064581> .
    } UNION {
      ?contract pc:contractingAuthority ?authority .
      ?authority br:officialNumber "00064581" .
    }
  }
}
However, in some cases, the official number is missing, so I might want to try the authority’s name as its identifier. However, expanding my search by adding an option to match contracting authority based on its exact name will give me 195 public contracts that were issued by this authority. In effect, in this case the recall is not improved by matching on the authority’s legal name.
## Number of public contracts of Prague with manual integration #
PREFIX br: <http://purl.org/business-register#>
PREFIX pc: <http://purl.org/procurement/public-contracts#>

SELECT (COUNT(?contract) AS ?numberOfContracts)
FROM <http://ld.opendata.cz/resource/dataset/isvzus.cz/new>
WHERE {
  GRAPH <http://ld.opendata.cz/resource/dataset/isvzus.cz/new> {
    ?contract a pc:Contract .
    {
      ?contract pc:contractingAuthority <http://ld.opendata.cz/resource/business-entity/00064581> .
    } UNION {
      ?contract pc:contractingAuthority ?authority .
      ?authority br:officialNumber "00064581" .
    } UNION {
       ?contract pc:contractingAuthority ?authority .
       ?authority gr:legalName "Hlavní město Praha" .
    }
  }
}
Even still, I know there might be typing errors in the name of the contracting authority. Listing distinct legal names of the authority of which I know either its URI or its official number will give me 8 different spelling variants, which might indicate there are more errors in the data.
## Names that are used for Prague as a contracting authority #
PREFIX br: <http://purl.org/business-register#>
PREFIX gr: <http://purl.org/goodrelations/v1#>
PREFIX pc: <http://purl.org/procurement/public-contracts#>

SELECT DISTINCT ?legalName
FROM <http://ld.opendata.cz/resource/dataset/isvzus.cz/new>
WHERE {
  GRAPH <http://ld.opendata.cz/resource/dataset/isvzus.cz/new> {
    ?contract a pc:Contract .
    {
      ?contract pc:contractingAuthority <http://ld.opendata.cz/resource/business-entity/00064581> .
      OPTIONAL {
        <http://ld.opendata.cz/resource/business-entity/00064581> gr:legalName ?legalName .
      }
    } UNION {
      ?contract pc:contractingAuthority ?authority .
      ?authority br:officialNumber "00064581" ;
        gr:legalName ?legalName .
    }
  }
}
Given the assumption there might be unmatched instances of the same contracting authority labelled with erroneous legal names, I may want to perform an approximate, fuzzy match when search for the authority’s contracts. Doing so will give me 717 public contracts that might be attributed to the contracting authority with a reasonable degree of certainty.
## Number of public contracts of Prague with manual integration #
PREFIX br: <http://purl.org/business-register#>
PREFIX pc: <http://purl.org/procurement/public-contracts#>

SELECT (COUNT(?contract) AS ?numberOfContracts)
FROM <http://ld.opendata.cz/resource/dataset/isvzus.cz/new>
WHERE {
  GRAPH <http://ld.opendata.cz/resource/dataset/isvzus.cz/new> {
    ?contract a pc:Contract .
    {
      ?contract pc:contractingAuthority <http://ld.opendata.cz/resource/business-entity/00064581> .
    } UNION {
      ?contract pc:contractingAuthority ?authority .
      ?authority br:officialNumber "00064581" .
    } UNION {
       ?contract pc:contractingAuthority ?authority .
      ?authority gr:legalName ?legalName .
      ?legalName <bif:contains> '"Hlavní město Praha"' .
    }
  }
}
Further integration on the query level would make the query more complex or it would not be possible to express the integration steps within the limits of the query language. This approach is both laborious and computationally inefficient, since the equivalence relationships need to be reinvented and recomputed every time the query is created and run.

Contrarily, when I use a URI of the contracting authority plus its owl:sameAs links, it results in a simpler query. In this case, 232 public contracts are found. In this way the recall is improved, and, even though it is not as high as in the case of the query that takes into account various spellings of the authority’s name, which may be possibly attributed to a greater precision of the interlinking done on the level of data instead of intergration on the query level.

The following query harnesses equivalence relationships within the data. The query extends the first query shown in this post. In the FROM clause, it adds a new data source to be queried (<http://ld.opendata.cz/resource/dataset/isvzus.cz/new/beSameAs>), which co
ntains the equivalence links between URIs identifying the same contracting authorities. Other than that, a Virtuoso-specific directive DEFINE input:same-as "yes" is turned on, so that owl:sameAs links are followed.
## Number of public contracts of Prague with owl:sameAs links #
DEFINE input:same-as "yes"
PREFIX pc: <http://purl.org/procurement/public-contracts#>

SELECT (COUNT(?contract) AS ?numberOfContracts)
FROM <http://ld.opendata.cz/resource/dataset/isvzus.cz/new>
FROM <http://ld.opendata.cz/resource/dataset/isvzus.cz/new/beSameAs>
WHERE {
  ?contract a pc:Contract ;
   pc:contractingAuthority <http://ld.opendata.cz/resource/business-entity/00064581>
   .
}

2012-10-08

How the Big Clean addresses the challenges of open data

The Big Clean 2012 is a one-day conference dedicated to three principal themes: screen-scraping, data refining and data-driven journalism. These topics address some of the current challenges of open data, focusing on usability, misinterpretation of data and on the issue of making data-driven journalism work.

Usability

A key challenge of the Big Clean is refining raw data into usable data. People often fall victim to the fallacy of treating screen-scraped data as a resource that can be used directly, fed straight into visualizations or analysed to yield insights. However, validity of data must not be taken for granted. It needs to be questioned.
Just as some raw ingredients need to be cooked to become edible, raw data needs to be preprocessed to become usable. Patchy data extracted from web pages should be refined into data that can be relied upon. Cleaning data makes it more regular, error-free and ultimately more usable.
The Big Clean will take this challenge into account in several talks. Jiří Skuhrovec will try to strike a fine balance, considering the question of how much do we need to clean. Štefan Urbánek will walk the event's participants through a data processing pipeline. Apart from the invited talks, this topic will be a subject to a screen-scraping workshop lead by Thomas Levine. The workshop will run in parallel with the main track of the conference.

Misinterpretation

Access to raw data allows people take control of the interpretation of data. Effectively, people are not only taking hold of uninterpreted data, but also of the right to interpret it. This is not the case in the current state of affairs, where there is often no access to raw data, since all data is mediated through user interfaces. In such case, the interface owners control the ways in which data may be viewed. On the contrary, raw data gives you a freedom to interpret data on your own. It allows you to skip the intermediaries and access data directly, instead of limiting yourself to the views provided by the interface owners.
While the loss of control over presentation of data may be perceived as a loss of control over the meaning of the data, it is actually a call for more explicit semantics in the data. It is a call for an encoding of the meaning in data in a waythat does not rely on the presentation of data.
A common excuse for not releasing data held in the public sector is the assumption that the data will be misinterpreted. As reported in Andrew Stott's OKCon 2011 talk, among the civil servants, there is a widespread expectation that “people will draw superficial conclusions from the data without understanding the wider picture.”. First, there is not a single correct interpretation of data possessed by the public sector. Instead, there are multiple valid interpretations that may coexist together. Second, the fact that data is prone to incorrect interpretation may not attest to the ambiguity of the data, but to the ambiguity of its representation.
Tighter semantics may make the danger of misinterpretation less probable. As examples such as Data.gov.uk in the United Kingdom have shown, one way to encode clearer interpretation rules directly into the data is by using semantic web technologies.

Data-driven journalism

Nevertheless, in most cases public sector data is not self-describing. The data is not smart and thus people interpreting it need to be smart. A key group that needs to become smarter, reading the clues conveyed in data, comprises of journalists. Journalists should read data, not only press releases. In becoming data literati the importance of their work increases. They serve as translators, mediating understanding derived from data to the wider public. In this way, data-driven journalism contributes to the goal of making data more usable as stories told with data are more accessible than the data itself.
Raw data opens space for different and potentially competing interpretations. This is the democratic aspect of open data. It invites participation in a shared discourse constructed around the data. A fundamental element of such discourse are the media. Journalists using the data may contribute to this conversation by finding what is new in the data, discovering issues hidden from public oversight or tracing the underlying systemic trends. This is the key contribution of data-driven journalism, providing diagnoses of the present society.
The principal part of data-driven journalism in the open data ecosystem will be reflected in a couple of talks given at the Big Clean. Liliana Bounegru will explain why data journalism is something you too should care about and Caelainn Barr will showcase how the EU data can be used in journalism.

Practical details

The Big Clean will be held on November 3rd, 2012, at the National Technical Library in Prague, Czech Republic. You can register by following this link. The admission to the event is free.
I hope to see many of you there.

2012-10-04

What makes open data weak will not get engineered away

Open data is still weak but growing strong. I have written a few fairly random points covering the weak points, in which open data may need to grow.

  • With the Open Government Partnership, open data is losing its edge. Open data is being assimilated into the current bureaucratic structures. It might be about time to reignite the subversive potential of open data.
  • There is no long-term commitment to open data. All activity in the domain seems to be fragmented in small projects that do not last long, nor do they share results. We need to find ways to make projects outlive their funding. Open data has an attention deficit disorder.
  • What makes open data weak and strange will not get engineered away. Better tools will not solve the inherent issues in open data, albeit they might help to grow the open data community in order to be able to solve those. Even though open data might be broken, we should not try to fix it, we should try to grow it to fix it itself.
  • People are getting lost on the way to realization of the goals of the open data movement. They fall for the distractions encountered on the way and get enchanted by the technology, a mere tool for addressing the goals of open data. People get stuck teaching others how to open and use data, while themselves not doing what they preach. People stop at community building, grasping for momentum using social media.
  • There is a legal uncertainty making people believe that taking legal actions is not possible without having a lawyer holding your hand. People are careful not to breach any of their imagined implications of the law. Civil servants are afraid to release the data their institutions hold, citizens are afraid of using data to effect real-world consequences.

2012-10-03

State of open data in the Czech Republic in 2012

During the Open Knowledge Festival 2012 in Helsinki I presented a lightning-fast two minutes summary of four key things that happened with open data in the Czech Republic. Here is a brief recap of the things I mentioned.

One of the most tangible results of the open data community in the past year was the launch of a national portal called “Náš stát” (which stands for “Our state”). It provides an overview of a network of Czech projects working towards improving Czech public sector with applications and services built on top of its data. What turned out to be one of its main benefits is that it started unifying disparate organizations that are often working on the same issues without knowing they might be duplicating work of others, and we will see in the coming years if it becomes the proverbial one ring to bind them all.

A Czech local chapter of the Open Knowledge Foundation was conceived and started its incubation phase. So far, we have managed to run several meetups and workshops, yet still, we have failed to involve a sufficient number of people contributing their cognitive surplus to the chapter in order to be able to sustain it in the long-term.

In this year data-driven journalism has appeared in mainstream news media. Inspired by the Guardian's Datablog the data blog was set up at iHNed.cz. The blog has become a source of data-driven stories supported by visualizations that regularly make it on the news site's front page.

Arguably, the main thing related to open data that happened in Czech Republic during the past year was the commitment to the Open Government Partnership. Czech Republic has committed to an action plan, in which opening government data plays a key role, encompassing the establishment of an official data catalogue and release of core datasets, such as the company register. On the other hand, there is no money to be spent on the OGP commitments and the list of efforts to date is blank. Thus the work on the implementation of commitments in mainly driven on by NGOs, which is very much in line with the spirit of “hacking” the Open Government Partnership.

To sum up, there have been both #wins and #fails. We keep calm and carry on.

2012-09-12

Challenges of open data: summary

The following post is an excerpt from my thesis entitled Linked open data for public sector information.
Open data creates opportunities that may end up being missed if the challenges associated with them are left unaddressed. The previous blog posts raised some of the questions the open data “movement” would have to face and resolve in order not to lose these opportunities and restore the faith in the transformative potential of open data.
Open data agenda is biased by its prevailing focus on the supply side of open data and its negligence of the demand side that gets to use the data. A significant part of the challenges associated with open data stems from a narrow-minded view of open data as a technology-triggered change that might be engineered. Although open data brings a change in which technology plays a fundamental role, it is important not to fail to recognize its side effects and the issues that cannot be solved by better engineering.
It is comfortable to abstract away from these issues at hand. So far, the challenges of open data are in most cases temporarily bypassed. While the essential features of open data are described thoroughly, its impact is left mostly unexplored. In fact, open data advocates frequently substitute their expectations for the effects of this relatively new phenomenon. The full implications of open data still need to be worked out. The blog posts about the challenges associated with open data can be thus read as an outline of some of the areas in which further research may be conducted and case studies may be commissioned.

2012-09-11

Challenges of open data: procured data

The following post is an excerpt from my thesis entitled Linked open data for public sector information.
The public sector is not only considered to be unable to deliver applications in a cost-efficient way, it may also lack the abilities to collect some data. There are several kinds of data, including geospatial surveys, that are difficult to gather using the means available in the public sector. The solution that public bodies adopt for such cases is to outsource data collection to private companies. Using the standard procedures of public procurement, the public bodies contract a provider to produce the requested data.
The challenge starts to appear when commercial data suppliers recognize the value of the procured data and become aware of the possibilities for reuse of such data that might generate revenue for them. Hence the suppliers offer the data under the terms of licences that prevent public sector bodies to share the data with the public, since releasing the data as open data would hamper the suppliers’ prospects to resell it. Should the public sector require a licence that allows to open the procured data, it would markedly increase the contract price.
Privatisation of collection of public sector data might be a way to achieve a better efficiency [1], yet without a significant investment it prohibits releasing the data as open data. It leaves open the question asking if public sector bodies should buy in expensive data to share it with others or if the infrastructure of the public sector should be enhanced to cater for acquisition of data that would be difficult to collect without such improvements.
Note: The topic of public sector data obtained through public procurement is the subject of a previous blog post.

References

  1. YIU, Chris. A right to data: fulfilling the promise of open public data in the UK [online]. Research note. March 6th, 2012 [cit. 2012-03-06]. Available from WWW: http://www.policyexchange.org.uk/publications/category/item/a-right-to-data-fulfilling-the-promise-of-open-public-data-in-the-uk