2012-10-15

How linked data improves recall via data integration

Linked data is an approach that materializes relationships in between resources described in data. It makes implicit relationships explicit, which makes them reusable. When working with linked data integration is performed on the level of data. It offloads (some of) the integration costs from consumers onto data producers. In this post, I compare the integration on the query level with the integration done on the level of data, showing the limits of the first approach as contrasted to the second one, demonstrated on the improvement of recall when querying the data.

All SPARQL queries featured in this post may be executed on this SPARQL endpoint.

For the purposes of this demonstration, I want to investigate public contracts issued by the capital of Prague. If I know a URI of the authority, say <http://ld.opendata.cz/resource/business-entity/00064581>, I can write a simple, naïve SPARQL query and I get to know there are 3 public contracts associated with this authority:
## Number of public contracts issued by Prague (without data integration) #
PREFIX pc: <http://purl.org/procurement/public-contracts#>

SELECT (COUNT(?contract) AS ?numberOfContracts)
FROM <http://ld.opendata.cz/resource/dataset/isvzus.cz/new>
WHERE {
  ?contract a pc:Contract ;
    pc:contractingAuthority <http://ld.opendata.cz/resource/business-entity/00064581>
    .
}
I can get the official number of this contracting authority that was assigned to it by the Czech Statistical Office. This number is “00064581”.
## The official number of Prague #
PREFIX br: <http://purl.org/business-register#>

SELECT ?officialNumber
FROM <http://ld.opendata.cz/resource/dataset/isvzus.cz/new>
WHERE {
  GRAPH <http://ld.opendata.cz/resource/dataset/isvzus.cz/new> {
      <http://ld.opendata.cz/resource/business-entity/00064581> br:officialNumber ?officialNumber .
  }
}
Consequently, I can look up all the contracts associated with a contracting authority identified with either the previously used URI or this official number. I get an answer telling me there is 195 public contracts issued by this authority.
## Number of public contracts of Prague with manual integration #
PREFIX br: <http://purl.org/business-register#>
PREFIX pc: <http://purl.org/procurement/public-contracts#>

SELECT (COUNT(?contract) AS ?numberOfContracts)
FROM <http://ld.opendata.cz/resource/dataset/isvzus.cz/new>
WHERE {
  GRAPH <http://ld.opendata.cz/resource/dataset/isvzus.cz/new> {
    ?contract a pc:Contract .
    {
      ?contract pc:contractingAuthority <http://ld.opendata.cz/resource/business-entity/00064581> .
    } UNION {
      ?contract pc:contractingAuthority ?authority .
      ?authority br:officialNumber "00064581" .
    }
  }
}
However, in some cases, the official number is missing, so I might want to try the authority’s name as its identifier. However, expanding my search by adding an option to match contracting authority based on its exact name will give me 195 public contracts that were issued by this authority. In effect, in this case the recall is not improved by matching on the authority’s legal name.
## Number of public contracts of Prague with manual integration #
PREFIX br: <http://purl.org/business-register#>
PREFIX pc: <http://purl.org/procurement/public-contracts#>

SELECT (COUNT(?contract) AS ?numberOfContracts)
FROM <http://ld.opendata.cz/resource/dataset/isvzus.cz/new>
WHERE {
  GRAPH <http://ld.opendata.cz/resource/dataset/isvzus.cz/new> {
    ?contract a pc:Contract .
    {
      ?contract pc:contractingAuthority <http://ld.opendata.cz/resource/business-entity/00064581> .
    } UNION {
      ?contract pc:contractingAuthority ?authority .
      ?authority br:officialNumber "00064581" .
    } UNION {
       ?contract pc:contractingAuthority ?authority .
       ?authority gr:legalName "Hlavní město Praha" .
    }
  }
}
Even still, I know there might be typing errors in the name of the contracting authority. Listing distinct legal names of the authority of which I know either its URI or its official number will give me 8 different spelling variants, which might indicate there are more errors in the data.
## Names that are used for Prague as a contracting authority #
PREFIX br: <http://purl.org/business-register#>
PREFIX gr: <http://purl.org/goodrelations/v1#>
PREFIX pc: <http://purl.org/procurement/public-contracts#>

SELECT DISTINCT ?legalName
FROM <http://ld.opendata.cz/resource/dataset/isvzus.cz/new>
WHERE {
  GRAPH <http://ld.opendata.cz/resource/dataset/isvzus.cz/new> {
    ?contract a pc:Contract .
    {
      ?contract pc:contractingAuthority <http://ld.opendata.cz/resource/business-entity/00064581> .
      OPTIONAL {
        <http://ld.opendata.cz/resource/business-entity/00064581> gr:legalName ?legalName .
      }
    } UNION {
      ?contract pc:contractingAuthority ?authority .
      ?authority br:officialNumber "00064581" ;
        gr:legalName ?legalName .
    }
  }
}
Given the assumption there might be unmatched instances of the same contracting authority labelled with erroneous legal names, I may want to perform an approximate, fuzzy match when search for the authority’s contracts. Doing so will give me 717 public contracts that might be attributed to the contracting authority with a reasonable degree of certainty.
## Number of public contracts of Prague with manual integration #
PREFIX br: <http://purl.org/business-register#>
PREFIX pc: <http://purl.org/procurement/public-contracts#>

SELECT (COUNT(?contract) AS ?numberOfContracts)
FROM <http://ld.opendata.cz/resource/dataset/isvzus.cz/new>
WHERE {
  GRAPH <http://ld.opendata.cz/resource/dataset/isvzus.cz/new> {
    ?contract a pc:Contract .
    {
      ?contract pc:contractingAuthority <http://ld.opendata.cz/resource/business-entity/00064581> .
    } UNION {
      ?contract pc:contractingAuthority ?authority .
      ?authority br:officialNumber "00064581" .
    } UNION {
       ?contract pc:contractingAuthority ?authority .
      ?authority gr:legalName ?legalName .
      ?legalName <bif:contains> '"Hlavní město Praha"' .
    }
  }
}
Further integration on the query level would make the query more complex or it would not be possible to express the integration steps within the limits of the query language. This approach is both laborious and computationally inefficient, since the equivalence relationships need to be reinvented and recomputed every time the query is created and run.

Contrarily, when I use a URI of the contracting authority plus its owl:sameAs links, it results in a simpler query. In this case, 232 public contracts are found. In this way the recall is improved, and, even though it is not as high as in the case of the query that takes into account various spellings of the authority’s name, which may be possibly attributed to a greater precision of the interlinking done on the level of data instead of intergration on the query level.

The following query harnesses equivalence relationships within the data. The query extends the first query shown in this post. In the FROM clause, it adds a new data source to be queried (<http://ld.opendata.cz/resource/dataset/isvzus.cz/new/beSameAs>), which co
ntains the equivalence links between URIs identifying the same contracting authorities. Other than that, a Virtuoso-specific directive DEFINE input:same-as "yes" is turned on, so that owl:sameAs links are followed.
## Number of public contracts of Prague with owl:sameAs links #
DEFINE input:same-as "yes"
PREFIX pc: <http://purl.org/procurement/public-contracts#>

SELECT (COUNT(?contract) AS ?numberOfContracts)
FROM <http://ld.opendata.cz/resource/dataset/isvzus.cz/new>
FROM <http://ld.opendata.cz/resource/dataset/isvzus.cz/new/beSameAs>
WHERE {
  ?contract a pc:Contract ;
   pc:contractingAuthority <http://ld.opendata.cz/resource/business-entity/00064581>
   .
}

2012-10-08

How the Big Clean addresses the challenges of open data

The Big Clean 2012 is a one-day conference dedicated to three principal themes: screen-scraping, data refining and data-driven journalism. These topics address some of the current challenges of open data, focusing on usability, misinterpretation of data and on the issue of making data-driven journalism work.

Usability

A key challenge of the Big Clean is refining raw data into usable data. People often fall victim to the fallacy of treating screen-scraped data as a resource that can be used directly, fed straight into visualizations or analysed to yield insights. However, validity of data must not be taken for granted. It needs to be questioned.
Just as some raw ingredients need to be cooked to become edible, raw data needs to be preprocessed to become usable. Patchy data extracted from web pages should be refined into data that can be relied upon. Cleaning data makes it more regular, error-free and ultimately more usable.
The Big Clean will take this challenge into account in several talks. Jiří Skuhrovec will try to strike a fine balance, considering the question of how much do we need to clean. Štefan Urbánek will walk the event's participants through a data processing pipeline. Apart from the invited talks, this topic will be a subject to a screen-scraping workshop lead by Thomas Levine. The workshop will run in parallel with the main track of the conference.

Misinterpretation

Access to raw data allows people take control of the interpretation of data. Effectively, people are not only taking hold of uninterpreted data, but also of the right to interpret it. This is not the case in the current state of affairs, where there is often no access to raw data, since all data is mediated through user interfaces. In such case, the interface owners control the ways in which data may be viewed. On the contrary, raw data gives you a freedom to interpret data on your own. It allows you to skip the intermediaries and access data directly, instead of limiting yourself to the views provided by the interface owners.
While the loss of control over presentation of data may be perceived as a loss of control over the meaning of the data, it is actually a call for more explicit semantics in the data. It is a call for an encoding of the meaning in data in a waythat does not rely on the presentation of data.
A common excuse for not releasing data held in the public sector is the assumption that the data will be misinterpreted. As reported in Andrew Stott's OKCon 2011 talk, among the civil servants, there is a widespread expectation that “people will draw superficial conclusions from the data without understanding the wider picture.”. First, there is not a single correct interpretation of data possessed by the public sector. Instead, there are multiple valid interpretations that may coexist together. Second, the fact that data is prone to incorrect interpretation may not attest to the ambiguity of the data, but to the ambiguity of its representation.
Tighter semantics may make the danger of misinterpretation less probable. As examples such as Data.gov.uk in the United Kingdom have shown, one way to encode clearer interpretation rules directly into the data is by using semantic web technologies.

Data-driven journalism

Nevertheless, in most cases public sector data is not self-describing. The data is not smart and thus people interpreting it need to be smart. A key group that needs to become smarter, reading the clues conveyed in data, comprises of journalists. Journalists should read data, not only press releases. In becoming data literati the importance of their work increases. They serve as translators, mediating understanding derived from data to the wider public. In this way, data-driven journalism contributes to the goal of making data more usable as stories told with data are more accessible than the data itself.
Raw data opens space for different and potentially competing interpretations. This is the democratic aspect of open data. It invites participation in a shared discourse constructed around the data. A fundamental element of such discourse are the media. Journalists using the data may contribute to this conversation by finding what is new in the data, discovering issues hidden from public oversight or tracing the underlying systemic trends. This is the key contribution of data-driven journalism, providing diagnoses of the present society.
The principal part of data-driven journalism in the open data ecosystem will be reflected in a couple of talks given at the Big Clean. Liliana Bounegru will explain why data journalism is something you too should care about and Caelainn Barr will showcase how the EU data can be used in journalism.

Practical details

The Big Clean will be held on November 3rd, 2012, at the National Technical Library in Prague, Czech Republic. You can register by following this link. The admission to the event is free.
I hope to see many of you there.

2012-10-04

What makes open data weak will not get engineered away

Open data is still weak but growing strong. I have written a few fairly random points covering the weak points, in which open data may need to grow.

  • With the Open Government Partnership, open data is losing its edge. Open data is being assimilated into the current bureaucratic structures. It might be about time to reignite the subversive potential of open data.
  • There is no long-term commitment to open data. All activity in the domain seems to be fragmented in small projects that do not last long, nor do they share results. We need to find ways to make projects outlive their funding. Open data has an attention deficit disorder.
  • What makes open data weak and strange will not get engineered away. Better tools will not solve the inherent issues in open data, albeit they might help to grow the open data community in order to be able to solve those. Even though open data might be broken, we should not try to fix it, we should try to grow it to fix it itself.
  • People are getting lost on the way to realization of the goals of the open data movement. They fall for the distractions encountered on the way and get enchanted by the technology, a mere tool for addressing the goals of open data. People get stuck teaching others how to open and use data, while themselves not doing what they preach. People stop at community building, grasping for momentum using social media.
  • There is a legal uncertainty making people believe that taking legal actions is not possible without having a lawyer holding your hand. People are careful not to breach any of their imagined implications of the law. Civil servants are afraid to release the data their institutions hold, citizens are afraid of using data to effect real-world consequences.

2012-10-03

State of open data in the Czech Republic in 2012

During the Open Knowledge Festival 2012 in Helsinki I presented a lightning-fast two minutes summary of four key things that happened with open data in the Czech Republic. Here is a brief recap of the things I mentioned.

One of the most tangible results of the open data community in the past year was the launch of a national portal called “Náš stát” (which stands for “Our state”). It provides an overview of a network of Czech projects working towards improving Czech public sector with applications and services built on top of its data. What turned out to be one of its main benefits is that it started unifying disparate organizations that are often working on the same issues without knowing they might be duplicating work of others, and we will see in the coming years if it becomes the proverbial one ring to bind them all.

A Czech local chapter of the Open Knowledge Foundation was conceived and started its incubation phase. So far, we have managed to run several meetups and workshops, yet still, we have failed to involve a sufficient number of people contributing their cognitive surplus to the chapter in order to be able to sustain it in the long-term.

In this year data-driven journalism has appeared in mainstream news media. Inspired by the Guardian's Datablog the data blog was set up at iHNed.cz. The blog has become a source of data-driven stories supported by visualizations that regularly make it on the news site's front page.

Arguably, the main thing related to open data that happened in Czech Republic during the past year was the commitment to the Open Government Partnership. Czech Republic has committed to an action plan, in which opening government data plays a key role, encompassing the establishment of an official data catalogue and release of core datasets, such as the company register. On the other hand, there is no money to be spent on the OGP commitments and the list of efforts to date is blank. Thus the work on the implementation of commitments in mainly driven on by NGOs, which is very much in line with the spirit of “hacking” the Open Government Partnership.

To sum up, there have been both #wins and #fails. We keep calm and carry on.