2012-10-15

How linked data improves recall via data integration

Linked data is an approach that materializes relationships in between resources described in data. It makes implicit relationships explicit, which makes them reusable. When working with linked data integration is performed on the level of data. It offloads (some of) the integration costs from consumers onto data producers. In this post, I compare the integration on the query level with the integration done on the level of data, showing the limits of the first approach as contrasted to the second one, demonstrated on the improvement of recall when querying the data.

All SPARQL queries featured in this post may be executed on this SPARQL endpoint.

For the purposes of this demonstration, I want to investigate public contracts issued by the capital of Prague. If I know a URI of the authority, say <http://ld.opendata.cz/resource/business-entity/00064581>, I can write a simple, naïve SPARQL query and I get to know there are 3 public contracts associated with this authority:
## Number of public contracts issued by Prague (without data integration) #
PREFIX pc: <http://purl.org/procurement/public-contracts#>

SELECT (COUNT(?contract) AS ?numberOfContracts)
FROM <http://ld.opendata.cz/resource/dataset/isvzus.cz/new>
WHERE {
  ?contract a pc:Contract ;
    pc:contractingAuthority <http://ld.opendata.cz/resource/business-entity/00064581>
    .
}
I can get the official number of this contracting authority that was assigned to it by the Czech Statistical Office. This number is “00064581”.
## The official number of Prague #
PREFIX br: <http://purl.org/business-register#>

SELECT ?officialNumber
FROM <http://ld.opendata.cz/resource/dataset/isvzus.cz/new>
WHERE {
  GRAPH <http://ld.opendata.cz/resource/dataset/isvzus.cz/new> {
      <http://ld.opendata.cz/resource/business-entity/00064581> br:officialNumber ?officialNumber .
  }
}
Consequently, I can look up all the contracts associated with a contracting authority identified with either the previously used URI or this official number. I get an answer telling me there is 195 public contracts issued by this authority.
## Number of public contracts of Prague with manual integration #
PREFIX br: <http://purl.org/business-register#>
PREFIX pc: <http://purl.org/procurement/public-contracts#>

SELECT (COUNT(?contract) AS ?numberOfContracts)
FROM <http://ld.opendata.cz/resource/dataset/isvzus.cz/new>
WHERE {
  GRAPH <http://ld.opendata.cz/resource/dataset/isvzus.cz/new> {
    ?contract a pc:Contract .
    {
      ?contract pc:contractingAuthority <http://ld.opendata.cz/resource/business-entity/00064581> .
    } UNION {
      ?contract pc:contractingAuthority ?authority .
      ?authority br:officialNumber "00064581" .
    }
  }
}
However, in some cases, the official number is missing, so I might want to try the authority’s name as its identifier. However, expanding my search by adding an option to match contracting authority based on its exact name will give me 195 public contracts that were issued by this authority. In effect, in this case the recall is not improved by matching on the authority’s legal name.
## Number of public contracts of Prague with manual integration #
PREFIX br: <http://purl.org/business-register#>
PREFIX pc: <http://purl.org/procurement/public-contracts#>

SELECT (COUNT(?contract) AS ?numberOfContracts)
FROM <http://ld.opendata.cz/resource/dataset/isvzus.cz/new>
WHERE {
  GRAPH <http://ld.opendata.cz/resource/dataset/isvzus.cz/new> {
    ?contract a pc:Contract .
    {
      ?contract pc:contractingAuthority <http://ld.opendata.cz/resource/business-entity/00064581> .
    } UNION {
      ?contract pc:contractingAuthority ?authority .
      ?authority br:officialNumber "00064581" .
    } UNION {
       ?contract pc:contractingAuthority ?authority .
       ?authority gr:legalName "Hlavní město Praha" .
    }
  }
}
Even still, I know there might be typing errors in the name of the contracting authority. Listing distinct legal names of the authority of which I know either its URI or its official number will give me 8 different spelling variants, which might indicate there are more errors in the data.
## Names that are used for Prague as a contracting authority #
PREFIX br: <http://purl.org/business-register#>
PREFIX gr: <http://purl.org/goodrelations/v1#>
PREFIX pc: <http://purl.org/procurement/public-contracts#>

SELECT DISTINCT ?legalName
FROM <http://ld.opendata.cz/resource/dataset/isvzus.cz/new>
WHERE {
  GRAPH <http://ld.opendata.cz/resource/dataset/isvzus.cz/new> {
    ?contract a pc:Contract .
    {
      ?contract pc:contractingAuthority <http://ld.opendata.cz/resource/business-entity/00064581> .
      OPTIONAL {
        <http://ld.opendata.cz/resource/business-entity/00064581> gr:legalName ?legalName .
      }
    } UNION {
      ?contract pc:contractingAuthority ?authority .
      ?authority br:officialNumber "00064581" ;
        gr:legalName ?legalName .
    }
  }
}
Given the assumption there might be unmatched instances of the same contracting authority labelled with erroneous legal names, I may want to perform an approximate, fuzzy match when search for the authority’s contracts. Doing so will give me 717 public contracts that might be attributed to the contracting authority with a reasonable degree of certainty.
## Number of public contracts of Prague with manual integration #
PREFIX br: <http://purl.org/business-register#>
PREFIX pc: <http://purl.org/procurement/public-contracts#>

SELECT (COUNT(?contract) AS ?numberOfContracts)
FROM <http://ld.opendata.cz/resource/dataset/isvzus.cz/new>
WHERE {
  GRAPH <http://ld.opendata.cz/resource/dataset/isvzus.cz/new> {
    ?contract a pc:Contract .
    {
      ?contract pc:contractingAuthority <http://ld.opendata.cz/resource/business-entity/00064581> .
    } UNION {
      ?contract pc:contractingAuthority ?authority .
      ?authority br:officialNumber "00064581" .
    } UNION {
       ?contract pc:contractingAuthority ?authority .
      ?authority gr:legalName ?legalName .
      ?legalName <bif:contains> '"Hlavní město Praha"' .
    }
  }
}
Further integration on the query level would make the query more complex or it would not be possible to express the integration steps within the limits of the query language. This approach is both laborious and computationally inefficient, since the equivalence relationships need to be reinvented and recomputed every time the query is created and run.

Contrarily, when I use a URI of the contracting authority plus its owl:sameAs links, it results in a simpler query. In this case, 232 public contracts are found. In this way the recall is improved, and, even though it is not as high as in the case of the query that takes into account various spellings of the authority’s name, which may be possibly attributed to a greater precision of the interlinking done on the level of data instead of intergration on the query level.

The following query harnesses equivalence relationships within the data. The query extends the first query shown in this post. In the FROM clause, it adds a new data source to be queried (<http://ld.opendata.cz/resource/dataset/isvzus.cz/new/beSameAs>), which co
ntains the equivalence links between URIs identifying the same contracting authorities. Other than that, a Virtuoso-specific directive DEFINE input:same-as "yes" is turned on, so that owl:sameAs links are followed.
## Number of public contracts of Prague with owl:sameAs links #
DEFINE input:same-as "yes"
PREFIX pc: <http://purl.org/procurement/public-contracts#>

SELECT (COUNT(?contract) AS ?numberOfContracts)
FROM <http://ld.opendata.cz/resource/dataset/isvzus.cz/new>
FROM <http://ld.opendata.cz/resource/dataset/isvzus.cz/new/beSameAs>
WHERE {
  ?contract a pc:Contract ;
   pc:contractingAuthority <http://ld.opendata.cz/resource/business-entity/00064581>
   .
}

No comments :

Post a Comment