Linked data: discoverability

The following post is an excerpt from my thesis entitled Linked open data for public sector information.
If we define discoverability as the ability to get to a previously unknown URI from a known URI, then this ability depends on the in-bound links from known URIs to unknown URIs. In particular, it depends on the quantity of in-bound links, how likely it is that the users will follow them, and discoverability of their referring URIs.
Linked data fulfils the basic requirement of being linkable by using static and persistent URIs. Moreover, guidelines on URI construction for linked data recommend using human-readable URIs that are easier to communicate [1, p. 4]. To increase the interconnectedness of data services were developed that take into account out-bound links as well, such as the PSI BackLinking Service for the Web of Data.
Dereferencing URIs serves as a way to discover more data. Self-describing resources of linked data “promote ad hoc discovery of information” [2]. The representations of resources the users obtain by dereferencing their URIs may contain links to other resources. This allows for a “follow your nose” link traversal exploration style, recursively navigating through the Web. Since dereferencing mechanisms adhere to a standardized protocol, it enables to automate this type of data discovery, such as with crawlers. The methods to improve discovery of linked data may be categorized either as passive or active. Passive approaches consist in publishing additional data that makes the published linked data easier to find. To improve data traversal for crawlers Semantic Sitemaps listing all the data access points may be published. Several RDF vocabularies were devised for expressing access metadata that help in data discovery, such as Vocabulary of Interlinked Datasets (VoID). A common solution for keeping a record of available data is to post data description to a data catalogue, such as the Data Hub. To address this purpose, Data Catalogue Vocabulary (DCAT) was created.
Active techniques serve the purpose of notifying linked data consumers about the existence of data. A common way to spread information about data availability is to notify prospective consumers via the ping protocol, such as with web services like Ping the Semantic Web. Submission of data to search engines works in a similar way, such as with the form for notifying Sindice, a search engine for the semantic web.
Linked data also ranks well in regular search engines. For example, Martin Moore reported that in 2010 linked data resources from the BBC’s Wildlife Finder appeared high in Google search results for animal names [3].


  1. Designing URI sets for the UK public sector: a report from the Public Sector Information Domain of the CTO Council’s Cross-Government Enterprise Architecture [online]. 2009 [cit. 2012-02-26]. Available from WWW: http://www.cabinetoffice.gov.uk/sites/default/files/resources/designing-URI-sets-uk-public-sector.pdf
  2. MENDELSOHN, Noah. The self-describing web [online]. W3C TAG Finding. February 7th, 2009 [cit. 2012-04-11]. Available from WWW: http://www.w3.org/2001/tag/doc/selfDescribingDocuments
  3. MOORE, Martin. 10 reasons why news organizations should use ‘linked data’. Idea Lab [online]. March 16th, 2010 [cit. 2012-04-24]. Available from WWW: http://www.pbs.org/idealab/2010/03/10-reasons-why-news-organizations-should-use-linked-data073.html

No comments :

Post a Comment