blog.mynarz.net: Linked data: quality

The following post is an excerpt from my thesis entitled Linked open data for public sector information.
Data quality is not inherent in technologies but it is a result of the way technologies are used. Apart from the strict limitations of semantic web technologies and linked data principles enforced by peer pressure, there is a body of knowledge about linked data captured in informal design patterns and best practices, that is embodied in resources like Linked data patterns [1] or Cookbook for open government linked data [2]. Among the other aspects these recommendations deal with they propose ways how linked data should be used to achieve the best data quality.

Content

The content facet of open data quality metrics tracks if the content of data is primary, complete, timely, and delivered intact.

Primariness

A key principle of linked data is to ensure access to raw data. Linked data URIs are required to dereference also to raw, machine-readable data, such as RDF in XML. Besides dereferencing, linked data may implement interfaces for access raw data, such as SPARQL endpoints.

Completeness

A common way to arrange for the access to complete data is to provide data dumps exported from a database or a triple store in the back-end. In this way, users are allowed to work with the data as a whole.
RDF offers an inclusive way for representing data of varying degree of structure and granularity. Depending on the modelling style, RDF can capture both highly-structured data and unstructured free-text. Linked data improves this inclusiveness by enabling to link to non-RDF content.
Linked data offers a means for materialization of the types of data that are, for the most part, out of the scope of the other approaches to data representation. For example, it may include explicit relationships between the described resources. From this perspective, linked data may be seen as a more complete representation of a particular phenomenon.

Timeliness

Even though timely release of data is rather a matter of policy and human resources, technologies employed for that task can make it easier. In particular with highly dynamic data that goes through frequent changes it is important to have a flexible update mechanism at hand. Updates of linked data may be automated with SPARQL 1.1 Update that offers a very expressive method for patching data.
Timeliness is crucial in two areas that are gaining prominence: streaming sensor data and user-generated content. Research on the technological solutions for these areas is in its infancy [3]. However, there already are experiments with streaming linked data or real-time extraction from user-generated content, such as DBPedia Live that captures updates in Wikipedia in a near real-time.

Integrity

The stack of the semantic web technologies, which linked data builds on, includes both digital signature and encryption as a part of the so-called Semantic Web Layer cake. For ensuring the content of data is not tampered with during its transmission secure HTTPS connections should be employed. An example of semantic web technology that builds on digital signatures is WebID, that may be used to authenticate data publishers.

Usability

Usability may be perceived as the weakest point of linked data. In most cases, raw, disintermediated linked data is not intended for direct consumption. This is the result of the separation of concerns that linked data employs. For example, consider working with a SPARQL endpoint that, even though it is a powerful way of interacting with data for applications, may be baffling for the regular users. Linked data should be rather mediated through end-user interfaces of web applications, that present the data in a more usable and visually-appealing manner. However, there are still aspects in which raw linked data excels when compared to other types of data.

Presentation

Intelligible presentation of linked data should be arranged for by the implementation of mechanisms for dereferecing URIs, which should be able to serve a human-readable resource representation, such as in HTML. However, representations of linked data resources are usually generated into generic templates in an automated fashion, which impedes custom adaptation of representations for different resource types.

Clarity

RDF has a well-defined way how to convey semantics through the use of RDF vocabularies and ontologies, the workings of which are described in the previous blog post about RDF. RDF vocalabularies and ontologies make thorough data modelling feasible, which increases the fidelity and clarity of the way representations of RDF resources are modelled.

Documentation

Linked data is self-describing data. Since the “consumers of Linked Data do not have the luxury of talking to a database administrator who could help them understand a schema” [2], all the information necessary to interpret the data, including RDF vocabularies and ontologies used by the data, should be stored on the Web and should be possible to retrieve via the mechanism of dereferencing by issuing HTTP GET requests and recursive following of links.
While the representations of resources should be self-documenting, there is no such requirement on the linked data URIs. URIs may be opaque since “the Web is designed so that agents communicate resource information state through representations, not identifiers” [4].

References

DODDS, Leigh; DAVIS, Ian. Linked data patterns [online]. Last changed 2011-08-19 [cit. 2011-11-05]. Available from WWW: http://patterns.dataincubator.org
HYLAND, Bernardette; TERRAZAS, Boris Villazón; CAPADISLI, Sarven. Cookbook for open government linked data [online]. Last modified on February 20^th, 2012 [cit. 2012-04-11]. Available from WWW: http://www.w3.org/2011/gld/wiki/Linked_Data_Cookbook
SEQUEDA, Juan F.; CORCHO, Oscar. Linked stream data: a position paper. In TAYLOR, Kerri; AYYAGARI, Arun; DE ROURE, David (eds.). Proceedings of the 2^nd International Workshop on Semantic Sensor Networks, collocated with the 8^th International Semantic Web Conference, Washington DC, USA, October 26^th, 2009. Aachen: RWTH Aachen University, 2009, p. 148 — 157. CEUR workshop proceedings, vol. 552. Also available from WWW: http://oa.upm.es/5442/1/INVE_MEM_2009_64353.pdf. ISSN 1613-0073.
JACOBS, Ian; WALSH, Norman (eds.). Architecture of the World Wide Web, volume 1 [online]. W3C Recommendation. December 15^th, 2004 [cit. 2012-04-20]. Available from WWW: http://www.w3.org/TR/webarch/

blog.mynarz.net

2012-08-23

Linked data: quality