2012-09-04

Challenges of open data: information overload

The following post is an excerpt from my thesis entitled Linked open data for public sector information.
As more and more data is released in the open there is a growing danger that irrelevant data might flood the data that is important [1]. Only few of the available datasets contain “actionable” information and there is no effective filtering mechanism to track them down. With open data “we have so many facts at such ready disposal that they lose their ability to nail conclusions down, because there are always other facts supporting other interpretations” [2].
The sheer volume of the existing open data makes it difficult to comprehend. At such scale there is a need for tools that make the large amounts of data intelligible. Edd Dumbill writes that “big data may be big. But if it’s not fast, it’s unintelligible” [3].
While human processing does not scale, machine processing does. Thus, the challenge of information overload highlights the need for machine-readable data. Big, yet sufficiently structured data may be automatically pre-processed and filtered to “small data” that people can manage to work with. For example, linked data may be effectively filtered with precise SPARQL queries harnessing its rich structure.
Scaling the processing of large amounts of machine-readable data with well-defined structure may be considered solved. However, the current challenge is to deal with the heterogeneity of data from different sources.

Heterogeneity

Not only is there a perceived information overload, there is also an overload of different and incompatible ways of representing information. What we have built out of different data formats or modelling approaches seems to be the proverbial “Tower of Babel”. In this state of affairs, the data available on the Web constitutes a highly dimensional, heterogeneous data space.
Nonetheless, it is in managing heterogeneous data sources where linked data excels. Linking may be considered as a lightweight, pay-as-you-go approach to intergration of disparate datasets [4]. Semantic web technologies also address the intrinsic heterogeneity in data sources by providing means to model varying levels of formality, quality, and completeness [5, p. 851].

Comparability

A key quality of data that suffers from heterogeneity is comparability. According to the SDMX content-oriented guidelines comparability is defined as “the extent to which differences between statistics can be attributed to differences between the true values of the statistical characteristics” [6, p. 13]. It is a quality of data that represents the extent to which the differences in data can be attributed to differences in the measured phenomena.
Improving comparability of data hence means minimizing unwanted interferences that skew the data. Influences leading to distorsion of data may originate from differences in schemata, differing conceptualizations of domains described in the data, or incompatible data handling procedures. Elimination of such influences leads to maximization of evidence in data, which reflects more directly on the observed phenomena.
The importance of comparability surfaces especially in data analysis tasks. Insights yielded from analyses then feed into decision support and policy making. Comparability also supports transparency of public sector data because it clears the view of public administration. It supports easier audits of public sector bodies due to the possibility to abstract from the ways used to collect data. On the other hand, incomparable data corrupts monitoring of public sector bodies and imprecise monitoring thus leaves an ample space for systemic inefficiencies and potential corruption.
The publication model of linked data has in-built comparability features, which come from the requirement for using common, shared standards. RDF provides a commensurate structure through its data model that linked data is required to conform to. The emphasis on reuse of shared conceptualizations, such as RDF vocabularies, ontologies, and reference datasets, provides for comparable data content.
In the network of linked data the “bandwagon” effect increases the probability of adoption of a set of core reference datasets, which further reinforces the positive feedback loop. Core reference data may be used to link other datasets to enhance their value. Such datasets attract most in-bound links, which leads to emergence of “linking hubs”. In this case, these de facto reference datasets derive their status from their highly reusable content. An example of this type of datasets is DBpedia, which provides machine-readable data based on Wikipedia. Its prime condition may be illustrated by the Linked Open Data Cloud, in the center of which it is prominently positioned, indicating the high number of datasets linking to it.
In contrast to these datasets, traditional reference sources are established through the authority of their publishers, which is reflected in policies that prescribe to use such datasets. Datasets of this type include knowledge organization systems, such as classifications or code lists, that offer shared conceptualizations of particular domains. For instance, a prototypical example of an essential reference dataset is the International System of Units that is a source of shared units of measurement. In contrast with the linking hubs of linked data, traditional reference datasets are, for the most part, not available in RDF and therefore not linkable.
The effect of using both kinds of reference data is the same. The conceptualizations they construct offer reference concepts that make data referring to them comparable. A trivial example to illustrate this point may be the use of the same units of measurement, which enables to sort data in an expected order.
Data might need to be converted prior to comparison with other datasets. In this case, there is a need for comparability on the level of the data the incomparable datasets refer to. Linked data makes this possible through linking; the same technology it applies to data integration. With the techniques, such as ontology alignment, mappings between reference datasets may be established to serve as proxies for the purpose of data comparison. Ultimately, machine-readable relationships in linked data make it outperform other ways of representing data when it comes to the ability to draw comparisons.

References

  1. FIORETTI, Marco. Open data, open society: a research project about openness of public data in EU local administration [online]. Pisa, 2010 [cit. 2012-03-10]. Available from WWW: http://stop.zona-m.net/2011/01/the-open-data-open-society-report-2/
  2. WEINBERGER, David. Too big to know. New York (NY): Basic Books, 2012. ISBN 978-0-465-02142-0.
  3. DUMBILL, Edd (ed.). Planning for big data: a CIO’s handbook to the changing data landscape [ebook]. Sebastopol: O’Reilly, 2012, 83 p. ISBN 978-1-4493-2963-1.
  4. HEATH, Tom; BIZER, Chris. Linked data: evolving the Web into a global data space. 1st ed. Morgan & Claypool, 2011. Also available from WWW: http://linkeddatabook.com/book. ISBN 978-1-60845-430-3. DOI 10.2200/S00334ED1V01Y201102WBE001.
  5. SHADBOLT, Nigel; O’HARA, Kieron; SALVADORES, Manuel; ALANI, Harith. eGovernment. In DOMINGUE, John; FENSEL, Dieter; HENDLER, James A. (eds.). Handbook of semantic web technologies. Berlin: Springer, 2011,
    p. 849 — 910. DOI 10.1007/978-3-540-92913-0_20.
  6. SDMX. SDMX content-oriented guidelines. Annex 1: cross-domain concepts. 2009. Also available from WWW: http://sdmx.org/wp-content/uploads/2009/01/01_sdmx_cog_annex_1_cdc_2009.pdf

No comments :

Post a Comment