2012-09-12

Challenges of open data: summary

The following post is an excerpt from my thesis entitled Linked open data for public sector information.
Open data creates opportunities that may end up being missed if the challenges associated with them are left unaddressed. The previous blog posts raised some of the questions the open data “movement” would have to face and resolve in order not to lose these opportunities and restore the faith in the transformative potential of open data.
Open data agenda is biased by its prevailing focus on the supply side of open data and its negligence of the demand side that gets to use the data. A significant part of the challenges associated with open data stems from a narrow-minded view of open data as a technology-triggered change that might be engineered. Although open data brings a change in which technology plays a fundamental role, it is important not to fail to recognize its side effects and the issues that cannot be solved by better engineering.
It is comfortable to abstract away from these issues at hand. So far, the challenges of open data are in most cases temporarily bypassed. While the essential features of open data are described thoroughly, its impact is left mostly unexplored. In fact, open data advocates frequently substitute their expectations for the effects of this relatively new phenomenon. The full implications of open data still need to be worked out. The blog posts about the challenges associated with open data can be thus read as an outline of some of the areas in which further research may be conducted and case studies may be commissioned.

2012-09-11

Challenges of open data: procured data

The following post is an excerpt from my thesis entitled Linked open data for public sector information.
The public sector is not only considered to be unable to deliver applications in a cost-efficient way, it may also lack the abilities to collect some data. There are several kinds of data, including geospatial surveys, that are difficult to gather using the means available in the public sector. The solution that public bodies adopt for such cases is to outsource data collection to private companies. Using the standard procedures of public procurement, the public bodies contract a provider to produce the requested data.
The challenge starts to appear when commercial data suppliers recognize the value of the procured data and become aware of the possibilities for reuse of such data that might generate revenue for them. Hence the suppliers offer the data under the terms of licences that prevent public sector bodies to share the data with the public, since releasing the data as open data would hamper the suppliers’ prospects to resell it. Should the public sector require a licence that allows to open the procured data, it would markedly increase the contract price.
Privatisation of collection of public sector data might be a way to achieve a better efficiency [1], yet without a significant investment it prohibits releasing the data as open data. It leaves open the question asking if public sector bodies should buy in expensive data to share it with others or if the infrastructure of the public sector should be enhanced to cater for acquisition of data that would be difficult to collect without such improvements.
Note: The topic of public sector data obtained through public procurement is the subject of a previous blog post.

References

  1. YIU, Chris. A right to data: fulfilling the promise of open public data in the UK [online]. Research note. March 6th, 2012 [cit. 2012-03-06]. Available from WWW: http://www.policyexchange.org.uk/publications/category/item/a-right-to-data-fulfilling-the-promise-of-open-public-data-in-the-uk

2012-09-10

Challenges of open data: trust

The following post is an excerpt from my thesis entitled Linked open data for public sector information.
Transparency brought about by the adoption of open data affects the trust in the public sector. Current governments experience a crisis of legitimacy [1, p. 58] and lack the trust of citizens. Improved visibility of the workings of public sector bodies established by the open access to their proceedings enables to track their actions in detail and improves the trust citizens put in the bodies. Nevertheless, the release of open data may reveal many fallacies of public sector bodies, which may produce a temporary disillusion, distrust in government, and loss of interest in politics [2].
The initial assumption of most open data advocates is that the data made in the public sector may be relied on. However, the public sector data cannot be treated as neutral and uncontested resource. “Unaudited, unverfied statistics abound in government data, particularly when outside parties-local government agencies, federal lobbyists, campaign committees-collect the data and turn it over to the government” [1, p. 261]. False data may be fabricated to provide alibi for corruption behaviour. For instance, Nithya Raman draws attention to an Indian dataset on urban planning in which non-existent public toilets are present, so that the spending, that supposedly goes for the toilets’ maintenance, may be justified [3]. Another example that demonstrates how false data is contained with the public sector data is the exposure of errors in subsidies awarded by the EU Common Agricultural Policy. The data shows that the oldest recipients of these funds, coming from Sweden, were 100 years old, though both dead [4, p. 85].
In the light of such facts, it is important to acknowledge that “public confidence in the veracity of government-published information is critical to Open Government Data take-off, essential to spurring demand and use of public datasets” [5]. If the data is regarded as manipulated instead of being recognized as trustworthy, the impact of open data will be significantly diminished.

References

  1. LATHROP, Daniel; RUMA, Laurel (eds.). Open government: collaboration, transparency, and participation in practice. Sebastopol: O'Reilly, 2010. ISBN 978-0-596-80435-0.
  2. FIORETTI, Marco. Open data, open society: a research project about openness of public data in EU local administration [online]. Pisa, 2010 [cit. 2012-03-10]. Available from WWW: http://stop.zona-m.net/2011/01/the-open-data-open-society-report-2/
  3. RAMAN, Nithya V. Collecting data in Chennai city and the limits of openness. Journal of Community Informatics [online]. 2012 [cit. 2012-04-12], vol. 8, no. 2. Available from WWW: http://ci-journal.net/index.php/ciej/article/view/877/908. ISSN 1712-4441.
  4. Beyond access: open government data & the right to (re)use public information [online]. Access Info Europe, Open Knowledge Foundation, January 7th, 2011 [cit. 2012-04-15]. Available from WWW: http://www.access-info.org/documents/Access_Docs/Advancing/Beyond_Access_7_January_2011_web.pdf
  5. GIGLER, Bjorn-Soren; CUSTER, Samantha; RAHEMTULLA, Hanif. Realizing the vision of open government data: opportunities, challenges and pitfalls [online]. World Bank, 2011 [cit. 2012-04-11]. Available from WWW: http://www.scribd.com/WorldBankPublications/d/75642397-Realizing-the-Vision-of-Open-Government-Data-Long-Version-Opportunities-Challenges-and-Pitfalls

2012-09-09

Challenges of open data: data quality

The following post is an excerpt from my thesis entitled Linked open data for public sector information.
Data quality is required for data that may be depended upon. Yet public sector data may be mired in errors and suffer from unintentional omissions that may markedly decrease usability of data. For example, Michael Daconta [1] identified 10 common types of mistakes in datasets in the U.S. data portal Data.gov.
  • Ommission errors violating data completeness, missing metadata definitions, using code without providing code lists
  • Formatting errors violating data consistency, syntax errors not fulfilling requirements of the employed data formats’ specifications
  • Accuracy errors violating correctness, errors breaking range limitations
  • Incorrectly labelled records violating correctness, for example, some datasets misnamed as CSV even though they are just dumps from Excel files that do not meet the standards established in the specification the CSV data format
  • Access errors referring to incorrect metadata descriptions, for example, not linking to the content described by the link’s label
  • Poorly structured data caused by improper selection of data format, using formats that are innapropriate for the expected uses of data
  • Non-normalized data violating the principle of normalization, which attempt to reduce redundant data by, e.g., removing duplicates
  • Raw database dumps violating relevance and providing raw database dumps that are hard to interpret and use correctly
  • Inflation of counts that is a metadata quality issue having an adverse impact on usability, for instance, when datasets pertaining to the same phenomena are not properly grouped and thus difficult to find
  • Inconsistent data granularity violating expected quality of metadata, such that datasets use widely varying levels of data granularity without their explicit specification
Linked data principles impose a rigour to data that may improve its consistency and quality. At the same time, linked data is more susceptible to corruption caused by “link rot” and the issues that arise when links no longer resolve. For example, in 2006 it was found that 52 % of links from the official parliamentary record of the UK were not functional [2, p. 20]. The reliance on URI makes it even more important for linked data to adopt URIs that are stable and persistent.

References

  1. DACONTA, Michael. 10 flaws with the data on Data.gov. Federal Computer Week [online]. March 11th, 2010 [cit. 2012-04-10]. Available from WWW: http://fcw.com/articles/2010/03/11/reality-check-10-data-gov-shortcomings.aspx
  2. Beyond access: open government data & the right to (re)use public information [online]. Access Info Europe, Open Knowledge Foundation, January 7th, 2011 [cit. 2012-04-15]. Available from WWW: http://www.access-info.org/documents/Access_Docs/Advancing/Beyond_Access_7_January_2011_web.pdf

2012-09-08

Challenges of open data: privacy

The following post is an excerpt from my thesis entitled Linked open data for public sector information.
In the pursuit of the public task public sector bodies collect personal data as well. Such data does not fall under the scope of open data. Principles of open data explicitly exclude personal data from being released and suppose it to be left closed in well-secured databases.
A complaint that is heard with regard to privacy is that the public sector collects more personal information than the minimum it needs. An example where public data collection posed a potential privacy breach comes from Finland [1]. A Finnish travel system logged all instances when a travel card was scanned by reader machine on different public transport lines. Since travel cards can be traced to individual persons, in this arrangement the travel system had location data for a large number of people, which was perceived as a violation of privacy. Ultimately, based on the data protection legislation, the travel card data was ceased to be collected.
However, in most cases personal data is not collected at an excessive rate and is governed by an access regime that is strictly limited to authorized users from the public sector to prevent accidental leaks of private data. In line with this observation, Marco Fioretti notes that privacy issues of open data have almost always been a non-issue [2].
Nonetheless, a new privacy risk is being recognized in the danger of statistical re-identification. This privacy threat is inflicted by the availability of large amounts of machine-readable data, that contains indirect personal identifiers, and the technologies allowing to combine it.
So far, privacy was guaranteed by the “practical obscurity” [3, p. 867]. It existed chiefly due to the difficulty of obtaining and combining data. In many cases, personal data was not logged down at all. Under such conditions, the right to privacy was akin to the right to be forgotten [2]. However, this assumption loses ground when confronted with the ever-increasing amount of data that is currently being recorded and stored.
Data anonymization that is based on removal of direct identifiers, such as identity card numbers, is insufficient on its own. A subject may be identified and linked to sensitive information through a combination of indirect identifiers [3, p. 8]. Indirect identifier is a data item that narrows down the set of persons who might be described by the data. An example of an indirect identifier that works this way is gender. When enough indirect identifiers are combined, they may narrow down the set of subjects they might identify to a single person.
There are established techniques for protecting personal privacy in data by limiting the risks of re-identification by statistical methods. Chris Yiu lists several of them, most of which have adverse impact on data quality and openness [5, p. 26].
  • Access and query control, e.g., filtering and limiting size of query results to samples
  • Anonymisation, or deidentification, such as stripping personal information from data
  • Obfuscation, that may, for example, reduce precision in data by replacing values with ranges
  • Perturbation, introducing random errors into data
  • Pseudonymisation, including replacing persons’ names with identifiers
Fortunately, both direct and indirect personal identifiers are rare in public sector data. Most of the data tracked by the public sector consists of non-identifiers. Moreover, the data is usually available in aggregated forms and not as microdata that results directly from data collection. Therefore, in most cases, data quality and openness do not need to be compromised due to the requirements of privacy protection.

References

  1. DIETRICH, Daniel; GRAY, Jonathan; MCNAMARA, Tim; POIKOLA, Antti; POLLOCK, Rufus; TAIT, Julian; ZIJLSTRA, Ton. The open data handbook [online]. 2010 — 2012 [cit. 2012-03-09]. Available from WWW: http://opendatahandbook.org/
  2. FIORETTI, Marco. Open data, open society: a research project about openness of public data in EU local administration [online]. Pisa, 2010 [cit. 2012-03-10]. Available from WWW: http://stop.zona-m.net/2011/01/the-open-data-open-society-report-2/
  3. SHADBOLT, Nigel; O’HARA, Kieron; SALVADORES, Manuel; ALANI, Harith. eGovernment. In DOMINGUE, John; FENSEL, Dieter; HENDLER, James A. (eds.). Handbook of semantic web technologies. Berlin: Springer, 2011, p. 849 — 910. DOI 10.1007/978-3-540-92913-0_20.
  4. YAKOWITZ, Jane. Tragedy of the data commons. Harvard Journal of Law & Technology. Fall 2011, vol. 25, no. 1. Also available from WWW: http://ssrn.com/abstract=1789749
  5. YIU, Chris. A right to data: fulfilling the promise of open public data in the UK [online]. Research note. March 6th, 2012 [cit. 2012-03-06]. Available from WWW: http://www.policyexchange.org.uk/publications/category/item/a-right-to-data-fulfilling-the-promise-of-open-public-data-in-the-uk

2012-09-07

Challenges of open data: misinterpretation

The following post is an excerpt from my thesis entitled Linked open data for public sector information.
Another argument pointing at the potential risks in disclosure of public data was presented by Lawrence Lessig in an article titled Against transparency [1], in which he draws attention to adverse effects of misinterpretation of public data. He highlights the issues that arise when monopoly on interpretation is removed and members of the public are provided with raw, uninterpreted data [2, p. 2]. Disintermediation causes decontextualization of public sector data that may lead to highly divergent interpretations of the same data [3]. Such change may be perceived as a loss of control the civil servants used to have. Instead of an “official” interpretation of open data this would potentially lead to a plurality “competing” and possibly conflicting interpretations, some of which may be driven by malicious interests.
Lessig claims, paying respect to the alleged shortening attention spans of members of the public, that it is easier to come up with an incorrect judgement based on public data than one that is based on solid understanding [1]. The ability to correctly interpret data is largely prevalent only among people with suffiecient expertise and data literacy skills. Moreover, Archon Fung and David Weil argue that the way open data is disclosed is conducive to pessimistic view of the public sector. They claim that “the systems of open government that we’re building - structures that facilitate citizens’ social and political judgments - are much more disposed to seeing the glass of government as half or even one-quarter empty, rather than mostly full” [4, p. 107]. Such conditions may also make users of data susceptible to apophenia, a phenomenon of seeing patterns that actually do not exist [5, p. 2]. In fact, Lessig writes, encountered with the vast amounts of available public data, ignorance is a rational investment of attention [1]. Without a significant time investment and data literacy skills people will usually come to shallow and premature conclusions based on their examination of public data. Unfounded conclusions may be quickly adopted and spread by the media, which may cause significant harm of reputation of public sector bodies, civil servants, or politicians, until these assertions are re-examined and proven to be false. For example, unverified oversimplifications may be yielded from public data to support political campaigns. Open data can be misused for skewed interpretations supporting political actions, casting suspicion on public image of politicians that are the target of discreditation campaigns.
Misinterpretations may increase distrust in the public sector. Thus, Lessig makes the case for disclosing a limited amounts of public data prone to misinterpretation [Ibid.]. Even though, he is not completely opposing the transparency initiatives, he warns that careful considerations should be given when releasing sensitive information that may be misused for defamation.
Unrestricted access to communication channels provided by new media gives strong voice to all competing interpretations, unhindered by the filtering mechanisms of traditional publishing. This state of affairs results in unfounded claims and rumours to amplify and spread with an impact that was previously impossible to achieve, causing harm to personal reputations and the public image of government. Fortunately, the self-repairing properties of communication networks eventually lead to the rebuttal of misinformation. The openness of public data thus brings not only a greater control of the public sector, but indirectly also a better control of unproven claims.

References

  1. LESSIG, Lawrence. Against transparency: the perils of openness in government. The New Republic [online]. October 9th, 2009 [cit. 2012-03-29]. Available from WWW: http://www.tnr.com/article/books-and-arts/against-transparency
  2. DAVIES, Tim. Open data, democracy and public sector reform: a look at open government data use from data.gov.uk [online]. Based on an MSc Dissertation submitted for examination in Social Science of the Internet, University of Oxford. August 2010 [cit. 2012-03-09]. Available from WWW: http://www.opendataimpacts.net/report/wp-content/uploads/2010/08/How-is-open-government-data-being-used-in-practice.pdf
  3. KAPLAN, Daniel. Open public data: then what? Part 1 [online]. January 28th, 2011 [cit. 2012-04-10]. Available from WWW: http://blog.okfn.org/2011/01/28/open-public-data-then-what-part-1/
  4. LATHROP, Daniel; RUMA, Laurel (eds.). Open government: collaboration, transparency, and participation in practice. Sebastopol: O'Reilly, 2010. ISBN 978-0-596-80435-0.
  5. BOYD, Danah; CRAWFORD, Kate. Six provocations for big data. In Proceedings of A Decade in Internet Time: Symposium on the Dynamics of the Internet and Society, 21 — 24 September 2011, University of Oxford. Oxford (UK): Oxford University, 2011. Also available from WWW: http://ssrn.com/abstract=1926431

2012-09-06

Challenges of open data: data literacy

The following post is an excerpt from my thesis entitled Linked open data for public sector information.
Even though open data bridges the data divide between the public sector and members of the public, it might be introducing a new data divide that separates those with resources to make use of the data and those who do not. Despite the fact that open data virtually eliminates the cost of data acquisition, the cost of use remains “sufficiently high to compromise the political impact of open data” [1, p. 11].
An oft-cited quote attributed to Francis Bacon claims that “knowledge is power”. If data is a source of knowledge, then opening it up creates a shift in access to a source of power. However, equal access to data does not imply equal use, nor equal empowerment, as transforming data into power requires not only access. Letting aside the concerns of unequal access addressed by the agenda of the digital divide, while the principles of open data lead to the removal of barriers to access, they do not remove all barriers to use. In this respect, it is vitally important to distinguish between the “opportunity” and the actual “realization” of use of open data [2]. Even though everyone may have equal opportunities to access and use open data, only someone is able to achieve “effective use” [Ibid.]. In the light of this assertion, open data empowers only the already empowered; those that have access to technologies and computer skills that are necessary to make use of the data.
The belief in transformative potential of open data is based on optimistic assumptions about the citizens’ data literacy. The technocratic perspective with which open data principles are drafted takes high level of skills necessary for working with data for granted. Thus, the open data initiatives are in a way exclusive as they are limited mostly to technically inclined citizens [3, p. 268].
The minimalist role of the public sector, withdrawn into the background to serve as a platform, proceeds of the supposition that members of the society have all the necessary ingredients to make effective use of open government data, such as high level of information processing capabilities [4]. Even though ICT penetration and internet connectivity may be sufficient to access open data, it is not enough to make use of it. What is also needed are the abilities to process and interpret the data. However, open data released in a raw form may not be easily digestible without a substantial proficiency in data processing. Therefore, it should not be underestimated that users are required to possess technical expertise to process the data.
The bottom line is that access to data may in fact increase the asymmetry in society. If all interest groups have equal access to public sector information, then we can expect that the better organized and well-equipped groups to make better use of it [5]. The asymmetry may stem from the fact, that the interest groups that are able to take advantage of the newly released information will prosper at the expense of groups that cannot do that.
On the other hand, this type of unequality is in a sense natural. Such state of affairs should not be considered as a final one, but rather as a starting point. David Eaves compares the challenge of increasing data literacy to increasing literacy in libraries and reminds us that “we didn’t build libraries for an already literate citizenry. We built libraries to help citizens become literate” [6]. In the same way, we do not publish open data expecting everyone will be able to use it. The data are released since access is a necessary prerequisite for use. Direct access to data by the empowered, technically-skilled infomediaries may become a basis for an indirect access for many more [7]. Coming from this perspective, the most effective uses of open data can be thought of as those that let others make effective use of the data.

References

  1. MCCLEAN, Tom. Not with a bang but with a whimper: the politics of accountability and open data in the UK. In HAGOPIAN, Frances; HONIG, Bonnie (eds.). American Political Science Association Annual Meeting Papers, Seattle, Washington, 1 — 4 September 2011 [online]. Washington (DC): American Political Science Association, 2011 [cit. 2012-04-19]. Also available from WWW: http://ssrn.com/abstract=1899790
  2. GURSTEIN, Michael. Open data: empowering the empowered or effective data use for everyone? First Monday [online]. February 7th, 2011 [cit. 2012-04-01], vol. 16, no. 2. Available from WWW: http://firstmonday.org/htb in/cgiwrap/bin/ojs/index.php/fm/article/view/3316/2764
  3. BERTOT, John C.; JAEGER, Paul T.; GRIMES, Justin M. Using ICTs to create a culture of transparency: e-government and social media as openness and anti-corruption tools for societies. Government Information Quarterly. July 2010, vol. 27, iss. 3, p. 264 — 271. DOI 10.1016/j.giq.2010.03.001.
  4. GIGLER, Bjorn-Soren; CUSTER, Samantha; RAHEMTULLA, Hanif. Realizing the vision of open government data: opportunities, challenges and pitfalls [online]. World Bank, 2011 [cit. 2012-04-11]. Available from WWW: http://www.scribd.com/WorldBankPublications/d/75642397-Realizing-the-Vision-of-Open-Government-Data-Long-Version-Opportunities-Challenges-and-Pitfalls
  5. SHIRKY, Clay. Open House thoughts, Open Senate direction. In Open House Project [online]. November 23rd, 2008 [cit. 2012-04-19]. Available from WWW: http://groups.google.com/group/openhouseproject/msg/53867cab80ed4be9
  6. EAVES, David. Learning from libraries: the literacy challenge of open data [online]. June 10th, 2010 [cit. 2012-04-11]. Available from WWW: http://eaves.ca/2010/06/10/learning-from-libraries-the-literacy-challenge-of-open-data/
  7. TAUBERER, Joshua. Open government data: principles for a transparent government and an engaged public [online]. 2012 [cit. 2012-03-09]. Available from WWW: http://opengovdata.io/

2012-09-05

Challenges of open data: usability

The following post is an excerpt from my thesis entitled Linked open data for public sector information.
Considering usability as a property of interfaces, raw data provides a difficult one. Largely, data is too unwieldy to be used by most people. For example, 50 % of the respondents in the Socrata’s open data study said that the data was unusable [1]. Alternatively, poor usability may be correlated with the low level of use most open data sources receive.
The requirements on usability of open data reviewed in a previous blog post prove to be difficult to satisfy. The usability barrier may be especially high when dealing with linked open data as was reported in the previous post about usability of linked data. Yet it is important not to compromise the generative potential of open data to low usability of the underlying technologies.
The challenge of usability requires data producers to refocus on the view of user-centric perspective. The following blog posts highlight the increased need for data literacy, which is necessary for interacting with open data, and warn of the dangers of incorrect interpretations drawn from data.

References

  1. Socrata. 2010 open government data benchmark study [online]. Version 1.4. Last updated January 4th, 2011 [cit. 2012-04-07]. Available from WWW:

2012-09-04

Challenges of open data: information overload

The following post is an excerpt from my thesis entitled Linked open data for public sector information.
As more and more data is released in the open there is a growing danger that irrelevant data might flood the data that is important [1]. Only few of the available datasets contain “actionable” information and there is no effective filtering mechanism to track them down. With open data “we have so many facts at such ready disposal that they lose their ability to nail conclusions down, because there are always other facts supporting other interpretations” [2].
The sheer volume of the existing open data makes it difficult to comprehend. At such scale there is a need for tools that make the large amounts of data intelligible. Edd Dumbill writes that “big data may be big. But if it’s not fast, it’s unintelligible” [3].
While human processing does not scale, machine processing does. Thus, the challenge of information overload highlights the need for machine-readable data. Big, yet sufficiently structured data may be automatically pre-processed and filtered to “small data” that people can manage to work with. For example, linked data may be effectively filtered with precise SPARQL queries harnessing its rich structure.
Scaling the processing of large amounts of machine-readable data with well-defined structure may be considered solved. However, the current challenge is to deal with the heterogeneity of data from different sources.

Heterogeneity

Not only is there a perceived information overload, there is also an overload of different and incompatible ways of representing information. What we have built out of different data formats or modelling approaches seems to be the proverbial “Tower of Babel”. In this state of affairs, the data available on the Web constitutes a highly dimensional, heterogeneous data space.
Nonetheless, it is in managing heterogeneous data sources where linked data excels. Linking may be considered as a lightweight, pay-as-you-go approach to intergration of disparate datasets [4]. Semantic web technologies also address the intrinsic heterogeneity in data sources by providing means to model varying levels of formality, quality, and completeness [5, p. 851].

Comparability

A key quality of data that suffers from heterogeneity is comparability. According to the SDMX content-oriented guidelines comparability is defined as “the extent to which differences between statistics can be attributed to differences between the true values of the statistical characteristics” [6, p. 13]. It is a quality of data that represents the extent to which the differences in data can be attributed to differences in the measured phenomena.
Improving comparability of data hence means minimizing unwanted interferences that skew the data. Influences leading to distorsion of data may originate from differences in schemata, differing conceptualizations of domains described in the data, or incompatible data handling procedures. Elimination of such influences leads to maximization of evidence in data, which reflects more directly on the observed phenomena.
The importance of comparability surfaces especially in data analysis tasks. Insights yielded from analyses then feed into decision support and policy making. Comparability also supports transparency of public sector data because it clears the view of public administration. It supports easier audits of public sector bodies due to the possibility to abstract from the ways used to collect data. On the other hand, incomparable data corrupts monitoring of public sector bodies and imprecise monitoring thus leaves an ample space for systemic inefficiencies and potential corruption.
The publication model of linked data has in-built comparability features, which come from the requirement for using common, shared standards. RDF provides a commensurate structure through its data model that linked data is required to conform to. The emphasis on reuse of shared conceptualizations, such as RDF vocabularies, ontologies, and reference datasets, provides for comparable data content.
In the network of linked data the “bandwagon” effect increases the probability of adoption of a set of core reference datasets, which further reinforces the positive feedback loop. Core reference data may be used to link other datasets to enhance their value. Such datasets attract most in-bound links, which leads to emergence of “linking hubs”. In this case, these de facto reference datasets derive their status from their highly reusable content. An example of this type of datasets is DBpedia, which provides machine-readable data based on Wikipedia. Its prime condition may be illustrated by the Linked Open Data Cloud, in the center of which it is prominently positioned, indicating the high number of datasets linking to it.
In contrast to these datasets, traditional reference sources are established through the authority of their publishers, which is reflected in policies that prescribe to use such datasets. Datasets of this type include knowledge organization systems, such as classifications or code lists, that offer shared conceptualizations of particular domains. For instance, a prototypical example of an essential reference dataset is the International System of Units that is a source of shared units of measurement. In contrast with the linking hubs of linked data, traditional reference datasets are, for the most part, not available in RDF and therefore not linkable.
The effect of using both kinds of reference data is the same. The conceptualizations they construct offer reference concepts that make data referring to them comparable. A trivial example to illustrate this point may be the use of the same units of measurement, which enables to sort data in an expected order.
Data might need to be converted prior to comparison with other datasets. In this case, there is a need for comparability on the level of the data the incomparable datasets refer to. Linked data makes this possible through linking; the same technology it applies to data integration. With the techniques, such as ontology alignment, mappings between reference datasets may be established to serve as proxies for the purpose of data comparison. Ultimately, machine-readable relationships in linked data make it outperform other ways of representing data when it comes to the ability to draw comparisons.

References

  1. FIORETTI, Marco. Open data, open society: a research project about openness of public data in EU local administration [online]. Pisa, 2010 [cit. 2012-03-10]. Available from WWW: http://stop.zona-m.net/2011/01/the-open-data-open-society-report-2/
  2. WEINBERGER, David. Too big to know. New York (NY): Basic Books, 2012. ISBN 978-0-465-02142-0.
  3. DUMBILL, Edd (ed.). Planning for big data: a CIO’s handbook to the changing data landscape [ebook]. Sebastopol: O’Reilly, 2012, 83 p. ISBN 978-1-4493-2963-1.
  4. HEATH, Tom; BIZER, Chris. Linked data: evolving the Web into a global data space. 1st ed. Morgan & Claypool, 2011. Also available from WWW: http://linkeddatabook.com/book. ISBN 978-1-60845-430-3. DOI 10.2200/S00334ED1V01Y201102WBE001.
  5. SHADBOLT, Nigel; O’HARA, Kieron; SALVADORES, Manuel; ALANI, Harith. eGovernment. In DOMINGUE, John; FENSEL, Dieter; HENDLER, James A. (eds.). Handbook of semantic web technologies. Berlin: Springer, 2011,
    p. 849 — 910. DOI 10.1007/978-3-540-92913-0_20.
  6. SDMX. SDMX content-oriented guidelines. Annex 1: cross-domain concepts. 2009. Also available from WWW: http://sdmx.org/wp-content/uploads/2009/01/01_sdmx_cog_annex_1_cdc_2009.pdf

2012-09-03

Challenges of open data: implementation

The following post is an excerpt from my thesis entitled Linked open data for public sector information.
Data publishers may perceive adoption of linked open data to have daunting entry barriers. In particular, they are aware of the high demands on expertise for publishing linked data, which is esteemed to have a steep learning curve. Linked data publishing model poses requirements that may seem to be difficult to meet. The Frequently Observed Problems on the Web of Data [1] testify to that.
Therefore, “it is vital to follow a realistic, practical and inexpensive approach” [2]. Fortunately, linked data facilitates an incremental, evolutionary information management. Its deployment may follow a step by step approach, adopting iterative development for continuous improvement. For example, before a switch of the database technology linked data publishers could start by caching given legacy databases into triple stores. Another way how to cushion the demands of linked data adoption is to minimise their ontological commitment by creating small ontologies that may be gradually linked together.
Two implementation challenges collocated with the adoption of linked open data in the public sector will be dealt with in detail; resistance to change in the public sector and maturity of the linked data technology stack.

Resistance to change

Rhetoric of open data supporters puts an emphasis on bureaucracy as a major barrier to opening data in the public sector. There is a tendency to frame the politics of access to data as a struggle between the public sector, that has an inbreed attachment to secrecy, and members of the public, which are depicted rather as individuals than groups [3, p. 7].
While this view seems to be biased, the institutional inertia may pose a challenge to adoption of open data, which may require a “cultural change in the public sector” [4]. The transition from the status quo may be significantly hindered by the established culture in the public administration. “A major impediment is an entrenched closed culture in many government organisations as a result of the fear of disclosing government failures and provoking political escalation and public outcry” [5]. The intangible problem of the closed mindset prevailing in the public sector proves to be difficult to resolve. And so, in many ways, the adoption of open data “isn’t a hardware retirement issue, it’s an employee retirement one” [6].
Resistance to change is not the only barrier hindering in the adoption of open data. A hurdle that is commonly encountered by open data advocates is that civil servants perceive open data as an additional workload that lacks clear justification [7, p. 70]. Unlike citizens that are allowed to do everything that is not prohibited, public servants are allowed to do only what law and policies order them to do. Voluntary adoption of open data at the lower levels of public administration is thus highly unlikely. It requires a policy to push open data through.
However, it might be for the existing policies that the change is made difficult. In general, the public sector is a subject to special obstacles that impede adoption of new technologies. For example, the combination of strict data handling procedures and constricted possibilities due to limited budget resources may effectively stop any technological change [7]. That is why there must by a strong commitment to open data on the upper levels of the public sector in order to put through the necessary amendments to existing data handling policies.

Technology maturity

Semantic web technologies underlying linked data were for a long time thought of as not being ready for adoption in the enterprise settings and in the public sector. In 2010, linked data technology stack was not perceived to be ready for large-scale adoption in the public sector. John Sheridan reports three key things missing [8]:
  • Repeatable design patterns
  • Supportive tools
  • Commoditization of linked data APIs
At that time, standards were mature enough, but their translation to repeatable design patterns applicable in practice was lacking. This has changed since. Several sources recommend established design patterns (e.g., [9], [10], [11]), supportive tools were developed and packaged (e.g., LOD2 Stack), and frameworks for developing custom APIs based on linked data were created (e.g., Linked Data API mentioned in a previous blog post). Linked data has matured progressively in the recent years and so it may be argued that it is ready to be implemented at the level of the public sector.

References

  1. HOGAN, Aidan; CYGANIAK, Richard. Frequently observed problems on the web of data [online]. Version 0.3. November 13th, 2009 [cit. 2012-04-23]. Available from WWW: http://pedantic-web.org/fops.html
  2. ALANI, Harith; CHANDLER, Peter; HALL, Wendy; O’HARA, Kieron; SHADBOLT, Nigel; SZOMSZOR, Martin. Building a pragmatic semantic web. IEEE Intelligent Systems. May—June 2008, vol. 23, iss. 3, p. 61 — 68. Also available from WWW: http://eprints.soton.ac.uk/265787/1/alani-IEEEIS08.pdf. ISSN 1541-1672. DOI 10.1109/MIS.2008.42.
  3. MCCLEAN, Tom. Not with a bang but with a whimper: the politics of accountability and open data in the UK. In HAGOPIAN, Frances; HONIG, Bonnie (eds.). American Political Science Association Annual Meeting Papers, Seattle, Washington, 1 — 4 September 2011 [online]. Washington (DC): American Political Science Association, 2011 [cit. 2012-04-19]. Also available from WWW: http://ssrn.com/abstract=1899790
  4. GRAY, Jonathan. The best way to get value from data is to give it away. Guardian Datablog [online]. December 13th, 2011 [cit. 2011-12-14]. Available from WWW: http://www.guardian.co.uk/world/datablog/2011/dec/13/eu-open-government-data
  5. VAN DEN BROEK, Tijs; KOTTERINK, Bas; HUIJBOOM, Noor; HOFMAN, Wout; VAN GRIEKEN, Stefan. Open data need a vision of smart government. In Share-PSI Workshop: Removing the Roadblocks to a Pan-European Market for Public Sector Information Re-use [online]. 2011 [cit. 2012-03-09]. Available from WWW: http://share-psi.eu/submitted-papers/
  6. DUMBILL, Edd (ed.). Planning for big data: a CIO’s handbook to the changing data landscape [ebook]. Sebastopol: O’Reilly, 2012, 83 p. ISBN 978-1-4493-2963-1.
  7. HALONEN, Antti. Being open about data: analysis of the UK open data policies and applicability of open data [online]. Report. London: Finnish Institute, 2012 [cit. 2012-04-05]. Available from WWW: http://www.finnish-institute.org.uk/images/stories/pdf2012/being%20open%20about%20data.pdf
  8. ACAR, Suzanne; ALONSO, José M.; NOVAK, Kevin (eds.). Improving access to government through better use of the Web [online]. W3C Interest Group Note. May 12th, 2009 [cit. 2012-04-06]. Available from WWW: http://www.w3.org/TR/egov-improving/
  9. SHERIDAN, John; TENNISON, Jeni. Linking UK government data. In BIZER, Christian; HEATH, Tom; BERNERS-LEE, Tim; HAUSENBLAS, Michael (eds.). Li
    nked Data on the Web: proceedings of the WWW 2010 Workshop on Linked Data on the Web, April 27th, 2010, Raleigh, USA
    . Aachen: RWTH Aachen University, 2010. CEUR workshop proceedings, vol. 628. ISSN 1613-0073.
  10. DODDS, Leigh; DAVIS, Ian. Linked data patterns [online]. Last changed 2011-08-19 [cit. 2011-11-05]. Available from WWW: http://patterns.dataincubator.org
  11. HEATH, Tom; BIZER, Chris. Linked data: evolving the Web into a global data space. 1st ed. Morgan & Claypool, 2011. Also available from WWW: http://linkeddatabook.com/book. ISBN 978-1-60845-430-3. DOI 10.2200/S00334ED1V01Y201102WBE001.
  12. HYLAND, Bernardette; TERRAZAS, Boris Villazón; CAPADISLI, Sarven. Cookbook for open government linked data [online]. Last modified on February 20th, 2012 [cit. 2012-04-11]. Available from WWW: http://www.w3.org/2011/gld/wiki/Linked_Data_Cookbook

2012-09-02

Challenges of open data

The following post is an excerpt from my thesis entitled Linked open data for public sector information.
Open data not only opens new opportunities, it also opens new challenges. These challenges point to the limits of openness and to shortcomings of the approaches used to put linked open data in practice in the public sector.
The top 10 barriers and potential risks for adoption of open data in the public sector, which were compiled by Noor Huijboom and Tijs van den Broek [1, p. 7], comprise of the following.
  • closed government culture
  • privacy legislation
  • limited quality of data
  • limited user-friendliness/information overload
  • lack of standardisation of open data policy
  • security threats
  • existing charging models
  • uncertain economic impact
  • digital divide
  • network overload
Some of these challenges will be discussed in detail in the following blog posts. In particular, this section will cover the difficulties that may be encountered during implementation of linked open data, information overload and the problems of scalable processing of large, heterogeneous datasets, usability of raw data, issues for protection of personal data, deficiencies in data quality, adverse effects of open data on trust in the public sector, and finally the unresolved question of opening data obtained via public procurement.

References

  1. HUIJBOOM, Noor; VAN DEN BROEK, Tijs. Open data: an international comparison of strategies. European Journal of ePractice [online]. March/April 2011 [cit. 2012-04-30], no. 12. Available from WWW: http://www.epractice.eu/files/European%20Journal%20epractice%20Volume%2012_1.pdf. ISSN 1988-625X.

2012-09-01

Impacts of open data: journalism

The following post is an excerpt from my thesis entitled Linked open data for public sector information.
The availability of data and data processing tools gives birth to a new paradigm in journalism that is commonly referred to as data-driven journalism. It refers to the practice of basing journalistic articles on hard data, which allows to back up claims with well-founded evidence.
Unlike in journalism that is driven by data, unverified claims abound in traditional journalistic practice. To address this deficiency, data-driven journalism may employ open data sources to cross-verify the claims. Data triangulation combining disparate sources may establish validity of the verified claims.
If data-driven journalists strive to draw closer to objectivity, they need to share their sources to achieve transparency. Sharing the underlying data is an imperative of data-driven journalism, so that others can see what lead to insights conveyd in articles. In the light of such transparency, claims made by journalists may be verified by third parties and trust may be established.
The best known examples of data-driven journalism include the Guardian’s Datablog or Pro Publica.