Challenges of open data: data quality

The following post is an excerpt from my thesis entitled Linked open data for public sector information.
Data quality is required for data that may be depended upon. Yet public sector data may be mired in errors and suffer from unintentional omissions that may markedly decrease usability of data. For example, Michael Daconta [1] identified 10 common types of mistakes in datasets in the U.S. data portal Data.gov.
  • Ommission errors violating data completeness, missing metadata definitions, using code without providing code lists
  • Formatting errors violating data consistency, syntax errors not fulfilling requirements of the employed data formats’ specifications
  • Accuracy errors violating correctness, errors breaking range limitations
  • Incorrectly labelled records violating correctness, for example, some datasets misnamed as CSV even though they are just dumps from Excel files that do not meet the standards established in the specification the CSV data format
  • Access errors referring to incorrect metadata descriptions, for example, not linking to the content described by the link’s label
  • Poorly structured data caused by improper selection of data format, using formats that are innapropriate for the expected uses of data
  • Non-normalized data violating the principle of normalization, which attempt to reduce redundant data by, e.g., removing duplicates
  • Raw database dumps violating relevance and providing raw database dumps that are hard to interpret and use correctly
  • Inflation of counts that is a metadata quality issue having an adverse impact on usability, for instance, when datasets pertaining to the same phenomena are not properly grouped and thus difficult to find
  • Inconsistent data granularity violating expected quality of metadata, such that datasets use widely varying levels of data granularity without their explicit specification
Linked data principles impose a rigour to data that may improve its consistency and quality. At the same time, linked data is more susceptible to corruption caused by “link rot” and the issues that arise when links no longer resolve. For example, in 2006 it was found that 52 % of links from the official parliamentary record of the UK were not functional [2, p. 20]. The reliance on URI makes it even more important for linked data to adopt URIs that are stable and persistent.


  1. DACONTA, Michael. 10 flaws with the data on Data.gov. Federal Computer Week [online]. March 11th, 2010 [cit. 2012-04-10]. Available from WWW: http://fcw.com/articles/2010/03/11/reality-check-10-data-gov-shortcomings.aspx
  2. Beyond access: open government data & the right to (re)use public information [online]. Access Info Europe, Open Knowledge Foundation, January 7th, 2011 [cit. 2012-04-15]. Available from WWW: http://www.access-info.org/documents/Access_Docs/Advancing/Beyond_Access_7_January_2011_web.pdf

No comments :

Post a Comment