blog.mynarz.net: Qualities of open data

The following post is an excerpt from my thesis entitled Linked open data for public sector information.
Data quality is complementary to data openness. It is a set of features of data that are not essential for its openness but they are closely related.

Content

A primary facet of data quality is the type of content that is included in data. The following group of requirements instructs producers of open data about what should be in their datasets.

Primariness

Data is traditionally available in finished products, such as compiled in reports. However, the call for “raw data now” asks rather for disaggregated and un-interpreted data [1]. Open data should thus be made available at the earliest point when it is useful to businesses and citizens [2]. A similar principle is adopted in the open source community, incarnated in the slogan “release early, release often”, that emphasizes the importance of a tight loop of gathering and applying user feedback, which steers the released product towards a better quality.
Data should be collected at the source with the highest possible level of granularity to achieve maximum accuracy. It is desirable to strive for high precision of data, because it reflects the depth of information encoded in data [3]. Accuracy then represents the likelihood that the information extracted from the data is correct. For example, publishers of open data should provide fine-grained data with high resolution, with high sampling rate, such as high-definition images or video.

Completeness

All public data should be made available, except direct or indirect identifiers of persons, which constitute personally identifiable information, and data that need to be kept secret due to the reasons of national security. The goal of open data principles is make the public sector, not citizens, transparent. Complete datasets should be available to bulk download since whole datasets could be difficult or impossible to retrieve through an API.

Timeliness

Essentially, all datasets are snapshots of data streams, capturing the current state of an observed phenomenon. Accordingly, the value of data can decrease over time. For example, weather forecasts lose most of their value after the day for which they predict the weather conditions. What is valid for all types of data is that the value of data decreases as the methods used to capture the data become obsolete.
Usefulness of data may quickly drain out as the data ages. A commercial from IBM stresses the importance of real-time data for decision making. It claims that you would not have crossed a road if everything you had was a five-minute old snapshot of the traffic situation. This is the case of freedom of information requests, the procedure of which is too slow to obtain timely data. The long waiting periods for these requests may result in receiving out of date data.
Having the transient nature of most data in mind, data producers should publish it as soon as possible to preserve its value, such as with live feeds for frequently updated material [4, p. 33]. Preferably, the data should be released to the public at the time of its release for the internal use. In this way, the data can serve to help in achieving real-time transparency and can be treated as a news source.

Integrity

To ensure the integrity of open data digital signatures may be used. Signatures serve to guarantee authenticity of data, tracing its digital provenance, and also preserve the integrity of data in course of its transfer to the user. Publishing data with the secure HTTPS protocol may decrease the risk of tampering with data during its transmission.

Usability

Usability is a quality of data that account for how well the data can be used. Open data that is usable well has a lower cost of use. This section mentions three aspects of open data that contribute to its usability.

Presentation

A human-readable copy of data should be available to alleviate the unequal levels of ability to work with raw data. Given the differing data literacy skills among users an effort needs to be taken to provide the largest number of people the greatest benefits from the data and to help them make “effective use” of it (as dubbed by Michael Gurstein in [5]). The primary format for human-readable presentation, which is recommended for open data, is HTML [6].

Clarity

Open data should communicate as clearly as possible, using plain and accurate language. Descriptions in data should be given in a neutral and unambiguous language that does not skew the interpretation of data. They should avoid jargon or technical language, unless the terminology is well-defined and adds to the clarity of data. Data should employ meaningful scales that clearly convey the differences in data. Data should not contain extraneous information and superfluous padding that might distract users from the important parts of data or confuse them.
To widen the reach of data its descriptive metadata should use a universal language (e.g., English), while the content of the data should be language-independent. This is particularly important to improve the prospects of cross-country reuse.

Documentation

An aspect that greatly contributes to usability of data is availability and quality of documentation. Providing documentation is important for users because it helps them understand the data. Tim Davies makes the point that “data is also only effectively open if any code-lists and documentation necessary to interpret it (e.g., details of the units of measurement used etc.) is also made openly available” [7, p. 1]. Documentation should require only general knowledge and should not presuppose knowledge of internal practices of the agency that produced the dataset. For example, documentation might explain how a dataset is structured and what abbreviations are used in it.
The need for explanatory descriptions of data may be demonstrated on Comma-separated values (CSV) data format. It is exactly the simple structure of CSV without any schema descriptions that makes interpretation of data in this format difficult without an accompanying “codebook”, domain knowledge, and manual data inspection [8].

References

GIGLER, Bjorn-Soren; CUSTER, Samantha; RAHEMTULLA, Hanif. Realizing the vision of open government data: opportunities, challenges and pitfalls [online]. World Bank, 2011 [cit. 2012-04-11]. Available from WWW: http://www.scribd.com/WorldBankPublications/d/75642397-Realizing-the-Vision-of-Open-Government-Data-Long-Version-Opportunities-Challenges-and-Pitfalls
GRAVES, Antoinette. The price of everything the value of nothing. In UHLIR, Paul F. (rpt.). The socioeconomic effects of public sector information on digital networks: toward a better understanding of different access and reuse policies: workshop summary. Washington (DC): National Academies Press, 2009. Also available from WWW: http://books.nap.edu/openbook.php?record_id=12687&page=37. ISBN 0-309-13968-6.
TAUBERER, Joshua. Open government data: principles for a transparent government and an engaged public [online]. 2012 [cit. 2012-03-09]. Available from WWW: http://opengovdata.io/
Beyond access: open government data & the right to (re)use public information [online]. Access Info Europe, Open Knowledge Foundation, Ja
nuary 7^th, 2011 [cit. 2012-04-15]. Available from WWW: http://www.access-info.org/documents/Access_Docs/Advancing/Beyond_Access_7_January_2011_web.pdf
GURSTEIN, Michael. Open data: empowering the empowered or effective data use for everyone? First Monday [online]. February 7^th, 2011 [cit. 2012-04-01], vol. 16, no. 2. Available from WWW: http://firstmonday.org/htb in/cgiwrap/bin/ojs/index.php/fm/article/view/3316/2764
BENNETT, Daniel; HARVEY, Adam. Publishing open government data [online]. W3C Working Draft. September 8^th, 2009 [cit. 2012-04-07]. Available from WWW: http://www.w3.org/TR/gov-data/
DAVIES, Tim. Linked data in international development: practical issues [online]. Draft 0.1. September 2011 [cit. 2011-11-07]. Available from WWW: http://www.timdavies.org.uk/wp-content/uploads/1-Primer-Introducing-linked-open-data.pdf
LEBO, Timothy; WILLIAMS, Gregory Todd. Converting governmental datasets into linked data. In I-Semantics 2010: proceedings of the 6^th International Conference on Semantic Systems, September 1 — 3, 2010, Graz, Austria. New York (NY): ACM, 2010. ISBN 978-1-4503-0014-8. DOI 10.1145/1839707.1839755.

blog.mynarz.net

2012-08-09

Qualities of open data