2012-08-08

Principles of open data: use

The following post is an excerpt from my thesis entitled Linked open data for public sector information.
The group of principles governing the use of data covers the affordances expected for open data. It highlights the features of data that are deemed to be fundamental in opening data up to a variety of uses. Consequently, it warns of technological choices that may cause unintended usage limitations.

Non-proprietary data formats

Open data should use data formats over which no entity has exclusive control. Specifications of open data formats should be community-owned, free for all to read and implement, subject to no fees, royalties, or patent rights. Public review should be a part of the decision-making process in the format’s development in order to enable participation both from implementers and users of the format. For example, the World Wide Web Consortium’s has an open and well-defined process for making standard data formats.
Using proprietary data formats excludes users of a platform or a software that, for the developers of which it is not allowed to implement support for the format. Hence, by using a proprietary format users are confined to acquire software from a single vendor. Data producers risk not being able to change software supplier, experiencing vendor or product lock-in. Relying on proprietary data formats for storing data comes with the risk of them becoming obsolete. These are some of the reasons why it is important to adopt a non-proprietary format for open data. For example, unlike spreadsheets’ formats from commercial vendors, Comma-separated values (CSV) is a non-proprietary data format that is more suitable for open data.

Standards

Adhering to a set of common standards makes reuse easier as the data can be processed by a wide array of standards-compliant tools. Standards create expected behaviour, enable comparisons, and ultimately lead to superior interoperability. Standards, such as controlled vocabularies and common identifiers, provide better opportunities for combining disparate sources of data. Consistent use of standards leads to “informal” standards encoded as best practices.
For example, standards from the World Wide Web Consortium are appropriate for open data.

Machine readability

Machine readability is a property of the data structure. Machines parse (“read”) structures. The more machine-readable the data is, the smaller is the unit that can be read. High level of partitioning in the data structure leads to a greater readability.
For instance, when machines are dealing with scanned documents saved as images in PDF files, the smallest unit they can meaninfully distinguish is the whole file, a blob of data that is opaque to them. On the other hand, when machines read HTML files, the smallest unit that can be read may be one HTML element or even one character.
What is most frustrating is when public servants think it is a good thing if they transform data from a machine-readable format, such as XML, into a format that is not machine-readable, such as PDF [1, p. 27]. While users of the data can convert it from XML to PDF, they cannot convert it from PDF to XML. Tim Koelkebeck writes that “storing structured information via structureless scanning is the e-government equivalent of burning the files” [2, p. 278].
The term “machine-readable” is a bit misleading when interpreted strictly. Machines can “read” all digital information. However, some data formats do not leave open many ways how the data may be used. For example, binary formats, such as images or executables, do not lend themselves to other types of use than display or execution, and as such they limit the possibilities of reuse. Therefore, open data should be stored in textual formats (e.g., CSV) with explicit and standard character encoding (e.g., UTF-8).
Open data should be captured in a structured and formalized data format that enables automated processing by software. Daniel Bennett writes that “structure allows others to successfully make automated use of the data” [3]. Users should be able not only to display the data, they should also be able to perform other types of automated processing as well, such as full-text search, sorting, or analysis.
Open data should be valid, conforming with its format’s specification. Even though, minor errors may be handled by error recovery process of the user’s software. For example, web browsers are very tolerant of malformed HTML. However, in general, syntax errors increase the cost of using data, because fixing such mistakes always involves human intervention [4]. Thus, data that contains errors severely violating specification of its data format cannot be considered as machine-readable.
Machines are users of data too, and thus providing data in a machine-readable format avoids discriminating them. However, “most government websites weren’t designed to share data with other websites” [2, p. 205]. People view data through machines and machines help them to process it efficiently. For example, one of the main types of data intermediaries are search engines. Therefore, it is important that search engines can access and crawl open data. Another example where machine readability is crucial is big data, since people are not able to process large volumes of data and have to pre-process them with machines first. Machine readability is also important for people who cannot read (e.g., visually impaired, disabled), for whom machines must read (e.g., screen readers).
The connection between the licensed work and the terms of its licence may be made even more explicit by using a machine-readable licence statement. There are several ways how to indicate a licence so that it can be recognized automatically. A widespread method to do this is to embed a qualification of the type of link to the licence (for example, with Microformats). Having the licence attached to data in a way that is meaningful to machines comes with benefits for the users, such as the ability to search for reusable photos under the terms of a particular licence.

Safety

Open data should be published in data formats that cannot contain executable content. Such content may contain malicious code harmful to the users of the data. Textual formats, which are recommended for disclosure of open data, are safe to use. On the other hand, Microsoft Office files are not considered to be safe, since they can contain executable macros.

References

  1. Beyond access: open government data & the right to (re)use public information [online]. Access Info Europe, Open Knowledge Foundation, January 7th, 2011 [cit. 2012-04-15]. Available from WWW: http://www.access-info.org/documents/Access_Docs/Advancing/Beyond_Access_7_January_2011_web.pdf
  2. LATHROP, Daniel; RUMA, Laurel (eds.). Open government: collaboration, transparency, and participation in practice. Sebastopol: O'Reilly, 2010. ISBN 978-0-596-80435-0.
  3. BENNETT, Daniel; HARVEY, Adam. Publishing open government data [online]. W3C Working Draft. September 8th, 2009 [cit. 2012-04-07]. Available from WWW: http://www.w3.org/TR/gov-data/
  4. TAUBERER, Joshua. Open government data: principles for a transparent government and an engaged public [online]. 2012 [cit. 2012-03-09]. Available from WW
    W: http://opengovdata.io/

No comments :

Post a Comment