2012-08-07

Principles of open data: accessibility

The following post is an excerpt from my thesis entitled Linked open data for public sector information.
The requirements of open data principles covered in this blog post answer the question how one can obtain the data. Access is important because it necessarily precedes reuse. Making data accessible can be thought of as the next step after making it legally open.

Discoverability

In order to be able to access a dataset, you need to discover it. The information that the data actually exists is a necessary prerequisite to data access [1]. Users of open data should be able to discover where the data is and locate where are the parts of data distributed. Essentially, discoverability is the ability to get from a known URI to a previously unknown URI, which may be used to retrieve the data. There are two main approaches to make data discoverable. The URI known to a user may be of a data catalogue or a search engine.
Discoverability is the reason why data should be equipped with a thin layer of commonly agreed descriptive metadata [2, p. 8], such as in a data catalogue or, more broadly, an information asset register [3]. Data catalogue may form a single access interface to data. For instance, PublicData.eu is an example of an unofficial data catalogue of Europe’s public data. An official pan-European data portal is planned by the European Commission to be started in 2013 [2, p. 10].
Another way of making data discoverable is to make data accessible to machines, such as search engines, that will index the data and enable it to be found. Machines that index data also profit from access tools, such as descriptive metadata. Indexers may use either the full content of data, if it is machien-readable, or even catalogue records representing the data. However, there are specific types of metadata that can be used to improve discovery by machines, such as site maps that describe information architecture of the way the data is distributed, or robots.txt files that police access control for the data.

Accessibility

Carl Malamud claims that “today, public means on-line” [4]. Open data should be available online on the World Wide Web, retrievable via HTTP GET requests. There should be both access interface for human users, such as a web site or an application, and access interface for machine users, such as an API or downloadable data dump.
There are a number of ways how to make data accessible online. A common and widely recommended practice is to publish data exports that users may use to download the data in bulk. An option that has lately fallen out of favour is the use of File Transfer Protocol (FTP) to distribute the data dumps. Currently, this option has been replaced by exposing the data via Hypertext Transfer Protocol (HTTP), so that one may retrieve it via HTTP GET requests. An efficient alternative to HTTP is to use peer-to-peer file sharing via the BitTorrent protocol instead. However, this technology has not yet received nearly as widespread use as HTTP, particularly in the public sector.
There should be no barriers obstructing access to data, coming from both technological restrictions and policy rules. No party or website should have a privileged or exclusive access to public sector data. There should be no financial cost associated with the use of data, although recovering reasonable marginal costs of data reproduction is acceptable in a limited number of cases in which reproduction of data incurs expenses to its producer.
To safeguard user’s privacy and confidentiality any mechanism that identifies users should be prohibited [3] and instead anonymous access without requiring to login should be provided. Protecting user’s identity by providing anonymous access is not possible with reactive disclosure that is based on interacting via freedom of information requests. However, proactive disclosure permits users to access data without sharing their identity [5, p. 69]. Users should not be required to register, albeit requesting users to apply for an Application Programming Interface (API) key is reasonable, especially when the data producer needs to control the load on servers hosting the data. There should be no password protection, no strict limit on the number of API calls, and no encryption hindering in access to data.

Permanence

Open data should be accessible in the long term. A technical infrastructure needs to be in place to ensure long-term availability of public sector data [2, p. 8]. The overall permanence comprises of the permanence of content, access mechanisms, and software.
Data publishers should have back-up strategies. A common approach to maintaining data permanence is to have data both in exchange formats and preservation formats. Formats employed for storing data for the purpose of preservation should be sustained by a strong community of users or by a standards body, because obsolescence of data format may prevent archival access [6]. To ensure future accessibility of data the data access points, from which the data can be retrieved, should be persistent. Roy Fielding argues that “the quality of an identifier is often proportional to the amount of money spent to retain its validity” [7, p. 90]. Identifying resources with persistent access points has the benefit that consumer knowing the identifier does not need to re-discover the identified resource during each attempt to access it [9]. The sustainability and reliability of data access methods is important especially due to the direct reuse of data, such as in applications built on top of data APIs, or in the cases when the data cannot be copied or it is not efficient to do so. A solution for this requirement may be to introduce indirection by providing a layer redirecting access requests to variable locations of the data, such as with persistent URLs.
Software that implements support for the format of data needs to be preserved as well. Long-term availability of such software is required to preserve the ability to use the data. In this perspective, relying on a single software vendor increases the likelihood of obsolescence and should be thus avoided in favour of data formats that are supported by multiple vendors.

References

  1. European Commission. Digital agenda: Commission’s open data strategy, questions & answers [online]. MEMO/11/891. Brussels, December 12th, 2011 [cit. 2012-04-11]. Available from WWW: http://europa.eu/rapid/pressReleasesAction.do?reference=MEMO/11/891
  2. European Commission. Open data: an engine for innovation, growth and transparent governance [online]. Communication from the Commission to the European Parliament, the Council, the European Economic and Social Committee and the Committee of the Regions. Brussels, 2011 [cit. 2012-03-15]. Available from WWW: http://ec.europa.eu/information_society/policy/psi/docs/pdfs/opendata2012/open_data_communication/opendata_EN.pdf
  3. American Library Association. Key principles of government information [online]. Chicago, 1997 — 2012 [cit. 2012-04-07]. Available from WWW: http://www.ala.org/advocacy/govinfo/keyprinciples
  4. MALAMUD, Carl. By the people [online]. Government 2.0 Summit. Washington (DC), September 1
    0th, 2009 [cit. 2011-03-23]. Available from WWW: http://public.resource.org/people/
  5. Beyond access: open government data & the right to (re)use public information [online]. Access Info Europe, Open Knowledge Foundation, January 7th, 2011 [cit. 2012-04-15]. Available from WWW: http://www.access-info.org/documents/Access_Docs/Advancing/Beyond_Access_7_January_2011_web.pdf
  6. TAUBERER, Joshua. Open data is civic capital: best practices for “open government data” [online]. Version 1.5. January 29th, 2011 [cit. 2012-03-17]. Available from WWW: http://razor.occams.info/pubdocs/opendataciviccapital.html
  7. FIELDING, Roy Thomas. Architectural styles and the design of network-based software architectures. Irvine (CA), 2000. 162 p. Dissertation (PhD.). University of California, Irvine.

No comments :

Post a Comment