2012-08-11

Open data for public sector information

The following post is an excerpt from my thesis entitled Linked open data for public sector information.
Like data in general, public sector information seems to be predisposed to be opened. The key argument in favour for opening up public sector information is that this information belongs to the public. Joseph Stiglitz, a noted economist, writes: “[...] Who owns the information? Is it the private province of the government official, or does it belong to the public at large? I would argue that information gathered by public officials at public expense is owned by the public – just as the chairs and buildings and other physical assets used by government belong to the public” [1, p. 7]. Collection and maintenance of public sector data is paid for from public funds derived from tax incomes. Therefore, the data should be treated as a public good, which enables equal levels of access and use not only to the public sector officials, but to every citizen as well. In other words, paraphrasing an Internet meme, “All your data are belong to us” [2, p. 241].
The public owns the public sector data and demands it to be openly available [3]. In 2010, survey by Socrata showed that there was a strong support for open data in the public sector [4]. It showed that 92.6 % of civil servant would commit to open data and that 67.2 % of citizens agreed with opening up of public sector data. The interest of citizens in data from the public sector may also be illustrated by the existence of community alternatives to public sector data [5]. For example, the demand for geo-spatial data may demonstrated by the projects like OpenStreetMap, for which volunteers are “re-engineering” the data that should have been provided by the public sector.
Given the predispositions of public sector information to being opened, the demand for it, and the technologies that make it possible to be opened, one may expect an increase in activity in this domain. Open data in the public sector went from being a niche cause to being pervasive in the whole world. Now, there is over a hundred initiatives opening up data in the public sector world-wide [6], building up to a global, networked data infrastructure.

References

  1. STIGLITZ, Joseph E. On liberty, the right to know, and public discourse: the role of transparency in public life. Oxford Amnesty Lecture. Oxford (UK), 1999. Also available from WWW:
  2. LATHROP, Daniel; RUMA, Laurel (eds.). Open government: collaboration, transparency, and participation in practice. Sebastopol: O'Reilly, 2010. ISBN 978-0-596-80435-0.
  3. ARTHUR, Charles; CROSS, Michael. Give us back our crown jewels. Guardian [online]. March 9th, 2006 [cit. 2012-03-09]. Available from WWW: http://www.guardian.co.uk/technology/2006/mar/09/education.epublic
  4. Socrata. 2010 open government data benchmark study [online]. Version 1.4. Last updated January 4th, 2011 [cit. 2012-04-07]. Available from WWW:
  5. FIORETTI, Marco. Open data, open society: a research project about openness of public data in EU local administration [online]. Pisa, 2010 [cit. 2012-03-10]. Available from WWW: http://stop.zona-m.net/2011/01/the-open-data-open-society-report-2/
  6. DAVIES, Tim; BAWA, Zainab Ashraf. The promises and perils of open government data (OGD). Journal of Community Informatics [online]. 2012 [cit. 2012-04-12], vol. 8, no. 2. Available from WWW: http://ci-journal.net/index.php/ciej/article/view/929/926. ISSN 1712-4441.

2012-08-10

Open data policies

The following post is an excerpt from my thesis entitled Linked open data for public sector information.
Principles describe goals that cover what should be achieved. The goals need to be linked to ways how to accomplish them. It needs to be clear how to implement goals and thus translate principles into action.
For this purpose, policies are made. They represent a pragmatic use of principles, prescribing requirements for behaviour and resulting actions. Priciples need to be distilled into policies, in order to provide direct guidance and practical steps to be taken by their implementers. Policies should supplement principles with motivations. They should explain their objectives along with their prospective outcomes. The motivations should be underlaid with benefits to be yielded or sanctions to be imposed on those disobeying the policy.
Another motivation to make public data more accessible and usable was presented in the research paper Government data and the invisible hand [1]. The proposal suggested that there should be a policy requiring public bodies to access their data in the same way the public may access them: “The policy route to realizing this principle is to require that federal government Web sites retrieve their published data using the same infrastructure that they have made available to the public” [Ibid., p. 170].
Compliance with policies must be reviewable. Control mechanisms, such as performance indicators or tests, should be designed in order to determine if sanctions should be applied. A contact person must be designated to respond to people trying to use the data and address the complaints about violations of the principles embodied in open data policies. Open data policies were generally made in the last few years, however, the term “open data” appeared in a policy context several years before. Harlan Yu reports the earliest “open data policy” to be from the 1970s [2, p. 8]. It was a US science policy that insisted on NASA partners to have an “open-data policy comparable to that of NASA [...] particularly with respect to the public availability of data”.
Policies may be issued at different levels of the public sector, either at the level of state government or by local administrations. An example of an open data policy is the Open Government Directive from Barack Obama’s administration in the US, which ordered all agencies in the public sector to publish their non-classified datasets on the Web [3].

References

  1. ROBINSON, David G.; YU, Harlan; ZELLER, William P.; FELTEN, Edward W. Government data and the invisible hand. Yale Journal of Law & Technology. 2009, vol. 11, p. 160 — 175.
  2. YU, Harlan; ROBINSON, David G. The new ambiguity of “open government” [online]. Princeton CITP / Yale ISP Working Paper. Draft of February 28th, 2012. Available from WWW: http://ssrn.com/abstract=2012489
  3. ORSZAG, Peter R. Open government directive. M-10-06. Memorandum for the heads of executive departments and agencies. Washington: Executive Office of the President, December 8th, 2009. Also available from WWW: http://www.whitehouse.gov/sites/default/files/omb/assets/memoranda_2010/m10-06.pdf

2012-08-09

Qualities of open data

The following post is an excerpt from my thesis entitled Linked open data for public sector information.
Data quality is complementary to data openness. It is a set of features of data that are not essential for its openness but they are closely related.

Content

A primary facet of data quality is the type of content that is included in data. The following group of requirements instructs producers of open data about what should be in their datasets.

Primariness

Data is traditionally available in finished products, such as compiled in reports. However, the call for “raw data now” asks rather for disaggregated and un-interpreted data [1]. Open data should thus be made available at the earliest point when it is useful to businesses and citizens [2]. A similar principle is adopted in the open source community, incarnated in the slogan “release early, release often”, that emphasizes the importance of a tight loop of gathering and applying user feedback, which steers the released product towards a better quality.
Data should be collected at the source with the highest possible level of granularity to achieve maximum accuracy. It is desirable to strive for high precision of data, because it reflects the depth of information encoded in data [3]. Accuracy then represents the likelihood that the information extracted from the data is correct. For example, publishers of open data should provide fine-grained data with high resolution, with high sampling rate, such as high-definition images or video.

Completeness

All public data should be made available, except direct or indirect identifiers of persons, which constitute personally identifiable information, and data that need to be kept secret due to the reasons of national security. The goal of open data principles is make the public sector, not citizens, transparent. Complete datasets should be available to bulk download since whole datasets could be difficult or impossible to retrieve through an API.

Timeliness

Essentially, all datasets are snapshots of data streams, capturing the current state of an observed phenomenon. Accordingly, the value of data can decrease over time. For example, weather forecasts lose most of their value after the day for which they predict the weather conditions. What is valid for all types of data is that the value of data decreases as the methods used to capture the data become obsolete.
Usefulness of data may quickly drain out as the data ages. A commercial from IBM stresses the importance of real-time data for decision making. It claims that you would not have crossed a road if everything you had was a five-minute old snapshot of the traffic situation. This is the case of freedom of information requests, the procedure of which is too slow to obtain timely data. The long waiting periods for these requests may result in receiving out of date data.
Having the transient nature of most data in mind, data producers should publish it as soon as possible to preserve its value, such as with live feeds for frequently updated material [4, p. 33]. Preferably, the data should be released to the public at the time of its release for the internal use. In this way, the data can serve to help in achieving real-time transparency and can be treated as a news source.

Integrity

To ensure the integrity of open data digital signatures may be used. Signatures serve to guarantee authenticity of data, tracing its digital provenance, and also preserve the integrity of data in course of its transfer to the user. Publishing data with the secure HTTPS protocol may decrease the risk of tampering with data during its transmission.

Usability

Usability is a quality of data that account for how well the data can be used. Open data that is usable well has a lower cost of use. This section mentions three aspects of open data that contribute to its usability.

Presentation

A human-readable copy of data should be available to alleviate the unequal levels of ability to work with raw data. Given the differing data literacy skills among users an effort needs to be taken to provide the largest number of people the greatest benefits from the data and to help them make “effective use” of it (as dubbed by Michael Gurstein in [5]). The primary format for human-readable presentation, which is recommended for open data, is HTML [6].

Clarity

Open data should communicate as clearly as possible, using plain and accurate language. Descriptions in data should be given in a neutral and unambiguous language that does not skew the interpretation of data. They should avoid jargon or technical language, unless the terminology is well-defined and adds to the clarity of data. Data should employ meaningful scales that clearly convey the differences in data. Data should not contain extraneous information and superfluous padding that might distract users from the important parts of data or confuse them.
To widen the reach of data its descriptive metadata should use a universal language (e.g., English), while the content of the data should be language-independent. This is particularly important to improve the prospects of cross-country reuse.

Documentation

An aspect that greatly contributes to usability of data is availability and quality of documentation. Providing documentation is important for users because it helps them understand the data. Tim Davies makes the point that “data is also only effectively open if any code-lists and documentation necessary to interpret it (e.g., details of the units of measurement used etc.) is also made openly available” [7, p. 1]. Documentation should require only general knowledge and should not presuppose knowledge of internal practices of the agency that produced the dataset. For example, documentation might explain how a dataset is structured and what abbreviations are used in it.
The need for explanatory descriptions of data may be demonstrated on Comma-separated values (CSV) data format. It is exactly the simple structure of CSV without any schema descriptions that makes interpretation of data in this format difficult without an accompanying “codebook”, domain knowledge, and manual data inspection [8].

References

  1. GIGLER, Bjorn-Soren; CUSTER, Samantha; RAHEMTULLA, Hanif. Realizing the vision of open government data: opportunities, challenges and pitfalls [online]. World Bank, 2011 [cit. 2012-04-11]. Available from WWW: http://www.scribd.com/WorldBankPublications/d/75642397-Realizing-the-Vision-of-Open-Government-Data-Long-Version-Opportunities-Challenges-and-Pitfalls
  2. GRAVES, Antoinette. The price of everything the value of nothing. In UHLIR, Paul F. (rpt.). The socioeconomic effects of public sector information on digital networks: toward a better understanding of different access and reuse policies: workshop summary. Washington (DC): National Academies Press, 2009. Also available from WWW: http://books.nap.edu/openbook.php?record_id=12687&page=37. ISBN 0-309-13968-6.
  3. TAUBERER, Joshua. Open government data: principles for a transparent government and an engaged public [online]. 2012 [cit. 2012-03-09]. Available from WWW: http://opengovdata.io/
  4. Beyond access: open government data & the right to (re)use public information [online]. Access Info Europe, Open Knowledge Foundation, Ja
    nuary 7th, 2011 [cit. 2012-04-15]. Available from WWW: http://www.access-info.org/documents/Access_Docs/Advancing/Beyond_Access_7_January_2011_web.pdf
  5. GURSTEIN, Michael. Open data: empowering the empowered or effective data use for everyone? First Monday [online]. February 7th, 2011 [cit. 2012-04-01], vol. 16, no. 2. Available from WWW: http://firstmonday.org/htb in/cgiwrap/bin/ojs/index.php/fm/article/view/3316/2764
  6. BENNETT, Daniel; HARVEY, Adam. Publishing open government data [online]. W3C Working Draft. September 8th, 2009 [cit. 2012-04-07]. Available from WWW: http://www.w3.org/TR/gov-data/
  7. DAVIES, Tim. Linked data in international development: practical issues [online]. Draft 0.1. September 2011 [cit. 2011-11-07]. Available from WWW: http://www.timdavies.org.uk/wp-content/uploads/1-Primer-Introducing-linked-open-data.pdf
  8. LEBO, Timothy; WILLIAMS, Gregory Todd. Converting governmental datasets into linked data. In I-Semantics 2010: proceedings of the 6th International Conference on Semantic Systems, September 1 — 3, 2010, Graz, Austria. New York (NY): ACM, 2010. ISBN 978-1-4503-0014-8. DOI 10.1145/1839707.1839755.

2012-08-08

Principles of open data: use

The following post is an excerpt from my thesis entitled Linked open data for public sector information.
The group of principles governing the use of data covers the affordances expected for open data. It highlights the features of data that are deemed to be fundamental in opening data up to a variety of uses. Consequently, it warns of technological choices that may cause unintended usage limitations.

Non-proprietary data formats

Open data should use data formats over which no entity has exclusive control. Specifications of open data formats should be community-owned, free for all to read and implement, subject to no fees, royalties, or patent rights. Public review should be a part of the decision-making process in the format’s development in order to enable participation both from implementers and users of the format. For example, the World Wide Web Consortium’s has an open and well-defined process for making standard data formats.
Using proprietary data formats excludes users of a platform or a software that, for the developers of which it is not allowed to implement support for the format. Hence, by using a proprietary format users are confined to acquire software from a single vendor. Data producers risk not being able to change software supplier, experiencing vendor or product lock-in. Relying on proprietary data formats for storing data comes with the risk of them becoming obsolete. These are some of the reasons why it is important to adopt a non-proprietary format for open data. For example, unlike spreadsheets’ formats from commercial vendors, Comma-separated values (CSV) is a non-proprietary data format that is more suitable for open data.

Standards

Adhering to a set of common standards makes reuse easier as the data can be processed by a wide array of standards-compliant tools. Standards create expected behaviour, enable comparisons, and ultimately lead to superior interoperability. Standards, such as controlled vocabularies and common identifiers, provide better opportunities for combining disparate sources of data. Consistent use of standards leads to “informal” standards encoded as best practices.
For example, standards from the World Wide Web Consortium are appropriate for open data.

Machine readability

Machine readability is a property of the data structure. Machines parse (“read”) structures. The more machine-readable the data is, the smaller is the unit that can be read. High level of partitioning in the data structure leads to a greater readability.
For instance, when machines are dealing with scanned documents saved as images in PDF files, the smallest unit they can meaninfully distinguish is the whole file, a blob of data that is opaque to them. On the other hand, when machines read HTML files, the smallest unit that can be read may be one HTML element or even one character.
What is most frustrating is when public servants think it is a good thing if they transform data from a machine-readable format, such as XML, into a format that is not machine-readable, such as PDF [1, p. 27]. While users of the data can convert it from XML to PDF, they cannot convert it from PDF to XML. Tim Koelkebeck writes that “storing structured information via structureless scanning is the e-government equivalent of burning the files” [2, p. 278].
The term “machine-readable” is a bit misleading when interpreted strictly. Machines can “read” all digital information. However, some data formats do not leave open many ways how the data may be used. For example, binary formats, such as images or executables, do not lend themselves to other types of use than display or execution, and as such they limit the possibilities of reuse. Therefore, open data should be stored in textual formats (e.g., CSV) with explicit and standard character encoding (e.g., UTF-8).
Open data should be captured in a structured and formalized data format that enables automated processing by software. Daniel Bennett writes that “structure allows others to successfully make automated use of the data” [3]. Users should be able not only to display the data, they should also be able to perform other types of automated processing as well, such as full-text search, sorting, or analysis.
Open data should be valid, conforming with its format’s specification. Even though, minor errors may be handled by error recovery process of the user’s software. For example, web browsers are very tolerant of malformed HTML. However, in general, syntax errors increase the cost of using data, because fixing such mistakes always involves human intervention [4]. Thus, data that contains errors severely violating specification of its data format cannot be considered as machine-readable.
Machines are users of data too, and thus providing data in a machine-readable format avoids discriminating them. However, “most government websites weren’t designed to share data with other websites” [2, p. 205]. People view data through machines and machines help them to process it efficiently. For example, one of the main types of data intermediaries are search engines. Therefore, it is important that search engines can access and crawl open data. Another example where machine readability is crucial is big data, since people are not able to process large volumes of data and have to pre-process them with machines first. Machine readability is also important for people who cannot read (e.g., visually impaired, disabled), for whom machines must read (e.g., screen readers).
The connection between the licensed work and the terms of its licence may be made even more explicit by using a machine-readable licence statement. There are several ways how to indicate a licence so that it can be recognized automatically. A widespread method to do this is to embed a qualification of the type of link to the licence (for example, with Microformats). Having the licence attached to data in a way that is meaningful to machines comes with benefits for the users, such as the ability to search for reusable photos under the terms of a particular licence.

Safety

Open data should be published in data formats that cannot contain executable content. Such content may contain malicious code harmful to the users of the data. Textual formats, which are recommended for disclosure of open data, are safe to use. On the other hand, Microsoft Office files are not considered to be safe, since they can contain executable macros.

References

  1. Beyond access: open government data & the right to (re)use public information [online]. Access Info Europe, Open Knowledge Foundation, January 7th, 2011 [cit. 2012-04-15]. Available from WWW: http://www.access-info.org/documents/Access_Docs/Advancing/Beyond_Access_7_January_2011_web.pdf
  2. LATHROP, Daniel; RUMA, Laurel (eds.). Open government: collaboration, transparency, and participation in practice. Sebastopol: O'Reilly, 2010. ISBN 978-0-596-80435-0.
  3. BENNETT, Daniel; HARVEY, Adam. Publishing open government data [online]. W3C Working Draft. September 8th, 2009 [cit. 2012-04-07]. Available from WWW: http://www.w3.org/TR/gov-data/
  4. TAUBERER, Joshua. Open government data: principles for a transparent government and an engaged public [online]. 2012 [cit. 2012-03-09]. Available from WW
    W: http://opengovdata.io/

2012-08-07

Principles of open data: accessibility

The following post is an excerpt from my thesis entitled Linked open data for public sector information.
The requirements of open data principles covered in this blog post answer the question how one can obtain the data. Access is important because it necessarily precedes reuse. Making data accessible can be thought of as the next step after making it legally open.

Discoverability

In order to be able to access a dataset, you need to discover it. The information that the data actually exists is a necessary prerequisite to data access [1]. Users of open data should be able to discover where the data is and locate where are the parts of data distributed. Essentially, discoverability is the ability to get from a known URI to a previously unknown URI, which may be used to retrieve the data. There are two main approaches to make data discoverable. The URI known to a user may be of a data catalogue or a search engine.
Discoverability is the reason why data should be equipped with a thin layer of commonly agreed descriptive metadata [2, p. 8], such as in a data catalogue or, more broadly, an information asset register [3]. Data catalogue may form a single access interface to data. For instance, PublicData.eu is an example of an unofficial data catalogue of Europe’s public data. An official pan-European data portal is planned by the European Commission to be started in 2013 [2, p. 10].
Another way of making data discoverable is to make data accessible to machines, such as search engines, that will index the data and enable it to be found. Machines that index data also profit from access tools, such as descriptive metadata. Indexers may use either the full content of data, if it is machien-readable, or even catalogue records representing the data. However, there are specific types of metadata that can be used to improve discovery by machines, such as site maps that describe information architecture of the way the data is distributed, or robots.txt files that police access control for the data.

Accessibility

Carl Malamud claims that “today, public means on-line” [4]. Open data should be available online on the World Wide Web, retrievable via HTTP GET requests. There should be both access interface for human users, such as a web site or an application, and access interface for machine users, such as an API or downloadable data dump.
There are a number of ways how to make data accessible online. A common and widely recommended practice is to publish data exports that users may use to download the data in bulk. An option that has lately fallen out of favour is the use of File Transfer Protocol (FTP) to distribute the data dumps. Currently, this option has been replaced by exposing the data via Hypertext Transfer Protocol (HTTP), so that one may retrieve it via HTTP GET requests. An efficient alternative to HTTP is to use peer-to-peer file sharing via the BitTorrent protocol instead. However, this technology has not yet received nearly as widespread use as HTTP, particularly in the public sector.
There should be no barriers obstructing access to data, coming from both technological restrictions and policy rules. No party or website should have a privileged or exclusive access to public sector data. There should be no financial cost associated with the use of data, although recovering reasonable marginal costs of data reproduction is acceptable in a limited number of cases in which reproduction of data incurs expenses to its producer.
To safeguard user’s privacy and confidentiality any mechanism that identifies users should be prohibited [3] and instead anonymous access without requiring to login should be provided. Protecting user’s identity by providing anonymous access is not possible with reactive disclosure that is based on interacting via freedom of information requests. However, proactive disclosure permits users to access data without sharing their identity [5, p. 69]. Users should not be required to register, albeit requesting users to apply for an Application Programming Interface (API) key is reasonable, especially when the data producer needs to control the load on servers hosting the data. There should be no password protection, no strict limit on the number of API calls, and no encryption hindering in access to data.

Permanence

Open data should be accessible in the long term. A technical infrastructure needs to be in place to ensure long-term availability of public sector data [2, p. 8]. The overall permanence comprises of the permanence of content, access mechanisms, and software.
Data publishers should have back-up strategies. A common approach to maintaining data permanence is to have data both in exchange formats and preservation formats. Formats employed for storing data for the purpose of preservation should be sustained by a strong community of users or by a standards body, because obsolescence of data format may prevent archival access [6]. To ensure future accessibility of data the data access points, from which the data can be retrieved, should be persistent. Roy Fielding argues that “the quality of an identifier is often proportional to the amount of money spent to retain its validity” [7, p. 90]. Identifying resources with persistent access points has the benefit that consumer knowing the identifier does not need to re-discover the identified resource during each attempt to access it [9]. The sustainability and reliability of data access methods is important especially due to the direct reuse of data, such as in applications built on top of data APIs, or in the cases when the data cannot be copied or it is not efficient to do so. A solution for this requirement may be to introduce indirection by providing a layer redirecting access requests to variable locations of the data, such as with persistent URLs.
Software that implements support for the format of data needs to be preserved as well. Long-term availability of such software is required to preserve the ability to use the data. In this perspective, relying on a single software vendor increases the likelihood of obsolescence and should be thus avoided in favour of data formats that are supported by multiple vendors.

References

  1. European Commission. Digital agenda: Commission’s open data strategy, questions & answers [online]. MEMO/11/891. Brussels, December 12th, 2011 [cit. 2012-04-11]. Available from WWW: http://europa.eu/rapid/pressReleasesAction.do?reference=MEMO/11/891
  2. European Commission. Open data: an engine for innovation, growth and transparent governance [online]. Communication from the Commission to the European Parliament, the Council, the European Economic and Social Committee and the Committee of the Regions. Brussels, 2011 [cit. 2012-03-15]. Available from WWW: http://ec.europa.eu/information_society/policy/psi/docs/pdfs/opendata2012/open_data_communication/opendata_EN.pdf
  3. American Library Association. Key principles of government information [online]. Chicago, 1997 — 2012 [cit. 2012-04-07]. Available from WWW: http://www.ala.org/advocacy/govinfo/keyprinciples
  4. MALAMUD, Carl. By the people [online]. Government 2.0 Summit. Washington (DC), September 1
    0th, 2009 [cit. 2011-03-23]. Available from WWW: http://public.resource.org/people/
  5. Beyond access: open government data & the right to (re)use public information [online]. Access Info Europe, Open Knowledge Foundation, January 7th, 2011 [cit. 2012-04-15]. Available from WWW: http://www.access-info.org/documents/Access_Docs/Advancing/Beyond_Access_7_January_2011_web.pdf
  6. TAUBERER, Joshua. Open data is civic capital: best practices for “open government data” [online]. Version 1.5. January 29th, 2011 [cit. 2012-03-17]. Available from WWW: http://razor.occams.info/pubdocs/opendataciviccapital.html
  7. FIELDING, Roy Thomas. Architectural styles and the design of network-based software architectures. Irvine (CA), 2000. 162 p. Dissertation (PhD.). University of California, Irvine.

2012-08-06

Licences for open data

The following post is an excerpt from my thesis entitled Linked open data for public sector information.
By default, reuse requires permission. Unless there are legal instruments that enforce openness of data by default, there is a need for an explicit, open licence. Licence serves as a legal tool facilitating reuse [1, p. 6].
The licence should state clearly what are the users allowed to do with the data. At the same time, the data should explicitly reference its licence to provide legal certainty. With explicit licences users of data no longer find themselves in a legal vacuum with no clear guidaince on how they can use the data.
However, even though explicit licensing is a fundamental requirement for publishing of both open and non-open data, data producers often neglect to conform with it. For example, 82.16 % of data sources in the Linked Open Data Cloud, the diagram overviewing linked open data sources, do not provide any licensing information [2]. Similar situation may be observed for the Czech public sector data, for which the licence is left unspecified in the majority of cases.
An essential goal of open licences is to achieve equal opportunities to access and use of the licensed work. An open licence should thus be non-exclusive, non-dicriminatory, enabling free reuse and redistribution of the licenced data. It should be agnostic of both users and types of use. Therefore, it should not discriminate against any persons or groups, fields of endeavour, or any types of prospective use for the data. Open licences should permit any type of reuse, allowing modifications and creation of derivative data, and any type of redistribution that provides access to data to others.
Access to data must not be restricted by administrative barriers or geography. Limiting access rigths only to citizens of a particular country is unacceptable. On the contrary, enabling access only to a pre-defined group of people is not sufficient. For example, Creative Commons Developing Nations License makes licensed content open only to the citizens in developing countries and as such is not considered to be an open licence.
Even though the primary objective of open licences is to remove obstacles to access and use, licences may stipulate some permissible requirements that the licensees using the licensed content need to comply with. At maximum, an open licence may require attribution to the original author and redistribution with the same or analogous licence.
However, the requirement for attribution can cause difficulties when multiple datasets are reused and combined. This problem is known as “attribution stacking” because the number of parties that have to be attributed increases with the number of datasets that are involved in reuse and come from different authors.
A similar problem to the attribution stacking and spreading arise with share-alike licences that require the same or analogous licence to be used for redistribution. Share-alike licences are “viral” licences, for which the licensed content is their carrier. They may prove to be difficult to work with in cases where data available under the terms of different viral licences are combined and redistributed.
Open data is advised to be equipped with a standard, generic licence. If a custom licence is applied, it makes the use of data more cumbersome, because the user has to first study the unknown licence, instead of relying on terms and conditions of a well-known licence. Thus, the use of a custom licence may imply high transaction costs associated with using the licensed content.
The way users interface with data may be made even more uniform if a single licence is applied. In a controlled setting, such as in the public sector, establishing a unified licence is encouraged to simplify conditions of use, particularly for combining multiple datasets. Nevertheless, data provision under the terms of one licence is unlikely to scale. There are far too many different conditions around data which no single licence can cover.
Open data licences are considered to be those that conform with the Open Definition. Open Definition is a widely established definition of what it means for information to be open. “A piece of content or data is open if anyone is free to use, reuse, and redistribute it — subject only, at most, to the requirement to attribute and share-alike” [3]. The definition focuses on the legal aspects of openness and as such it is closely tied to licences that enable open distribution.
Several existing licences conform with the requirements on legal openness of open data. Some of them are the generic licences that may be used regardless of the context.
For example, among the generic licences recommended for open data the commonly applied ones include Creative Commons Zero (CC0) and Open Data Commons Public Domain Dedication and License (ODC PDDL). As a matter of fact, CC0 is not a licence, but a waiver that puts the licensed content in the public domain. As discussed in the previous parts, in some states legislation does not allow content to enter in the public domain by artificial means, such as with a waiver. In such cases, ODC PDDL may be applied because it contains not only a waiver but a licence agreement too, which sets the conditions of use for the licensed content to be the same as for the public domain content.
General-purpose licences may be substituted by licences with a specific purpose. An example of this type of licence is UK Open Government Licence that was designed for releasing open data in the UK public sector in particular.

References

  1. GRAY, Jonathan; HATCHER, Jordan; HEGGE, Becky; PARRISH, Simon; POLLOCK, Rufus. Unlocking the potential of aid information [online]. Version 0.2. December 2009 [cit. 2012-04-08]. Available from WWW:
  2. BIZER, Chris; JENTZSCH, Anja; CYGANIAK, Richard. State of the LOD Cloud [online]. Version 0.3. September 19th, 2011 [cit. 2012-04-11]. Available from WWW: http://www4.wiwiss.fu-berlin.de/lodcloud/state/
  3. Open definition [online]. Version 1.1. November 2009 [cit. 2012-03-17]. Available from WWW: http://opendefinition.org/okd/

2012-08-05

Legal openness of data

The following post is an excerpt from my thesis entitled Linked open data for public sector information.
Legal openness addresses the conditions of use. In other words, it covers what users are allowed to do with the data.
The default conditions of use for open data are declared by law. The main areas of legislation that impact open data include intellectual property rights and database rights [1, p. 138].
The control of intellectual property rights over data depends on the content of data. These rights affect only original creative works. Data, in most cases, does not satisfy this condition. It usually consists of facts and, according to the law, no one can claim ownership of facts. Moreover, data is not a product of creative work [2].
In the case of public sector information, it is a product of the pursue of the public task. In such a case, public data may be explicitly declared to be exempt from copyright, which was proclaimed for the US public data in the 1976’s Copyright Law. The baseline here is that, in many cases, data may not be treated as a private property, but more likely as a common good.
The distinction whether there are intellectual property rights associated with data is an important one. The options in this division introduce a completely different default state for data. The assessment of the relation of intellectual property rights is relevant for narrowing down the alternative ways how the rights holders may modify the conditions of use for data.
The impact of database rights on data is restricted by the law that is valid where the data is produced. Of course, local legislations influence intellectual property rights of data as well, however, they tend to be more universal as they are harmonized thanks to a number of international treaties. Sui generis database rights apply especially in the context of the member states of the European Union. In 1996, the EU issued the Directive 96/9/EC on the legal protection of databases [3]. The directive grants rights to the creators of databases, protecting their intellectual contribution to selection and arrangement of the database contents. This directive is now transposed into the legal systems in many EU member countries.
With regard to the described rights, in some cases, open data may be a subject to requirements of both. The content of data may be eligible for intellectual property rights protection and the data as a whole may be entitled to derive its protection from database rights. In such a situation, dual licensing may be applied, providing data content and data structure with different licences that are more appropriate for the given type of licensed work. However, it may prove to be difficult to find a clear boundary distinguishing between the parts of data to be licensed separately. It also raises the barrier to use of the data, since its users need to know the requirements of both licences. Due to these complexities it may be easier to handle the legal variations with a universal waiver.
Possibilities for opening data may also be limited by implied contracts, such as exclusive licence agreements. Data bounded by contracts may be difficult to work with, because users may be either not aware of their existence or they may be found difficult to interpret and abide with, especially for laypersons. The most usable solution for open data would be to have a single legal document that users need to consult in order to know what the conditions of use are, as explicit and unified rules simplify the use of data.
The legal recommendations found in open data principles usually advise to modify the default conditions under which data is available with a legal instrument that amends the conditions on the basis of contract law, using tools such as a licence or a waiver. Such recommendation serves a number of purposes. First of all, it provides explicit and comprehensive conditions of use that are valid for the data in question, shielding the users from the possibly complex and hard-to-interpret law. The second aim is crucial for open data, because this is the way how a previously restricting conditions may be made more open by renouncing some rights.
There are two main types of legal tools used to amend the conditions of use of data: licences and waivers. Licences redefine how data may be used in accordance with the producer’s desires and users’ needs. Licences for open data are discussed in the subsequent section.
Waivers serve to waive rights associated with data. The purpose of legal waivers is to reconstruct the conditions of use that applies to the works in the public domain. Yet in some countries, such as the Czech Republic, waiving intellectual property rights is not considered as a valid legal act. In these countries, works may enter into the public domain only naturally and not with a deliberate action. However, licences may be used to emulate the public domain by explicitly setting the same conditions of use.
Both with law, regulations, licences, and waivers data producers are able to accomplish legal openness. Legal openness is a necessary precondition for achieving technical openness. Data that is technically open (e.g., online and in a structured format) but not legally open (e.g., with a prohibitive licence) is not open at all. Most of the data that is legally open can be made open in the technological respect, such as by screen-scraping, a technique that extracts data from web pages. In fact, increasing technical openness of data is an example of reuse that is made possible by open legal conditions of use. On the contrary, there are no ways in which users of data can achieve legal openness of the data, since only data producers can do that.

References

  1. VAN DER SLOOT, Bart. On the fabrication of sausages, or of open government and private data. eJournal of eDemocracy and Open Government [online]. 2011 [cit. 2012-03-15], vol. 3, no. 2, p. 136 — 154. ISSN 2075-9517. Available from WWW: http://www.jedem.org/article/view/68
  2. MILLER, Paul; STYLES, Rob; HEATH, Tom. Open Data Commons, a license for open data. In BIZER, Christian; HEATH, Tom; IDEHEN, Kingsley; BERNERS-LEE, Tim (eds.). Linked Data on the Web (LDOW 2008): proceedings of the WWW2008 Workshop on Linked Data on the Web, Beijing, China, April 22nd, 2008. Aachen: RWTH Aachen University, 2008. CEUR workshop proceedings, vol. 369. ISSN 1613-0073.
  3. EU. Directive 96/9/EC of the European Parliament and of the Council of 11 March 1996 on the legal protection of databases. Official Journal of the European Union. 1996, vol. 15, L 77, p. 20 — 28. Also available from WWW: http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=OJ:L:1996:077:0020:0028:EN:PDF. ISSN 1725-2555