2012-03-20

Principled open data

There is a proliferation of principles of open data. Most of them share a similar core, they seem to be diverging from common predecessors. Principles of open data are usually used to get across the meaning of the concept of “open data”. However, there is no definition of “open data”; as is common with socially constructed, its meaning is embedded in the open data community. In this post I have decided to try to summarize what are the key characteristics of this concept.

On the way to this goal I took a series of steps:

  1. Take some of the existing open data principles (sampling)
  2. Think about their relationships (interlinking)
  3. Group them (clustering)
  4. Re-arrange them according to their relationships (classification)
  5. Infer new principles (extrapolation)

Content

What should be in the data?

Primary: The data should be collected at the source with highest possible level of granularity. Provide fine grained data with high resolution, with high sampling rate. Provide raw, uninterpreted data instead of aggregated or derivated forms. Public sector should not hold a monopoly to interpretation of public sector data by providing them reduced to reports; thumbnails of the original data.

Complete: All public data should be made available, except direct and indirect identifiers of persons constituting personally identifiable information and data that need to be kept secret due to the reasons of national security. Complete datasets should be available to bulk download.

Timely: Release data in timely fashion. All datasets are essentially snapshots of data streams capturing the current state of an observed phenomenon. Thus, the value of data can decrease over time (e.g., weather data). Also, the value of the methods used to capture the data decreases as the methods become obsolete. It is necessary to publish data as soon as possible to preserve their value. Preferably, the data should be released to the public at the time of their release for internal use. In this way, the data can serve to help in achieving real-time transparency and can be treated as a news source.

Conditions of use

What am I allowed to do with the data?

Licence: Unless there are legal instruments that enforce openness of data by default, there is a need for an explicit, open licence. The licence should state clearly what are the users allowed to do with the data. An open licence should be non-dicriminatory, enabling free reuse and redistribution of the licenced data. The licence should not discriminate against persons or groups, fields of endeavour, or any types of prospective use for the data. It should ignore the differences between users and their intentions. Therefore, the licence should permit any type of reuse, allowing modifications and create derivative data, and any type of redistribution, providing access to data to others. At maximum, an open licence may require attribution to the original author and redistribution with the same or analogous licence. In controlled settings, such as in government, establishing a single licence is encouraged to simplify conditions of use for combinations or multiple datasets.

Accessibility

How can I obtain the data?

Discoverability: In order to be able to access a dataset, you need to discover it. That is why data should be equipped with descriptive metadata, such as in a data catalogue. Another way is make data accessible to machines, such as search engines, that will enable the data to be found.

Accessibility: Data should be available online, retrievable via HTTP GET requests. There should be both access interface for human users, such as a web site or an application, and access interface for machine users, such as an API or downloadable data dump. There should not be any barriers obstructing access to data. There should be no financial cost associated with the use of data, although recovering reasonable marginal costs of data reproduction is OK in limited number of cases. Users should not be required to register, although requesting users to apply for an API key is OK. There should be no password protection, no strict limits on the number of API calls, and no encryption hindering in access to data.

Use

How can I use the data?

Non-proprietary data formats: Open data should use data formats over which no entity has exclusive control. Specifications of open data formats should be free for all to read and implement, subject to no fees or royalties. Using proprietary data formats excludes users of software that has not been allowed to implement support for the data format. Relying on proprietary data formats for storing data comes with the risk of them becoming obsolete, which may prevent archival access.

Standards: Open data should use open, community-owned standards, such as World Wide Web Consortium's standards. Adhering to a set of common standards makes reuse easier as the data can be processed by a wide array of standards-compliant tools. Standards create expected behaviour, enable comparisons, and ultimately lead to a greater interoperability. Standards, such as controlled vocabularies and common identifiers, provides better opportunities for combining disparate sources of data. Consistent use of standards leads to “informal” standards encoded in best practices.

Machine readability: Open data should be captured in a structured and formalized data format that enables automated processing by software. Users should be able not only to display the data, they should be able to perform other types of automated processing as well, such as full-text search, sorting or analysis. Machines are data users too, and thus providing data in machine-readable formats is not discriminating machines. People view data through machines and machines help them with efficient processing of the data. Some people, such as people with disabilities, consume data only via machines, such as screen-readers for users with visual impairment.

Note: The term “machine-readable” is a bit misleading when interpreted strictly. Machines can “read” all digital information. However, some data formats do not leave open many ways how the data may be used. For example, binary formats, such as images or executables, do not lend themselves to other types of use that display or execution and as such limit the possibilities of reuse. Therefore, open data should be stored in textual formats (e.g., CSV) with explicit and standard character encoding (e.g., UTF-8).

Safety: Use data formats that cannot contain executable content that may contain malicious code harmful to the data users. Textual formats, that are recommended for disclosure of open data, are safe to use.

Usability

How well the data can be used?

Presentation: A human-readable copy of data (e.g., a web page) should be available to address the issue of the data divide, alleviating the unequal levels of ability to work with data. Given the differing data literacy skills among users an effort needs to be taken to provide the largest number of people the greatest benefits from the data and help them make “effective use” as dubbed by Michael Gurstein.

Clarity: The data should communicate as clearly as
possible, using plain and accurate language. The descriptions in data should be given in a neutral and unambiguous language that does not skew the interpretation of data. Data should employ meaningful scales that clearly convey the differences in data. To widen the reach of data use a universal language (e.g., English) and avoid using jargon or technical language unless the terminology is well-defined and adds to the clarity of data. Data should not contain extraneous information and superfluous padding that might distract users from important data or confuse them.

Permanence: Open data should be available in the long term. To ensure the future accessibility of data the URIs, from which the data can be retrieved, should be persistent. The sustainability and reliability of data access methods is important due to direct reuse of data, such as in applications built on top of data APIs, when the data cannot be copied or it is not efficient to do so.

Conclusion

When compiling principles of open data, it is difficult to separate data “openness“ and data “quality”. The question that we can ask is what are the non-essential features of openness that are actually features of a more general good design? I would expect the importance of different attributes of data openness depends on the use case. Thus, I have not subjected the principles presented above to a coarse narrowing down to those that seemed the most important to me.

The other reason why the principles are presented in a comprehensive way is that they are meant to serve as a tool. Principles of open data describe what should be achieved. This needs to be linked to how it should be achieved. Goals needs to be linked to implementations, so that it is straightforward to translate principles into action.

Open data principles should be distilled into policies and recommendations that provide direct guidance and specific steps for implementers. Recommendations should be accompanied with explanations and policies should be connected with the outcomes and benefits to offer motivating reasons to their users. The process of policy creation should be kept iterative, open, and transparent.

Finally, compliance with the policies based on open data principles should be reviewable. There should be tests and control mechanism in place to put their implementers under scrutiny, because a policy without a way of enforcing it is just a shadow of policy.

Sources

2012-03-04

Opening contracted data in the public sector

Public sector sucks at making applications. Look at what applications it creates and look at what applications are created in the private sector, such as e-banking. The difference is huge. A common argument in favour of open government data follows this line of reasoning. Public sector is not able to create useful applications in a cost-efficient way, therefore it should openly publish its data and the applications will flow, produced by the members of public, for free.

See, the problem is that the public sector also sucks at making some data. Some types of data, such as geographical data or extensive surveys are quite difficult to gather by the means available to the public sector. The solution is to sign a contract with company that produces the requested data. By outsourcing acquisition of some types of data the public sector gets what it needs for its functioning. No problem so far.

The problem starts to appear in cases when the companies (often unlike the public sector) see the possibilites for reuse of the data. The companies producing the data are well aware of the ways in which their product can be reused by businesses to generate revenue. It would be stupid of the company to provide the public sector with an exclusive rights for the contracted data when it can be re-selled to other companies. For example, a company producing geospatial data for the cadaster may sell the same data to businesses producing maps. Of course, the public sector might want to get a licence permitting to re-distribute the data, but a contract containing such requirement would get a much higher pricing from the supplier, due to the fact that the supplier would be deprived of the additional income from re-selling the data. Opening highly reusable data might be pricey.

It leaves me with a lot of questions, wondering what is the best answer for opening data acquired by the public sector from a commercial supplier that is conscious of the real value of data and reflects it in the price.

At the beginning of #opendata film Tom Steinberg from MySociety says:

Open Government Data is any information the Government collects, by and large for their own purposes, that it then makes available for other people to use for their purposes.

The definition of open government data concerns the data that are collected by the government. However, it is not clear if it only covers data produced by the government itself or, if it includes data provisioned to the government by a third-party as well. Does the definition of open government data apply even for data that are a result of a public contract? If this is a correct interpretation, is it nevertheless the responsibility of the public sector to contract data in a way that allows to release the data as open data, even though it might significantly raise the price of the contract? Spending government's finance on this would certainly lower the barrier for starting a business based on such data. And, given its financial constraints, can the public sector afford to contract data in this way?

Acknowledgements: Thanks to Jáchym ??epický for bringing this point to our seminar Open Data and Public Sector: applying Austrian experience in Czech Republic.

Liking is linking

Hello, ladies. Look at you interface for creating linked data:

Text editor

Now back to an interface used for creating linked data at Facebook:

Like button

Now back at your interface. Sadly, it not like the one from Facebook. Why is that?

The concept of linked data has its page on Facebook. It is identified by the URI http://graph.facebook.com/103761322995229, which is based on the identifier in the Facebook URL (103761322995229). The numeric identifier may be replaced by a “nice”, human-readable string, when the page reaches at least 20 likes. Given the concept URI, one can dereference it to retrieve its RDF representation:

curl -H "Accept:text/turtle" http://graph.facebook.com/103761322995229

Bearing in mind the linked data best practices, it would be even better if there was a redirect set up from the original Facebook URL to this URI. Nevertheless, the important thing is that one can reference Facebook resources from their data. Having Facebook resources equipped with dereferenceable URIs makes them linkable.

Facebook's Open Graph Protocol features a property likes (URI: http://graph.facebook.com/schema/user#likes) that can be used for relating a resource to an object that the resource likes. The auto-generated, yet human-readable reference for the vocabulary can be found here.

Taking this into account, the act of clicking the Like button for the concept of linked data while being logged in as me (URI: http://graph.facebook.com/jindrich.mynarz) can be treated as equivalent to writing the following triple (in Turtle notation):

@prefix graph:   <http://graph.facebook.com/> .@prefix fbuser:  <http://graph.facebook.com/schema/user#> .graph:jindrich.mynarz fbuser:likes graph:103761322995229 .

Note: the dot character (“.”) is not allowed in local names (such as jindrich.mynarz) in the original Turtle specification, however, in the newer version of the specification it is possible to use it.

It is likely that Facebook stores data differently than in this way, however, as can be seen in the case of Facebook pages and users, in some cases Facebook can surface the data in RDF. Such assumption can be supported by the practice of using RDF as an exchange format.

What I wanted to show by this example, is that by clicking a Like button, you are in fact creating links. Liking is an example of a speech act in which a subject expresses its relation to an object. The subject of the link is the agent (i.e., you, the person acting) and the object is the web page shown.

Using the Facebook Like button is an example of expressing how users feel about something. Facebook allows to express various feelings about things. Apart from the most known liking, users can recommend, or, thanks to the recently introduced Facebook actions there is an extensible mechanism for creating new types of relationships that users may describe. Facebook sees this functionality, and quite rightly so, as the “building blocks of Open Graph” (source).

What this reflects on, is the growing opportunity for crowdsourcing linking. Facebook Like button serves as an example of an easy-to-use interface for creating linked data that is available for masses. It shows the potential of adding more complex and machine-readable annotations via simple interfaces. It is a tool for growing the interconnected web of data, describing how do the users of the Web relate to its contents. Not to forget that the users of the Web might be machines too. Imagine bots crawling the Web and clicking Like buttons, leaving their traces on the visited places, and you will start to see the possibilities of crawlers connecting the web of data.