2012-03-20

Principled open data

There is a proliferation of principles of open data. Most of them share a similar core, they seem to be diverging from common predecessors. Principles of open data are usually used to get across the meaning of the concept of “open data”. However, there is no definition of “open data”; as is common with socially constructed, its meaning is embedded in the open data community. In this post I have decided to try to summarize what are the key characteristics of this concept.

On the way to this goal I took a series of steps:

  1. Take some of the existing open data principles (sampling)
  2. Think about their relationships (interlinking)
  3. Group them (clustering)
  4. Re-arrange them according to their relationships (classification)
  5. Infer new principles (extrapolation)

Content

What should be in the data?

Primary: The data should be collected at the source with highest possible level of granularity. Provide fine grained data with high resolution, with high sampling rate. Provide raw, uninterpreted data instead of aggregated or derivated forms. Public sector should not hold a monopoly to interpretation of public sector data by providing them reduced to reports; thumbnails of the original data.

Complete: All public data should be made available, except direct and indirect identifiers of persons constituting personally identifiable information and data that need to be kept secret due to the reasons of national security. Complete datasets should be available to bulk download.

Timely: Release data in timely fashion. All datasets are essentially snapshots of data streams capturing the current state of an observed phenomenon. Thus, the value of data can decrease over time (e.g., weather data). Also, the value of the methods used to capture the data decreases as the methods become obsolete. It is necessary to publish data as soon as possible to preserve their value. Preferably, the data should be released to the public at the time of their release for internal use. In this way, the data can serve to help in achieving real-time transparency and can be treated as a news source.

Conditions of use

What am I allowed to do with the data?

Licence: Unless there are legal instruments that enforce openness of data by default, there is a need for an explicit, open licence. The licence should state clearly what are the users allowed to do with the data. An open licence should be non-dicriminatory, enabling free reuse and redistribution of the licenced data. The licence should not discriminate against persons or groups, fields of endeavour, or any types of prospective use for the data. It should ignore the differences between users and their intentions. Therefore, the licence should permit any type of reuse, allowing modifications and create derivative data, and any type of redistribution, providing access to data to others. At maximum, an open licence may require attribution to the original author and redistribution with the same or analogous licence. In controlled settings, such as in government, establishing a single licence is encouraged to simplify conditions of use for combinations or multiple datasets.

Accessibility

How can I obtain the data?

Discoverability: In order to be able to access a dataset, you need to discover it. That is why data should be equipped with descriptive metadata, such as in a data catalogue. Another way is make data accessible to machines, such as search engines, that will enable the data to be found.

Accessibility: Data should be available online, retrievable via HTTP GET requests. There should be both access interface for human users, such as a web site or an application, and access interface for machine users, such as an API or downloadable data dump. There should not be any barriers obstructing access to data. There should be no financial cost associated with the use of data, although recovering reasonable marginal costs of data reproduction is OK in limited number of cases. Users should not be required to register, although requesting users to apply for an API key is OK. There should be no password protection, no strict limits on the number of API calls, and no encryption hindering in access to data.

Use

How can I use the data?

Non-proprietary data formats: Open data should use data formats over which no entity has exclusive control. Specifications of open data formats should be free for all to read and implement, subject to no fees or royalties. Using proprietary data formats excludes users of software that has not been allowed to implement support for the data format. Relying on proprietary data formats for storing data comes with the risk of them becoming obsolete, which may prevent archival access.

Standards: Open data should use open, community-owned standards, such as World Wide Web Consortium's standards. Adhering to a set of common standards makes reuse easier as the data can be processed by a wide array of standards-compliant tools. Standards create expected behaviour, enable comparisons, and ultimately lead to a greater interoperability. Standards, such as controlled vocabularies and common identifiers, provides better opportunities for combining disparate sources of data. Consistent use of standards leads to “informal” standards encoded in best practices.

Machine readability: Open data should be captured in a structured and formalized data format that enables automated processing by software. Users should be able not only to display the data, they should be able to perform other types of automated processing as well, such as full-text search, sorting or analysis. Machines are data users too, and thus providing data in machine-readable formats is not discriminating machines. People view data through machines and machines help them with efficient processing of the data. Some people, such as people with disabilities, consume data only via machines, such as screen-readers for users with visual impairment.

Note: The term “machine-readable” is a bit misleading when interpreted strictly. Machines can “read” all digital information. However, some data formats do not leave open many ways how the data may be used. For example, binary formats, such as images or executables, do not lend themselves to other types of use that display or execution and as such limit the possibilities of reuse. Therefore, open data should be stored in textual formats (e.g., CSV) with explicit and standard character encoding (e.g., UTF-8).

Safety: Use data formats that cannot contain executable content that may contain malicious code harmful to the data users. Textual formats, that are recommended for disclosure of open data, are safe to use.

Usability

How well the data can be used?

Presentation: A human-readable copy of data (e.g., a web page) should be available to address the issue of the data divide, alleviating the unequal levels of ability to work with data. Given the differing data literacy skills among users an effort needs to be taken to provide the largest number of people the greatest benefits from the data and help them make “effective use” as dubbed by Michael Gurstein.

Clarity: The data should communicate as clearly as
possible, using plain and accurate language. The descriptions in data should be given in a neutral and unambiguous language that does not skew the interpretation of data. Data should employ meaningful scales that clearly convey the differences in data. To widen the reach of data use a universal language (e.g., English) and avoid using jargon or technical language unless the terminology is well-defined and adds to the clarity of data. Data should not contain extraneous information and superfluous padding that might distract users from important data or confuse them.

Permanence: Open data should be available in the long term. To ensure the future accessibility of data the URIs, from which the data can be retrieved, should be persistent. The sustainability and reliability of data access methods is important due to direct reuse of data, such as in applications built on top of data APIs, when the data cannot be copied or it is not efficient to do so.

Conclusion

When compiling principles of open data, it is difficult to separate data “openness“ and data “quality”. The question that we can ask is what are the non-essential features of openness that are actually features of a more general good design? I would expect the importance of different attributes of data openness depends on the use case. Thus, I have not subjected the principles presented above to a coarse narrowing down to those that seemed the most important to me.

The other reason why the principles are presented in a comprehensive way is that they are meant to serve as a tool. Principles of open data describe what should be achieved. This needs to be linked to how it should be achieved. Goals needs to be linked to implementations, so that it is straightforward to translate principles into action.

Open data principles should be distilled into policies and recommendations that provide direct guidance and specific steps for implementers. Recommendations should be accompanied with explanations and policies should be connected with the outcomes and benefits to offer motivating reasons to their users. The process of policy creation should be kept iterative, open, and transparent.

Finally, compliance with the policies based on open data principles should be reviewable. There should be tests and control mechanism in place to put their implementers under scrutiny, because a policy without a way of enforcing it is just a shadow of policy.

Sources

3 comments :