2011-09-24

Technical openness of open data

Apart from the legal requirements on open data, there are also aspects of technical openness. While the legal aspects are explicitly defined by the Open Definition, there is less understanding of the technical recommendations for making data open. Some principles of this side of openness are covered by the Three laws of open data by David Eaves, others are proposed in the Linked Open Data star scheme. An excellent resource that touches on both legal and technical requirements for open data is 8 open government data principles.

Data need to be formalized so that we can serialize them to representations that may be exchanged. However, there are different formalizations that may be used for communicating data, different formats that are more or less open. I think open technologies for representing data share a set of family resemblances. So, open data are:

Non-exclusive
Open data are not published exclusively for a particular application. No application has exclusive access to open data. Instead, they are available to be used by any application and thus support a wide range of uses.

Non-proprietary
No entity has exclusive control over non-proprietary data formats. Such formats have an open specification that may be implemented by anyone. Therefore, data in these formats are not tightly coupled with a specific software that is able to read them.

Standards-based
The data are based on open, community-owned standards. This means the standards are developed in an open process that may be joined by anyone from the public (i.e., not Schema.org). Such standards prescribe a set of rules the data have to adhere to. Standardized data have an expected format, which ensures interoperability, and as such can be used by a plethora of standards-compliant tools.

Machine-readable
Open data are formalized enough so that machines are able to use them. Well-formalized data have a structure that enables their automated machine processing. For instance, unlike a scanned document stored as an image, which is one opaque blob, open data have a higher granularity because they are segmented into well-defined data items (e.g., rows, columns, triples).

Findable
Open data should be publicly available on the Web. This means to have URLs that successfully return representation of data. Data should be directly accessible by resolving its URL. Any technical barriers, passwords or required registration, preventing from accessing data are unacceptable, as well as any attemps to hide the data and achieve security through obscurity via techniques of anti-SEO. As David Eaves puts it, if Google cannot find it, no one can.

Linkable
Elements of open data should be identified with URIs. In this way it is possible to link to it. This approach encourages re-use, data integration, and proper attribution of data used as a source.

Linked
If your open data are linked to other open data, users can follow these links to discover more. Being a part of the Web of data brings the benefits yielded by the network effects.

As you might have guessed from the previous points, I think that linked data is a very open technology. And, if you look at the 5 star of linked data, its author Tim Berners-Lee thinks the same. So if you want to make your data more open, it is a step in the right direction.

Open bibliographic data checklist

I have decided to write a few points that might be of interest to those thinking about publishing open bibliographic data. The following is a fragment of an open bibliographic data checklist, or, how to release your library's data into the public without a lawyer holding your hand.

I have been interested in open bibliographic data for a couple of years now, and I try to promote them at the National Technical Library, where we have, so far, released only authority dataset — the Polythematic Structured Subject Heading System. The following points are based on my experiences with this topic. What should you pay attention to when opening your bibliographic data then?

  • Make sure you are the sole owner of the data or make arrangements with other owners. For instance, things may get complicated in the case data was created collaboratively via shared cataloguing. If you are not in complete control of the data, then start with consulting the other proprietors that have a stake in the datasets.
  • Check if the data you are about to release are not bound by some contractual obligations. For example, you may publish a dataset under a Creative Commons licence, soon to realize that there are some unsolved contracts with parties that helped fund the creation of that data years ago. Then you need to discuss this issue with the involved parties to resolve if making the data open is a problem.
  • Read your country's legislation to get to know what you are able to do with your data. For instance, in Czech Republic it is not possible to put data into the public domain intentionally. The only way how public domain content is created is by the natural order of things, i.e., author dies, leaves no heir, and after quite some time the work enters the public domain.
  • See if the data are copyrightable. For instance, if the data do not fall into the scope of the copyright law of your country, it is not suitable to be licenced under Creative Commons, since this set of licences draws its legal binding from the copyright law; it is an extension of the copyright and it builds on it. Facts are not copyrightable and most bibliographic records are made of facts. However, some contain creative content, for example, subject indexing or an abstract, and as such are appropriate for licencing based on the copyright law. Your mileage may vary.
  • Consult the database act. Check if your country has a specific law dealing with the use of databases that might add more requirements that need your attention. For example, in some legal regimes databases are protected on other level, as an aggregation of individual data elements.
  • Different licencing options may be applicable for content and structure of dataset, for instance when there are additional terms required by database law. You can opt in dual-licensing and use two different licences, one for dataset's content that is protected by the copyright law (e.g., a Creative Commons licence), and one for dataset's structure for which the copyright protection may not apply (e.g., Public Domain Dedication and License).
  • Choose a proper licence. A proper open licence is a licence that conforms with the Open Definition (and will not get you sued), so pick one of the OKD-Compliant licences. Good source of solid information about licences for open data is Open Data Commons.
  • BONUS: Tell your friends. Create a record in the Data Hub (formerly CKAN) and add it to the bibliographic data group to let others know that your dataset exists.

Even if it may seem there are lots of things you need to check before releasing open bibliographic data, it is actually easy. It is an performative speech act: you only need to declare your data open to make it open.

<disclaimer>If you are unsure about some of the steps above, see a lawyer to consult it. Note that the usual disclaimers apply for this post, i.e., IANAL.</disclaimer>