Challenges of open data: privacy

The following post is an excerpt from my thesis entitled Linked open data for public sector information.
In the pursuit of the public task public sector bodies collect personal data as well. Such data does not fall under the scope of open data. Principles of open data explicitly exclude personal data from being released and suppose it to be left closed in well-secured databases.
A complaint that is heard with regard to privacy is that the public sector collects more personal information than the minimum it needs. An example where public data collection posed a potential privacy breach comes from Finland [1]. A Finnish travel system logged all instances when a travel card was scanned by reader machine on different public transport lines. Since travel cards can be traced to individual persons, in this arrangement the travel system had location data for a large number of people, which was perceived as a violation of privacy. Ultimately, based on the data protection legislation, the travel card data was ceased to be collected.
However, in most cases personal data is not collected at an excessive rate and is governed by an access regime that is strictly limited to authorized users from the public sector to prevent accidental leaks of private data. In line with this observation, Marco Fioretti notes that privacy issues of open data have almost always been a non-issue [2].
Nonetheless, a new privacy risk is being recognized in the danger of statistical re-identification. This privacy threat is inflicted by the availability of large amounts of machine-readable data, that contains indirect personal identifiers, and the technologies allowing to combine it.
So far, privacy was guaranteed by the “practical obscurity” [3, p. 867]. It existed chiefly due to the difficulty of obtaining and combining data. In many cases, personal data was not logged down at all. Under such conditions, the right to privacy was akin to the right to be forgotten [2]. However, this assumption loses ground when confronted with the ever-increasing amount of data that is currently being recorded and stored.
Data anonymization that is based on removal of direct identifiers, such as identity card numbers, is insufficient on its own. A subject may be identified and linked to sensitive information through a combination of indirect identifiers [3, p. 8]. Indirect identifier is a data item that narrows down the set of persons who might be described by the data. An example of an indirect identifier that works this way is gender. When enough indirect identifiers are combined, they may narrow down the set of subjects they might identify to a single person.
There are established techniques for protecting personal privacy in data by limiting the risks of re-identification by statistical methods. Chris Yiu lists several of them, most of which have adverse impact on data quality and openness [5, p. 26].
  • Access and query control, e.g., filtering and limiting size of query results to samples
  • Anonymisation, or deidentification, such as stripping personal information from data
  • Obfuscation, that may, for example, reduce precision in data by replacing values with ranges
  • Perturbation, introducing random errors into data
  • Pseudonymisation, including replacing persons’ names with identifiers
Fortunately, both direct and indirect personal identifiers are rare in public sector data. Most of the data tracked by the public sector consists of non-identifiers. Moreover, the data is usually available in aggregated forms and not as microdata that results directly from data collection. Therefore, in most cases, data quality and openness do not need to be compromised due to the requirements of privacy protection.


  1. DIETRICH, Daniel; GRAY, Jonathan; MCNAMARA, Tim; POIKOLA, Antti; POLLOCK, Rufus; TAIT, Julian; ZIJLSTRA, Ton. The open data handbook [online]. 2010 — 2012 [cit. 2012-03-09]. Available from WWW: http://opendatahandbook.org/
  2. FIORETTI, Marco. Open data, open society: a research project about openness of public data in EU local administration [online]. Pisa, 2010 [cit. 2012-03-10]. Available from WWW: http://stop.zona-m.net/2011/01/the-open-data-open-society-report-2/
  3. SHADBOLT, Nigel; O’HARA, Kieron; SALVADORES, Manuel; ALANI, Harith. eGovernment. In DOMINGUE, John; FENSEL, Dieter; HENDLER, James A. (eds.). Handbook of semantic web technologies. Berlin: Springer, 2011, p. 849 — 910. DOI 10.1007/978-3-540-92913-0_20.
  4. YAKOWITZ, Jane. Tragedy of the data commons. Harvard Journal of Law & Technology. Fall 2011, vol. 25, no. 1. Also available from WWW: http://ssrn.com/abstract=1789749
  5. YIU, Chris. A right to data: fulfilling the promise of open public data in the UK [online]. Research note. March 6th, 2012 [cit. 2012-03-06]. Available from WWW: http://www.policyexchange.org.uk/publications/category/item/a-right-to-data-fulfilling-the-promise-of-open-public-data-in-the-uk

1 comment :

  1. Yes, almost everybody is concerned about security and privacy. That is why for data share I prefer to use secure virtual data room . It is kinda a cloud.