2012-11-22

Sampling CSV headers from the Data Hub

Recently, I decided to check how useful column headers in typical CSV files are. My hunch was that in many cases columns would be labelled ambiguously or that the header row would be simply missing from many CSVs. In such cases data may be near to useless, since hints how to use data are lacking.

To support my assumptions about the typical CSV file, I needed sample data. Many such files are listed as downloadable resources in the Data Hub, which is one of the most extensive CKAN instances. Fortunately for me, CKAN exposes a friendly API. However, an even friendlier way for me was to obtain the data by using the SPARQL endpoint of the Semantic CKAN, which offers access to the Data Hub data in RDF. Simply put:
This is the query that I used:
PREFIX dcat:    <http://www.w3.org/ns/dcat#>
SELECT ?accessURL
WHERE {
  ?s a dcat:Distribution ;
    dcat:accessURL ?accessURL .
  FILTER (STRENDS(STR(?accessURL), "csv"))
}

I saved the query in query.txt file and executed it on the endpoint:
curl -H "Accept:text/csv" --data-urlencode "query@query.txt" http://semantic.ckan.net/sparql > files.csv

In the command, I took advantage of content negotiation provided by OpenLink's Virtuoso and set the HTTP Accept header to the MIME type text/csv. I made curl to load the query from the query.txt file and pass it in the query parameter by using the argument "query@query.txt" (thanks to @cygri for this tip). The query results were stored in the files.csv file.

Having a list of CSV files, I was prepared to download them. I created a directory for the files that I wanted to get and moved into it with mkdir download; cd download. To download the CSV files I executed:
tail -n+2 ../files.csv | xargs -n 1 curl -L --range 0-499 --fail --silent --show-error -O 2> fails.txt
To skip the header row containing the SPARQL results variable name, I used -n+2. I piped the list of CSV files to curl. I switched the -L argument on in order to follow redirects. To minimize the amount of downloaded data I used --range to 0-499 to fetch only a partial response containing the first 500 bytes from servers that support HTTP/1.1. Finally, I muted curl with --silent and --fail to turn error reporting off and redirected errors to fails.txt file.

When the CSV files were retrieved, I concatenated their first lines:
find * | xargs -n 1 head -1 | sort | perl -p -e "s/^M//g" > 1st_lines.txt
head -1 outputted the first line from every file was passed to it through xargs. To polish the output a bit, I sorted it and removed superfluous characters with perl -p -e "s/^M//g". Finally, I had a list with samples of CSV column headers.

By inspecting the samples, I found that ambiguous column labels are indeed the case, as labels such as “amount” or “id” are fairly widespread. Examples of other labels that caught my attention included “A-in-A”, “Column 42” and the particularly mysterious label “X”. Disabiguating such column names would be difficult without additional contextual information, including examples of data from the columns or supplementary documentation. Such data could be hard to use, especially for automated processing.

2012-10-15

How linked data improves recall via data integration

Linked data is an approach that materializes relationships in between resources described in data. It makes implicit relationships explicit, which makes them reusable. When working with linked data integration is performed on the level of data. It offloads (some of) the integration costs from consumers onto data producers. In this post, I compare the integration on the query level with the integration done on the level of data, showing the limits of the first approach as contrasted to the second one, demonstrated on the improvement of recall when querying the data.

All SPARQL queries featured in this post may be executed on this SPARQL endpoint.

For the purposes of this demonstration, I want to investigate public contracts issued by the capital of Prague. If I know a URI of the authority, say <http://ld.opendata.cz/resource/business-entity/00064581>, I can write a simple, naïve SPARQL query and I get to know there are 3 public contracts associated with this authority:
## Number of public contracts issued by Prague (without data integration) #
PREFIX pc: <http://purl.org/procurement/public-contracts#>

SELECT (COUNT(?contract) AS ?numberOfContracts)
FROM <http://ld.opendata.cz/resource/dataset/isvzus.cz/new>
WHERE {
  ?contract a pc:Contract ;
    pc:contractingAuthority <http://ld.opendata.cz/resource/business-entity/00064581>
    .
}
I can get the official number of this contracting authority that was assigned to it by the Czech Statistical Office. This number is “00064581”.
## The official number of Prague #
PREFIX br: <http://purl.org/business-register#>

SELECT ?officialNumber
FROM <http://ld.opendata.cz/resource/dataset/isvzus.cz/new>
WHERE {
  GRAPH <http://ld.opendata.cz/resource/dataset/isvzus.cz/new> {
      <http://ld.opendata.cz/resource/business-entity/00064581> br:officialNumber ?officialNumber .
  }
}
Consequently, I can look up all the contracts associated with a contracting authority identified with either the previously used URI or this official number. I get an answer telling me there is 195 public contracts issued by this authority.
## Number of public contracts of Prague with manual integration #
PREFIX br: <http://purl.org/business-register#>
PREFIX pc: <http://purl.org/procurement/public-contracts#>

SELECT (COUNT(?contract) AS ?numberOfContracts)
FROM <http://ld.opendata.cz/resource/dataset/isvzus.cz/new>
WHERE {
  GRAPH <http://ld.opendata.cz/resource/dataset/isvzus.cz/new> {
    ?contract a pc:Contract .
    {
      ?contract pc:contractingAuthority <http://ld.opendata.cz/resource/business-entity/00064581> .
    } UNION {
      ?contract pc:contractingAuthority ?authority .
      ?authority br:officialNumber "00064581" .
    }
  }
}
However, in some cases, the official number is missing, so I might want to try the authority’s name as its identifier. However, expanding my search by adding an option to match contracting authority based on its exact name will give me 195 public contracts that were issued by this authority. In effect, in this case the recall is not improved by matching on the authority’s legal name.
## Number of public contracts of Prague with manual integration #
PREFIX br: <http://purl.org/business-register#>
PREFIX pc: <http://purl.org/procurement/public-contracts#>

SELECT (COUNT(?contract) AS ?numberOfContracts)
FROM <http://ld.opendata.cz/resource/dataset/isvzus.cz/new>
WHERE {
  GRAPH <http://ld.opendata.cz/resource/dataset/isvzus.cz/new> {
    ?contract a pc:Contract .
    {
      ?contract pc:contractingAuthority <http://ld.opendata.cz/resource/business-entity/00064581> .
    } UNION {
      ?contract pc:contractingAuthority ?authority .
      ?authority br:officialNumber "00064581" .
    } UNION {
       ?contract pc:contractingAuthority ?authority .
       ?authority gr:legalName "Hlavní město Praha" .
    }
  }
}
Even still, I know there might be typing errors in the name of the contracting authority. Listing distinct legal names of the authority of which I know either its URI or its official number will give me 8 different spelling variants, which might indicate there are more errors in the data.
## Names that are used for Prague as a contracting authority #
PREFIX br: <http://purl.org/business-register#>
PREFIX gr: <http://purl.org/goodrelations/v1#>
PREFIX pc: <http://purl.org/procurement/public-contracts#>

SELECT DISTINCT ?legalName
FROM <http://ld.opendata.cz/resource/dataset/isvzus.cz/new>
WHERE {
  GRAPH <http://ld.opendata.cz/resource/dataset/isvzus.cz/new> {
    ?contract a pc:Contract .
    {
      ?contract pc:contractingAuthority <http://ld.opendata.cz/resource/business-entity/00064581> .
      OPTIONAL {
        <http://ld.opendata.cz/resource/business-entity/00064581> gr:legalName ?legalName .
      }
    } UNION {
      ?contract pc:contractingAuthority ?authority .
      ?authority br:officialNumber "00064581" ;
        gr:legalName ?legalName .
    }
  }
}
Given the assumption there might be unmatched instances of the same contracting authority labelled with erroneous legal names, I may want to perform an approximate, fuzzy match when search for the authority’s contracts. Doing so will give me 717 public contracts that might be attributed to the contracting authority with a reasonable degree of certainty.
## Number of public contracts of Prague with manual integration #
PREFIX br: <http://purl.org/business-register#>
PREFIX pc: <http://purl.org/procurement/public-contracts#>

SELECT (COUNT(?contract) AS ?numberOfContracts)
FROM <http://ld.opendata.cz/resource/dataset/isvzus.cz/new>
WHERE {
  GRAPH <http://ld.opendata.cz/resource/dataset/isvzus.cz/new> {
    ?contract a pc:Contract .
    {
      ?contract pc:contractingAuthority <http://ld.opendata.cz/resource/business-entity/00064581> .
    } UNION {
      ?contract pc:contractingAuthority ?authority .
      ?authority br:officialNumber "00064581" .
    } UNION {
       ?contract pc:contractingAuthority ?authority .
      ?authority gr:legalName ?legalName .
      ?legalName <bif:contains> '"Hlavní město Praha"' .
    }
  }
}
Further integration on the query level would make the query more complex or it would not be possible to express the integration steps within the limits of the query language. This approach is both laborious and computationally inefficient, since the equivalence relationships need to be reinvented and recomputed every time the query is created and run.

Contrarily, when I use a URI of the contracting authority plus its owl:sameAs links, it results in a simpler query. In this case, 232 public contracts are found. In this way the recall is improved, and, even though it is not as high as in the case of the query that takes into account various spellings of the authority’s name, which may be possibly attributed to a greater precision of the interlinking done on the level of data instead of intergration on the query level.

The following query harnesses equivalence relationships within the data. The query extends the first query shown in this post. In the FROM clause, it adds a new data source to be queried (<http://ld.opendata.cz/resource/dataset/isvzus.cz/new/beSameAs>), which co
ntains the equivalence links between URIs identifying the same contracting authorities. Other than that, a Virtuoso-specific directive DEFINE input:same-as "yes" is turned on, so that owl:sameAs links are followed.
## Number of public contracts of Prague with owl:sameAs links #
DEFINE input:same-as "yes"
PREFIX pc: <http://purl.org/procurement/public-contracts#>

SELECT (COUNT(?contract) AS ?numberOfContracts)
FROM <http://ld.opendata.cz/resource/dataset/isvzus.cz/new>
FROM <http://ld.opendata.cz/resource/dataset/isvzus.cz/new/beSameAs>
WHERE {
  ?contract a pc:Contract ;
   pc:contractingAuthority <http://ld.opendata.cz/resource/business-entity/00064581>
   .
}

2012-10-08

How the Big Clean addresses the challenges of open data

The Big Clean 2012 is a one-day conference dedicated to three principal themes: screen-scraping, data refining and data-driven journalism. These topics address some of the current challenges of open data, focusing on usability, misinterpretation of data and on the issue of making data-driven journalism work.

Usability

A key challenge of the Big Clean is refining raw data into usable data. People often fall victim to the fallacy of treating screen-scraped data as a resource that can be used directly, fed straight into visualizations or analysed to yield insights. However, validity of data must not be taken for granted. It needs to be questioned.
Just as some raw ingredients need to be cooked to become edible, raw data needs to be preprocessed to become usable. Patchy data extracted from web pages should be refined into data that can be relied upon. Cleaning data makes it more regular, error-free and ultimately more usable.
The Big Clean will take this challenge into account in several talks. Jiří Skuhrovec will try to strike a fine balance, considering the question of how much do we need to clean. Štefan Urbánek will walk the event's participants through a data processing pipeline. Apart from the invited talks, this topic will be a subject to a screen-scraping workshop lead by Thomas Levine. The workshop will run in parallel with the main track of the conference.

Misinterpretation

Access to raw data allows people take control of the interpretation of data. Effectively, people are not only taking hold of uninterpreted data, but also of the right to interpret it. This is not the case in the current state of affairs, where there is often no access to raw data, since all data is mediated through user interfaces. In such case, the interface owners control the ways in which data may be viewed. On the contrary, raw data gives you a freedom to interpret data on your own. It allows you to skip the intermediaries and access data directly, instead of limiting yourself to the views provided by the interface owners.
While the loss of control over presentation of data may be perceived as a loss of control over the meaning of the data, it is actually a call for more explicit semantics in the data. It is a call for an encoding of the meaning in data in a waythat does not rely on the presentation of data.
A common excuse for not releasing data held in the public sector is the assumption that the data will be misinterpreted. As reported in Andrew Stott's OKCon 2011 talk, among the civil servants, there is a widespread expectation that “people will draw superficial conclusions from the data without understanding the wider picture.”. First, there is not a single correct interpretation of data possessed by the public sector. Instead, there are multiple valid interpretations that may coexist together. Second, the fact that data is prone to incorrect interpretation may not attest to the ambiguity of the data, but to the ambiguity of its representation.
Tighter semantics may make the danger of misinterpretation less probable. As examples such as Data.gov.uk in the United Kingdom have shown, one way to encode clearer interpretation rules directly into the data is by using semantic web technologies.

Data-driven journalism

Nevertheless, in most cases public sector data is not self-describing. The data is not smart and thus people interpreting it need to be smart. A key group that needs to become smarter, reading the clues conveyed in data, comprises of journalists. Journalists should read data, not only press releases. In becoming data literati the importance of their work increases. They serve as translators, mediating understanding derived from data to the wider public. In this way, data-driven journalism contributes to the goal of making data more usable as stories told with data are more accessible than the data itself.
Raw data opens space for different and potentially competing interpretations. This is the democratic aspect of open data. It invites participation in a shared discourse constructed around the data. A fundamental element of such discourse are the media. Journalists using the data may contribute to this conversation by finding what is new in the data, discovering issues hidden from public oversight or tracing the underlying systemic trends. This is the key contribution of data-driven journalism, providing diagnoses of the present society.
The principal part of data-driven journalism in the open data ecosystem will be reflected in a couple of talks given at the Big Clean. Liliana Bounegru will explain why data journalism is something you too should care about and Caelainn Barr will showcase how the EU data can be used in journalism.

Practical details

The Big Clean will be held on November 3rd, 2012, at the National Technical Library in Prague, Czech Republic. You can register by following this link. The admission to the event is free.
I hope to see many of you there.

2012-10-04

What makes open data weak will not get engineered away

Open data is still weak but growing strong. I have written a few fairly random points covering the weak points, in which open data may need to grow.

  • With the Open Government Partnership, open data is losing its edge. Open data is being assimilated into the current bureaucratic structures. It might be about time to reignite the subversive potential of open data.
  • There is no long-term commitment to open data. All activity in the domain seems to be fragmented in small projects that do not last long, nor do they share results. We need to find ways to make projects outlive their funding. Open data has an attention deficit disorder.
  • What makes open data weak and strange will not get engineered away. Better tools will not solve the inherent issues in open data, albeit they might help to grow the open data community in order to be able to solve those. Even though open data might be broken, we should not try to fix it, we should try to grow it to fix it itself.
  • People are getting lost on the way to realization of the goals of the open data movement. They fall for the distractions encountered on the way and get enchanted by the technology, a mere tool for addressing the goals of open data. People get stuck teaching others how to open and use data, while themselves not doing what they preach. People stop at community building, grasping for momentum using social media.
  • There is a legal uncertainty making people believe that taking legal actions is not possible without having a lawyer holding your hand. People are careful not to breach any of their imagined implications of the law. Civil servants are afraid to release the data their institutions hold, citizens are afraid of using data to effect real-world consequences.

2012-10-03

State of open data in the Czech Republic in 2012

During the Open Knowledge Festival 2012 in Helsinki I presented a lightning-fast two minutes summary of four key things that happened with open data in the Czech Republic. Here is a brief recap of the things I mentioned.

One of the most tangible results of the open data community in the past year was the launch of a national portal called “Náš stát” (which stands for “Our state”). It provides an overview of a network of Czech projects working towards improving Czech public sector with applications and services built on top of its data. What turned out to be one of its main benefits is that it started unifying disparate organizations that are often working on the same issues without knowing they might be duplicating work of others, and we will see in the coming years if it becomes the proverbial one ring to bind them all.

A Czech local chapter of the Open Knowledge Foundation was conceived and started its incubation phase. So far, we have managed to run several meetups and workshops, yet still, we have failed to involve a sufficient number of people contributing their cognitive surplus to the chapter in order to be able to sustain it in the long-term.

In this year data-driven journalism has appeared in mainstream news media. Inspired by the Guardian's Datablog the data blog was set up at iHNed.cz. The blog has become a source of data-driven stories supported by visualizations that regularly make it on the news site's front page.

Arguably, the main thing related to open data that happened in Czech Republic during the past year was the commitment to the Open Government Partnership. Czech Republic has committed to an action plan, in which opening government data plays a key role, encompassing the establishment of an official data catalogue and release of core datasets, such as the company register. On the other hand, there is no money to be spent on the OGP commitments and the list of efforts to date is blank. Thus the work on the implementation of commitments in mainly driven on by NGOs, which is very much in line with the spirit of “hacking” the Open Government Partnership.

To sum up, there have been both #wins and #fails. We keep calm and carry on.

2012-09-12

Challenges of open data: summary

The following post is an excerpt from my thesis entitled Linked open data for public sector information.
Open data creates opportunities that may end up being missed if the challenges associated with them are left unaddressed. The previous blog posts raised some of the questions the open data “movement” would have to face and resolve in order not to lose these opportunities and restore the faith in the transformative potential of open data.
Open data agenda is biased by its prevailing focus on the supply side of open data and its negligence of the demand side that gets to use the data. A significant part of the challenges associated with open data stems from a narrow-minded view of open data as a technology-triggered change that might be engineered. Although open data brings a change in which technology plays a fundamental role, it is important not to fail to recognize its side effects and the issues that cannot be solved by better engineering.
It is comfortable to abstract away from these issues at hand. So far, the challenges of open data are in most cases temporarily bypassed. While the essential features of open data are described thoroughly, its impact is left mostly unexplored. In fact, open data advocates frequently substitute their expectations for the effects of this relatively new phenomenon. The full implications of open data still need to be worked out. The blog posts about the challenges associated with open data can be thus read as an outline of some of the areas in which further research may be conducted and case studies may be commissioned.

2012-09-11

Challenges of open data: procured data

The following post is an excerpt from my thesis entitled Linked open data for public sector information.
The public sector is not only considered to be unable to deliver applications in a cost-efficient way, it may also lack the abilities to collect some data. There are several kinds of data, including geospatial surveys, that are difficult to gather using the means available in the public sector. The solution that public bodies adopt for such cases is to outsource data collection to private companies. Using the standard procedures of public procurement, the public bodies contract a provider to produce the requested data.
The challenge starts to appear when commercial data suppliers recognize the value of the procured data and become aware of the possibilities for reuse of such data that might generate revenue for them. Hence the suppliers offer the data under the terms of licences that prevent public sector bodies to share the data with the public, since releasing the data as open data would hamper the suppliers’ prospects to resell it. Should the public sector require a licence that allows to open the procured data, it would markedly increase the contract price.
Privatisation of collection of public sector data might be a way to achieve a better efficiency [1], yet without a significant investment it prohibits releasing the data as open data. It leaves open the question asking if public sector bodies should buy in expensive data to share it with others or if the infrastructure of the public sector should be enhanced to cater for acquisition of data that would be difficult to collect without such improvements.
Note: The topic of public sector data obtained through public procurement is the subject of a previous blog post.

References

  1. YIU, Chris. A right to data: fulfilling the promise of open public data in the UK [online]. Research note. March 6th, 2012 [cit. 2012-03-06]. Available from WWW: http://www.policyexchange.org.uk/publications/category/item/a-right-to-data-fulfilling-the-promise-of-open-public-data-in-the-uk

2012-09-10

Challenges of open data: trust

The following post is an excerpt from my thesis entitled Linked open data for public sector information.
Transparency brought about by the adoption of open data affects the trust in the public sector. Current governments experience a crisis of legitimacy [1, p. 58] and lack the trust of citizens. Improved visibility of the workings of public sector bodies established by the open access to their proceedings enables to track their actions in detail and improves the trust citizens put in the bodies. Nevertheless, the release of open data may reveal many fallacies of public sector bodies, which may produce a temporary disillusion, distrust in government, and loss of interest in politics [2].
The initial assumption of most open data advocates is that the data made in the public sector may be relied on. However, the public sector data cannot be treated as neutral and uncontested resource. “Unaudited, unverfied statistics abound in government data, particularly when outside parties-local government agencies, federal lobbyists, campaign committees-collect the data and turn it over to the government” [1, p. 261]. False data may be fabricated to provide alibi for corruption behaviour. For instance, Nithya Raman draws attention to an Indian dataset on urban planning in which non-existent public toilets are present, so that the spending, that supposedly goes for the toilets’ maintenance, may be justified [3]. Another example that demonstrates how false data is contained with the public sector data is the exposure of errors in subsidies awarded by the EU Common Agricultural Policy. The data shows that the oldest recipients of these funds, coming from Sweden, were 100 years old, though both dead [4, p. 85].
In the light of such facts, it is important to acknowledge that “public confidence in the veracity of government-published information is critical to Open Government Data take-off, essential to spurring demand and use of public datasets” [5]. If the data is regarded as manipulated instead of being recognized as trustworthy, the impact of open data will be significantly diminished.

References

  1. LATHROP, Daniel; RUMA, Laurel (eds.). Open government: collaboration, transparency, and participation in practice. Sebastopol: O'Reilly, 2010. ISBN 978-0-596-80435-0.
  2. FIORETTI, Marco. Open data, open society: a research project about openness of public data in EU local administration [online]. Pisa, 2010 [cit. 2012-03-10]. Available from WWW: http://stop.zona-m.net/2011/01/the-open-data-open-society-report-2/
  3. RAMAN, Nithya V. Collecting data in Chennai city and the limits of openness. Journal of Community Informatics [online]. 2012 [cit. 2012-04-12], vol. 8, no. 2. Available from WWW: http://ci-journal.net/index.php/ciej/article/view/877/908. ISSN 1712-4441.
  4. Beyond access: open government data & the right to (re)use public information [online]. Access Info Europe, Open Knowledge Foundation, January 7th, 2011 [cit. 2012-04-15]. Available from WWW: http://www.access-info.org/documents/Access_Docs/Advancing/Beyond_Access_7_January_2011_web.pdf
  5. GIGLER, Bjorn-Soren; CUSTER, Samantha; RAHEMTULLA, Hanif. Realizing the vision of open government data: opportunities, challenges and pitfalls [online]. World Bank, 2011 [cit. 2012-04-11]. Available from WWW: http://www.scribd.com/WorldBankPublications/d/75642397-Realizing-the-Vision-of-Open-Government-Data-Long-Version-Opportunities-Challenges-and-Pitfalls

2012-09-09

Challenges of open data: data quality

The following post is an excerpt from my thesis entitled Linked open data for public sector information.
Data quality is required for data that may be depended upon. Yet public sector data may be mired in errors and suffer from unintentional omissions that may markedly decrease usability of data. For example, Michael Daconta [1] identified 10 common types of mistakes in datasets in the U.S. data portal Data.gov.
  • Ommission errors violating data completeness, missing metadata definitions, using code without providing code lists
  • Formatting errors violating data consistency, syntax errors not fulfilling requirements of the employed data formats’ specifications
  • Accuracy errors violating correctness, errors breaking range limitations
  • Incorrectly labelled records violating correctness, for example, some datasets misnamed as CSV even though they are just dumps from Excel files that do not meet the standards established in the specification the CSV data format
  • Access errors referring to incorrect metadata descriptions, for example, not linking to the content described by the link’s label
  • Poorly structured data caused by improper selection of data format, using formats that are innapropriate for the expected uses of data
  • Non-normalized data violating the principle of normalization, which attempt to reduce redundant data by, e.g., removing duplicates
  • Raw database dumps violating relevance and providing raw database dumps that are hard to interpret and use correctly
  • Inflation of counts that is a metadata quality issue having an adverse impact on usability, for instance, when datasets pertaining to the same phenomena are not properly grouped and thus difficult to find
  • Inconsistent data granularity violating expected quality of metadata, such that datasets use widely varying levels of data granularity without their explicit specification
Linked data principles impose a rigour to data that may improve its consistency and quality. At the same time, linked data is more susceptible to corruption caused by “link rot” and the issues that arise when links no longer resolve. For example, in 2006 it was found that 52 % of links from the official parliamentary record of the UK were not functional [2, p. 20]. The reliance on URI makes it even more important for linked data to adopt URIs that are stable and persistent.

References

  1. DACONTA, Michael. 10 flaws with the data on Data.gov. Federal Computer Week [online]. March 11th, 2010 [cit. 2012-04-10]. Available from WWW: http://fcw.com/articles/2010/03/11/reality-check-10-data-gov-shortcomings.aspx
  2. Beyond access: open government data & the right to (re)use public information [online]. Access Info Europe, Open Knowledge Foundation, January 7th, 2011 [cit. 2012-04-15]. Available from WWW: http://www.access-info.org/documents/Access_Docs/Advancing/Beyond_Access_7_January_2011_web.pdf

2012-09-08

Challenges of open data: privacy

The following post is an excerpt from my thesis entitled Linked open data for public sector information.
In the pursuit of the public task public sector bodies collect personal data as well. Such data does not fall under the scope of open data. Principles of open data explicitly exclude personal data from being released and suppose it to be left closed in well-secured databases.
A complaint that is heard with regard to privacy is that the public sector collects more personal information than the minimum it needs. An example where public data collection posed a potential privacy breach comes from Finland [1]. A Finnish travel system logged all instances when a travel card was scanned by reader machine on different public transport lines. Since travel cards can be traced to individual persons, in this arrangement the travel system had location data for a large number of people, which was perceived as a violation of privacy. Ultimately, based on the data protection legislation, the travel card data was ceased to be collected.
However, in most cases personal data is not collected at an excessive rate and is governed by an access regime that is strictly limited to authorized users from the public sector to prevent accidental leaks of private data. In line with this observation, Marco Fioretti notes that privacy issues of open data have almost always been a non-issue [2].
Nonetheless, a new privacy risk is being recognized in the danger of statistical re-identification. This privacy threat is inflicted by the availability of large amounts of machine-readable data, that contains indirect personal identifiers, and the technologies allowing to combine it.
So far, privacy was guaranteed by the “practical obscurity” [3, p. 867]. It existed chiefly due to the difficulty of obtaining and combining data. In many cases, personal data was not logged down at all. Under such conditions, the right to privacy was akin to the right to be forgotten [2]. However, this assumption loses ground when confronted with the ever-increasing amount of data that is currently being recorded and stored.
Data anonymization that is based on removal of direct identifiers, such as identity card numbers, is insufficient on its own. A subject may be identified and linked to sensitive information through a combination of indirect identifiers [3, p. 8]. Indirect identifier is a data item that narrows down the set of persons who might be described by the data. An example of an indirect identifier that works this way is gender. When enough indirect identifiers are combined, they may narrow down the set of subjects they might identify to a single person.
There are established techniques for protecting personal privacy in data by limiting the risks of re-identification by statistical methods. Chris Yiu lists several of them, most of which have adverse impact on data quality and openness [5, p. 26].
  • Access and query control, e.g., filtering and limiting size of query results to samples
  • Anonymisation, or deidentification, such as stripping personal information from data
  • Obfuscation, that may, for example, reduce precision in data by replacing values with ranges
  • Perturbation, introducing random errors into data
  • Pseudonymisation, including replacing persons’ names with identifiers
Fortunately, both direct and indirect personal identifiers are rare in public sector data. Most of the data tracked by the public sector consists of non-identifiers. Moreover, the data is usually available in aggregated forms and not as microdata that results directly from data collection. Therefore, in most cases, data quality and openness do not need to be compromised due to the requirements of privacy protection.

References

  1. DIETRICH, Daniel; GRAY, Jonathan; MCNAMARA, Tim; POIKOLA, Antti; POLLOCK, Rufus; TAIT, Julian; ZIJLSTRA, Ton. The open data handbook [online]. 2010 — 2012 [cit. 2012-03-09]. Available from WWW: http://opendatahandbook.org/
  2. FIORETTI, Marco. Open data, open society: a research project about openness of public data in EU local administration [online]. Pisa, 2010 [cit. 2012-03-10]. Available from WWW: http://stop.zona-m.net/2011/01/the-open-data-open-society-report-2/
  3. SHADBOLT, Nigel; O’HARA, Kieron; SALVADORES, Manuel; ALANI, Harith. eGovernment. In DOMINGUE, John; FENSEL, Dieter; HENDLER, James A. (eds.). Handbook of semantic web technologies. Berlin: Springer, 2011, p. 849 — 910. DOI 10.1007/978-3-540-92913-0_20.
  4. YAKOWITZ, Jane. Tragedy of the data commons. Harvard Journal of Law & Technology. Fall 2011, vol. 25, no. 1. Also available from WWW: http://ssrn.com/abstract=1789749
  5. YIU, Chris. A right to data: fulfilling the promise of open public data in the UK [online]. Research note. March 6th, 2012 [cit. 2012-03-06]. Available from WWW: http://www.policyexchange.org.uk/publications/category/item/a-right-to-data-fulfilling-the-promise-of-open-public-data-in-the-uk

2012-09-07

Challenges of open data: misinterpretation

The following post is an excerpt from my thesis entitled Linked open data for public sector information.
Another argument pointing at the potential risks in disclosure of public data was presented by Lawrence Lessig in an article titled Against transparency [1], in which he draws attention to adverse effects of misinterpretation of public data. He highlights the issues that arise when monopoly on interpretation is removed and members of the public are provided with raw, uninterpreted data [2, p. 2]. Disintermediation causes decontextualization of public sector data that may lead to highly divergent interpretations of the same data [3]. Such change may be perceived as a loss of control the civil servants used to have. Instead of an “official” interpretation of open data this would potentially lead to a plurality “competing” and possibly conflicting interpretations, some of which may be driven by malicious interests.
Lessig claims, paying respect to the alleged shortening attention spans of members of the public, that it is easier to come up with an incorrect judgement based on public data than one that is based on solid understanding [1]. The ability to correctly interpret data is largely prevalent only among people with suffiecient expertise and data literacy skills. Moreover, Archon Fung and David Weil argue that the way open data is disclosed is conducive to pessimistic view of the public sector. They claim that “the systems of open government that we’re building - structures that facilitate citizens’ social and political judgments - are much more disposed to seeing the glass of government as half or even one-quarter empty, rather than mostly full” [4, p. 107]. Such conditions may also make users of data susceptible to apophenia, a phenomenon of seeing patterns that actually do not exist [5, p. 2]. In fact, Lessig writes, encountered with the vast amounts of available public data, ignorance is a rational investment of attention [1]. Without a significant time investment and data literacy skills people will usually come to shallow and premature conclusions based on their examination of public data. Unfounded conclusions may be quickly adopted and spread by the media, which may cause significant harm of reputation of public sector bodies, civil servants, or politicians, until these assertions are re-examined and proven to be false. For example, unverified oversimplifications may be yielded from public data to support political campaigns. Open data can be misused for skewed interpretations supporting political actions, casting suspicion on public image of politicians that are the target of discreditation campaigns.
Misinterpretations may increase distrust in the public sector. Thus, Lessig makes the case for disclosing a limited amounts of public data prone to misinterpretation [Ibid.]. Even though, he is not completely opposing the transparency initiatives, he warns that careful considerations should be given when releasing sensitive information that may be misused for defamation.
Unrestricted access to communication channels provided by new media gives strong voice to all competing interpretations, unhindered by the filtering mechanisms of traditional publishing. This state of affairs results in unfounded claims and rumours to amplify and spread with an impact that was previously impossible to achieve, causing harm to personal reputations and the public image of government. Fortunately, the self-repairing properties of communication networks eventually lead to the rebuttal of misinformation. The openness of public data thus brings not only a greater control of the public sector, but indirectly also a better control of unproven claims.

References

  1. LESSIG, Lawrence. Against transparency: the perils of openness in government. The New Republic [online]. October 9th, 2009 [cit. 2012-03-29]. Available from WWW: http://www.tnr.com/article/books-and-arts/against-transparency
  2. DAVIES, Tim. Open data, democracy and public sector reform: a look at open government data use from data.gov.uk [online]. Based on an MSc Dissertation submitted for examination in Social Science of the Internet, University of Oxford. August 2010 [cit. 2012-03-09]. Available from WWW: http://www.opendataimpacts.net/report/wp-content/uploads/2010/08/How-is-open-government-data-being-used-in-practice.pdf
  3. KAPLAN, Daniel. Open public data: then what? Part 1 [online]. January 28th, 2011 [cit. 2012-04-10]. Available from WWW: http://blog.okfn.org/2011/01/28/open-public-data-then-what-part-1/
  4. LATHROP, Daniel; RUMA, Laurel (eds.). Open government: collaboration, transparency, and participation in practice. Sebastopol: O'Reilly, 2010. ISBN 978-0-596-80435-0.
  5. BOYD, Danah; CRAWFORD, Kate. Six provocations for big data. In Proceedings of A Decade in Internet Time: Symposium on the Dynamics of the Internet and Society, 21 — 24 September 2011, University of Oxford. Oxford (UK): Oxford University, 2011. Also available from WWW: http://ssrn.com/abstract=1926431

2012-09-06

Challenges of open data: data literacy

The following post is an excerpt from my thesis entitled Linked open data for public sector information.
Even though open data bridges the data divide between the public sector and members of the public, it might be introducing a new data divide that separates those with resources to make use of the data and those who do not. Despite the fact that open data virtually eliminates the cost of data acquisition, the cost of use remains “sufficiently high to compromise the political impact of open data” [1, p. 11].
An oft-cited quote attributed to Francis Bacon claims that “knowledge is power”. If data is a source of knowledge, then opening it up creates a shift in access to a source of power. However, equal access to data does not imply equal use, nor equal empowerment, as transforming data into power requires not only access. Letting aside the concerns of unequal access addressed by the agenda of the digital divide, while the principles of open data lead to the removal of barriers to access, they do not remove all barriers to use. In this respect, it is vitally important to distinguish between the “opportunity” and the actual “realization” of use of open data [2]. Even though everyone may have equal opportunities to access and use open data, only someone is able to achieve “effective use” [Ibid.]. In the light of this assertion, open data empowers only the already empowered; those that have access to technologies and computer skills that are necessary to make use of the data.
The belief in transformative potential of open data is based on optimistic assumptions about the citizens’ data literacy. The technocratic perspective with which open data principles are drafted takes high level of skills necessary for working with data for granted. Thus, the open data initiatives are in a way exclusive as they are limited mostly to technically inclined citizens [3, p. 268].
The minimalist role of the public sector, withdrawn into the background to serve as a platform, proceeds of the supposition that members of the society have all the necessary ingredients to make effective use of open government data, such as high level of information processing capabilities [4]. Even though ICT penetration and internet connectivity may be sufficient to access open data, it is not enough to make use of it. What is also needed are the abilities to process and interpret the data. However, open data released in a raw form may not be easily digestible without a substantial proficiency in data processing. Therefore, it should not be underestimated that users are required to possess technical expertise to process the data.
The bottom line is that access to data may in fact increase the asymmetry in society. If all interest groups have equal access to public sector information, then we can expect that the better organized and well-equipped groups to make better use of it [5]. The asymmetry may stem from the fact, that the interest groups that are able to take advantage of the newly released information will prosper at the expense of groups that cannot do that.
On the other hand, this type of unequality is in a sense natural. Such state of affairs should not be considered as a final one, but rather as a starting point. David Eaves compares the challenge of increasing data literacy to increasing literacy in libraries and reminds us that “we didn’t build libraries for an already literate citizenry. We built libraries to help citizens become literate” [6]. In the same way, we do not publish open data expecting everyone will be able to use it. The data are released since access is a necessary prerequisite for use. Direct access to data by the empowered, technically-skilled infomediaries may become a basis for an indirect access for many more [7]. Coming from this perspective, the most effective uses of open data can be thought of as those that let others make effective use of the data.

References

  1. MCCLEAN, Tom. Not with a bang but with a whimper: the politics of accountability and open data in the UK. In HAGOPIAN, Frances; HONIG, Bonnie (eds.). American Political Science Association Annual Meeting Papers, Seattle, Washington, 1 — 4 September 2011 [online]. Washington (DC): American Political Science Association, 2011 [cit. 2012-04-19]. Also available from WWW: http://ssrn.com/abstract=1899790
  2. GURSTEIN, Michael. Open data: empowering the empowered or effective data use for everyone? First Monday [online]. February 7th, 2011 [cit. 2012-04-01], vol. 16, no. 2. Available from WWW: http://firstmonday.org/htb in/cgiwrap/bin/ojs/index.php/fm/article/view/3316/2764
  3. BERTOT, John C.; JAEGER, Paul T.; GRIMES, Justin M. Using ICTs to create a culture of transparency: e-government and social media as openness and anti-corruption tools for societies. Government Information Quarterly. July 2010, vol. 27, iss. 3, p. 264 — 271. DOI 10.1016/j.giq.2010.03.001.
  4. GIGLER, Bjorn-Soren; CUSTER, Samantha; RAHEMTULLA, Hanif. Realizing the vision of open government data: opportunities, challenges and pitfalls [online]. World Bank, 2011 [cit. 2012-04-11]. Available from WWW: http://www.scribd.com/WorldBankPublications/d/75642397-Realizing-the-Vision-of-Open-Government-Data-Long-Version-Opportunities-Challenges-and-Pitfalls
  5. SHIRKY, Clay. Open House thoughts, Open Senate direction. In Open House Project [online]. November 23rd, 2008 [cit. 2012-04-19]. Available from WWW: http://groups.google.com/group/openhouseproject/msg/53867cab80ed4be9
  6. EAVES, David. Learning from libraries: the literacy challenge of open data [online]. June 10th, 2010 [cit. 2012-04-11]. Available from WWW: http://eaves.ca/2010/06/10/learning-from-libraries-the-literacy-challenge-of-open-data/
  7. TAUBERER, Joshua. Open government data: principles for a transparent government and an engaged public [online]. 2012 [cit. 2012-03-09]. Available from WWW: http://opengovdata.io/

2012-09-05

Challenges of open data: usability

The following post is an excerpt from my thesis entitled Linked open data for public sector information.
Considering usability as a property of interfaces, raw data provides a difficult one. Largely, data is too unwieldy to be used by most people. For example, 50 % of the respondents in the Socrata’s open data study said that the data was unusable [1]. Alternatively, poor usability may be correlated with the low level of use most open data sources receive.
The requirements on usability of open data reviewed in a previous blog post prove to be difficult to satisfy. The usability barrier may be especially high when dealing with linked open data as was reported in the previous post about usability of linked data. Yet it is important not to compromise the generative potential of open data to low usability of the underlying technologies.
The challenge of usability requires data producers to refocus on the view of user-centric perspective. The following blog posts highlight the increased need for data literacy, which is necessary for interacting with open data, and warn of the dangers of incorrect interpretations drawn from data.

References

  1. Socrata. 2010 open government data benchmark study [online]. Version 1.4. Last updated January 4th, 2011 [cit. 2012-04-07]. Available from WWW:

2012-09-04

Challenges of open data: information overload

The following post is an excerpt from my thesis entitled Linked open data for public sector information.
As more and more data is released in the open there is a growing danger that irrelevant data might flood the data that is important [1]. Only few of the available datasets contain “actionable” information and there is no effective filtering mechanism to track them down. With open data “we have so many facts at such ready disposal that they lose their ability to nail conclusions down, because there are always other facts supporting other interpretations” [2].
The sheer volume of the existing open data makes it difficult to comprehend. At such scale there is a need for tools that make the large amounts of data intelligible. Edd Dumbill writes that “big data may be big. But if it’s not fast, it’s unintelligible” [3].
While human processing does not scale, machine processing does. Thus, the challenge of information overload highlights the need for machine-readable data. Big, yet sufficiently structured data may be automatically pre-processed and filtered to “small data” that people can manage to work with. For example, linked data may be effectively filtered with precise SPARQL queries harnessing its rich structure.
Scaling the processing of large amounts of machine-readable data with well-defined structure may be considered solved. However, the current challenge is to deal with the heterogeneity of data from different sources.

Heterogeneity

Not only is there a perceived information overload, there is also an overload of different and incompatible ways of representing information. What we have built out of different data formats or modelling approaches seems to be the proverbial “Tower of Babel”. In this state of affairs, the data available on the Web constitutes a highly dimensional, heterogeneous data space.
Nonetheless, it is in managing heterogeneous data sources where linked data excels. Linking may be considered as a lightweight, pay-as-you-go approach to intergration of disparate datasets [4]. Semantic web technologies also address the intrinsic heterogeneity in data sources by providing means to model varying levels of formality, quality, and completeness [5, p. 851].

Comparability

A key quality of data that suffers from heterogeneity is comparability. According to the SDMX content-oriented guidelines comparability is defined as “the extent to which differences between statistics can be attributed to differences between the true values of the statistical characteristics” [6, p. 13]. It is a quality of data that represents the extent to which the differences in data can be attributed to differences in the measured phenomena.
Improving comparability of data hence means minimizing unwanted interferences that skew the data. Influences leading to distorsion of data may originate from differences in schemata, differing conceptualizations of domains described in the data, or incompatible data handling procedures. Elimination of such influences leads to maximization of evidence in data, which reflects more directly on the observed phenomena.
The importance of comparability surfaces especially in data analysis tasks. Insights yielded from analyses then feed into decision support and policy making. Comparability also supports transparency of public sector data because it clears the view of public administration. It supports easier audits of public sector bodies due to the possibility to abstract from the ways used to collect data. On the other hand, incomparable data corrupts monitoring of public sector bodies and imprecise monitoring thus leaves an ample space for systemic inefficiencies and potential corruption.
The publication model of linked data has in-built comparability features, which come from the requirement for using common, shared standards. RDF provides a commensurate structure through its data model that linked data is required to conform to. The emphasis on reuse of shared conceptualizations, such as RDF vocabularies, ontologies, and reference datasets, provides for comparable data content.
In the network of linked data the “bandwagon” effect increases the probability of adoption of a set of core reference datasets, which further reinforces the positive feedback loop. Core reference data may be used to link other datasets to enhance their value. Such datasets attract most in-bound links, which leads to emergence of “linking hubs”. In this case, these de facto reference datasets derive their status from their highly reusable content. An example of this type of datasets is DBpedia, which provides machine-readable data based on Wikipedia. Its prime condition may be illustrated by the Linked Open Data Cloud, in the center of which it is prominently positioned, indicating the high number of datasets linking to it.
In contrast to these datasets, traditional reference sources are established through the authority of their publishers, which is reflected in policies that prescribe to use such datasets. Datasets of this type include knowledge organization systems, such as classifications or code lists, that offer shared conceptualizations of particular domains. For instance, a prototypical example of an essential reference dataset is the International System of Units that is a source of shared units of measurement. In contrast with the linking hubs of linked data, traditional reference datasets are, for the most part, not available in RDF and therefore not linkable.
The effect of using both kinds of reference data is the same. The conceptualizations they construct offer reference concepts that make data referring to them comparable. A trivial example to illustrate this point may be the use of the same units of measurement, which enables to sort data in an expected order.
Data might need to be converted prior to comparison with other datasets. In this case, there is a need for comparability on the level of the data the incomparable datasets refer to. Linked data makes this possible through linking; the same technology it applies to data integration. With the techniques, such as ontology alignment, mappings between reference datasets may be established to serve as proxies for the purpose of data comparison. Ultimately, machine-readable relationships in linked data make it outperform other ways of representing data when it comes to the ability to draw comparisons.

References

  1. FIORETTI, Marco. Open data, open society: a research project about openness of public data in EU local administration [online]. Pisa, 2010 [cit. 2012-03-10]. Available from WWW: http://stop.zona-m.net/2011/01/the-open-data-open-society-report-2/
  2. WEINBERGER, David. Too big to know. New York (NY): Basic Books, 2012. ISBN 978-0-465-02142-0.
  3. DUMBILL, Edd (ed.). Planning for big data: a CIO’s handbook to the changing data landscape [ebook]. Sebastopol: O’Reilly, 2012, 83 p. ISBN 978-1-4493-2963-1.
  4. HEATH, Tom; BIZER, Chris. Linked data: evolving the Web into a global data space. 1st ed. Morgan & Claypool, 2011. Also available from WWW: http://linkeddatabook.com/book. ISBN 978-1-60845-430-3. DOI 10.2200/S00334ED1V01Y201102WBE001.
  5. SHADBOLT, Nigel; O’HARA, Kieron; SALVADORES, Manuel; ALANI, Harith. eGovernment. In DOMINGUE, John; FENSEL, Dieter; HENDLER, James A. (eds.). Handbook of semantic web technologies. Berlin: Springer, 2011,
    p. 849 — 910. DOI 10.1007/978-3-540-92913-0_20.
  6. SDMX. SDMX content-oriented guidelines. Annex 1: cross-domain concepts. 2009. Also available from WWW: http://sdmx.org/wp-content/uploads/2009/01/01_sdmx_cog_annex_1_cdc_2009.pdf

2012-09-03

Challenges of open data: implementation

The following post is an excerpt from my thesis entitled Linked open data for public sector information.
Data publishers may perceive adoption of linked open data to have daunting entry barriers. In particular, they are aware of the high demands on expertise for publishing linked data, which is esteemed to have a steep learning curve. Linked data publishing model poses requirements that may seem to be difficult to meet. The Frequently Observed Problems on the Web of Data [1] testify to that.
Therefore, “it is vital to follow a realistic, practical and inexpensive approach” [2]. Fortunately, linked data facilitates an incremental, evolutionary information management. Its deployment may follow a step by step approach, adopting iterative development for continuous improvement. For example, before a switch of the database technology linked data publishers could start by caching given legacy databases into triple stores. Another way how to cushion the demands of linked data adoption is to minimise their ontological commitment by creating small ontologies that may be gradually linked together.
Two implementation challenges collocated with the adoption of linked open data in the public sector will be dealt with in detail; resistance to change in the public sector and maturity of the linked data technology stack.

Resistance to change

Rhetoric of open data supporters puts an emphasis on bureaucracy as a major barrier to opening data in the public sector. There is a tendency to frame the politics of access to data as a struggle between the public sector, that has an inbreed attachment to secrecy, and members of the public, which are depicted rather as individuals than groups [3, p. 7].
While this view seems to be biased, the institutional inertia may pose a challenge to adoption of open data, which may require a “cultural change in the public sector” [4]. The transition from the status quo may be significantly hindered by the established culture in the public administration. “A major impediment is an entrenched closed culture in many government organisations as a result of the fear of disclosing government failures and provoking political escalation and public outcry” [5]. The intangible problem of the closed mindset prevailing in the public sector proves to be difficult to resolve. And so, in many ways, the adoption of open data “isn’t a hardware retirement issue, it’s an employee retirement one” [6].
Resistance to change is not the only barrier hindering in the adoption of open data. A hurdle that is commonly encountered by open data advocates is that civil servants perceive open data as an additional workload that lacks clear justification [7, p. 70]. Unlike citizens that are allowed to do everything that is not prohibited, public servants are allowed to do only what law and policies order them to do. Voluntary adoption of open data at the lower levels of public administration is thus highly unlikely. It requires a policy to push open data through.
However, it might be for the existing policies that the change is made difficult. In general, the public sector is a subject to special obstacles that impede adoption of new technologies. For example, the combination of strict data handling procedures and constricted possibilities due to limited budget resources may effectively stop any technological change [7]. That is why there must by a strong commitment to open data on the upper levels of the public sector in order to put through the necessary amendments to existing data handling policies.

Technology maturity

Semantic web technologies underlying linked data were for a long time thought of as not being ready for adoption in the enterprise settings and in the public sector. In 2010, linked data technology stack was not perceived to be ready for large-scale adoption in the public sector. John Sheridan reports three key things missing [8]:
  • Repeatable design patterns
  • Supportive tools
  • Commoditization of linked data APIs
At that time, standards were mature enough, but their translation to repeatable design patterns applicable in practice was lacking. This has changed since. Several sources recommend established design patterns (e.g., [9], [10], [11]), supportive tools were developed and packaged (e.g., LOD2 Stack), and frameworks for developing custom APIs based on linked data were created (e.g., Linked Data API mentioned in a previous blog post). Linked data has matured progressively in the recent years and so it may be argued that it is ready to be implemented at the level of the public sector.

References

  1. HOGAN, Aidan; CYGANIAK, Richard. Frequently observed problems on the web of data [online]. Version 0.3. November 13th, 2009 [cit. 2012-04-23]. Available from WWW: http://pedantic-web.org/fops.html
  2. ALANI, Harith; CHANDLER, Peter; HALL, Wendy; O’HARA, Kieron; SHADBOLT, Nigel; SZOMSZOR, Martin. Building a pragmatic semantic web. IEEE Intelligent Systems. May—June 2008, vol. 23, iss. 3, p. 61 — 68. Also available from WWW: http://eprints.soton.ac.uk/265787/1/alani-IEEEIS08.pdf. ISSN 1541-1672. DOI 10.1109/MIS.2008.42.
  3. MCCLEAN, Tom. Not with a bang but with a whimper: the politics of accountability and open data in the UK. In HAGOPIAN, Frances; HONIG, Bonnie (eds.). American Political Science Association Annual Meeting Papers, Seattle, Washington, 1 — 4 September 2011 [online]. Washington (DC): American Political Science Association, 2011 [cit. 2012-04-19]. Also available from WWW: http://ssrn.com/abstract=1899790
  4. GRAY, Jonathan. The best way to get value from data is to give it away. Guardian Datablog [online]. December 13th, 2011 [cit. 2011-12-14]. Available from WWW: http://www.guardian.co.uk/world/datablog/2011/dec/13/eu-open-government-data
  5. VAN DEN BROEK, Tijs; KOTTERINK, Bas; HUIJBOOM, Noor; HOFMAN, Wout; VAN GRIEKEN, Stefan. Open data need a vision of smart government. In Share-PSI Workshop: Removing the Roadblocks to a Pan-European Market for Public Sector Information Re-use [online]. 2011 [cit. 2012-03-09]. Available from WWW: http://share-psi.eu/submitted-papers/
  6. DUMBILL, Edd (ed.). Planning for big data: a CIO’s handbook to the changing data landscape [ebook]. Sebastopol: O’Reilly, 2012, 83 p. ISBN 978-1-4493-2963-1.
  7. HALONEN, Antti. Being open about data: analysis of the UK open data policies and applicability of open data [online]. Report. London: Finnish Institute, 2012 [cit. 2012-04-05]. Available from WWW: http://www.finnish-institute.org.uk/images/stories/pdf2012/being%20open%20about%20data.pdf
  8. ACAR, Suzanne; ALONSO, José M.; NOVAK, Kevin (eds.). Improving access to government through better use of the Web [online]. W3C Interest Group Note. May 12th, 2009 [cit. 2012-04-06]. Available from WWW: http://www.w3.org/TR/egov-improving/
  9. SHERIDAN, John; TENNISON, Jeni. Linking UK government data. In BIZER, Christian; HEATH, Tom; BERNERS-LEE, Tim; HAUSENBLAS, Michael (eds.). Li
    nked Data on the Web: proceedings of the WWW 2010 Workshop on Linked Data on the Web, April 27th, 2010, Raleigh, USA
    . Aachen: RWTH Aachen University, 2010. CEUR workshop proceedings, vol. 628. ISSN 1613-0073.
  10. DODDS, Leigh; DAVIS, Ian. Linked data patterns [online]. Last changed 2011-08-19 [cit. 2011-11-05]. Available from WWW: http://patterns.dataincubator.org
  11. HEATH, Tom; BIZER, Chris. Linked data: evolving the Web into a global data space. 1st ed. Morgan & Claypool, 2011. Also available from WWW: http://linkeddatabook.com/book. ISBN 978-1-60845-430-3. DOI 10.2200/S00334ED1V01Y201102WBE001.
  12. HYLAND, Bernardette; TERRAZAS, Boris Villazón; CAPADISLI, Sarven. Cookbook for open government linked data [online]. Last modified on February 20th, 2012 [cit. 2012-04-11]. Available from WWW: http://www.w3.org/2011/gld/wiki/Linked_Data_Cookbook

2012-09-02

Challenges of open data

The following post is an excerpt from my thesis entitled Linked open data for public sector information.
Open data not only opens new opportunities, it also opens new challenges. These challenges point to the limits of openness and to shortcomings of the approaches used to put linked open data in practice in the public sector.
The top 10 barriers and potential risks for adoption of open data in the public sector, which were compiled by Noor Huijboom and Tijs van den Broek [1, p. 7], comprise of the following.
  • closed government culture
  • privacy legislation
  • limited quality of data
  • limited user-friendliness/information overload
  • lack of standardisation of open data policy
  • security threats
  • existing charging models
  • uncertain economic impact
  • digital divide
  • network overload
Some of these challenges will be discussed in detail in the following blog posts. In particular, this section will cover the difficulties that may be encountered during implementation of linked open data, information overload and the problems of scalable processing of large, heterogeneous datasets, usability of raw data, issues for protection of personal data, deficiencies in data quality, adverse effects of open data on trust in the public sector, and finally the unresolved question of opening data obtained via public procurement.

References

  1. HUIJBOOM, Noor; VAN DEN BROEK, Tijs. Open data: an international comparison of strategies. European Journal of ePractice [online]. March/April 2011 [cit. 2012-04-30], no. 12. Available from WWW: http://www.epractice.eu/files/European%20Journal%20epractice%20Volume%2012_1.pdf. ISSN 1988-625X.

2012-09-01

Impacts of open data: journalism

The following post is an excerpt from my thesis entitled Linked open data for public sector information.
The availability of data and data processing tools gives birth to a new paradigm in journalism that is commonly referred to as data-driven journalism. It refers to the practice of basing journalistic articles on hard data, which allows to back up claims with well-founded evidence.
Unlike in journalism that is driven by data, unverified claims abound in traditional journalistic practice. To address this deficiency, data-driven journalism may employ open data sources to cross-verify the claims. Data triangulation combining disparate sources may establish validity of the verified claims.
If data-driven journalists strive to draw closer to objectivity, they need to share their sources to achieve transparency. Sharing the underlying data is an imperative of data-driven journalism, so that others can see what lead to insights conveyd in articles. In the light of such transparency, claims made by journalists may be verified by third parties and trust may be established.
The best known examples of data-driven journalism include the Guardian’s Datablog or Pro Publica.

2012-08-31

Impacts of open data: business

The following post is an excerpt from my thesis entitled Linked open data for public sector information.
There is no direct return on investment on open data. As a matter of fact, economic impact of releasing open data is difficult, if not impossible, to anticipate and quantify beforehand, prior to the publication date. The causal chain connecting open data as a cause with its economic effects is particularly unrealiable. However, it seems to be feasible to recount the effect on business after the moment data is made accessible. For instance, an analyst may consider the number of uses by businesses comparing how it changed before and after the data was opened [1]. Accordingly, the economic value of open data can be rather considered as indirect.
Given the way open data affects economy, estimates of the market size for public sector data are based on methodologies that are insufficient to come up with accurate numbers. For example, most of the studies evaluating economic impact of opening up data in the public sector were based on extrapolations from research conducted on a smaller scale. In his study for the European Commission, Graham Vickery assessed the aggregate volume of the direct and indirect economic impacts of opening public sector information in the EU member countries to be EUR 140 billion annually [2, p. 4]. In contrast with this number, estimates of the direct revenue based on selling public sector information were much lower, and Vickery quantified it to EUR 1.4 billion [Ibid., p. 5].
Open data opens new opportunities for private businesses. It allows new business models to appear, including crowdsourced administration of public property by services such as FixMyStreet. Another example of a business that is based on public sector data is BrightScope that delivers financial information for investors. An area that may benefit the most from availability of public sector data are location-based services. The EU Directive on the reuse of public sector information was reported to have the strongest impact on the growth of the market of geospatial data that is essential for such services to be operated [Ibid., p. 20].
The opportunities offered by open data are particularly important for small and medium enterprises. These businesses are a prime target for reuse of open data since they usually cannot afford to pay the charges to public bodies for data that is not open. Stimulation of economic activities may result in new jobs being created. Availability of public data may give rise to a whole new sector of “independent advisers”, that add value to the data by making it more digestible to citizens [3]. More businesses eventually generate more tax revenue, which ultimately promises to return the investment in open data back to the budget from which the public sector is funded.
Open data fosters product and service innovation. It affects especially the areas of forecasting, prediction, and optimization. For example, European Union makes its official documents available in all languages of the EU member states. This multilingual corpus is used as a training set for machine translation algorithms in Google Translate leading to an improvement in quality of its service [4].
At the same time, open data disrupts existing business models that are based on exclusive arrangements for data provision by public sector bodies to companies. This is how businesses that thrive on barriers to access to public data are made obsolete. Open data weeds out companies that hoard public data for their benefit and establishes an environment, in which all businesses have an equal opportunity to reuse public sector data for their commercial interests.

References

  1. ORAM, Andy. European Union starts project about economic effects of open government data. O’Reilly Radar [online]. June 11th, 2010 [cit. 2012-04-09]. Available from WWW: http://radar.oreilly.com/2010/06/european-union-starts-project.html
  2. VICKERY, Graham. Review of the recent developments on PSI re-use and related market developments [online]. Final version. Paris, 2011 [cit. 2012-04-19]. Available from WWW: http://ec.europa.eu/information_society/policy/psi/docs/pdfs/report/psi_final_version_formatted.docx
  3. HIRST, Tony. So what's open government data good for? Government and “independent advisers”, maybe? [online]. July 7th, 2011 [cit. 2012-04-07]. Available from WWW: http://blog.ouseful.info/2011/07/07/so-whats-open-government-data-good-for-government-maybe/
  4. DIETRICH, Daniel; GRAY, Jonathan; MCNAMARA, Tim; POIKOLA, Antti; POLLOCK, Rufus; TAIT, Julian; ZIJLSTRA, Ton. The open data handbook [online]. 2010 — 2012 [cit. 2012-03-09]. Available from WWW: http://opendatahandbook.org/

2012-08-30

Impacts of open data: participation

The following post is an excerpt from my thesis entitled Linked open data for public sector information.
Open data enables better interaction between citizens and governments through the Web [1]. It redresses the information asymmetry between the public sector and citizens [2] by advocating that everyone should have the same conditions for use of public sector data as the public body from which the data originates. Sharing public data facilitates universal participation since no one is excluded from reusing and redistributing open data [3].
Open data opens the possibility of citizen self-service. It makes the public more self-reliant, which reduces the need for government regulation [4]. It enables to tap into the cognitive surplus and improve public services with the crowdsourced work of the public. One of the main benefits of open data consists in third-party developed citizen services [5, p. 40]. Citizens may thus become more involved in public affairs, which ultimately leads to a more participatory democracy.

References

  1. ACAR, Suzanne; ALONSO, José M.; NOVAK, Kevin (eds.). Improving access to government through better use of the Web [online]. W3C Interest Group Note. May 12th, 2009 [cit. 2012-04-06]. Available from WWW: http://www.w3.org/TR/egov-improving/
  2. GIGLER, Bjorn-Soren; CUSTER, Samantha; RAHEMTULLA, Hanif. Realizing the vision of open government data: opportunities, challenges and pitfalls [online]. World Bank, 2011 [cit. 2012-04-11]. Available from WWW: http://www.scribd.com/WorldBankPublications/d/75642397-Realizing-the-Vision-of-Open-Government-Data-Long-Version-Opportunities-Challenges-and-Pitfalls
  3. DIETRICH, Daniel; GRAY, Jonathan; MCNAMARA, Tim; POIKOLA, Antti; POLLOCK, Rufus; TAIT, Julian; ZIJLSTRA, Ton. The open data handbook [online]. 2010 — 2012 [cit. 2012-03-09]. Available from WWW: http://opendatahandbook.org/
  4. TAUBERER, Joshua. Open data is civic capital: best practices for “open government data” [online]. Version 1.5. January 29th, 2011 [cit. 2012-03-17]. Available from WWW: http://razor.occams.info/pubdocs/opendataciviccapital.html
  5. LONGO, Justin. #OpenData: digital-era governance thoroughbred or new public management Trojan horse? Public Policy & Governance Review. Spring 2011, vol. 2, no. 2, p. 38 — 51. Also available from WWW: http://ssrn.com/abstract=1856120

2012-08-29

Impacts of open data: disintermediation

The following post is an excerpt from my thesis entitled Linked open data for public sector information.
Who draws and controls the maps controls how other people see the world [1]. Who interprets data from the public sector controls how other people see the things described in the data. By releasing raw open data the public sector also releases its total control over the interfaces in which the data is presented. In this way, the interpretive dominance of the public sector data is abolished and it no longer controls the way how citizens should see the world described in the data [2]. Civil servants perceive this as a loss of control over the released data, but in fact, it is only a loss of control over interfaces in which the data is presented.
Providing raw data is an example of disintermediation. It reduces the frictions and inherent cognitive biases that come with interpretations by intermediaries. It allows users to skip the intermediaries that stand between them and access to raw data. For example, both civil servants producing reports based on primary data and journalists transforming data into narratives conveyed in articles serve as intermediaries that affect how the public perceives public sector data.
Depending on the type of use mediation may be either a barrier or a help. It is a barrier for those that want to access raw data to interpret them themselves. However, common perception has it that too few people are interested in raw data [3, p. 71]. Yet one should not make such generalizations as there is evidence that suggests otherwise. For example, after the release of data from the Norwegian meteorological institute, the institute registered more data downloads (14.8 million) than page views (4.5 million). These numbers were given by Anton Eliassen, the institute’s director, during the first plenary on the revised public sector information directive at the ePSI Platform Conference 2012. In general, it is the case that raw data receives relatively few downloads, yet access to raw data is vital to build new applications on top of the data.
Disintermediation creates a demand for reintermediation. Mediation helps users that need to get user-friendly translations of data in order to reach understanding. Applications mediating data in ways that are accessible and compelling, such as visualizations, may attract a lot of attention proving the demand for public sector data. For instance, this has happened in the case of the UK crime statistics, the visualization of which crashed under the weight of 18 million requests per hour at the time it was released [4].

References

  1. ERLE, Schuyler; GIBSON, Rich; WALSH, Jo. Mapping hacks: tips & tools for electronic cartography. Sebastopol: O’Reilly, 2005, 568 p. ISBN 978-0-596-00703-4.
  2. BARNICKEL, Nils; HÖFIG, Edzard; KLESSMANN, Jens; SOTO, Juan. Organisational and societal obstacles to implementations of technical systems supporting PSI re-use. In Share-PSI Workshop: Removing the Roadblocks to a Pan-European Market for Public Sector Information Re-use [online]. 2011 [cit. 2012-03-08]. Available from WWW: http://share-psi.eu/submitted-papers/
  3. HALONEN, Antti. Being open about data: analysis of the UK open data policies and applicability of open data [online]. Report. London: Finnish Institute, 2012 [cit. 2012-04-05]. Available from WWW: http://www.finnish-institute.org.uk/images/stories/pdf2012/being%20open%20about%20data.pdf
  4. TRAVIS, Alan; MULHOLLAND, Hélène. Online crime maps crash under weight of 18 million hits an hour. Guardian [online]. February 1st, 2011 [cit. 2012-04-17]. Available from WWW: http://www.guardian.co.uk/uk/2011/feb/01/online-crime-maps-power-hands-people

2012-08-28

Impacts of open data: efficiency

The following post is an excerpt from my thesis entitled Linked open data for public sector information.
The public sector itself is the primary user of public sector data. Open access to public data thus impacts the way the public sector operates. While the initial costs of opening up data may turn out to be significant, adopting open data promises to deliver cost savings in the long run, enabling the public bodies to operate more efficiently. “There is a body of evidence which suggests that proactive disclosure encourages better information management and hence improves a public authority’s internal information flows” [1, p. 69]. For instance, open data produces cost savings on cheaper information provision and efficient development of applications providing services to citizens.
For information provision, similarly to health services, prevention is cheaper than therapy [2]. Prevention via proactive disclosure is presumed to be more cost-efficient than therapy via acting on the demand of freedom of information requests [3, p. 25]. Open data saves the effort spent on responding freedom of information requests by providing the requested data in advance. In this way, the effort of providing data is expended only once, instead of repeating it due to the requests for the same data. Although the initial set-up overhead for open data may be higher, it is supposed to lower the per-interaction overhead.
Open data promotes a new way of information management that may streamline the data handling procedures and curb unnecessary expenditures. By elimination of the costs associated with access to public sector data the adoption of open data removes the expenses on data acquisition from public sector bodies selling their data. In effect, a better interagency coordination is established, which lessens administrative friction. Given the reduced workload, it may lead to destruction of some clerical jobs [2], which will produce savings on labour costs.
A common argument in favour of open data is based on the observation that the public sector is not capable of creating applications providing services to citizens in a cost-efficient way. Commissioning software for the public sector must pass through the protracted process of public procurement. Such procedure is slow to respond to users’ demands and the resulting applications may end up being costly. With openly available public sector data, the public sector is no longer the only producer that can deliver applications based on the data. Third parties may take the data a produce applications on their own, substituting the applications subsidized by the public sector. This is how a more cost-efficient means of production of applications may be devised.
The way in which open data makes efficiency of the public sector better is not limited to monenatary savings. The internal impact of open data encompasses that the data quality may be improved by harnessing the feedback from citizens. It may also inform the way the public sector is governed through evidence-based policies.
Opening data enables anybody to inspect it. Feedback from users probing the data puts a pressure on the public sector to improve the data quality. Better quality data enables better quality service delivery, improving the pursue of public task on many levels, such as better responsiveness to citizen feedback. Based on user feedback, collection of less used datasets may be discontinued, leading to a more responsive and user-oriented data disclosure.
Quality of data influences the quality of the policy that is based upon it [4]. It may become a source for a more efficient, evidence-based policy. Public policies may be improved by considering data as an input, as an evidence of the phenomena to be policed, and should be made with publicly available data [Ibid., p. 384], empirical data that is open to public scrutiny [5, p. 4], in order to keep the policy creators accountable.

References

  1. Beyond access: open government data & the right to (re)use public information [online]. Access Info Europe, Open Knowledge Foundation, January 7th, 2011 [cit. 2012-04-15]. Available from WWW: http://www.access-info.org/documents/Access_Docs/Advancing/Beyond_Access_7_January_2011_web.pdf
  2. FIORETTI, Marco. Open data, open society: a research project about openness of public data in EU local administration [online]. Pisa, 2010 [cit. 2012-03-10]. Available from WWW: http://stop.zona-m.net/2011/01/the-open-data-open-society-report-2/
  3. HALONEN, Antti. Being open about data: analysis of the UK open data policies and applicability of open data [online]. Report. London: Finnish Institute, 2012 [cit. 2012-04-05]. Available from WWW: http://www.finnish-institute.org.uk/images/stories/pdf2012/being%20open%20about%20data.pdf
  4. NAPOLI, Philip M.; KARAGANIS, Joe. On making public policy with publicly available data: the case of U.S. communications policymaking. Government Information Quarterly. October 2010, vol. 27, iss. 4, p. 384 — 391. DOI 10.1016/j.giq.2010.06.005.
  5. SHADBOLT, Nigel. Towards a pan EU data portal — data.gov.eu. Version 4.0. December 15th, 2010 [cit. 2012-03-10]. Available from WWW: http://ec.europa.eu/information_society/policy/psi/docs/pdfs/towards_an_eu_psi_portals_v4_final.pdf

2012-08-27

Impacts of open data: accountability

The following post is an excerpt from my thesis entitled Linked open data for public sector information.
Transparency feeds into accountability. “In the world of big data correlations surface almost by themselves. Access to data creates a culture of accountability” [1]. Open data enables to hold politicians accountable by comparing their promises with data showing how are their promises put into practice. For example, unfavourable audit results based on open data may cause a politician not being reelected.
Public scrutiny of governmental data may reveal fraud or abuse of public funds. Given the availability of public data everyone may check out, we may see a rise of the so-called “armchair auditing.” In the same way, it improves the function of “watchdog” institutions, such as non-governmental organizations dedicated to overseeing government transparency. In this way, open data increases civic engagement leading to a more participatory democracy and better democratic control.
Open data enables to apply crowdsourcing to monitor institutions and their performance, which is described in the data. Rufus Pollock illustrated the opportunities of leveraging citizen feedback by saying that “to many eyes all anomalies are noticeable,” in which he paraphrased the quote “given enough eyeballs, all bugs are shallow” by Linus Torvalds. Accordingly, releasing data to the public allows to get the data verified or inspected for quality for free.

References

  1. Data, data everywhere. Economist. February 25th, 2010. Also available from WWW: http://www.economist.com/node/15557443

2012-08-26

Impacts of open data: transparency

The following post is an excerpt from my thesis entitled Linked open data for public sector information.
Transparency of the public sector reflects the ability of the public to see what is going on. David Weinberger declares that transparency is the new objectivity [1], a change that he claims to stem from the transformation of the currect knowledge ecosystem to one that is inherently network-based. Transparency replaces the role of the long-discredited objectivity in that aspects that it is used as a source of veracity and reliability [2].
Transparency serves for fraud prevention. It puts the public sector under a peer pressure based on the fact that anybody can inspect the its public proceedings. The peer supervision makes it more difficult for civil servants to profit from the control they have and abuse of the powers vested in them. By increasing the risks of exposure of venal activities, it lowers the systemic corruption [3, p. 9]. In effect, members of the public may hold civil servants accountable for corruption, illegal takeover of subsidies, or plain budgetary waste [4, p. 80].
An illustrative example of the self-regulating effects of transparency was presented in [5, p. 110]. In 1997, restaurants in the Los Angeles county were ordered to post highly visible letter grades on their front windows. The grades (A, B, C) were based on the results of County Department of Health Services inspections probing hygiene maintenance in the restaurants. The ready availability of evidence on insanitary practices in food handling made it easier for people to make better choices about restaurants and helped them to avoid restaurants that were deemed unsafe to eat at. The introduction of this policy proved to have a significant impact both on the restaurants and their customers. Revenues at C-grade restaurants dropped, while those of A-grade restaurants increased, leading over time to a growth of the number of cleanly restaurants and a steep decline of the poorly performing ones. The policy also improved health conditions of the restaurants’ customers, with a decrease of hospitalizations caused by food-borne illnesses from 20 % to 13 %. Transparency has an ambiguous impact on trust in the public sector. While there is a positive impression of stronger control over the public sector, at the same time more failures are identified, which chips away at the trust in public affairs. Furthermore, transparency makes citizens aware of how vulnerable to manipulation the public sector data is.
Open data shapes the reality it measures [6, p. 3]. When communicating, the sender conveying information modifies its content based on the perceived context of communication. Evaluation of the way of communication, the expected audience, and other circumstances factored into the communication context impacts what messages are sent. Open data establishes a new context with a wider and less defined range of potential recipients and a different set of expectations about the effect of communicated data. Such re-contextualization may affect what gets released and in what form. Data may be distorted in a direction so that it supports only the interpretations data producers expect [7]. As a result, some data may end up withheld from the public, while other data may turn out to be misrepresenting of the phenomena it bears witness to. At the same time, the change brought about by the obligation to disclose data may have positive consequences by forcing public bodies “to rethink, reorganize and streamline their delivery before going online” [8, p. 448].
As the control is ultimately in the hands of civil servants, data disclosure may be shaped as required by various interest groups, including politicians or lobbyists. It illuminates the fact that there is no direct causation between open data and open government. “A government can be an ‘open government,’ in the sense of being transparent, even if it does not embrace new technology” [9, p. 2]. Only politically important and sensitive disclosures take government further on its way to open government. “A government can provide ‘open data’ on politically neutral topics even as it remains deeply opaque and unaccountable” [Ibid., p. 2]. This reflects what Ellen Miller from the Sunlight Foundation calls the danger of a mere “transparency theater”. This is nothing new in the politics. For instance, questions that politicians get asked may be moderated to include only those that are not sensitive and do not require the interviewee to disclose any delicate facts.
It also indicates that there is a limit to transparency, a limit that Joshua Tauberer entitled the “Wonderlich Transparency Paradox” [10]. It is named after John Wonderlich from the Sunlight Foundation that once wrote that “How ever far back in the process you require public scrutiny, the real negotiations [...] will continue fervently to exactly that point” [11]. Some parts of the processes in the public sector are exempted from disclosure to provide a “space to think” [4, p. 74]. However, this paradox shows that no matter how thourough and deep the transparency of the public sector is, the real decision-making processes will always have a chance to elude what is recorded and exposed for public scrutiny.
Everything may be abused and transparency is no different. For example, releasing data about how well are civil servants paid may be used to identify targets for bribery. Disclosing salaries of politicians helps lobbyists to find a low-paid politician who is an easier target for corruption. A difficult question is also to ask whether terrorist watch list should be made open [5, p. 4].
These examples showcase the unintended consequences of opening data. What these concerns illustrate is that transparency is obviously not a panacea and it would be naïve to think it is. Open data is not an end to itself and transparency by itself is an input, not an output [12].

References

  1. WEINBERGER, David. Too big to know. New York (NY): Basic Books, 2012. ISBN 978-0-465-02142-0.
  2. WEINBERGER, David. Transparency is the new objectivity [online]. July 19th, 2009 [cit. 2012-04-25]. Available from WWW: http://www.hyperorg.com/blogger/2009/07/19/transparency-is-the-new-objectivity/
  3. BERLINER, Daniel. The political origins of transparency. In HAGOPIAN, Frances; HONIG, Bonnie (eds.). American Political Science Association Annual Meeting Papers, Seattle, Washington, 1 — 4 September 2011 [online]. Washington (DC): American Political Science Association, 2011 [cit. 2012-04- 29]. Also available from WWW: http://ssrn.com/abstract=1899791
  4. Beyond access: open government data & the right to (re)use public information [online]. Access Info Europe, Open Knowledge Foundation, January 7th, 2011 [cit. 2012-04-15]. Available from WWW: http://www.access-info.org/documents/Access_Docs/Advancing/Beyond_Access_7_January_2011_web.pdf
  5. LATHROP, Daniel; RUMA, Laurel (eds.). Open government: collaboration, transparency, and participation in practice.
    Sebastopol: O'Reilly, 2010. ISBN 978-0-596-80435-0.
  6. BOYD, Danah; CRAWFORD, Kate. Six provocations for big data. In Proceedings of A Decade in Internet Time: Symposium on the Dynamics of the Internet and Society, 21 — 24 September 2011, University of Oxford. Oxford (UK): Oxford University, 2011. Also available from WWW: http://ssrn.com/abstract=1926431
  7. KAPLAN, Daniel. Open public data: then what? Part 1 [online]. January 28th, 2011 [cit. 2012-04-10]. Available from WWW: http://blog.okfn.org/2011/01/28/open-public-data-then-what-part-1/
  8. HAZLETT, Shirley-Ann; HILL, Frances. E-government: the realities of using IT to transform the public sector. Managing Quality Service. 2003, vol. 13, iss. 6, p. 445 — 452. ISSN 0960-4529. DOI 10.1108/09604520310506504.
  9. YU, Harlan; ROBINSON, David G. The new ambiguity of “open government” [online]. Princeton CITP / Yale ISP Working Paper. Draft of February 28th, 2012. Available from WWW: http://ssrn.com/abstract=2012489
  10. TAUBERER, Joshua. Open government data: principles for a transparent government and an engaged public [online]. 2012 [cit. 2012-03-09]. Available from WWW: http://opengovdata.io/
  11. WONDERLICH, John. Pelosi reverses on 72 hour promises? In Open House Project [online]. November 7th, 2009 [cit. 2012-04-19]. Available from WWW: http://groups.google.com/group/openhouseproject/msg/94060a876083d86a
  12. SHIRKY, Clay. Open House thoughts, Open Senate direction. In Open House Project [online]. November 23rd, 2008 [cit. 2012-04-19]. Available from WWW: http://groups.google.com/group/openhouseproject/msg/53867cab80ed4be9