2012-08-31

Impacts of open data: business

The following post is an excerpt from my thesis entitled Linked open data for public sector information.
There is no direct return on investment on open data. As a matter of fact, economic impact of releasing open data is difficult, if not impossible, to anticipate and quantify beforehand, prior to the publication date. The causal chain connecting open data as a cause with its economic effects is particularly unrealiable. However, it seems to be feasible to recount the effect on business after the moment data is made accessible. For instance, an analyst may consider the number of uses by businesses comparing how it changed before and after the data was opened [1]. Accordingly, the economic value of open data can be rather considered as indirect.
Given the way open data affects economy, estimates of the market size for public sector data are based on methodologies that are insufficient to come up with accurate numbers. For example, most of the studies evaluating economic impact of opening up data in the public sector were based on extrapolations from research conducted on a smaller scale. In his study for the European Commission, Graham Vickery assessed the aggregate volume of the direct and indirect economic impacts of opening public sector information in the EU member countries to be EUR 140 billion annually [2, p. 4]. In contrast with this number, estimates of the direct revenue based on selling public sector information were much lower, and Vickery quantified it to EUR 1.4 billion [Ibid., p. 5].
Open data opens new opportunities for private businesses. It allows new business models to appear, including crowdsourced administration of public property by services such as FixMyStreet. Another example of a business that is based on public sector data is BrightScope that delivers financial information for investors. An area that may benefit the most from availability of public sector data are location-based services. The EU Directive on the reuse of public sector information was reported to have the strongest impact on the growth of the market of geospatial data that is essential for such services to be operated [Ibid., p. 20].
The opportunities offered by open data are particularly important for small and medium enterprises. These businesses are a prime target for reuse of open data since they usually cannot afford to pay the charges to public bodies for data that is not open. Stimulation of economic activities may result in new jobs being created. Availability of public data may give rise to a whole new sector of “independent advisers”, that add value to the data by making it more digestible to citizens [3]. More businesses eventually generate more tax revenue, which ultimately promises to return the investment in open data back to the budget from which the public sector is funded.
Open data fosters product and service innovation. It affects especially the areas of forecasting, prediction, and optimization. For example, European Union makes its official documents available in all languages of the EU member states. This multilingual corpus is used as a training set for machine translation algorithms in Google Translate leading to an improvement in quality of its service [4].
At the same time, open data disrupts existing business models that are based on exclusive arrangements for data provision by public sector bodies to companies. This is how businesses that thrive on barriers to access to public data are made obsolete. Open data weeds out companies that hoard public data for their benefit and establishes an environment, in which all businesses have an equal opportunity to reuse public sector data for their commercial interests.

References

  1. ORAM, Andy. European Union starts project about economic effects of open government data. O’Reilly Radar [online]. June 11th, 2010 [cit. 2012-04-09]. Available from WWW: http://radar.oreilly.com/2010/06/european-union-starts-project.html
  2. VICKERY, Graham. Review of the recent developments on PSI re-use and related market developments [online]. Final version. Paris, 2011 [cit. 2012-04-19]. Available from WWW: http://ec.europa.eu/information_society/policy/psi/docs/pdfs/report/psi_final_version_formatted.docx
  3. HIRST, Tony. So what's open government data good for? Government and “independent advisers”, maybe? [online]. July 7th, 2011 [cit. 2012-04-07]. Available from WWW: http://blog.ouseful.info/2011/07/07/so-whats-open-government-data-good-for-government-maybe/
  4. DIETRICH, Daniel; GRAY, Jonathan; MCNAMARA, Tim; POIKOLA, Antti; POLLOCK, Rufus; TAIT, Julian; ZIJLSTRA, Ton. The open data handbook [online]. 2010 — 2012 [cit. 2012-03-09]. Available from WWW: http://opendatahandbook.org/

2012-08-30

Impacts of open data: participation

The following post is an excerpt from my thesis entitled Linked open data for public sector information.
Open data enables better interaction between citizens and governments through the Web [1]. It redresses the information asymmetry between the public sector and citizens [2] by advocating that everyone should have the same conditions for use of public sector data as the public body from which the data originates. Sharing public data facilitates universal participation since no one is excluded from reusing and redistributing open data [3].
Open data opens the possibility of citizen self-service. It makes the public more self-reliant, which reduces the need for government regulation [4]. It enables to tap into the cognitive surplus and improve public services with the crowdsourced work of the public. One of the main benefits of open data consists in third-party developed citizen services [5, p. 40]. Citizens may thus become more involved in public affairs, which ultimately leads to a more participatory democracy.

References

  1. ACAR, Suzanne; ALONSO, José M.; NOVAK, Kevin (eds.). Improving access to government through better use of the Web [online]. W3C Interest Group Note. May 12th, 2009 [cit. 2012-04-06]. Available from WWW: http://www.w3.org/TR/egov-improving/
  2. GIGLER, Bjorn-Soren; CUSTER, Samantha; RAHEMTULLA, Hanif. Realizing the vision of open government data: opportunities, challenges and pitfalls [online]. World Bank, 2011 [cit. 2012-04-11]. Available from WWW: http://www.scribd.com/WorldBankPublications/d/75642397-Realizing-the-Vision-of-Open-Government-Data-Long-Version-Opportunities-Challenges-and-Pitfalls
  3. DIETRICH, Daniel; GRAY, Jonathan; MCNAMARA, Tim; POIKOLA, Antti; POLLOCK, Rufus; TAIT, Julian; ZIJLSTRA, Ton. The open data handbook [online]. 2010 — 2012 [cit. 2012-03-09]. Available from WWW: http://opendatahandbook.org/
  4. TAUBERER, Joshua. Open data is civic capital: best practices for “open government data” [online]. Version 1.5. January 29th, 2011 [cit. 2012-03-17]. Available from WWW: http://razor.occams.info/pubdocs/opendataciviccapital.html
  5. LONGO, Justin. #OpenData: digital-era governance thoroughbred or new public management Trojan horse? Public Policy & Governance Review. Spring 2011, vol. 2, no. 2, p. 38 — 51. Also available from WWW: http://ssrn.com/abstract=1856120

2012-08-29

Impacts of open data: disintermediation

The following post is an excerpt from my thesis entitled Linked open data for public sector information.
Who draws and controls the maps controls how other people see the world [1]. Who interprets data from the public sector controls how other people see the things described in the data. By releasing raw open data the public sector also releases its total control over the interfaces in which the data is presented. In this way, the interpretive dominance of the public sector data is abolished and it no longer controls the way how citizens should see the world described in the data [2]. Civil servants perceive this as a loss of control over the released data, but in fact, it is only a loss of control over interfaces in which the data is presented.
Providing raw data is an example of disintermediation. It reduces the frictions and inherent cognitive biases that come with interpretations by intermediaries. It allows users to skip the intermediaries that stand between them and access to raw data. For example, both civil servants producing reports based on primary data and journalists transforming data into narratives conveyed in articles serve as intermediaries that affect how the public perceives public sector data.
Depending on the type of use mediation may be either a barrier or a help. It is a barrier for those that want to access raw data to interpret them themselves. However, common perception has it that too few people are interested in raw data [3, p. 71]. Yet one should not make such generalizations as there is evidence that suggests otherwise. For example, after the release of data from the Norwegian meteorological institute, the institute registered more data downloads (14.8 million) than page views (4.5 million). These numbers were given by Anton Eliassen, the institute’s director, during the first plenary on the revised public sector information directive at the ePSI Platform Conference 2012. In general, it is the case that raw data receives relatively few downloads, yet access to raw data is vital to build new applications on top of the data.
Disintermediation creates a demand for reintermediation. Mediation helps users that need to get user-friendly translations of data in order to reach understanding. Applications mediating data in ways that are accessible and compelling, such as visualizations, may attract a lot of attention proving the demand for public sector data. For instance, this has happened in the case of the UK crime statistics, the visualization of which crashed under the weight of 18 million requests per hour at the time it was released [4].

References

  1. ERLE, Schuyler; GIBSON, Rich; WALSH, Jo. Mapping hacks: tips & tools for electronic cartography. Sebastopol: O’Reilly, 2005, 568 p. ISBN 978-0-596-00703-4.
  2. BARNICKEL, Nils; HÖFIG, Edzard; KLESSMANN, Jens; SOTO, Juan. Organisational and societal obstacles to implementations of technical systems supporting PSI re-use. In Share-PSI Workshop: Removing the Roadblocks to a Pan-European Market for Public Sector Information Re-use [online]. 2011 [cit. 2012-03-08]. Available from WWW: http://share-psi.eu/submitted-papers/
  3. HALONEN, Antti. Being open about data: analysis of the UK open data policies and applicability of open data [online]. Report. London: Finnish Institute, 2012 [cit. 2012-04-05]. Available from WWW: http://www.finnish-institute.org.uk/images/stories/pdf2012/being%20open%20about%20data.pdf
  4. TRAVIS, Alan; MULHOLLAND, Hélène. Online crime maps crash under weight of 18 million hits an hour. Guardian [online]. February 1st, 2011 [cit. 2012-04-17]. Available from WWW: http://www.guardian.co.uk/uk/2011/feb/01/online-crime-maps-power-hands-people

2012-08-28

Impacts of open data: efficiency

The following post is an excerpt from my thesis entitled Linked open data for public sector information.
The public sector itself is the primary user of public sector data. Open access to public data thus impacts the way the public sector operates. While the initial costs of opening up data may turn out to be significant, adopting open data promises to deliver cost savings in the long run, enabling the public bodies to operate more efficiently. “There is a body of evidence which suggests that proactive disclosure encourages better information management and hence improves a public authority’s internal information flows” [1, p. 69]. For instance, open data produces cost savings on cheaper information provision and efficient development of applications providing services to citizens.
For information provision, similarly to health services, prevention is cheaper than therapy [2]. Prevention via proactive disclosure is presumed to be more cost-efficient than therapy via acting on the demand of freedom of information requests [3, p. 25]. Open data saves the effort spent on responding freedom of information requests by providing the requested data in advance. In this way, the effort of providing data is expended only once, instead of repeating it due to the requests for the same data. Although the initial set-up overhead for open data may be higher, it is supposed to lower the per-interaction overhead.
Open data promotes a new way of information management that may streamline the data handling procedures and curb unnecessary expenditures. By elimination of the costs associated with access to public sector data the adoption of open data removes the expenses on data acquisition from public sector bodies selling their data. In effect, a better interagency coordination is established, which lessens administrative friction. Given the reduced workload, it may lead to destruction of some clerical jobs [2], which will produce savings on labour costs.
A common argument in favour of open data is based on the observation that the public sector is not capable of creating applications providing services to citizens in a cost-efficient way. Commissioning software for the public sector must pass through the protracted process of public procurement. Such procedure is slow to respond to users’ demands and the resulting applications may end up being costly. With openly available public sector data, the public sector is no longer the only producer that can deliver applications based on the data. Third parties may take the data a produce applications on their own, substituting the applications subsidized by the public sector. This is how a more cost-efficient means of production of applications may be devised.
The way in which open data makes efficiency of the public sector better is not limited to monenatary savings. The internal impact of open data encompasses that the data quality may be improved by harnessing the feedback from citizens. It may also inform the way the public sector is governed through evidence-based policies.
Opening data enables anybody to inspect it. Feedback from users probing the data puts a pressure on the public sector to improve the data quality. Better quality data enables better quality service delivery, improving the pursue of public task on many levels, such as better responsiveness to citizen feedback. Based on user feedback, collection of less used datasets may be discontinued, leading to a more responsive and user-oriented data disclosure.
Quality of data influences the quality of the policy that is based upon it [4]. It may become a source for a more efficient, evidence-based policy. Public policies may be improved by considering data as an input, as an evidence of the phenomena to be policed, and should be made with publicly available data [Ibid., p. 384], empirical data that is open to public scrutiny [5, p. 4], in order to keep the policy creators accountable.

References

  1. Beyond access: open government data & the right to (re)use public information [online]. Access Info Europe, Open Knowledge Foundation, January 7th, 2011 [cit. 2012-04-15]. Available from WWW: http://www.access-info.org/documents/Access_Docs/Advancing/Beyond_Access_7_January_2011_web.pdf
  2. FIORETTI, Marco. Open data, open society: a research project about openness of public data in EU local administration [online]. Pisa, 2010 [cit. 2012-03-10]. Available from WWW: http://stop.zona-m.net/2011/01/the-open-data-open-society-report-2/
  3. HALONEN, Antti. Being open about data: analysis of the UK open data policies and applicability of open data [online]. Report. London: Finnish Institute, 2012 [cit. 2012-04-05]. Available from WWW: http://www.finnish-institute.org.uk/images/stories/pdf2012/being%20open%20about%20data.pdf
  4. NAPOLI, Philip M.; KARAGANIS, Joe. On making public policy with publicly available data: the case of U.S. communications policymaking. Government Information Quarterly. October 2010, vol. 27, iss. 4, p. 384 — 391. DOI 10.1016/j.giq.2010.06.005.
  5. SHADBOLT, Nigel. Towards a pan EU data portal — data.gov.eu. Version 4.0. December 15th, 2010 [cit. 2012-03-10]. Available from WWW: http://ec.europa.eu/information_society/policy/psi/docs/pdfs/towards_an_eu_psi_portals_v4_final.pdf

2012-08-27

Impacts of open data: accountability

The following post is an excerpt from my thesis entitled Linked open data for public sector information.
Transparency feeds into accountability. “In the world of big data correlations surface almost by themselves. Access to data creates a culture of accountability” [1]. Open data enables to hold politicians accountable by comparing their promises with data showing how are their promises put into practice. For example, unfavourable audit results based on open data may cause a politician not being reelected.
Public scrutiny of governmental data may reveal fraud or abuse of public funds. Given the availability of public data everyone may check out, we may see a rise of the so-called “armchair auditing.” In the same way, it improves the function of “watchdog” institutions, such as non-governmental organizations dedicated to overseeing government transparency. In this way, open data increases civic engagement leading to a more participatory democracy and better democratic control.
Open data enables to apply crowdsourcing to monitor institutions and their performance, which is described in the data. Rufus Pollock illustrated the opportunities of leveraging citizen feedback by saying that “to many eyes all anomalies are noticeable,” in which he paraphrased the quote “given enough eyeballs, all bugs are shallow” by Linus Torvalds. Accordingly, releasing data to the public allows to get the data verified or inspected for quality for free.

References

  1. Data, data everywhere. Economist. February 25th, 2010. Also available from WWW: http://www.economist.com/node/15557443

2012-08-26

Impacts of open data: transparency

The following post is an excerpt from my thesis entitled Linked open data for public sector information.
Transparency of the public sector reflects the ability of the public to see what is going on. David Weinberger declares that transparency is the new objectivity [1], a change that he claims to stem from the transformation of the currect knowledge ecosystem to one that is inherently network-based. Transparency replaces the role of the long-discredited objectivity in that aspects that it is used as a source of veracity and reliability [2].
Transparency serves for fraud prevention. It puts the public sector under a peer pressure based on the fact that anybody can inspect the its public proceedings. The peer supervision makes it more difficult for civil servants to profit from the control they have and abuse of the powers vested in them. By increasing the risks of exposure of venal activities, it lowers the systemic corruption [3, p. 9]. In effect, members of the public may hold civil servants accountable for corruption, illegal takeover of subsidies, or plain budgetary waste [4, p. 80].
An illustrative example of the self-regulating effects of transparency was presented in [5, p. 110]. In 1997, restaurants in the Los Angeles county were ordered to post highly visible letter grades on their front windows. The grades (A, B, C) were based on the results of County Department of Health Services inspections probing hygiene maintenance in the restaurants. The ready availability of evidence on insanitary practices in food handling made it easier for people to make better choices about restaurants and helped them to avoid restaurants that were deemed unsafe to eat at. The introduction of this policy proved to have a significant impact both on the restaurants and their customers. Revenues at C-grade restaurants dropped, while those of A-grade restaurants increased, leading over time to a growth of the number of cleanly restaurants and a steep decline of the poorly performing ones. The policy also improved health conditions of the restaurants’ customers, with a decrease of hospitalizations caused by food-borne illnesses from 20 % to 13 %. Transparency has an ambiguous impact on trust in the public sector. While there is a positive impression of stronger control over the public sector, at the same time more failures are identified, which chips away at the trust in public affairs. Furthermore, transparency makes citizens aware of how vulnerable to manipulation the public sector data is.
Open data shapes the reality it measures [6, p. 3]. When communicating, the sender conveying information modifies its content based on the perceived context of communication. Evaluation of the way of communication, the expected audience, and other circumstances factored into the communication context impacts what messages are sent. Open data establishes a new context with a wider and less defined range of potential recipients and a different set of expectations about the effect of communicated data. Such re-contextualization may affect what gets released and in what form. Data may be distorted in a direction so that it supports only the interpretations data producers expect [7]. As a result, some data may end up withheld from the public, while other data may turn out to be misrepresenting of the phenomena it bears witness to. At the same time, the change brought about by the obligation to disclose data may have positive consequences by forcing public bodies “to rethink, reorganize and streamline their delivery before going online” [8, p. 448].
As the control is ultimately in the hands of civil servants, data disclosure may be shaped as required by various interest groups, including politicians or lobbyists. It illuminates the fact that there is no direct causation between open data and open government. “A government can be an ‘open government,’ in the sense of being transparent, even if it does not embrace new technology” [9, p. 2]. Only politically important and sensitive disclosures take government further on its way to open government. “A government can provide ‘open data’ on politically neutral topics even as it remains deeply opaque and unaccountable” [Ibid., p. 2]. This reflects what Ellen Miller from the Sunlight Foundation calls the danger of a mere “transparency theater”. This is nothing new in the politics. For instance, questions that politicians get asked may be moderated to include only those that are not sensitive and do not require the interviewee to disclose any delicate facts.
It also indicates that there is a limit to transparency, a limit that Joshua Tauberer entitled the “Wonderlich Transparency Paradox” [10]. It is named after John Wonderlich from the Sunlight Foundation that once wrote that “How ever far back in the process you require public scrutiny, the real negotiations [...] will continue fervently to exactly that point” [11]. Some parts of the processes in the public sector are exempted from disclosure to provide a “space to think” [4, p. 74]. However, this paradox shows that no matter how thourough and deep the transparency of the public sector is, the real decision-making processes will always have a chance to elude what is recorded and exposed for public scrutiny.
Everything may be abused and transparency is no different. For example, releasing data about how well are civil servants paid may be used to identify targets for bribery. Disclosing salaries of politicians helps lobbyists to find a low-paid politician who is an easier target for corruption. A difficult question is also to ask whether terrorist watch list should be made open [5, p. 4].
These examples showcase the unintended consequences of opening data. What these concerns illustrate is that transparency is obviously not a panacea and it would be naïve to think it is. Open data is not an end to itself and transparency by itself is an input, not an output [12].

References

  1. WEINBERGER, David. Too big to know. New York (NY): Basic Books, 2012. ISBN 978-0-465-02142-0.
  2. WEINBERGER, David. Transparency is the new objectivity [online]. July 19th, 2009 [cit. 2012-04-25]. Available from WWW: http://www.hyperorg.com/blogger/2009/07/19/transparency-is-the-new-objectivity/
  3. BERLINER, Daniel. The political origins of transparency. In HAGOPIAN, Frances; HONIG, Bonnie (eds.). American Political Science Association Annual Meeting Papers, Seattle, Washington, 1 — 4 September 2011 [online]. Washington (DC): American Political Science Association, 2011 [cit. 2012-04- 29]. Also available from WWW: http://ssrn.com/abstract=1899791
  4. Beyond access: open government data & the right to (re)use public information [online]. Access Info Europe, Open Knowledge Foundation, January 7th, 2011 [cit. 2012-04-15]. Available from WWW: http://www.access-info.org/documents/Access_Docs/Advancing/Beyond_Access_7_January_2011_web.pdf
  5. LATHROP, Daniel; RUMA, Laurel (eds.). Open government: collaboration, transparency, and participation in practice.
    Sebastopol: O'Reilly, 2010. ISBN 978-0-596-80435-0.
  6. BOYD, Danah; CRAWFORD, Kate. Six provocations for big data. In Proceedings of A Decade in Internet Time: Symposium on the Dynamics of the Internet and Society, 21 — 24 September 2011, University of Oxford. Oxford (UK): Oxford University, 2011. Also available from WWW: http://ssrn.com/abstract=1926431
  7. KAPLAN, Daniel. Open public data: then what? Part 1 [online]. January 28th, 2011 [cit. 2012-04-10]. Available from WWW: http://blog.okfn.org/2011/01/28/open-public-data-then-what-part-1/
  8. HAZLETT, Shirley-Ann; HILL, Frances. E-government: the realities of using IT to transform the public sector. Managing Quality Service. 2003, vol. 13, iss. 6, p. 445 — 452. ISSN 0960-4529. DOI 10.1108/09604520310506504.
  9. YU, Harlan; ROBINSON, David G. The new ambiguity of “open government” [online]. Princeton CITP / Yale ISP Working Paper. Draft of February 28th, 2012. Available from WWW: http://ssrn.com/abstract=2012489
  10. TAUBERER, Joshua. Open government data: principles for a transparent government and an engaged public [online]. 2012 [cit. 2012-03-09]. Available from WWW: http://opengovdata.io/
  11. WONDERLICH, John. Pelosi reverses on 72 hour promises? In Open House Project [online]. November 7th, 2009 [cit. 2012-04-19]. Available from WWW: http://groups.google.com/group/openhouseproject/msg/94060a876083d86a
  12. SHIRKY, Clay. Open House thoughts, Open Senate direction. In Open House Project [online]. November 23rd, 2008 [cit. 2012-04-19]. Available from WWW: http://groups.google.com/group/openhouseproject/msg/53867cab80ed4be9

2012-08-25

Impact of open data

The following post is an excerpt from my thesis entitled Linked open data for public sector information.
Rufus Pollock from the Open Knowledge Foundation argues that “open data is a means to an end, not an end in itself” [1]. Open data alone has no impact, as its impact is triggered by its use. Thus, no impact is guaranteed by the intrinsic properties of open data.
Open data discourse contains a vision that promises a better society in the offing. It is a vision that stems from the belief in transformative effects of open data principles and information technologies that are entrusted to deliver this vision. However, this vision will not be put into practice by releasing open data. Its the use of open data that puts the transformation into motion.
Rhetoric of open data advocates emphasizes the positive side of open access to public sector data. Moreover, it is often presented as an asymptomatic and strictly apolitical issue. However, it would be short-sighted to assume it is a neutral, technological change. We need to admit that there are both positive and negative impacts of open data, bringing both benefits and repercussions.
Distinguishing between the target of open data impacts, a rough categorization can be drawn classifying impacts either as internal, if they affect data producers, or as external, if they influence others.

Internal impact

Internal impact, which affects the producers of public sector data, is based largely on data about the public sector. The data describing the public sector is a record of its activity that may be used and scrutinized to improve the workings of the public sector. An open and better performing public sector is among the key objectives of the open data movement. Ultimately, open data paves the way to an open and more efficient government.
Open data disrupts existing workflows that are established in the public sector. It subjects the public sector to a greater transparency, which enables to held civil servants accountable, and establishes conditions under which the public sector may function in a more efficient way.

External impact

External impact of open data affects the demand side of open data. It results chiefly from availability of data about the environment governed by the public sector bodies releasing the data.
A recognized issue with the open data movement is that it lacks focus on the demand side of data. It suffers from unrealistic expectations brought about with the pervasive tendency to pay attention solely to the supply side, which is coupled with a lack of consideration of how the data would be used after its release [2, p. 1]. The public sector should abandon this ill-considered model and instead adopt a user-centric model for data disclosure.
Close attention to the demand side is needed because the power of open data is not in itself, it resides in the ways it can empower people that use the data. Open data empowers citizens to make better decisions. For example, access to crime data may assist city dwellers in finding the safest route home. Information about wheelchair access to public transportation may help persons with reduced mobility to arrange their city transport better. The effects of open data that impact users of data are covered in the following sections. Among the effects that are discussed is the phenomenon of disintermediation that allows users of data to by-pass intermediaries and the ways in which open data enables citizens to participate in public affairs. Influences of open data on two specific domains are considered. The availability of public sector data is a new potential for the economy. For journalism open data brings about a change that makes it become more data-driven.

References

  1. POLLOCK, Rufus. Open data: a means to an end, not an end in itself [online]. September 15th, 2011 [cit. 2012-04-06]. Available from WWW: http://blog.okfn.org/2011/09/15/open-data-a-means-to-an-end-not-an-end-in-itself/
  2. MCCLEAN, Tom. Not with a bang but with a whimper: the politics of accountability and open data in the UK. In HAGOPIAN, Frances; HONIG, Bonnie (eds.). American Political Science Association Annual Meeting Papers, Seattle, Washington, 1 — 4 September 2011 [online]. Washington (DC): American Political Science Association, 2011 [cit. 2012-04-19]. Also available from WWW: http://ssrn.com/abstract=1899790

2012-08-24

Linked open data in the public sector

The following post is an excerpt from my thesis entitled Linked open data for public sector information.
Having reviewed the theoretical foundations for technical openness and data quality of linked data, this section turns to the ways in which linked open data is used in practice in the public sector. Contrary to the popular belief, linked open data is not any more confined to the research institutes producing pilots and prototypes. It is used in practice, and the public sector is one of the central areas in which linked data is being adopted.
To find out about the role of public sector data in the ever-increasing web of data, the Linked Open Data Cloud diagram may be consulted. This diagram depicts the connections between the existing linked data sources that are published under the terms of an open licence. Progressive changes made to this diagram over time illustrate the growth of the web of data that now contains more than a billion triples. The cloud is partitioned in broad subject categories that include a category for “government”. According to the State of the LOD Cloud [1] survey from September 2011 the datasets in this category represented 42.09 % of triples in the cloud. However, these datasets accounted only for 3.84 % of outbound links to external datasets.
The Linked Open Data Cloud features datasets from the public sector of a number of countries. The U.S. is represented by their pioneering Data.gov project started by the Obama administration in May 2009. In the United Kingdom, the adoption of linked open data in the public sector was kick-started by research projects, such as AKTivePSI [2]  at the University of Southampton. The research activity quickly developed into an official part of work of the public sector and gave rise to Data.gov.uk, one of the most comprehensive and progressive government data catalogues to-date. Aside from the other countries, initial experiments with linked open data for the data produced in the public sector are also conducted in the Czech Republic by an un-official initiative OpenData.cz.
The thriving growth of linked open data activities in the public sector pointed to a need for coordination and development of standards and best practices. The W3C has taken the lead and established the Government Linked Data Working Group to help guide the adoption of linked open data in the public sector. The group is scheduled to run until 2013, but it already published several recommendations, such as the Cookbook for open government linked data [3].

References

  1. BIZER, Chris; JENTZSCH, Anja; CYGANIAK, Richard. State of the LOD Cloud [online]. Version 0.3. September 19th, 2011 [cit. 2012-04-11]. Available from WWW: http://www4.wiwiss.fu-berlin.de/lodcloud/state/
  2. ALANI, Harith; CHANDLER, Peter; HALL, Wendy; O’HARA, Kieron; SHADBOLT, Nigel; SZOMSZOR, Martin. Building a pragmatic semantic web. IEEE Intelligent Systems. May—June 2008, vol. 23, iss. 3, p. 61 — 68. Also available from WWW: http://eprints.soton.ac.uk/265787/1/alani-IEEEIS08.pdf. ISSN 1541-1672. DOI 10.1109/MIS.2008.42.
  3. HYLAND, Bernardette; TERRAZAS, Boris Villazón; CAPADISLI, Sarven. Cookbook for open government linked data [online]. Last modified on February 20th, 2012 [cit. 2012-04-11]. Available from WWW: http://www.w3.org/2011/gld/wiki/Linked_Data_Cookbook

2012-08-23

Linked data: quality

The following post is an excerpt from my thesis entitled Linked open data for public sector information.
Data quality is not inherent in technologies but it is a result of the way technologies are used. Apart from the strict limitations of semantic web technologies and linked data principles enforced by peer pressure, there is a body of knowledge about linked data captured in informal design patterns and best practices, that is embodied in resources like Linked data patterns [1] or Cookbook for open government linked data [2]. Among the other aspects these recommendations deal with they propose ways how linked data should be used to achieve the best data quality.

Content

The content facet of open data quality metrics tracks if the content of data is primary, complete, timely, and delivered intact.

Primariness

A key principle of linked data is to ensure access to raw data. Linked data URIs are required to dereference also to raw, machine-readable data, such as RDF in XML. Besides dereferencing, linked data may implement interfaces for access raw data, such as SPARQL endpoints.

Completeness

A common way to arrange for the access to complete data is to provide data dumps exported from a database or a triple store in the back-end. In this way, users are allowed to work with the data as a whole.
RDF offers an inclusive way for representing data of varying degree of structure and granularity. Depending on the modelling style, RDF can capture both highly-structured data and unstructured free-text. Linked data improves this inclusiveness by enabling to link to non-RDF content.
Linked data offers a means for materialization of the types of data that are, for the most part, out of the scope of the other approaches to data representation. For example, it may include explicit relationships between the described resources. From this perspective, linked data may be seen as a more complete representation of a particular phenomenon.

Timeliness

Even though timely release of data is rather a matter of policy and human resources, technologies employed for that task can make it easier. In particular with highly dynamic data that goes through frequent changes it is important to have a flexible update mechanism at hand. Updates of linked data may be automated with SPARQL 1.1 Update that offers a very expressive method for patching data.
Timeliness is crucial in two areas that are gaining prominence: streaming sensor data and user-generated content. Research on the technological solutions for these areas is in its infancy [3]. However, there already are experiments with streaming linked data or real-time extraction from user-generated content, such as DBPedia Live that captures updates in Wikipedia in a near real-time.

Integrity

The stack of the semantic web technologies, which linked data builds on, includes both digital signature and encryption as a part of the so-called Semantic Web Layer cake. For ensuring the content of data is not tampered with during its transmission secure HTTPS connections should be employed. An example of semantic web technology that builds on digital signatures is WebID, that may be used to authenticate data publishers.

Usability

Usability may be perceived as the weakest point of linked data. In most cases, raw, disintermediated linked data is not intended for direct consumption. This is the result of the separation of concerns that linked data employs. For example, consider working with a SPARQL endpoint that, even though it is a powerful way of interacting with data for applications, may be baffling for the regular users. Linked data should be rather mediated through end-user interfaces of web applications, that present the data in a more usable and visually-appealing manner. However, there are still aspects in which raw linked data excels when compared to other types of data.

Presentation

Intelligible presentation of linked data should be arranged for by the implementation of mechanisms for dereferecing URIs, which should be able to serve a human-readable resource representation, such as in HTML. However, representations of linked data resources are usually generated into generic templates in an automated fashion, which impedes custom adaptation of representations for different resource types.

Clarity

RDF has a well-defined way how to convey semantics through the use of RDF vocabularies and ontologies, the workings of which are described in the previous blog post about RDF. RDF vocalabularies and ontologies make thorough data modelling feasible, which increases the fidelity and clarity of the way representations of RDF resources are modelled.

Documentation

Linked data is self-describing data. Since the “consumers of Linked Data do not have the luxury of talking to a database administrator who could help them understand a schema” [2], all the information necessary to interpret the data, including RDF vocabularies and ontologies used by the data, should be stored on the Web and should be possible to retrieve via the mechanism of dereferencing by issuing HTTP GET requests and recursive following of links.
While the representations of resources should be self-documenting, there is no such requirement on the linked data URIs. URIs may be opaque since “the Web is designed so that agents communicate resource information state through representations, not identifiers” [4].

References

  1. DODDS, Leigh; DAVIS, Ian. Linked data patterns [online]. Last changed 2011-08-19 [cit. 2011-11-05]. Available from WWW: http://patterns.dataincubator.org
  2. HYLAND, Bernardette; TERRAZAS, Boris Villazón; CAPADISLI, Sarven. Cookbook for open government linked data [online]. Last modified on February 20th, 2012 [cit. 2012-04-11]. Available from WWW: http://www.w3.org/2011/gld/wiki/Linked_Data_Cookbook
  3. SEQUEDA, Juan F.; CORCHO, Oscar. Linked stream data: a position paper. In TAYLOR, Kerri; AYYAGARI, Arun; DE ROURE, David (eds.). Proceedings of the 2nd International Workshop on Semantic Sensor Networks, collocated with the 8th International Semantic Web Conference, Washington DC, USA, October 26th, 2009. Aachen: RWTH Aachen University, 2009, p. 148 — 157. CEUR workshop proceedings, vol. 552. Also available from WWW: http://oa.upm.es/5442/1/INVE_MEM_2009_64353.pdf. ISSN 1613-0073.
  4. JACOBS, Ian; WALSH, Norman (eds.). Architecture of the World Wide Web, volume 1 [online]. W3C Recommendation. December 15th, 2004 [cit. 2012-04-20]. Available from WWW: http://www.w3.org/TR/webarch/

2012-08-22

Linked data: use

The following post is an excerpt from my thesis entitled Linked open data for public sector information.
The flexible, application-agnostic nature of linked data makes it possible to employ it for a broad spectrum of uses. Linked data does not discriminate according to the type of use as “Linked Data principles and publishing guidelines are designed to make structured data more amenable to ad hoc consumption on the Web” [1, p. 13].
Roy Fielding wrote that “the primary mechanisms for inducing reusability within architectural styles is reduction of coupling (knowledge of identity) between components and constraining the generality of component interfaces” [2, p. 35]. Fielding’s REST, covered in the previous blog post about HTTP, is based on uniform interfaces between components and thus abides by this recommendation. However, a trade-off of uniform interfaces is of efficiency because such interfaces are optimized for the general case [Ibid., p. 82]. Since linked data is based on REST it also inherits this trade-off.
Linked data adopts separation of concerns and decouples content from presentation. In this way, it decouples data from upstream (producers) and downstream (consumers) interfaces enabling variability without introducing interoperability costs. Since linked data is not application-specific it may be used to power all kinds of applications.
Modelling of linked data is based on the reuse of existing models provided by RDF vocabularies and ontologies. A common approach to modelling of linked data is to mix various vocabularies and ontologies at will, cherry-picking their components to build a customized model suited for particular data.
Flexibility of the RDF data model enables to query the data and reconfigure it for a particular use. Semantic web technologies open opportunities for reuse by offering “query interfaces for applications to access public information in a non-predefined way” [3]. This is more difficult to achieve for non-RDF data formats. For example, Fadi Maali argues that “providing the data in a fixed table structure, as in CSV files, makes it harder for consumers to re-arrange the data in a way that best fits their needs” [4, p. 86].
Together, composing data models of parts of data models already known to applications and the flexibility that allows to rearrange the data model to the application model is facilitative to generic consumption. Such an advantage is particularly manifest when applications combine multiple sources of linked data The applications of this type are referred to as “meshups” since they are built on data sources that mesh with each other [5, p. 321]. Without linked data, this scenario would require manual integration effort on the application level, whereas linked data would be already integrated on the data level.
The following paragraphs provide answers on how linked data meets the concrete criteria on the use of open data.

Non-proprietary data formats

RDF is a non-proprietary data format and its specifications are open and free for anyone to inspect and implement.

Standards

Linked data builds on web standards maintained by the W3C or the Internet Engineering Task Force (IETF). For an overview of standard specifications related to linked data see Linked Data Specifications maintained by Michael Hausenblas.

Machine readability

RDF serializations covered in the previous blog post on RDF are machine-readable. Specifications of RDF serializations have well-defined conformance criteria, which facilitate the development of standard parsers and make it possible for data to be validated for conformance, such as with the W3C RDF Validation Service.
RDF data is well-structured with a high level of granularity. Users of RDF may use it as a graph that may be broken down into individual triples, which allows access to data at a very detailed level.
Linked data makes explicit, machine-readable licensing possible by linking to licences. There are several RDF vocabularies that contain properties to do that, such as the Dublin Core Terms with dcterms:rights. For a structured representation of the licences themselves Creative Commons Rights Expression Language may be employed.

Safety

RDF cannot include executable content. Serializations of RDF are textual (with the exception of the proposed Binary RDF [6]), which promotes inspection and eases safety checks. However, using RDF in adversarial environments with security problems, such as RDF injection or query sanitization, is an area in which little research is conduced.

References

  1. HOGAN, Aidan; UMBRICH, Jürgen; HARTH, Andreas; CYGANIAK, Richard; POLLERES, Axel; DECKER, Stefan. An empirical survey of linked data conformance. In Journal of Web Semantics [in print]. 2012. Also available from WWW: http://sw.deri.org/~aidanh/docs/ldstudy12.pdf. ISSN 1570-8268. DOI 10.1016/j.websem.2012.02.001.
  2. FIELDING, Roy Thomas. Architectural styles and the design of network-based software architectures. Irvine (CA), 2000. 162 p. Dissertation (PhD.). University of California, Irvine.
  3. ACAR, Suzanne; ALONSO, José M.; NOVAK, Kevin (eds.). Improving access to government through better use of the Web [online]. W3C Interest Group Note. May 12th, 2009 [cit. 2012-04-06]. Available from WWW: http://www.w3.org/TR/egov-improving/
  4. MAALI, Fadi. Getting to the five-star: from raw data to linked government data. Galway, 2011. Masters thesis (MSc.). National University of Ireland. Digital Enterprise Research Institute.
  5. OMITOLA, Tope; KOUMENIDES, Christos L.; POPOV, Igor O.; YANG, Yang; SALVADORES, Manuel; SZOMSZOR, Martin; BERNERS-LEE, Tim; GIBBINS, Nicholas; HALL, Wendy; SCHRAEFEL, Mc; SHADBOLT, Nigel. Put in your postcode, out come the data: a case study. In AROYO, Lora; ANTONIOU, Grigoris; HYVONËN, Eero; TEN TEIJE, Annette; STUCK- ENSCHMIDT, Heiner; CABRAL, Liliana; TUDORACHE, Tania (eds.). The semantic web: research and applications, 7th Extended Semantic Web Conference, Heraklion, Crete, Greece, May 30 — June 3, 2010, Proceedings, Part I. Heidelberg: Springer, 2010. Lecture notes in computer science, 6088. ISBN 978-3-642-13485-2.
  6. FERNÁNDEZ, Javier D.; MARTÍNEZ-PRIETO, Miguel A.; GUTIERREZ, Claudio; POLLERES, Axel. Binary RDF representation for publication and exchange (HDT) [online]. W3C Member Submission. March 30th, 2011 [cit. 2012-04-24]. Available from WWW: http://www.w3.org/Submission/2011/SUBM-HDT-20110330/

2012-08-21

Linked data: permanence

The following post is an excerpt from my thesis entitled Linked open data for public sector information.
Linked data principles enforce separation of data and applications, which promotes permanence. Modelling linked data is modelling without a context of use [1, p. 11]. When designing a data model for linked data, its creators abstract away from particular uses the data may get, such as in specific applications. Such design principle results in an application-agnostic data model that is not tightly coupled with any type of use that might be intended for the data. As a result, the data supports a wide range of unintended and unforeseen uses. Given the data is decoupled from applications using it, it needs not to be changed when the implementation of interfaces mediating it changes. Moreover, the software used for publishing or consuming linked data is in most cases open source and thus needs not to be changed if a vendor providing it changes. Even if there was no support for these open source solutions, data formats used for linked data have open specificiations that may be re-implemented by anyone.
Established design patterns for linked data promote persistent URIs providing long-lasting access points [2, p. 5]. Several of the best practices for minting URIs contribute to their persistence. URIs should not be made session-specific, in which case they cannot be used for re-identifying the requested resources after the session expires. URIs should be made implementation-agnostic because if they depend on an implementation they cannot outlast it. Therefore, URIs should not be cluttered with implementation details, such as file type suffixes (e.g., .php). A technique that further decouples URIs from the way they are dereferenced is to introduce a layer of indirection by using a service such as http://purl.org to redirect URIs to URLs that serve their representations. However, ultimately the persistence of URIs is proportional to the commitment of institutions maintaining them.

References

  1. WOOD, David (ed.). Linking government data. Heidelberg: Springer, 2011. ISBN 978-1-4614-1766-8.
  2. Designing URI sets for the UK public sector: a report from the Public Sector Information Domain of the CTO Council’s Cross-Government Enterprise Architecture [online]. 2009 [cit. 2012-02-26]. Available from WWW: http://www.cabinetoffice.gov.uk/sites/default/files/resources/designing-URI-sets-uk-public-sector.pdf

2012-08-20

Linked data: accessibility

The following post is an excerpt from my thesis entitled Linked open data for public sector information.
Linked data requires using dereferenceable HTTP URIs that serve as open access points to data. Resolution of linked data URIs may be either implemented by serving static files or by generating resource representations on the fly.
Linked data may be published in static files in one of the RDF serializations described in the previous post about RDF. This approach is used mainly for serving RDF vocabularies and ontologies, transfer of datasets for local batch processing, or for files with embedded RDF. Serving static files is easy to implement, however, their content is fixed and difficult to manipulate and update. To take advantage of the flexible nature of linked data on-demand, dynamically generated RDF representations may be served instead. One option for this approach is to use wrappers for dynamic data extraction from non-RDF data sources. For example, D2R Server allows to expose relational databases as RDF through a pre-defined mapping.
However, to reap the full benefits of RDF a triple store should be used to store the data. Triple store is a database optimized for storage and retrieval of RDF data. To publish data from a triple store SPARQL endpoints are used as the interfaces users interact with. The endpoints expose an interface defined by the SPARQL Protocol for RDF, which allows to query or manipulate data and serves the query results in XML via HTTP. In order to comply with linked data principles publishers should use front-end applications that implement dereferencing and content negotiation. A common way how to expose RDF as linked data is through lightweight SPARQL wrappers that dereference URIs to concise bounded descriptions [1] of the requested resources, the descriptions of which they retrieve via SPARQL queries. Example implementations of linked data front-ends include Pubby or Graphite.
To ease the transition to the use of linked data for web developers specification of Linked Data API was created. Linked Data API is a framework for more user-friendly APIs interacting with linked data in a way that follows the guidelines of REST and uses simple data formats, such as JSON. Among the example implementations of this framework are Puelia and Elda.

References

  1. STICKLER, Patrick. CBD: concise bounded description [online]. W3C Member Submission. June 3rd, 2004 [cit. 2012-04-23]. Available from WWW: http://www.w3.org/Submission/CBD/

2012-08-19

Linked data: discoverability

The following post is an excerpt from my thesis entitled Linked open data for public sector information.
If we define discoverability as the ability to get to a previously unknown URI from a known URI, then this ability depends on the in-bound links from known URIs to unknown URIs. In particular, it depends on the quantity of in-bound links, how likely it is that the users will follow them, and discoverability of their referring URIs.
Linked data fulfils the basic requirement of being linkable by using static and persistent URIs. Moreover, guidelines on URI construction for linked data recommend using human-readable URIs that are easier to communicate [1, p. 4]. To increase the interconnectedness of data services were developed that take into account out-bound links as well, such as the PSI BackLinking Service for the Web of Data.
Dereferencing URIs serves as a way to discover more data. Self-describing resources of linked data “promote ad hoc discovery of information” [2]. The representations of resources the users obtain by dereferencing their URIs may contain links to other resources. This allows for a “follow your nose” link traversal exploration style, recursively navigating through the Web. Since dereferencing mechanisms adhere to a standardized protocol, it enables to automate this type of data discovery, such as with crawlers. The methods to improve discovery of linked data may be categorized either as passive or active. Passive approaches consist in publishing additional data that makes the published linked data easier to find. To improve data traversal for crawlers Semantic Sitemaps listing all the data access points may be published. Several RDF vocabularies were devised for expressing access metadata that help in data discovery, such as Vocabulary of Interlinked Datasets (VoID). A common solution for keeping a record of available data is to post data description to a data catalogue, such as the Data Hub. To address this purpose, Data Catalogue Vocabulary (DCAT) was created.
Active techniques serve the purpose of notifying linked data consumers about the existence of data. A common way to spread information about data availability is to notify prospective consumers via the ping protocol, such as with web services like Ping the Semantic Web. Submission of data to search engines works in a similar way, such as with the form for notifying Sindice, a search engine for the semantic web.
Linked data also ranks well in regular search engines. For example, Martin Moore reported that in 2010 linked data resources from the BBC’s Wildlife Finder appeared high in Google search results for animal names [3].

References

  1. Designing URI sets for the UK public sector: a report from the Public Sector Information Domain of the CTO Council’s Cross-Government Enterprise Architecture [online]. 2009 [cit. 2012-02-26]. Available from WWW: http://www.cabinetoffice.gov.uk/sites/default/files/resources/designing-URI-sets-uk-public-sector.pdf
  2. MENDELSOHN, Noah. The self-describing web [online]. W3C TAG Finding. February 7th, 2009 [cit. 2012-04-11]. Available from WWW: http://www.w3.org/2001/tag/doc/selfDescribingDocuments
  3. MOORE, Martin. 10 reasons why news organizations should use ‘linked data’. Idea Lab [online]. March 16th, 2010 [cit. 2012-04-24]. Available from WWW: http://www.pbs.org/idealab/2010/03/10-reasons-why-news-organizations-should-use-linked-data073.html

2012-08-18

Linked data principles

The following post is an excerpt from my thesis entitled Linked open data for public sector information.
Linked data principles govern the use of the semantic web technologies described in the previous sections. Unlike the technologies, the principles are not backed by any standards body, such as the World Wide Web Consortium. Instead, they are community-driven and their sole enforcement mechanism is peer pressure. Nevertheless, this may turn out not to be the case in the near term future if the principles get incorporated into official policies and regulations, such as the ones that govern public sector institutions.
Linked data principles provide a guidance both for data publishers and consumers. For publishers, they offer the best practices that they have to comply with in order for their data to be recognized as linked data. From consumers’ perspective, the principles prescribe behaviour patterns that they can expect when working with linked data, such as what happens when linked data URIs are resolved in the course of content negotiation.
Compared with the principles of open data, there are fewer instances of the linked data principles. The original Linked Data Principles drafted by Tim Berners-Lee form a strong core that any other, and mostly derivative, linked data principles tend to cite or relate to.

Tim Berners-Lee's Linked Data Principles

Linked Data Principles, written by Tim Berners-Lee in 2006, effectively define what is linked data. The principles set a touchstone that may be used to determine if datasets qualify for being described as “linked data”, by covering all the necessary conditions that datasets need to fulfil in order to earn that label. These conditions are encapsulated in four succinct principles.
  1. Use URIs as names for things.
  2. Use HTTP URIs so that people can look up those names.
  3. When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL).
  4. Include links to other URIs, so that they can discover more things.
Berners-Lee, the inventor of the World Wide Web, sees the principles as natural for the Web. He recounts that in writing down the principles he only captures his intentions that were already a part of his original architecture for the Web.
After the creation of the principles they were modified to a small extent, clarifying certain issues and making some parts more explicit. For example, the original version from 2006 did not explicitly mention what technologies should be used for achieving the prescribed behaviour of the data. This was amended later, making it clear that the technologies that were intended to be used were RDF and SPARQL.

Five Stars of Linked Open Data

Four years after the inception of the original Linked Data Principles Tim Berners-Lee proposed a more iterative take on publishing linked data in his Five Stars of Linked Open Data scheme. It contains five commandments for data producers explaining how to proceed with improving the way how their data is published.
  • ★ Publish data on the Web under an open licence (e.g., in PDF).
  • ★★ Publish data in a structured format (e.g., in Excel).
  • ★★★ Publish data in a non-proprietary format (e.g., in CSV).
  • ★★★★ Use URLs to identify data, so that it is linkable (e.g., in RDF).
  • ★★★★★ Link your data to other data to provide context.
A major change in this scheme is the recognition of the importance of open access to data, which is already required in order to earn the first star. The scheme emphasizes that adoption of linked data principles creates a space for continuous improvement. Data producers can start publishing data with a low up-front cost and consequently continue investing more resources towards the goal of joining the pool of linked open data.
There are several renditions of the Five Stars of Linked Open Data scheme besides the one done by Tim Berners-Lee himself. For example, Ed Summers was among the first to publish the scheme and Michael Hausenblas illustrated the scheme with some examples along with associated costs and benefits for each of the steps described by the scheme.

2012-08-17

Technologies of linked data: RDF

The following post is an excerpt from my thesis entitled Linked open data for public sector information.
Resource Description Framework (RDF) is a standard format for data interchange on the Web. RDF is a generic graph data format that has several isomorphic representations. Any given RDF dataset may be represented as a directed labelled graph that may be broken down into a set of triples, each consisting of subject, predicate, and object.
Triples are items that RDF data is composed of. Subject of a triple is a referent, an entity that is described by the triple. Predicate-object pairs are the referent’s characteristics.
RDF is a type of entity-attribute-value with classes and relationships (EAV/CR) data model. EAV/CR is a general model that may be grafted onto implementations spanning relational databases or object-oriented data structures, such as JSON. In the case of RDF, entities are represented as subjects, which are instances of classes, attributes are expressed as predicates that qualify relationships in data, and objects account for values.
In terms of the graph representation of RDF, subjects and objects form the graph’s nodes. Predicates constitute the graph’s vertices that connect subjects and objects. The graph’s nodes and vertices are labelled with URIs, blank nodes (nodes without intrinsic names), or literals (textual values).

Serializations

RDF is an abstract data format that needs to be formalized for exchange. To cater for this purpose RDF offers a number of textual serializations suitable for different host environments. A side effect of RDF notations being text-based is that they are open to inspection as anyone can view their sources and learn from them. Now we will describe several examples of the most common RDF serializations.
N-Triples is a simple, line-based RDF serialization that is easy to parse. It compresses well and so it is convenient for exchanging RDF dumps and executing batch processes. However, the character encoding of N-Triples is limited to 7-bit and covers only ASCII characters, while other characters have to be represented using Unicode escaping.
Turtle is a successor to N-Triples that provides a more compact and readable syntax. For instance, it has a mechanism for shortening URIs to namespaced compact URIs. Unlike N-Triples, Turtle requires UTF-8 to be used as the character encoding, which simplifies entry of non-ASCII characters.
RDF serializations based on several common data formats were developed, such as those building on XML or JSON. XML-based syntax of RDF is a W3C recommendation from 2004. With regard to JSON, there are a number of proposed serializations, such as JSON-LD, an unofficial draft for representing linked data. However, these serializations suffer from the fact that their host data formats are tree-based, whereas RDF is graph-based. This introduces difficulties for the format’s syntax as a result of “packing” graph data into hierarchical structures. For example, the same RDF graph may be serialized differently with no way of determining the “canonical” serialization.
Several RDF serializations were proposed to tie RDF data with documents, using document formats as carriers that embed RDF data. An example of this approach is RDFa that allows to interweave structured data into documents by using attribute-value pairs. It is a framework that can be extended to various host languages, of which XHTML has a specification of RDFa syntax that reached the status of an official W3C recommendation.

Vocabularies and ontologies

While RDF is a common data model for linked data, RDF vocabularies and ontologies offer common way of describing various domains. Their role is to provide a means of conveying semantics in data. RDF vocabulary or ontology covers a specific domain of human endeavour and distills the most reusable parts of the domain into “an explicit specification of a conceptualization” [1, p. 1]. Conceptualization is thought of as a way of dividing a domain into discrete concepts.
The distinction between RDF vocabularies and ontologies is somewhat blurry. Ontologies provide not only lexical but also intensional or extensional definitions of concepts that are connected with logical relationships, and thus are thought of as more suitable for the tasks based on logic, e.g., reasoning. RDF vocabularies offer a basic “interface” data for a particular domain and as such as better suited for more lightweight tasks. Most of linked data gets by with using simple RDF vocabularies, that are in rare cases complemented with ontological constructs.
Having data described with a well-defined and machine-readable RDF vocabulary or an ontology enables to perform inference on the data. Inference serves for materializing data implied by the rules defined in RDF vocabularies and ontologies, through the means of which the data is expressed. W3C standardized two ontological languages that may be used to create RDF vocabularies and ontologies: RDF Schema (RDFS) and Web Ontology Language (OWL).
There are countless RDF vocabularies and ontologies available on the Web. However, a great deal of them is used only in the dataset, for which they were defined, and only a few of them reached a sufficient popularity in order to be treated as de facto standards for modelling of the domains they cover. An example of a general and widespread RDF vocabulary is Dublin Core Terms, which provides a basic set of means for expressing descriptive metadata. With regards to the public sector, some of the RDF vocabularies and ontologies covering this domain may be found in the Government vocabulary space of the Linked Open Vocabularies project.

References

  1. GRUBER, Thomas R. A translation approach to portable ontology specifications. Knowledge Acquisition. 1993, vol. 5, iss. 2, p. 199 — 220. Also available from WWW: http://tomgruber.org/writing/ontolingua-kaj-1993.htm

2012-08-16

Technologies of linked data: HTTP

The following post is an excerpt from my thesis entitled Linked open data for public sector information.
Linked data uses URIs with the http scheme that are handled by the Hypertext Transfer Protocol (HTTP), “an application-level protocol for distributed, collaborative, hypermedia information systems” [1, p. 6]. HTTP is the default interaction protocol for linked data that is used for data exchange, querying, updates, and so forth. Linked data uses HTTP in accordance with the constraints of the Representational State Tranfer architectural style that is described in the next section.

Representational State Transfer

The resource-oriented architecture of linked data may be considered as a style that builds on Representational State Transfer (REST). REST is an architectural style defining a stateless communication protocol for distributed client-server applications, such as the World Wide Web. Roy Fielding, the author of REST, defines architectural style as a “coordinated set of architectural constraints that has been given a name for ease of reference [2, p. xvi]. In his doctoral dissertation Fielding defines four interface constrains for REST:
  • identification of resources with URIs
  • manipulation with resources through their representations
  • self-descriptive messages
  • hypermedia as the engine of application state
Linked data interfaces adopt these constraints and they build onto them further constraints based on the Linked Data Principles.

Dereferencing

Dereferencing is a basic mechanism built on REST that linked data employs for interaction with URIs. By minting a URI in a namespace, namespace owners “enter into an implicit social contract with users of their data” [3] and should be therefore aware that “there are social expectations for responsible representation management by URI owners” [4]. The expectation the users of URIs have is that there are dereferencing mechanisms implemented for the URIs, which work in a predictable manner.
Dereferencing is an idempotent operation on URI that exchanges reference to a resource for the resource. HTTP agent (e.g., a web browser) that dereferences a URI issues an HTTP GET request for the resource’s reference (i.e., a URI) and the HTTP server administering this reference replies with a response containing the resource or its representation. The response should be accompanied by a correct HTTP Content-type header indicating the data format of the response encoded with the Multipurpose Internet Mail Extensions (MIME). Dereferencing can be indirect as redirects may be employed, which is a common practice especially for persistent URIs and non-information resources.
According to the Architecture of the World Wide Web [4] there are two kinds of resources, information resources and non-information resources, for which different dereferencing mechanisms apply. Information resource is “a resource which has the property that all of its essential characteristics can be conveyed in a message” [Ibid.], and so it may be trasferred via HTTP (e.g., HTML or PDF files). For example, http://dbpedia.org/page/Czech_Republic is a URI of an information resource identifying a page about the Czech Republic. Non-information resources are those resources that cannot be transferred via HTTP, such as physical objects or abstract notions. For example, http://dbpedia.org/resource/Czech_Republic is a URI of a non-information resource representing the Czech Republic. Since the owner of a URI of a non-information resource cannot serve the user requesting the URI with the identified resource, a recommended, yet widely disputed practice suggests to reply with the HTTP 303 See Other status code redirecting users to a URI of a representation of the non-information resource [5].

Content negotiation

Content negotiation is a way how to decide on an appropriate response format based on the content of HTTP request’s headers. HTTP clients can send HTTP headers along with the requested URI to provide context, stating what format of representation of the requested resource they prefer.
A common HTTP header that is used for this purpose in the linked data publication model is the Accept header that contains an enumeration of the preferred MIME types for the representation of the requested resource. This pattern allows the client to negotiate with a server on the format of the server’s response that is appropriate for the actual communication context. In practice, this is a way how the server may distinguish between human and machine traffic and serve either a human-readable (e.g., HTML) or a machine-readable (e.g., XML) representation of the requested resource.
Principles of content negotation offer a generic approach to communication of the client’s preferences. A widespread use of content negotiation may be demonstrated on the Accept-Language header that may be used to indicate preferred language of the response. A novel use of this method is the datetime content negotiation that allows the client to access different time snapshots of data using the Accept-Datetime header, which is implemented in the Memento software.
There are multiple ways and levels on which content negotation may be implemented. A common way to do it is by configuring HTTP server, such as with the Apache HTTPD’s mod_rewrite. A recommended way to enable discovery of the supported types of representations is to use the link HTML element with a link typed ”alternate” and the type attribute describing a MIME type that the server is capable of responding with.

References

  1. RFC 2616. Hypertext Transfer Protocol: HTTP/1.1 [online]. FIELDING, Roy Thomas; GETTYS, J.; MOGUL, J.; FRYSTYK, H.; MASINTER, L.; LEACH, P.; BERNERS-LEE, Tim. June 1999 [cit. 2012-04-21], 176 p. Available from WWW: http://tools.ietf.org/html/rfc2616. ISSN 2070-1721.
  2. FIELDING, Roy Thomas. Architectural styles and the design of network-based software architectures. Irvine (CA), 2000. 162 p. Dissertation (PhD.). University of California, Irvine.
  3. HYLAND, Bernardette; TERRAZAS, Boris Villazón; CAPADISLI, Sarven. Cookbook for open government linked data [online]. Last modified on February 20th, 2012 [cit. 2012-04-11]. Available from WWW: http://www.w3.org/2011/gld/wiki/Linked_Data_Cookbook
  4. JACOBS, Ian; WALSH, Norman (eds.). Architecture of the World Wide Web, volume 1 [online]. W3C Recommendation. December 15th, 2004 [cit. 2012-04-20]. Available from WWW: http://www.w3.org/TR/webarch/
  5. HEATH, Tom; BIZER, Chris. Linked data: evolving the Web into a global data space. 1st ed. Morgan & Claypool, 2011. Also available from WWW: http://linkeddatabook.com/book. ISBN 978-1-60845-430-3. DOI 10.2200/S00334ED1V01Y201102WBE001.

2012-08-15

Technologies of linked data: URIs

The following post is an excerpt from my thesis entitled Linked open data for public sector information.
Uniform Resource Identifiers (URI) offer an extensible, federated naming system for universal and global identification [1, p. 6]. Thanks to URI’s universality, resource identified with a URI may be anything, including web sites, ideas, and real-world objects.
URIs and Uniform Resource Locators (URLs) are different. URI needs not to locate the resource it identifies. Location of a resource is described by a URL, that in addition to identifying the resource provides a way of addressing it. In some cases, a resource may have the same URL as URI. This is true for information resources that may be retrieved via the Web. However, resources that may not be retrieved via the Web, such as physical objects, have a URI but do not have any URL, since they cannot be located in that way.
Resource needs not to be identified with a single URI because linked data adopts the non-unique name assumption allowing equivalent resources to have multiple URIs. This approach lowers the start-up barriers for data modelling since it lets linked data publishers to assign resources with their own URIs instead of making the effort to find the URIs that already exist for such resources.

References

  1. RFC 3986. Uniform Resource Identifier (URI): generic syntax [online]. BERNERS-LEE, Tim; FIELDING, Roy Thomas; MASINTER, Larry. January 2005 [cit. 2012-04-23]. 61 p. Available from WWW: http://tools.ietf.org/html/rfc3986. ISSN 2070-1721.

2012-08-14

What is linked data?

The following post is an excerpt from my thesis entitled Linked open data for public sector information.
Linked data is a publication model for structured data on the Web. The term “linked data” was coined by Tim Berners-Lee in 2006 in a note in which he described the Linked Data Principles.
An essential feature of linked data is materialization of relationships. Linked data makes implicit relationships between the described things explicit by materializing them as data [1, p. 94]. Reified relationships expressed as links thus become a part of machine-readable data amenable to automated processing. Traditionally, relationships in data are kept implicit as a part of the background knowledge, documentation, or software. In such cases, integration of data is done on the application level with a custom-crafted code or queries and the effort of discovering relationships in disparate datasets is left to application developers and other data consumers. Materialization of relationships in linked data shifts this integration effort to the data level.
While the current Web turned out to be mostly a web of documents, linked data leads to a growth of a web of data. This web of data may describe not only documents but may also include data, abstract ideas, or physical objects, along with their materialized relationships. In this way, linked data offers a seamless integration of the web of documents and the web of things into the web of data. Marko Rodriguez supposes that “the web of data may emerge as the de facto medium for data representation, distribution and ultimately, processing” [2, p. 38].
Linked data is a fundamentally distributed publishing model that locates data in heterogeneous data spaces. Unlike the current data stores that may be likened to silos or terminal nodes, linked data spaces are mutually connected via hyperlinks, through which disparate data sources may be defragmented and integrated into a single, virtual global data space. For linked data, relationships with other data expressed via links are of fundamental value. To illustrate this point, in his note about linked data Tim Berners-Lee claims that “the value of your own information is very much a function of what it links to”.
Linked data may be seen as a pragmatic implementation of the vision of the so-called “semantic web”, that is the web that communicates meaning in a way machines can operate on. Linked data has a mature and well-understood technology stack [3] comprised of the semantic web technologies. Most of these technologies are developed and standardized at the World Wide Web Consortium (W3C). In the following blog posts the key technologies for linked data will be introduced: Uniform Resource Identifier for identification of data, Hypertext Transfer Protocol for interaction with data, and Resource Description Framework for data representation.

References

  1. AYERS, Danny. Evolving the link. IEEE Internet Computing. January/February 2007, vol. 11, no. 1, p. 94 — 96. ISSN 1089-7801.
  2. RODRIGUEZ, Marko A. A reflection on the structure and process of the web of data. Bulletin of the American Society for Information Science and Technology. August/September 2009, vol. 35, no. 6. ISSN 1550-836.
  3. HEATH, Tom; BIZER, Chris. Linked data: evolving the Web into a global data space. 1st ed. Morgan & Claypool, 2011. Also available from WWW: http://linkeddatabook.com/book. ISBN 978-1-60845-430-3. DOI 10.2200/S00334ED1V01Y201102WBE001.

2012-08-13

Open data as a platform

The following post is an excerpt from my thesis entitled Linked open data for public sector information.
Open data infrastructure is the gist of the concept of “government as a platform” formulated by Tim O’Reilly [1, p. 11]. O’Reilly expands on the notion of open data by demanding governments to expose not only raw open data, but also open web services. Government as a platform is a provider of services built on open data. The services, accessible to anyone, offer ways of interfacing with data on which they are based, allowing to perform basic operations on that data. In this way, these open services form an API for the public sector.
This line of thinking sees the public sector as an enabler rather than an implementer, focusing more on creating an open environment rather than delivering end-user services. In contrast to government that works as a platform, current governments may be described rather as “vending machine governments” [Ibid., p. 13]. In such governments citizens pay taxes and expect services in return. If no services are provided or the obtained services are not satisfactory, citizens protest, which is like shaking the vending machine.
If we get on a more metaphorical level of the “government as a platform” concept, as Carl Malamud does, we can see law as the operating system of society [Ibid., p. 45]. Law provides rules that govern society, similar to operating systems governing the allocation of system resources. For an open and democratic society not only an unfettered access to its underlying infrastructure is necessary, it is crucial to guarantee equal access to law as well. As Malamud puts it, “if a document is to have the force of law, it must be available for all to read” [Ibid., p. 46]. Law, the operating system of society, has to be made open source.
What is important on the government as a platform is that this idea needs generative data. Jonathan Zittrain defines generativity as the “system’s capacity to produce unanticipated change through unfiltered contributions from broad and varied audiences” [2, p. 70]. It is a property that describes the ability of users of the system to produce new content unique to that system without any input from the system’s creators. The generativity of a system is based on its affordances, “the possible actions that exist in a given environment” [Ibid., p. 78].
Platforms balance control with generativity. Open infrastructures favour generativity and loose control mechanisms. Open data model incentivizes peer production of applications based on the data [3, p. 331]. Jonathan Zittrain claims that “generatively-enabled activity by amateurs can lead to results that would not have been produced in a firm-mediated market model” [2, p. 84]. This is the essence of the Many minds principle that asserts that “the coolest thing to do with your data will be thought of by someone else.”
Bill Schrier writes that “governments should provide services which are difficult or impossible for the public to provide for themselves, or which are hard to purchase from private businesses” [1, p. 305]. The rest of the services should be catered for by the public, by businesses or civic associations. What contributes to this approach is the recognition that “the needs of today’s society are too complex to be met by government alone” [4]. Ultimately, “if the private sector can make downstream products more cheaply or meet consumer demands in other ways, then the public sector body should consider pulling out of the market” [5, p. 38]. The solution is to open up the data infrastructure that the public sector works on and invite third parties to build on it. In this way, exposing public sector data within an open infrastructure enables to complement government-provided services with citizen self-service. Although the government as a platform principle is still in an early stage of realization, there are several places in which the public sector opened up its infrastructure to others. To give an example of this principle in action, the Global Positioning System (GPS), that the US government made publicly available for full commercial use a decade ago, may be considered [1, p. 44]. Built on geospatial data, this system provides geolocation services that are open to anyone to access, free of charge.
Highly successful, yet short-lived, were the occasions in which the public sector opened its data for application challenges. In these competitions public bodies released some of their data and offered prizes for the best applications developed with that data. The challenges proved to have a high return on investment. Not only they created a value in applications that significantly exceeded the original investment in prizes, but the application contests also delivered tangible examples of what data can do. Application challenges, such as the founding Apps for Democracy, that took place in Washingtion D.C. in 2009, were a source of inspiration for others to follow their lead. Finally, there already is software being built for creating open data infrastructures. An example of such software is the aptly-named Open Government Platform dataset management system that is jointly developed by the US and India.

References

  1. LATHROP, Daniel; RUMA, Laurel (eds.). Open government: collaboration, transparency, and participation in practice. Sebastopol: O'Reilly, 2010. ISBN 978-0-596-80435-0.
  2. ZITTRAIN, Jonathan. The future of the Internet: and how to stop it. New Haven: Yale University Press, 2008. Also available from WWW: http://futureoftheinternet.org/static/ZittrainTheFutureoftheInternet.pdf. ISBN 978-0-300-15124-4.
  3. HÖCHTL, Johann; REICHSTÄDTER, Peter. Linked open data: a means for public sector information management. In ANDERSEN, Kim Normann; FRANCESCONI, Enrico; GRÖNLUND, Åke; VAN ENGERS, Tom M. (eds.). Electronic Government and the Information Systems Perspective: proceedings of the second international conference, Toulouse, France, August 29 — September 2, 2011. Heidelberg: Springer, 2011, p. 330 — 343. Lecture notes in computer science, vol. 6866. DOI 10.1007/978-3-642-22961-9_26.
  4. Open declaration on European public services [online]. 2009 [cit. 2012-04-07]. Available from WWW: http://eups20.wordpress.com/the-open-declaration/
  5. GRAVES, Antoinette. The price of everything the value of nothing. In UHLIR, Paul F. (rpt.). The socioeconomic effects of public sector information on digital networks: toward a better understanding of different access and reuse policies: workshop summary. Washington (DC): National Academies Press, 2009. Also available from WWW: http://books.nap.edu/openbook.php?record_id=12687&page=37. ISBN 0-309-13968-6.

2012-08-12

Open data infrastructure of the public sector

The following post is an excerpt from my thesis entitled Linked open data for public sector information.
Information infrastructure is a necessary prerequisite for all information-demanding services. In his treatment on networks Yochai Benkler describes the need for a shared infrastructure.
“To flourish, a networked information economy rich in social production practices requires a core common infrastructure, a set of resources necessary for information production and exchange that are open for all to use. This requires physical, logical, and content resources from which to make new statements, encode them for communication, and then render and receive them” [1, p. 470].
Ursula Maier-Rabler ties these insights to the public sector. “The prerequisite for the functioning of networks is a common infrastructure. The role of government is to provide that infrastructure” [2, p. 187].
In the current state of affairs, there are multiple fragmented infrastructures that the performance of public functions depends on. Moreover, it is common that these infrastructures are available to dedicated applications only, while being closed to applications from other parts of the public sector, let alone the ones created by members of the public. These information infrastructures are neither shared nor open.
Open data may serve as a data infrastructure of the public sector. By definition, it constitutes a fundamentally open and shared infrastructure, that is in line with the Benkler’s vision. Such infrastructure not only enables public services to run; but, because it is open to everyone, it also enables private services to run. Building such infrastructure is the goal of open data initiatives and policies.

References

  1. BENKLER, Yochai. The wealth of networks: how social production transforms markets and freedom. New York: Yale University Press, 2006. ISBN 978-0-300-11056-2.
  2. MAIER-RABLER, Ursula; HUBER, Stefan. “Open”: the changing relation between citizens, public administration, and political authority. eJournal of eDemocracy and Open Government [online]. 2011 [cit. 2012-03-15], vol. 3, no. 2, p. 182 — 191. ISSN 2075-9517. Available from WWW: http://www.jedem.org/article/view/66