2011-09-24

Technical openness of open data

Apart from the legal requirements on open data, there are also aspects of technical openness. While the legal aspects are explicitly defined by the Open Definition, there is less understanding of the technical recommendations for making data open. Some principles of this side of openness are covered by the Three laws of open data by David Eaves, others are proposed in the Linked Open Data star scheme. An excellent resource that touches on both legal and technical requirements for open data is 8 open government data principles.

Data need to be formalized so that we can serialize them to representations that may be exchanged. However, there are different formalizations that may be used for communicating data, different formats that are more or less open. I think open technologies for representing data share a set of family resemblances. So, open data are:

Non-exclusive
Open data are not published exclusively for a particular application. No application has exclusive access to open data. Instead, they are available to be used by any application and thus support a wide range of uses.

Non-proprietary
No entity has exclusive control over non-proprietary data formats. Such formats have an open specification that may be implemented by anyone. Therefore, data in these formats are not tightly coupled with a specific software that is able to read them.

Standards-based
The data are based on open, community-owned standards. This means the standards are developed in an open process that may be joined by anyone from the public (i.e., not Schema.org). Such standards prescribe a set of rules the data have to adhere to. Standardized data have an expected format, which ensures interoperability, and as such can be used by a plethora of standards-compliant tools.

Machine-readable
Open data are formalized enough so that machines are able to use them. Well-formalized data have a structure that enables their automated machine processing. For instance, unlike a scanned document stored as an image, which is one opaque blob, open data have a higher granularity because they are segmented into well-defined data items (e.g., rows, columns, triples).

Findable
Open data should be publicly available on the Web. This means to have URLs that successfully return representation of data. Data should be directly accessible by resolving its URL. Any technical barriers, passwords or required registration, preventing from accessing data are unacceptable, as well as any attemps to hide the data and achieve security through obscurity via techniques of anti-SEO. As David Eaves puts it, if Google cannot find it, no one can.

Linkable
Elements of open data should be identified with URIs. In this way it is possible to link to it. This approach encourages re-use, data integration, and proper attribution of data used as a source.

Linked
If your open data are linked to other open data, users can follow these links to discover more. Being a part of the Web of data brings the benefits yielded by the network effects.

As you might have guessed from the previous points, I think that linked data is a very open technology. And, if you look at the 5 star of linked data, its author Tim Berners-Lee thinks the same. So if you want to make your data more open, it is a step in the right direction.

Open bibliographic data checklist

I have decided to write a few points that might be of interest to those thinking about publishing open bibliographic data. The following is a fragment of an open bibliographic data checklist, or, how to release your library's data into the public without a lawyer holding your hand.

I have been interested in open bibliographic data for a couple of years now, and I try to promote them at the National Technical Library, where we have, so far, released only authority dataset — the Polythematic Structured Subject Heading System. The following points are based on my experiences with this topic. What should you pay attention to when opening your bibliographic data then?

  • Make sure you are the sole owner of the data or make arrangements with other owners. For instance, things may get complicated in the case data was created collaboratively via shared cataloguing. If you are not in complete control of the data, then start with consulting the other proprietors that have a stake in the datasets.
  • Check if the data you are about to release are not bound by some contractual obligations. For example, you may publish a dataset under a Creative Commons licence, soon to realize that there are some unsolved contracts with parties that helped fund the creation of that data years ago. Then you need to discuss this issue with the involved parties to resolve if making the data open is a problem.
  • Read your country's legislation to get to know what you are able to do with your data. For instance, in Czech Republic it is not possible to put data into the public domain intentionally. The only way how public domain content is created is by the natural order of things, i.e., author dies, leaves no heir, and after quite some time the work enters the public domain.
  • See if the data are copyrightable. For instance, if the data do not fall into the scope of the copyright law of your country, it is not suitable to be licenced under Creative Commons, since this set of licences draws its legal binding from the copyright law; it is an extension of the copyright and it builds on it. Facts are not copyrightable and most bibliographic records are made of facts. However, some contain creative content, for example, subject indexing or an abstract, and as such are appropriate for licencing based on the copyright law. Your mileage may vary.
  • Consult the database act. Check if your country has a specific law dealing with the use of databases that might add more requirements that need your attention. For example, in some legal regimes databases are protected on other level, as an aggregation of individual data elements.
  • Different licencing options may be applicable for content and structure of dataset, for instance when there are additional terms required by database law. You can opt in dual-licensing and use two different licences, one for dataset's content that is protected by the copyright law (e.g., a Creative Commons licence), and one for dataset's structure for which the copyright protection may not apply (e.g., Public Domain Dedication and License).
  • Choose a proper licence. A proper open licence is a licence that conforms with the Open Definition (and will not get you sued), so pick one of the OKD-Compliant licences. Good source of solid information about licences for open data is Open Data Commons.
  • BONUS: Tell your friends. Create a record in the Data Hub (formerly CKAN) and add it to the bibliographic data group to let others know that your dataset exists.

Even if it may seem there are lots of things you need to check before releasing open bibliographic data, it is actually easy. It is an performative speech act: you only need to declare your data open to make it open.

<disclaimer>If you are unsure about some of the steps above, see a lawyer to consult it. Note that the usual disclaimers apply for this post, i.e., IANAL.</disclaimer>

2011-08-06

Turning off feed reader

Today I have decided to stop using my feed reader. My use of it has diminished over a long period of time and I no longer think it's an optimal tool for the way I like to discover information.

In my view, feeds, whether they're from blogs, news sites or of any other origin, contain just too much noise. You need to go through all of items in your subscribed feeds yourself. It's information filtering on the client-side. Feed readers don't allow for fine grained filtering I would like to be able to do, and thus, they are blunt instruments for information discovery.

Reading feeds may also lack the serendipitous discovery. I'm rarely surprised when I read my feeds. On the other hand, on Twitter I get interesting pointers to various resources much more frequently due to the ways information spreads through the network of Twitter users before it finally reaches me (e.g., retweets).

Because of these shortcomings my primary platform for information acquisition is Twitter now. I don't read feeds, newspapers, magazines, watch TV news and the like. I have resigned from trying to achieve even near-complete coverage of the topics I'm interested and instead I sample and skim-read my Twitter stream.

Twitter provides me a manageable stream of highly relevant information resources that I'm usually able to process and digest. It offers me seredipitous discoveries I wouldn't have come across when using feed readers. Also, I like to sample from a wide range of resources on different topics and Twitter caters for that quite well.

I have changed my infomation consumption habits. In a sense, I have switched to a probabilistic information retrieval. I know that I can't get a complete coverage on the subject areas I'm interested in. I'm conscious that I miss something, but I'm fine with that. I believe that if the information is important enough, it will come back to me. If I don't catch something, I trust my network on Twitter to make me pay attention to it by mentioning it, re-tweeting it, and re-discovering it for me.

On Twitter my information filter is the network of the people I follow. The key difference is that while you're reading feeds you're using people as content creators, on Twitter you're using people as content curators. It's a filtering on a meta level: instead of filtering information yourself you filter the people that are filtering information for you. Your responsibility is to curate the list of Twitter users you follow. However, if you want to be an active member of the Twitter ecosystem you curate, share, and forward information for your followers.

On the Web there are many information channels and trying to follow all of them results in a fragmentation of one's attention. Reading lots of information resources is time consuming, content is often duplicated, and therefore demands strenuous filtering, and also, context switching between different media is expensive for one's cognitive abilities.

In an attention economy we decide how we spend our resources of attention. While marketing uses targetting to reach relevant audiences, we do reverse targetting when we expose ourselves as targets to media of our choice. Choosing a single, yet heterogeneous, information acquisition channel, such as Twitter, may lead to a defragmentation of our attention, and thus it may be a step towards more efficient allocation of one's attention.

The switch from feed readers may be a general trend. I think that information acquisition via feed readers was in part surpassed by the social media and the ubiquitous sharing of content on the Web (tweets, likes, plus ones, recommendations, etc.). One of the questions asked by the media theorist Marshall McLuhan in his tetrad of media effects was What does the medium make obsolete?. If we ask what Twitter makes obsolete, the answer may well be feed readers.

That said, I still think feeds are indispensable when it comes to information acquisition for machines, such as web applications and the like. Feeds are well suited for machines to exchange information. Unlike in humans, attention isn't a scarce resource for machines. Machines can read all items in feeds. But people needs more human ways for discovering new information as they have limited resources of attention. I think Twitter delivers on that.

2011-07-19

Spoonfeeding Google with RDF graphs packaged as trees

During a small side project I've found out that Google Rich Snippets Testing Tool doesn't treat RDFa as RDF (i.e., a graph) but rather as a simple hierarchical structure (i.e., a tree). It doesn't take under account links in RDFa, but only the way HTML elements are nested inside one another. More about the difference between data models of graph and tree can be found in a blog post by Lin Clark.

I've created two documents that give the same RDF when you run RDFa distiller on them. Both contain GoodRelations product data, but the difference between them is that in the first document the HTML element describing price specification (gr:UnitPriceSpecification) is a not nested inside the HTML element descibing the offering (gr:Offering) and the offering is linked to via gr:hasPriceSpecification property. In the second document the HTML element with price specification is nested in the element about the offering.

Even though the documents contain same data, Google Rich Snippets Testing Tool parses them differently and refuses to show a preview of search result in the case of the first document, whereas the second document produces a preview. In the first case, the price information is not recognized because it's not nested inside the HTML element describing the offering and thus a warning is shown:

Warning: In order to generate a preview, either price or review or availability needs to be present.

This leads me to believe that Google Rich Snippets Testing Tool doesn't parse RDFa as RDF, but as a tree (much like a DOM tree), effectively the same way as HTML5 microdata, which is built on the tree model. Google doesn't use RDFa as RDF, but as microdata.

Eric Hellman wrote a blog post about spoonfeeding data to Google. Even though Google still accepts some RDF (e.g., GoodRelations) after the announcement of microdata-based Schema.org, it wants to be spoonfed RDF graphs packaged as microdata trees. Does it mean that if Google is your primary target consumer for your data, you shouldn't bother with packaging your RDF in trees, but rather directly provide your data as a tree in HTML5 microdata?

2011-07-03

RDFa in action

RDFa is a way how to exchange structured data inside of HTML documents. RDFa provides information that is formalized enough for computers (such as googlebot) to process it in an automated way. RDFa is a complete serialization of RDF, using the attribute = value pairs to embed data into HTML documents in a way that does not affect their visual display. RDFa is a hack built on top of HTML. It repurposes some of the standard HTML attributes (such as href, src or rel) and adds new ones (such as property, about or typeof) to enrich HTML with semantic mark-up.

A good way to start with RDFa is to read through some of the documents, such as the RDFa Primer or even the RDFa specification. When you want to annotate an HTML document with RDFa you might want to go through a series of steps. We have used this workflow during an RDFa workshop I have helped to organize and this recipe worked quite well. Here it is.

  1. Find out what do you want to describe (e.g., your personal profile).
  2. Find which RDF vocabularies can be used to express description of such a thing (e.g., FOAF). There are multiple ways how to discover suitable vocabularies, some of which are listed at the W3C website for Ontology Dowsing.
  3. Start editting your HTML: either the static files or dynamically rendered templates.
  4. Start at the first line of your document and set the correct DOCTYPE. If you are using XHTML, use <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN" "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd"> (i.e., RDFa 1.0). If you are using HTML5, use <!doctype html> (i.e., RDFa 1.1). This will allow you to validate your document and see if you are using RDFa correctly.
  5. Refer to the used RDF vocabularies. By declaring vocabularies' namespaces you can set up variables that you can use in compact URIs. If you are using XHTML, use the xmlns attribute (e.g., xmlns:dv="http://rdf.data-vocabulary.org/#"). If you are using HTML5, use prefix, vocab, or profile attributes (e.g., prefix="dv: http://rdf.data-vocabulary.org/#").
  6. Identify the thing you want to describe. Use a URI as a name for the thing so that others can link to it. Use the about attribute (e.g., <body about="http://example.com/recipe">). Everything that is nested inside of the HTML element with the about attribute is the description of the identified thing, unless a new subject of description is introduced via a new about attribute.
  7. Use the typeof attribute to express what is the thing that you are describing (e.g., <body about="http://example.com/recipe" typeof="dv:Recipe">). Pick a suitable class from the RDF vocabularies you have chosen to use and define the thing you describe as an instance of this class. Note that every time the typeof attribute is used the subject of description changes.
  8. Use the property, rel and rev attributes to name the properties of the thing you are describing (e.g., <h1 property="name">).
  9. Assing values to the properties of the described thing using either the textual content of the annotated HTML element or an attribute such as content, href, resource or src (e.g., <h1 property="name">RDFa in action</h1> or <span property="v:author" rel="dcterms:creator" resource="http://keg.vse.cz/resource/person/jindrich-mynarz">Jind??ich Mynarz</span>).
  10. If you have assigned the textual content of an HTML element as a value of an attribute of the thing described you can annotate it. To define the language of the text, use either xml:lang (in XHTML) or lang (in HTML5) attributes (e.g., <h1 property="name" lang="en">RDFa in action</h1>). If you want to set the datatype of the value, use the datatype attribute (e.g., <span content="2011-07-03" datatype="xsd:date">July 3, 2011</span>)
  11. Check you RDFa-annotated document using validators and examine the data using RDFa distillers to see if you have got it right.
  12. Publish the annotated HTML documents on the Web. Ping the RDFa consumers such as search engines so that they know about your RDFa-annotated web pages.

Art of emptiness

Marshall McLuhan created a distinction between "hot" and "cool" media. I think it is a productive conceptualization of media because it stimulates thinking, even though it suggests thinking in terms of binary opposites.

The longer I enjoy art, I think I tend to prefer "cool art". The following is a comparison of hot and cool styles of art, with a particular focus on music. I hope this will not result in a death from metaphor, but rather in a productive use of it. First, let's start with what McLuhan called the "hot media".

Hot art

Hot art is an art of sensory overload. It provides rich, overwhelming, super-stimuli that lower our ability to parse our sensory input. Hot art needs a space to inhabit; it is an environment-seeking art. Art is always situated in a host environment, in a wider context; and hot art needs space to live in. For instance, for visual arts it is the space of plain, white walls in an art gallery.

Hot art enforces a single interpretation, it is not open for a creative use. It guides a person through a linear, pre-defined experience, without a need for participation. In this way, it achieves a temporary oblivion by the means of hypnosis. The source of super-stimulation occupies our brain, blocks any other input, and forces the person to pay attention only to it.

For the most part, hot art is perceived on the conscious level. Hot art is digitally mastered, manufactured product that is made to achieve the maximum effect possible. The result of such process feels artificial, perfect, and error-free.

A typical example of hot art is pop music. For example, this manifests itself in the "wall of sound" method which uses a plenty of different layers of sound to provide a compelling listening experience.

Cool art

On the other hand, cool art is an art of sensory deprivation. It uses under-stimulation to create emptiness. Cool art creates space, and thus it is an environment-creating art as it puts the person perceiving it in an environment of its own.

Cool art is open and invites a multiplicity of interpretations. It inspires people to undergo a non-linear experience, while requiring a high level of active participation. Participatory art evokes hallucination, which manifests itself as a furious fill-in or completion of sense when all outer sensation is withdrawn (source). Left with minimal sensory input, human mind starts to create its own content. This is a mechanical process, a natural reaction to under-stimulation of the sensory apparatus. Left alone, mind tends to wander, fill in the blanks, and complete the missing parts. Cool art inspires to create by the means of hallucination.

The experience of cool art is mostly an unconscious one. In contrast to hot art, it is based in analogue, non-discrete forms, which grow in organic ways. For instance, this can be achieved by the techniques of field recordings or employing non-deterministic or random processes. Such art is in a way more natural, it embraces error (cf. esthetics of glitch in music).

A typical example of cool art is dub techno. Dub techno got rid of the usual elements of music, such as the melody, and confined itself to conveying music mostly through subtle, slowly evolving changes of rhythm or timbre. This is the minimalism that manifests itself through extensive repetition and limiting yourself to the expressive power of bare rhythm.

I prefer cool art to hot art. However, this is a matter of taste, which implies it may change. To conclude, let me give you a couple of examples of what I consider to be cool art.

Visual arts: Unloud painting

Cinema: Stalker by Andrei Tarkovsky

Music: cv313 - Subtraktive (Soultek's Stripped Down Dub)

2011-05-07

geoKarlovka.cz or: how we used location-based services at the CharlesUniversity

Among the things I do, I study at New Media Studies at the Charles University. In the course of one class we created a team on a project to use location-based services for the Charles University. This is how geoKarlovka.cz started. It's a story of how we have replaced the user-generated content with superuser-generated content.
Our intention was to use location-based services to improve orientation and navigation in the complex mesh of the university's buildings and institutions scattered across the whole Prague and other Czech cities. We wanted to employ the game-like elements involved in these services, such as earning badges or winning mayorships, to make students explore the university campus more, to visit the university's libraries or campus dining halls.
We have used the existing location-based services to accomplish the task we have set for us. To implement our goals, we have chosen Foursquare, Gowalla, and Google Places. Foursquare was identified as our core priority since it is the most widely used location-based service in Czech Republic, according to the latest statistics.
In this way, we have used the Web as a content management system (source). Instead of building another place to put the information about the venues at the Charles University in, we have used the existing infrastructure for hosting content provided by these services. We have also used Google Docs as our internal CMS. All of the data on the map of the venues of Charles University comes from a spreadsheet in Google Docs (after its imported into another spreadsheet and filtered with Yahoo! Query Language).
The content of the various location-based services is primarily user-generated content. As it turns out, it may contain inconsistencies and its quality may be dubious.
We set our goal to improve this image the Charles University has in the afore-mentioned services. We strived to repair incorrect data, add missing links, and enrich the description with another useful particulars, such as opening hours or the URLs of the venues' websites. What we were trying to achieve was also to unify of the content describing the university on the chosen location-based services, align the information coming from different sources, and thus establish a minimal common basis. To maintain a level of uniformity we have devised our own naming scheme in order to name the venues in a consistent manner.
The most common issues of the content present in the selected location-based services included missing data, wrong information or incorrect names. For example, on Google Places we have discovered venues with names such as Charles University, part E which was in fact a college dormitory.
We, the superusers, have provided the superuser-generated content. As students, we have the local knowledge. We have studied at the Charles University for several years during which we have acquired a high degree of familiarity with its peculiarities. As new media geeks, we are long-time power users of location-based services, and we knew better to use the established best practices, such as venue naming guidelines. In fact, we have been granted limited superuser privileges from Foursquare in order to be able to edit venues and merge duplicate entries.
However, we have stumbled upon some problems. For instance, there was an issue with Google's hyper-correctness. The abbreviation for the Charles University, cuni, is an expletive in Tamil, and as such it is prohibited to be used in Google Places. For this reason we weren't able to provide any URLs to the Charles University's venues since they all share the same root URL: www.cuni.cz.
We have created an official brand page for Charles University on Foursquare, at which we are adding tips to the various university venues. In this way, we have put Charles University in line with the early adopters, side by side with Harvard or Stanford University. We have managed to setup one special on Foursquare, a badge for new visitors of one of the university libraries. Two trips on Gowalla were created, one for newbie students, the second for those who want to take a tour through the students' most favorited pubs. And in April, we have presented our project at the Foursquare Day in Prague to get attention of the wider Foursquare community.
As another component of the project, we have coded a simple application for generating QR codes for easy check-in at any Foursquare venue. Just search for the name of the venue you manage and you'll get a custom QR code enabling to check-in without the need to use geolocation or search to find your venue. When you scan such QR code with your mobile device (preferably iOS or Android-based), it re-directs you automatically onto a venue page in a dedicated Foursquare application.
This project was an opportunity to explore various tools and services. We have built the website with Google App Engine, pulled data from Foursquare API and Google Spreadsheets API with a little help from the excellent Yahoo! Query Language. The visualizations are based on Google's services, such as Google Maps Javascript API or Google Charts.
Finally, I want to give a shout out to @josefslerka, the head of New Media Studies, who has first brought the idea to the fore, @eliskah for backing us up with her extensive knowledge of location-based services and her sheer geekdom, and all of the team members: @annaceskova, @matez_jindra, and @yanana, for bringing the initial ideas to fruition. I really enjoyed working on this project.

2011-04-25

Library SEO

When people search for information, they are very likely to start at Google. They don't start at a library like they used to. What this means for libraries is that, if they don't want to be bypassed, it's important that there is a path from Google to a library. If you can't make people start searching at your library, at least you can make the path from Google to your library as prominent and as accessible as possible. The paths that Google serves in nice, ordered lists, are URIs. On the Web, the path to a library is a link.

This is why it's important to make the things at libraries linkable. Then, not only Google can link to it, but everyone can. If people start to link to your library's website, Google perceives this as a good thing. The number of inbound links to a web page is known to be a key factor tied to its importance in the eyes of Google.

In the current situation things in libraries often don't have URIs. Or they have them, but poor ones, unstable, session-based URIs that change with every request. You can't link to such content. It's like a library that prevents you from telling anyone about it. Be aware that Google is a very important user of your library. If it likes your web content, it will tell lots of other people about you. Word of mouth is powerful but word of Google is more powerful.

Steps for library SEO

  1. Provide every piece of your content with a URI.
    If your content doesn't have a unique URI, it cannot be linked and it won't show in Google's search results. On the Web, content without a URI does not exist.
  2. Make that URI a stable URI.
    Not that dynamic session-based nonsense. Make an effort to sustain the URI in the long term. A URI should always resolve. Provide re-directs if you change your URI's structure (e.g., by using PURL).
  3. Make that URI easy to use.
    A short URI that fits in a tweet or can be read from slides is better than a long or unreadable one. If it's easy to use, it'll be used more.
  4. Make that URI a cool one.
    Implement content negotiation. In this way, both humans and machines get what they like.

2011-04-23

Removable Web

First, there was the Readable Web. It started as a one-to-many conversation accessible to anyone who had internet connection. Masses of internet users were allowed to read what the elite ones put on-line.

But for many people, the permission to read wasn't enough. The result of that was the Writeable Web. This was a many-to-many conversation and the Web has started to fill with user-generated content. Everyone with internet connection was able to both read and write to the Web.

Increasingly, the Web was not only of documents but it was also a web of applications. Next up was the Executable Web. The web applications exposed standard interfaces — APIs. The new paradigms of software as a service and the Web as a platform have started to get attention. The Web was available for anyone with internet connection to read, write and execute.

Now that we've got all of these permissions to the Web we are able to do lots of powerful things, we can even do damage with them. However, it seems we still lack one permission — the permission to delete.

On the Web everything is recorded and stored (forever). To forget is human, to remember is Google. Every time we use the Web, we leave digital trails. Our foot-print gets stored. For instance, Google stores all queries entered into its search box, even though they get anonymized after 9 months. Another example, Facebook doesn't allow you to delete your account permanently, you have only the permission to de-activate it.

Digital information is not prone to disappear. And we have the methods of digital preservation, such as LOCKSS, to fight the processes of natural degradation of digital information, such as bit-rot (or link-rot), to make it even harder for it to get lost. And even though the bits vanish and links break, these are the natural phenomena of the Web, not something one can control and use on purpose.

The next step in endowing users with more permissions might be the Removable Web, where everyone is able to delete the content their own from the Web. Now, there are some people that enjoy the authority to delete the content from others from the Web. We call it filtering or censorship. There are even some people that can at least temporalily remove the whole Web, as you can see in the recent example, when the internet was switched off in Egypt.

We could benefit from the ability to remove our content. We could get rid of all the embarrassing photos and statuses we have ever posted. Forgetting is an essential human virtue that enables us learn from our mistakes, get a second chance, and re-establish your reputation. Forgetting also helps us to forgive, wrote Viktor Mayer-Schönberger, the author of the book Delete: the virtue of forgetting in the digital age.

With great power comes great responsibility, said Voltaire (and also Spider-Man). Therefore, we should be more conscious of what we write to the Web knowing that it will stay there. The following is often true on the Web: What was published cannot be unpublished. The Web doesn't forget.

We should have a right to delete our own content. Or, as Mayer-Schönberger suggests, we could add expiration dates for digital information. This shouldn't be that hard. We already have a verb for it. Just as we can HTTP GET something to read, HTTP POST to share something, we should be able to HTTP DELETE what we've published.

2011-04-10

Data-driven e-commerce with GoodRelations

On April 6th at the University of Economics, Prague, Martin Hepp gave a talk entitled Advertising with Linked Data in Web Content: From Semantic SEO to E-Commerce on the Web. Martin presented his view of the current situation in e-commerce and how it can be made better through structured data, explaining it on the use of GoodRelations, the ontology he has created.

GoodRelations

GoodRelations is an ontology describing the domain of electronic commerce. For instance, it can be used to express an offering of a product, specify a price, or describe a business and the like. The author and active maintainer of GoodRelations is Martin Hepp. As he has shared in his talk, there is actually quite a lot of features that set it apart from other ontologies.
  1. It's the single one ontology that someone has paid for doing. At Overstock.com an expert was hired to consult the use of GoodRelations.
  2. It's not only a research project. It's been accepted by the e-commerce industry and it's used by companies such as BestBuy or O'Reilly Media.
  3. Its design is driven mainly by practice and real use cases and not only by research objectives. For instance, it's been amended when Google requested minor changes. And Google even stopped recommending its own vocabulary it has created for the domain of e-commerce in favour of GoodRelations. It's the piece of the semantic web Google has chosen. Nonetheless, it's still an OWL-compliant ontology.
  4. It comes with a healthy ecosystem around it. The ontology provides a thorough documentation with lots of examples and recipes that you can adopt and fine-tune to your specific use case. There are available validators for the ontology and there is a plenty of e-shop extensions and tools built for GoodRelations.
  5. Finally, it's not only a product of necessity. As Martin Hepp said, he actually quite enjoys doing it.

Product Ontology

The other project that was showcased by Martin Hepp is the Product Ontology. It's a dataset describing products that is derived from Wikipedia's pages. It contains a several hundred thousand precise OWL DL class definitions of products. These class definitions are tightly coupled with Wikipedia: the edits in Wikipedia are reflected in Product Ontology. For instance, if the Product Ontology doesn't list the type of product you sell, you can create a page for it in Wikipedia and, given that it's not deleted, the product type will appear within 24 hours in the Product Ontology. This is similar to the way BBC uses Wikipedia. An added benefit is that it can also serve as dictionary containing up to a hundred labels in different languages for a product because it's built on Wikipedia containing the bundles of pages describing the same thing in different languages.

Semantic SEO

The primary benefit of GoodRelations is in how it improves search. We spend more time searching than we have ever used to. Martin Hepp said that there's an order of magnitude increase in the time we spend searching. It takes us long time before we finally find the thing we interested in because the current web search is a blunt instrument.
World-Wide Web acts as a giant information shredder. In databases, data are stored in a structured format. However, during the data transmission to web clients, data are being lost. They aren't sent as structured data but presented in a web page that can be read by a human customer but machines can pretty much treat it only as a black-box. Instead of being sent in the form in which it's stored in database, the message is not kept intact when it's being sent through the web infrastructure. The structure of the data gets lost on the way to a client and only the presentation of the content is delivered. This means that the agent accessing the data via the Web often needs to reconstruct and infer the original structure of the data.
The web search operates on a vast amount of data that is most for part unstructured and as such it doesn't provide the affordances to conduct anything clever. Simple HTML doesn't allow you to articulate your value proposition well. The products and services are often reduced to a price tag. Enter the semantic SEO.
Semantic SEO can be defined as using data to articulate your value proposition on the Web. It strives to preserve the specificity and richness of your value proposition when you need to send it over the Web. Ontologies such as GoodRelations allow you to describe your products and services with a high degree of precision.

Specificity

We need clever and more powerful search engines because of the tremendous growth in specificity. Wealth fosters the differentiation of products and this in turn leads to an increased specificity. This means there is a plethora of various types of goods and services available on the shelves of markets and shops. The size of the type system we use has grown (In RDF-speak, this would be the number of different rdf:types). We're overloaded with the number of different product types we're able to choose from. It's the paradox of choice: faced with a larger number of goods our ability to choose one of them goes down.
What GoodRelations does is that it provides a way to annotate products and services on the Web in a way that can be used by search engines to deliver a better search experience to their users. It allows for the deep search — a search that accepts very specific search queries and gives very precise answers. With GoodRelations you can retain the specificity of your offering and harness it in search. This is a possibility to target niche markets and get customers with highly specific needs in the long tail.
We need better search engines built on the structured data on the Web to alleviate the analysis paralysis that results from us being overwhelmed by the number of things to choose from. The growing amount of GoodRelations-annotated data is a step in the direction to a situation when you'll be able to pose a specific question to a search engine and get a list of only the highly relevant results.
The e-commerce applications and ontologies such as GoodRelations or Product Ontology show the pragmatic approach to the use of the semantic web technologies. Martin Hepp also mentioned his pragmatic view of linked data. In his opinion, the links that create the most business advantage are the most important. And it was interesting to see parts of the semantic web that work. It seems we're headed to a future of data-driven e-commerce.

2011-02-12

Open data success stories

The following is a short compilation of open data success stories. It's hard to see the indirect benefits of releasing data. Since publishing open data is building an infrastructure, there are no obvious direct benefits and you can't predict the concrete impact it will have. It's like building a roads system: you also don't know what type of cars will use it; and with open data, you don't know what kind of applications will be built on top of it. So, the aim of this post is to provide stories about real, tangible benefits people can understand and relate to. The stories follow this pattern:

  1. Some open data is released.
  2. Something useful happens with the data.

Open data success stories:

  1. Opening data about donations to charities saves $3.2 billion. (http://eaves.ca/2010/04/14/case-study-open-data-and-the-public-purse/)
  2. In Houston (Texas) they found out that thousands of traffic violations were dismissed because the prosecuting officer was a no-show, $25 million were lost. (http://www.texaswatchdog.org/2010/11/houston-police-miss-hundreds-of-traffic-court-dates-tickets/1290042682.story)
  3. When the EU farm subsidy data were published, it was discovered that one of the main recipients of subsidies was Nestlé. (http://farmsubsidy.org/GB/recipient/GB131541/nestle-uk-ltd-804817/)
  4. Linked data for the traffic infrastructure in Amsterdam enabled the fire brigade to get to incidents faster by following the optimal path. (http://www.epsiplatform.eu/news/news/amsterdam_fire_brigade_on_linked_data)
  5. Combining the data about houses connected to water supply and houses inhabited by black families and plotting them on a map showed a discrimination against blacks. (http://www.miller-mccune.com/culture-society/the-revolution-will-be-mapped-7130/)
  6. On-demand improvement of maps for better organisation of relief and recovery at earthquake-stricken Port au Prince in Haiti. (http://haiti.openstreetmap.nl/)
  7. 43 % of public contracts in Slovakia had only one candidate supplier. (http://www.transparency.sk/o-43-velkych-verejnych-obstaravani-v-roku-2010-sutazil-len-jeden-kandidat)
  8. The city of Vancouver published the bin collection schedule, so that applications reminding citizens about the upcoming collection can be built. (http://eaves.ca/2009/06/29/how-open-data-even-makes-garbage-collection-sexier-easier-and-cheaper/)
  9. After the release of UK bus stops' locations people started reporting where the bus stops were missing or misplaced. (http://www.guardian.co.uk/news/datablog/2010/sep/27/uk-transport-national-public-data-repository)
  10. Open data enables people in Denmark to make better decisions: such as where to find the nearest public toilet. (http://beta.findtoilet.dk/)

A good list of open data exemplars can be found at the Open Knowledge Foundation blog. If you need further convincing that there are benefits for doing open data, I would recommend reading through an excellent article Why Open Data? from the Open Data Manual. And yes, I know that I should have named this post Top 10 open data success stories to get more traffic…

2011-01-16

Shopping starts at Google

I don't know where the Web ends. It may have multiple ends, or none. But I know where the Web starts. It starts at Google.

Few years back, it was reported that 6 % of all internet traffic starts at Google. Also, plenty of people have Google set as their homepage. I think many of us would agree that our brain is only a thin layer on top of Google.

One reason for using Google is that people don't remember URIs. Google does it well. On the Web the address of a thing is a URI. In human brain the address of a thing is a set of associations which locate it in a neural network. That's why we need a way to translate these associations to a URI. Google does it fairly well. You pass it a bunch of keywords related to the thing you are looking for and it produces a nice, ordered list of URIs that might point to the thing you have on mind.

People don't use URIs to describe the things they are thinking of, machines do. I can't remember URIs, especially those of RDF vocabularies, which tend to be quite long. That's why I use prefix.cc which lets me to find the URI I'm looking for by passing it something I can remember: the vocabulary's prefix. The service remembers the vocabulary's URIs for me.

As it turns out, people don't remember the URIs of the things they want to buy either. So these days, a lot of shopping starts at Google. When you are looking to buy something you often start by describing that something to Google.

In commerce, things are addressed by brand. The problem with that is that people don't search for brands and they don't search for product names; they search for concepts. People don't search for Olympus E-450, they search for a camera. Brands and product names are not in their vocabularies, but concepts described by keywords are. People don't use brand names to describe the things they are thinking of, commerce does.

To bridge this gap you need to translate the keywords that people use to describe stuff to the brands that commerce uses to describe stuff. Enter search engine optimization (SEO). One of the things that SEO does is that it creates synonym rings. Synonym ring is a set of synonyms, words that people use to describe a thing, such as words mentioned in this tweet:

Can you all please stop retweeting those SEO jokes, gags, cracks, funnies, LOLs, humour, ROFLs, chuckles, rib-ticklers, one-liners, puns?

This SEO task consists in collecting the keywords people might use when searching for a thing so that they find your thing™ that you have described with these keywords.

It would be better if you can say that your thing™ (e.g., Olympus E-450) is a kind of thing people search for (e.g., a camera). Then, when people would search for a thing, they may find that your thing™ is such a thing. This is one of the promises of the semantic web vision. But, just as its Wikipedia article, the semantic web still has a lot of issues.

Nevertheless, the semantic web vision created some interesting by-products in the last few years. One of them is the Linked Open Data initiative striving to build a common, open data infrastructure for the semantic web that is coming (for sure). Other by-product of this vision is the so-called semantic SEO.

Both the semantic web and semantic SEO are misnomers. There is nothing exceptionally semantic in them. I would rather like to call it data SEO, but it seems the current name will stick. Semantic SEO is a practise of adding a little bit of structured data (preferably in RDF) to websites instead of adding a bunch of keywords. For instance, you can use the GoodRelations RDF vocabulary to mark-up your web page describing the product you're offering; even Google says you can. In semantic SEO a little bit of semantics is good enough, it can still go a long way.

Having your thing™ described with structured data makes it machine readable. Search engine, like Google, is a kind of machine. Therefore making your data machine-readable makes them readable for search engines. You can try how Google reads your data yourself.

By adding a bit of data into the mark-up of your web page (preferably via RDFa) you can optimize the way it will be displayed in Google's search results. Instead of a boring, text-only rendering you can get a display that contains useful information, such as an image of your thing™, its rating, reviews and the like. See the example at the GoodRelations website to compare the difference.

People are more likely to click on a search result with nice image in it, a result that is enriched with all kinds of useful information. This may lead to an increase in your click-through rate. For example, RDFa adoption at BestBuy resulted in a 30 % increase in search traffic. Pursuing the semantic web vision has been a largely academic undertaking, so it's good to see that its by-product, semantic SEO, has some real financial benefits.

The practise of semantic SEO is definitely not an academic endeavour, quite the opposite, a lot of high-profile companies and institutions are adopting it (e.g., BestBuy, O'Reilly, or Tesco). The share of webpages that have structured data in RDFa in them is growing. In October 2010, RDFa was in 3,5 % webpages, whereas the year before the share was 0,5 %.

E-commerce is one of the key factors that contributed to the growth of the Web in the 1990s. The same may become true for the Web of Data, a.k.a. linked data, and the e-commerce applications of the semantic web technologies, such as semantic SEO, may become a crucial drive behind its growth and lead to accelerate the rate of adoption of the linked data principles.