blog.mynarz.net

2012-01-21

Computing label specificity

This post has been long shelved in my head. Sometime around the summer of 2011 I started to think again about the problems that arise when you use labels (strings) as identifiers for information retrieval tasks. The ambiguity of labels used as identifiers without the necessary context is a common problem. Consider for example Wikipedia, which is trying to ameliorate this issue by providing disambiguation links to ambiguous label-based URIs. In this case the disambiguation is done by the user, that is provided with more contextual information describing the disambiguated resources.

Gradually I started to be interested not in label ambiguity, but an inverse property of textual labels: label specificity. What particularly interested me was the notion of computing label specificity based on external linked data . At first, I thought that having an indicator of the label's specificity may be useful when such label is to be treated as an identifier of a resource. Interlinking came to mind, in which the probability of a linkage between two resources is often computed based on labels' similarity. Another idea was to use label similarity in ontology design, to label parts of an ontology with unambiguous strings.

The more that I delved into the topic, the more it started to look like a useless, academic exercise. I wrote a few scripts, did some tests, and thought the topic is hardly worth continuing on. Then I stopped, leaving the work unfinished (as it lead nowhere). I still cannot think of a real-world use for the approach I have chosen to investigate, however I believe I have learnt something in the process, something that might stimulate further, and a more fruitful research.

Label ambiguity and specificity

Let's begin a hundred years ago, when Ferdinand de Saussure was writing about language. Language works through relations of difference, then, which place signs in opposition to one another, he wrote. And, a linguistic system is a series of differences of sound combined with a series of differences of ideas. However, these differences are not context-free as their resolution is context-dependent. Natural language is not as precise as DNS and its correct resolution requires reasoning with context, of which humans are more capable than computers. It begs the question to what would happen if computers were provided with contextual, background knowledge in a sufficiently structured form.

Label is a proxy to a concept and its resolution depends on the label's context. Depending on the context in which a label is used it can redirect to different concepts. Thus, addressing RDF resources with literal labels is a crude method of information retrieval. In most cases, labels alone are not sufficient for unique identification of concepts. The ambiguity of labels makes them unfit to be used as identifiers for resources. However, in some cases labels serve the purpose of identification, and this comes with consequences, the consequences of treating labels as values of owl:inverseFunctionalProperties.

From the linguistic perspective, ambiguous labels can be homonyms or polysemes. Homonyms are unrelated meanings that share a label with the same orthography, such as bat as an animal and as a baseball racket. Polysemes, on the other hand are related meanings grouped by the same label, such as mouth as a bodily part and as a place where river enters the sea.

Computing label specificity then largely becomes a task of identification of ambiguous labels. Given the extensive language-related datasets available on the Web, such task seems feasible. For instance, by using the additional data from external sources, one can verify a link based on a rule specifying that the linked resources must match on an unambiguous label. And vice versa, every method of verification may be reversed and used for questioning the certainty of a given link.

The label specificity impacts the quality of interlinking based of matching labels of RDF resources. Harnessing string similarity for literal values of label properties is a common practice for linking data that is fast and easy to implement. Also, when the data about an RDF resource are very brief and there is virtually no other information describing the resource apart from its label, matching based on computing similarity of labels may be the only possible method for discovering links in heterogeneous datasets.

This approach to interlinking may work well if the matched labels are specific enough in the target domain to uniquely identify linked resources. In a well-defined and limited domain, such as medical terminology, it makes sense to treat label properties almost as instances of owl:InverseFunctionalProperty and use their values as strong identifiers of the labelled resources. However, in other cases the quality of results of this approach suffers from lexical ambiguity of resource labels.

In cases where the links between datasets are made in a procedure that is based on similarity matching of resource labels, checking specificity of the matched labels may be used to find links that were created by matching ambiguous labels and therefore need further confirmation by other methods, such as matching labels in a different language in cases the RDF resources have multilingual description, or even manual examination to verify validity of the link.

So I thought label specificity can be used as a method of verification for straight-forward and faulty interlinking based on resources' labels. I designed a couple of tests that were supposed to examine the tested label's specificity.

Label specificity tests

The simplest of the tests was a check of the label's character length. This test was based on a naïve intuition that the label's length correlates with its specificity, and so longer labels tend to be more specific as more letters add more context and in some cases labels that exceed a specified threshold length may be treated as unambiguous.

Just a little bit more complex was the test taking into account if the label is an abbreviation. If the label contains only capital letters it highly likely that it is an abbreviation. Abbreviations used as labels are short and less specific because they may refer to several meanings. For instance, the abbreviation RDF may be expanded to Resource Description Framework, Radio direction finder, or Rapid deployment force (define:RDF). Unless abbreviations are used in a limited context defining topical domain or language, such as the context of semantic web technologies for RDF, it is difficult to guess correctly the meaning they refer to.

Another test based on an intuition assumed that if a single dataset contains more than one resource with the examined label, the label should be treated as ambiguous. It is likely that in a single datasets there are no duplicate resources and thus resources sharing the same label must refer to distinct concepts.

I thought about a test based on a related supposition, that the number of translations of the tested label from one language to another also indicates its specificity. However, at that time I was thinking about this, I have not discovered any sufficient data sources. Google Translate had recently closed its API and other dictionary sources were not equipped with a machine-readable data output, or were not open enough. Not long ago I have learnt about the LOD in Translation project, that queries multiple linked data sources and retrieves translations of strings based on multilingual values of the same properties for the same resources. Looking at the number of translations of bat (from English to English) it supports the inkling that it is not the most specific label.

Then I tried a number of tests based on data from external sources, which explicitly provided data about the various meanings of the tested label. I started with the so-called Thesaurus, which offers a REST API, that provides a service to retrieve a set of categories, in which the given label belongs. Thesaurus supports four languages, with Italian, French and Spanish along with English. However, the best coverage is available for English.

The web service at accepts a label and the code of its language, together with the API key and specification of the requested output format (XML or JSON). For example, for label bat it returns 10 categories representing different senses in which the label may be used.

Then I turned to the resources that provide data in RDF. Lexvo provides RDF data from various linguistic resources combined together, building on datasets such as Princeton WordNet, Wiktionary, or Library of Congress Subject Headings.

Lexvo offers a linked data interface and it is possible to access its resources also in RDF/XML via content negotiation. The Lexvo URIs follow a simple pattern http://lexvo.org/id/term/{language code}/{label}. In order to get the data to examine label's specificity, retrieve the RDF representation of a given label, get the values of the property http://lexvo.org/ontology#means from Lexvo ontology, and group by results by data provider (cluster them according to their base URI) to eliminate the same meanings coming from different datasets included in Lexvo. For example, the URI http://lexvo.org/id/term/eng/bat returns 10 meanings from WordNet 3.0 and 4 meanings from OpenCyc.

Of course I tried DBpedia, one of the biggest sources of structured data, as well. Unlike Lexvo, DBpedia provides a SPARQL endpoint. In this way, instead of requesting several resources and running multiple queries, the information about the ambiguity of a label based on Wikipedia's disambiguation links can be retrieved by a single SPARQL query, such as this one:

PREFIX dbpont: <http://dbpedia.org/ontology/>PREFIX rdfs:   <http://www.w3.org/2000/01/rdf-schema#>SELECT DISTINCT ?resource ?labelWHERE {  ?disambiguation dbpont:wikiPageDisambiguates ?resource, [    rdfs:label "Bat"@en  ] .  ?resource rdfs:label ?label .  FILTER(LANGMATCHES(LANG(?label), "en"))}

After implementing similar tests of label specificity, I was not able to find any practical use case for such functionality. I left this work, shifting my focus to things that looked more applicable in the real world. I still think there is something to it, and maybe I will return to it on some day.

2011-09-24

Technical openness of open data

Apart from the legal requirements on open data, there are also aspects of technical openness. While the legal aspects are explicitly defined by the Open Definition, there is less understanding of the technical recommendations for making data open. Some principles of this side of openness are covered by the Three laws of open data by David Eaves, others are proposed in the Linked Open Data star scheme. An excellent resource that touches on both legal and technical requirements for open data is 8 open government data principles.

Data need to be formalized so that we can serialize them to representations that may be exchanged. However, there are different formalizations that may be used for communicating data, different formats that are more or less open. I think open technologies for representing data share a set of family resemblances. So, open data are:

Non-exclusive: Open data are not published exclusively for a particular application. No application has exclusive access to open data. Instead, they are available to be used by any application and thus support a wide range of uses.

Non-proprietary: No entity has exclusive control over non-proprietary data formats. Such formats have an open specification that may be implemented by anyone. Therefore, data in these formats are not tightly coupled with a specific software that is able to read them.

Standards-based: The data are based on open, community-owned standards. This means the standards are developed in an open process that may be joined by anyone from the public (i.e., not Schema.org). Such standards prescribe a set of rules the data have to adhere to. Standardized data have an expected format, which ensures interoperability, and as such can be used by a plethora of standards-compliant tools.

Machine-readable: Open data are formalized enough so that machines are able to use them. Well-formalized data have a structure that enables their automated machine processing. For instance, unlike a scanned document stored as an image, which is one opaque blob, open data have a higher granularity because they are segmented into well-defined data items (e.g., rows, columns, triples).

Findable: Open data should be publicly available on the Web. This means to have URLs that successfully return representation of data. Data should be directly accessible by resolving its URL. Any technical barriers, passwords or required registration, preventing from accessing data are unacceptable, as well as any attemps to hide the data and achieve security through obscurity via techniques of anti-SEO. As David Eaves puts it, if Google cannot find it, no one can.

Linkable: Elements of open data should be identified with URIs. In this way it is possible to link to it. This approach encourages re-use, data integration, and proper attribution of data used as a source.

Linked: If your open data are linked to other open data, users can follow these links to discover more. Being a part of the Web of data brings the benefits yielded by the network effects.

As you might have guessed from the previous points, I think that linked data is a very open technology. And, if you look at the 5 star of linked data, its author Tim Berners-Lee thinks the same. So if you want to make your data more open, it is a step in the right direction.

Open bibliographic data checklist

I have decided to write a few points that might be of interest to those thinking about publishing open bibliographic data. The following is a fragment of an open bibliographic data checklist, or, how to release your library's data into the public without a lawyer holding your hand.

I have been interested in open bibliographic data for a couple of years now, and I try to promote them at the National Technical Library, where we have, so far, released only authority dataset — the Polythematic Structured Subject Heading System. The following points are based on my experiences with this topic. What should you pay attention to when opening your bibliographic data then?

Make sure you are the sole owner of the data or make arrangements with other owners. For instance, things may get complicated in the case data was created collaboratively via shared cataloguing. If you are not in complete control of the data, then start with consulting the other proprietors that have a stake in the datasets.
Check if the data you are about to release are not bound by some contractual obligations. For example, you may publish a dataset under a Creative Commons licence, soon to realize that there are some unsolved contracts with parties that helped fund the creation of that data years ago. Then you need to discuss this issue with the involved parties to resolve if making the data open is a problem.
Read your country's legislation to get to know what you are able to do with your data. For instance, in Czech Republic it is not possible to put data into the public domain intentionally. The only way how public domain content is created is by the natural order of things, i.e., author dies, leaves no heir, and after quite some time the work enters the public domain.
See if the data are copyrightable. For instance, if the data do not fall into the scope of the copyright law of your country, it is not suitable to be licenced under Creative Commons, since this set of licences draws its legal binding from the copyright law; it is an extension of the copyright and it builds on it. Facts are not copyrightable and most bibliographic records are made of facts. However, some contain creative content, for example, subject indexing or an abstract, and as such are appropriate for licencing based on the copyright law. Your mileage may vary.
Consult the database act. Check if your country has a specific law dealing with the use of databases that might add more requirements that need your attention. For example, in some legal regimes databases are protected on other level, as an aggregation of individual data elements.
Different licencing options may be applicable for content and structure of dataset, for instance when there are additional terms required by database law. You can opt in dual-licensing and use two different licences, one for dataset's content that is protected by the copyright law (e.g., a Creative Commons licence), and one for dataset's structure for which the copyright protection may not apply (e.g., Public Domain Dedication and License).
Choose a proper licence. A proper open licence is a licence that conforms with the Open Definition (and will not get you sued), so pick one of the OKD-Compliant licences. Good source of solid information about licences for open data is Open Data Commons.
BONUS: Tell your friends. Create a record in the Data Hub (formerly CKAN) and add it to the bibliographic data group to let others know that your dataset exists.

Even if it may seem there are lots of things you need to check before releasing open bibliographic data, it is actually easy. It is an performative speech act: you only need to declare your data open to make it open.

<disclaimer>If you are unsure about some of the steps above, see a lawyer to consult it. Note that the usual disclaimers apply for this post, i.e., IANAL.</disclaimer>

2011-08-06

Turning off feed reader

Today I have decided to stop using my feed reader. My use of it has diminished over a long period of time and I no longer think it's an optimal tool for the way I like to discover information.

In my view, feeds, whether they're from blogs, news sites or of any other origin, contain just too much noise. You need to go through all of items in your subscribed feeds yourself. It's information filtering on the client-side. Feed readers don't allow for fine grained filtering I would like to be able to do, and thus, they are blunt instruments for information discovery.

Reading feeds may also lack the serendipitous discovery. I'm rarely surprised when I read my feeds. On the other hand, on Twitter I get interesting pointers to various resources much more frequently due to the ways information spreads through the network of Twitter users before it finally reaches me (e.g., retweets).

Because of these shortcomings my primary platform for information acquisition is Twitter now. I don't read feeds, newspapers, magazines, watch TV news and the like. I have resigned from trying to achieve even near-complete coverage of the topics I'm interested and instead I sample and skim-read my Twitter stream.

Twitter provides me a manageable stream of highly relevant information resources that I'm usually able to process and digest. It offers me seredipitous discoveries I wouldn't have come across when using feed readers. Also, I like to sample from a wide range of resources on different topics and Twitter caters for that quite well.

I have changed my infomation consumption habits. In a sense, I have switched to a probabilistic information retrieval. I know that I can't get a complete coverage on the subject areas I'm interested in. I'm conscious that I miss something, but I'm fine with that. I believe that if the information is important enough, it will come back to me. If I don't catch something, I trust my network on Twitter to make me pay attention to it by mentioning it, re-tweeting it, and re-discovering it for me.

On Twitter my information filter is the network of the people I follow. The key difference is that while you're reading feeds you're using people as content creators, on Twitter you're using people as content curators. It's a filtering on a meta level: instead of filtering information yourself you filter the people that are filtering information for you. Your responsibility is to curate the list of Twitter users you follow. However, if you want to be an active member of the Twitter ecosystem you curate, share, and forward information for your followers.

On the Web there are many information channels and trying to follow all of them results in a fragmentation of one's attention. Reading lots of information resources is time consuming, content is often duplicated, and therefore demands strenuous filtering, and also, context switching between different media is expensive for one's cognitive abilities.

In an attention economy we decide how we spend our resources of attention. While marketing uses targetting to reach relevant audiences, we do reverse targetting when we expose ourselves as targets to media of our choice. Choosing a single, yet heterogeneous, information acquisition channel, such as Twitter, may lead to a defragmentation of our attention, and thus it may be a step towards more efficient allocation of one's attention.

The switch from feed readers may be a general trend. I think that information acquisition via feed readers was in part surpassed by the social media and the ubiquitous sharing of content on the Web (tweets, likes, plus ones, recommendations, etc.). One of the questions asked by the media theorist Marshall McLuhan in his tetrad of media effects was What does the medium make obsolete?. If we ask what Twitter makes obsolete, the answer may well be feed readers.

That said, I still think feeds are indispensable when it comes to information acquisition for machines, such as web applications and the like. Feeds are well suited for machines to exchange information. Unlike in humans, attention isn't a scarce resource for machines. Machines can read all items in feeds. But people needs more human ways for discovering new information as they have limited resources of attention. I think Twitter delivers on that.

2011-07-19

Spoonfeeding Google with RDF graphs packaged as trees

During a small side project I've found out that Google Rich Snippets Testing Tool doesn't treat RDFa as RDF (i.e., a graph) but rather as a simple hierarchical structure (i.e., a tree). It doesn't take under account links in RDFa, but only the way HTML elements are nested inside one another. More about the difference between data models of graph and tree can be found in a blog post by Lin Clark.

I've created two documents that give the same RDF when you run RDFa distiller on them. Both contain GoodRelations product data, but the difference between them is that in the first document the HTML element describing price specification (gr:UnitPriceSpecification) is a not nested inside the HTML element descibing the offering (gr:Offering) and the offering is linked to via gr:hasPriceSpecification property. In the second document the HTML element with price specification is nested in the element about the offering.

Even though the documents contain same data, Google Rich Snippets Testing Tool parses them differently and refuses to show a preview of search result in the case of the first document, whereas the second document produces a preview. In the first case, the price information is not recognized because it's not nested inside the HTML element describing the offering and thus a warning is shown:

Warning: In order to generate a preview, either price or review or availability needs to be present.

This leads me to believe that Google Rich Snippets Testing Tool doesn't parse RDFa as RDF, but as a tree (much like a DOM tree), effectively the same way as HTML5 microdata, which is built on the tree model. Google doesn't use RDFa as RDF, but as microdata.

Eric Hellman wrote a blog post about spoonfeeding data to Google. Even though Google still accepts some RDF (e.g., GoodRelations) after the announcement of microdata-based Schema.org, it wants to be spoonfed RDF graphs packaged as microdata trees. Does it mean that if Google is your primary target consumer for your data, you shouldn't bother with packaging your RDF in trees, but rather directly provide your data as a tree in HTML5 microdata?

2011-07-03

RDFa in action

RDFa is a way how to exchange structured data inside of HTML documents. RDFa provides information that is formalized enough for computers (such as googlebot) to process it in an automated way. RDFa is a complete serialization of RDF, using the attribute = value pairs to embed data into HTML documents in a way that does not affect their visual display. RDFa is a hack built on top of HTML. It repurposes some of the standard HTML attributes (such as href, src or rel) and adds new ones (such as property, about or typeof) to enrich HTML with semantic mark-up.

A good way to start with RDFa is to read through some of the documents, such as the RDFa Primer or even the RDFa specification. When you want to annotate an HTML document with RDFa you might want to go through a series of steps. We have used this workflow during an RDFa workshop I have helped to organize and this recipe worked quite well. Here it is.

Find out what do you want to describe (e.g., your personal profile).
Find which RDF vocabularies can be used to express description of such a thing (e.g., FOAF). There are multiple ways how to discover suitable vocabularies, some of which are listed at the W3C website for Ontology Dowsing.
Start editting your HTML: either the static files or dynamically rendered templates.
Start at the first line of your document and set the correct DOCTYPE. If you are using XHTML, use <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN" "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd"> (i.e., RDFa 1.0). If you are using HTML5, use <!doctype html> (i.e., RDFa 1.1). This will allow you to validate your document and see if you are using RDFa correctly.
Refer to the used RDF vocabularies. By declaring vocabularies' namespaces you can set up variables that you can use in compact URIs. If you are using XHTML, use the xmlns attribute (e.g., xmlns:dv="http://rdf.data-vocabulary.org/#"). If you are using HTML5, use prefix, vocab, or profile attributes (e.g., prefix="dv: http://rdf.data-vocabulary.org/#").
Identify the thing you want to describe. Use a URI as a name for the thing so that others can link to it. Use the about attribute (e.g., <body about="http://example.com/recipe">). Everything that is nested inside of the HTML element with the about attribute is the description of the identified thing, unless a new subject of description is introduced via a new about attribute.
Use the typeof attribute to express what is the thing that you are describing (e.g., <body about="http://example.com/recipe" typeof="dv:Recipe">). Pick a suitable class from the RDF vocabularies you have chosen to use and define the thing you describe as an instance of this class. Note that every time the typeof attribute is used the subject of description changes.
Use the property, rel and rev attributes to name the properties of the thing you are describing (e.g., <h1 property="name">).
Assing values to the properties of the described thing using either the textual content of the annotated HTML element or an attribute such as content, href, resource or src (e.g., <h1 property="name">RDFa in action</h1> or <span property="v:author" rel="dcterms:creator" resource="http://keg.vse.cz/resource/person/jindrich-mynarz">Jind??ich Mynarz</span>).
If you have assigned the textual content of an HTML element as a value of an attribute of the thing described you can annotate it. To define the language of the text, use either xml:lang (in XHTML) or lang (in HTML5) attributes (e.g., <h1 property="name" lang="en">RDFa in action</h1>). If you want to set the datatype of the value, use the datatype attribute (e.g., <span content="2011-07-03" datatype="xsd:date">July 3, 2011</span>)
Check you RDFa-annotated document using validators and examine the data using RDFa distillers to see if you have got it right.
Publish the annotated HTML documents on the Web. Ping the RDFa consumers such as search engines so that they know about your RDFa-annotated web pages.

Art of emptiness

Marshall McLuhan created a distinction between "hot" and "cool" media. I think it is a productive conceptualization of media because it stimulates thinking, even though it suggests thinking in terms of binary opposites.

The longer I enjoy art, I think I tend to prefer "cool art". The following is a comparison of hot and cool styles of art, with a particular focus on music. I hope this will not result in a death from metaphor, but rather in a productive use of it. First, let's start with what McLuhan called the "hot media".

Hot art

Hot art is an art of sensory overload. It provides rich, overwhelming, super-stimuli that lower our ability to parse our sensory input. Hot art needs a space to inhabit; it is an environment-seeking art. Art is always situated in a host environment, in a wider context; and hot art needs space to live in. For instance, for visual arts it is the space of plain, white walls in an art gallery.

Hot art enforces a single interpretation, it is not open for a creative use. It guides a person through a linear, pre-defined experience, without a need for participation. In this way, it achieves a temporary oblivion by the means of hypnosis. The source of super-stimulation occupies our brain, blocks any other input, and forces the person to pay attention only to it.

For the most part, hot art is perceived on the conscious level. Hot art is digitally mastered, manufactured product that is made to achieve the maximum effect possible. The result of such process feels artificial, perfect, and error-free.

A typical example of hot art is pop music. For example, this manifests itself in the "wall of sound" method which uses a plenty of different layers of sound to provide a compelling listening experience.

Cool art

On the other hand, cool art is an art of sensory deprivation. It uses under-stimulation to create emptiness. Cool art creates space, and thus it is an environment-creating art as it puts the person perceiving it in an environment of its own.

Cool art is open and invites a multiplicity of interpretations. It inspires people to undergo a non-linear experience, while requiring a high level of active participation. Participatory art evokes hallucination, which manifests itself as a furious fill-in or completion of sense when all outer sensation is withdrawn (source). Left with minimal sensory input, human mind starts to create its own content. This is a mechanical process, a natural reaction to under-stimulation of the sensory apparatus. Left alone, mind tends to wander, fill in the blanks, and complete the missing parts. Cool art inspires to create by the means of hallucination.

The experience of cool art is mostly an unconscious one. In contrast to hot art, it is based in analogue, non-discrete forms, which grow in organic ways. For instance, this can be achieved by the techniques of field recordings or employing non-deterministic or random processes. Such art is in a way more natural, it embraces error (cf. esthetics of glitch in music).

A typical example of cool art is dub techno. Dub techno got rid of the usual elements of music, such as the melody, and confined itself to conveying music mostly through subtle, slowly evolving changes of rhythm or timbre. This is the minimalism that manifests itself through extensive repetition and limiting yourself to the expressive power of bare rhythm.

I prefer cool art to hot art. However, this is a matter of taste, which implies it may change. To conclude, let me give you a couple of examples of what I consider to be cool art.

Visual arts: Unloud painting

Cinema: Stalker by Andrei Tarkovsky

Music: cv313 - Subtraktive (Soultek's Stripped Down Dub)

Subscribe to: Posts ( Atom )