2012-01-21

Computing label specificity

This post has been long shelved in my head. Sometime around the summer of 2011 I started to think again about the problems that arise when you use labels (strings) as identifiers for information retrieval tasks. The ambiguity of labels used as identifiers without the necessary context is a common problem. Consider for example Wikipedia, which is trying to ameliorate this issue by providing disambiguation links to ambiguous label-based URIs. In this case the disambiguation is done by the user, that is provided with more contextual information describing the disambiguated resources.

Gradually I started to be interested not in label ambiguity, but an inverse property of textual labels: label specificity. What particularly interested me was the notion of computing label specificity based on external linked data . At first, I thought that having an indicator of the label's specificity may be useful when such label is to be treated as an identifier of a resource. Interlinking came to mind, in which the probability of a linkage between two resources is often computed based on labels' similarity. Another idea was to use label similarity in ontology design, to label parts of an ontology with unambiguous strings.

The more that I delved into the topic, the more it started to look like a useless, academic exercise. I wrote a few scripts, did some tests, and thought the topic is hardly worth continuing on. Then I stopped, leaving the work unfinished (as it lead nowhere). I still cannot think of a real-world use for the approach I have chosen to investigate, however I believe I have learnt something in the process, something that might stimulate further, and a more fruitful research.

Label ambiguity and specificity

Let's begin a hundred years ago, when Ferdinand de Saussure was writing about language. Language works through relations of difference, then, which place signs in opposition to one another, he wrote. And, a linguistic system is a series of differences of sound combined with a series of differences of ideas. However, these differences are not context-free as their resolution is context-dependent. Natural language is not as precise as DNS and its correct resolution requires reasoning with context, of which humans are more capable than computers. It begs the question to what would happen if computers were provided with contextual, background knowledge in a sufficiently structured form.

Label is a proxy to a concept and its resolution depends on the label's context. Depending on the context in which a label is used it can redirect to different concepts. Thus, addressing RDF resources with literal labels is a crude method of information retrieval. In most cases, labels alone are not sufficient for unique identification of concepts. The ambiguity of labels makes them unfit to be used as identifiers for resources. However, in some cases labels serve the purpose of identification, and this comes with consequences, the consequences of treating labels as values of owl:inverseFunctionalProperties.

From the linguistic perspective, ambiguous labels can be homonyms or polysemes. Homonyms are unrelated meanings that share a label with the same orthography, such as bat as an animal and as a baseball racket. Polysemes, on the other hand are related meanings grouped by the same label, such as mouth as a bodily part and as a place where river enters the sea.

Computing label specificity then largely becomes a task of identification of ambiguous labels. Given the extensive language-related datasets available on the Web, such task seems feasible. For instance, by using the additional data from external sources, one can verify a link based on a rule specifying that the linked resources must match on an unambiguous label. And vice versa, every method of verification may be reversed and used for questioning the certainty of a given link.

The label specificity impacts the quality of interlinking based of matching labels of RDF resources. Harnessing string similarity for literal values of label properties is a common practice for linking data that is fast and easy to implement. Also, when the data about an RDF resource are very brief and there is virtually no other information describing the resource apart from its label, matching based on computing similarity of labels may be the only possible method for discovering links in heterogeneous datasets.

This approach to interlinking may work well if the matched labels are specific enough in the target domain to uniquely identify linked resources. In a well-defined and limited domain, such as medical terminology, it makes sense to treat label properties almost as instances of owl:InverseFunctionalProperty and use their values as strong identifiers of the labelled resources. However, in other cases the quality of results of this approach suffers from lexical ambiguity of resource labels.

In cases where the links between datasets are made in a procedure that is based on similarity matching of resource labels, checking specificity of the matched labels may be used to find links that were created by matching ambiguous labels and therefore need further confirmation by other methods, such as matching labels in a different language in cases the RDF resources have multilingual description, or even manual examination to verify validity of the link.

So I thought label specificity can be used as a method of verification for straight-forward and faulty interlinking based on resources' labels. I designed a couple of tests that were supposed to examine the tested label's specificity.

Label specificity tests

The simplest of the tests was a check of the label's character length. This test was based on a naïve intuition that the label's length correlates with its specificity, and so longer labels tend to be more specific as more letters add more context and in some cases labels that exceed a specified threshold length may be treated as unambiguous.

Just a little bit more complex was the test taking into account if the label is an abbreviation. If the label contains only capital letters it highly likely that it is an abbreviation. Abbreviations used as labels are short and less specific because they may refer to several meanings. For instance, the abbreviation RDF may be expanded to Resource Description Framework, Radio direction finder, or Rapid deployment force (define:RDF). Unless abbreviations are used in a limited context defining topical domain or language, such as the context of semantic web technologies for RDF, it is difficult to guess correctly the meaning they refer to.

Another test based on an intuition assumed that if a single dataset contains more than one resource with the examined label, the label should be treated as ambiguous. It is likely that in a single datasets there are no duplicate resources and thus resources sharing the same label must refer to distinct concepts.

I thought about a test based on a related supposition, that the number of translations of the tested label from one language to another also indicates its specificity. However, at that time I was thinking about this, I have not discovered any sufficient data sources. Google Translate had recently closed its API and other dictionary sources were not equipped with a machine-readable data output, or were not open enough. Not long ago I have learnt about the LOD in Translation project, that queries multiple linked data sources and retrieves translations of strings based on multilingual values of the same properties for the same resources. Looking at the number of translations of bat (from English to English) it supports the inkling that it is not the most specific label.

Then I tried a number of tests based on data from external sources, which explicitly provided data about the various meanings of the tested label. I started with the so-called Thesaurus, which offers a REST API, that provides a service to retrieve a set of categories, in which the given label belongs. Thesaurus supports four languages, with Italian, French and Spanish along with English. However, the best coverage is available for English.

The web service at accepts a label and the code of its language, together with the API key and specification of the requested output format (XML or JSON). For example, for label bat it returns 10 categories representing different senses in which the label may be used.

Then I turned to the resources that provide data in RDF. Lexvo provides RDF data from various linguistic resources combined together, building on datasets such as Princeton WordNet, Wiktionary, or Library of Congress Subject Headings.

Lexvo offers a linked data interface and it is possible to access its resources also in RDF/XML via content negotiation. The Lexvo URIs follow a simple pattern http://lexvo.org/id/term/{language code}/{label}. In order to get the data to examine label's specificity, retrieve the RDF representation of a given label, get the values of the property http://lexvo.org/ontology#means from Lexvo ontology, and group by results by data provider (cluster them according to their base URI) to eliminate the same meanings coming from different datasets included in Lexvo. For example, the URI http://lexvo.org/id/term/eng/bat returns 10 meanings from WordNet 3.0 and 4 meanings from OpenCyc.

Of course I tried DBpedia, one of the biggest sources of structured data, as well. Unlike Lexvo, DBpedia provides a SPARQL endpoint. In this way, instead of requesting several resources and running multiple queries, the information about the ambiguity of a label based on Wikipedia's disambiguation links can be retrieved by a single SPARQL query, such as this one:

PREFIX dbpont: <http://dbpedia.org/ontology/>PREFIX rdfs:   <http://www.w3.org/2000/01/rdf-schema#>SELECT DISTINCT ?resource ?labelWHERE {  ?disambiguation dbpont:wikiPageDisambiguates ?resource, [    rdfs:label "Bat"@en  ] .  ?resource rdfs:label ?label .  FILTER(LANGMATCHES(LANG(?label), "en"))}

After implementing similar tests of label specificity, I was not able to find any practical use case for such functionality. I left this work, shifting my focus to things that looked more applicable in the real world. I still think there is something to it, and maybe I will return to it on some day.