2012-02-14

In search for the ontology

It is a common problem. When you want to create RDF data describing a domain that you have never described in RDF before, you need to find a proper RDF vocabulary or ontology that provides sufficient means of expression covering the domain in question. I you have enjoyed a triplification exercise, I am pretty sure you have encountered this obstacle. For instance, when you find a dataset about budgets of municipalities, then the first thing you need to do is to find an ontology that you can use to describe budgets, the topic of the dataset. This ontology retrieval problem is not straightforward as it may seem. It earned a reputation of being a rather esoteric practice and on the excellent W3C's web site it was aptly named ontology dowsing.

Clearly, finding an ontology should be easier than inventing a new one. There is a plenty of approaches to solving this question, some of which work fairly well when complemented with others. The difficulty in finding a proper ontology poses a hurdle to re-use oriented data modelling of linked data. Such a barrier leads to a situation where there is a lot of dataset-specific ontologies, the authors of which have taken the path of aligning their data modelling with the structure of legacy datasets instead of aligning it with the available RDF vocabularies and ontologies. In fact, to emphasize the importance of this question, if an ontology cannot be found, it is almost as if the ontology did not exist.

This problem is such common source of frustration that it has motivated a number of solutions and given rise to lots of questions. Among the approaches taken to solve it the one that gets often mentioned is the use of semantic search engines, such as FalconS, Watson, or (probably the best known) Sindice. These tools usually search across both instance and schema-level data (ABox and TBox), even though FalconS offers a functionality to search only within ontologies. Semantic search engines rely solely on the ontologies to be self-descriptive enough to be found. They take into account only the information that is in ontologies themselves, which may constitute a significant drawback when searching for ontologies containing only a brief description. How to ameliorate such state of affairs? It may help to introduce more data.

With the increasing volume of the Linked Open Data Cloud containing several billion of RDF triples the possibility of gaining relevant insights about ontologies from instance data became available. Essentially, by performing statistical analyses measuring how the ontologies are being used one can get more information that may help in search for the right ontology. As an example of such application based on a survey of ontology adoption by crawling and analysing existing data may be given the vocab.cc, which processed the data from the Billion Triples Challenge 2011 dataset. Examination of a large amount of data seems to be a research practice that is growing in popularity. This statement may be supported by the recently launched LOD2's LODStats project that also shows the list of the most used RDF vocabularies and ontologies. In fact, this kind of data can be used to implement PageRank-like metrics supporting relevancy ranking in ontology search, a feature that might be used to distinguish poor-quality or difficult to use ontologies from the established and widely deployed ones.

The method that I believe was not explored yet is to search for ontologies through datasets. Given the richer description of the data in LOD Cloud, it is possible that it could be easier to find datasets covering the domain you are interested in and then see what ontologies they use. Ultimately, if this approach proves to be fruitful, you could store the correspondences between datasets' topics and datasets' ontologies and skip the datasets during the search for ontologies.

Another source of information that powers the ontology search is metadata. Several projects, among which Schemapedia may be recognized as a good example, strive to provide for a better ontology search through additional data organizing the ontologies. These registries record information such as the topical classification or provenance metadata that helps users to find the ontologies they need. Linked Open Vocabularies, another example of a project of this type, employs a simple classification for the ontologies that can be browsed following the classification's hierarchical structure. Ontologies are organized into vocabulary spaces (the project's Vocabulary of a Friend introduces voaf:VocabularySpace class), and so, for example, the Public Contracts Ontology is sorted into the Contracts vocabulary space that belongs into the Market space. Linked Open Vocabularies also incorporates data of the previously mentioned type that are based on statistical analysis of instance data to compute vocabulary metrics such as popularity.

A common problem of the centralized registries is their patchy coverage. The registries require manual curation to update them with new ontologies, to mark ontologies that are no longer maintained, or to remove namespace URIs that do no longer resolve. One approach to this issue may be to introduce user generated content and let users to add annotations to ontologies. Although Schemapedia leverages such content by allowing to assign tags to the listed RDF vocabularies and ontologies, a more complete wiki-like approach used in the CKAN software for building data catalogues might be necessary. Nevertheless a simpler solution, such as extending namespace lookup service prefix.cc with user generated tags that may be looked-up as well, might turn out to be just as effective.

A potential bottleneck of all of the methods for finding relevant ontologies may be in their interfaces exposed to their users to support efficient query formulation. The user has an information need (to get an adecuate ontology for his or her dataset) and is required to express it using the means of the search system. The easiest way is to specify the need with natural language, however, I think having a dictionary-like lookup providing direct translations from user's phrases to ontology's terms is quite a hard problem to solve. What seems to be more feasible is browsing that harnesses the added structure built from tags or controlled vocabularies providing subject indexing for ontologies. However, this type of approach begs the question if we will not get lost in a rabbit hole of creating ontologies for describing ontologies as we will be adding more and more information to help us find other information. We will see about that.