blog.mynarz.net: December 2010

2010-12-06

Publishing the vocabulary of the types of grey literature as linked data

This blog post is based on the poster presentation delivered at the Grey Literature 12 conference.

The aim of this post is to introduce the typology of grey literature we have started to develop at the National Technical Library. The vocabulary of the types of grey literature is a controlled vocabulary that is meant to be used to express that a document belongs to a certain document type. The design of the vocabulary is based on an analysis of six existing grey literature typologies. Thus, it can be seen as a formalization of the outcomes of the systematic examination done during this analysis.

It has a loose structure with hierarchical relationships between the types' concepts. Each type has a unique identifier (a URI, in this case) and a preferred label. Some types feature labels in multiple languages and links to other types, both from the vocabulary itself and from external datasets. In the vocabulary's documentation each type will be provided with a definition and a prototypical example of a document for which it can be used.

I will briefly mention the technologies that we have used in the vocabulary's development. The vocabulary is expressed in the RDF data format as a SKOS concept schema. RDF (which stands for Resource Description Framework) is a data format for expressing data with a graph structure and Simple Knowledge Organisation System (abbreviated as SKOS) is an ontological language for representing knowledge organisation systems, such as thesauri, codelists, or systematic classifications.

The vocabulary will be published as linked data. Linked data is a publication model for exposing structured data on the Web in a way that uses links between datasets to create a network of interlinked data. The vocabulary includes links to other vocabularies and datasets, such as the Biblioontology, Dublin Core Metadata Initiative Types, or DBPedia, which represents the structured information extracted from Wikipedia.

The vocabulary is supposed to be a product of a co-operative development. The project of the grey literature typology is hosted at the Google Code website. The reason for using Google Code is that it has the functionality to support collaborative development. At its core there is a distributed version control system that enables to track the different versions of the vocabulary submitted by the members of the development team. It makes possible to incorporate feedback by commenting on the individual changes of the vocabulary and by reporting issues that should be fixed in the future versions. The Google Code website also includes a wiki that serves as the vocabulary's documentation.

For the purposes of this vocabulary's development the Working Group for Grey Literature Typology was established. The aim of this informal group is to bring together the experts from various fields related to grey literature, knowledge organisation systems, or semantic web technologies in order to work collaboratively on the further evolution of the vocabulary. If you are interested in participating in this vocabulary's development or becoming its user I encourage you to check out the project's website at Google Code.

2010-12-04

Design patterns for modelling bibliographic data

I have done a several conversions of bibliographic library data from the MARC format and most of the times I had to deal with some re-occurring issues of data modelling. During these conversions I have also adopted a set of design patterns, described in the following parts of this post, that can make the conversion process easier. This post was inspired by the things that were discussed at the Semantic Web in Bibliotheken 2010 conference, by a several conversations at SemanticOverflow, and a book on linked data patterns.

Do a frequency analysis of MARC tags occurrences before data modelling.
If you perform a frequency analysis beforehand you will have a better picture about the structure of your dataset. This way you will know which MARC elements are the most frequent ones and thus more important to be modelled properly and captured in RDF without a loss of semantics.
Use an HTTP URI for every entity that has some assertions attached to it.
Even though you may have little information about some entity it is worth minting new URI or re-using an existing one for it. Using explicit (URI) and resolvable (HTTP) identifiers instead of hiding the entity in a literal or referring to it with a blank-node enables it to be used both in the RDF statements in your dataset as well as any other external RDF data. And, as Stefano Bertolo wrote, Linking identifiers is human, re-using them is divine.
Create a URI pattern for every distinct resource type.
In order to differentiate clusters of resources belonging to a certain resource type (rdf:type) you should define a specific URI pattern for each of them. By doing this you will have the resource identifiers grouped in a meaningful manner, preferrably with patterns that enable human users to infer the type of a resource.
Use common RDF vocabularies.
It is better to re-use common RDF vocabularies, such as Dublin Core or Friend of a Friend, instead of making use of complex, library-specific vocabularies, such as those created for the RDA. There is also the option of going with the library-specific vocabularies and then linking them on the schema level to more general and widely-used vocabularies. This enables easier integration of RDF data and a combination of multiple RDF datasets together. It also helps to make your data more accessible to semantic developers because they do not have to learn a new vocabulary just for the sake of using your data. Libraries are not so special and they do not need everything re-invented with a special respect to them.
Use separate resources for the document and the record about it.
This allows to make separate assertions about the document and its description in the record and attach them to the correct resource. For some statements, such as the copyright, it makes sense to attach them the record resource instead of the document. For example, this way you can avoid the mistake of claiming authorship of a book while you are only the author of the record about that book.
Provide labels in multiple languages.
It is useful to provide labels for the bibliographic resources in multiple languages even though they the strings of the labels might be equivalent. This is the case of terms that either do not have a representation in a language or the were adopted from other language, for example "Call for papers"@en and "Call for papers"@cs. It does not create redundand or duplicate data because it adds new information and that is the fact that the labels have the same string representations. They may be pronouced the same or differently but still, they provide useful information, for purposes like translations and the like.