2010-12-04

Design patterns for modelling bibliographic data

I have done a several conversions of bibliographic library data from the MARC format and most of the times I had to deal with some re-occurring issues of data modelling. During these conversions I have also adopted a set of design patterns, described in the following parts of this post, that can make the conversion process easier. This post was inspired by the things that were discussed at the Semantic Web in Bibliotheken 2010 conference, by a several conversations at SemanticOverflow, and a book on linked data patterns.

  • Do a frequency analysis of MARC tags occurrences before data modelling.

    If you perform a frequency analysis beforehand you will have a better picture about the structure of your dataset. This way you will know which MARC elements are the most frequent ones and thus more important to be modelled properly and captured in RDF without a loss of semantics.
  • Use an HTTP URI for every entity that has some assertions attached to it.

    Even though you may have little information about some entity it is worth minting new URI or re-using an existing one for it. Using explicit (URI) and resolvable (HTTP) identifiers instead of hiding the entity in a literal or referring to it with a blank-node enables it to be used both in the RDF statements in your dataset as well as any other external RDF data. And, as Stefano Bertolo wrote, Linking identifiers is human, re-using them is divine.
  • Create a URI pattern for every distinct resource type.

    In order to differentiate clusters of resources belonging to a certain resource type (rdf:type) you should define a specific URI pattern for each of them. By doing this you will have the resource identifiers grouped in a meaningful manner, preferrably with patterns that enable human users to infer the type of a resource.
  • Use common RDF vocabularies.

    It is better to re-use common RDF vocabularies, such as Dublin Core or Friend of a Friend, instead of making use of complex, library-specific vocabularies, such as those created for the RDA. There is also the option of going with the library-specific vocabularies and then linking them on the schema level to more general and widely-used vocabularies. This enables easier integration of RDF data and a combination of multiple RDF datasets together. It also helps to make your data more accessible to semantic developers because they do not have to learn a new vocabulary just for the sake of using your data. Libraries are not so special and they do not need everything re-invented with a special respect to them.
  • Use separate resources for the document and the record about it.

    This allows to make separate assertions about the document and its description in the record and attach them to the correct resource. For some statements, such as the copyright, it makes sense to attach them the record resource instead of the document. For example, this way you can avoid the mistake of claiming authorship of a book while you are only the author of the record about that book.
  • Provide labels in multiple languages.

    It is useful to provide labels for the bibliographic resources in multiple languages even though they the strings of the labels might be equivalent. This is the case of terms that either do not have a representation in a language or the were adopted from other language, for example "Call for papers"@en and "Call for papers"@cs. It does not create redundand or duplicate data because it adds new information and that is the fact that the labels have the same string representations. They may be pronouced the same or differently but still, they provide useful information, for purposes like translations and the like.

     

2 comments :