2010-12-06

Publishing the vocabulary of the types of grey literature as linked data

This blog post is based on the poster presentation delivered at the Grey Literature 12 conference.

The aim of this post is to introduce the typology of grey literature we have started to develop at the National Technical Library. The vocabulary of the types of grey literature is a controlled vocabulary that is meant to be used to express that a document belongs to a certain document type. The design of the vocabulary is based on an analysis of six existing grey literature typologies. Thus, it can be seen as a formalization of the outcomes of the systematic examination done during this analysis.

It has a loose structure with hierarchical relationships between the types' concepts. Each type has a unique identifier (a URI, in this case) and a preferred label. Some types feature labels in multiple languages and links to other types, both from the vocabulary itself and from external datasets. In the vocabulary's documentation each type will be provided with a definition and a prototypical example of a document for which it can be used.

I will briefly mention the technologies that we have used in the vocabulary's development. The vocabulary is expressed in the RDF data format as a SKOS concept schema. RDF (which stands for Resource Description Framework) is a data format for expressing data with a graph structure and Simple Knowledge Organisation System (abbreviated as SKOS) is an ontological language for representing knowledge organisation systems, such as thesauri, codelists, or systematic classifications. 

The vocabulary will be published as linked data. Linked data is a publication model for exposing structured data on the Web in a way that uses links between datasets to create a network of interlinked data. The vocabulary includes links to other vocabularies and datasets, such as the Biblioontology, Dublin Core Metadata Initiative Types, or DBPedia, which represents the structured information extracted from Wikipedia.

The vocabulary is supposed to be a product of a co-operative development. The project of the grey literature typology is hosted at the Google Code website. The reason for using Google Code is that it has the functionality to support collaborative development. At its core there is a distributed version control system that enables to track the different versions of the vocabulary submitted by the members of the development team. It makes possible to incorporate feedback by commenting on the individual changes of the vocabulary and by reporting issues that should be fixed in the future versions. The Google Code website also includes a wiki that serves as the vocabulary's documentation.

For the purposes of this vocabulary's development the Working Group for Grey Literature Typology was established. The aim of this informal group is to bring together the experts from various fields related to grey literature, knowledge organisation systems, or semantic web technologies in order to work collaboratively on the further evolution of the vocabulary. If you are interested in participating in this vocabulary's development or becoming its user I encourage you to check out the project's website at Google Code.

 

2010-12-04

Design patterns for modelling bibliographic data

I have done a several conversions of bibliographic library data from the MARC format and most of the times I had to deal with some re-occurring issues of data modelling. During these conversions I have also adopted a set of design patterns, described in the following parts of this post, that can make the conversion process easier. This post was inspired by the things that were discussed at the Semantic Web in Bibliotheken 2010 conference, by a several conversations at SemanticOverflow, and a book on linked data patterns.

  • Do a frequency analysis of MARC tags occurrences before data modelling.

    If you perform a frequency analysis beforehand you will have a better picture about the structure of your dataset. This way you will know which MARC elements are the most frequent ones and thus more important to be modelled properly and captured in RDF without a loss of semantics.
  • Use an HTTP URI for every entity that has some assertions attached to it.

    Even though you may have little information about some entity it is worth minting new URI or re-using an existing one for it. Using explicit (URI) and resolvable (HTTP) identifiers instead of hiding the entity in a literal or referring to it with a blank-node enables it to be used both in the RDF statements in your dataset as well as any other external RDF data. And, as Stefano Bertolo wrote, Linking identifiers is human, re-using them is divine.
  • Create a URI pattern for every distinct resource type.

    In order to differentiate clusters of resources belonging to a certain resource type (rdf:type) you should define a specific URI pattern for each of them. By doing this you will have the resource identifiers grouped in a meaningful manner, preferrably with patterns that enable human users to infer the type of a resource.
  • Use common RDF vocabularies.

    It is better to re-use common RDF vocabularies, such as Dublin Core or Friend of a Friend, instead of making use of complex, library-specific vocabularies, such as those created for the RDA. There is also the option of going with the library-specific vocabularies and then linking them on the schema level to more general and widely-used vocabularies. This enables easier integration of RDF data and a combination of multiple RDF datasets together. It also helps to make your data more accessible to semantic developers because they do not have to learn a new vocabulary just for the sake of using your data. Libraries are not so special and they do not need everything re-invented with a special respect to them.
  • Use separate resources for the document and the record about it.

    This allows to make separate assertions about the document and its description in the record and attach them to the correct resource. For some statements, such as the copyright, it makes sense to attach them the record resource instead of the document. For example, this way you can avoid the mistake of claiming authorship of a book while you are only the author of the record about that book.
  • Provide labels in multiple languages.

    It is useful to provide labels for the bibliographic resources in multiple languages even though they the strings of the labels might be equivalent. This is the case of terms that either do not have a representation in a language or the were adopted from other language, for example "Call for papers"@en and "Call for papers"@cs. It does not create redundand or duplicate data because it adds new information and that is the fact that the labels have the same string representations. They may be pronouced the same or differently but still, they provide useful information, for purposes like translations and the like.

     

2010-10-01

Peer to peer library

This blogpost is a short write-up about the topic that has been lately on my mind. It deals with the notion of peer to peer networks and their applicability to the field of libraries. The reason why this post is out is that I wanted to share these ideas on the peer to peer network that we call the Web.

Peer to peer is an architecture for distribution of information. Spreading good information and ignoring bullshit is one of the fundamental functions of libraries. Let's see how they fulfil this duty at the present time.

Currently, libraries adopt the centralized distribution model. This is the hub and spoke model in which the library serves as hub offering information to its users. It is also a pull model where the spoke has to bother the hub to obtain something from it. In the traditional library setting the roles are clearly set; library is the content provider whereas its patrons are the consumers.

By contrast, peer to peer is a distributed network architecture in which participants make a portion of their resources directly available to other participants. Originally, the peer to peer principle was put into effect in file-sharing systems, however, the sharing doesn't have to be confined only to files. The resources shared can be almost anything, such as free time in the case of Wikipedia. If this statement is true, even knowledge embodied in documents or in people's expertise can be shared via a peer to peer network.

Both centralized and decentralized distribution architectures obviously have advantages and disadvantages at the same time but I will focus on just one side of this distinction for the sake of argument. The centralized model is dependent on the central node. The information is propagated through the network only via the central node, so in case it doesn't work properly no information is transmitted. Contrary to this, peer to peer architecture enhances scalability and service robustness by not relying on a single node and enabling participants to reach to one another without the need for middleman.

The communication in a peer to peer system is governed by a network protocol that prescribes how can the participants access the shared resources, with their roles and responsibilities clearly defined. Libraries could be the ones that formulate such a protocol that makes it clear how to share knowledge resources.

Libraries have the ideal foundation to become social networks oriented around knowledge and I believe that they still have the authority to declare the standard protocols for exchanging knowledge in a network setting. Libraries have such protocols even now; the library rules define exactly what are the limits of the user to library interaction. The role of libraries in such environment is to set up the rules and back them up. I wonder if libraries could lay down the grounds for such a network instead of sharding their potential in a number of small systems where it's the library that is the most important part.

The technology has matured, and with tools like WebID decentralized social networks can become reality. Imagine friending a library on social network to become its patron and a member of its peer to peer network with the right to access the shared resources. Peer to peer networks can serve to libraries as a platform for building new services as well, for example indexing and resource discovery.

Today, people can borrow books from library but not from one another. People can ask the library's reference service but they can't ask questions and give answers to them between themselves. How it would look like if to become a library's user would not only mean that one can take advantage of the services and resources provided by the library but also benefit from the community of its users as well?

According to the Wikipedia article, peer to peer networks lead to egalitarian social networking and flattening of the established hierarchical differences between the network's participants. Well, can library and its users become peers?

2010-09-29

Library is a social organism

This is a post about an idea of a library being a social organism. In it I will try to find what does it mean to be considered as a social organism and if it is useful to think of libraries in this way. It is clearly just a metaphor, but it may introduce a mode of thinking that will lead to some interesting insights.

The 5th law of library science by Shiyali Ramamrita Ranganathan stands that

The library is a growing organism.

If library is an organism then it obviously can't be treated and managed like a machine. Whereas machines are made, organisms are born; and whereas machines are modified, organisms evolve.

Now, the question then is what does it mean for a library to be a social organism? If libraries are not machines then they cannot be simply build, updated, or repaired, but rather they grow just as other organisms do. Currently the change management in libraries consists of applying patches and updates when adopting new technologies or services which seems to be, in fact, based on the assumption of library as a machine. This approach can work but it does not look as a sustainable one.

The reason why I have used to label library as a social organism is because library is made of people. People are the constituent elements, or organs, of the body of a library. It is also the social aspect that makes for library's cohesivenes.

If we decide to adopt the view of library as an organism, then we can ask about the necessary preconditions for its evolution to happen. Among these, variability can be mentioned, the ability to produce mutations. I think that we have library mutants — digital libraries, mobile libraries, and the like, or their various combinations. The other thing that is mentioned in the minimal requirements to run the evolution is the selective pressure. Now it seems that libraries feel some kind of pressure that is coming either from their funders or patrons. Because we have these basic mechanisms in place, the question is if we can optimize the library evolution to run faster and be more efficient. However, as we've discovered in the past, eugenics does not work that well. The reason why can stem from treating organisms as machines.

There is also a number of concepts associated with the notion of an organism. One of them may be the concept of health. If seems that every organism has health, so can we think of the health of a library? Can we determine library's health or evaluate on its basis? This may have come too far in extending the metaphor of a library as an organism or it may still stimulate some useful thinking. It highlights the fact that metaphors are tools for thinking, and every tool can be used the right or wrong way.

It will be interesting witness the evolution of this library organism in the coming years. Or, if not, just to observe how libraries struggle to keep the status quo and preserve their homeostasis.

2010-09-02

An inarticulate account on ontologies

This post was inspired by An Introduction to Ontology: from Aristotle to the Universal Core, which is a training course carried out in eight lectures delivered by a notable ontologist Barry Smith and a high-quality web content at the same time.

We share the world and we can also share our descriptions of this world. If we want to share our descriptions of the world with computers these description might take up the form of ontologies. We use natural languages for communicating our descriptions of the world and to express a formalized conceptualization of the world we use ontological languages. There are different maps of the structure of reality and ontologies are just one kind of them. Ontologies can be seen as windows through which we look at the universals in reality.

When two descriptions of the world are sharing the same language they can be combined and integrated. There are different levels of integration that is possible depending on the things the descriptions have in common. First level is sharing the data model which means that the structure of the descriptions matches. Next level is sharing the conceptualization of a part of the world the descriptions are about. Third level is sharing the concepts from the conceptualization.

The necessary condition for collaboration in science is sharing the way of describing the world. In medieval times of scholasticism, Latin was established as a language of science, as a controlled vocabulary that every researcher used at that time. 

To create the language which scientist can speak to each other we need ontologists. The first ontologist was probably Aristotle as he proposed a standard classification of the human knowledge available at that time. Carl Linée was by this standards also an influential ontologist because he made a taxonomy for plants and animals that used extensively in the following years. On the other hand, Immanuel Kant was a reverse-ontologist because of his claim that the structure of the language is the key to the structure of reality. He actually interchanged one particular description of the reality (a natural language) for the natural granularity present in reality itself.

The need for a shared ontology may be more obvious in natural sciences such as physics or biomedicine than in social sciences and humanities. The natural sciences deal with physical world so the ontologies for this domain must present its conceptualizations.

The physical world has holes in it. Places are essentially holes, which is important because we can be in (occupy) such holes. Also, the physical reality has a natural granularity, which enables only certain ways of how partitioning. These partitions are shared social constructs forming multiple transparent layers on each other expressing different levels of granularity.

Discrepancies can arise when you try to partition a continuum category, like a colour. For such categories there is no natural granularity so the conceptualizations of them must be seen as purely arbitrary human constructs. However, most categories are not like that and for them some conceptualizations can be proven better than others. So this is the domain of scientific research.

Scientific research can be also seen as a process to obtain finer-grained partitions of reality. When we encounter a fringe instance of a category, we try to find a new conceptualization with a higher granularity that can explain the instance better. For example, an ostrich is a fringe instance of the category of birds because it cannot fly, so we try to find an explanation for this instance belonging to this category on a finer-grained level, say on the level of DNA.

Likewise, we might see the evolution of science as a convergence to a set of shared ontologies. The converging aspect is important because it enables compare different sources. Once two information sources use the same ontology they become comparable. This also implies that the sources can be integrated together. For example, using the same concepts from an ontology of standards units of measure, say kilograms, means that two measures expressed in kilograms can be compared. Having comparable science and comparable research findings is essential for the further progress of sciences. Barry Smith, the ontologist whose lecture series inspired this post, proposed principles for comparable science:

  1. Scientific theories must be common resources that cannot be bought or sold.
  2. They must be intelligible to a human being.
  3. They must use open publishing venues.
  4. They must constantly evolve to reflect results of scientific experiments, which means they must be evidence-based.
  5. They must be synchronized by a common system of units and common mathematical theories that are built by adults.

Barry Smith expresses his concern with shared ways of describing the world when he says that Scientists should not be free to take existing terms and give them new meaning. In fact, he stands in the opposition to the linked data initiative which favors building order and shared understanding from the bottom up, when he says:

The attitudes of Tim Berners-Lee, which are in favour of freedom and anarchy, and creativity, and all those nice things, mitigate against the coordination which is necessary to make good scientific ontology work - in a way good science works.

The linked data publishing model depends on availability of light-weight ontologies. However, Barry Smith advocates for more scientific approach in developing an ontology, the one that is based on the best scientific theories available at the time. The benefit of this approach is that then domain experts can help you validate your ontology. The feedback from the community of users is an important requirement in development of an ontology and, as Barry Smith says, ontology is not something people should do alone, without public supervision.

In my opinion, this is a difficult question to resolve and it is unclear whether we can converge to a shared description of the world from the bottom up or we need to get to one description by a centralized effort based on the current state of science. The success of the Web proved that the decetralized structures can work very efficiently, however, there's no proof yet that we can decentralize our descriptions of the world (ontologies) the same way we did with out data.

2010-07-24

Library is a habitat for knowledge interaction

Libraries no longer have a monopoly on providing access to information. With the ever increasing volume of information resources available freely from the internet the need for a library as the middleman negotiating conditions for access to resources entrenched in paid databases has been in a decline. Not only due to the open access initiatives but also as a consequence of the general trend of putting everything online people have the possibility to bypass library as an information provider and go directly for the resources freely available on the Web.

Likewise, libraries have also lost the monopoly on knowledge organisation. For a long time libraries were the institutions that were entitled and authorized to organize world's knowledge by the means of systematic classifications (such as??Dewey Decimal System), thesauri or subject heading systems (e.g.,??Library of Congress Subject Headings). Now, the situation is much different as people have got a right to organize knowledge themselves using tags in social bookmarking services (such as Delicious) or categories on their blogs (like Wordpress).

Provision of the access to infomation resources and centralized knowledge organisation are two examples of library services that are??losing importance in the current circumstances. The internet enables to dislocate and decentralize many of the functions that were previously available only in the domain of libraries.??It doesn't make sense to duplicate these services in each of the libraries. It's a sub-optimal solution that's dissolving the effort in a multitude of attempts to provide the best service possible instead of combining personal and financial resources to cooperate on one central service. There's also the possibility to yield over providing these services entirely and let the Web find the right solution for them.

Proportionally to the decrease in relative importance of library as a provider of these services its role as??a place??is increasing.??Libraries can be seen as a safe shelter, as an environment saturated with useful affordances. Libraries are designed specifically for the purposes of learning, writing a paper, or some of the many other activities related to information. In this way, library can be seen as??a habitat for knowledge interaction.

What is a habitat for knowledge interaction?

[Habitat] is a bounded chunk of space/time that is designed to accommodate a delimited set of activities. It accommodates the activities by including physical artefacts that can be used in the activities and signs that offer activity-relevant information.

(Source [pdf])

The activities a library habitat can cater for are the interactions with knowledge, regardless whether it's the recorded knowledge or the knowledge that's still in the heads of the authors. Libraries are intentionally filled with artefacts like chairs or books that enrich the space with a set of affordances related to the activities that the habitat is suppossed to host. And it also has boundaries, even they may seem to be vague. Sure, the end of the library space is clearly demarcated by material (walls), functional (valid user card for entry) or symbolic (sign posts) barriers regulating access to the habitat, but it continues more or less seamlessly to the virtual space represented in the library's online presence.

In this way, library's habitat is not bound to a single physical space. It can extend itself and penetrate into other environments. Nevertheless, material part of the habitat is still very important and libraries should focus on providing services and affordances that are bound to a place and designed with the local context in mind; the services that make the library a better place to be in, a better habitat for knowledge interaction.

2010-07-10

Library is a social network oriented around knowledge

What is the library of the future? Let me try to reformulate the concept of the library in a vision.

Just as Last.fm is a social network oriented around music or WeRead.com is about reading, library is a social network oriented around knowledge. It is a social organism, so it grows and evolves, just as other organisms do. It is not a machine. It is also an open system in which both its input and output are free. And finally, library is a knowledge technology, it's a way we "know" stuff.

The important collections libraries have are not made of books, but made of people. There are lots of different carriers of knowledge in libraries: books holding knowledge on paper, databases containg knowledge in bits, people having knowledge in heads. People are the most important carriers of knowledge. In them, there's the social and knowledge capital libraries have.

To start social interactions library enables communication in an open, transparent way. Building on top of that, library enables conversation, that is a communication within a set of people known to each other. Library is designed to connect users. It helps to connect expert users within a specific domain. Library provides access to knowledge but also provides access to the holders of knowledge. It offers information services but also enables its users to offer information services to each other.

This is similar to the difference between centralized and peer-to-peer services. Centralized services (such as Napster) have one central source (server, repository) and are therefore highly dependent on it. Peer-to-peer (P2P, such as Skype) model uses multiple distributed sources so the overall function of a system is independent on any of its constituent parts. Library is a peer-to-peer system because it not only offers its own sources and services but also provides a social protocol for users to offer to each other resources they control and services they are able to supply.

Library doesn't organize knowledge but empowers its users to organize it. It enables users to create knowledge in a way so that it can be shared. Public knowledge interactions are recorded and stored for re-use. In this body of recorded knowledge there are explicit links that can be followed. These links make sense because they have meaning. The meaning of link is defined by how it's being used and this meaning is encoded in the link in a structured way. Library enables to link everything and in this way attach a description.

The borders of a library are not clearly defined, so no one can say where it begins and where it ends. But library has a well designed space that it can offer to its users. It provides habitat for knowledge interaction. It's an environment designed for social interactions and knowledge exchange. In this respect, library provides lots of affordances, it's an opportunity-rich environment.

Library provides an intriguing interface for its users. One part of this interface is the interface for the library, the other part is the interface for interacting with other users. Yet, the interface to a library is transparent for its users. Users don't come to realize that they're interacting with a library. Because it's barrier-free it makes the institution in library's core invisible.