2012-08-17

Technologies of linked data: RDF

The following post is an excerpt from my thesis entitled Linked open data for public sector information.
Resource Description Framework (RDF) is a standard format for data interchange on the Web. RDF is a generic graph data format that has several isomorphic representations. Any given RDF dataset may be represented as a directed labelled graph that may be broken down into a set of triples, each consisting of subject, predicate, and object.
Triples are items that RDF data is composed of. Subject of a triple is a referent, an entity that is described by the triple. Predicate-object pairs are the referent’s characteristics.
RDF is a type of entity-attribute-value with classes and relationships (EAV/CR) data model. EAV/CR is a general model that may be grafted onto implementations spanning relational databases or object-oriented data structures, such as JSON. In the case of RDF, entities are represented as subjects, which are instances of classes, attributes are expressed as predicates that qualify relationships in data, and objects account for values.
In terms of the graph representation of RDF, subjects and objects form the graph’s nodes. Predicates constitute the graph’s vertices that connect subjects and objects. The graph’s nodes and vertices are labelled with URIs, blank nodes (nodes without intrinsic names), or literals (textual values).

Serializations

RDF is an abstract data format that needs to be formalized for exchange. To cater for this purpose RDF offers a number of textual serializations suitable for different host environments. A side effect of RDF notations being text-based is that they are open to inspection as anyone can view their sources and learn from them. Now we will describe several examples of the most common RDF serializations.
N-Triples is a simple, line-based RDF serialization that is easy to parse. It compresses well and so it is convenient for exchanging RDF dumps and executing batch processes. However, the character encoding of N-Triples is limited to 7-bit and covers only ASCII characters, while other characters have to be represented using Unicode escaping.
Turtle is a successor to N-Triples that provides a more compact and readable syntax. For instance, it has a mechanism for shortening URIs to namespaced compact URIs. Unlike N-Triples, Turtle requires UTF-8 to be used as the character encoding, which simplifies entry of non-ASCII characters.
RDF serializations based on several common data formats were developed, such as those building on XML or JSON. XML-based syntax of RDF is a W3C recommendation from 2004. With regard to JSON, there are a number of proposed serializations, such as JSON-LD, an unofficial draft for representing linked data. However, these serializations suffer from the fact that their host data formats are tree-based, whereas RDF is graph-based. This introduces difficulties for the format’s syntax as a result of “packing” graph data into hierarchical structures. For example, the same RDF graph may be serialized differently with no way of determining the “canonical” serialization.
Several RDF serializations were proposed to tie RDF data with documents, using document formats as carriers that embed RDF data. An example of this approach is RDFa that allows to interweave structured data into documents by using attribute-value pairs. It is a framework that can be extended to various host languages, of which XHTML has a specification of RDFa syntax that reached the status of an official W3C recommendation.

Vocabularies and ontologies

While RDF is a common data model for linked data, RDF vocabularies and ontologies offer common way of describing various domains. Their role is to provide a means of conveying semantics in data. RDF vocabulary or ontology covers a specific domain of human endeavour and distills the most reusable parts of the domain into “an explicit specification of a conceptualization” [1, p. 1]. Conceptualization is thought of as a way of dividing a domain into discrete concepts.
The distinction between RDF vocabularies and ontologies is somewhat blurry. Ontologies provide not only lexical but also intensional or extensional definitions of concepts that are connected with logical relationships, and thus are thought of as more suitable for the tasks based on logic, e.g., reasoning. RDF vocabularies offer a basic “interface” data for a particular domain and as such as better suited for more lightweight tasks. Most of linked data gets by with using simple RDF vocabularies, that are in rare cases complemented with ontological constructs.
Having data described with a well-defined and machine-readable RDF vocabulary or an ontology enables to perform inference on the data. Inference serves for materializing data implied by the rules defined in RDF vocabularies and ontologies, through the means of which the data is expressed. W3C standardized two ontological languages that may be used to create RDF vocabularies and ontologies: RDF Schema (RDFS) and Web Ontology Language (OWL).
There are countless RDF vocabularies and ontologies available on the Web. However, a great deal of them is used only in the dataset, for which they were defined, and only a few of them reached a sufficient popularity in order to be treated as de facto standards for modelling of the domains they cover. An example of a general and widespread RDF vocabulary is Dublin Core Terms, which provides a basic set of means for expressing descriptive metadata. With regards to the public sector, some of the RDF vocabularies and ontologies covering this domain may be found in the Government vocabulary space of the Linked Open Vocabularies project.

References

  1. GRUBER, Thomas R. A translation approach to portable ontology specifications. Knowledge Acquisition. 1993, vol. 5, iss. 2, p. 199 — 220. Also available from WWW: http://tomgruber.org/writing/ontolingua-kaj-1993.htm

No comments :

Post a Comment