2013-07-31

Vocabularies for the web of data and principles of least markup

I want to share a few thoughts about markup vocabularies that I pondered upon in the past months when developing Schema.org extension proposal targetting the long tail of job market. Schema.org is a prime example of markup vocabulary. In fact, if you search Google for “markup vocabulary”, most results will associate the term with Schema.org. Throughout this post, I’ll use this vocabulary as an example to illustrate the points made.

So, how does a markup vocabulary differ from, say, an ontology? Markup vocabularies serve different purposes than traditional ontologies, albeit their uses overlap. While the distinction between vocabularies and ontologies is blurry, it can be said that ontologies are based on logic, whereas vocabularies are based on convention. Ontologies are typically used for tasks such as inferring additional data, whereas vocabularies serve rather as structures for easier parsing when exchanging data. Alex Shubin likened Schema.org, as an example of markup vocabulary, to a set of “sitemaps for content” (source). Whereas sitemaps serve to machines to find pages within a web site, Schema.org serves to machines to find the bits of content within a web page.

In practice, vocabularies are used for the less orderly data. As Dan Brickley said in one of his talks: “Schema.org is for the rest of the Web; for that big sprawling chaos.” To reach wide adoption vocabularies need to be generic and application-agnostic so that they can be applied at the largest scale possible. So, as asked at the Schema.org panel discussion, “how does schema design at a planetary scale work, in practice?” I think the answer to this question may be approached from two complementary angles: the recommended vocabulary design patterns and markup guidance.

Vocabulary design

While there is a lot of methodologies for developing ontologies (such as NeOn Methodology or METHONTOLOGY), it seems that similar instructions are lacking for vocabularies. It is unclear what such instructions should be based on. Whereas design of artefacts is frequently shaped by their anticipated uses in practice, so that their form follows function, successful vocabularies are often those that don’t anticipate any particular uses. Their function is defined in terms of a broad goal of supporting the widest possible use. And this isn’t an easy goal to provide workable recommendations for.

I think one common rule of thumb is that vocabulary designers should strive to the lower cognitive overhead users face when working with vocabularies and focus on improving vocabulary usability. However, how do these nebulous goals translate into practice?

One (slightly less vague) advice is that vocabulary design shouldn’t require users to make difficult conceptual distinctions. In order to achieve that, make the differences between vocabulary terms clear (using clear labels and descriptions) in order to avoid ambiguity. If users regularly mix up two distinct concepts, either drop one of the concepts or provide the concepts with better definition. As the Zen of Python states on a similar note, “there should be one— and preferably only one —obvious way to do it.”

Another (slightly less vague) advice is to avoid object proliferation in the vocabulary you develop. In his talk from May 2013 Richard Cyganiak mentioned that vocabularies are typically built from bottom to top, based on usage evidence, so that unused object aren’t included. Richard reiterated the claim asserting that successful vocabularies for the web of data are small and simple (such as Dublin Core Terms), which was already presented by Martin Hepp in his account of Possible ontologies.

One practical technique in line with this advice is to avoid intermediate resources, which are typically represented with blank nodes, and are often needed for object properties. Schema.org labels such intermediate objects as “embedded items”. If your vocabulary contains an object property that points to an intermediate object further described with other properties and all these properties have 0…1 cardinality, then you may consider redefining them as direct properties of the object property’s subject. For example, the class schema:JobPosting is used with properties schema:baseSalary and schema:salaryCurrency. These properties could have been associated with an intermediate schema:Salary object, or even with 2 intermediate objects schema:JobPosition and schema:Salary, however, they are instead attached as direct properties of the schema:JobPosting class. Be careful though not to take this as a catch-all rule. Object properties that usually link to URIs, such as schema:hiringOrganization links to schema:Organization, don’t need to be treated in this way.

Markup guidance

Judging from the documentation of markup vocabularies, much of the presented guidance revolves around markup rather than vocabulary design. For instance, the guidance on designing vocabularies for HTML provided by W3C focuses on issues of markup syntax. I think a lot of recommendations concerning markup for the web of data can be considered extensions of the principle of least effort, so calling them the principles of least markup sounds about right.

A practical realization of such principles might advise to omit data that can be computed automatically. This guidance might encompass omitting inferrable types, including class instantiations and literal datatypes, when there’s only one valid option. Note that this approach doesn’t apply in cases when “type” or “unit” needs to be provided to serve as a value reference; for example when describing price and its currency. A more controversial extension of this principle might recommend to avoid forcing users to mint their own URIs unless necessary. For many purposes of data on the Web anonymous nodes represented with blank nodes are sufficient, given that they may be transformed to URIs and linked deterministically during data ingest and subsequent processing.

Besides decreasing the number of characters that users need to type to add vocabulary markup, there are few recurrent issues frequently mentioned in markup advice.

A thorny issue is the single namespace policy, which proposes that users should be able to create markup with a single vocabulary only. This recommendation is based on the assumption that having multiple vocabulary namespaces requires users to shift between multiple contexts of different vocabularies, which is held to be cognitively demanding. For example, Schema.org aims to provide this single all-encompassing namespace, from which every necessary vocabulary term may be drawn. Single namespace policy is also reflected in RDFa’s vocab attribute that enables to specify a single namespace, which is then applied to all unqualified names used in markup.

When looking for the source of errors in markup, unclear scoping rules are often to blame. Scoping is governed by rules prescribing what subject should the properties in markup be attached to, based on positioning of attribute-value pairs in the hierarchical structure of HTML and semantic context as set by other markup. The scoping rules are notoriously difficult to grasp, which might have contributed to Microdata having the itemscope attribute that sets the scope explicitly.

A related issue to scoping is directionality, which prescribes whether the current scope should be used as subject or object of marked up properties. To reverse the default directionality RDFa offers the rev attribute and previously, it used reverse direction for src attribute as well. Directionality, among other issues, is described by Gregg Kellogg in his list of common pitfalls when marking up HTML with RDFa. Microdata, on the other hand, avoids this issue by being uni-directional.

Tolerance

Markup guidelines for data publishers should have a counterpart on the side of data consumers. That counterpart is the principle of tolerance. Schema.org documentation of its data model states: “In the spirit of ‘some data is better than none’, we will accept this markup and do the best we can.” Even though markup may be broken in many different ways, data consumers should try to be fault-tolerant. This attitude is in line with Postel’s principle of robustness that states: “Be conservative in what you send, be liberal in what you accept” And so I think that until we know better about vocabulary design, we better be tolerant and liberal about data on the Web.

4 comments :

  1. Great post.

    Excellent focus about Vocabularies and Ontologies

    Regards

    ReplyDelete
  2. +1

    One size does not fit all, but I try to lean on using a single namespace if possible, and come up with terms that's not overly specific. Leave out information unless it is detrimental. My particular reasoning is that once they are out there and used, it is a PITA to work around it when it comes to maintenance.

    BTW, URI to Richard's talk is broken at the moment.

    ReplyDelete
  3. Sorry about the slides being unaccessible. The servers of the University of Economics are down at the moment.

    ReplyDelete