blog.mynarz.net: rdfa

Showing posts with label rdfa. Show all posts

2013-07-31

Vocabularies for the web of data and principles of least markup

I want to share a few thoughts about markup vocabularies that I pondered upon in the past months when developing Schema.org extension proposal targetting the long tail of job market. Schema.org is a prime example of markup vocabulary. In fact, if you search Google for “markup vocabulary”, most results will associate the term with Schema.org. Throughout this post, I’ll use this vocabulary as an example to illustrate the points made.

So, how does a markup vocabulary differ from, say, an ontology? Markup vocabularies serve different purposes than traditional ontologies, albeit their uses overlap. While the distinction between vocabularies and ontologies is blurry, it can be said that ontologies are based on logic, whereas vocabularies are based on convention. Ontologies are typically used for tasks such as inferring additional data, whereas vocabularies serve rather as structures for easier parsing when exchanging data. Alex Shubin likened Schema.org, as an example of markup vocabulary, to a set of “sitemaps for content” (source). Whereas sitemaps serve to machines to find pages within a web site, Schema.org serves to machines to find the bits of content within a web page.

In practice, vocabularies are used for the less orderly data. As Dan Brickley said in one of his talks: “Schema.org is for the rest of the Web; for that big sprawling chaos.” To reach wide adoption vocabularies need to be generic and application-agnostic so that they can be applied at the largest scale possible. So, as asked at the Schema.org panel discussion, “how does schema design at a planetary scale work, in practice?” I think the answer to this question may be approached from two complementary angles: the recommended vocabulary design patterns and markup guidance.

Vocabulary design

While there is a lot of methodologies for developing ontologies (such as NeOn Methodology or METHONTOLOGY), it seems that similar instructions are lacking for vocabularies. It is unclear what such instructions should be based on. Whereas design of artefacts is frequently shaped by their anticipated uses in practice, so that their form follows function, successful vocabularies are often those that don’t anticipate any particular uses. Their function is defined in terms of a broad goal of supporting the widest possible use. And this isn’t an easy goal to provide workable recommendations for.

I think one common rule of thumb is that vocabulary designers should strive to the lower cognitive overhead users face when working with vocabularies and focus on improving vocabulary usability. However, how do these nebulous goals translate into practice?

One (slightly less vague) advice is that vocabulary design shouldn’t require users to make difficult conceptual distinctions. In order to achieve that, make the differences between vocabulary terms clear (using clear labels and descriptions) in order to avoid ambiguity. If users regularly mix up two distinct concepts, either drop one of the concepts or provide the concepts with better definition. As the Zen of Python states on a similar note, “there should be one— and preferably only one —obvious way to do it.”

Another (slightly less vague) advice is to avoid object proliferation in the vocabulary you develop. In his talk from May 2013 Richard Cyganiak mentioned that vocabularies are typically built from bottom to top, based on usage evidence, so that unused object aren’t included. Richard reiterated the claim asserting that successful vocabularies for the web of data are small and simple (such as Dublin Core Terms), which was already presented by Martin Hepp in his account of Possible ontologies.

One practical technique in line with this advice is to avoid intermediate resources, which are typically represented with blank nodes, and are often needed for object properties. Schema.org labels such intermediate objects as “embedded items”. If your vocabulary contains an object property that points to an intermediate object further described with other properties and all these properties have 0…1 cardinality, then you may consider redefining them as direct properties of the object property’s subject. For example, the class schema:JobPosting is used with properties schema:baseSalary and schema:salaryCurrency. These properties could have been associated with an intermediate schema:Salary object, or even with 2 intermediate objects schema:JobPosition and schema:Salary, however, they are instead attached as direct properties of the schema:JobPosting class. Be careful though not to take this as a catch-all rule. Object properties that usually link to URIs, such as schema:hiringOrganization links to schema:Organization, don’t need to be treated in this way.

Markup guidance

Judging from the documentation of markup vocabularies, much of the presented guidance revolves around markup rather than vocabulary design. For instance, the guidance on designing vocabularies for HTML provided by W3C focuses on issues of markup syntax. I think a lot of recommendations concerning markup for the web of data can be considered extensions of the principle of least effort, so calling them the principles of least markup sounds about right.

A practical realization of such principles might advise to omit data that can be computed automatically. This guidance might encompass omitting inferrable types, including class instantiations and literal datatypes, when there’s only one valid option. Note that this approach doesn’t apply in cases when “type” or “unit” needs to be provided to serve as a value reference; for example when describing price and its currency. A more controversial extension of this principle might recommend to avoid forcing users to mint their own URIs unless necessary. For many purposes of data on the Web anonymous nodes represented with blank nodes are sufficient, given that they may be transformed to URIs and linked deterministically during data ingest and subsequent processing.

Besides decreasing the number of characters that users need to type to add vocabulary markup, there are few recurrent issues frequently mentioned in markup advice.

A thorny issue is the single namespace policy, which proposes that users should be able to create markup with a single vocabulary only. This recommendation is based on the assumption that having multiple vocabulary namespaces requires users to shift between multiple contexts of different vocabularies, which is held to be cognitively demanding. For example, Schema.org aims to provide this single all-encompassing namespace, from which every necessary vocabulary term may be drawn. Single namespace policy is also reflected in RDFa’s vocab attribute that enables to specify a single namespace, which is then applied to all unqualified names used in markup.

When looking for the source of errors in markup, unclear scoping rules are often to blame. Scoping is governed by rules prescribing what subject should the properties in markup be attached to, based on positioning of attribute-value pairs in the hierarchical structure of HTML and semantic context as set by other markup. The scoping rules are notoriously difficult to grasp, which might have contributed to Microdata having the itemscope attribute that sets the scope explicitly.

A related issue to scoping is directionality, which prescribes whether the current scope should be used as subject or object of marked up properties. To reverse the default directionality RDFa offers the rev attribute and previously, it used reverse direction for src attribute as well. Directionality, among other issues, is described by Gregg Kellogg in his list of common pitfalls when marking up HTML with RDFa. Microdata, on the other hand, avoids this issue by being uni-directional.

Tolerance

Markup guidelines for data publishers should have a counterpart on the side of data consumers. That counterpart is the principle of tolerance. Schema.org documentation of its data model states: “In the spirit of ‘some data is better than none’, we will accept this markup and do the best we can.” Even though markup may be broken in many different ways, data consumers should try to be fault-tolerant. This attitude is in line with Postel’s principle of robustness that states: “Be conservative in what you send, be liberal in what you accept” And so I think that until we know better about vocabulary design, we better be tolerant and liberal about data on the Web.

2011-07-19

Spoonfeeding Google with RDF graphs packaged as trees

During a small side project I've found out that Google Rich Snippets Testing Tool doesn't treat RDFa as RDF (i.e., a graph) but rather as a simple hierarchical structure (i.e., a tree). It doesn't take under account links in RDFa, but only the way HTML elements are nested inside one another. More about the difference between data models of graph and tree can be found in a blog post by Lin Clark.

I've created two documents that give the same RDF when you run RDFa distiller on them. Both contain GoodRelations product data, but the difference between them is that in the first document the HTML element describing price specification (gr:UnitPriceSpecification) is a not nested inside the HTML element descibing the offering (gr:Offering) and the offering is linked to via gr:hasPriceSpecification property. In the second document the HTML element with price specification is nested in the element about the offering.

Even though the documents contain same data, Google Rich Snippets Testing Tool parses them differently and refuses to show a preview of search result in the case of the first document, whereas the second document produces a preview. In the first case, the price information is not recognized because it's not nested inside the HTML element describing the offering and thus a warning is shown:

Warning: In order to generate a preview, either price or review or availability needs to be present.

This leads me to believe that Google Rich Snippets Testing Tool doesn't parse RDFa as RDF, but as a tree (much like a DOM tree), effectively the same way as HTML5 microdata, which is built on the tree model. Google doesn't use RDFa as RDF, but as microdata.

Eric Hellman wrote a blog post about spoonfeeding data to Google. Even though Google still accepts some RDF (e.g., GoodRelations) after the announcement of microdata-based Schema.org, it wants to be spoonfed RDF graphs packaged as microdata trees. Does it mean that if Google is your primary target consumer for your data, you shouldn't bother with packaging your RDF in trees, but rather directly provide your data as a tree in HTML5 microdata?

2011-07-03

RDFa in action

RDFa is a way how to exchange structured data inside of HTML documents. RDFa provides information that is formalized enough for computers (such as googlebot) to process it in an automated way. RDFa is a complete serialization of RDF, using the attribute = value pairs to embed data into HTML documents in a way that does not affect their visual display. RDFa is a hack built on top of HTML. It repurposes some of the standard HTML attributes (such as href, src or rel) and adds new ones (such as property, about or typeof) to enrich HTML with semantic mark-up.

A good way to start with RDFa is to read through some of the documents, such as the RDFa Primer or even the RDFa specification. When you want to annotate an HTML document with RDFa you might want to go through a series of steps. We have used this workflow during an RDFa workshop I have helped to organize and this recipe worked quite well. Here it is.

Find out what do you want to describe (e.g., your personal profile).
Find which RDF vocabularies can be used to express description of such a thing (e.g., FOAF). There are multiple ways how to discover suitable vocabularies, some of which are listed at the W3C website for Ontology Dowsing.
Start editting your HTML: either the static files or dynamically rendered templates.
Start at the first line of your document and set the correct DOCTYPE. If you are using XHTML, use <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN" "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd"> (i.e., RDFa 1.0). If you are using HTML5, use <!doctype html> (i.e., RDFa 1.1). This will allow you to validate your document and see if you are using RDFa correctly.
Refer to the used RDF vocabularies. By declaring vocabularies' namespaces you can set up variables that you can use in compact URIs. If you are using XHTML, use the xmlns attribute (e.g., xmlns:dv="http://rdf.data-vocabulary.org/#"). If you are using HTML5, use prefix, vocab, or profile attributes (e.g., prefix="dv: http://rdf.data-vocabulary.org/#").
Identify the thing you want to describe. Use a URI as a name for the thing so that others can link to it. Use the about attribute (e.g., <body about="http://example.com/recipe">). Everything that is nested inside of the HTML element with the about attribute is the description of the identified thing, unless a new subject of description is introduced via a new about attribute.
Use the typeof attribute to express what is the thing that you are describing (e.g., <body about="http://example.com/recipe" typeof="dv:Recipe">). Pick a suitable class from the RDF vocabularies you have chosen to use and define the thing you describe as an instance of this class. Note that every time the typeof attribute is used the subject of description changes.
Use the property, rel and rev attributes to name the properties of the thing you are describing (e.g., <h1 property="name">).
Assing values to the properties of the described thing using either the textual content of the annotated HTML element or an attribute such as content, href, resource or src (e.g., <h1 property="name">RDFa in action</h1> or <span property="v:author" rel="dcterms:creator" resource="http://keg.vse.cz/resource/person/jindrich-mynarz">Jind??ich Mynarz</span>).
If you have assigned the textual content of an HTML element as a value of an attribute of the thing described you can annotate it. To define the language of the text, use either xml:lang (in XHTML) or lang (in HTML5) attributes (e.g., <h1 property="name" lang="en">RDFa in action</h1>). If you want to set the datatype of the value, use the datatype attribute (e.g., <span content="2011-07-03" datatype="xsd:date">July 3, 2011</span>)
Check you RDFa-annotated document using validators and examine the data using RDFa distillers to see if you have got it right.
Publish the annotated HTML documents on the Web. Ping the RDFa consumers such as search engines so that they know about your RDFa-annotated web pages.