2013-07-29

Towards usability metrics for vocabularies

As far as I know, there are no established measures for evaluating vocabulary usability. To clarify, when I use the term “vocabularies”, what I mean simple schemas and lightweight lexical ontologies that are used primarily for marking up content embedded in web pages, using syntaxes such as Microdata or RDFa. A good example of such vocabulary is Schema.org, an overarching, yet simple schema of things and relations that four big search engines (Google, Microsoft Bing, Yahoo! and Yandex) deem to be important to their users.

The closest to the topic seems to be the paper Ontology evaluation through usability measures by Núria Casellas. With regards to syntactical usability of markup, there was a usability study of Microdata done by Ian Hickson, the minimalistic settings of which was a subject of numerous rants, such as the one by Manu Sporny. I presume more thought needs to be spent on discovering how existing usability research relates to vocabularies and which standard usability principles apply. Nevertherless, borrowing from usability testing used for web sites or software or in libraries, three metrics relevant to vocabularies crossed my mind.

The first is error rate when using a vocabulary. It is based on the assumption that the more usable vocabulary is the fewer errors should its users make. Vocabulary validators may be used to automate this technique. Such tools may execute fine-grained rules, which may help to discern the most problematic parts of vocabularies, where users make the most errors. An example of a study testing error rate was conducted by Yandex. Note, however, that it focused more on markup syntaxes, rather than vocabularies themselves. It reported 10 % error rate in RDFa (4 % share in the sample), 10 % error rate in hCard (20 % share) and almost no errors in Facebook’s Open Graph Protocol (1.5 % share), which is also RDFa.

A broader feature that may serve as input to usability testing is data quality. A metric based on data quality should primarily take into account valid data, since invalid data should be caught by error rate testing. Recognizing that data quality as a relevant feature is based on the assumption that more usable vocabularies support creating data of better quality. However, the relation between vocabulary usability and data quality should not be considered as causation, but rather correlation, which might pinpoint weak parts of vocabulary where data quality suffers. Transforming data quality into a discrete metric is tricky, but there already are data quality assessment methodologies, some of which are documented in this paper (PDF), on which test procedures for usability of vocabularies may be derived.

The remaining metric I propose is adopted from library and information science, in which quality of indexing (much like mark-up) can be evaluated in terms of inter-indexer and intra-indexer consistency. Reframing that to usability testing of vocabularies inter-user and intra-user consistency could be more suitable labels. Inter-user consistency is the degree of agreement among users in describing the same content. On the other hand, intra-user consistency is the extent to which one is consistent in marking up content with oneself. Consistent use of a vocabulary may be taken as a sign that the vocabulary terms are not ambiguously defined, so that users do not confuse them. It may also show that there is a documentation providing clear guidance on the ways in which the vocabulary may be used. These metrics might help test if vocabularies can be “easily understood by the users, so that it can be consistently applied and interpreted” (source).

These metrics have a long history in the field of libraries and are already deployed in practice on the Web. For example, Google Image Labeler (now defunct) was a game that asked pairs of users (mutually unknown to each other) to label the same image and rewarded them if they agreed on a label. A similar service that works on the same principle rewarding consistency is LibraryThing’s CoverGuess. A naïve approach to implementing these metrics could compute the size of the diff, so that, for example, markups produced by 2 users given the same web page and instruction to use the same vocabulary can be compared. A more complex implementation might involve distance metrics that measure similarity of patterns in data, such as with the metrics offered by Silk. Finally, when applying the consistency metrics, as observed previously, you should keep in mind is that high consistency may be achieved at the expense of low overall quality. Therefore, these metrics are best complemented with data quality testing.

I believe adopting usability testing as a part of vocabulary design is a step forward for data modelling as a discipline. To start we will first need to find out what usability metrics apply to vocabularies or develop new specific approaches to usability testing. So let’s get user-centric, shall we?

1 comment :