2014-07-09

Methods for designing vocabularies for data on the Web

Over the past year and a half I have been working on a project, in which we were tasked with producing a vocabulary for describing job postings.1 In doing so, we were expected to write down what worked, so that others can avoid our mistakes. Apart from our own experience, the write-up I prepared took into account the largest public discussion on designing vocabularies for data on the Web. Perusing its archive, I have read every email on the public-vocabs mailing list since its start in June 2011 until April 2014. The following text distills some of what I have learnt from the conversations on this mailing list, especially from the vocabulary design veterans including Dan Brickley or Martin Hepp, coupled with research of other sources and our own experiments in data modelling for the Web.

This work was supported by the project no. CZ.1.04/5.1.01/77.00440 funded by the European Social Fund through the Human Resources and Employment Operational Programme and the state budget of the Czech Republic.


“All models are wrong, but some are useful.”
George E. P. Box

The presented text offers a set of recommendations for designing ontologies and vocabularies for data on the Web. The motivation for creating it was to collect relevant advice for data modelling scattered in various sources into a single resource. It focuses on the intersection of vocabularies defined using the RDF Schema (Brickley, Guha, 2014) and those that are intended to be used in RDFa Lite syntax (Sporny, 2012) in HTML web pages. It specifically aims to support vocabularies that aspire to large-scale adoption.

The vocabularies in question in this text are domain-specific, unlike upper ontologies that span general aspects of many domains. Therefore, it is necessary to delimit the domain to be covered by the developed vocabulary to restrict its scope. The target domain can have a broad definition, which may be further clarified by examples of data falling into the domain and examples of data that is out of the domain’s scope. Particular details of the vocabulary’s specialization may be made more specific during the initial research or vocabulary’s design.

“Do not reinvent the wheel.”
HTML design principles
(Kesteren, Stachowiak, 2007)

It is appropriate to devote the initial stage of vocabulary development to research and preparation. One may consider three principal kinds of relevant resources that can be pooled when designing a vocabulary. These resources comprise existing data models, knowledge of domain experts, and domain-specific texts.

Existing data models

Research of existing data models helps to prevent unnecessary work by answering two main questions:

  1. Is there an available data model that can be reused as a whole instead of developing a new data model?
  2. What parts of existing data models can be reused in design of a new data model?

There are two main types of data models that are relevant for reuse in vocabulary development. The first type covers ontological resources that consist of available vocabularies and ontologies. If one finds such resource that describes the target domain and fits the envisioned use cases, it can be directly reused as a whole, provided that its terms of use permit it. If there is a suitable vocabulary that addresses only some of the foreseen uses, it can be extended to cover the others as well. Otherwise, a new vocabulary may be composed of elements that are cherry-picked from the available ontological resources, which forms a basis for the reuse-based development of vocabularies (Poveda-Villalón, 2012). One of the best places to look for these resources is Linked Open Vocabularies, which provides a full-text search engine for the publicly available vocabularies formalized in RDF Schema or OWL (Motik, Patel-Schneider, Parsia, 2012).

The second kind of resources to consider encompasses non-ontological resources, such as XML schemas or data models in relational databases. As these resources cannot be reused directly for building vocabularies, they need to be re-engineered into ontological resources, which is a process that is also referred to as ‘semantic lifting’. Taking non-ontological resources into account may complement the input from ontological sources well. Special attention should be paid to industry standards produced by standardization bodies such as ISO. An alternative approach is to analyze what schemas are employed in public datasets from the given domain, for which data catalogues, such as Datahub, may be used.

Knowledge elicitation with domain experts

“Role models are important.”
Officer Alex J. Murphy / RoboCop

Domain experts constitute a source of implicit knowledge that is not yet formalized in conceptualizations documented in data models (Schreiber et al., 2000). Knowledge elicited from experts who have internalized a working knowledge of the domain of interest can feed in the conceptual distinctions captured by the developed vocabulary. The choice of experts to consult depends on the domain in question. The interviewed experts can range from academic researchers to practitioners from the industry. Similarly, the selection of knowledge elicitation methods should be motivated by the intended use cases for the developed vocabulary. Common methods that serve the purpose of knowledge acquisition include discussion of a glossary, manual simulation of tasks to automate, and competency questions.

Glossary is a useful aid that may guide interviews with domain experts. It can be either manually prepared or constructed automatically from the developed vocabulary. Glossary can be written down as a table in which each vocabulary term is listed together with its label, working definition and broadly described type (e.g., class, property or individuum). It can then serve as a basis for discussion about the established terminology in the domain covered by the developed vocabulary.

Collaboration with domain experts is an opportunity to conduct manual simulation of tasks that are intended to be performed automatically using data described by the developed vocabulary. Such simulation can provide a practical grounding for the vocabulary design with respect to its planned use cases. The simulation should reveal what kinds of data are important for carrying out the envisioned tasks successfully. It can indicate what data can be added to aid in such tasks and what data makes a difference in deciding how to proceed in the chosen tasks. For example, if the target domain is the job market, a simulation task may set about matching sample CVs of job seekers to actual job offers, which can suggest what properties are important to tell a likely successful candidate.

A classical approach to eliciting knowledge from domain experts is to discuss competency questions. These are the questions that data described with the developed vocabulary should be able to answer. As such, competency questions can serve as tests that examine if a vocabulary is sufficiently capable to support its planned use cases. For example, these questions may specify what views on data must be possible, what are the users’ needs that data must be able to answer in a single query, or what level of data granularity and detail is needed.

Analysis of domain-specific corpora

“Pave the cowpaths.”
HTML design principles
(Kesteren, Stachowiak, 2007)

While eliciting knowledge from domain experts concentrates on implicit knowledge, analyses of domain-specific corpora seek for common patterns in explicit, yet unstructured, natural-language text. Textual analysis can be considered a data-driven approach to schema discovery. Its key purpose is to ensure that the designed vocabulary can express the most common kinds of data that are published in the target domain. The approaches to processing domain-specific textual corpora can be divided into qualitative, manual analyses and quantitative, automated analyses.

Qualitative analysis

Manual qualitative analysis can be performed with a smaller domain-specific corpus, which can consist of tens of sample documents. The corpus should be analysed by knowledge engineer to spot common patterns and identify the most important types of data in the domain. Qualitative analysis may result in clusters of similar types of data grouped into a hierarchical tree, in which the most frequently occurring kinds of data are highlighted. The identified clusters may then serve as precursors for classes in the developed vocabulary.

Quantitative analysis

A corpus of texts prepared for quantitative analysis can be sampled from sources on the Web that publish semi-structured data describing the domain of the vocabulary. Producers of these sources can be projected as potential adopters of the developed vocabulary. The texts need to be written in a single language, so that translation is not necessary. Contents of the corpus ought to be sampled from a wide array of diverse sources in order to avoid sampling bias. The corpus needs to be sufficiently large, so that the findings based on analysing it may be taken as indicative of general characteristics of the covered domain. Establishment of such extensive corpus typically requires automated harvesting of texts via web crawlers or scripts that access data through APIs.

Quantitative analysis of domain-specific corpora can be likened to ‘distant reading’. Its aim is to read through the corpus and discover patterns of interest to the vocabulary creator. A typical task of this type of analysis is to extract the most frequent n-grams, indicating common phrases in the established domain terminology, and map their co-occurrences. Quantitative analyses on textual corpora may be performed using dedicated software, such as Voyant Tools or CorpusViewer.

Abstract data model

The results of the performed analyses and knowledge elicitation should provide a basis for development of an abstract data model. At this stage, data model of the designed vocabulary is abstract because it is not mapped to any concrete vocabulary terms in order to avoid being closely tied to particular implementation. Abstract data model may start to be formalized as a mind map, hierarchical tree list or table. Vocabulary creators can base the model on the clusters of the most commonly found terms from domain corpora and sort them into a glossary table. Such proto-model should pass through several rounds of iteration based on successive reviews by vocabulary creators. Key classes and properties in the data model should be identified and equipped with both preferred and non-preferred labels (i.e. synonyms) and preliminary definitions. To get an overview of the whole model and the relationships of its constitutive concepts it may be visualised as UML class diagram or using a generic graph visualization.

Data model’s implementation

“One language’s syntax can be another’s semantics.”
Brian L. Meek

When the abstract data model is deemed to be sound from the conceptual standpoint, it can be formalized in a concrete syntax. The primary languages that should be employed for formalization of the abstract data model are RDF and RDF Schema.As simplicity should be a key design goal, the use of more complex ontological restrictions expressed via OWL ought to be limited to a minimum. The implementation should map the elements of the abstract data model to concrete vocabulary terms that may be either reused from the available ontological resources or newly created.2 At this stage, the expressive RDF Turtle (Prud’hommeaux, Carothers, 2014) syntax may be used conveniently to produce a formal specification of the developed vocabulary.

The implementation process should follow an iterative development workflow, using examples of data in place of software prototypes. During each iteration samples of existing data from the vocabulary’s domain may be modelled using the means provided by the vocabulary, so that it can be assessed how the proposed data model fits its intended uses by seeing it being applied in the context of real examples.

General design principles

Implementation of a vocabulary may be guided by several general principles recommended for vocabularies targeting data written in markup that is embedded in HTML web pages. The goal of widespread adoption of the vocabulary on the Web puts an emphasis on specific design principles. Instead of focusing on conceptual clarity and expressivity as in traditional ontologies, the driving principles of design of lightweight web vocabularies accentuate simplicity, ease of adoption, and usability. This section then further discusses some of the key concerns in vocabulary development, including conceptual parsimony, vocabulary’s coverage that is driven by existing data and the like.

Simplicity

Vocabulary should avoid complex ontological axioms and subtle conceptual distinctions. Instead, it ought to seek simplicity for the data producer, rather than the data consumer.3 It is advisable that vocabulary design tries to strike a fine balance between expressivity and implementation complexity cost. Following the principle of minimal ontological commitment (Gruber, 1995), vocabularies should limit the number of ontological axioms (and especially restrictions) to improve their reusability. The developed vocabulary should thus be as simple as possible without sacrificing the leverage its structure gives to data consumers. Nevertheless, not only it should make simple things simple, it should also make complex things possible. Practical vocabulary design can reflect this guideline by focusing on solving simpler problems first and complex problems later.

Ease of adoption

Adoption of a vocabulary may be made easier if the vocabulary builds on common idioms and established terminology that is already familiar to data publishers. Vocabulary design should strive for intuitiveness. In line with the principle of least astonishment, vocabulary users should rather be exposed largely to things that can be expected.

Usability

Vocabulary design should focus on documentation rather than specification. That being said, neither specification nor documentation can ensure correct use of a vocabulary. Even though vocabulary terms may be precisely defined and documented, their meaning is largely established by their use in practice. Nonetheless, correct application of vocabulary terms may be supported by providing good examples showing the vocabulary in use. As Guha (2013) emphasizes, the default mode of authoring structured data on the Web is copy, paste and edit; for which the availability of examples is essential. Usability of vocabularies can be also improved by following the recommendations of cognitive ergonomics (Gavrilova, Gorovoy, Bolotnikova, 2010), such as readable documentation or vocabulary with narrow width and shallow depth.

Conceptual parsimony

Vocabulary design should introduce as few conceptual distinctions as possible, while still producing a useful conceptualization. Vocabulary does not need to include means of expressing data that can be computed or inferred from data expressed by other means. For example, it is not necessary to include a :numberOfOffers property because its value may be computed if there already is a :hasOffer property, which may have its distinct objects counted to arrive to the same data. An exception to this rule is warranted if it is expected that data producers may only have the computed data, but not the primary data from which it was derived from. For example, the number of offers may not available in disaggregated form as the list of individual offers. There is also no need to define inverse properties, such as :isOfferOf for the :hasOffer property. In a similar manner, vocabulary should not require explicit assertion of data that can be recovered from implicit context, such as data types for literal values. On the other hand, it is important to recognize that this approach shifts the burden from data publishers to clients consuming data that need to execute additional computation, such as inference, to materialize implicit data.

In general, additional conceptual distinctions are useful only if vocabulary users are able to apply them consistently. It is important to realize that valuable conceptual distinctions, justified from experts’ perspective, may not lead to more reliable data. Vocabulary creators should mainly concentrate on offering means for describing data that can be reliably provided by a large number of parties. A key reason for adding conceptual distinct is enabling to publish more data.

The merits of conceptual distinctions should be judged based on their discriminatory value. In other words, the value of distinction is in how it differs from the rest of the vocabulary. The more finely or ambiguously a vocabulary term is defined, the more likely it will be used incorrectly. Complex designs are a subject to misinterpretation. If vocabulary terms cannot be understood by data producers with ease and reliably, they will not be used (resulting in less data) or will be used inconsistently (resulting in lower data quality). Therefore, vocabulary should only use conceptual distinctions that matter and are well understood in the target domain.

Data-driven coverage

Since enabling to publish existing data in a structured form is an essential goal of vocabulary development, it ought to be driven by the available data. Data-driven approach implies that vocabularies should not use conceptualizations that do not match well to common database schemas in their target domains. If this is not the case, then data producers do not have a way of providing their data described using the vocabulary unless they alter their database’s schemas and change the way how they collect data. Vocabulary should be rather descriptive than prescriptive. Vocabulary design should be driven by existing data rather than prescribing what data should be published.

Communication interface

Vocabularies should accurately represent the domain they cover only to the degree it improves consistency of vocabulary use. Shared reality mirrored by a vocabulary may serve as a common referent improving shared understanding. However, the prime goal of a vocabulary is not to model the world, but to enable communication that gets a message across, which means that the prime aim of vocabulary is communication instead of representation. For example, structured values such as postal addresses, do not represent reality but they help formalize communication.

Vocabulary defines a communication interface between data producers and data consumers. Data producers are typically people, whereas data consumers are typically machines. Therefore, vocabulary design should balance usability for people with usability for machines. Vocabularies ought to be designed for people first and machines second (The microformats process, 2013). Thus vocabulary design should reflect the trade-off between consistent understanding of vocabulary among people and the degree to which it makes data machine readable.

Syntax limitations

Vocabulary should be aligned with the syntax, in which it is intended to be used. The design of a vocabulary is constrained by expressivity of its intended syntax. For example, HTML5 Microdata’s lack of mechanism for expressing inverse properties, such as with RDFa rev attribute, may warrant adding inverse properties into a vocabulary. Syntax of data can be considered a medium for the vocabulary. In case of vocabularies made for data embedded in web pages, such as Schema.org, their design should correspond to simpler markup. For example, vocabulary should require less nesting.

Tolerant specification

Vocabulary specification should be tolerant about data that it can express. It should not impose a fixed schema. No properties should be required, so that not providing some data is not invalid. On the other hand, vocabulary should allow additional data to be expressed, so that superfluous data is also not invalid, unless it raises a contradiction. It is advisable to use cardinality restrictions for properties only sparingly, as it is difficult to make them generally valid in the broad context of the multicultural Web. Vocabulary should support dynamic data granularity and varying level of detail, so that unstructured text values are allowed to be used in place of structured values if the structure cannot be reconstructed from the source data. On the other hand, specific consumers of data may add specific requirements that may be negotiated on a case by case basis with particular data producers. Overall, data consumers should be expected to follow the spirit of “some data is better than none” (Schema.org: data model, 2012) and accept even broken or partial data.

Vocabulary evolution

If a vocabulary aims for mass adoption, backwards incompatible changes need to be avoided. It is therefore advisable not to remove or deprecate any vocabulary terms, but rather list them as non-preferred with a link to their preferred variant. Large-scale use of a vocabulary raises the cost of changes, because more vocabulary users (both data producers and consumers) need to react to the changes. Widespread adoption increases the difficulty of propagating the changes, because updates about vocabulary changes need to reach a larger audience.

Conclusion

“It’s probably better to allow volcanoes to have fax machines than try to define everything ‘correctly’. Usage will win out in the end.”
Martin Hepp

The methods for designing vocabularies for data on the Web introduced in this text do not form a coherent methodology but instead compile and synthesize recommendations proposed in related work. Guiding principles manifested in the presented methods shall not be considered as hard-and-fast rules but rather as suggestions based on experience of seasoned vocabulary designers. These include both practical advice on researching state of the art in vocabulary’s target domain and concerns to keep in mind when implementing a formal conceptualization for a vocabulary. Moreover, the presented methods do not involve the notion of vocabulary being “right” but instead aim for developing vocabularies that are useful. Therefore, it is only by practical use on the Web in the long-term that these methods and recommendations may be “proved” of being useful themselves.

References

  • BRICKLEY, Dan; GUHA, R.V. (eds.). RDF Schema 1.1 [online]. W3C Recommendation 25 February 2014. W3C, 2004-2014 [cit. 2014-04-29]. Available from WWW: http://www.w3.org/TR/rdf-schema/
  • GAVRILOVA, T. A.; GOROVOY, V. A.; BOLOTNIKOVA, E. S. Evaluation of the cognitive ergonomics of ontologies on the basis of graph analysis. Scientific and Technical Information Processing. December 2010, vol. 37, iss. 6, p. 398-406. Also available from WWW: http://link.springer.com/article/10.3103%2FS0147688210060043. ISSN 0147-6882. DOI 10.3103/S0147688210060043.
  • GRUBER, Thomas R. Toward principles for the design of ontologies used for knowledge sharing? International Journal of Human-Computer Studies. November 1995, vol. 43, iss. 5-6, p. 907-928. Also available from WWW: http://tomgruber.org/writing/onto-design.pdf
  • GUHA, R. V. Light at the end of the tunnel [video]. Keynote at 12th International Semantic Web Conference. Sydney, 2013. Also available from WWW: http://videolectures.net/iswc2013_guha_tunnel
  • KESTEREN, Anne van; STACHOWIAK, Maciej (eds.). HTML design principles [online]. W3C Working Draft 26 November 2007. W3C, 2007 [cit. 2014-04-29]. Available from WWW: http://www.w3.org/TR/html-design-principles/
  • MOTIK, Boris; PATEL-SCHNEIDER, Peter F.; PARSIA, Bijan (eds.). OWL 2 Web Ontology Language: structural specification and functional-style syntax [online]. W3C Recommendation 11 December 2012. 2nd ed. W3C, 2012 [cit. 2014-04-29]. Available from WWW: http://www.w3.org/TR/owl2-syntax/
  • POVEDA-VILLALÓN, María. A reuse-based lightweight method for developing linked data ontologies and vocabularies. In Proceedings of the 9th Extended Semantic Web Conference, Heraklion, Crete, Greece, May 27-31, 2012. Berlin; Heidelberg: Springer, 2012, p. 833-837. Lecture notes in computers science, vol. 7295. Also available from WWW: http://link.springer.com/chapter/10.1007%2F978-3-642-30284-8_66. ISSN 0302-9743. DOI 10.1007/978-3-642-30284-8_66.
  • PRUD’HOMMEAUX, Eric; CAROTHERS, Gavin (eds.). RDF 1.1 Turtle: terse RDF triple language [online]. W3C Recommendation 25 February 2014. W3C, 2008-2014 [cit. 2014-04-30]. Available from WWW: http://www.w3.org/TR/turtle/
  • Schema.org: data model [online]. June 6th, 2012 [cit. 2014-04-29]. Available from WWW: http://schema.org/docs/datamodel.html
  • SCHREIBER, Guus [et al.] (eds.). Knowledge elicitation techniques. In Knowledge engineering and management: the CommonKADS methodology. Cambridge (MA): MIT, 2000, p. 187-214. ISBN 0-262-19300-0.
  • SPORNY, Manu. RDFa Lite 1.1 [online]. W3C Recommendation 07 June 2012. W3C, 2012 [cit. 2014-04-29]. Available from WWW: http://www.w3.org/TR/rdfa-lite/
  • The microformats process [online]. April 28th, 2013 [cit. 2014-04-29]. Available from WWW: http://microformats.org/wiki/process

Footnotes

  1. The result of this endeavour can be found here: https://github.com/OPLZZ/data-modelling

  2. Those may be in turn mapped to other vocabularies’ terms; e.g., via rdfs:subClassOf.

  3. However, it must be possible to reconstruct the main data structures; at least from its context and without out-of-band knowledge.

No comments:

Post a Comment