Proceedings of TDWG, 2007

The Role of Ontologies in the TDWG Architecture

Roger Hyam

Abstract


The TDWG Standards Architecture is based on three pillars.
1) An ontology or ontologies,
2) A series of exchange protocols and associated message formats, and
3) The use of Globally Unique Identifiers for primary data objects.
The nature of the ontology and how it integrates with the exchange protocols and GUIDs is discussed here. The justification of the overall structure of the architecture was given in an earlier talk “TDWG Standards Architecture - What and Why”. A specific example of the application of the ontology is given is a later talk “RDF over TAPIR”.

Prior to the TDWG standards architecture, data exchange was based solely on passing XML documents. This is good for federation networks but it is not as suitable for sharing different types of data across generic bus-type architecture – which is emerging as the primary use case. Combining documents is difficult because the meaning of the elements within the documents depends on their context. If we initially model the shared data as an ontology of linked classes of objects rather than documents it becomes possible to construct documents, from the perspective of different base classes that map directly to the ontology. Clients can then combine documents from different perspectives (and of different formats) because the documents are composed of serializations of objects that are typed in the ontology and the clients understand the ontology.

Applying the principle of separation of concerns, it is possible for the definition of the validity of the documents exchanged to be defined outside the ontology. The ontology can be used to specify the meaning of the namespaces whilst XML Schemas (or some other technology) can be used to specify valid document structures for any particular exchange application. An ontology is therefore central to unifying disparate application schemas.

Last year a team lead by Jessie Kennedy and including representatives from across TDWG interest groups developed an initial high level ontology of the biodiversity domain. This ontology is available through the TAG Wiki. Creating the ontology was a valuable exercise but everyone involved recognised it needed more work before it could be put to production use. At the same time a programme was actively rolling out LSID (Life Science Identifiers) authorities. The meta-data returned by LSIDs are in RDF format and, to be useful, requires an RDF vocabulary or ontology that at least defines the object types. The TDWG ontology was not going to be developed to a sufficient level of detail in the allotted time. The decision was therefore taken to develop a series of smaller ontologies that could serve as an application layer within the larger TDWG ontology and to only loosely link them into the more general or higher classes in the ontology. These two ontologies are referred to as the “LSID Vocabularies” and the “Current TDWG Ontology”.

The LSID Vocabularies are now entering production use and, in due course, there will be a requirement for them to be linked to a higher level ontology so as to permit inference. Here, I propose that this link is not made but that a separation of concerns is again followed. There are multiple ways in which the basic classes of exchanged data could be related. No one set of these relationships is suitable for all applications. It is therefore important not to impose a top-down interpretation of the data but to allow for the possibility of multiple higher level classifications of which the Current TDWG Ontology may only be one.