Proceedings of TDWG, 2007

TDWG Standards Architecture - What and Why

Roger Hyam

Abstract


In 2005, the TDWG Infrastructure Project (TIP) was given the remit of devising an umbrella architecture for TDWG standards. A meeting (TAG1) in April 2006 led to the establishment of the basic principles for underlying the standards architecture. The TIP has been promoting adoption of this common architecture over the last 18 months. But why have a standards architecture at all?

There is no need for a standards architecture when exchanging data within the federation of similar applications such as natural history collections. The federation is a closed system where a single exchange format can be agreed on. The federation can grow by adding new members whose needs are met by the format. This model has worked well in the past but it does not meet the primary use case that is emerging. Biodiversity research is typically carried out by combining data of different kinds from multiple sources. The providers of data do not know who will use their data or how it will be combined with data from other sources. The consumer needs some level of commonality across all the data received so that it can be combined for analysis without the need to write computer software for every new combination. This commonality needs to seamlessly extend to new types of data as they are made available. An architecture is required to provided this commonality.

What form should the architecture take? A degree of commonality could be achieved simply by specifying how the data should be serialised. If all suppliers passed data as well-formed XML for example, it would provide a degree of interoperability. Clients would however, still not know how the elements within one XML document relate to those in another, or how the items described in those documents relate. At the other extreme, the architecture could provide a detailed data type library which describes the way in which each kind of data should be serialised at a fine level of granularity. In other words, which XML elements must be present and what they should contain? It is however unlikely that a single set of serialisations would meet all needs any more than a single federation schema would. Some thematic networks require that they have well defined data types to ensure that the data passed is valid and fit for purpose.

The architecture has to meet two needs. It has to allow generic interoperability but also restricted validation of data for some networks. It does this using three interlinked components. 1) An ontology is used to express the shared semantics of the data but not to define the validity of those data. Concepts within the ontology are represented as URIs (Universal Resource Indentifiers). 2) Exchange protocols use formats defined in XML Schemas (or other technologies) that exploit the URIs from the ontology concepts. 3) Objects about which data are exchanged are identified using Globally Unique Identifiers. This means that, although exchanges between data producers and clients may make use of different XML formats, the items the data is about and the meaning of the data elements is common across all formats.