Proceedings of TDWG, 2007

Key Enabling Technologies: Transfer Protocols

Donald Hobern

Abstract


The new TDWG data architecture relies on three core abilities:

  1. Constructing data objects representing objects and concepts in biodiversity informatics. This is the purpose of the TDWG data standards.

  2. Referring reliably to data objects. This is why TDWG has adopted Life Science Identifiers (LSIDs) as a globally unique identifier technology.(/li>
  3. Discovering and accessing data objects. This why TDWG develops its own data access protocols and explores other protocol standards.

TDWG's work in this area has led to the family of protocols beginning with DiGIR and BioCASe and leading to TAPIR (the TDWG Access Protocol for Information Retrieval) today.
The DiGIR protocol has been used extensively by a range of major projects to support exchange of specimen and observation data using Darwin Core. DiGIR provides a flexible XML language for making remote search requests against a web-connected database. More importantly DiGIR provides a tool for organisations to map their databases into a common set of concepts such as Darwin Core.

BioCASe introduced support for records with a significant nested structure such as the ABCD schema. BioCASe simplified the use of the protocol with external data models developed without knowledge of DiGIR or BioCASe.

The TAPIR protocol learns from DiGIR and BioCASe and adds new features of its own. Two implementations of the protocol are currently available, pyWrapper (written in Python) and TapirLink (written in PHP).

To use a protocol such as TAPIR, a data administrator maps a local database to a set of concepts recognised by the community (e.g., ScientificName, Locality and CatalogNumber are Darwin Core concepts recognised by a wide range of projects). TAPIR software then offers the following operations:

  • Metadata – retrieve descriptive information about a dataset;

  • Capabilities – retrieve the technical capabilities of the TAPIR server and the concepts mapped by the data administrator;

  • Ping – check that the TAPIR server is active;

  • Inventory – retrieve a list of distinct values within the dataset for one or more concepts, with counts of matching records; and

  • Search – retrieve records matching a set of filter conditions.

TAPIR can handle requests encoded as XML documents or as a set of parameters supplied within a URL. TAPIR supports common request and response templates to format results for different tools. For example, TAPIR can issue requests based on Darwin Core concepts and receive results as a Google Earth KML document or an RSS feed. Installing TAPIR software may therefore be an efficient way to expose data for a range of other client tools.

TDWG’s re-engineering of its data standards as reusable vocabularies enables the use the same terms and definitions in different contexts. TDWG could use its own standards with many general purpose data access protocols. Examples include:
  • OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting) – standard access to metadata for a wide range of online resources.

  • WFS (Open GIS Web Feature Service) – a standard that could be used to map locations of species occurrences.

  • SPARQL Query Language for RDF – a standard allowing complex queries across different data sets.