Proceedings of TDWG, 2007

Biodiversity Portals: Implications for TDWG

Donald Hobern

Abstract


Many of the TDWG standards were first developed to support federated searches. The intended use case was as follows:
 A user submits a search request (e.g., occurrences of Chiroptera from before 1950);
 A workflow application passes the request to relevant data providers;
 Each provider responds with at least the first page of matching records; and
 The workflow application returns the combined results to the user (with support for retrieving records not returned in the initial request).

This approach requires the workflow application to maintain the following information:
 Basic technical metadata for each dataset (e.g., endpoint, data standards);
 Session information to support paging through matching data sets;
 Ideally - knowledge of each dataset's content so requests can be forwarded only to relevant data providers; and
 Ideally - domain knowledge to enhance requests (e.g., to use synonyms as well as the accepted name for a species).

Most biodiversity data portals, including GBIF, have used a cached index of key information retrieved from the various data providers to provide quick answers to most search requests. This solves several problems which appear as network sizes increase:
 It is wasteful to forward every request to every potentially relevant data provider;
 Many requests are too general for any datasets to be excluded in advance;
 At any time some providers will be off-line; and
 Some providers cannot handle complex requests, or respond very slowly.

The decision whether a network should use an index/cache will depend on several factors:
 The size of the network;
 The robustness and availability of the data providers;
 Whether the data providers have the desire/capabilities to maintain a live server;
 Whether search requests require joins between multiple data sets; and
 Whether the portal needs to pre-process data to make queries more reliable.

This approach has some implications for future TDWG standards development:
1. TDWG should produce recommendations on the use of TAPIR and/or OAI-PMH (or some similar harvesting protocol) for maintaining central caches of records. The approach should be selected to minimise the burden on data providers.
2. TDWG should continue to revise its standards to minimise alternative representations for the same information and to stabilise key concepts. The LSID vocabularies promise to provide stable properties which could be recognised wherever they are used. This will simplify building index databases.
3. TDWG should adopt or develop metadata standards for on-line datasets. As well as standard Dublin Core properties, these metadata should document the taxonomic, geographic and temporal coverage of the dataset (with references to standard taxonomies and vocabularies) and the methods used to gather the data (atlasing projects, amateur observations, etc.). Even when portals use local indexes, good metadata can simplify selection of datasets and improve the quality of the indexing process.
4. TDWG should ensure that protocols and data standards make it easy to provide attribution for each individual record. The GBIF portal offers interfaces for searching aggregated data but cannot use the same interfaces used by the original data providers. This is because the existing output models were developed for individual data providers (with the same metadata for all records) rather than for composite documents with records from a number of datasets. This problem can be solved with standard record-level properties for attributing any record to its source dataset.