Proceedings of TDWG, 2006

Prototyping a Generic Slice Generation System for the GBIF Index

Jörg Holetschek, Anton Güntsch, Cristian Oancea, Markus Döring, Walter G. Berendsohn

Abstract


The GBIF index is maintained by the GBIF secretariat in Copenhagen. It contains a list of all specimens and observations registered within the GBIF network together with some data items considered most relevant for searches and output, such as taxon name, gathering/observation date and site geography. These data currently (Aug 30th, 2006) derive from 804 collections around the world and are harvested by the GBIF indexer using the Darwin Core and ABCD data schemas. In the process, the data are decomposed and stored in the highly normalized data model of the index.

The EU-funded SYNTHESYS project and the development of the German GBIF Node have included efforts to set up specialized search portals for biodiversity data. As a first step, a prototype system has been set up in association with one of the mirrors of the GBIF index. This system creates subsets of the GBIF index that could be used as the base for the search portals of special interest networks or regional organizations. This offers an opportunity to these groups to draw on their resources to enhance the usability of the data in the GBIF system, for example, by adding additional information provided by other data sources such as regional or group-specific taxonomic thesauri, local geographic services, or translation mechanisms.

The process of the geographic slice generation comprises three stages:

1. Filtering data from the GBIF index using different criteria (taxa, country codes, geographic coordinates, regional place or area names, collection metadata);
2. Transforming the data into a query-optimized data model and
3. Processing data in order to enhance data quality (optional) and/or augmenting data with additional information (optional)

At the moment slices are updated regularly during the night (01:00 GMT).

As an example, SYNTHESYS is implementing the BioCASE search portal for European biodiversity data. This will ultimately be integrated with European taxonomic backbone systems (Fauna Europaea and Euro+Med PlantBase) as well as with the evolving European geographic data infrastructure. In parallel, GBIF-D Botany is prototyping a search portal for botanical data that will be linked to the standard lists of plants available for the German flora. For these two projects, slicing can be performed by applying filters on geography and taxon information, respectively. We suggest that basic slice generation based on geographic criteria (e.g. for countries) could be among the services offered by the new GBIF index system. At the prototyping stage, rules must be specified as SQL statements, which permits slicing rules based on all fields contained in the index database.

The system is still in a prototype stage, but a slice system is being tested with the SYNTHESYS user interface currently under construction (http://search.biocase.org).

The major challenges with such a system are not technical, but relate to GBIF’s obligations to the data providers. Before such a system can be deployed more widely, it is essential to ensure that sliced data are kept current as providers make modifications and corrections, all data providers are fully and appropriately acknowledged for their contributions, and data providers are kept informed on uses to which their data are put. We will be working with the GBIF Secretariat to address these issues and also to accommodate changes arising from the current redesign of the GBIF index system.