Proceedings of TDWG, 2009

Using the CDM to build Europe’s largest species database

Marc Geoffroy, Anton Güntsch, Andreas Kohlbecker

Abstract


PESI (a Pan-European Species-directories Infrastructure) defines and coordinates strategies to enhance the quality and reliability of European biodiversity information. It is a joint initiative of two Networks of Excellence: EDIT (European Distributed Institute of Taxonomy) and MarBEF (Marine Biodiversity and Ecosystem Functioning); funded by the European Union under the Framework 7 Capacities Work Programme: Research Infrastructures and led by the University of Amsterdam. It started in May 2008, will last three years and involves 40 partner organisations from 26 countries.

One of the goals of PESI is to taxonomically integrate and secure the main pan-European species checklists, starting with Fauna Europaea (FaEu), the Euro+Med plantbase (E+M), and the European Register of Marine Species (ERMS). With more than 200,000 species, together they provide the largest and most comprehensive regional species inventory in the world.

The integration of these three checklists, currently stored in separate databases with their different data models, relies on the Common Data Model (CDM) which was developed within EDIT with a goal of ensuring it could be mapped to most of taxonomic databases. The CDM essentially follows the TDWG Ontology (http://wiki.tdwg.org/twiki/bin/view/TAG/TDWGOntology), but modelling was influenced by other models and standards, such as the Access to Biological Collections Data (ABCD) schema, the Taxonomic Concept Schema (TCS) and the Structure of Descriptive Data (SDD) schema, as well.

The CDM Java library implements all classes in the CDM, and is the primary interface for applications communicating with CDM data stores. Import routines will be created as necessary for each of the PESI source databases; all source data will then be merged into a single PESI CDM store instance. If one or more of these databases is maintained externally, the import routine must be run regularly. Rules for data quality control concerning the syntax of terms and the structural and relational integrity of data will be implemented at the CDM level and applied to the complete data set of the PESI CDM store. Overlapping and inconsistent data stemming from disparities among the source databases - for instance, a handful of animal species with brackish water habitat are stored in both ERMS and FaEu - will also be detected. These conflicts and discrepancies can then be resolved by PESI taxonomists using the EDIT Taxonomic Editor, which will play an important role for the maintenance of participating checklists and therefore complement the quality checker rules.

PESI data will be regularly exported from the CDM store into a denormalised relational database management system (the “PESI data warehouse”), following maintenance and improvements to data quality. Here the CDM’s versioning capability will also provide substantial support. The “PESI data warehouse” is optimized for queries from the World Wide Web portal and PESI web-services. The new PESI portal will make the content of the major European taxonomic infrastructures available and support the use of the pan-European species data in the e-science domain.

Using the CDM as the decisive layer for an ambitious project such as PESI represents a major step towards establishing the CDM as a possible standard for taxonomic databases and applications.

PESI (http://www.eu-nomen.eu/pesi)
CDM (http://dev.e-taxonomy.eu/trac/wiki/CommonDataModel)
EDIT (http://www.e-taxonomy.eu/)
ERMS (http://www.marbef.org/data/erms.php)
Euro+Med plantbase (http://www.emplantbase.org/home.html)
FaEu (http://www.faunaeur.org/)
MarBEF (http://www.marbef.org/)