Proceedings of TDWG, 2006

A Generic Data Import Layer for the Berlin Taxonomic Information Model

Anton Güntsch, Walter G. Berendsohn, Andreas Müller

Abstract


The Berlin Taxonomic Information Model is a relational information model based on the potential taxon concept (Berendsohn, 1995). The model incorporates nomenclatural rules and traditional taxonomic relationships (synonymies, taxonomic inclusions) and the capability of representing taxonomic concepts as name-reference pairs (Berendsohn & al., 2003). The additional inclusion of non-traditional set-theoretical concept-relations provides the means for accurate and transparent storage of concept graphs (Geoffroy & Güntsch, 2003). The model has been implemented as a Microsoft SQL-Server database together with a suite of application programs such as a taxonomic web-editor, WWW publication software, and various parser programs. Berlin Model users range from taxonomists writing monographs to international checklist projects.

Experience from the existing Berlin model application projects suggests that data imports consume a substantial share of project resources. This is mainly due to the heterogeneous structure of available taxonomic data and the complexity of the target model.

A generic data import method using two XML schema layers and three phases of transformation flow between a data source and the target Berlin model database aids importation. In the first phase, importers transform the source data into data valid against a “soft schema” that best fits the semantics of elements in their source. Users may choose from a comprehensive Java library of transformation tools. If an appropriate soft schema does not exist, it is possible to use a new one (e.g. a new version of TCS).

“Soft schema” data are then transformed by defined rules (including atomizing and restructuring) to the final “strict schema” representing a fixed definition of elements and structures for taxonomic data sets. Like the Berlin model, this schema is capable of representing concepts and arbitrary relations but it hides the complexity of the database model from the user. Malformed source data are highlighted and may be corrected during the semi-automatic transformation from the “soft schema” to the “strict schema” (phase 2).

An automated phase 3 consists of duplicate detection and an object-relational data transformation.

The method has been used successfully in the course of the Med-Checklist project which imported Vol. I, III, and IV into a Berlin model database from heterogeneous sources (http://ww2.bgbm.org/mcl/home.asp). Further importing tasks for the EU project EDIT, for the IOPI Species Plantarum initiative, and for the Euro+Med project will be used to refine the scheme.