Proceedings of TDWG, 2007

Data Integration Issues in Biodiversity Research

Jessie Kennedy, Shawn Bowers, Matthew Jones, Josh Madin, Robert Peet, Deana Pennington, Mark Schildhauer, Aimee Stewart

Abstract


The Scientific Environment for Ecological Knowledge (SEEK) project is developing an IT framework and infrastructure that will be used to derive biodiversity and ecological knowledge by facilitating the discovery, integration, interpretation, and analyses of ecological information. SEEK is based on a 3-layered architecture: the EarthGrid (the lowest layer) provides uniform access to biodiversity and other types of data sets; Kepler, a workflow tool (the highest layer), allows scientists to visually define, document, and execute their analyses; and the Semantic Mediation System (in the middle) uses domain knowledge represented in ontologies and databases to inform the discovery, integration and analysis of ecological data. The SEEK project has been motivated and directed by ecological analyses such as niche modelling and biodiversity studies. Example case studies have been used to explore the issues facing the researchers undertaking the analyses. This presentation will outline these issues and overview approaches used by SEEK.

Much modern research in ecology is based on the integration (and re-use) of multiple datasets. These datasets may be distributed globally, will be stored in a variety of formats, and most likely the data will have differing semantics reflecting any of the many measurements of spatial and temporal environmental factors and organismal characteristics and interactions that contribute to a given ecosystem. A typical scenario is a scientist is interested in analyzing the spread of invasive species in a certain region. S/he has distribution records in a personal database, but requires access to other potentially relevant datasets on-line. The researcher needs to be able to discover candidate datasets and then merge their relevant and compatible information. The researcher needs to resolve which datasets contain information about the species of interest or are to the timescale and locality of research. Simplistically, datasets might be retrieved and integrated on the basis of country and species name; however even simple data files can be extremely time consuming to integrate manually and complicated if at all possible to integrate automatically as a simple example will show.

In order to find and integrate suitable data, meta-data describing the content of the data sets is important, therefore SEEK requires data sets stored in the EarthGrid to be marked up with Ecological Metadata Language (EML). EML includes descriptions of the temporal, geographical and taxonomic coverage of the data sets. Much of the terminology used in EML is generically applicable to scientific data structures—such as table name or column label; while more domain-relevant terms—such as biomass or wing span, are defined in ontologies being developed by the SEEK team in conjunction with disciplinary specialists.

The Semantic Mediation System (SMS) layer in SEEK uses ontologies to expand terms for searching EarthGrid for data discovery and for supporting the scientist in semi-automatically transforming data for input to appropriate analytical components in Kepler. This is accomplished using a generalized ontology for modeling “observational data”, called OBOE. OBOE provides a framework in which the meaning and inter-relationships of observations within a scientific data set can be clarified. For example, one can use OBOE to indicate that various data sets contain both weights and wing spans of bird specimens—thus greatly facilitating effective data discovery and potential integration of those types of data sets. The SEEK Taxon group, whose work also sits in the semantic mediation layer of Kepler, has been researching the more specialized issues associated with clarifying the semantics necessary to inter-relate the taxonomic coverage of ecological data sets.

Ecological data sets of relevance to biodiversity modeling tend to have been collected either over long periods of time or over a wide geographic range and typically use unqualified biological names for recording taxon occurrences or counts (often codes are used with biological names specified in the meta-data). However due to the ongoing work of taxonomy in classifying and naming the known organisms, the meaning associated with these names changes over time. Therefore representing the taxonomic coverage for ecological data by simply referencing names of species results in ambiguity. This ambiguity may be significantly detrimental to the results of any subsequent ecological analysis. To address this problem the SEEK Taxon group is adopting a taxonomic concept approach, as defined in collaboration with TDWG in the TCS standard. A necessary component will be formal modification of the Ecological Metadata Language (EML) to support identification of organisms to concept. We are currently developing tools to aid the ecologist in selecting appropriate taxon concepts, which will improve the accuracy of matching data for integration. The tools include a Taxon Object Server (whose model is closely based on TCS) to support the resolution of taxon names and concepts, and visual tools to enable users to compare concepts and clarify relationships among them.