Proceedings of TDWG, 2007

Species Profile Model: Data integration lessons from GBIF

Donald Hobern

Abstract


The Species Profile Model (SPM) is being developed as a standard to simplify integration of species information from multiple sources and to maximise its usefulness and reusability.

The Global Biodiversity Information Facility has spent the last five years working to integrate biodiversity data from a wide range of different resources and has learned several lessons which are likely to be relevant to those developing the Species Profile Model and to those planning to build species information networks.
1. Any data exchange model should be optimised for the key use cases. If the intended applications for the data are well understood, it is relatively easy to determine which elements in a model should be mandatory and which elements will require tightly controlled vocabularies. This allows attention to be focused on the most important areas and perhaps for less critical aspects to be deferred to future versions of the model.
2. Despite the difficulties in gaining consensus, some elements in a data model require strictly controlled vocabularies to ensure that applications can discover which data records are genuinely of interest for a given purpose. Tools are required to help data providers to map their existing data to such vocabularies.
3. Metadata require as much planning and design as the fields regarded as data. Associating appropriate metadata with a data resource makes it much easier for applications to select requested data records and to interpret them correctly.
4. Data models may atomise the same fundamental information to different degrees. The key driver has usually been to ensure that different data providers are all able to map their existing data into the model. For many purposes the existence of alternative ways to represent the same information will not matter. However it is important to consider the effect of the variation on applications that seek to consume and analyse these data. In the long run less effort may be involved in modifying the source data than in building client applications which can handle this extra complexity.