Using Distributed Annotations for Continuous Quality Control of Biodiversity Data
Paul J Morris, James Macklin, Maureen Kelly, Robert A Morris, Zhimin Wang
Abstract
Meaningful federation of scientific data is not attainable without the assessment of the quality and validity of the aggregated data in the context of particular research problems, i.e., its fitness for use.
The Filtered Push platform (http://etaxonomy.org/FilteredPush) implements a network that circulates annotations signaling the location and consequence of potential errors in data, and provides optional ability for corrections to be pushed back to the original data curator. In addition to the network cyberinfrastructure, we have prototyped a domain-independent XML Schema for annotations which has proved suitable for some of the needs we have identified for data quality control. Among these are:
• simple accuracy problems such as errors made during the capture of the data (e.g. spelling, numeric reversal, etc.),
• errors arising from representations and interpretation of the data
(e.g. inconsistencies in local to global concept mapping, unit
conversions), and
• timeliness issues arising from the currency of taxonomic identifications.
Our Java and Web Service APIs support collaborations with other platforms, such as the GBIF Integrated Publishing Toolkit (IPT), for which we have demonstrated the injection into the network of annotations from an IPT client; similarly, we have prototyped interfaces to the Specify6 collection management software for both the injection and acceptance of annotations about data in the local specimen database.
We refer to "Continuous Quality Control" because science, data, or data corrections that emerge after a scientific analysis based on a data set may change the conclusion of the analysis. This changing knowledge at any time, in any place, is a variant of the Open World assumption and brings two consequences: (1) Any annotation schema or ontology must be able to transport any present or future domain concepts and (2) a notification mechanism such as a publication/subscription overlay is necessary to insure that network participants know when existing annotations (or un-annotated data) are the subject of new knowledge or have become inconsistent with new data.
The Filtered Push platform (http://etaxonomy.org/FilteredPush) implements a network that circulates annotations signaling the location and consequence of potential errors in data, and provides optional ability for corrections to be pushed back to the original data curator. In addition to the network cyberinfrastructure, we have prototyped a domain-independent XML Schema for annotations which has proved suitable for some of the needs we have identified for data quality control. Among these are:
• simple accuracy problems such as errors made during the capture of the data (e.g. spelling, numeric reversal, etc.),
• errors arising from representations and interpretation of the data
(e.g. inconsistencies in local to global concept mapping, unit
conversions), and
• timeliness issues arising from the currency of taxonomic identifications.
Our Java and Web Service APIs support collaborations with other platforms, such as the GBIF Integrated Publishing Toolkit (IPT), for which we have demonstrated the injection into the network of annotations from an IPT client; similarly, we have prototyped interfaces to the Specify6 collection management software for both the injection and acceptance of annotations about data in the local specimen database.
We refer to "Continuous Quality Control" because science, data, or data corrections that emerge after a scientific analysis based on a data set may change the conclusion of the analysis. This changing knowledge at any time, in any place, is a variant of the Open World assumption and brings two consequences: (1) Any annotation schema or ontology must be able to transport any present or future domain concepts and (2) a notification mechanism such as a publication/subscription overlay is necessary to insure that network participants know when existing annotations (or un-annotated data) are the subject of new knowledge or have become inconsistent with new data.