Literature & interoperability: a working example using Ants
Donat Agosti, Terry Catapano, Guido Sautter
Abstract
Print is still the main medium to communicate taxonomic results. Traditionally printed taxonomic publications may include all the information (data, analysis, conclusions) needed to understand new research results. This system has been very successful, surviving almost a quarter of a millennium. Even today with widespread technologies for electronic distribution, the basic means of taxonomic communication has not altered, not yet taking full advantage of these technologies.
An understanding of the successful print model of systematics should orient efforts in the shift to a new digital knowledge infrastructure. In essence, a taxonomic treatment is the amalgamation in a single record of information we consider relevant to describe our taxa, including often not just the inferred hypotheses but also the underlying data. If sufficiently detailed, the latter can be identified, extracted, and populate dedicated databases on specimens, nomenclature or bibliographic citations.
Our German DFG / US NSF funded digital library project has been built upon this premise. In order to digitally represent the significant components of systematics literature, the XML schema TaxonX (http://taxonx.org) has been developed. The prospect of encoding the tens of millions of printed pages inspired the development of dedicated mark-up software (GoldenGATE) enabling the semi-automatic mark-up of suitably clean OCRed texts.
But even this process is still time consuming and dependent on the involvement of experts. As a result, a dedicated server, plazi.org (http://plazi.org) will be launched at the TDWG meeting that will allow the community not only to retrieve the respective documents, but actively participate in the mark-up process, and to be able to retrieve digital versions of individual treatments (descriptions of taxa). Openly available services like iSpecies or EDIT’s scratchpads will be able to access the treatments and incorporate them in their mash-ups or as seeds for scratchpads.
For the legacy publications to become truly interoperable, TaxonX allows the inclusion of references to identifiers in the increasing number of dedicated databases (eg GBIF; bibliographic references). To bridge the gap between the idea and implementation, unique identifiers for ant names will be retrieved from the Hymenoptera Name Server (including >200K names, including all ant names) and expressed as LSIDs. For literature, handles are retrieved via bioguid.org from plazi.org’s handle server, an integral part of DSpace, the respository of all the digitized legacy ant publication used to administer all the publications.
Although plazi.org currently concentrates on ants and legacy publications, it can in principle provide its services for any taxon. This all comes at high costs. What is needed in future are dedicated databases (specimens, character, names, bibliographies, etc.), unique identifiers, a program like LUCID to machine generate both a human readable text as well as the underlying xml mark up, and for publishers to integrate taxonomic specific annotations alongside a human readable text version seen in taxonomic publications.
An understanding of the successful print model of systematics should orient efforts in the shift to a new digital knowledge infrastructure. In essence, a taxonomic treatment is the amalgamation in a single record of information we consider relevant to describe our taxa, including often not just the inferred hypotheses but also the underlying data. If sufficiently detailed, the latter can be identified, extracted, and populate dedicated databases on specimens, nomenclature or bibliographic citations.
Our German DFG / US NSF funded digital library project has been built upon this premise. In order to digitally represent the significant components of systematics literature, the XML schema TaxonX (http://taxonx.org) has been developed. The prospect of encoding the tens of millions of printed pages inspired the development of dedicated mark-up software (GoldenGATE) enabling the semi-automatic mark-up of suitably clean OCRed texts.
But even this process is still time consuming and dependent on the involvement of experts. As a result, a dedicated server, plazi.org (http://plazi.org) will be launched at the TDWG meeting that will allow the community not only to retrieve the respective documents, but actively participate in the mark-up process, and to be able to retrieve digital versions of individual treatments (descriptions of taxa). Openly available services like iSpecies or EDIT’s scratchpads will be able to access the treatments and incorporate them in their mash-ups or as seeds for scratchpads.
For the legacy publications to become truly interoperable, TaxonX allows the inclusion of references to identifiers in the increasing number of dedicated databases (eg GBIF; bibliographic references). To bridge the gap between the idea and implementation, unique identifiers for ant names will be retrieved from the Hymenoptera Name Server (including >200K names, including all ant names) and expressed as LSIDs. For literature, handles are retrieved via bioguid.org from plazi.org’s handle server, an integral part of DSpace, the respository of all the digitized legacy ant publication used to administer all the publications.
Although plazi.org currently concentrates on ants and legacy publications, it can in principle provide its services for any taxon. This all comes at high costs. What is needed in future are dedicated databases (specimens, character, names, bibliographies, etc.), unique identifiers, a program like LUCID to machine generate both a human readable text as well as the underlying xml mark up, and for publishers to integrate taxonomic specific annotations alongside a human readable text version seen in taxonomic publications.