Proceedings of TDWG, 2009

Plazi: Building Communities and Software for Increasing the Utility of Digitized Biodiversity Publications

Guido Sautter, Donat Agosti, Terry Catapano, Robert A. Morris

Abstract


Taxonomic literature includes a body of several hundred million printed and thus hardly accessible pages of highly structured data-rich descriptions. Digitization and semantic markup enables data mining and extraction, such as demonstrated by Plazi.

Plazi’s document collection comprises over 500 taxonomic publications, including all literature on ants in Madagascar, all publications on ants worldwide published since 2007, and all Zootaxa papers on ants, fish, and platygasteroid wasps; in all, over 12,000 treatments on over 10,000 different taxa.

Plazi’s main markup tool, the GoldenGATE Document Editor, is now in version 3. The new version improves usability, performance, and adaptability. Built on top of it is the GoldenGATE Markup Wizard, an easy-to-use highly automated tool with advanced user guidance through the document markup process. It is most efficient if adapted to a specific type of document, e.g. a specific journal. It allows users to create comprehensive markup in less than a minute per document page.

A GoldenGATE Server hosts Plazi’s document collection and treatments. Its Tomcat-based web front-end provides multiple interfaces for accessing the treatments and their details. Through these and remote interfaces, Plazi collaborates with many other institutions and initiatives, both as a donor and a consumer of data. Document markup includes adding LSIDs (Life Science Identifiers) from HNS (Hymenoptera Name Server) and/or Zoobank to the taxonomic names, then uploading previously unknown taxa to both providers in the process. A generic XML interface providing raw treatments is the basis for most of the other services. An HTML-based search portal allows human users to browse the treatment collection, linking to specimen images on Antweb and Morphbank, and visualizing georeferenced occurrence records in GoogleMaps. GBIF (Global Biodiversity Information Facility) harvests occurrence records from a TAPIR (TDWG Access Protocol for Information Retrieval) provider. EOL (Encyclopedia Of Life) harvests treatments from an eXist-based SPM (Species Profile Model) interface.

Upcoming collaborations will further enhance Plazi’s document collection. FishBase will join the line of LSID providers and have new taxa uploaded to their database. Bibliographic meta data will be synchronized with Zoobank, GNUB (Global Name Usage Bank), and BHL’s (Biodiversity Heritage Library) CiteBank. Original description treatments will be exported to Wikipedia.

In an upcoming project, Plazi will widen its scope to ecological publications. This includes the tools for marking up such publications as well as the facilities to host them and make them available on the web. As a side effect, occurrence records from taxonomy and ecology will become available as one larger dataset.

Furthermore, plans exist to mark up portions of BHL’s vast and steadily growing data set by assembling individual pages to documents, marking up the documents, and exposing the contained treatments through Plazi’s interfaces. To handle this huge amount of data, the markup will be handed over to a community of volunteer users. The web front-end of GoldenGATE Server will be extended for community functions, and the interactive document markup will be handled in small web-based dialogs. This alleviates the need for client-side software, and the community members can contribute in small time slices as it suits them, as the dialogs take at most a minute to answer. A voting mechanism ensures data quality.