Progress in making literature easily accessible: schemas and marking up
Terry Catapano, Anna Weitzman
Abstract
An important component of making biodiversity content available is the vast quantity of taxonomic information in printed form. Even 300+ year-old works remain relevant to taxonomy. Taxonomists have traditionally accessed this information by reading and taking notes, which are later incorporated into subsequent treatments. Similar, though more widespread, access exists for images of pages on the Web (i.e., the user still needs to know for what and where to look). Another step forward is to reproduce the printed information as machine-readable text. Even this still leaves the task of distinguishing relevant information in the potentially vast quantities of data. In order to make data in literature fully accessible, it must be encoded, have proper metadata added, and be made available for searching, linking and processing. Two projects, taxonX/GoldenGate (GG) and taXMLit/INOTAXA are attempting to tackle this task.
The aim of the taxonX schema is to provide a minimally sufficient XML tagset to identify and delineate taxonomic treatments and their significant components, particularly scientific names, geographic names, bibliographic citations, and descriptions. Once encoded in taxonX, the treatment and its associated data can be more readily extracted and incorporated into other databases as well as accessed and integrated into external resources. Owing to the diverse heterogenous forms of taxonomic treatments, the schema design is loose and flexible. Similarly, the content of the data itself requires normalization in order to be useful within existing and future digital infrastructures.
Developed independently, but alongside taxonX, GG contains tools for the semi-automatic markup of scientific names and treatment boundaries, and work proceeds on similar tools for bibliographic citations and geographic names. Tools to assist in identification of normalization of descriptive data are possible, but more difficult. GG can input a cleaned OCR (optical character recognition) file in xml, html, or text format and export a taxonX instance.
taXMLit is another schema for tagging taxonomic literature. Unlike taxonX, it is deliberately a fairly complete representation of data within the literature and thus is a complex schema. Taxonomic literature has a limited number of ‘kinds’ of information. These may be recognized in several ways, including using GG. Using xml text with those designated, e.g., a taxonX instance, another set of tools is underway to further parse and normalize data from kinds of paragraphs most likely to be needed by taxonomists (e.g., taxon heading, synonymy, specimen citations). As different formats of these kinds of paragraphs are identified, a library of tools will be built. Artificial intelligence should be able to select which tool is needed for each paragraph.
We believe experiences gained in the development of taxonX and taXMLit can inform future efforts to establish TDWG standard(s) for taxonomic literature. Two approaches to this task might be considered. First, the development be of a Standard, not necessarily a Schema. A core Vocabulary could be developed, with a number of different expressions, each ontologically harmonic, but in forms optimal for particular processes and uses. Secondly, the NLM/NCBI Journal Archiving DTD (a Document Type Definition defines the allowed building blocks of an XML document) should be investigated as one of the forms for expression of a TDWG Literature Standard. The NLM DTD enjoys strong and committed maintenance and has been adopted widely. It is deigned to be modular, with domain specific elements added to the base generic markup elements.
The aim of the taxonX schema is to provide a minimally sufficient XML tagset to identify and delineate taxonomic treatments and their significant components, particularly scientific names, geographic names, bibliographic citations, and descriptions. Once encoded in taxonX, the treatment and its associated data can be more readily extracted and incorporated into other databases as well as accessed and integrated into external resources. Owing to the diverse heterogenous forms of taxonomic treatments, the schema design is loose and flexible. Similarly, the content of the data itself requires normalization in order to be useful within existing and future digital infrastructures.
Developed independently, but alongside taxonX, GG contains tools for the semi-automatic markup of scientific names and treatment boundaries, and work proceeds on similar tools for bibliographic citations and geographic names. Tools to assist in identification of normalization of descriptive data are possible, but more difficult. GG can input a cleaned OCR (optical character recognition) file in xml, html, or text format and export a taxonX instance.
taXMLit is another schema for tagging taxonomic literature. Unlike taxonX, it is deliberately a fairly complete representation of data within the literature and thus is a complex schema. Taxonomic literature has a limited number of ‘kinds’ of information. These may be recognized in several ways, including using GG. Using xml text with those designated, e.g., a taxonX instance, another set of tools is underway to further parse and normalize data from kinds of paragraphs most likely to be needed by taxonomists (e.g., taxon heading, synonymy, specimen citations). As different formats of these kinds of paragraphs are identified, a library of tools will be built. Artificial intelligence should be able to select which tool is needed for each paragraph.
We believe experiences gained in the development of taxonX and taXMLit can inform future efforts to establish TDWG standard(s) for taxonomic literature. Two approaches to this task might be considered. First, the development be of a Standard, not necessarily a Schema. A core Vocabulary could be developed, with a number of different expressions, each ontologically harmonic, but in forms optimal for particular processes and uses. Secondly, the NLM/NCBI Journal Archiving DTD (a Document Type Definition defines the allowed building blocks of an XML document) should be investigated as one of the forms for expression of a TDWG Literature Standard. The NLM DTD enjoys strong and committed maintenance and has been adopted widely. It is deigned to be modular, with domain specific elements added to the base generic markup elements.