Plazi: Implementing SPM
Terry Catapano
Abstract
Plazi is a Swiss-based non-profit organization dedicated to the digitization of legacy scientific literature. Currently Plazi is engaged in a project with GBIF and The Encyclopedia of Life (EOL) on an experimental implementation the Species Profile Model (SPM). Our aim is to use SPM RDF to describe data in ca. 5000 ant species drawn from taxonomic treatments encoded in the TaxonX XML Schema and served at http://plazi.org. The SPM data in turn will be accessed and processed by EOL agents for incorporation in their resources. Plazi TaxonX documents often document or refer to valid nomenclatural acts and as such may have rich information available for provision to SPM about the Taxon Concepts they define or cite.
SPM documents have two major components: the Taxon Concept which the model documents, and a series of Information Items ("InfoItems)" further elucidating attributes of the described taxon. In general, the InfoItems fall in one of several named classes covering different aspects of scientific interest, and specified using the TDWG OWL/RDF Ontology mechanisms. These range from descriptions of the biology of the taxon, to its ecological impacts and relations, to management and social impacts. Our focus in mainly to represent taxonomic descriptions and the support for them (e.g. the specimens documenting them). InfoItem attribute values can be described either with controlled vocabularies representing, in the case of descriptions, characters and states, or with text-based descriptive phrases extracted from the TaxonX document. To refine the latter into the former is a machine learning research problem actively pursued by Plazi and others in collaboration with Dr. Hong Cui (University of Arizona) and by other groups. In this presentation we only discuss how we generate the textual descriptions. Because SPM is a developing specification with many issues remaining to be encountered and further investigated, we also will discuss our thoughts on some of these such as:
• Syntax of SPM RDF: We will be producing RDF serialized in XML. The XML itself can express the relationships among the data in a variety of ways. What is the optimal expression for the purposes at hand?
• Adequacy and completeness of data: What data should be included? In what form? To what extent should data be explicit in an SPM instance as opposed to being obtained through dereferencing of URIs?
• Provenance data: How can information about the source of both the data (i.e., the publication containing the treatment) and the SPM instance itself be best expressed?
• Validation: What mechanisms can/should be available for validation of the SPM instance? What degree of validity is necessary?
Profiling: How can all of the above be formally or semi-formally communicated to enable efficient interoperability?
SPM documents have two major components: the Taxon Concept which the model documents, and a series of Information Items ("InfoItems)" further elucidating attributes of the described taxon. In general, the InfoItems fall in one of several named classes covering different aspects of scientific interest, and specified using the TDWG OWL/RDF Ontology mechanisms. These range from descriptions of the biology of the taxon, to its ecological impacts and relations, to management and social impacts. Our focus in mainly to represent taxonomic descriptions and the support for them (e.g. the specimens documenting them). InfoItem attribute values can be described either with controlled vocabularies representing, in the case of descriptions, characters and states, or with text-based descriptive phrases extracted from the TaxonX document. To refine the latter into the former is a machine learning research problem actively pursued by Plazi and others in collaboration with Dr. Hong Cui (University of Arizona) and by other groups. In this presentation we only discuss how we generate the textual descriptions. Because SPM is a developing specification with many issues remaining to be encountered and further investigated, we also will discuss our thoughts on some of these such as:
• Syntax of SPM RDF: We will be producing RDF serialized in XML. The XML itself can express the relationships among the data in a variety of ways. What is the optimal expression for the purposes at hand?
• Adequacy and completeness of data: What data should be included? In what form? To what extent should data be explicit in an SPM instance as opposed to being obtained through dereferencing of URIs?
• Provenance data: How can information about the source of both the data (i.e., the publication containing the treatment) and the SPM instance itself be best expressed?
• Validation: What mechanisms can/should be available for validation of the SPM instance? What degree of validity is necessary?
Profiling: How can all of the above be formally or semi-formally communicated to enable efficient interoperability?