1. Client's Perspectives: User Needs
1.1. EDIT needs Biodiversity Information Standards
Walter G. Berendsohn, Markus Döring, Malte C. Ebach
Botanic Garden & Botanical Museum Berlin-Dahlem
The European Distributed Institute of Taxonomy (EDIT) is a network of 26 leading natural history institutions and organisations in the European Union, the United States and Russia. EDIT is a “Network of Excellence” project financed by the European Commission, this implies that its main aim is the durable integration of institutional resources to jointly meet the challenges taxonomy faces today. The EDIT project started on the March 1st 2006 and will last until 2011.
An integral part of EDIT is the creation of an “Internet Platform for Cybertaxonomy”. This is a distributed computing platform that will allow taxonomists to do taxonomic revisions (including processing the results of field work) more efficiently, expediently and via the web. It consists of interoperable but independent platform components, which can take the form of software applications (desktop or web-based) for human users or (web) services. The envisioned platform will not have a single user interface or website, instead it will be a collection of interacting components which may be combined and assembled according to the task at hand. A central endeavour of EDIT is to establish a Common Data Model that platform components adhere to. In the near future, this will include more or less loose coupling with existing software solutions. More information is available at http://wp5.e-taxonomy.eu/EDIT-Architecture.html.
Present and future EDIT member institutions will join agreements for the maintenance and use of specific parts of the platform (components, standards, data provision or access) once they are considered mature enough for practical use.
For the development of the Platform, EDIT will closely collaborate with projects, organisations and initiatives with overlapping aims, prominently among them TDWG. Members of EDIT staff have been active in TDWG groups and meetings. TDWG offers an indispensable forum for contacts and networking among those active in biodiversity informatics. For EDIT software development, TDWG standards and discussions will be considered and incorporated. One obstacle is the lack of integration among TDWG Standards (e.g., TCS, SDD, ABCD), which on a structural level, are largely incompatible with each other. However, the achievements of TDWG groups on data definition at the atomic level (i.e., definition of the semantics of data elements) are recognised and indispensable for EDIT’s planned Common Data Model. This will need further development and well defined content standards (ranging from controlled vocabularies to data services), which requires close involvement of the biological community in TDWG working groups.
Another area EDIT is engaged in is certification of biodiversity informatics software. The new TDWG standards process is a possible model for such an endeavour. One of the criteria for “EDIT certified software” will certainly be compatibility with applicable TDWG standards as well as wider used standards recommended by TDWG.
Support is acknowledged from: European Commission Framework Programme 6
An integral part of EDIT is the creation of an “Internet Platform for Cybertaxonomy”. This is a distributed computing platform that will allow taxonomists to do taxonomic revisions (including processing the results of field work) more efficiently, expediently and via the web. It consists of interoperable but independent platform components, which can take the form of software applications (desktop or web-based) for human users or (web) services. The envisioned platform will not have a single user interface or website, instead it will be a collection of interacting components which may be combined and assembled according to the task at hand. A central endeavour of EDIT is to establish a Common Data Model that platform components adhere to. In the near future, this will include more or less loose coupling with existing software solutions. More information is available at http://wp5.e-taxonomy.eu/EDIT-Architecture.html.
Present and future EDIT member institutions will join agreements for the maintenance and use of specific parts of the platform (components, standards, data provision or access) once they are considered mature enough for practical use.
For the development of the Platform, EDIT will closely collaborate with projects, organisations and initiatives with overlapping aims, prominently among them TDWG. Members of EDIT staff have been active in TDWG groups and meetings. TDWG offers an indispensable forum for contacts and networking among those active in biodiversity informatics. For EDIT software development, TDWG standards and discussions will be considered and incorporated. One obstacle is the lack of integration among TDWG Standards (e.g., TCS, SDD, ABCD), which on a structural level, are largely incompatible with each other. However, the achievements of TDWG groups on data definition at the atomic level (i.e., definition of the semantics of data elements) are recognised and indispensable for EDIT’s planned Common Data Model. This will need further development and well defined content standards (ranging from controlled vocabularies to data services), which requires close involvement of the biological community in TDWG working groups.
Another area EDIT is engaged in is certification of biodiversity informatics software. The new TDWG standards process is a possible model for such an endeavour. One of the criteria for “EDIT certified software” will certainly be compatibility with applicable TDWG standards as well as wider used standards recommended by TDWG.
Support is acknowledged from: European Commission Framework Programme 6
1.2. Biodiversity Heritage Library: Progress & Potential
Chris Freeland
Missouri Botanical Garden
The Biodiversity Heritage Library (BHL) is an international consortium of 10 natural history libraries with a goal to digitize a significant collection of materials across the member libraries. A working prototype for BHL is online at http://www.biodiversitylibrary.org.
Developments in the next two years include enhancing this interface and providing globally unique identifiers and robust services guided by TDWG standards and recommendations. Developments will allow remixing and incorporation of material into complimentary applications.
Support is acknowledged from: Alfred P. Sloan Foundation, John D. and Catherine T. MacArthur Foundation
Developments in the next two years include enhancing this interface and providing globally unique identifiers and robust services guided by TDWG standards and recommendations. Developments will allow remixing and incorporation of material into complimentary applications.
Support is acknowledged from: Alfred P. Sloan Foundation, John D. and Catherine T. MacArthur Foundation
1.3. One million species in the Catalogue of Life – a triumph for Species 2000 and ITIS, or for TDWG standards?
Frank A. Bisby
Species 2000 Secretariat, School of Biological Sciences, University of Reading
On 29 March 2007 Species 2000 and ITIS held their ‘One Million Species Day’ celebrating reaching one million species in their Catalogue of Life. This was achieved by federating species checklists from 47 taxonomic databases from around the world. Not only was the Species 2000 programme initiated by TDWG, but from the start in 1996 the programme depended on standards for interoperability within its architecture for federating many taxonomic databases. Then as now, TDWG was considered the community’s forum and authority for standards. So how has TDWG served this client community over the eleven years, and how has this client responded? First – the will of TDWG to establish and promote practical standards as different from acting as a forum for innovation in biodiversity informatics has fluctuated over the years. Second – the early cohort of standards were largely content standards, but these, nonetheless can prove valuable to a programme such as ours. Third – the gradual shift to schemas and protocols at the informatics level has done much to widen the generality of solutions and to open opportunities for multiple uses. Fourth – we need to be realistic about the time-lags between design, adoption, implementation and effective adoption in the community, and where possible to manage this life-cycle rather severely. The response from our species checklist database community has been decidedly mixed. Huge variations in the sense of purpose and in perceptions of how it should be done, have led to some exciting innovations, but also to much needless diversity in how simple tasks are done. Some of the disappointing elements in this response relate to the weak uptake of generic software in our community, and the shortage of success stories in this area. Lastly, with participation of more than 50 databases in the Species 2000 programme, we can bring to TDWG incipient standards that have already proved effective within this community. One is the SPICE Protocol for federating species checklists, another is the Species 2000 Data Content Standard for species checklists, and we have started a ‘best practice’ document that addresses content and management. On behalf of both the Species 2000 and the ITIS programmes it is important to re-iterate both the fundamental importance of interoperability standards, and the work that TDWG is doing. Nowhere are standards more important than in biodiversity. Our ability to describe, model and manage global biodiversity depends entirely on our ability to synthesise high level knowledge from the myriad individual observations and syntheses made independently around the world: distributed systems and interoperability are central to this task.
1.4. User Needs - The alpha and omega of system design
Charles J.T. Copp
Charles Copp Environmental Information Management
This presentation will include user needs, the role of interfaces and web services in building systems to serve different types of users and the use of thesauri for providing appropriate user-targeted terms.
Developers jokingly complain that the problem with software is the users: ‘users never read manuals and can be bloody-minded or even downright stupid’. Most potential users do not really understand their data requirements or have a clear idea of what can be delivered. This is especially true in large scale information projects, of which the database software forms only a part, for instance, a local or regional biodiversity network. Is there any consensus on what the potential users want out of a biodiversity network?
The key issues are: who are the users, what are their real needs, what problems can the proposed system solve, how can different levels of user get what they need, will their requirements change over time, and who will pay for it? In the UK at least, a failure to solve these issues contributes to the confusion and demoralisation in library, museum and school services. All too often the debate is of what should they get not what do they need? Is there a danger of this with biodiversity networks?
Establishing user needs is a difficult and under-estimated task. The, now outmoded, Structured Systems Analysis and Design Methodology (SSADM) was particularly good for describing existing systems and establishing user requirements. Data flow diagrams (DFDs) remain one of the most powerful tools for charting the limits of the system and defining what parts affect what users but have little to say about user interfaces. Times move on and the rise of prototyping, extreme programming, object-oriented methodologies and web-related technologies have given us new paradigms for system development but the user definition problems remain much the same. Much of the effort still goes into data capture, data storage and linking or querying distributed databases but not enough effort goes into data re-purposing or repackaging for different types of users. Even less effort goes into what sort of data were needed in the first place. The result is increasingly large, interconnected data systems that solve few real-world problems.
The work to create data models and set terminology, validation and verification standards on an international scale continues to be spectacularly successful and TDWG and related projects can be justifiably proud of their achievements. This is not true for usability of data access applications, which is probably the greatest limiting factor in extending the value of these systems. For instance, it is quite clear that the choice of language and depth of information used in answering questions from children, members of the public or keen local naturalists are very different. Likewise in building applications “one size never fits all”.
We are still at the rudimentary stage of interface design, ergo the example of the blank text box labelled “Enter a species name”, and hierarchical taxonomic trees are little use to non-specialists. Real progress will only come with interfaces that designed for the level of knowledge of the user. It is especially important to give users the means to explore what is held within a system according to their level of experience and interest. Users must not be forced to follow a rigid access routine.
Developers jokingly complain that the problem with software is the users: ‘users never read manuals and can be bloody-minded or even downright stupid’. Most potential users do not really understand their data requirements or have a clear idea of what can be delivered. This is especially true in large scale information projects, of which the database software forms only a part, for instance, a local or regional biodiversity network. Is there any consensus on what the potential users want out of a biodiversity network?
The key issues are: who are the users, what are their real needs, what problems can the proposed system solve, how can different levels of user get what they need, will their requirements change over time, and who will pay for it? In the UK at least, a failure to solve these issues contributes to the confusion and demoralisation in library, museum and school services. All too often the debate is of what should they get not what do they need? Is there a danger of this with biodiversity networks?
Establishing user needs is a difficult and under-estimated task. The, now outmoded, Structured Systems Analysis and Design Methodology (SSADM) was particularly good for describing existing systems and establishing user requirements. Data flow diagrams (DFDs) remain one of the most powerful tools for charting the limits of the system and defining what parts affect what users but have little to say about user interfaces. Times move on and the rise of prototyping, extreme programming, object-oriented methodologies and web-related technologies have given us new paradigms for system development but the user definition problems remain much the same. Much of the effort still goes into data capture, data storage and linking or querying distributed databases but not enough effort goes into data re-purposing or repackaging for different types of users. Even less effort goes into what sort of data were needed in the first place. The result is increasingly large, interconnected data systems that solve few real-world problems.
The work to create data models and set terminology, validation and verification standards on an international scale continues to be spectacularly successful and TDWG and related projects can be justifiably proud of their achievements. This is not true for usability of data access applications, which is probably the greatest limiting factor in extending the value of these systems. For instance, it is quite clear that the choice of language and depth of information used in answering questions from children, members of the public or keen local naturalists are very different. Likewise in building applications “one size never fits all”.
We are still at the rudimentary stage of interface design, ergo the example of the blank text box labelled “Enter a species name”, and hierarchical taxonomic trees are little use to non-specialists. Real progress will only come with interfaces that designed for the level of knowledge of the user. It is especially important to give users the means to explore what is held within a system according to their level of experience and interest. Users must not be forced to follow a rigid access routine.
1.5. Exploring the Brave New World of eTaxonomy
Chuck Miller
Missouri Botanical Garden
The Missouri Botanical Garden has multiple initiatives in progress that are opening the door to a new world of taxonomic research methods. We are developing new online pathways to taxonomic data, digitizing reference literature, and engaging with other institutions to better integrate plant data world-wide. But to truly fulfill the vision of this new world requires development of standard ways to share and integrate multiple dimensions of data. I will explore the peaks and valleys of this unexplored territory and suggest some priorities for moving forward from the point of view of a taxonomic research center.
2. Client's Perspectives: Examples of TDWG Standards in Use
2.1. TDWG Standards in use within the Global Biodiversity Information Facility (GBIF) Data Portal
Tim Robertson
GBIF
This presentation will include a very high level overview of the Biodiversity Data Portal (http://data.gbif.org) offered by the Global Biodiversity Information Facility (GBIF http://www.gbif.org). The process of harvesting, parsing, and efficiently serving data for graphic user interface (GUI) tools and reporting services will be covered, illustrating the heavy dependency on TDWG standards. An overview of the mechanism employed to normalise the incoming data from various formats will be explained. This will highlight a use for a Universal Biodiversity Data Bus, which is a common set of standards for publishing, discovering and accessing data across the Internet.
From this overview, non technical participants will receive an insight into the data flow involved, some of the limitations faced, and how important TDWG formats are when processing data. It is expected that this will form a good basis for subsequent technical discussions relating to the Universal Biodiversity Data Bus.
The data within the GBIF network is collated using Distributed Generic Information Retrieval (DiGIR) , the Biological Collection Access Service for Europe (BioCASE), and the TDWG Access Protocol for Information Retrieval (TAPIR). These are all protocols encapsulating various versions of DwC (Darwin Core 2) and Access to Biological Collections Data (ABCD), and the data is served to the public through the new GBIF Data Portal in many forms including DwC and the Taxonomic Concept Schema (TCS) and employing Life Science IDentifiers (LSIDs).
Support is acknowledged from: The Global Biodiversity Information Facility
From this overview, non technical participants will receive an insight into the data flow involved, some of the limitations faced, and how important TDWG formats are when processing data. It is expected that this will form a good basis for subsequent technical discussions relating to the Universal Biodiversity Data Bus.
The data within the GBIF network is collated using Distributed Generic Information Retrieval (DiGIR) , the Biological Collection Access Service for Europe (BioCASE), and the TDWG Access Protocol for Information Retrieval (TAPIR). These are all protocols encapsulating various versions of DwC (Darwin Core 2) and Access to Biological Collections Data (ABCD), and the data is served to the public through the new GBIF Data Portal in many forms including DwC and the Taxonomic Concept Schema (TCS) and employing Life Science IDentifiers (LSIDs).
Support is acknowledged from: The Global Biodiversity Information Facility
2.2. Assessing the Threat of Invasive Species in South America: an ensemble modeling approach in support of data standards, integration, and dissemination
Miguel Fernandez1, Wendy Tejeda2, Guillermo Duran3, Adriana Rico4, Christian Arias2, Maria Laura Quintanilla2, Alberto Pareja2, Juan Carlos Chive5, Monica Rivera2, Healy Hamilton6
1 University of California, Merced; California Academy of Sciences, San Francisco, 2 Centro de Analisis Espacial, Universidad Mayor de San Andres, La Paz, Bolivia, 3 California Academy of Sciences, San Francisco; San Francisco State University, 4 Centro de Analisis Espacial, Universidad Mayor de San Andres, La Paz. Bolivia, 5 Museo Noel Kempff Mercado, Bolivia, 6 Center for Biodiversity Research and Information, California Academy of Sciences, San Francisco
Today’s global economy moves unprecedented quantities of people and products around the planet, increasing the probability that alien species will be introduced and successfully established beyond their native ranges. Invasive alien species (IAS) are the second most important cause of biodiversity loss, and pose additional threats to agriculture and human health. Together, IAS, habitat alteration and climate change are dramatically re-shaping biogeographic patterns across the globe. We need accessible data and analysis tools to assess the threats of IAS at multiple stages: to identify at-risk habitats before invasion occurs, to identify potential arrival sites, and to understand potential routes and rates of dispersal. Beyond threat assessment, data and tools are needed to create conservation strategies that mitigate these threats. In Latin America, economic losses from IAS amount to billions of dollars annually, but strategies to minimize the damage of IAS are generally underdeveloped. We describe an international collaboration using novel techniques to predict the potential distributions of IAS in South America.
Researchers from the California Academy of Sciences, The Nature Conservancy (TNC), and the Centro de Analisis Espacial of the Universidad Mayor de San Andres in Bolivia, are using ensemble distribution modeling to generate composite potential distribution maps for 300 of the most threatening IAS in South America. We are using species occurrence data, derived from both museum specimens and observations obtained from the TNC Invasive Species Initiative, the IABIN Invasive Species Information Network (I3N), and the Global Biodiversity Information Facility (GBIF). Global environmental data layers and higher resolution regional layers are being used to predict distributions of IAS. Seven distribution modeling algorithms are being run for each IAS: Bioclim, Minimum distance, Climate space model, Distance to average, Environmental distance, Garp and MaxEnt. The outputs are combined using a consensus method to produce an ensemble model. Composite maps reveal ‘hotspots’ of IAS susceptibility, depicting which regions of South America are most at risk from the threats of IAS. We are compiling a database of all the biological and spatial data input, as well as all the output models, which will be made publicly accessible.
Our future goals include: 1) creating web access to all project inputs and outputs, including the high-resolution regional environmental data layers we created specifically for this IAS modeling research; 2) building a website to support the collection and distribution of invasive species occurrence data in Bolivia, the only South American country not currently contributing to the I3N effort; and 3) incorporating estimates of future land use and climate change in predicting IAS distributions for South America.
Support is acknowledged from: California Academy of Sciences, The Nature Conservancy, Centro de Analisis Espacial
Researchers from the California Academy of Sciences, The Nature Conservancy (TNC), and the Centro de Analisis Espacial of the Universidad Mayor de San Andres in Bolivia, are using ensemble distribution modeling to generate composite potential distribution maps for 300 of the most threatening IAS in South America. We are using species occurrence data, derived from both museum specimens and observations obtained from the TNC Invasive Species Initiative, the IABIN Invasive Species Information Network (I3N), and the Global Biodiversity Information Facility (GBIF). Global environmental data layers and higher resolution regional layers are being used to predict distributions of IAS. Seven distribution modeling algorithms are being run for each IAS: Bioclim, Minimum distance, Climate space model, Distance to average, Environmental distance, Garp and MaxEnt. The outputs are combined using a consensus method to produce an ensemble model. Composite maps reveal ‘hotspots’ of IAS susceptibility, depicting which regions of South America are most at risk from the threats of IAS. We are compiling a database of all the biological and spatial data input, as well as all the output models, which will be made publicly accessible.
Our future goals include: 1) creating web access to all project inputs and outputs, including the high-resolution regional environmental data layers we created specifically for this IAS modeling research; 2) building a website to support the collection and distribution of invasive species occurrence data in Bolivia, the only South American country not currently contributing to the I3N effort; and 3) incorporating estimates of future land use and climate change in predicting IAS distributions for South America.
Support is acknowledged from: California Academy of Sciences, The Nature Conservancy, Centro de Analisis Espacial
2.3. Results of a Needs Assessment Survey of the Global Invasive Species Information Network (GISIN)
Annie Simpson1, Jim Graham, Michael Browne2, Hannu Saarenmaa3, Elizabeth Sellers4
1 US National Biological Information Infrastructure, 2 IUCN Invasive Species Specialist Group, 3 Finnish Museum of Natural History, 4 US Geological Survey
The Global Invasive Species Information Network (GISIN) is developing a system for the exchange of invasive species information over the Internet utilizing TDWG standards. A critical step in the process of creating this system is to determine requirements of its eventual users. The system's users can be divided into four types:
1) data providers: organizations and persons that will provide data;
2) data consumers: intermediary organizations and persons that will use the system's primary data for modeling and other analyses, and then make these value-added products available back through the system;
3) stakeholders: those who support the system without necessarily providing or consuming data; and
4) end users: those who use the system’s data and/or analyses, but do not provide products back through the system.
The results of a needs assessment survey to obtain user requirements, which ran from 15 December 2006 through 15 February 2007, had both surprising and expected elements.
With 137 respondents from 41 countries, 80% identify themselves as providers and consumers of invasive species data. As expected, most (77%) offer invasive species spatial/temporal information, profiles/species pages (65%), and checklist information (59%). Although most are data providers, their technical knowledge is surprisingly low: 80% said they do not know what existing protocols are appropriate for invasive species information management; 45% do not know the level of web services their organization provides and/or uses; 75% did not know what schemas/grammars would be acceptable to copy or extend for the GISIN data exchange system. A complete report of survey results is available at http://www.gisinetwork.org/Survey/SurveyResultsFinal.htm.
From the results of this survey, it was determined that standards for the GISIN system will need to be both simple to implement and easy to understand, if the system is to be a success. Because only 23% of respondents said Python is an acceptable programming language for a toolkit, a Py-wrapper application is not being considered at this time. Likewise, SOAP (Service Oriented Architecture Protocol) is not being considered, because it is more complex than is needed and would significantly slow data exchange within the system.
Because the results of the needs assessment survey indicated that a complex solution would not be met with wide acceptance and would be too expensive for current funding levels, the GISIN system operates as a simple HTTP Request/Response protocol. This method is used to serve web pages on the Internet and ensures the best access through firewalls without security problems. This approach also provides the required flexibility with high performance.
The GISIN protocol is a subset of the functionality defined by TAPIR (TDWG Access Protocol for Information Retrieval).Only simple Key-Value Pair (KVP) requests are supported because complex filters encoded as XML (Extensible Markup Language) were not required.
Respondents to the needs assessment survey listed ASP, JSP, and PHP (in that order) as acceptable internet frameworks for a toolkit. Therefore a GISIN data providers’ workshop is being planned for 13-16 November with programmers of these three frameworks as instructors. Although 80% of the respondents preferred receiving a software toolkit to install and configure on their server to become a GISIN data provider, at the November meeting programmers and database managers will create their own code to map each of their unique database systems to the GISIN protocol.
Special thanks to Jeremy Kranowitz, who donated his time to configuring, running, and analyzing the survey, and to his organization, The Keystone Center.
Support is acknowledged from: US National Biological Information Infrastructure; GBIF; IUCN-Invasive Species Specialist Group; US National Institute of Invasive Species Science; The Keystone Center
1) data providers: organizations and persons that will provide data;
2) data consumers: intermediary organizations and persons that will use the system's primary data for modeling and other analyses, and then make these value-added products available back through the system;
3) stakeholders: those who support the system without necessarily providing or consuming data; and
4) end users: those who use the system’s data and/or analyses, but do not provide products back through the system.
The results of a needs assessment survey to obtain user requirements, which ran from 15 December 2006 through 15 February 2007, had both surprising and expected elements.
With 137 respondents from 41 countries, 80% identify themselves as providers and consumers of invasive species data. As expected, most (77%) offer invasive species spatial/temporal information, profiles/species pages (65%), and checklist information (59%). Although most are data providers, their technical knowledge is surprisingly low: 80% said they do not know what existing protocols are appropriate for invasive species information management; 45% do not know the level of web services their organization provides and/or uses; 75% did not know what schemas/grammars would be acceptable to copy or extend for the GISIN data exchange system. A complete report of survey results is available at http://www.gisinetwork.org/Survey/SurveyResultsFinal.htm.
From the results of this survey, it was determined that standards for the GISIN system will need to be both simple to implement and easy to understand, if the system is to be a success. Because only 23% of respondents said Python is an acceptable programming language for a toolkit, a Py-wrapper application is not being considered at this time. Likewise, SOAP (Service Oriented Architecture Protocol) is not being considered, because it is more complex than is needed and would significantly slow data exchange within the system.
Because the results of the needs assessment survey indicated that a complex solution would not be met with wide acceptance and would be too expensive for current funding levels, the GISIN system operates as a simple HTTP Request/Response protocol. This method is used to serve web pages on the Internet and ensures the best access through firewalls without security problems. This approach also provides the required flexibility with high performance.
The GISIN protocol is a subset of the functionality defined by TAPIR (TDWG Access Protocol for Information Retrieval).Only simple Key-Value Pair (KVP) requests are supported because complex filters encoded as XML (Extensible Markup Language) were not required.
Respondents to the needs assessment survey listed ASP, JSP, and PHP (in that order) as acceptable internet frameworks for a toolkit. Therefore a GISIN data providers’ workshop is being planned for 13-16 November with programmers of these three frameworks as instructors. Although 80% of the respondents preferred receiving a software toolkit to install and configure on their server to become a GISIN data provider, at the November meeting programmers and database managers will create their own code to map each of their unique database systems to the GISIN protocol.
Special thanks to Jeremy Kranowitz, who donated his time to configuring, running, and analyzing the survey, and to his organization, The Keystone Center.
Support is acknowledged from: US National Biological Information Infrastructure; GBIF; IUCN-Invasive Species Specialist Group; US National Institute of Invasive Species Science; The Keystone Center
2.4. When Taxonomies Meet Observations: An Examination of Taxonomic Concepts used by the Observation Systems eBird and the Avian Knowledge Network
Paul Edward Allen
Cornell Lab of Ornithology
Ideally, observations of organisms are identified by the observer with a taxonomic concept, consisting of the taxonomic name and the reference defining that name. However, systems that manage observational data must be able to accommodate imprecision or uncertainty in concepts since observers are not always able to classify an organism as a single, well-established species (or subspecies) taxonomic concept. There are several instances in which indefinite concepts are required. First, an observer may identify an organism as a hybrid of two species. Second, imperfect observation conditions (e.g., limited visibility), limited experience, or other factors might limit an observer to classifying an organism only as a member of some subset of concepts, where the subset has meaning to field observers, but may not be circumscribed by a academically established taxonomic concept. Finally, similar factors might lead an observer to identify an organism only to a genus or higher taxonomic rank. The first two cases may lead managers of observation systems to informally become taxonomists, since they must create concepts to accommodate the observations they hold and which do not fall into a well-established taxonomic concept. This presentation shows how uncertainty and imprecision in taxonomic identity are handled by the Bird Monitoring Data Exchange standard used by the Avian Knowledge Network (AKN, www.avianknowledge.net) and the TDWG Taxonomic Concept Transfer Schema standard.
Analysis of 29 million avian observation records from eBird (www.ebird.org) and the Avian Knowledge Network shows that uncertain and imprecise taxonomic concepts represent 5% (eBird) 18% (AKN) of the concepts in these systems. However, observations labeled with uncertain or imprecise concepts represent only 0.05% (eBird) and 1% (AKN) of the observations held in those systems.
Analysis of 29 million avian observation records from eBird (www.ebird.org) and the Avian Knowledge Network shows that uncertain and imprecise taxonomic concepts represent 5% (eBird) 18% (AKN) of the concepts in these systems. However, observations labeled with uncertain or imprecise concepts represent only 0.05% (eBird) and 1% (AKN) of the observations held in those systems.
2.5. Taxonomists at work: relationships of process and data
Anna Weitzman1, Christopher Lyal2
1 Smithsonian Institution, 2 Natural History Museum, London
Taxonomy has developed in practice over hundreds (or thousands) of years. Humans have always been interested in the world around them and using names to communicate what they know about the organisms that they see. From simple beginnings lost in the origins of human culture, this process has developed into taxonomy as we know it. Though it has become formalized, it is still mainly about learning about the organisms that we share the planet with and using names to communicate about them.
In order to do this, we have developed systems of nomenclature for applying names to organisms; collections of preserved organisms which serve to help us understand, document, and apply names to what we observe to be taxa (taxon concepts); and ways to document the information in publications. After 300+ years of generating these systems and collections, there is a vast body of existing knowledge that is used routinely in current taxonomic work. Additional sources of data have been added recently and been incorporated into workflow.
Understanding the information flow between different data and information sources as employed by taxonomists and others is important to model how interoperable data systems should connect. The results of an analysis of the data flow and working practices can be depicted in the following diagram. Standards and schemas employed for the different elements are identified. The diagram also indicates where interoperability between particular schemas must be developed.
We will present and explain the diagram, especially as it relates to the user needs presented at the opening of TDWG 2007. At the close of TDWG 2007, we will present it again in the context of the entire meeting’s discussions and presentations, with any amendments that have been shown to be needed.
Support is acknowledged from: Atherton Seidell Fund of the Smithsonian Institution
In order to do this, we have developed systems of nomenclature for applying names to organisms; collections of preserved organisms which serve to help us understand, document, and apply names to what we observe to be taxa (taxon concepts); and ways to document the information in publications. After 300+ years of generating these systems and collections, there is a vast body of existing knowledge that is used routinely in current taxonomic work. Additional sources of data have been added recently and been incorporated into workflow.
Understanding the information flow between different data and information sources as employed by taxonomists and others is important to model how interoperable data systems should connect. The results of an analysis of the data flow and working practices can be depicted in the following diagram. Standards and schemas employed for the different elements are identified. The diagram also indicates where interoperability between particular schemas must be developed.
We will present and explain the diagram, especially as it relates to the user needs presented at the opening of TDWG 2007. At the close of TDWG 2007, we will present it again in the context of the entire meeting’s discussions and presentations, with any amendments that have been shown to be needed.
Support is acknowledged from: Atherton Seidell Fund of the Smithsonian Institution
3. Needed Technologies: Introductions and Demos
3.1. TDWG Standards Architecture - What and Why
Roger Hyam
TDWG Infrastructure Project
In 2005, the TDWG Infrastructure Project (TIP) was given the remit of devising an umbrella architecture for TDWG standards. A meeting (TAG1) in April 2006 led to the establishment of the basic principles for underlying the standards architecture. The TIP has been promoting adoption of this common architecture over the last 18 months. But why have a standards architecture at all?
There is no need for a standards architecture when exchanging data within the federation of similar applications such as natural history collections. The federation is a closed system where a single exchange format can be agreed on. The federation can grow by adding new members whose needs are met by the format. This model has worked well in the past but it does not meet the primary use case that is emerging. Biodiversity research is typically carried out by combining data of different kinds from multiple sources. The providers of data do not know who will use their data or how it will be combined with data from other sources. The consumer needs some level of commonality across all the data received so that it can be combined for analysis without the need to write computer software for every new combination. This commonality needs to seamlessly extend to new types of data as they are made available. An architecture is required to provided this commonality.
What form should the architecture take? A degree of commonality could be achieved simply by specifying how the data should be serialised. If all suppliers passed data as well-formed XML for example, it would provide a degree of interoperability. Clients would however, still not know how the elements within one XML document relate to those in another, or how the items described in those documents relate. At the other extreme, the architecture could provide a detailed data type library which describes the way in which each kind of data should be serialised at a fine level of granularity. In other words, which XML elements must be present and what they should contain? It is however unlikely that a single set of serialisations would meet all needs any more than a single federation schema would. Some thematic networks require that they have well defined data types to ensure that the data passed is valid and fit for purpose.
The architecture has to meet two needs. It has to allow generic interoperability but also restricted validation of data for some networks. It does this using three interlinked components. 1) An ontology is used to express the shared semantics of the data but not to define the validity of those data. Concepts within the ontology are represented as URIs (Universal Resource Indentifiers). 2) Exchange protocols use formats defined in XML Schemas (or other technologies) that exploit the URIs from the ontology concepts. 3) Objects about which data are exchanged are identified using Globally Unique Identifiers. This means that, although exchanges between data producers and clients may make use of different XML formats, the items the data is about and the meaning of the data elements is common across all formats.
Support is acknowledged from: The Gordon and Betty Moore Foundation
There is no need for a standards architecture when exchanging data within the federation of similar applications such as natural history collections. The federation is a closed system where a single exchange format can be agreed on. The federation can grow by adding new members whose needs are met by the format. This model has worked well in the past but it does not meet the primary use case that is emerging. Biodiversity research is typically carried out by combining data of different kinds from multiple sources. The providers of data do not know who will use their data or how it will be combined with data from other sources. The consumer needs some level of commonality across all the data received so that it can be combined for analysis without the need to write computer software for every new combination. This commonality needs to seamlessly extend to new types of data as they are made available. An architecture is required to provided this commonality.
What form should the architecture take? A degree of commonality could be achieved simply by specifying how the data should be serialised. If all suppliers passed data as well-formed XML for example, it would provide a degree of interoperability. Clients would however, still not know how the elements within one XML document relate to those in another, or how the items described in those documents relate. At the other extreme, the architecture could provide a detailed data type library which describes the way in which each kind of data should be serialised at a fine level of granularity. In other words, which XML elements must be present and what they should contain? It is however unlikely that a single set of serialisations would meet all needs any more than a single federation schema would. Some thematic networks require that they have well defined data types to ensure that the data passed is valid and fit for purpose.
The architecture has to meet two needs. It has to allow generic interoperability but also restricted validation of data for some networks. It does this using three interlinked components. 1) An ontology is used to express the shared semantics of the data but not to define the validity of those data. Concepts within the ontology are represented as URIs (Universal Resource Indentifiers). 2) Exchange protocols use formats defined in XML Schemas (or other technologies) that exploit the URIs from the ontology concepts. 3) Objects about which data are exchanged are identified using Globally Unique Identifiers. This means that, although exchanges between data producers and clients may make use of different XML formats, the items the data is about and the meaning of the data elements is common across all formats.
Support is acknowledged from: The Gordon and Betty Moore Foundation
3.2. Life Sciences Identifiers (LSID) and the Biodiversity Information Standards (TDWG)
Ricardo Scachetti Pereira
TDWG Infrastructure Project
Over the last few decades, the biodiversity information community has made primary data available for environmental analyses and decision making. Information on a million scientific names is now available through data providers such as the Integrated Taxonomic Information Service (ITIS), Species2000 and the Catalogue of Life (CoL). Almost one hundred million specimen records are provided by Herbaria and Natural History Museums around the world.
To use these data more effectively, clients need mechanisms to: a) refer to authoritative information resources, b) facilitate data integration and c) detect duplicates of the same resource. To achieve these goals, a system of globally unique identifiers (GUIDs) is needed.
The TDWG Infrastructure Project (TIP) established a TDWG Globally Unique Identifiers Task Group (TDWG-GUID) to provide recommendations for use of GUIDs in our domain. The GUID members concluded that the Life Sciences Identifiers (LSIDs) were the most appropriate technology to address current problems.
LSIDs are unique, persistent, location-independent, resource identifiers for biologically significant resources such as species names, concepts, occurrences, genes or proteins. LSIDs identify and locate biological objects via the web and overcome limitations of current naming schemes.
I will provide an overview of Life Science Identifiers and how they solve current problems. I will report on the work performed by the GUID group over the last two years and provide recommendations and a plan on the use of LSIDs in the biodiversity information domain.
Support is acknowledged from: The Gordon and Betty Moore Foundation
To use these data more effectively, clients need mechanisms to: a) refer to authoritative information resources, b) facilitate data integration and c) detect duplicates of the same resource. To achieve these goals, a system of globally unique identifiers (GUIDs) is needed.
The TDWG Infrastructure Project (TIP) established a TDWG Globally Unique Identifiers Task Group (TDWG-GUID) to provide recommendations for use of GUIDs in our domain. The GUID members concluded that the Life Sciences Identifiers (LSIDs) were the most appropriate technology to address current problems.
LSIDs are unique, persistent, location-independent, resource identifiers for biologically significant resources such as species names, concepts, occurrences, genes or proteins. LSIDs identify and locate biological objects via the web and overcome limitations of current naming schemes.
I will provide an overview of Life Science Identifiers and how they solve current problems. I will report on the work performed by the GUID group over the last two years and provide recommendations and a plan on the use of LSIDs in the biodiversity information domain.
Support is acknowledged from: The Gordon and Betty Moore Foundation
3.3. Nala: A Semantic Data Capture Extension for Mozilla Firefox
Ben Szekely1, Ricardo Scachetti Pereira2
1 Cambridge Semantics Inc., 2 TDWG Infrastructure Project
Collecting and integrating biodiversity informatics data from diverse websites and transforming these data into the formats accepted by the analysis tools takes considerable resources.
Semantic Web tools such as the Resource Description Framework (RDF) and the Web Ontology Language (OWL) make it easier for computers to interpret the meaning of data items. Life Sciences Identifiers (LSIDs) are another Semantic Web product that allows information resources to be uniquely named and easily located.
Nala is a Semantic Web data capture tool that we have developed to demonstrate how Semantic Web technologies, in particular, RDF, OWL and LSIDs, may be used to improve the process of data capture and integration.
Nala is a Mozilla Firefox web browser extension, similar to Piggy Bank, which allows users to capture and integrate data while browsing the Web. Nala looks for data that may be acquired and transformed into RDF from web pages that are browsed. When such data are detected, the user is given the option to acquire, transform it into RDF format and store it in a repository called an RDF triple store. Data in the repository may then be integrated using OWL vocabularies such as Dublin Core or the TDWG Ontology and LSID Vocabularies and exported in CSV and MS Excel formats.
Support is acknowledged from: TDWG Infrastructure Project
Semantic Web tools such as the Resource Description Framework (RDF) and the Web Ontology Language (OWL) make it easier for computers to interpret the meaning of data items. Life Sciences Identifiers (LSIDs) are another Semantic Web product that allows information resources to be uniquely named and easily located.
Nala is a Semantic Web data capture tool that we have developed to demonstrate how Semantic Web technologies, in particular, RDF, OWL and LSIDs, may be used to improve the process of data capture and integration.
Nala is a Mozilla Firefox web browser extension, similar to Piggy Bank, which allows users to capture and integrate data while browsing the Web. Nala looks for data that may be acquired and transformed into RDF from web pages that are browsed. When such data are detected, the user is given the option to acquire, transform it into RDF format and store it in a repository called an RDF triple store. Data in the repository may then be integrated using OWL vocabularies such as Dublin Core or the TDWG Ontology and LSID Vocabularies and exported in CSV and MS Excel formats.
Support is acknowledged from: TDWG Infrastructure Project
3.4. Key Enabling Technologies: Transfer Protocols
Donald Hobern
Global Biodiversity Information Facility
The new TDWG data architecture relies on three core abilities:
TDWG's work in this area has led to the family of protocols beginning with DiGIR and BioCASe and leading to TAPIR (the TDWG Access Protocol for Information Retrieval) today.
The DiGIR protocol has been used extensively by a range of major projects to support exchange of specimen and observation data using Darwin Core. DiGIR provides a flexible XML language for making remote search requests against a web-connected database. More importantly DiGIR provides a tool for organisations to map their databases into a common set of concepts such as Darwin Core.
BioCASe introduced support for records with a significant nested structure such as the ABCD schema. BioCASe simplified the use of the protocol with external data models developed without knowledge of DiGIR or BioCASe.
The TAPIR protocol learns from DiGIR and BioCASe and adds new features of its own. Two implementations of the protocol are currently available, pyWrapper (written in Python) and TapirLink (written in PHP).
To use a protocol such as TAPIR, a data administrator maps a local database to a set of concepts recognised by the community (e.g., ScientificName, Locality and CatalogNumber are Darwin Core concepts recognised by a wide range of projects). TAPIR software then offers the following operations:
TAPIR can handle requests encoded as XML documents or as a set of parameters supplied within a URL. TAPIR supports common request and response templates to format results for different tools. For example, TAPIR can issue requests based on Darwin Core concepts and receive results as a Google Earth KML document or an RSS feed. Installing TAPIR software may therefore be an efficient way to expose data for a range of other client tools.
TDWG’s re-engineering of its data standards as reusable vocabularies enables the use the same terms and definitions in different contexts. TDWG could use its own standards with many general purpose data access protocols. Examples include:
- Constructing data objects representing objects and concepts in biodiversity informatics. This is the purpose of the TDWG data standards.
- Referring reliably to data objects. This is why TDWG has adopted Life Science Identifiers (LSIDs) as a globally unique identifier technology.(/li>
- Discovering and accessing data objects. This why TDWG develops its own data access protocols and explores other protocol standards.
TDWG's work in this area has led to the family of protocols beginning with DiGIR and BioCASe and leading to TAPIR (the TDWG Access Protocol for Information Retrieval) today.
The DiGIR protocol has been used extensively by a range of major projects to support exchange of specimen and observation data using Darwin Core. DiGIR provides a flexible XML language for making remote search requests against a web-connected database. More importantly DiGIR provides a tool for organisations to map their databases into a common set of concepts such as Darwin Core.
BioCASe introduced support for records with a significant nested structure such as the ABCD schema. BioCASe simplified the use of the protocol with external data models developed without knowledge of DiGIR or BioCASe.
The TAPIR protocol learns from DiGIR and BioCASe and adds new features of its own. Two implementations of the protocol are currently available, pyWrapper (written in Python) and TapirLink (written in PHP).
To use a protocol such as TAPIR, a data administrator maps a local database to a set of concepts recognised by the community (e.g., ScientificName, Locality and CatalogNumber are Darwin Core concepts recognised by a wide range of projects). TAPIR software then offers the following operations:
- Metadata – retrieve descriptive information about a dataset;
- Capabilities – retrieve the technical capabilities of the TAPIR server and the concepts mapped by the data administrator;
- Ping – check that the TAPIR server is active;
- Inventory – retrieve a list of distinct values within the dataset for one or more concepts, with counts of matching records; and
- Search – retrieve records matching a set of filter conditions.
TAPIR can handle requests encoded as XML documents or as a set of parameters supplied within a URL. TAPIR supports common request and response templates to format results for different tools. For example, TAPIR can issue requests based on Darwin Core concepts and receive results as a Google Earth KML document or an RSS feed. Installing TAPIR software may therefore be an efficient way to expose data for a range of other client tools.
TDWG’s re-engineering of its data standards as reusable vocabularies enables the use the same terms and definitions in different contexts. TDWG could use its own standards with many general purpose data access protocols. Examples include:
- OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting) – standard access to metadata for a wide range of online resources.
- WFS (Open GIS Web Feature Service) – a standard that could be used to map locations of species occurrences.
- SPARQL Query Language for RDF – a standard allowing complex queries across different data sets.
4. LSIDs: Gluing it together to meet users' needs
4.1. LSIDs for Taxon Names: The ZooBank Experience
Richard Pyle
Bishop Museum
The International Commission on Zoological Nomenclature (ICZN) has, for the past 112 years, set the rules by which scientific names for animals are established, as described in the ICZN Code of Nomenclature. In 2005, the ICZN Secretariat and Commissioners announced “ZooBank”, a proposed registry of zoological names and nomenclatural acts. The intention of ZooBank is to serve as a mechanism for making information about new and historical scientific animal names more available and accessible than by traditional means of information dissemination through paper-based publications. The complete implementation details of the ZooBank registry are currently being discussed, developed, and tested. The first step of the implementation process involves the creation of a prototype web site that will eventually mature into the full-blown ZooBank registration service.
The Bishop Museum in Honolulu has agreed to host the initial implementation of the ZooBank prototype. With financial support from TDWG/GBIF through ICZN, in partnership with Landcare Research (New Zealand), Bishop was able to establish a functioning resolver for Life Science Identifiers (LSIDs) and a content provider following the TDWG Access Protocol for Information Retrieval (TAPIR). LSIDs were assigned to a sample dataset of verified taxon names and literature citations from the Catalog of Fishes database. An LSID resolver service was set up on a Windows/IIS server using VB.NET code developed by Kevin Richards of Landcare Research. A TAPIR provider service was also implemented to return metadata associated with these LSIDs. LSIDs assigned to taxon names return metadata in accordance with the TDWG Taxon Name LSID Ontology, and LSIDs assigned to publication citations return metadata in accordance with the TDWG Publication Citation LSID Ontology.
A discussion of the implementation of these services, including alternate strategies for defining “taxon name objects” and associated implications, and the role of nomenclators for providing taxonomic services, will be provided.
Support is acknowledged from: TDWG Infrastructure Project; Pacific Basin Information Node (PBIN) of the U.S. National Biological Information Infrastructure (NBII); Landcare Research (New Zealand); Bishop Museum, Honolulu (BPBM)
The Bishop Museum in Honolulu has agreed to host the initial implementation of the ZooBank prototype. With financial support from TDWG/GBIF through ICZN, in partnership with Landcare Research (New Zealand), Bishop was able to establish a functioning resolver for Life Science Identifiers (LSIDs) and a content provider following the TDWG Access Protocol for Information Retrieval (TAPIR). LSIDs were assigned to a sample dataset of verified taxon names and literature citations from the Catalog of Fishes database. An LSID resolver service was set up on a Windows/IIS server using VB.NET code developed by Kevin Richards of Landcare Research. A TAPIR provider service was also implemented to return metadata associated with these LSIDs. LSIDs assigned to taxon names return metadata in accordance with the TDWG Taxon Name LSID Ontology, and LSIDs assigned to publication citations return metadata in accordance with the TDWG Publication Citation LSID Ontology.
A discussion of the implementation of these services, including alternate strategies for defining “taxon name objects” and associated implications, and the role of nomenclators for providing taxonomic services, will be provided.
Support is acknowledged from: TDWG Infrastructure Project; Pacific Basin Information Node (PBIN) of the U.S. National Biological Information Infrastructure (NBII); Landcare Research (New Zealand); Bishop Museum, Honolulu (BPBM)
4.2. LSID and TCS deployment in the Catalogue of Life
Richard John White, Andrew C Jones, Ewen R Orme
Cardiff University
This paper describes a project to add support for Life Sciences Identifiers (LSIDs) and the Taxon Concept Schema (TCS) to the Annual and Dynamic Checklists assembled and delivered by the Catalogue of Life (CoL) partners, Species 2000 and ITIS. We plan to improve the compatibility of the protocols and public software interfaces used by Species 2000 with TDWG standards. We wish to increase the usefulness of the CoL to users, including GBIF, by improving the CoL’s compatibility with other biodiversity tools, by supplying its information to clients expressed as taxon concepts, and by enhancing interoperability between data providers and consumers by means of LSIDs referring to these concepts. It is hoped this will increase the use of TDWG standards, accelerate LSID deployment and the uptake of TCS, assist providers and users to ascribe data unambiguously to specified taxon concepts, and speed the growth of shared biodiversity data resources.
At Cardiff University we are investigating approaches for adding LSID and TCS support to the CoL and implementing them in evaluation versions of its systems. We have implemented a new prototype of the Annual Checklist which issues LSIDs for taxon concepts and established a resolution service to support the use of these LSIDs by giving provisional RDF/TCS responses generated from the Annual Checklist.
We are developing modified Spice protocols and a new Spice software prototype to provide LSIDs and TCS data in response to Web Service requests and to receive any name or taxon concept LSIDs from data providers. A new version of one of the data providers is being implemented for this purpose. We will develop a validation tool to check that the data and responses are valid, correctly structured and internally consistent. We plan to complete the project by the end of December 2007.
The Species 2000 Secretariat in Reading is assisting in this project. Its responsibilities are to survey the needs, capabilities and preferences of data providers and users in the light of these demonstration systems; to deploy the enhanced Spice software in the CoL global and European regional hubs; to use the validation tool and other means to perform testing and quality assurance of the data served; and to assist the CoL partners to agree a plan for the introduction of LSIDs.
The updated Spice protocol, documentation and enhanced Spice software will be available for use by other projects to build species information systems for their own purposes and to create regional hubs which can be linked to the CoL, both to enhance its usefulness in those regions and to help set up new global data providers.
Planning and carrying out this project has raised a number of interesting questions, some to be resolved during the project, others for wider consideration and future research. They include the choice of which kinds of entity will be identified by LSIDs (including names as well as taxon concepts), how users (human or software) will obtain LSIDs for entities of interest, how any GUIDs (not necessarily LSIDs) that data providers supply will be propagated through the CoL, users’ expectations concerning tasks that LSIDs might assist, including navigating the taxonomic hierarchy and linking data to taxa, and the role of CoL LSIDs in building the biodiversity information systems of the future.
Further information about this project and its progress, updated periodically, is at http://spice.cs.cf.ac.uk/lsid/
Support is acknowledged from: TDWG Infrastructure Project
At Cardiff University we are investigating approaches for adding LSID and TCS support to the CoL and implementing them in evaluation versions of its systems. We have implemented a new prototype of the Annual Checklist which issues LSIDs for taxon concepts and established a resolution service to support the use of these LSIDs by giving provisional RDF/TCS responses generated from the Annual Checklist.
We are developing modified Spice protocols and a new Spice software prototype to provide LSIDs and TCS data in response to Web Service requests and to receive any name or taxon concept LSIDs from data providers. A new version of one of the data providers is being implemented for this purpose. We will develop a validation tool to check that the data and responses are valid, correctly structured and internally consistent. We plan to complete the project by the end of December 2007.
The Species 2000 Secretariat in Reading is assisting in this project. Its responsibilities are to survey the needs, capabilities and preferences of data providers and users in the light of these demonstration systems; to deploy the enhanced Spice software in the CoL global and European regional hubs; to use the validation tool and other means to perform testing and quality assurance of the data served; and to assist the CoL partners to agree a plan for the introduction of LSIDs.
The updated Spice protocol, documentation and enhanced Spice software will be available for use by other projects to build species information systems for their own purposes and to create regional hubs which can be linked to the CoL, both to enhance its usefulness in those regions and to help set up new global data providers.
Planning and carrying out this project has raised a number of interesting questions, some to be resolved during the project, others for wider consideration and future research. They include the choice of which kinds of entity will be identified by LSIDs (including names as well as taxon concepts), how users (human or software) will obtain LSIDs for entities of interest, how any GUIDs (not necessarily LSIDs) that data providers supply will be propagated through the CoL, users’ expectations concerning tasks that LSIDs might assist, including navigating the taxonomic hierarchy and linking data to taxa, and the role of CoL LSIDs in building the biodiversity information systems of the future.
Further information about this project and its progress, updated periodically, is at http://spice.cs.cf.ac.uk/lsid/
Support is acknowledged from: TDWG Infrastructure Project
4.3. An LSID authority for specimens and an LSID browsing client
Kevin James Richards
Landcare Research
The requirements and use-cases for globally unique identifiers (GUIDs) have been developed by the TDWG community over the last 18 months. Use-cases include:
• unique and persistent identification of taxon name and specimen data;
• linking specific specimen records to accepted taxonomic names; and
• detection of duplicate records.
After careful investigation of several identification schemes, the TDWG Globally Unique Identifiers group (TDWG GUID) endorsed the use of Life Science IDentifiers (LSIDs) for use in biodiversity information applications.
LSID resolvers are Internet services that return the data and metadata associated LSID to a requester. Resolvers have been set up mainly to process taxonomic name data. Important data types such as specimens have been neglected. It is therefore important to examine and test the use of LSIDs and related technologies within the context of specimen data.
‘Herb IMI’ is a collection of over 300,000 fungal specimens from the International Mycological Institute (IMI). The Herb IMI database contain fungus/host identification data and these records also have corresponding LSIDs for the fungal names in Index Fungorum and plant names in the International Plant Names Index (IPNI). Both these global nomenclators have Taxonomic Names LSID resolvers in place. This combination made Herb IMI a candidate for testing LSIDs and related technologies.
We developed an LSID resolver for the Herb IMI collection and a tool for demonstrating LSID related technologies such as the use of Resource Description Framework (RDF). RDF is a language in which entities are modelled with subject-predicate-statements known as “triples”. Associated protocols enable a user to query sets of these triples and to infer relationships between entities. The tool that we have developed works with the specimen LSID resolver and TDWG’s LSID vocabularies to display, browse, store and query the RDF associated with LSIDs.
This talk presents the processes, problems and outcomes of implementing LSIDs and RDF with the Herb IMI specimen database. The browsing tool developed for this project will also be demonstrated.
Support is acknowledged from: TDWG Infrastructure Project, Landcare Research
• unique and persistent identification of taxon name and specimen data;
• linking specific specimen records to accepted taxonomic names; and
• detection of duplicate records.
After careful investigation of several identification schemes, the TDWG Globally Unique Identifiers group (TDWG GUID) endorsed the use of Life Science IDentifiers (LSIDs) for use in biodiversity information applications.
LSID resolvers are Internet services that return the data and metadata associated LSID to a requester. Resolvers have been set up mainly to process taxonomic name data. Important data types such as specimens have been neglected. It is therefore important to examine and test the use of LSIDs and related technologies within the context of specimen data.
‘Herb IMI’ is a collection of over 300,000 fungal specimens from the International Mycological Institute (IMI). The Herb IMI database contain fungus/host identification data and these records also have corresponding LSIDs for the fungal names in Index Fungorum and plant names in the International Plant Names Index (IPNI). Both these global nomenclators have Taxonomic Names LSID resolvers in place. This combination made Herb IMI a candidate for testing LSIDs and related technologies.
We developed an LSID resolver for the Herb IMI collection and a tool for demonstrating LSID related technologies such as the use of Resource Description Framework (RDF). RDF is a language in which entities are modelled with subject-predicate-statements known as “triples”. Associated protocols enable a user to query sets of these triples and to infer relationships between entities. The tool that we have developed works with the specimen LSID resolver and TDWG’s LSID vocabularies to display, browse, store and query the RDF associated with LSIDs.
This talk presents the processes, problems and outcomes of implementing LSIDs and RDF with the Herb IMI specimen database. The browsing tool developed for this project will also be demonstrated.
Support is acknowledged from: TDWG Infrastructure Project, Landcare Research
4.4. LSID policy and implementation in Australia
Greg Whitbread, Alex R. Chapman, Ben Richardson
Australian National Botanic Gardens
In April 2007 a 2-day workshop of representative of Australian museums and herbaria was held in Canberra, with TDWG assistance, to develop recommendations for a policy to apply to adoption of Life Science Identifiers (LSIDs) within the Australasian biodiversity federation. This meeting established the business case for LSIDs and guidelines and a roadmap for LSID implementation by and for local data providers and biodiversity informatics networks (http://www.tdwg.org/fileadmin/subgroups/guid/LSID_policy_workshop_Report_Canberra.pdf).
The workshop generated recommendations for the delegation of responsibility for allocation, persistence and resolution of LSIDs within the Australian biodiversity federation and drafted a work plan for our implementation of LSID technology.
Progress against these recommendations however has not been good. Resources are limited and the integration of LSID technology into an existing biodiversity information network is not without issues. There are elements in our LSID implementation plan that require more careful consideration: ambiguity within the classes of information identified for LSID assignment; the role of LSIDs in version control and the discovery of duplication; resolution; and metadata standards and access to data in appropriate formats. A more detailed specification, beyond best practice, for the form and function of LSIDs within the biodiversity informatics context is still required.
Support is acknowledged from: TDWG Infrastructure Project, Australian National Botanic Gardens, Council of Heads of Australian Herbaria (CHAH)
The workshop generated recommendations for the delegation of responsibility for allocation, persistence and resolution of LSIDs within the Australian biodiversity federation and drafted a work plan for our implementation of LSID technology.
Progress against these recommendations however has not been good. Resources are limited and the integration of LSID technology into an existing biodiversity information network is not without issues. There are elements in our LSID implementation plan that require more careful consideration: ambiguity within the classes of information identified for LSID assignment; the role of LSIDs in version control and the discovery of duplication; resolution; and metadata standards and access to data in appropriate formats. A more detailed specification, beyond best practice, for the form and function of LSIDs within the biodiversity informatics context is still required.
Support is acknowledged from: TDWG Infrastructure Project, Australian National Botanic Gardens, Council of Heads of Australian Herbaria (CHAH)
4.5. LSID Mashup
Daniel Miranker
University of Texas at Austin
Morphster* is a productivity tool for annotating specimen images and organizing the features into character state matrices suitable for phylogenetic reconstruction. Central to the architecture is distributed data integration where the data are tagged with global unique identifies (GUID); usually a life-science identifier (LSID). Source data for Morphster includes certain image databases, records from the uBio Taxonomic Name Server and Nomina Anatomica in the form of OBO ontologies. Each of these data sources associates a GUID with each record.
Persistent data records created by Morphster are tagged with LSIDs and made available per the protocol. For example, character definitions, character state definitions and the assignment of states to specimens are all separate records that may need to be archived and/or reused and are made uniquely identifiable. These records themselves reference the source images, the taxon and, usually, a field from the Nomina Anatomica. Thus, when resolving a Morphster LSID, the data returned will include a number of additional LSIDs. It is anticipated that Treebase II will store LSIDs in addition to encoded character states. The result will be a distributed data structure enabling on-line access to the complete provenance of a morphological phylogenetic study.
*The project (see http://www.morphster.org) is a collaboration with Julian Humphries and Timothy Rowe, Jackson School of Geology, University of Texas at Austin.
Support is acknowledged from: NSF, IIS:0531767
Persistent data records created by Morphster are tagged with LSIDs and made available per the protocol. For example, character definitions, character state definitions and the assignment of states to specimens are all separate records that may need to be archived and/or reused and are made uniquely identifiable. These records themselves reference the source images, the taxon and, usually, a field from the Nomina Anatomica. Thus, when resolving a Morphster LSID, the data returned will include a number of additional LSIDs. It is anticipated that Treebase II will store LSIDs in addition to encoded character states. The result will be a distributed data structure enabling on-line access to the complete provenance of a morphological phylogenetic study.
*The project (see http://www.morphster.org) is a collaboration with Julian Humphries and Timothy Rowe, Jackson School of Geology, University of Texas at Austin.
Support is acknowledged from: NSF, IIS:0531767
5. Enabling Technologies: Protocols
5.1. TapirLink: Facilitating the transition to TAPIR
Renato De Giovanni
TapirLink is a free and open source data provider tool which implements the TAPIR protocol. It is based on the earlier DiGIR PHP provider, which is used by many institutions around the world to serve a total of more than 100 million specimen records. TapirLink has been designed to be as simple to use as the DiGIR PHP provider and to enable rapid and seamless migration of existing DiGIR providers to the TAPIR protocol.
TapirLink is a general-purpose tool and can be used to serve other classes of data as well as biological collections data. It supports most of the advanced features of the TAPIR protocol, including all TAPIR operations, searches using complex filters and flexible output models (for example KML, RSS2, DarwinCore 1.4 application schema, ABCD 2.06 and TDWG Ontology RDF).
Additional features include the ability to import configuration files from the DiGIR PHP provider, a user interface for UDDI registration, a configurable LSID resolver and the option to associate XSLT stylesheets with the XML responses to present the data in a human-readable form in Web browsers.
TapirLink allows data providers to participate in TAPIR networks or simply to offer a Web Service interface to their data. This presentation will describe the TapirLink software, showing the installation requirements, configuration details and main features of the tool.
Support is acknowledged from: TDWG Infrastructure Project
TapirLink is a general-purpose tool and can be used to serve other classes of data as well as biological collections data. It supports most of the advanced features of the TAPIR protocol, including all TAPIR operations, searches using complex filters and flexible output models (for example KML, RSS2, DarwinCore 1.4 application schema, ABCD 2.06 and TDWG Ontology RDF).
Additional features include the ability to import configuration files from the DiGIR PHP provider, a user interface for UDDI registration, a configurable LSID resolver and the option to associate XSLT stylesheets with the XML responses to present the data in a human-readable form in Web browsers.
TapirLink allows data providers to participate in TAPIR networks or simply to offer a Web Service interface to their data. This presentation will describe the TapirLink software, showing the installation requirements, configuration details and main features of the tool.
Support is acknowledged from: TDWG Infrastructure Project
5.2. RDF over TAPIR
Roger Hyam
TDWG Infrastructure Project
The TDWG standards architecture relies on the melding together of two technologies that are often thought to be antagonistic: the Resource Definition Framework (RDF) and XML Schema.
RDF is based on a modelling language that describes everything in terms of subject-predicate-object statements (known as triples). This may be familiar from formal logic. RDF can be serialised in many ways. One of those ways is as XML.
XML Schema is a language for defining XML document structures. It is possible to define an XML document structure using XML Schema so that the resulting documents are valid serialisations of RDF.
TAPIR is a data exchange protocol designed to pass XML messages. The output from a TAPIR data provider is described using XML Schema. TAPIR knows nothing about RDF but by using XML Schemas that define RDF instance documents it is possible for a TAPIR data provider to behave as an RDF data source. This is demonstrated using the TapirLink provider software.
One of the strengths of the TAPIR protocol is that it allows the definition of custom response types (output models). This can act as a mapping point between conceptual schemas. It should therefore be possible to map other TAPIR concepts into RDF that uses the TDWG ontology. This is demonstrated using data sources mapped to DarwinCore. It should also be possible to map any TAPIR data source to generic RDF.
There are a series of limitations to these approaches. Defining RDF instance data using XML Schema is not ideal because it is not possible to control the use of attributes of elements according to whether the element has content and thus prevent the simultaneous occurrence of an rdf:resource attribute and embedded content, which would be illegal. This is largely overcome in the demonstrations because it is known a priori whether a value is a literal or resource link. XML Schema is awkward to use when there are many namespaces in the instance document. Current examples use around ten separate XML Schema documents. This could become a performance issue in the future and imposes an implementation burden on TAPIR wrapper software. The current examples make use of the TapirLink provider software which does not implement complex internal data structures, only 'flat' tables. The PyWrapper TAPIR provider has been shown to support RDF in initial tests but not tested with the current examples.
Support is acknowledged from: The Gordon and Betty Moore Foundation
RDF is based on a modelling language that describes everything in terms of subject-predicate-object statements (known as triples). This may be familiar from formal logic. RDF can be serialised in many ways. One of those ways is as XML.
XML Schema is a language for defining XML document structures. It is possible to define an XML document structure using XML Schema so that the resulting documents are valid serialisations of RDF.
TAPIR is a data exchange protocol designed to pass XML messages. The output from a TAPIR data provider is described using XML Schema. TAPIR knows nothing about RDF but by using XML Schemas that define RDF instance documents it is possible for a TAPIR data provider to behave as an RDF data source. This is demonstrated using the TapirLink provider software.
One of the strengths of the TAPIR protocol is that it allows the definition of custom response types (output models). This can act as a mapping point between conceptual schemas. It should therefore be possible to map other TAPIR concepts into RDF that uses the TDWG ontology. This is demonstrated using data sources mapped to DarwinCore. It should also be possible to map any TAPIR data source to generic RDF.
There are a series of limitations to these approaches. Defining RDF instance data using XML Schema is not ideal because it is not possible to control the use of attributes of elements according to whether the element has content and thus prevent the simultaneous occurrence of an rdf:resource attribute and embedded content, which would be illegal. This is largely overcome in the demonstrations because it is known a priori whether a value is a literal or resource link. XML Schema is awkward to use when there are many namespaces in the instance document. Current examples use around ten separate XML Schema documents. This could become a performance issue in the future and imposes an implementation burden on TAPIR wrapper software. The current examples make use of the TapirLink provider software which does not implement complex internal data structures, only 'flat' tables. The PyWrapper TAPIR provider has been shown to support RDF in initial tests but not tested with the current examples.
Support is acknowledged from: The Gordon and Betty Moore Foundation
5.3. TAPIR networks in Australia’s Virtual Herbarium and the Atlas of Living Australia
Greg Whitbread1, Shunde Zhang2, Paul Coddington2
1 Australian National Botanic Gardens, 2 University of Adelaide
The first, and currently the major, iteration of Australia's Virtual Herbarium (AVH) uses a very simple protocol designed for a single task, to assemble partial HISPID documents from a number of providers and display species occurrence on a map. It is web-based, easy to implement, fully distributed, and praised and lamented by the community it serves.
AVH2.0 will accommodate full data interchange between Australian Herbaria and enable development of products to meet increased local expectations and support provider participation in global markets for biodiversity information. The story is: a network based on TDWG standards.
Development of the AVH2.0 portal has been completed, using Java. The full AVH1.0 functionality has been enhanced to interrogate and deliver HISPID, ABCD and Darwin Core, and to offer full indexing of distributed BioCASE and AVH1.0 providers, interfaces supporting pluggable services, and instance replication. However, schema support for AVH functionality and full provider compliance is yet to be achieved. In practice, deployment has proven difficult with technical, financial and social issues all presenting barriers to successful implementation. TAPIR integration using TAPIRUS and PyWrapper may be a solution to these problems.
TAPIRUS, the TAPIR Unit Seeker, is a Java library providing a programming interface to the underlying XML protocols for entering queries and for parsing, aggregating and post-processing result sets. TAPIRUS replaces the BioCASE UnitLoader. It supports both the BioCASE and TAPIR protocols, simplifies indexing, supports protocol extension, provides better performance and improved memory management.
The AVH is also a component of the Atlas of Living Australia (ALA). The ALA is a significant new initiative modelled on the AVH and related biodiversity informatics projects. It is designed to provide a national information infrastructure supporting biodiversity science and to establish a sustainable architecture for biodiversity informatics in Australia. Without the foundation of the pioneering work and standards of TDWG, the ALA would not be possible. The first breaths of the ALA will undoubtedly arise from a network of TAPIR providers and an open evolution will contribute to the future for TDWG process, protocols and standards.
Support is acknowledged from: Australian National Botanic Gardens, Centre for Plant Biodiversity Research, Council of Heads of Australian Herbaria (CHAH), TDWG
AVH2.0 will accommodate full data interchange between Australian Herbaria and enable development of products to meet increased local expectations and support provider participation in global markets for biodiversity information. The story is: a network based on TDWG standards.
Development of the AVH2.0 portal has been completed, using Java. The full AVH1.0 functionality has been enhanced to interrogate and deliver HISPID, ABCD and Darwin Core, and to offer full indexing of distributed BioCASE and AVH1.0 providers, interfaces supporting pluggable services, and instance replication. However, schema support for AVH functionality and full provider compliance is yet to be achieved. In practice, deployment has proven difficult with technical, financial and social issues all presenting barriers to successful implementation. TAPIR integration using TAPIRUS and PyWrapper may be a solution to these problems.
TAPIRUS, the TAPIR Unit Seeker, is a Java library providing a programming interface to the underlying XML protocols for entering queries and for parsing, aggregating and post-processing result sets. TAPIRUS replaces the BioCASE UnitLoader. It supports both the BioCASE and TAPIR protocols, simplifies indexing, supports protocol extension, provides better performance and improved memory management.
The AVH is also a component of the Atlas of Living Australia (ALA). The ALA is a significant new initiative modelled on the AVH and related biodiversity informatics projects. It is designed to provide a national information infrastructure supporting biodiversity science and to establish a sustainable architecture for biodiversity informatics in Australia. Without the foundation of the pioneering work and standards of TDWG, the ALA would not be possible. The first breaths of the ALA will undoubtedly arise from a network of TAPIR providers and an open evolution will contribute to the future for TDWG process, protocols and standards.
Support is acknowledged from: Australian National Botanic Gardens, Centre for Plant Biodiversity Research, Council of Heads of Australian Herbaria (CHAH), TDWG
5.4. Checklist Provider Tool: a GBIF Application for Sharing Taxonomic Checklists Using TAPIR and TCS
Wouter Addink, Jorrit van Hertum
ETI BioInformatics
There is currently no simple way to connect nomenclatural and taxonomic resources to the Global Biodiversity Information Network (GBIF) network. The Taxon Concept Schema (TCS) standard was developed within Biodiversity Information Standards (TDWG) to make the exchange of taxonomic data possible. Providers need easy-to-use tools to connect such data to the GBIF network and to other networks such as the Catalogue of Life using TCS.
GBIF commissioned the development of a Checklist Provider Tool for sharing datasets held in tab-delimited files or Excel spreadsheets. This tool is scheduled to be made available at the GBIF GB14 meeting in October 2007 in Amsterdam.
The Checklist Provider Tool uses a MySQL relational database to store data in a TCS-compliant format. Data can be imported into the database through web forms written in PHP. The tool includes a pre-configured TAPIR-compliant access point which connects directly to the database and will facilitate connection to the GBIF network and to other networks. The access point will be based on the TapirLink PHP implementation of the TAPIR protocol (the same implementation is in use for sharing occurrence data using Darwin Core). This TAPIR access point will expose the taxonomic checklist data in TCS format.
The Checklist Provider Tool will be available for download as open source software at the GBIF website. In addition GBIF is considering using the tool to host small data sets directly at GBIF without data providers needing to install the software locally.
Support is acknowledged from: GBIF
GBIF commissioned the development of a Checklist Provider Tool for sharing datasets held in tab-delimited files or Excel spreadsheets. This tool is scheduled to be made available at the GBIF GB14 meeting in October 2007 in Amsterdam.
The Checklist Provider Tool uses a MySQL relational database to store data in a TCS-compliant format. Data can be imported into the database through web forms written in PHP. The tool includes a pre-configured TAPIR-compliant access point which connects directly to the database and will facilitate connection to the GBIF network and to other networks. The access point will be based on the TapirLink PHP implementation of the TAPIR protocol (the same implementation is in use for sharing occurrence data using Darwin Core). This TAPIR access point will expose the taxonomic checklist data in TCS format.
The Checklist Provider Tool will be available for download as open source software at the GBIF website. In addition GBIF is considering using the tool to host small data sets directly at GBIF without data providers needing to install the software locally.
Support is acknowledged from: GBIF
5.5. Shibboleth, a potential security framework for the TDWG architecture
Lutz Suhrbier1, Andreas Kohlbecker2
1 Institut für Informatik - AG Netzbasierte Informationssysteme, Freie Universität Berlin, 2 Biodiversity Informatics, Botanic Garden & Botanical Museum Berlin-Dahlem
Shibboleth is a project of the Internet2 Middleware Initiative (http://middleware.internet2.edu/). It provides an architecture and an open-source implementation for a federated, identity-based authentication and authorization infrastructure. Groups of organisations or projects may develop a federation by agreeing on common security policies and practices. They can use SAML/Shibboleth protocols to manage single sign-on across domains. This removes the need for content providers to maintain usernames and passwords.
Authorisation is instead based on trusted user attributes supplied by trusted Identity providers (IdPs) and consumed by service providers (SPs) which then gate access to secure content.
We will introduce the main concepts of the Shibboleth architecture. In addition, we outline potential benefits for the entire TDWG architecture and present the current approach to federation within the EDIT project.
Authorisation is instead based on trusted user attributes supplied by trusted Identity providers (IdPs) and consumed by service providers (SPs) which then gate access to secure content.
We will introduce the main concepts of the Shibboleth architecture. In addition, we outline potential benefits for the entire TDWG architecture and present the current approach to federation within the EDIT project.
6. Models for Integrating TDWG: Literature Model
6.1. Linking Bibliographic Data to Library Content
Julius Welby
Natural History Museum
The European Distributed Institute of Taxonomy (EDIT) Work package 5.3 will provide bibliographic and literature discovery tools which will help reduce literature related bottlenecks which can hinder the progress of day-to-day taxonomic research.
In response to discussions and a broader requirements gathering exercise we are planning a website supporting federated searching of taxonomically relevant data sources accessible via standard protocols (e.g., Z39.50). Data sources include library catalogues of EDIT partners and others, the Biodiversity Heritage Library (BHL), and electronic journals. Users will be able to browse through the results of their searches to see links to useful resources and metadata, and there will be a link to the original content where this is available, via an OpenURL service.
Where full text content is not available, registered users will be able to nominate non-copyright texts which they would like to use, and these nominations will be made available to the various digitisation projects currently in progress at institutions around the world.
The Virtual Taxonomic Library (ViTaL) will also provide a place for taxonomists to view and search aggregated bibliographic references harvested from a number of reference management services and web sites. References will benefit from the same linking technology used for the search results.
The team will share their overall vision for the site, outline some of the practical and technical challenges, and give a brief overview of the technology used to provide linking from search results.
Support is acknowledged from: EDIT, Natural History Museum
In response to discussions and a broader requirements gathering exercise we are planning a website supporting federated searching of taxonomically relevant data sources accessible via standard protocols (e.g., Z39.50). Data sources include library catalogues of EDIT partners and others, the Biodiversity Heritage Library (BHL), and electronic journals. Users will be able to browse through the results of their searches to see links to useful resources and metadata, and there will be a link to the original content where this is available, via an OpenURL service.
Where full text content is not available, registered users will be able to nominate non-copyright texts which they would like to use, and these nominations will be made available to the various digitisation projects currently in progress at institutions around the world.
The Virtual Taxonomic Library (ViTaL) will also provide a place for taxonomists to view and search aggregated bibliographic references harvested from a number of reference management services and web sites. References will benefit from the same linking technology used for the search results.
The team will share their overall vision for the site, outline some of the practical and technical challenges, and give a brief overview of the technology used to provide linking from search results.
Support is acknowledged from: EDIT, Natural History Museum
6.2. Use cases from taxonomists, conservationists, and others
Cynthia Sims Parr1, Christopher Lyal2
1 Information International Associates, 2 The Natural History Museum, London
Digital versions of taxonomic literature are increasingly readily available, but significant work remains to make this literature fully accessible to users and delivered in a manner best fitting their needs. As part of the INOTAXA project, we conducted interviews with a wide range of potential users and developed a collection of use cases that may guide the development of systems to provide access to the taxonomic literature. The use cases span the range from those closest to the originators of taxonomic literature (systematists gathering material for taxonomic revision, often including phylogenetic analysis) to those that will extend the impact of the literature (e.g. ecologists harvesting species associations for modelling of interactions). They fall into eight broad categories: 1. general exploration of source material; 2. taxonomy (e.g. preparing revisions and checklists, conducting phylogenetic analyses); 3. specialized taxonomy-related (e.g., preparing author catalogues, itineraries or histories of expeditions); 4. identification (e.g., identifying specimens for taxonomic, pest control, surveys, or other purposes); 5. extra-taxonomy (e.g., harvesting data for ecological, morphological, or character evolution studies); 6. policy decision-making; 7. data maintenance (e.g., correcting information in databases); and 8. web services.
While there are many common functions required by these use cases, for example, searching and browsing by taxonomic name or geographic location, sequences of tasks and desired results often differ. Some use cases, particularly those beyond more traditional taxonomy, involve users who are interested only in specific parts of the literature. As they are likely to be less familiar with taxonomic literature and how to search for their exact needs, they may require more support for browsing or in assessing data completeness and fitness for use. It will be a challenge to design schemas and interfaces which support multiple use cases well without overwhelming users. The practices of many users are currently constrained by print formats. Future systems can be freed from these constraints and support database-oriented rather than document-oriented uses of literature. Such a perspective will foster closer integration of the literature with other kinds of biodiversity information such as specimen and nomenclatural databases. This will not only allow published biodiversity data to be used much more extensively and in novel ways, it will open the door to more flexible ‘publication’ and delivery of taxonomic information and data in the future.
Support is acknowledged from: Atherton Seidell Fund of the Smithsonian Institution
While there are many common functions required by these use cases, for example, searching and browsing by taxonomic name or geographic location, sequences of tasks and desired results often differ. Some use cases, particularly those beyond more traditional taxonomy, involve users who are interested only in specific parts of the literature. As they are likely to be less familiar with taxonomic literature and how to search for their exact needs, they may require more support for browsing or in assessing data completeness and fitness for use. It will be a challenge to design schemas and interfaces which support multiple use cases well without overwhelming users. The practices of many users are currently constrained by print formats. Future systems can be freed from these constraints and support database-oriented rather than document-oriented uses of literature. Such a perspective will foster closer integration of the literature with other kinds of biodiversity information such as specimen and nomenclatural databases. This will not only allow published biodiversity data to be used much more extensively and in novel ways, it will open the door to more flexible ‘publication’ and delivery of taxonomic information and data in the future.
Support is acknowledged from: Atherton Seidell Fund of the Smithsonian Institution
6.3. Progress in making literature easily accessible: schemas and marking up
Terry Catapano1, Anna Weitzman2
1 Columbia University, 2 Smithsonian Institution
An important component of making biodiversity content available is the vast quantity of taxonomic information in printed form. Even 300+ year-old works remain relevant to taxonomy. Taxonomists have traditionally accessed this information by reading and taking notes, which are later incorporated into subsequent treatments. Similar, though more widespread, access exists for images of pages on the Web (i.e., the user still needs to know for what and where to look). Another step forward is to reproduce the printed information as machine-readable text. Even this still leaves the task of distinguishing relevant information in the potentially vast quantities of data. In order to make data in literature fully accessible, it must be encoded, have proper metadata added, and be made available for searching, linking and processing. Two projects, taxonX/GoldenGate (GG) and taXMLit/INOTAXA are attempting to tackle this task.
The aim of the taxonX schema is to provide a minimally sufficient XML tagset to identify and delineate taxonomic treatments and their significant components, particularly scientific names, geographic names, bibliographic citations, and descriptions. Once encoded in taxonX, the treatment and its associated data can be more readily extracted and incorporated into other databases as well as accessed and integrated into external resources. Owing to the diverse heterogenous forms of taxonomic treatments, the schema design is loose and flexible. Similarly, the content of the data itself requires normalization in order to be useful within existing and future digital infrastructures.
Developed independently, but alongside taxonX, GG contains tools for the semi-automatic markup of scientific names and treatment boundaries, and work proceeds on similar tools for bibliographic citations and geographic names. Tools to assist in identification of normalization of descriptive data are possible, but more difficult. GG can input a cleaned OCR (optical character recognition) file in xml, html, or text format and export a taxonX instance.
taXMLit is another schema for tagging taxonomic literature. Unlike taxonX, it is deliberately a fairly complete representation of data within the literature and thus is a complex schema. Taxonomic literature has a limited number of ‘kinds’ of information. These may be recognized in several ways, including using GG. Using xml text with those designated, e.g., a taxonX instance, another set of tools is underway to further parse and normalize data from kinds of paragraphs most likely to be needed by taxonomists (e.g., taxon heading, synonymy, specimen citations). As different formats of these kinds of paragraphs are identified, a library of tools will be built. Artificial intelligence should be able to select which tool is needed for each paragraph.
We believe experiences gained in the development of taxonX and taXMLit can inform future efforts to establish TDWG standard(s) for taxonomic literature. Two approaches to this task might be considered. First, the development be of a Standard, not necessarily a Schema. A core Vocabulary could be developed, with a number of different expressions, each ontologically harmonic, but in forms optimal for particular processes and uses. Secondly, the NLM/NCBI Journal Archiving DTD (a Document Type Definition defines the allowed building blocks of an XML document) should be investigated as one of the forms for expression of a TDWG Literature Standard. The NLM DTD enjoys strong and committed maintenance and has been adopted widely. It is deigned to be modular, with domain specific elements added to the base generic markup elements.
Support is acknowledged from: US National Science Foundation; Atherton Seidell Fund of the Smithsonian Institution
The aim of the taxonX schema is to provide a minimally sufficient XML tagset to identify and delineate taxonomic treatments and their significant components, particularly scientific names, geographic names, bibliographic citations, and descriptions. Once encoded in taxonX, the treatment and its associated data can be more readily extracted and incorporated into other databases as well as accessed and integrated into external resources. Owing to the diverse heterogenous forms of taxonomic treatments, the schema design is loose and flexible. Similarly, the content of the data itself requires normalization in order to be useful within existing and future digital infrastructures.
Developed independently, but alongside taxonX, GG contains tools for the semi-automatic markup of scientific names and treatment boundaries, and work proceeds on similar tools for bibliographic citations and geographic names. Tools to assist in identification of normalization of descriptive data are possible, but more difficult. GG can input a cleaned OCR (optical character recognition) file in xml, html, or text format and export a taxonX instance.
taXMLit is another schema for tagging taxonomic literature. Unlike taxonX, it is deliberately a fairly complete representation of data within the literature and thus is a complex schema. Taxonomic literature has a limited number of ‘kinds’ of information. These may be recognized in several ways, including using GG. Using xml text with those designated, e.g., a taxonX instance, another set of tools is underway to further parse and normalize data from kinds of paragraphs most likely to be needed by taxonomists (e.g., taxon heading, synonymy, specimen citations). As different formats of these kinds of paragraphs are identified, a library of tools will be built. Artificial intelligence should be able to select which tool is needed for each paragraph.
We believe experiences gained in the development of taxonX and taXMLit can inform future efforts to establish TDWG standard(s) for taxonomic literature. Two approaches to this task might be considered. First, the development be of a Standard, not necessarily a Schema. A core Vocabulary could be developed, with a number of different expressions, each ontologically harmonic, but in forms optimal for particular processes and uses. Secondly, the NLM/NCBI Journal Archiving DTD (a Document Type Definition defines the allowed building blocks of an XML document) should be investigated as one of the forms for expression of a TDWG Literature Standard. The NLM DTD enjoys strong and committed maintenance and has been adopted widely. It is deigned to be modular, with domain specific elements added to the base generic markup elements.
Support is acknowledged from: US National Science Foundation; Atherton Seidell Fund of the Smithsonian Institution
6.4. Literature & interoperability: a working example using Ants
Donat Agosti1, Terry Catapano, Guido Sautter2
1 plazi.org /American Museum of Natural History, 2 University of Karlsruhe
Print is still the main medium to communicate taxonomic results. Traditionally printed taxonomic publications may include all the information (data, analysis, conclusions) needed to understand new research results. This system has been very successful, surviving almost a quarter of a millennium. Even today with widespread technologies for electronic distribution, the basic means of taxonomic communication has not altered, not yet taking full advantage of these technologies.
An understanding of the successful print model of systematics should orient efforts in the shift to a new digital knowledge infrastructure. In essence, a taxonomic treatment is the amalgamation in a single record of information we consider relevant to describe our taxa, including often not just the inferred hypotheses but also the underlying data. If sufficiently detailed, the latter can be identified, extracted, and populate dedicated databases on specimens, nomenclature or bibliographic citations.
Our German DFG / US NSF funded digital library project has been built upon this premise. In order to digitally represent the significant components of systematics literature, the XML schema TaxonX (http://taxonx.org) has been developed. The prospect of encoding the tens of millions of printed pages inspired the development of dedicated mark-up software (GoldenGATE) enabling the semi-automatic mark-up of suitably clean OCRed texts.
But even this process is still time consuming and dependent on the involvement of experts. As a result, a dedicated server, plazi.org (http://plazi.org) will be launched at the TDWG meeting that will allow the community not only to retrieve the respective documents, but actively participate in the mark-up process, and to be able to retrieve digital versions of individual treatments (descriptions of taxa). Openly available services like iSpecies or EDIT’s scratchpads will be able to access the treatments and incorporate them in their mash-ups or as seeds for scratchpads.
For the legacy publications to become truly interoperable, TaxonX allows the inclusion of references to identifiers in the increasing number of dedicated databases (eg GBIF; bibliographic references). To bridge the gap between the idea and implementation, unique identifiers for ant names will be retrieved from the Hymenoptera Name Server (including >200K names, including all ant names) and expressed as LSIDs. For literature, handles are retrieved via bioguid.org from plazi.org’s handle server, an integral part of DSpace, the respository of all the digitized legacy ant publication used to administer all the publications.
Although plazi.org currently concentrates on ants and legacy publications, it can in principle provide its services for any taxon. This all comes at high costs. What is needed in future are dedicated databases (specimens, character, names, bibliographies, etc.), unique identifiers, a program like LUCID to machine generate both a human readable text as well as the underlying xml mark up, and for publishers to integrate taxonomic specific annotations alongside a human readable text version seen in taxonomic publications.
Support is acknowledged from: NSF, DFG
An understanding of the successful print model of systematics should orient efforts in the shift to a new digital knowledge infrastructure. In essence, a taxonomic treatment is the amalgamation in a single record of information we consider relevant to describe our taxa, including often not just the inferred hypotheses but also the underlying data. If sufficiently detailed, the latter can be identified, extracted, and populate dedicated databases on specimens, nomenclature or bibliographic citations.
Our German DFG / US NSF funded digital library project has been built upon this premise. In order to digitally represent the significant components of systematics literature, the XML schema TaxonX (http://taxonx.org) has been developed. The prospect of encoding the tens of millions of printed pages inspired the development of dedicated mark-up software (GoldenGATE) enabling the semi-automatic mark-up of suitably clean OCRed texts.
But even this process is still time consuming and dependent on the involvement of experts. As a result, a dedicated server, plazi.org (http://plazi.org) will be launched at the TDWG meeting that will allow the community not only to retrieve the respective documents, but actively participate in the mark-up process, and to be able to retrieve digital versions of individual treatments (descriptions of taxa). Openly available services like iSpecies or EDIT’s scratchpads will be able to access the treatments and incorporate them in their mash-ups or as seeds for scratchpads.
For the legacy publications to become truly interoperable, TaxonX allows the inclusion of references to identifiers in the increasing number of dedicated databases (eg GBIF; bibliographic references). To bridge the gap between the idea and implementation, unique identifiers for ant names will be retrieved from the Hymenoptera Name Server (including >200K names, including all ant names) and expressed as LSIDs. For literature, handles are retrieved via bioguid.org from plazi.org’s handle server, an integral part of DSpace, the respository of all the digitized legacy ant publication used to administer all the publications.
Although plazi.org currently concentrates on ants and legacy publications, it can in principle provide its services for any taxon. This all comes at high costs. What is needed in future are dedicated databases (specimens, character, names, bibliographies, etc.), unique identifiers, a program like LUCID to machine generate both a human readable text as well as the underlying xml mark up, and for publishers to integrate taxonomic specific annotations alongside a human readable text version seen in taxonomic publications.
Support is acknowledged from: NSF, DFG
6.5. Taxonomic Literature: What Next?
Anna Weitzman1, Christopher Lyal2
1 Smithsonian Institution, 2 Natural History Museum, London
Calls continue for agreed standards for taxonomic literature. Earlier work identified three key levels: microcitations, metadata and content. Broad agreement has been reached on the first of these, although the standard needs finalisation, including a decision on LSIDs. A standard for metadata is still to be agreed, and must accommodate both librarian and taxonomist needs; this is becoming urgent with the development of the Biodiversity Heritage Library (BHL) and the need to access its content. The most complex standard is for complete taxonomic literature content. Taxonomic literature accommodates many different data and information types, including those subject to existing TDWG standards. This raises the possibility of it serving as a test bed for full interoperability of taxonomic data and linking of TDWG standards. However, expression of such data and information in literature sources often differs from the source material examined by other TDWG groups. In developing a standard the requirements of meeting several goals interact: a) interoperability across data and information types; b) maximising cost-effective access to and display of information and data in a manner for the user; c) cost-effective mark-up in agreed formats. Alternative routes to interoperability include a) making different schemas congruent while reflecting properties of the data sources, and b) embedding different modular schemas within a larger container. The degree of atomisation of the content will impact both on the breadth of user needs that can be met cost-effectively, and issues of mark-up. Once interoperability is achieved, user-friendly navigation tools for the information universe thus created, and delivery of output and choices expected by different users, become issues that must be addressed.
Support is acknowledged from: Atherton Seidell Fund of the Smithsonian Institution
Support is acknowledged from: Atherton Seidell Fund of the Smithsonian Institution
7. Models for Integrating TDWG: Spatial Model
7.1. Species distribution modelling and phylogenetics
Stephen Andrew Smith
Yale University
Recent advances in niche modeling methods allow us to more accurately predict the ranges of species. These models have been used extensively in building low resolution maps from georeferenced museum specimens as well as predicting future movements of invasive species, however, their full potential as a tool for evolutionary biology has not been adequately explored. With the additional development of high quality world climate layers it is possible to reconstruct not only the ancestral climate envelopes for species, but also the rates of evolution in these climatic variables. I will focus on a clade of 19 plant species endemic to western North America (Oenothera sect. Anogra and Kleinia, Onagraceae). I have used high resolution (1 km2) climate data, a maximum entropy method to model the climatic tolerances of species, and phylogenies inferred from DNA sequence data. I will show a reconstruction of the evolution of climatic tolerances using standard continuous models and focus on the rate at which climatic tolerances evolve, including tests for rate heterogeneity. This approach allows me to investigate how climatic niches evolve over time scales relevant to macroevolutionary biologists. These results provide examples of both climatic niche conservation and evolution.
Support is acknowledged from: NSF
Support is acknowledged from: NSF
7.2. Lifemapper: Using and Creating Geospatial Data and Open Source Tools for the Biological Community
Aimee Stewart, C.J. Grady, James Beach
University of Kansas
The open source project Lifemapper 2 creates an archive of species predicted habitat maps and other spatial distribution information. Lifemapper 2 uses museum specimen data archived by the Global Biodiversity Information Facility (GBIF) and accessed through their web services, current and future International Panel on Climate Change (IPCC) climate scenarios, the openModeller niche modeling library, and a 64-node compute cluster. Applied studies using the resulting data can predict the impacts of climate change, loss of biodiversity, spread of invasive species, and emerging diseases.
Lifemapper 2 uses various open source libraries and applications to provide information to the biological community and general public. The Geospatial Data Abstraction Library (GDAL), PostgreSQL, PostGIS, and Mapserver are used in the pipeline of the system, while the cluster uses Sun Grid Engine and openModeller to create the niche models. The website provides archive browsing and data query and download, while web services provide programmatic access.
The services provided by this project will provide inputs for more user-friendly software tools for biological collection data integration and analysis. This will provide a foundation for a new pluggable, extensible architecture that will tie together the services, functions and methods of different applications. The end result will improve quality of data collected and provide support to researchers in studying species’ actual and potential distributions.
Support is acknowledged from: National Science Foundation
Lifemapper 2 uses various open source libraries and applications to provide information to the biological community and general public. The Geospatial Data Abstraction Library (GDAL), PostgreSQL, PostGIS, and Mapserver are used in the pipeline of the system, while the cluster uses Sun Grid Engine and openModeller to create the niche models. The website provides archive browsing and data query and download, while web services provide programmatic access.
The services provided by this project will provide inputs for more user-friendly software tools for biological collection data integration and analysis. This will provide a foundation for a new pluggable, extensible architecture that will tie together the services, functions and methods of different applications. The end result will improve quality of data collected and provide support to researchers in studying species’ actual and potential distributions.
Support is acknowledged from: National Science Foundation
7.3. A pilot project for biodiversity and climate change interoperability in the GEOSS framework
Stefano Nativi1, Paolo Mazzetti1, Lorenzo Bigagli1, Valerio Angelini1, Enrico Boldrini1, Éamonn Ó Tuama2, Hannu Saarenmaa3, Jeremy Kerr4, Siri Jodha Singh Khalsa5
1 Italian National Research Council – IMAA and Univ. of Florence, 2 GBIF Secretariat, 3 University of Helsinki, 4 University of Ottawa, 5 IEEE and Univ. of Colorado
The Global Biodiversity Interoperability Framework (GBIF) Interoperability Process Pilot Project (IP3) addresses two Societal Benefit Areas: Biodiversity and Climate and is developed within the framework of the Global Earth Observation System of Systems (GEOSS). GEOSS is an international initiative to combine new and existing hardware and software for the purposes of supplying earth observation data and information at no cost.
The aim of IP3 is the implementation of a GEOSS Architecture through the development of relevant scenarios that draw on data and information exchange from a series of interconnected systems.
The focus of GBIF IP3 is modeling the impact of climate change on species distribution. To achieve this, heterogeneous data resources (e.g., biodiversity, climatological and environmental resources) and processing services are required to interoperate by using a Service Oriented Architecture (SOA) approach.
Through the pilot some available service-based components were selected, some artifacts developed and special arrangements to facilitate interoperability demonstrated, all for specific use scenarios.
These main components are described below.
Biodiversity occurrences are discovered and accessed through web services published by the GBIF Data Portal (http://data.gbif.org) and according to the TDWG Darwin Core standard format.
Climatological data are obtained from the NCAR GIS portal which provides web access to free global datasets of climate change scenarios. These data (spanning 50 years from 2000 to 2050) have been generated for the 4th Assessment Report of the Intergovernmental Panel on Climate Change (IPCC) by the Community Climate System Model (CCSM). The datasets are processed to generate grid coverage and served through an OGC WCS 1.0 server.
GI-cat is a federated catalog providing a unique and consistent interface that enables the interrogation of biodiversity and climatological data resources. GI-cat exposes an OGC CS-W/ebRIM interface and is able to federate heterogeneous catalogs and access servers that implement international geospatial standards (e.g., OGC OWS). In addition, GI-cat implements a mediation server, making it possible to federate non-standard servers (e.g., THREDDS/OPenDAP servers) by specifying “special interoperability arrangements”. A special interoperability arrangement was introduced for the GBIF portal services, consisting of the introduction of a formal mapping for the GBIF data model to the ISO 19115 core metadata profile, and the GI-cat to GBIF service protocols adaptation.
The component used for processing collected data and generating future projections is the OpenModeller, an open source Ecological Niche Modelling (ENM) framework. It is accessed through a Web Services interface based on the SOAP protocol.
An AJAX client was developed to implement a user friendly interface to OpenModeller functionalities, making them accessible by any web browser. With this tool, the user is guided through the process of discovering data (by submitting queries to GI-cat), accessing selected data (through GBIF and WCS/NCAR data servers) and running ENM projections. Finally, the results are shown.
A first demonstration dealt with the Canadian butterfly species (Amblyscirtes vialis) and its response to climate change. This demonstration was presented in the most recent GEOSS workshops.
The aim of IP3 is the implementation of a GEOSS Architecture through the development of relevant scenarios that draw on data and information exchange from a series of interconnected systems.
The focus of GBIF IP3 is modeling the impact of climate change on species distribution. To achieve this, heterogeneous data resources (e.g., biodiversity, climatological and environmental resources) and processing services are required to interoperate by using a Service Oriented Architecture (SOA) approach.
Through the pilot some available service-based components were selected, some artifacts developed and special arrangements to facilitate interoperability demonstrated, all for specific use scenarios.
These main components are described below.
Biodiversity occurrences are discovered and accessed through web services published by the GBIF Data Portal (http://data.gbif.org) and according to the TDWG Darwin Core standard format.
Climatological data are obtained from the NCAR GIS portal which provides web access to free global datasets of climate change scenarios. These data (spanning 50 years from 2000 to 2050) have been generated for the 4th Assessment Report of the Intergovernmental Panel on Climate Change (IPCC) by the Community Climate System Model (CCSM). The datasets are processed to generate grid coverage and served through an OGC WCS 1.0 server.
GI-cat is a federated catalog providing a unique and consistent interface that enables the interrogation of biodiversity and climatological data resources. GI-cat exposes an OGC CS-W/ebRIM interface and is able to federate heterogeneous catalogs and access servers that implement international geospatial standards (e.g., OGC OWS). In addition, GI-cat implements a mediation server, making it possible to federate non-standard servers (e.g., THREDDS/OPenDAP servers) by specifying “special interoperability arrangements”. A special interoperability arrangement was introduced for the GBIF portal services, consisting of the introduction of a formal mapping for the GBIF data model to the ISO 19115 core metadata profile, and the GI-cat to GBIF service protocols adaptation.
The component used for processing collected data and generating future projections is the OpenModeller, an open source Ecological Niche Modelling (ENM) framework. It is accessed through a Web Services interface based on the SOAP protocol.
An AJAX client was developed to implement a user friendly interface to OpenModeller functionalities, making them accessible by any web browser. With this tool, the user is guided through the process of discovering data (by submitting queries to GI-cat), accessing selected data (through GBIF and WCS/NCAR data servers) and running ENM projections. Finally, the results are shown.
A first demonstration dealt with the Canadian butterfly species (Amblyscirtes vialis) and its response to climate change. This demonstration was presented in the most recent GEOSS workshops.
7.4. Advances at the OGC, and Opportunities for Harmonization with TDWG Standards and Models
Phillip C. Dibner
OGC Interoperability Institute (OGCii)
Several recent developments in technology and organization at the Open Geospatial Consortium (OGC) are highly relevant to the mission and objectives of the Biodiversity Information Standards (TDWG) community.
Most fundamentally, a new institute has been created: the OGC Interoperability Institute (OGCii). Whereas the OGC’s mission is to create standards that support spatiotemporal processing, the OGCii was constituted to help make these standards accessible to the scientific research community through education, technical engagement, and collaboration in research programs. The OGCii also maintains a strong relationship with the OGC’s Standards and Interoperability Programs, and enjoys continued access to OGC resources.
As of this writing, the OGCii has collaborated with several academic and research institutions in scientific proposals that span a variety of domains, including biodiversity informatics. These essentially independent efforts share a common theme: enabling the integration of datasets from different domains of knowledge by harmonizing the information models employed by their respective information communities.
Several OGC specifications in the final stages of the approval process are highly relevant to these and future efforts. Prominent among them is the Observations and Measurements (O&M) standard that has been a topic in several presentations to the TDWG membership (e.g., P. Dibner, 2006, Proceedings of TDWG, “An integrative, standards-compliant framework for TDWG schemata and services”), and a key component in the TDWG/OGC domain modeling and harmonization workshop conducted in Edinburgh in June, 2006. Related standards, also near release, include the Sensor Observation Service (SOS), which serves Observation objects, and Sensor Modeling Language (SensorML), which describes sensing devices and data collection processes.
There has been a recent convergence of interest throughout the scientific community in standards for describing and sharing observations. The same formats and schemata need not be adopted by every discipline; it is sufficient if the information models they use are consistent, so that real-time conversion between them is feasible. In this capacity, the O&M and related standards offer the prospect of enabling seamless integration of BIS (TDWG) data into investigations and analyses that use the growing network of OGC service implementations.
Other developments at the OGC involve products and tools that are associated with the mass market. A particularly prominent example is the project to harmonize the KML language for describing geographic information with OGC specifications, and ultimately to release it as an OGC standard in its own right. Ultimately, this will enable the distribution of scientific data via powerful, popular, and freely available tools such as the Google Earth browser.
Perhaps more significant than these technical capabilities and projects is the growing web of relationships and common interests between the two organizations. The MoU executed between TDWG and the OGC in October 2006 has continued to spawn a variety of exchanges, including speakers at meetings, collaboration in modeling exercises and proposals, and incorporation of biodiversity data in research exercises and implementation pilots. The relationship is alive, well, and continuing to grow.
Most fundamentally, a new institute has been created: the OGC Interoperability Institute (OGCii). Whereas the OGC’s mission is to create standards that support spatiotemporal processing, the OGCii was constituted to help make these standards accessible to the scientific research community through education, technical engagement, and collaboration in research programs. The OGCii also maintains a strong relationship with the OGC’s Standards and Interoperability Programs, and enjoys continued access to OGC resources.
As of this writing, the OGCii has collaborated with several academic and research institutions in scientific proposals that span a variety of domains, including biodiversity informatics. These essentially independent efforts share a common theme: enabling the integration of datasets from different domains of knowledge by harmonizing the information models employed by their respective information communities.
Several OGC specifications in the final stages of the approval process are highly relevant to these and future efforts. Prominent among them is the Observations and Measurements (O&M) standard that has been a topic in several presentations to the TDWG membership (e.g., P. Dibner, 2006, Proceedings of TDWG, “An integrative, standards-compliant framework for TDWG schemata and services”), and a key component in the TDWG/OGC domain modeling and harmonization workshop conducted in Edinburgh in June, 2006. Related standards, also near release, include the Sensor Observation Service (SOS), which serves Observation objects, and Sensor Modeling Language (SensorML), which describes sensing devices and data collection processes.
There has been a recent convergence of interest throughout the scientific community in standards for describing and sharing observations. The same formats and schemata need not be adopted by every discipline; it is sufficient if the information models they use are consistent, so that real-time conversion between them is feasible. In this capacity, the O&M and related standards offer the prospect of enabling seamless integration of BIS (TDWG) data into investigations and analyses that use the growing network of OGC service implementations.
Other developments at the OGC involve products and tools that are associated with the mass market. A particularly prominent example is the project to harmonize the KML language for describing geographic information with OGC specifications, and ultimately to release it as an OGC standard in its own right. Ultimately, this will enable the distribution of scientific data via powerful, popular, and freely available tools such as the Google Earth browser.
Perhaps more significant than these technical capabilities and projects is the growing web of relationships and common interests between the two organizations. The MoU executed between TDWG and the OGC in October 2006 has continued to spawn a variety of exchanges, including speakers at meetings, collaboration in modeling exercises and proposals, and incorporation of biodiversity data in research exercises and implementation pilots. The relationship is alive, well, and continuing to grow.
7.5. The BiogeoSDI workshop: Demonstrating the use of TDWG and OGC standards together
Javier de la Torre1, Tim Sutton2, Bart Meganck3, Dave Vieglais4, Aimee Stewart4, Peter Brewer5, Renato de Giovanni2
1 Imaste-IPS, 2 CRIA, 3 Africamuseum, 4 University of Kansas, 5 University of Reading
A week long workshop was held in Campinas, Brazil during the first week of April 2007. The focus of the workshop was to develop a test-bed web application to demonstrate the interoperability of digital data and services using open standards, with particular emphasis on geospatial, taxonomic and occurrence biodiversity data.
Two versions of a prototype web application were developed using PHP and Flex. The wizard style application leads the user through a defined sequence of steps in order to acquire sufficient data to create a niche model. The process includes taxonomic validation using the Catalogue of Life, search and retrieval of occurrence data using services such as the GBIF portal or WFS, selection of raster layers representing environmental data and modeling these data using the openModeller Web Service to create a probability surface that represents areas where a species is likely to occur.
The workshop highlighted how easy it is to rapidly create a feature rich application using open access to data, free software and open standards. The workshop also highlighted areas where further work is needed to effectively blend these services into a cohesive computing platform. Finally, suggestions were made for improving OGC and TDWG standards in a report that is available at (http://wiki.tdwg.org/twiki/bin/view/Geospatial/InteroperabilityWorkshop1). The prototype will be demonstrated and the issues arising will be discussed.
Support is acknowledged from: TDWG Infrastructure Project, CRIA, University of Kansas, Africanmuseum, IMASTE-IPS
Two versions of a prototype web application were developed using PHP and Flex. The wizard style application leads the user through a defined sequence of steps in order to acquire sufficient data to create a niche model. The process includes taxonomic validation using the Catalogue of Life, search and retrieval of occurrence data using services such as the GBIF portal or WFS, selection of raster layers representing environmental data and modeling these data using the openModeller Web Service to create a probability surface that represents areas where a species is likely to occur.
The workshop highlighted how easy it is to rapidly create a feature rich application using open access to data, free software and open standards. The workshop also highlighted areas where further work is needed to effectively blend these services into a cohesive computing platform. Finally, suggestions were made for improving OGC and TDWG standards in a report that is available at (http://wiki.tdwg.org/twiki/bin/view/Geospatial/InteroperabilityWorkshop1). The prototype will be demonstrated and the issues arising will be discussed.
Support is acknowledged from: TDWG Infrastructure Project, CRIA, University of Kansas, Africanmuseum, IMASTE-IPS
8. Models for Integrating TDWG: Descriptive Model
8.1. From Xper to Xper²: comments on twenty years of taxonomic applications with descriptive and identification tools
Régine Vignes Lebbe, Guillaume Dubus
Universite Pierre et Marie Curie, Paris 6
Xper and associated programs have now existed for twenty years. They are dedicated to managing taxonomic descriptions, providing interactive free-access identification, constructing keys and diagnoses and analysing, comparing and measuring similarities between descriptions.
During these twenty years, each new taxonomic application has suggested improvement of knowledge representation, management functionalities, user interface and taxonomic tools. For example, in the past a tool to automatically write descriptions as readable text was developed to publish (in two languages) the descriptions of phlebotomine sandflies of French Guiana. Then an HTML export was added to Xper² for a quick on-line distribution of a knowledge base. Another example: a tool to compute similarities between descriptions was developed, then this tool was used to complete the taxonomic forms constructed automatically from a knowledge base to suggest the most similar taxa and the risk of misidentification. Recently the import/export in spreadsheet format appears important for practical use during an application on Flora of France.
We will discuss the positive and negative points of such developments and the gap between taxonomists’ needs and computer scientists’ objectives.
http://lis.snv.jussieu.fr/newlis/?q=en/resources/softwares/cai/xper2
http://lully.snv.jussieu.fr/xperbotanica/
We thank all the Xper² users for their profitable comments on the software.
Support is acknowledged from: Xper² development is funded by the BioInfo 2002 grant of the CNRS and by the French Ministère for Research and New Technology (n°04L370 – Project Xper Botanica, 2005-2007).
During these twenty years, each new taxonomic application has suggested improvement of knowledge representation, management functionalities, user interface and taxonomic tools. For example, in the past a tool to automatically write descriptions as readable text was developed to publish (in two languages) the descriptions of phlebotomine sandflies of French Guiana. Then an HTML export was added to Xper² for a quick on-line distribution of a knowledge base. Another example: a tool to compute similarities between descriptions was developed, then this tool was used to complete the taxonomic forms constructed automatically from a knowledge base to suggest the most similar taxa and the risk of misidentification. Recently the import/export in spreadsheet format appears important for practical use during an application on Flora of France.
We will discuss the positive and negative points of such developments and the gap between taxonomists’ needs and computer scientists’ objectives.
http://lis.snv.jussieu.fr/newlis/?q=en/resources/softwares/cai/xper2
http://lully.snv.jussieu.fr/xperbotanica/
We thank all the Xper² users for their profitable comments on the software.
Support is acknowledged from: Xper² development is funded by the BioInfo 2002 grant of the CNRS and by the French Ministère for Research and New Technology (n°04L370 – Project Xper Botanica, 2005-2007).
8.2. GrassBase – integrating structured descriptions, taxonomy and content management
Kehan Harman
Royal Botanic Gardens, Kew, Richmond, Surrey, TW9 3AB, U.K.
Kew's World Grass species descriptions have recently been made available over the internet, but management of this descriptive resource is still difficult. The descriptions consist of 11,000 species and 700 generic descriptions in DELTA format using a suite of 1090 mainly morphological characters. While the CSIRO DELTA software has supported this dataset well for many years, the lack of support or development of this software has led to the need for a more sustainable tool. Furthermore the associated nomenclature database is only available through the download of an MS Access database application. The World Grass Species project has been set up to help integrate these two resources and to make them available through a web portal. Combining an industry leading Content Management System (Drupal – http://drupal.org) with existing TDWG standards to integrate and present this data facilitates the development process, and simplifies the extension of the tool using community developed modules.
Support is acknowledged from: Royal Botanic Gardens Kew, Cranfield University
Support is acknowledged from: Royal Botanic Gardens Kew, Cranfield University
8.3. Mechanisms for coordination and delivery of descriptive data and taxon profiles in the Australasian Biodiversity Federation
Alex R. Chapman
Western Australian Herbarium, DEC
A summary of the development and delivery of descriptive data within keys and taxon profile pages in Australia, with particular focus on ‘FloraBase - the Western Australian Flora’ as a case study for information assembly and coordination.
Structured taxonomic descriptive data standards enable the rigorous capture of taxon data at an atomic character level, for delivery in interactive identification, information retrieval and natural language descriptions.
Coordinating the capture of coded descriptive data for one taxonomic group becomes more complex with additional contributors. Integrating descriptive data across taxonomic groups requires an agreed vocabulary and definition of characters and states even when the morphology is non-homologous; once again coordination becomes more complex with increased variety of taxonomic groups.
While there are some notable large-scale collaborative descriptive projects scoring atomic character data across large geographic and taxonomic extents, there is more commonly a disjunction when existing data from various authoritative sources is assembled. In these cases a more generalised schema defining higher-level aggregated descriptive components can be used.
In Western Australia, with 3% of the world's vascular flora, a mixed strategy is maintained. Detailed coded data for the 1300 families and genera allows the publication of a comprehensive set of interactive keys and scientific descriptions from a single source. At the species level, some 13,000 short coded descriptions are maintained, as well as a growing set of free-text descriptions from online journals, archived floras or stand-alone projects.
Nationally, the Flora of Australia volumes have been marked up with XML schema identifying larger blocks of descriptive text. This schema may provide a common reference standard for marking up species-level descriptions from the many potential sources. With the upcoming 'Atlas of Living Australia' and global 'Encyclopedia of Life' projects, the identification of standard components on taxon profile pages and a mechanism for reliable tagging will be required.
Life Science Identifiers (LSIDs) can flag available elements for interrogation and retrieval, with the potential for automated sharing, re-use and re-assembly of authoritative content. This has the benefit of allowing the dwindling pool of specialists to focus on the development, publication and maintenance of content.
Support is acknowledged from: WA Department of Environment and Conservation; Global Biodiversity Information Facility
Structured taxonomic descriptive data standards enable the rigorous capture of taxon data at an atomic character level, for delivery in interactive identification, information retrieval and natural language descriptions.
Coordinating the capture of coded descriptive data for one taxonomic group becomes more complex with additional contributors. Integrating descriptive data across taxonomic groups requires an agreed vocabulary and definition of characters and states even when the morphology is non-homologous; once again coordination becomes more complex with increased variety of taxonomic groups.
While there are some notable large-scale collaborative descriptive projects scoring atomic character data across large geographic and taxonomic extents, there is more commonly a disjunction when existing data from various authoritative sources is assembled. In these cases a more generalised schema defining higher-level aggregated descriptive components can be used.
In Western Australia, with 3% of the world's vascular flora, a mixed strategy is maintained. Detailed coded data for the 1300 families and genera allows the publication of a comprehensive set of interactive keys and scientific descriptions from a single source. At the species level, some 13,000 short coded descriptions are maintained, as well as a growing set of free-text descriptions from online journals, archived floras or stand-alone projects.
Nationally, the Flora of Australia volumes have been marked up with XML schema identifying larger blocks of descriptive text. This schema may provide a common reference standard for marking up species-level descriptions from the many potential sources. With the upcoming 'Atlas of Living Australia' and global 'Encyclopedia of Life' projects, the identification of standard components on taxon profile pages and a mechanism for reliable tagging will be required.
Life Science Identifiers (LSIDs) can flag available elements for interrogation and retrieval, with the potential for automated sharing, re-use and re-assembly of authoritative content. This has the benefit of allowing the dwindling pool of specialists to focus on the development, publication and maintenance of content.
Support is acknowledged from: WA Department of Environment and Conservation; Global Biodiversity Information Facility
8.4. Using Automatically Extracted Information in Species Page Retrieval
Xiaoya Tang1, P. Bryan Heidorn2
1 Emporia State University, 2 University of Illinois
Users searching botanical texts online in currently available full-text indexes such as Google must accurately guess the vocabulary of the original author(s) to find the desired results. A large number of botanical volumes are available electronically, and many more are being made available through projects such as the Encyclopedia of Life and Biodiversity Heritage Library. However, current retrieval systems available for these collections are not able to interpret the specific information requests correctly and match them with appropriate documents. Author vocabulary often varies greatly from the user’s search vocabulary. We will present a study which integrates text mining techniques into the full-text search process and automatically identifies selected plant morphological information from text to assist keyword-based retrieval. The technique could be expanded to other collections of documents.
An experiment involving users was conducted to evaluate this approach on the full-text of the Flora of North America (FNA). Thirty upper-level undergraduates and graduate students from two Illinois universities who had completed a course in botany were asked to identify ten herbarium specimens of trees of Illinois. The subjects used a full text search engine with an index of several volumes of FNA. The user search logs were used to identify the plant characteristics most frequently used by the students, independent of the usefulness of these terms for retrieving taxonomic treatments using full-text search. These characters were targeted for text extraction. A set of treatments were marked by hand to serve as training examples and a machine learning method was used to learn extraction patterns and these commonly used characters were mined from the 1637 treatments in the FNA. The accuracy of the extraction was between 60% and 100%, except for leaf shape and leaf arrangement information, which was around 50% and 30%, respectively, depending on the information type. In a new experiment one group of 12 subjects used a traditional full text search system while another group of 12 used full text plus pull-down menus and web forms that allowed them to search based on the machine extracted information. The experimental results indicate that the latter approach significantly improves keyword-based retrieval performance by allowing the users to complete more identification tasks successfully than when they had to generate their own search terms. It also increases users’ satisfaction with the retrieval system.
An experiment involving users was conducted to evaluate this approach on the full-text of the Flora of North America (FNA). Thirty upper-level undergraduates and graduate students from two Illinois universities who had completed a course in botany were asked to identify ten herbarium specimens of trees of Illinois. The subjects used a full text search engine with an index of several volumes of FNA. The user search logs were used to identify the plant characteristics most frequently used by the students, independent of the usefulness of these terms for retrieving taxonomic treatments using full-text search. These characters were targeted for text extraction. A set of treatments were marked by hand to serve as training examples and a machine learning method was used to learn extraction patterns and these commonly used characters were mined from the 1637 treatments in the FNA. The accuracy of the extraction was between 60% and 100%, except for leaf shape and leaf arrangement information, which was around 50% and 30%, respectively, depending on the information type. In a new experiment one group of 12 subjects used a traditional full text search system while another group of 12 used full text plus pull-down menus and web forms that allowed them to search based on the machine extracted information. The experimental results indicate that the latter approach significantly improves keyword-based retrieval performance by allowing the users to complete more identification tasks successfully than when they had to generate their own search terms. It also increases users’ satisfaction with the retrieval system.
8.5. Capturing structured data to facilitate web revisions
Dave Roberts1, Julius Welby1, Markus Döring2
1 The Natural History Museum, 2 Botanischer Garten und Botanisches Museum Berlin-Dahlem
In order to write a taxonomic revision it is necessary for an author to assemble and consider the range of existing descriptions, bring them into a common framework (i.e., standardisation) and consider how well they form delineated groups. In general the existing descriptions are in free-text blocks with associated nomenclatural and relationship information usually laid out in a structured (formatted) manner. The EU project EDIT has devised a general information-flow structure to guide the development of tools to assist taxonomists in their work and to bring the products of taxonomic effort more efficiently to the broader user community.
From a sociological perspective we consider it essential to design ways of working that mesh seamlessly with the way taxonomists work now. To that end we have investigated the natural language application GoldenGATE as a means to add structure to both the free-text descriptions and the formatted nomenclatural elements of both published and new work. The primary intention is to capture content from manuscripts (word processor documents) rather than from published sources per se.
We will describe the information model that is guiding EDIT development and the advantage that structured data can offer in terms of increasing the efficiency of taxonomic workflow. Better tools to process taxonomic information are of significantly greater value if there is information to be processed. In other words, we need to establish a bank of structured content and demonstrate the benefits of working with structured data if we are to engage new users with this improved way of working. The goal is to motivate users to invest the effort required to understand and use structured data tools.
Support is acknowledged from: EU's Sixth Framework Programme: European Distributed Institute of Taxonomy (EDIT)
From a sociological perspective we consider it essential to design ways of working that mesh seamlessly with the way taxonomists work now. To that end we have investigated the natural language application GoldenGATE as a means to add structure to both the free-text descriptions and the formatted nomenclatural elements of both published and new work. The primary intention is to capture content from manuscripts (word processor documents) rather than from published sources per se.
We will describe the information model that is guiding EDIT development and the advantage that structured data can offer in terms of increasing the efficiency of taxonomic workflow. Better tools to process taxonomic information are of significantly greater value if there is information to be processed. In other words, we need to establish a bank of structured content and demonstrate the benefits of working with structured data if we are to engage new users with this improved way of working. The goal is to motivate users to invest the effort required to understand and use structured data tools.
Support is acknowledged from: EU's Sixth Framework Programme: European Distributed Institute of Taxonomy (EDIT)
9. Integrating Biodiversity Data
9.1. Removing Taxonomic Impediments: How the Encyclopedia of Life and Biodiversity Heritage Library projects can help
Graham Higley
The Natural History Museum, London
The Encyclopedia of Life (EOL) is a collaborative scientific effort led by the Atlas of Living Australia, Field Museum, Harvard University, Marine Biological Laboratory (Woods Hole), Missouri Botanical Garden, Smithsonian Institution, and Biodiversity Heritage Library (BHL), a consortium of natural history libraries(1). Ultimately, the Encyclopedia of Life will provide an online database for all 1.8 million species known to live on Earth. When completed, www.eol.org will serve as a global biodiversity tool, providing scientists, policy-makers, students, and citizens the information they need to discover and protect the planet and encourage learning and conservation. An Advisory Board of 12 distinguished individuals from 5 countries will help guide the Encyclopedia.
The BHL has developed a strategy and operational plan to digitize the published literature of biodiversity held in their respective collections. This literature will be available at www.biodiversitylibrary.org. The partner libraries collectively hold a substantial part of the world’s published knowledge on biological diversity. This body of biodiversity knowledge, in its current form, is largely unavailable to a broad range of applications including: research, education, taxonomic study, biodiversity conservation, protected area management, disease control, and maintenance of diverse ecosystems services.
From a scholarly perspective, these collections are of exceptional value because the domain of systematic biology depends, more than any other science, upon historic literature. The so-called “decay-rate” of this literature is much slower than in other fields such as biotechnology. Ongoing mass digitization projects lack the discipline-specific focus of the Biodiversity Heritage Library Project. These other projects will fail to capture significant elements of legacy taxonomic literature. The Biodiversity Heritage Library Project will actively seek to incorporate data and content from other digitization projects.
The Biodiversity Heritage Library Project will immediately provide content for multiple bioinformatics initiatives and research, including EOL. For the first time in history, the core of natural history library collections will be available to a global audience. Web-based access to these collections will provide a substantial benefit to all researchers, especially those living and working in the developing world.
Up-to-date information can be found on the 2 Web sites www.eol.org/ and www.biodiversitylibrary.org/. An EOL Newsletter will be produced shortly. Any who wishes may register to get regular email updates at www.eol.org/registration.php.
1 Including the Smithsonian Institution, Missouri Botanical Garden, American Museum of Natural History (New York), Natural History Museum (London), New York Botanical Garden, Royal Botanic Garden (Kew), Marine Biological Laboratory and others.
The BHL has developed a strategy and operational plan to digitize the published literature of biodiversity held in their respective collections. This literature will be available at www.biodiversitylibrary.org. The partner libraries collectively hold a substantial part of the world’s published knowledge on biological diversity. This body of biodiversity knowledge, in its current form, is largely unavailable to a broad range of applications including: research, education, taxonomic study, biodiversity conservation, protected area management, disease control, and maintenance of diverse ecosystems services.
From a scholarly perspective, these collections are of exceptional value because the domain of systematic biology depends, more than any other science, upon historic literature. The so-called “decay-rate” of this literature is much slower than in other fields such as biotechnology. Ongoing mass digitization projects lack the discipline-specific focus of the Biodiversity Heritage Library Project. These other projects will fail to capture significant elements of legacy taxonomic literature. The Biodiversity Heritage Library Project will actively seek to incorporate data and content from other digitization projects.
The Biodiversity Heritage Library Project will immediately provide content for multiple bioinformatics initiatives and research, including EOL. For the first time in history, the core of natural history library collections will be available to a global audience. Web-based access to these collections will provide a substantial benefit to all researchers, especially those living and working in the developing world.
Up-to-date information can be found on the 2 Web sites www.eol.org/ and www.biodiversitylibrary.org/. An EOL Newsletter will be produced shortly. Any who wishes may register to get regular email updates at www.eol.org/registration.php.
1 Including the Smithsonian Institution, Missouri Botanical Garden, American Museum of Natural History (New York), Natural History Museum (London), New York Botanical Garden, Royal Botanic Garden (Kew), Marine Biological Laboratory and others.
9.2. Data Integration Issues in Biodiversity Research
Jessie Kennedy1, Shawn Bowers2, Matthew Jones3, Josh Madin3, Robert Peet4, Deana Pennington5, Mark Schildhauer3, Aimee Stewart6
1 Napier University, 2 UC Davis Genome Center, 3 National Center for Ecological Analysis and Synthesis, 4 The University of North Carolina at Chapel Hill, 5 The University of New Mexico, 6 The University of Kansas
The Scientific Environment for Ecological Knowledge (SEEK) project is developing an IT framework and infrastructure that will be used to derive biodiversity and ecological knowledge by facilitating the discovery, integration, interpretation, and analyses of ecological information. SEEK is based on a 3-layered architecture: the EarthGrid (the lowest layer) provides uniform access to biodiversity and other types of data sets; Kepler, a workflow tool (the highest layer), allows scientists to visually define, document, and execute their analyses; and the Semantic Mediation System (in the middle) uses domain knowledge represented in ontologies and databases to inform the discovery, integration and analysis of ecological data. The SEEK project has been motivated and directed by ecological analyses such as niche modelling and biodiversity studies. Example case studies have been used to explore the issues facing the researchers undertaking the analyses. This presentation will outline these issues and overview approaches used by SEEK.
Much modern research in ecology is based on the integration (and re-use) of multiple datasets. These datasets may be distributed globally, will be stored in a variety of formats, and most likely the data will have differing semantics reflecting any of the many measurements of spatial and temporal environmental factors and organismal characteristics and interactions that contribute to a given ecosystem. A typical scenario is a scientist is interested in analyzing the spread of invasive species in a certain region. S/he has distribution records in a personal database, but requires access to other potentially relevant datasets on-line. The researcher needs to be able to discover candidate datasets and then merge their relevant and compatible information. The researcher needs to resolve which datasets contain information about the species of interest or are to the timescale and locality of research. Simplistically, datasets might be retrieved and integrated on the basis of country and species name; however even simple data files can be extremely time consuming to integrate manually and complicated if at all possible to integrate automatically as a simple example will show.
In order to find and integrate suitable data, meta-data describing the content of the data sets is important, therefore SEEK requires data sets stored in the EarthGrid to be marked up with Ecological Metadata Language (EML). EML includes descriptions of the temporal, geographical and taxonomic coverage of the data sets. Much of the terminology used in EML is generically applicable to scientific data structures—such as table name or column label; while more domain-relevant terms—such as biomass or wing span, are defined in ontologies being developed by the SEEK team in conjunction with disciplinary specialists.
The Semantic Mediation System (SMS) layer in SEEK uses ontologies to expand terms for searching EarthGrid for data discovery and for supporting the scientist in semi-automatically transforming data for input to appropriate analytical components in Kepler. This is accomplished using a generalized ontology for modeling “observational data”, called OBOE. OBOE provides a framework in which the meaning and inter-relationships of observations within a scientific data set can be clarified. For example, one can use OBOE to indicate that various data sets contain both weights and wing spans of bird specimens—thus greatly facilitating effective data discovery and potential integration of those types of data sets. The SEEK Taxon group, whose work also sits in the semantic mediation layer of Kepler, has been researching the more specialized issues associated with clarifying the semantics necessary to inter-relate the taxonomic coverage of ecological data sets.
Ecological data sets of relevance to biodiversity modeling tend to have been collected either over long periods of time or over a wide geographic range and typically use unqualified biological names for recording taxon occurrences or counts (often codes are used with biological names specified in the meta-data). However due to the ongoing work of taxonomy in classifying and naming the known organisms, the meaning associated with these names changes over time. Therefore representing the taxonomic coverage for ecological data by simply referencing names of species results in ambiguity. This ambiguity may be significantly detrimental to the results of any subsequent ecological analysis. To address this problem the SEEK Taxon group is adopting a taxonomic concept approach, as defined in collaboration with TDWG in the TCS standard. A necessary component will be formal modification of the Ecological Metadata Language (EML) to support identification of organisms to concept. We are currently developing tools to aid the ecologist in selecting appropriate taxon concepts, which will improve the accuracy of matching data for integration. The tools include a Taxon Object Server (whose model is closely based on TCS) to support the resolution of taxon names and concepts, and visual tools to enable users to compare concepts and clarify relationships among them.
Support is acknowledged from: NSF
Much modern research in ecology is based on the integration (and re-use) of multiple datasets. These datasets may be distributed globally, will be stored in a variety of formats, and most likely the data will have differing semantics reflecting any of the many measurements of spatial and temporal environmental factors and organismal characteristics and interactions that contribute to a given ecosystem. A typical scenario is a scientist is interested in analyzing the spread of invasive species in a certain region. S/he has distribution records in a personal database, but requires access to other potentially relevant datasets on-line. The researcher needs to be able to discover candidate datasets and then merge their relevant and compatible information. The researcher needs to resolve which datasets contain information about the species of interest or are to the timescale and locality of research. Simplistically, datasets might be retrieved and integrated on the basis of country and species name; however even simple data files can be extremely time consuming to integrate manually and complicated if at all possible to integrate automatically as a simple example will show.
In order to find and integrate suitable data, meta-data describing the content of the data sets is important, therefore SEEK requires data sets stored in the EarthGrid to be marked up with Ecological Metadata Language (EML). EML includes descriptions of the temporal, geographical and taxonomic coverage of the data sets. Much of the terminology used in EML is generically applicable to scientific data structures—such as table name or column label; while more domain-relevant terms—such as biomass or wing span, are defined in ontologies being developed by the SEEK team in conjunction with disciplinary specialists.
The Semantic Mediation System (SMS) layer in SEEK uses ontologies to expand terms for searching EarthGrid for data discovery and for supporting the scientist in semi-automatically transforming data for input to appropriate analytical components in Kepler. This is accomplished using a generalized ontology for modeling “observational data”, called OBOE. OBOE provides a framework in which the meaning and inter-relationships of observations within a scientific data set can be clarified. For example, one can use OBOE to indicate that various data sets contain both weights and wing spans of bird specimens—thus greatly facilitating effective data discovery and potential integration of those types of data sets. The SEEK Taxon group, whose work also sits in the semantic mediation layer of Kepler, has been researching the more specialized issues associated with clarifying the semantics necessary to inter-relate the taxonomic coverage of ecological data sets.
Ecological data sets of relevance to biodiversity modeling tend to have been collected either over long periods of time or over a wide geographic range and typically use unqualified biological names for recording taxon occurrences or counts (often codes are used with biological names specified in the meta-data). However due to the ongoing work of taxonomy in classifying and naming the known organisms, the meaning associated with these names changes over time. Therefore representing the taxonomic coverage for ecological data by simply referencing names of species results in ambiguity. This ambiguity may be significantly detrimental to the results of any subsequent ecological analysis. To address this problem the SEEK Taxon group is adopting a taxonomic concept approach, as defined in collaboration with TDWG in the TCS standard. A necessary component will be formal modification of the Ecological Metadata Language (EML) to support identification of organisms to concept. We are currently developing tools to aid the ecologist in selecting appropriate taxon concepts, which will improve the accuracy of matching data for integration. The tools include a Taxon Object Server (whose model is closely based on TCS) to support the resolution of taxon names and concepts, and visual tools to enable users to compare concepts and clarify relationships among them.
Support is acknowledged from: NSF
9.3. Data Integration: Using TAPIR as an asynchronous caching protocol
Aaron Steele
University of California at Berkeley
There are over 100 million DarwinCore specimen records available on distributed networks worldwide. However, the search space for application-specific information is becoming vast and unreliable. For applications that know a priori what data are needed, asynchronous caching provides a reliable subset of data specific to a particular analysis. For example, an application generating species distribution models from Madagascar would benefit from accessing locally cached data where HigherGeography = Madagascar, instead of dynamically querying the network at run-time, which is expensive.
While TAPIR provides a straight forward caching protocol for retrieving specific DarwinCore concepts from a set of resources and integrating the results into a single database, key concerns are keeping these cached data synchronized with resources. For example, when records are inserted, updated, or deleted from resources, cached data must reflect these changes. Since TAPIR does not explicitly support syndicating these change events, they must be implicitly inferred by storing all resource GlobalUniqueIdentifier (GUID) and DateLastModified (dlm) concepts in a level-2 cache, and then periodically comparing it against the resource.
As a concrete example, suppose at time 't1' we create a level-2 cache 'C' for resource 'R'. The next day at time 't2' we create a second level-2 cache 'C2' of 'R'. Then, using 'C' and 'C2', the change events in 'R' during time period 't2'-'t1' can be defined as follows:
1) If 'C2.GUID' is not in 'C', then 'C2.GUID' was inserted.
2) If 'C2.dlm' is different than 'C.dlm', then 'C2.GUID' was updated.
3) If 'C.GUID' not in 'C2', then 'C2.GUID' was deleted.
In this way, after comparing records in the level-2 cache against current resource inventories, all change events are detected and associated with specific GUIDs. The level-1 cache then uses these GUIDs to synchronize changes by submitting new TAPIR inventory requests (for new or updated records) and deleting cached records that have been deleted.
In this presentation I will discuss these key caching algorithms in more detail, including the process of syndicating resource changes in the level-2 cache using RSS feeds, the implementation of data harvesting, initial results of these methods in the MaNIS, ORNIS and HerpNET networks, and proposed additions to TAPIR. I will also address social and political concerns associated with caching, and provide information about free open source storage solutions including MySQL and the Google Base API.
Support is acknowledged from: TDWG Infrastructure Project, NSF, University of California at Berkeley
While TAPIR provides a straight forward caching protocol for retrieving specific DarwinCore concepts from a set of resources and integrating the results into a single database, key concerns are keeping these cached data synchronized with resources. For example, when records are inserted, updated, or deleted from resources, cached data must reflect these changes. Since TAPIR does not explicitly support syndicating these change events, they must be implicitly inferred by storing all resource GlobalUniqueIdentifier (GUID) and DateLastModified (dlm) concepts in a level-2 cache, and then periodically comparing it against the resource.
As a concrete example, suppose at time 't1' we create a level-2 cache 'C' for resource 'R'. The next day at time 't2' we create a second level-2 cache 'C2' of 'R'. Then, using 'C' and 'C2', the change events in 'R' during time period 't2'-'t1' can be defined as follows:
1) If 'C2.GUID' is not in 'C', then 'C2.GUID' was inserted.
2) If 'C2.dlm' is different than 'C.dlm', then 'C2.GUID' was updated.
3) If 'C.GUID' not in 'C2', then 'C2.GUID' was deleted.
In this way, after comparing records in the level-2 cache against current resource inventories, all change events are detected and associated with specific GUIDs. The level-1 cache then uses these GUIDs to synchronize changes by submitting new TAPIR inventory requests (for new or updated records) and deleting cached records that have been deleted.
In this presentation I will discuss these key caching algorithms in more detail, including the process of syndicating resource changes in the level-2 cache using RSS feeds, the implementation of data harvesting, initial results of these methods in the MaNIS, ORNIS and HerpNET networks, and proposed additions to TAPIR. I will also address social and political concerns associated with caching, and provide information about free open source storage solutions including MySQL and the Google Base API.
Support is acknowledged from: TDWG Infrastructure Project, NSF, University of California at Berkeley
9.4. How to handle duplication in large datasets and import scenarios
Andreas Müller, Markus Döring, Walter G. Berendsohn
Botanic Garden Botanical Museum Berlin
When integrating, processing or querying biodiversity data, one sooner or later must address various problems raised by the existence of physical or digital duplicates. Both the creation and failure to find such duplicates may lead to information of lower quality in terms of completeness, readability or consistency of the dataset.
For the EU-funded SYNTHESYS project (A Synthesis of Systematics Resources) we developed a duplicate detection tool for the GBIF index of specimen and observation data as well as tools for importing taxonomic data into Berlin Model databases. In this context we developed different algorithms to handle such duplicates.
The current GBIF index contains about 100 million specimen and observation records. Querying such a database for duplicates online requires sophisticated techniques such as comparing each individual record which are too costly in terms of processing time. Hence an algorithm has been developed that adapts known record linkage techniques using pre-computed standardization and blocking, followed by online comparison and classification.
GBIF data are widely standardized, so little investment has been made in standardization. For blocking, a multi-channel sorted neighbourhood mechanism has been used. Records are inserted into sorted indices with a high probability of storing duplicates close to one another. When queried, this filtering component passes only those records that are in close proximity to the original record in at least one of the indices. The remaining candidates are compared by probability-based functions that work at both the attribute-level and record-level. Finally, classification depends on the type of duplicates searched for - physical or digital. The result-set is fuzzy, i.e., not only exact duplicates are returned. This takes into account that data may undergo changes depending on the pathway from collecting to importing them into the GBIF index.
Avoiding duplicates during the automatic import of data into a taxonomic Berlin Model database needs more conservative comparison functions, as false positives should be avoided here. Still, records should be detected as duplicates if they differ only in the completeness of some less important attributes. To handle this problem, a rule based two-step algorithm for an object-oriented Berlin Model persistence layer has been developed to easily detect duplicate candidates and merge them if verified as duplicates. Therefore a set of rules has been proposed to handle different types of attributes and attribute groups. The rules are easy to adapt to fulfil different needs of different users.
The software developed is available on the BioCASE website (www.biocase.org).
Support is acknowledged from: the European Commission, Framework Programme 6, contract no RII-CT-2003-506117 (SYNTHESYS)
For the EU-funded SYNTHESYS project (A Synthesis of Systematics Resources) we developed a duplicate detection tool for the GBIF index of specimen and observation data as well as tools for importing taxonomic data into Berlin Model databases. In this context we developed different algorithms to handle such duplicates.
The current GBIF index contains about 100 million specimen and observation records. Querying such a database for duplicates online requires sophisticated techniques such as comparing each individual record which are too costly in terms of processing time. Hence an algorithm has been developed that adapts known record linkage techniques using pre-computed standardization and blocking, followed by online comparison and classification.
GBIF data are widely standardized, so little investment has been made in standardization. For blocking, a multi-channel sorted neighbourhood mechanism has been used. Records are inserted into sorted indices with a high probability of storing duplicates close to one another. When queried, this filtering component passes only those records that are in close proximity to the original record in at least one of the indices. The remaining candidates are compared by probability-based functions that work at both the attribute-level and record-level. Finally, classification depends on the type of duplicates searched for - physical or digital. The result-set is fuzzy, i.e., not only exact duplicates are returned. This takes into account that data may undergo changes depending on the pathway from collecting to importing them into the GBIF index.
Avoiding duplicates during the automatic import of data into a taxonomic Berlin Model database needs more conservative comparison functions, as false positives should be avoided here. Still, records should be detected as duplicates if they differ only in the completeness of some less important attributes. To handle this problem, a rule based two-step algorithm for an object-oriented Berlin Model persistence layer has been developed to easily detect duplicate candidates and merge them if verified as duplicates. Therefore a set of rules has been proposed to handle different types of attributes and attribute groups. The rules are easy to adapt to fulfil different needs of different users.
The software developed is available on the BioCASE website (www.biocase.org).
Support is acknowledged from: the European Commission, Framework Programme 6, contract no RII-CT-2003-506117 (SYNTHESYS)
9.5. ALIS's Adventures in Wonderland
Samy Gaiji, Sonia Dias
Bioversity International
We will summarize the recent achievements of Bioversity International and the CGIAR System-wide Genetic Resources Programme (SGRP) in linking and integrating genebank information at a global scale. This was accomplished through the adoption of TDWG standards and Global Biodiversity Information Facility (GBIF) tools within the genebank community as a model. It is estimated that more than six million plant accessions are stored in ex situ collections worldwide and digitalized information on their essential characteristics have been gathered and stored in various institutions databases. Existing genebank information systems and portals, such as the CGIAR System-wide Information Network for Genetic Resources (SINGER) and the European Plant Genetic Resources Search Catalogue (EURISCO), are already major central entry points to such information. Their recent upgrade and adoption of TDWG/GBIF standards and protocols are making them more easily and readily accessible to the global community. Currently, information on over one th