Moving Targets: Integrating semistructured data
Pepe Ciardelli, Marc Geoffroy
Abstract
We will present our experience importing nomenclatural, taxonomic, bibliographical, and distribution data from text files into a relational database, as part of the Euro+Med Plantbase project. The aim of the project is to present a taxonomic inventory of vascular plants in Europe and the Mediterranean countries on the web.
There is a cultural gulf in taxonomic computing: few taxonomists truly comprehend that in order to develop adequate computer software, every conceivable case must be dealt with in advance. Even when users have agreed in advance to a detailed data format, the consequences of not sticking rigorously to this format are seldom truly appreciated.
Taxonomists expect software to follow their normal working processes, while at the same time being able to access the results of their work in all possible formats. In practice, this may mean importing a document file (usually Microsoft Word), in atomized form, into a database. Although these data appear to be in a structured format, e.g., tables with pre-agreed spacing, special signifying characters, etc., they remain text, meant for humans, not computers, to read. There is normally no mechanism to check the validity of input at the instant it is typed; errors are first recognized when the document as a whole is parsed and loaded into the database. In our experience, importing a number of such files required us to continuously adapt import software.
The first complication arises from the fact that it is in practice impossible to iron all typographical errors out of a 400 page document. Programmers are expected to build error-tolerant software, and in fact, all errors described above can, in our experience, be corrected programmatically after a few iterations. The larger problems lie in: variation between taxonomists’ standards; the peculiarities specific to certain taxonomic groups; lack of agreement on extra-taxonomic notations; and lack of communication between programmers and taxonomists.
For example, there may be only one group in a series of imports where species are allowed to be included in another species – if a programmer without any taxonomic background is given this file to import without warning, the entire algorithm for taxonomic inclusion may be thrown into chaos. Notoriously complex and exception-rich groups like Pilosella and Hieracium require a great deal of extra notation, and subsequently communication between taxonomist and programmer is of the utmost importance.
There is no way to program for 100% of exceptions and the programmer hours that would need to be invested are better spent putting unresolved cases in catch-all fields in the database, then allowing experts to parse the data later manually. In the end, we must recognize that the taxonomist is always right, and learn to adapt to his/her work methods. However, the learning process must take place on both sides. Many taxonomists are not ready to invest the time in checking the success of the import with, for example, a web interface. In the best case, the programmer and the taxonomist would develop such a tool according to the taxonomist’s preferences. While the younger generation of taxonomists has a substantially greater level of comfort with computers and appreciation of their capabilities, the problem of moving targets will likely continue to exist for the foreseeable future.
There is a cultural gulf in taxonomic computing: few taxonomists truly comprehend that in order to develop adequate computer software, every conceivable case must be dealt with in advance. Even when users have agreed in advance to a detailed data format, the consequences of not sticking rigorously to this format are seldom truly appreciated.
Taxonomists expect software to follow their normal working processes, while at the same time being able to access the results of their work in all possible formats. In practice, this may mean importing a document file (usually Microsoft Word), in atomized form, into a database. Although these data appear to be in a structured format, e.g., tables with pre-agreed spacing, special signifying characters, etc., they remain text, meant for humans, not computers, to read. There is normally no mechanism to check the validity of input at the instant it is typed; errors are first recognized when the document as a whole is parsed and loaded into the database. In our experience, importing a number of such files required us to continuously adapt import software.
The first complication arises from the fact that it is in practice impossible to iron all typographical errors out of a 400 page document. Programmers are expected to build error-tolerant software, and in fact, all errors described above can, in our experience, be corrected programmatically after a few iterations. The larger problems lie in: variation between taxonomists’ standards; the peculiarities specific to certain taxonomic groups; lack of agreement on extra-taxonomic notations; and lack of communication between programmers and taxonomists.
For example, there may be only one group in a series of imports where species are allowed to be included in another species – if a programmer without any taxonomic background is given this file to import without warning, the entire algorithm for taxonomic inclusion may be thrown into chaos. Notoriously complex and exception-rich groups like Pilosella and Hieracium require a great deal of extra notation, and subsequently communication between taxonomist and programmer is of the utmost importance.
There is no way to program for 100% of exceptions and the programmer hours that would need to be invested are better spent putting unresolved cases in catch-all fields in the database, then allowing experts to parse the data later manually. In the end, we must recognize that the taxonomist is always right, and learn to adapt to his/her work methods. However, the learning process must take place on both sides. Many taxonomists are not ready to invest the time in checking the success of the import with, for example, a web interface. In the best case, the programmer and the taxonomist would develop such a tool according to the taxonomist’s preferences. While the younger generation of taxonomists has a substantially greater level of comfort with computers and appreciation of their capabilities, the problem of moving targets will likely continue to exist for the foreseeable future.