Proceedings of TDWG, 2007

Machine Learning to Produce Structured Records from Herbarium Label OCR

P. Bryan Heidorn, Qin Yin Wei

Abstract


In this session, we will demonstrate the learning process of HERBIS, the XML (extensible markup language) schemas used in learning and markup, the principles for the use of the web interface and the web services interface, and discuss future developments. In the current version of HERBIS, all machine learning is run by the project programmer. End users provide raw OCR (optical character recognition) output to the classifier and the system returns an XML document. In the new version, we will permit users to provide accuracy feedback to the system allowing the performance to improve with system experience.

As presented elsewhere (1), supervised machine learning (SML) techniques and learning by example can be used to transform herbarium specimen label data to digital format. In the HERBIS project the objective of SML is to make a computer system that can recognize patterns in the OCR output of scanned herbarium labels, and convert them into 36 XML components including, for example, family, genus, species, author, variety, location, collection date, annotations, and others for convenient ingestion into museum databases. To accomplish this, the human trainer gives the computer properly classified examples to learn from. The computer generalizes from these examples to properly extract information from previously unseen examples. While a computer is accomplished at never forgetting an example that it has seen, like a savant child, the computer cannot recognize something it has never seen before. For example, the determiner on a label might be indicated by “Determiner:”, “DET”, or “Det.”, all of which are different from the point of view of the computer. Therefore, it is the job of the human trainer to provide carefully-selected examples that are representative of the future tasks that the computer will be asked to perform, including typical OCR errors. The trainer must tell the computer how to classify strings like “DFT:”, where a faded “E” was misread by the OCR as an “F” as well as other numerous but systematic errors. Using a combination of Rote Patterns Learning, Naïve Bayes classification, Hidden Markov Models, and other techniques, HERBIS reaches high accuracy on some elements but not all. Through improvements in the algorithms and improvements in training examples, performance is being enhanced. With a little practice, botanists can learn to provide training examples for the computer to allow the HERBIS SML System to efficiently convert herbarium label data to database format.

(1) Heidorn, P. Bryan, Wei Yin Qin, Beaman, Reed and Cellinese, Nico (2007). Learning by Example: Machine Learning and Herbarium Label Digitization. Joint Plant Science and Conference Botany 2007, Chicago Illinois. July 7-11, 2007.