Using Automatically Extracted Information in Species Page Retrieval
Xiaoya Tang, P. Bryan Heidorn
Abstract
Users searching botanical texts online in currently available full-text indexes such as Google must accurately guess the vocabulary of the original author(s) to find the desired results. A large number of botanical volumes are available electronically, and many more are being made available through projects such as the Encyclopedia of Life and Biodiversity Heritage Library. However, current retrieval systems available for these collections are not able to interpret the specific information requests correctly and match them with appropriate documents. Author vocabulary often varies greatly from the user’s search vocabulary. We will present a study which integrates text mining techniques into the full-text search process and automatically identifies selected plant morphological information from text to assist keyword-based retrieval. The technique could be expanded to other collections of documents.
An experiment involving users was conducted to evaluate this approach on the full-text of the Flora of North America (FNA). Thirty upper-level undergraduates and graduate students from two Illinois universities who had completed a course in botany were asked to identify ten herbarium specimens of trees of Illinois. The subjects used a full text search engine with an index of several volumes of FNA. The user search logs were used to identify the plant characteristics most frequently used by the students, independent of the usefulness of these terms for retrieving taxonomic treatments using full-text search. These characters were targeted for text extraction. A set of treatments were marked by hand to serve as training examples and a machine learning method was used to learn extraction patterns and these commonly used characters were mined from the 1637 treatments in the FNA. The accuracy of the extraction was between 60% and 100%, except for leaf shape and leaf arrangement information, which was around 50% and 30%, respectively, depending on the information type. In a new experiment one group of 12 subjects used a traditional full text search system while another group of 12 used full text plus pull-down menus and web forms that allowed them to search based on the machine extracted information. The experimental results indicate that the latter approach significantly improves keyword-based retrieval performance by allowing the users to complete more identification tasks successfully than when they had to generate their own search terms. It also increases users’ satisfaction with the retrieval system.
An experiment involving users was conducted to evaluate this approach on the full-text of the Flora of North America (FNA). Thirty upper-level undergraduates and graduate students from two Illinois universities who had completed a course in botany were asked to identify ten herbarium specimens of trees of Illinois. The subjects used a full text search engine with an index of several volumes of FNA. The user search logs were used to identify the plant characteristics most frequently used by the students, independent of the usefulness of these terms for retrieving taxonomic treatments using full-text search. These characters were targeted for text extraction. A set of treatments were marked by hand to serve as training examples and a machine learning method was used to learn extraction patterns and these commonly used characters were mined from the 1637 treatments in the FNA. The accuracy of the extraction was between 60% and 100%, except for leaf shape and leaf arrangement information, which was around 50% and 30%, respectively, depending on the information type. In a new experiment one group of 12 subjects used a traditional full text search system while another group of 12 used full text plus pull-down menus and web forms that allowed them to search based on the machine extracted information. The experimental results indicate that the latter approach significantly improves keyword-based retrieval performance by allowing the users to complete more identification tasks successfully than when they had to generate their own search terms. It also increases users’ satisfaction with the retrieval system.