An Evaluation of Taxonomic Name Recognition (TNR) in the Biodiversity Heritage Library
Qin Wei, Chris Freeland, P. Bryan Heidorn
Abstract
The Taxonomic Name Recognition (TNR) algorithm – identifying a text string as a taxonomic name or not and recognizing the boundaries of the name – is very important in the BHL digitization project for determining whether the users/researchers could find the materials they want efficiently. The BHL has incorporated TaxonFinder, a taxonomic name finding algorithm and service (provided by uBio.org), into its portal for the identification and verification of taxonomic name strings found within the digitized BHL corpus. An eight-week evaluation was performed to determine the factors affecting the accuracy of the results returned. Our findings are not only valuable for BHL but also for other digital projects that would like to do text mining on their collections. In this evaluation project, we explored and analyzed the factors influencing the performance of: 1) Optical Character Recognition (OCR) for transforming images into text, 2) TNR matching algorithms for identifying taxonomic names from texts, and 3) the completeness of NameBank, which is used as an authority file for name verification.
We randomly selected 392 pages from the BHL database, which contained 4,843,619 pages at the beginning of our project. This sample included 3,003 valid names (2,610 unique names), which were identified manually by a group of biologists. For this sample, the OCR error rate for name strings was 35.16%, meaning that among the 3,003 valid names the OCR software packages correctly output 74.84% of them.
In digitization projects such as the BHL, TNR must also be able to find names even if they have OCR errors. So our evaluation standard included taxonomic name strings that were identifiable by humans as being names even when they had OCR errors. We assessed two TNR matching algorithms: TaxonFinder and FAT (Find All Taxonomic Names), which are widely used within the biodiversity community. The performance was evaluated by two measures: Precision (P) and Recall (R). Precision is the proportion of algorithm identified strings that are valid names. In our case, the precision means the capability of the algorithm to identify the valid names as well as exclude the non-valid name at the same time. Recall is the proportion of valid names in the sample that are recognized by the algorithms, meaning the capability of finding all valid names from the collection. TaxonFinder found 1540 names; 674 of them were correct names. FAT found 1603 names; 517 of them were valid names. The precision for TaxonFinder and FAT are 43.77% (=674/1540) and 32.25% (=517/1603) respectively. The recall for TaxonFinder is 25.82% (=674/2610) and FAT is 17.21% (=517/3003).
For TaxonFinder, the NameBank omission rate is 5.4%, i.e., 5.4% of the real names found by TaxonFinder were not in NameBank. This demonstrates that names missing from the NameBank authority file are not the major source of information loss in converting the data from images to a structured searchable database.
Our results indicate that improving the performance of TNR algorithms is the main challenge for producing an index to taxonomic names within digital library projects like BHL. The future work should determine what names are not found by the algorithms and the reasons why they were missed.
We randomly selected 392 pages from the BHL database, which contained 4,843,619 pages at the beginning of our project. This sample included 3,003 valid names (2,610 unique names), which were identified manually by a group of biologists. For this sample, the OCR error rate for name strings was 35.16%, meaning that among the 3,003 valid names the OCR software packages correctly output 74.84% of them.
In digitization projects such as the BHL, TNR must also be able to find names even if they have OCR errors. So our evaluation standard included taxonomic name strings that were identifiable by humans as being names even when they had OCR errors. We assessed two TNR matching algorithms: TaxonFinder and FAT (Find All Taxonomic Names), which are widely used within the biodiversity community. The performance was evaluated by two measures: Precision (P) and Recall (R). Precision is the proportion of algorithm identified strings that are valid names. In our case, the precision means the capability of the algorithm to identify the valid names as well as exclude the non-valid name at the same time. Recall is the proportion of valid names in the sample that are recognized by the algorithms, meaning the capability of finding all valid names from the collection. TaxonFinder found 1540 names; 674 of them were correct names. FAT found 1603 names; 517 of them were valid names. The precision for TaxonFinder and FAT are 43.77% (=674/1540) and 32.25% (=517/1603) respectively. The recall for TaxonFinder is 25.82% (=674/2610) and FAT is 17.21% (=517/3003).
For TaxonFinder, the NameBank omission rate is 5.4%, i.e., 5.4% of the real names found by TaxonFinder were not in NameBank. This demonstrates that names missing from the NameBank authority file are not the major source of information loss in converting the data from images to a structured searchable database.
Our results indicate that improving the performance of TNR algorithms is the main challenge for producing an index to taxonomic names within digital library projects like BHL. The future work should determine what names are not found by the algorithms and the reasons why they were missed.