TAXAMATCH, a “fuzzy” matching algorithm for taxon names, and potential applications in taxonomic databases
Tony Rees
Abstract
Misspelled, or variant spelled taxon names present a common problem in taxonomic data systems, leading to recognition (in some systems) of a role for “near” or “fuzzy” match techniques for detection of similar, but not identically spelled taxon names. This approach can be of value in a number of use cases including:
• user queries against a taxonomic database or similar resource – where either the input term, or the available target term may be misspelled;
• deduplication of existing content, for example in disparate sources prior to merging, or within a single resource post- such merging;
• handling distributed queries, where the same name may be present in multiple forms (variants or misspellings) in different resources to be searched, which ideally should all be returned to satisfy a relevant user query; and
• spelling error detection and suggested corrections – if a suitably complete and authoritative reference database is available.
The present author’s activities in this area have included a series of phonetic “near match” algorithms developed over the period 2001-2007, which have been deployed in a range of taxonomic information systems in Australia, Europe, and U.SA., and more recently (2007-8) a more comprehensive algorithmic solution that has been termed TAXAMATCH. This algorithm is capable of detecting non-phonetic as well as phonetic spelling errors or mismatches, presently with an execution time of less than 1-2 seconds against a database of over 1.4 million species names on the author’s reference system (the “IRMNG” database of genus names for plants, animals and bacteria, plus many associated species, maintained at CSIRO Marine and Atmospheric Research, Australia), and provides good performance for both recall (return of most or all “relevant” near matches) and precision (ratio of “relevant” to “non relevant” matches returned), for specific queries using real world test data.
With TAXAMATCH development essentially complete, attention can be turned to (1) deployment across a range of suitable systems as desired, and (2) its application to situations such as those described above, including (for example) user searching of available resources e.g. in the Atlas of Living Australia project, provision of web services using IRMNG or other content, and the potential for general use as a “taxonomic spell checker”. This last application would depend on availability of a relevant reference list or lists that are authoritative, correct, and complete; while these aspects are to an extent both subjectively defined and also a continuously moving target, the present and emerging availability of extensively scrutinized lists such as Catalogue of Life, WoRMS, ZooBank, Index Fungorum, etc. does present some possibilities in this regard, which will be discussed further during this presentation.
Further information on TAXAMATCH can be obtained via a previous conference presentation http://www.marinebiodiversity.ca/OBI07/sessions/species-names-management-and-tools/oral-rees/, plus an upcoming paper in the biodiversity informatics literature, while the author’s present reference implementation is available for online user exploration via the IRMNG (Interim Register of Marine and Nonmarine Genera) search interface at http://www.cmar.csiro.au/datacentre/irmng/.
• user queries against a taxonomic database or similar resource – where either the input term, or the available target term may be misspelled;
• deduplication of existing content, for example in disparate sources prior to merging, or within a single resource post- such merging;
• handling distributed queries, where the same name may be present in multiple forms (variants or misspellings) in different resources to be searched, which ideally should all be returned to satisfy a relevant user query; and
• spelling error detection and suggested corrections – if a suitably complete and authoritative reference database is available.
The present author’s activities in this area have included a series of phonetic “near match” algorithms developed over the period 2001-2007, which have been deployed in a range of taxonomic information systems in Australia, Europe, and U.SA., and more recently (2007-8) a more comprehensive algorithmic solution that has been termed TAXAMATCH. This algorithm is capable of detecting non-phonetic as well as phonetic spelling errors or mismatches, presently with an execution time of less than 1-2 seconds against a database of over 1.4 million species names on the author’s reference system (the “IRMNG” database of genus names for plants, animals and bacteria, plus many associated species, maintained at CSIRO Marine and Atmospheric Research, Australia), and provides good performance for both recall (return of most or all “relevant” near matches) and precision (ratio of “relevant” to “non relevant” matches returned), for specific queries using real world test data.
With TAXAMATCH development essentially complete, attention can be turned to (1) deployment across a range of suitable systems as desired, and (2) its application to situations such as those described above, including (for example) user searching of available resources e.g. in the Atlas of Living Australia project, provision of web services using IRMNG or other content, and the potential for general use as a “taxonomic spell checker”. This last application would depend on availability of a relevant reference list or lists that are authoritative, correct, and complete; while these aspects are to an extent both subjectively defined and also a continuously moving target, the present and emerging availability of extensively scrutinized lists such as Catalogue of Life, WoRMS, ZooBank, Index Fungorum, etc. does present some possibilities in this regard, which will be discussed further during this presentation.
Further information on TAXAMATCH can be obtained via a previous conference presentation http://www.marinebiodiversity.ca/OBI07/sessions/species-names-management-and-tools/oral-rees/, plus an upcoming paper in the biodiversity informatics literature, while the author’s present reference implementation is available for online user exploration via the IRMNG (Interim Register of Marine and Nonmarine Genera) search interface at http://www.cmar.csiro.au/datacentre/irmng/.