Proceedings of TDWG, 2009

Using Citizen Science to Process Digital Herbarium Labels

Michael Giddens

Abstract


At SilverBiology, we are developing a software engine entitled “SilverArchive” to process typed and handwritten label data from digitized herbarium labels on specimens. Large digitized specimens are provided by collections for processing. These images are loaded into a queue for parallel processing using a website called http://www.helpingscience.org (in closed beta-testing at the time of writing). Citizen scientists sign in to the website to provide three different tasks. The first role is to identify all the label and determination locations on a given specimen sheet, the second is to identify the Darwin Core (DwC) fields within each label, and the third is to type in the text values of each field image.

Once labels have been identified on a specimen sheet, using a mouse to outline the borders, a label image is created and sent to Evernote (http://www.evernote.com) for optical character recognition (OCR). They return the position of every word, all the permutations of each word, and if the label is handwritten or typed. We use this information for making educated guesses and to help in expediting the field tagging process. We try to focus more on human input for accuracy and only use the OCR information as a secondary source.

Each part of the specimen label itself, whether it is the scientific name, date, country, etc., is parsed into associated DwC fields. These tags are assigned by a human using a simple click and drag interface. Once this is completed for a label, each marked field is created into individual images so they can be processed in parallel.

Each tagged field will be examined by three or more distinct users or citizen scientists who all input what they think the field image says. Typing the words is the most time consuming so we are trying different game style interfaces to see which type of game gives us the best response. No user will see the same field image twice. Each field image is circulated until enough people type in the same value, which gives a measure of accuracy. When the predetermined level of accuracy has been reached, the value for the field is accepted. Once all the field values are verified for a given label a DarwinCore record is created.

All processed data runs through a series of taxonomic and geographic validations. Any issues are reported to the collection manager for review. All data will be available in a variety of formats including DwC.