Data Quality Tools, Services and Workflows - Task Group Charter

Convener

Lee Belbin - leebelbin(at)gmail.com

Core Members

Members self-identified in response to call for action on the GBIF Community website: http://community.gbif.org/pg/forum/topic/45759/call-for-action-creation-of-task-groups/ 

Daniel Amariles <damariles(at)humboldt.org.co>
Anne-Sophie Archambeau <archambeau(at)gbif.fr>
Arturo H. Ariño Plana <artarip(at)unav.es>
Vijay Barve <vijay.barve(at)gmail.com>
Nelyda Beltran <nbeltran(at)gmail.com>
David Bloom <dbloom(at)vertnet.org>
Dimitri Brosens <dimitri.brosens(at)inbo.be>
Dairo Escobar <dairoescobar(at)gmail.com>
David Fichtmueller <d.fichtmueller(at)bgbm.org>
Rui Figueira <rui.figueira(at)iict.pt>
Luiz Gadelha <lgadelha(at)lncc.br>
Falko Gloeckler <Falko.gloeckler(at)mfn-berlin.de>
Elspeth Haston <e.haston(at)rbge.org.uk>
Hanna Koivula <Hanna.koivula(at)helsinki.fi>
Marie-Elise Lecoq <melecoq(at)gbif.fr>
Bertram Ludaescher <Ludaesch(at)gmail.com>
James Macklin <James.macklin(at)gmail.com>
Nicolas Noé <n.noe(at)biodiversity.be>,
Matthias.Obst <Matthias.obst(at)bioenv.gu.se
Javier Otegui <Javier.otegui(at)gmail.com>
Sophie Pamerlon <pamerlon(at)gbif.fr
Debbie Paul <dpaul(at)fsu.edu>
Dmitry Schigel [GBIF] <dschigel(at)gbif.org>
Allan Koch Veiga <Allan.kv(at)gmail.com>
Daniel Lins <daniel.lins(at)gmail.com>

Motivation

Other than data availability, ‘Data Quality’ is probably the most significant issue for users of biodiversity data and this is specially so for the research community.

This Task Group is reviewing practical aspects relating to ‘data quality’ with a goal to provide Best Current Practice.

If a list of practical data quality resources (tools, services and workflows) can be provided to users of biodiversity records, then greater use and more appropriate use could be made of biodiversity data. Data providers and particularly aggregators such as GBIF and its nodes would have increased credibility with the user communities and be able to provide more effective information for judging fitness for use.

The other Data Quality Task Groups will focus on an overview or framework (TG1) and Case Studies (TG3). This Task Group will certainly relate to the Darwin Core Standard and possibility other TDWG standards but its focus is the practical aspects of data quality.

I (Lee Belbin) raised the need for a practical set of tools related to Data Quality at the TDWG 2010 Conference at Woods Hole. What I was asking for was at least the public display of the rules that were being used by GBIF to flag issues in their records. This didn’t happen, so we are trying again and we will include any agency that provides biodiversity records to the public.

Goals Outputs and Outcomes

A set of rules, tests and resulting data assertions that are in use by agencies to flag record issues (March 2016). Extent of the report will be based on agencies that have responded.

A set of software tools that can be used to assist with data quality (March 2016). These will be based on the GBIF Data Quality software resource.

Optionally, if resources permit, a list of workflows that are in use that assist with data quality (March 2016).

Strategy

The tests and rules generating assertions at the record-level are more fundamental than the tools or workflows that will be based on them. The priority will therefore to create a comprehensive list of these tests, rules and assertions and where and how they are used. For example, GBIFs set can be found at https://github.com/gbif/gbif- api/blob/master/src/main/java/org/gbif/api/vocabulary/OccurrenceIssue.java while the Atlas of Living Australia has a more comprehensive site at biocache.ala.org.au/ws/assertions/codes. These will form the base.

Contact other agencies such as BISON, EoL, eBird, CRIA, DataONE etc to find out what if any rules, assertions or tests they use that are provided to their users along with the data records.

Becoming Involved

This Task Group would welcome anyone who has a practical interest in data quality and/or has experience with the tests, rules, assertions, tools or workflows.

Contact the Convener

Summary

The Task Group will provide a report of the practical tests, rules, assertions, software and workflows associated with data quality of biodiversity-related records. This should provide a basis, along with the other Data Quality Task Groups of a standard approach to data quality that should be used by all agencies providing biodiversity- related data.

Resources

Belbin, L., Daly, J., Hirsch, T., Hobern, D. and LaSalle, J. (2013). A specialist’s audit of aggregated occurrence records: An ‘aggregators’ response. ZooKeys 305: 67–76. doi: 10.3897/zookeys.305.5438.

Chapman, AD (2005a). Principles and Methods of Data Cleaning – Primary Species and Species Occurrence Data, version 1.0. Report for the Global Biodiversity Information Facility, Copenhagen. 75p.

Chapman, AD (2005b). Principles of Data Quality, version 1.0. Report for the Global Biodiversity Information Facility, Copenhagen. 61p.

Costello MJ, Michener WK, Gahegan M, Zhang Z-Q, Bourne P, Chavan V (2012). Quality assurance and intellectual property rights in advancing biodiversity data publications version 1.0, Copenhagen: Global Biodiversity Information Facility, 40p, ISBN: 87‐92020‐49‐6.

Mesibov R (2013) A specialist’s audit of aggregated occurrence records. ZooKeys 293: 1-18. doi: 10.3897/zookeys.293.5111

Otegui J, Ariño AH, Encinas MA, Pando F (2013) Assessing the Primary Data Hosted by the Spanish Node of the Global Biodiversity Information Facility (GBIF). PLoS ONE 8(1): e55144. doi:10.1371/journal.pone.0055144

https://github.com/tdwg/infrastructure/issues/48.

  Last Modified: 11 November 2015