How to handle duplication in large datasets and import scenarios
Andreas Müller, Markus Döring, Walter G. Berendsohn
Abstract
When integrating, processing or querying biodiversity data, one sooner or later must address various problems raised by the existence of physical or digital duplicates. Both the creation and failure to find such duplicates may lead to information of lower quality in terms of completeness, readability or consistency of the dataset.
For the EU-funded SYNTHESYS project (A Synthesis of Systematics Resources) we developed a duplicate detection tool for the GBIF index of specimen and observation data as well as tools for importing taxonomic data into Berlin Model databases. In this context we developed different algorithms to handle such duplicates.
The current GBIF index contains about 100 million specimen and observation records. Querying such a database for duplicates online requires sophisticated techniques such as comparing each individual record which are too costly in terms of processing time. Hence an algorithm has been developed that adapts known record linkage techniques using pre-computed standardization and blocking, followed by online comparison and classification.
GBIF data are widely standardized, so little investment has been made in standardization. For blocking, a multi-channel sorted neighbourhood mechanism has been used. Records are inserted into sorted indices with a high probability of storing duplicates close to one another. When queried, this filtering component passes only those records that are in close proximity to the original record in at least one of the indices. The remaining candidates are compared by probability-based functions that work at both the attribute-level and record-level. Finally, classification depends on the type of duplicates searched for - physical or digital. The result-set is fuzzy, i.e., not only exact duplicates are returned. This takes into account that data may undergo changes depending on the pathway from collecting to importing them into the GBIF index.
Avoiding duplicates during the automatic import of data into a taxonomic Berlin Model database needs more conservative comparison functions, as false positives should be avoided here. Still, records should be detected as duplicates if they differ only in the completeness of some less important attributes. To handle this problem, a rule based two-step algorithm for an object-oriented Berlin Model persistence layer has been developed to easily detect duplicate candidates and merge them if verified as duplicates. Therefore a set of rules has been proposed to handle different types of attributes and attribute groups. The rules are easy to adapt to fulfil different needs of different users.
The software developed is available on the BioCASE website (www.biocase.org).
For the EU-funded SYNTHESYS project (A Synthesis of Systematics Resources) we developed a duplicate detection tool for the GBIF index of specimen and observation data as well as tools for importing taxonomic data into Berlin Model databases. In this context we developed different algorithms to handle such duplicates.
The current GBIF index contains about 100 million specimen and observation records. Querying such a database for duplicates online requires sophisticated techniques such as comparing each individual record which are too costly in terms of processing time. Hence an algorithm has been developed that adapts known record linkage techniques using pre-computed standardization and blocking, followed by online comparison and classification.
GBIF data are widely standardized, so little investment has been made in standardization. For blocking, a multi-channel sorted neighbourhood mechanism has been used. Records are inserted into sorted indices with a high probability of storing duplicates close to one another. When queried, this filtering component passes only those records that are in close proximity to the original record in at least one of the indices. The remaining candidates are compared by probability-based functions that work at both the attribute-level and record-level. Finally, classification depends on the type of duplicates searched for - physical or digital. The result-set is fuzzy, i.e., not only exact duplicates are returned. This takes into account that data may undergo changes depending on the pathway from collecting to importing them into the GBIF index.
Avoiding duplicates during the automatic import of data into a taxonomic Berlin Model database needs more conservative comparison functions, as false positives should be avoided here. Still, records should be detected as duplicates if they differ only in the completeness of some less important attributes. To handle this problem, a rule based two-step algorithm for an object-oriented Berlin Model persistence layer has been developed to easily detect duplicate candidates and merge them if verified as duplicates. Therefore a set of rules has been proposed to handle different types of attributes and attribute groups. The rules are easy to adapt to fulfil different needs of different users.
The software developed is available on the BioCASE website (www.biocase.org).