A Hadoop-based Prototype for the Filtered-Push project
Zhimin Wang, Hui Dong, Maureen Kelly, James A. Macklin, Paul J. Morris, Robert A. Morris
Abstract
The Filtered-Push project aims to establish a cross-institutional infrastructure to help taxonomists share and improve digitized collection data via the exchange and management of record annotations. The project addresses three major challenges: The identification and annotation of specimen records in multiple collections that arose from a single collection event; the quality control of new annotations; and more generally the dissemination of annotations specimen records, whether or not representing duplicate specimens.
To address those concerns, first we decompose this system into five modules: The client API; network communication module; schema translation module; network node adapter; and storage module. All these modules are glued together through well defined interfaces, which also give us the flexibility to change the underlying technique for each module. In a prototype, we are adopting the Apache Hadoop map-reduce framework (http://hadoop.apache.org ) for the communication and storage modules in our system for the following reasons: First the network discovery of duplicates fits the map-reduce model well, in that one can understand the process as combining (reduce) results from local searches (map). Hadoop makes it easy for us to distribute programs and computation tasks across the net. The Hadoop HBase distributed database provides high availability by its robust, transparent file replication architecture. The column-oriented database structure is particularly suited to a global annotation store, from which local participants can accept or reject annotations based on local policies. This global repository also helps to retain knowledge that is independent of local nodes such as pending annotations.
The current underlying data model is a duplicates-oriented global view of specimen data of participants. Through this view, local changes can be provided globally and new global annotations can be applied to local copies. To facilitate the data exchange and data sharing, we use the Darwin Core as the common vocabulary, which is also extended to address some application specific problems, such as managing Globally Unique IDs of duplicates. The architecture imposes a strict separation between the message passing network and the computation models required to filter and respond to messages about annotations.
We have begun building a Hadoop-based prototype annotation sharing network. It holds promise as a platform for more complex research-oriented computations related to collection data , such as clustering of potential duplicates, identification of outliers for quality control A separate poster by P.J. Morris et al. describes the principal use cases and the messages required to describe them in the network.
To address those concerns, first we decompose this system into five modules: The client API; network communication module; schema translation module; network node adapter; and storage module. All these modules are glued together through well defined interfaces, which also give us the flexibility to change the underlying technique for each module. In a prototype, we are adopting the Apache Hadoop map-reduce framework (http://hadoop.apache.org ) for the communication and storage modules in our system for the following reasons: First the network discovery of duplicates fits the map-reduce model well, in that one can understand the process as combining (reduce) results from local searches (map). Hadoop makes it easy for us to distribute programs and computation tasks across the net. The Hadoop HBase distributed database provides high availability by its robust, transparent file replication architecture. The column-oriented database structure is particularly suited to a global annotation store, from which local participants can accept or reject annotations based on local policies. This global repository also helps to retain knowledge that is independent of local nodes such as pending annotations.
The current underlying data model is a duplicates-oriented global view of specimen data of participants. Through this view, local changes can be provided globally and new global annotations can be applied to local copies. To facilitate the data exchange and data sharing, we use the Darwin Core as the common vocabulary, which is also extended to address some application specific problems, such as managing Globally Unique IDs of duplicates. The architecture imposes a strict separation between the message passing network and the computation models required to filter and respond to messages about annotations.
We have begun building a Hadoop-based prototype annotation sharing network. It holds promise as a platform for more complex research-oriented computations related to collection data , such as clustering of potential duplicates, identification of outliers for quality control A separate poster by P.J. Morris et al. describes the principal use cases and the messages required to describe them in the network.