Proceedings of TDWG, 2007

Data Integration: Using TAPIR as an asynchronous caching protocol

Aaron Steele

Abstract


There are over 100 million DarwinCore specimen records available on distributed networks worldwide. However, the search space for application-specific information is becoming vast and unreliable. For applications that know a priori what data are needed, asynchronous caching provides a reliable subset of data specific to a particular analysis. For example, an application generating species distribution models from Madagascar would benefit from accessing locally cached data where HigherGeography = Madagascar, instead of dynamically querying the network at run-time, which is expensive.

While TAPIR provides a straight forward caching protocol for retrieving specific DarwinCore concepts from a set of resources and integrating the results into a single database, key concerns are keeping these cached data synchronized with resources. For example, when records are inserted, updated, or deleted from resources, cached data must reflect these changes. Since TAPIR does not explicitly support syndicating these change events, they must be implicitly inferred by storing all resource GlobalUniqueIdentifier (GUID) and DateLastModified (dlm) concepts in a level-2 cache, and then periodically comparing it against the resource.

As a concrete example, suppose at time 't1' we create a level-2 cache 'C' for resource 'R'. The next day at time 't2' we create a second level-2 cache 'C2' of 'R'. Then, using 'C' and 'C2', the change events in 'R' during time period 't2'-'t1' can be defined as follows:

1) If 'C2.GUID' is not in 'C', then 'C2.GUID' was inserted.
2) If 'C2.dlm' is different than 'C.dlm', then 'C2.GUID' was updated.
3) If 'C.GUID' not in 'C2', then 'C2.GUID' was deleted.

In this way, after comparing records in the level-2 cache against current resource inventories, all change events are detected and associated with specific GUIDs. The level-1 cache then uses these GUIDs to synchronize changes by submitting new TAPIR inventory requests (for new or updated records) and deleting cached records that have been deleted.

In this presentation I will discuss these key caching algorithms in more detail, including the process of syndicating resource changes in the level-2 cache using RSS feeds, the implementation of data harvesting, initial results of these methods in the MaNIS, ORNIS and HerpNET networks, and proposed additions to TAPIR. I will also address social and political concerns associated with caching, and provide information about free open source storage solutions including MySQL and the Google Base API.