Proceedings of TDWG, 2008

SpeciesIndex: A Practical Alternative to Fantasy Mashups?

Roger Hyam

Abstract


Generating consensus descriptions of species by mashing together data gleaned from different sources has been proposed by various projects but will never be practical. Different accounts not only come in different formats they are also authored for different audiences from different perspectives. An afternoon in the library with a photocopier and a pair of scissors will demonstrate the impossibility of creating non-trivial descriptions by cutting and pasting from original sources even for a human.
This wild idea is that we shouldn't bother doing mashups but instead enhance the indexing and federated search services such as iSpecies. by answering the simple question “Where are all the species pages?”.
The proposed way to do this is to exploit the well established SiteMaps format to allow authors of taxonomic accounts to provide an index file to their description pages.
SiteMaps is a protocol that allows webmasters to inform search engines of the URLs they would like to have indexed. Practically the protocol consists of an XML or plain text file that contains a list of URLs and metadata for a particular website. It is a very simple protocol, easy to implement and supported by a wide range of search engines including Google, MSN, Yahoo! and Ask.com. There are only three metadata elements in the protocol. They consist of the last modification date, change frequency and priority. Importantly SiteMaps is extensible by addition of XML elements in other namespaces.
As a first level of implementation data suppliers could be asked to generate a SiteMap file as per the protocol that only includes their species pages. There are tutorials and validation tools already available from Google to support this process. They should then submit the URL of the file to a Species Index Registry(SIR) along with a simple description of the taxonomic and geographic scope of their data. SIR consists of a human with an email account and a manually edited web page. It should be relatively simple to write an indexer that uses the SiteMaps listed in SIR to index just species page of interest to a particular project. The cost and risk of implementation of this strategy is very small yet it would enable a great deal of innovation going forward.
A second level of implementation would involve extending the SiteMaps file format to include metadata about what each species page included. This could involve reusing the TDWG ontology namespaces. Including more metadata in the SiteMaps file would enable the generation of a more complex registry and more intelligent indexers. It is highly likely however that discussions about metadata extensions will bog down the first level implementation and so prevent anything practical happening at all. Perhaps that is why this is a wild idea?