Review Details

Peer Review By Lee Belbin

Reviewer:Lee Belbin
Review Date:2008-04-21 05:31:59
Recommendation:Accept Submission
Comments:

A darn excellent job Roger! I read through the lot and it still looks great.

I've done a thorough search and tweaked the license layout a tad. I could not use Track Changes as it totally stuffed up the formatting.

I originally thought that we would wanted a Attribution 3.0 United States license as TDWG is registered in the USA and copyright would be enforced through the USA. But a) TDWG registration may change (as it has done previously) and b) USA copyright laws are archaic.

So I went back to an unported license.


Peer Review By Rod Page

Reviewer:Rod Page
Review Date:2008-04-18 13:52:58
Recommendation:Revisions Required
Comments:

In general these seem sensible, well thought out recommendations. I have reservations about the wisdom of LSIDs, but it's a done deal so the goal is to do them as well as possible. Some of my comments may have already been raised/dealt with on the TDWG Wikis, but I've not been following that discussion much, hence my apologies if I've raised familiar concerns.

The discussion of RDF reveals the problem LSIDs pose in a HTTP URI world. I think the owl:sameAs statement is a hack, and one which assumes TDWG is the best/only LSID resolver. I suggest doing this differently. I feel recommendation 39 is a mistake that will hamper integration across domains.

It might be useful to have some statement about what LSIDs identify. In the HTTP URI world, there has been much discussion of the HTTP range problem -- if I use identifier http://sws.geonames.org/2950159/ for the city of Berlin, and somebody wants to make a statement about Berlin, e.g.

http://sws.geonames.org/2950159/ size 1000

are they making a statement about the city Berlin, or the web page http://sws.geonames.org/2950159/? In part this problem arises because I can put http://sws.geonames.org/2950159/ into a web browser and get a web page, which implies that http://sws.geonames.org/2950159/ identifies that page (see http://geonames.wordpress.com/2006/10/21/semantic-web-concept-vs-document/).

One solution in the HTTP URI world is that identifiers of concepts return a 303 response, indicating that they are not information resources. One could argue that the resolution mechanism of LSIDs is another way around this problem (although recommendation 39 destroys this).

I've numbered comments following the spec.

1.
2.
3.

4. Providers should not assign LSIDs to objects that already have more widely accepted identifiers such as publications with DOIs.

I'm a big fan of identifier reuse, indeed it's the only way this stuff will succeed. However, I can imagine reasons for assigning LSIDs to publications with DOIs. For example, many publications have multiple GUIDs, such as DOIs and Handles, or DOIs and PubMed ids. Imagine a document has DOI A and Handle B. If metadata for LSID x refers to DOI A, and metadata for LSID y refers to Handle B, how do we know that both x and y refer to the same publication? A database might serve LSIDs for a set of statements that aggregate the other identifiers (in the same way that Connotea has HTTP URIs that resolve to metadata listing multiple external GUIDs).

I'm actually agreeing with this recommendation, I'm just pointing out a case where it might be legitimately overruled.

Lastly, if providers do use external GUIDs then it would be worthwhile TDWG thinking about how to refer to them. For example, how would I refer to a DOI in RDF? There are various ways DOIs can be written, including the INFO URI scheme (http://info-uri.info/), and using the current DOI proxy (http://dx.doi.org/). I raise this because it relates to recommendation 39, which I regard as a mistake.


5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.

18. Providers should use well-established locally unique and immutable object identifiers as LSID object identifiers.

CoL is a bad example, because their taxon ids change with each release. This year they are UUIDs, which I assume they intend to make stable.

This raises the issue of UUIDs as object identifiers. Both CoL and Zoobank have adopted these, and UUIDs are globally, not locally unique. It would be worthwhile making a recommendation about UUIDs.

In the context of CoL, it's interesting that the UUIDs contain a lot of redundancy, for example:

Agama agama urn:lsid:catalogueoflife.org:taxon:ea76a4b6-29c1-102b-9a4a-00304854f820:ac2008

Pinnotheres pisum
urn:lsid:catalogueoflife.org:taxon:ef0ae064-29c1-102b-9a4a-00304854f820:ac2008

The UUIDs both contain "-29c1-102b-9a4a-00304854f820", hence only the first 8 characters are actually relevant (this seems to be a characteristic of MySQL's UUID generator). Hence, UUIDs seem to be overkill in this context.

There are occasions when they do make sense, namely:

1. The provider is unsure if LSIDs will be the technology of choice in the future, but wants an easy way to ensure the identifiers remain unique if moved to another GUID mechanism (e.g., HTP URIs, Handles, or DOIs).

2. The provider is aggregating from other data sources, and wants to ensure that the object identifiers provided by the source are unique. I thought this might have been the motivation for CoL, except there's no evidence in their UUIDs that this is the case (the two examples above come from different source databases).

I personally find UUIDs very ugly, and I wonder how journal editors will feel having them littered through a paper.


19. LSID Authorities should not use the primary key of relational database tables as object identifications. Providers should create an extra column in the table (or a separate table) to manage the LSID independently of the primary key.

This should perhaps be cautionary, rather than a recommendation. The obvious identifier for ITIS, for example, is the primary key. I suggest that you advice people of the problem if they reload a database from sources other than, say, a SQL dump. But by making it a recommendation you are asking people to alter their database schema -- not a good strategy for encouraging adoption.


20.
21.
22.
23.
24.

25. Providers should not encode data in formats such as XML that may change the exact sequence of bytes.

I guess this raises issues of what "same" means, and whether a TIFF file saved in big endian or little endian byte order (same image) is truly different.

26.
27.
28.
29.

30.

31. Objects in the biodiversity information domain that are identified by an LSID must be typed using the TDWG ontology or other well-known vocabularies in accordance with the TDWG common architecture.

"Must be"? You should encourage this, but to state that anybody in this area should use TDWG ontologies is hubris (I for one regard them as rather over-blown). Encourage reuse of existing vocabularies (TDWG itself could do better in this regard) by all means, but to require it implies TDWG has control over what people can serve. The biggest provider of LSIDs in this area is uBio, which has managed very nicely without TDWG vocabularies.

32.

33. In an HTML document, an LSID appearing within the description of the object it identifies should be presented in plain text (i.e. not hyperlinked) and in its original form as in:...

Not a huge fan of the icon. If you are going to recommend this, I suggest providing a HTML template and a stable URL for the icon, so users can easily cut and paste code to generate the desired effect.

Not at all clear why the LSID for the object being described is NOT hyperlinked -- won't this just confuse people, especially if clickable and non-clickable links are flagged with the same icon?

34.
35.
36.
37.

38. The description of all objects identified by an LSID must contain an owl:sameAs statement expressing the equivalence between the object identifier in its standard form and its proxy version as in: ...

Why embed an obvious point of failure in the metadata? Why embed details of mechanism in the metadata? I'm also doubtful about the use of owl:sameAs, because http://lsid.tdwg.org/urn:lsid:... is not the actual identifier of the object, urn:lsid:... is.

An alternative might be to embed information about the resolver in such a way that providers could provide information on their own resolver, in much the same way as Connotea does:




10.1098/rsbl.2006.0473
doi:10.1098/rsbl.2006.0473



I'm not advocating this particular syntax, but the idea is to separate the resolution from the identifier.


39. All references to objects identified by LSIDs using the rdf:resource attribute must use a proxy version of the LSID as in: ...

I think this is unfortunate because:

1. It embeds the assumption that tdwg.org is going to survive long term.

2. It assumes that the TDWG resolver is the best/only resolver. People may choose others, or client software may support their own.

3. It commits a logical error. The statement

urn:lsid:ubio.org:namebank:11815 gla:vernacularName urn:lsid:ubio.org:namebank:954940

tells us that the vernacular name of the resource identified by urn:lsid:ubio.org:namebank:11815 is the resource identified by urn:lsid:ubio.org:namebank:954940. If I substitute

http://lsid.tdwg.org/urn:lsid:ubio.org:namebank:954940

then I'm saying the vernacular name is identified by a URL, when this URL isn't the identifier, the LSID is. If you are going to have URLs as identifiers, then you hit the HTTP range problem mentioned above.

4. It imposes an extra step on any RDF query, because it would have to link two objects together via an owl:sameAs statement like

x gla:objectiveSynonym y,
z owl:sameAs y,
z dc:title title

This imposes a needless burden on people doing queries, especially as we chain multiple objects together (each link requires two statements). This has not been designed with either query efficiency or robustness in mind. Nor interoperability - imagine somebody outside biodiversity trying to query a mashup of their data and ours, why force them to deal with this clumsy step? How would they know that they need an extra step (owl:sameAs) for TDWG objects. I strongly recommend rdf:resource link to the LSID itself, not the proxy.

40.

Roderic Page


Peer Review By Richard Pyle

Reviewer:Richard Pyle
Review Date:2008-04-18 22:04:45
Recommendation:Accept Submission
Comments:

I have made suggested changes to the MS Word version of the document, which I have uploaded separately. My changes and comments are indicated via the "Track Changes" feature of MS Word. All of my suggestions are optional, and except for corrections of typos and formatting inconsistencies, can be ignored if desired. No need for anonymity.


Peer Review By Ben Szekely

Reviewer:Ben Szekely
Review Date:2008-06-03 23:14:04
Recommendation:Accept Submission
Comments:

This document clearly demonstrates a sophisticated and complete understanding of the LSID specification and the intentions of the original authors. The recommendations contained within provide a sensible approach to implementing LSID within the TDWG community, and could certainly be applied elsewhere.

I hope that the TDWG community will share what they have done with the rest of the Semantic Web and Life Sciences community because this work is widely applicable, solving many problems that the greater community has been unable to successfully tackle thus far.

I have only a few small comments, indicated by page.

Pg 5
----

- (22) relationship -> relationships


Pg 14
-----

"A provider can use LSID namespaces to split identifiers across different categories, such as object
type, scientific or taxonomic discipline, departments, collections and projects. Namespaces help
distinguish objects of different types that have the same identifier"

This statement perhaps breaks the opacity of LSIDs. The namespace does not really distinguish, rather
it helps separate them out. Only the resolved metadata can provide the actual distinguishing.


Pg 18.
-----

It may be worth emphasizing that the "hub" objects should not contain any data, because if the data changes for that LSID, it is a violation.

Pg 20
-----

getDataByRange was a call that many of us did not want in the spec, and is considered optional. In particular,
such chunking should be left to the underlying protocol such as HTTP, and not made visible at the LSID web service
level. You may want to consider recommending that implementors within TDWG need not provide this as it may be very
difficult and inefficient in many circumstances.


Editor's Decision

Decision(s):No Decision Yet
Comments

  Last Modified: 23 November 2007