[OTDev] Life Sciences Identifiers ( On unique IDs )

Fri Oct 2 17:06:44 CEST 2009

Dear All,

We could consider Life Sciences Identifiers  approach 
http://lsids.sourceforge.net/

An excerpt from [1] :

    Standard identifiers. The problem with URL's is that they always
    point to a particular Web server (which may not always be in
    service) and worse, that the contents referred to by a URL may
    change. For researchers, the requirement to be able to exactly
    reproduce any observations and experiments based on a data object
    means that it is essential that data be uniquely named and available
    from many cached sources. The Life Science IDentifier or LSID
    (http://lsid.sourceforge.net) is designed to fulfill this
    requirement. An LSID names and refers to one unchanging data object
    (version numbers can be attached to handle updates). Every LSID
    consists of up to five parts: the Network Identifier; the root DNS
    name of the issuing authority; the namespace chosen by the issuing
    authority; the object id unique in that namespace; and finally an
    optional revision id for storing versioning information. Each part
    is separated by a colon to make LSIDs easy to parse. For example,
    "um:lsid:ncbi.nlm.nig.gov:GenBank:T48601:2" is an LSID with
    "um:lsid" being the NID, "ncbi.nlm.nig.gov" the issuing authority's
    DNS name, "GenBank" the database namespace, "T48601" the object id,
    and "2" the revision id. Unlike URLs, LSIDs are location
    independent. This means that a program or a user can be certain that
    what they are dealing with is exactly the same data if the LSID of
    any object is the same as the LSID of another copy of the object
    obtained elsewhere. As an example of LSID usage, the Entrez LSID Web
    service (http://lsid.biopathways.org/entrez/) uses NCBI's Entrez
    search interface to locate LSIDs within the biological databases
    hosted by the NCBL The LSID system is in essence similar to the role
    of the Domain Name Service (DNS) for converting named Internet
    locations to IP numbers.

[1] Baker, C., Semantic Web: Revolutionizing Knowledge Discovery in the
Life Sciences  
http://www.amazon.com/Semantic-Web-Revolutionizing-Knowledge-Discovery/dp/0387484361/ref=sr_1_1?ie=UTF8&s=books&qid=1254495736&sr=1-1-spell

Best regards,
Nina

Christoph Helma wrote:
> Excerpts from chung's message of Wed Sep 30 16:04:57 +0200 2009:
>   
>> Dear Tobias, Nina, All,
>>
>> According to the OpenTox API 1.0 about algorithms, a training algorithm
>> service accepts as POSTed parameters the dataset_uri and some algorithm
>> specific parameters. The dataset_uri itself contains a list of compounds
>> and a corresponding list of feature definitions in a sense that if
>> "/coumpound/100" and "/feature_definition/2535" belong to a dataset with
>> uri "/dataset/37" then the value of the feature is available at
>> "/feature/compound/100/feature_definition/2535". For example see
>> http://lxkramer13.informatik.tu-muenchen.de:8180/OpenTox/feature/compound/aldost
>> erone/feature_definition/CDK_LipinskiFailures 
>> .
>>
>> So, the dataset contains feature definitions that correspond either to
>> molecular descriptors or toxicological endpoints without being able to
>> tell which is which. However this piece of information is very important
>> for a learning algorithm. So I propose that the target feature
>> definition should be an extra posted parameters. This modification of
>> the API is of high importance for algorithm related web services and
>> should be taken into account in API 2.0 (or 1.1 maybe! :-).
>>     
>
> I make the distinction at the dataset level, which works very well for
> my purposes. Consider e.g. the ToxCast data: You can create several datasets, e.g. :
>
> - in vivo Data 
> - in vitro Data
> - phys/chem Properties
> - ... (e.g. structural Fragments)
>
> Now you can make a lot of interesting (and meaningful) experiments:
>
> - Use phys/chem Properties to predict in vivo Effects
> - Use phys/chem Properties to predict in vitro Effects
> - Use in vitro Data to predict in vivo Effects
> - Combine phys/chem and in vitro Data to predict in vivo Effects
>
> A exemplary workflow (roughly based on my API proposal) could be
>
> Create datasets for experimental Data:
>
> 	in_vivo_dataset_uri = POST /dataset data=in_vivo_data 	# we still have to decide about our internal data exchange format!!
> 	in_vitro_dataset_uri = POST /dataset data=in_vitro_data
>
> Calculate features:
>
> 	phys_chem_dataset_uri = POST /algorithm/phys_chem_properties dataset_uri=in_vivo_dataset_uri
>
> Create a model to predict in vivo effects from phys/chem Properties:
>
> 	model_uri = POST /algorithm/your_favorite_learning_algorithm training_dataset_uri=in_vivo_dataset_uri, feature_dataset_uri=phys_chem_dataset_uri 
>
> Create a model to predict in vivo effects from in vitro Properties:
>
> 	model_uri = POST /algorithm/your_favorite_learning_algorithm training_dataset_uri=in_vivo_dataset_uri, feature_dataset_uri=in_vitro_dataset_uri 
>
> Create a model to predict in vitro effects from phys/chem Properties:
>
> 	model_uri = POST /algorithm/your_favorite_learning_algorithm training_dataset_uri=in_vitro_dataset_uri, feature_dataset_uri=phys_chem_dataset_uri 
>
> and so on.
>
> So the only place where we need the distinction between dependent and
> independent variables is during model construction - this can be easily
> achieved by naming the input variables (i.e. training and feature
> dataset URIs) appropriately. I would strongly object to put this
> information into individual features, because it is
>
> 	- highly redundant
> 	- removes the flexibility that comes with generic features
> 	- clutters the feature API even more
>
>   
>> Furthermore about the previous post of Tobias about the Ids of features
>> and feature definition, I think that the following problem arises (Let
>> me give an example). Suppose that we (NTUA service) are given a data set
>> URI: http://www.server1.com/dataset/1 (i) and we choose the compound
>> http://www.server2.com/compound/100 (ii) and the feature definition
>> http://www.server1.com/feature_definition/1234 (iii). Where can we find
>> the value of the feature defined by the URI (iii) for the compound (ii).
>> Note that these two are in the same dataset (i) and the URI
>> http://www.server1.com/feature/compound/100/feature_definition/1234
>> might return a status code 404 (not found). The same holds for 
>> http://www.server2.com/feature/compound/100/feature_definition/1234 . 
>> The problem arises when compounds and feature definitions from different
>> servers meet in the same dataset. ...that's all greek to me :-(
>>     
>
> I think we should look up features through the dataset service (stores
> the relation between compounds and features), not through the feature
> service (provides information about individual features). The feature
> service should know nothing about compounds (and the compound service
> should know nothing about features).
>
> The other problem (that has been mentioned several times by now, so I
> guess it is an important one) is how to make IDs unique across
> webservices. I do not have a definitive conclusion, just a few
> ideas:
>
> - Avoid nested URIs in the dataset component that contain feature IDs
> 	I have done that in my present implementation where it works well. It is less
> 	restrictive as you might think initially and forces you to think in
> 	terms of collections (which has also an overall performance benefit).
> 	But I cannot guarantee, that we can avoid this in all cases.
>
> - Include the full URI (maybe without the http:// header) instead of an
>   ID
> 	+ readable
> 	+ unambiguous
> 	- long URIs
> 	- possibly URI parsing problems (works with my framework)
>
> - Create an unique ID from the complete URI (e.g. by base64 encoding)
> 	+ can be URI safe (e.g. with URI safe base64)
> 	+ shorter URIs
> 	- de/encoding requires additional computation
> 	- not human readable
>
> - Pass URIs as GET parameters
> 	+ not in the REST spirit
>
> - Pass URIs as POST parameters
> 	+ not in the REST spirit
> 	- breaks conventions (POST is usually destructive)
>
> Best regards,
> Christoph
> _______________________________________________
> Development mailing list
> Development at opentox.org
> http://www.opentox.org/mailman/listinfo/development
>