[OTDev] IDs for features, feature_definitions ...

Thu Oct 1 21:08:20 CEST 2009

Excerpts from chung's message of Wed Sep 30 16:04:57 +0200 2009:
> Dear Tobias, Nina, All,
> 
> According to the OpenTox API 1.0 about algorithms, a training algorithm
> service accepts as POSTed parameters the dataset_uri and some algorithm
> specific parameters. The dataset_uri itself contains a list of compounds
> and a corresponding list of feature definitions in a sense that if
> "/coumpound/100" and "/feature_definition/2535" belong to a dataset with
> uri "/dataset/37" then the value of the feature is available at
> "/feature/compound/100/feature_definition/2535". For example see
> http://lxkramer13.informatik.tu-muenchen.de:8180/OpenTox/feature/compound/aldost
> erone/feature_definition/CDK_LipinskiFailures 
> .
> 
> So, the dataset contains feature definitions that correspond either to
> molecular descriptors or toxicological endpoints without being able to
> tell which is which. However this piece of information is very important
> for a learning algorithm. So I propose that the target feature
> definition should be an extra posted parameters. This modification of
> the API is of high importance for algorithm related web services and
> should be taken into account in API 2.0 (or 1.1 maybe! :-).

I make the distinction at the dataset level, which works very well for
my purposes. Consider e.g. the ToxCast data: You can create several datasets, e.g. :

- in vivo Data 
- in vitro Data
- phys/chem Properties
- ... (e.g. structural Fragments)

Now you can make a lot of interesting (and meaningful) experiments:

- Use phys/chem Properties to predict in vivo Effects
- Use phys/chem Properties to predict in vitro Effects
- Use in vitro Data to predict in vivo Effects
- Combine phys/chem and in vitro Data to predict in vivo Effects

A exemplary workflow (roughly based on my API proposal) could be

Create datasets for experimental Data:

	in_vivo_dataset_uri = POST /dataset data=in_vivo_data 	# we still have to decide about our internal data exchange format!!
	in_vitro_dataset_uri = POST /dataset data=in_vitro_data

Calculate features:

	phys_chem_dataset_uri = POST /algorithm/phys_chem_properties dataset_uri=in_vivo_dataset_uri

Create a model to predict in vivo effects from phys/chem Properties:

	model_uri = POST /algorithm/your_favorite_learning_algorithm training_dataset_uri=in_vivo_dataset_uri, feature_dataset_uri=phys_chem_dataset_uri 

Create a model to predict in vivo effects from in vitro Properties:

	model_uri = POST /algorithm/your_favorite_learning_algorithm training_dataset_uri=in_vivo_dataset_uri, feature_dataset_uri=in_vitro_dataset_uri 

Create a model to predict in vitro effects from phys/chem Properties:

	model_uri = POST /algorithm/your_favorite_learning_algorithm training_dataset_uri=in_vitro_dataset_uri, feature_dataset_uri=phys_chem_dataset_uri 

and so on.

So the only place where we need the distinction between dependent and
independent variables is during model construction - this can be easily
achieved by naming the input variables (i.e. training and feature
dataset URIs) appropriately. I would strongly object to put this
information into individual features, because it is

	- highly redundant
	- removes the flexibility that comes with generic features
	- clutters the feature API even more

> Furthermore about the previous post of Tobias about the Ids of features
> and feature definition, I think that the following problem arises (Let
> me give an example). Suppose that we (NTUA service) are given a data set
> URI: http://www.server1.com/dataset/1 (i) and we choose the compound
> http://www.server2.com/compound/100 (ii) and the feature definition
> http://www.server1.com/feature_definition/1234 (iii). Where can we find
> the value of the feature defined by the URI (iii) for the compound (ii).
> Note that these two are in the same dataset (i) and the URI
> http://www.server1.com/feature/compound/100/feature_definition/1234
> might return a status code 404 (not found). The same holds for 
> http://www.server2.com/feature/compound/100/feature_definition/1234 . 
> The problem arises when compounds and feature definitions from different
> servers meet in the same dataset. ...that's all greek to me :-(

I think we should look up features through the dataset service (stores
the relation between compounds and features), not through the feature
service (provides information about individual features). The feature
service should know nothing about compounds (and the compound service
should know nothing about features).

The other problem (that has been mentioned several times by now, so I
guess it is an important one) is how to make IDs unique across
webservices. I do not have a definitive conclusion, just a few
ideas:

- Avoid nested URIs in the dataset component that contain feature IDs
	I have done that in my present implementation where it works well. It is less
	restrictive as you might think initially and forces you to think in
	terms of collections (which has also an overall performance benefit).
	But I cannot guarantee, that we can avoid this in all cases.

- Include the full URI (maybe without the http:// header) instead of an
  ID
	+ readable
	+ unambiguous
	- long URIs
	- possibly URI parsing problems (works with my framework)

- Create an unique ID from the complete URI (e.g. by base64 encoding)
	+ can be URI safe (e.g. with URI safe base64)
	+ shorter URIs
	- de/encoding requires additional computation
	- not human readable

- Pass URIs as GET parameters
	+ not in the REST spirit

- Pass URIs as POST parameters
	+ not in the REST spirit
	- breaks conventions (POST is usually destructive)

Best regards,
Christoph