[OTDev] Descriptor Calculation Services

Thu Jan 14 14:04:31 CET 2010

Christoph Helma wrote:
> Excerpts from Nina Jeliazkova's message of Wed Jan 13 09:06:37 +0100 2010:
>   
>>> returns the URI of the generated cleaned-up dataset. This is the way
>>> feature selection services should work too.
>>>
>>> First of all, I think we need a cleanup service that removes all string
>>> features from the dataset
>>>       
>> You can do this right now with the following steps:
>>
>> 1) Get features for a dataset via /dataset/{id}/feature  or any other
>> means (e.g. looking through the entire dataset )
>> 2)Select string features (numerics are denoted as in the latest opentox
>> ontology as ot:NumericFeature)
>> 3) form the URL for the reduced dataset as
>> /dataset/{id}?feature_uris[]=/mynumericfeature1&feature_uris[]=/mynumericfeature
>> 2&feature_uris[]=/mynumericfeature3&eature_uris[]=etc
>>
>> String feature dropping service will be just a convenient wrapper for
>> the steps above.
>>     
I will use the chance to remind all developers, the dataset API 1.1
allows specifying feature URI and compound URI (this is an improvement
over API 1.0, thanks to Christoph).

http://opentox.org/dev/apis/api-1.1/dataset
Query a dataset 	GET 	/dataset/{id} 	
	*compound_uris[]* and/or *feature_uris[]* to select compounds and
features;

These are very flexible means to get slices of a dataset (features = 
columns, compounds = rows ), or merging data across different datasets,
without the need to download/upload dataset content.

The above functionality is especially relevant for feature selection
algorithms and data cleanup algorithms.   Will it make sense for these
kind of algorithms to specify output of the algorithm as a set of
feature uris, instead of a dataset?

e.g. FeatureSelection alorithm : input parameter dataset_uri ; output
parameter feature_uri[]
>>>  and a second service that handles the missing
>>> values of the dataset substituting them with the average of the median
>>> of all other values for the same feature in the dataset.
>>>       
>> This functionality is indeed missing.
>>     
> What is the purpose of this, why do we need/want this functionality?
>   

It is referring to one of the methods to handle missing values in
machine learning (there are also more complex solutions than taking an
average).
We might check if we have included such methods in the list of planned
ones, and with what priority. 

Best regards,
Nina