[OTDev] Descriptor Calculation Services

Thu Jan 14 12:56:41 CET 2010

Excerpts from Nina Jeliazkova's message of Wed Jan 13 09:06:37 +0100 2010:
> > returns the URI of the generated cleaned-up dataset. This is the way
> > feature selection services should work too.
> >
> > First of all, I think we need a cleanup service that removes all string
> > features from the dataset
> 
> You can do this right now with the following steps:
> 
> 1) Get features for a dataset via /dataset/{id}/feature  or any other
> means (e.g. looking through the entire dataset )
> 2)Select string features (numerics are denoted as in the latest opentox
> ontology as ot:NumericFeature)
> 3) form the URL for the reduced dataset as
> /dataset/{id}?feature_uris[]=/mynumericfeature1&feature_uris[]=/mynumericfeature
> 2&feature_uris[]=/mynumericfeature3&eature_uris[]=etc
> 
> String feature dropping service will be just a convenient wrapper for
> the steps above.
> >  and a second service that handles the missing
> > values of the dataset substituting them with the average of the median
> > of all other values for the same feature in the dataset.
> This functionality is indeed missing.
What is the purpose of this, why do we need/want this functionality?
> >>> I would be extremely careful with the addition of missing features for
> >>> several reasons:
> >>>
> >>> - Sometimes there are good physical/chemical/biological/algorithmic reasons why
> >>>   features are missing - calculating these features might give
> >>>   you a number but it is very likely that it is meaningless. 
> >>>   
> >>>       
> >> Agree.
> >>     
> >
> > Yes, sometimes indeed. What about all other times. 
> It might be an interesting topic to think how do we distinguish the two
> cases :)
A descriptor calculation service can write this info to OWL-DL (see
Ninas proposal for calculation errors), but I not very optimistic to get the same info
reliably for measured values (e.g if a compound has poor solubility,
high volatility, ... most experimenters would rather enter nothing
instead of stating their difficulties with a compound).
> > For instance how
> > useful is a dataset which contains a set of compounds and values for one
> > and only feature (the target) without a service that calculates the
> > values for the other features? 
> 
> Information about description calculation services, used to generate
> existing values should be available for each Feature via ot:hasSource
> property.  It is then straightforward to use the URL of the service to
> launch remote or local calculation.
Agreed.
> > I believe that there are lots of reasons
> > to have a service which searches for missing values in the dataset and
> > tries to calculate them; after all that service will not be bundled with
> > the model training and its use would be optional.
> >
> >   
> Why not just use descriptor calculation services , as they currently
> exist?  It is implementation detail if the service will prefer to
> calculate existing values once again or only perform calculations where
> these are not available (I would actually prefer the later as default
> implementation, purely for performance reasons).
Agreed. I do this e.g. for Tox predictions, if the service finds a
measured value it returns the measured value, otherwise a prediction is
calculated.

Thanks Nina for your detailed explanantions.

Best regards,
Christoph