[OTDev] Descriptor Calculation Services

Wed Jan 13 09:06:37 CET 2010

Hi All,

chung wrote:
> Hi all,
>  There has been a discussion on data preprocessing web services these
> days so I'd like to summarize a few things here and then decide how
> should we proceed with the implementation of such services. First of
> all, there is no need for any changes in the current version of the API
> to implement API1.1-compliant services. Preprocessing services are
> algorithms, an RDF representation should be provided at:
>
> /algorithm/datacleanup
>
> and the method 
>
> POST /algorithm/datacleanup
> Content-type: application/rdf+xml
>   
To have this aligned with current API and ontologies, I would suggest:
1) use *dataset_uri *as input parameter (as with other algorithms),
rather than post the content itself
2)The URI follows generic algorithm naming scheme /algorithm/{id} .  A
data cleanup algorithm is a subclass of 
*http://www.opentox.org/algorithms.owl#DataCleanup*  from AlgorithmTypes
ontology, and RDF representation should contain proper rdf:type statement.
3)Agree on how the resulting URL is returned.  There are several options
(not mutually exclusive)
- if no content is returned, the URL is within Location-ref HTTP header
(this is mandatory for redirect responses, like returning task IDs)
- if content is returned, it can be text/uri-list (obviously)  or RDF
representation , containing perhaps only a simple RDF node with the URL
as node identificator.

http://opentox.org/dev/apis/api-1.1/Algorithm  entry is updated
accordingly.

(Please note for compatibility reasons, services should use
"dataset_uri"to specify datasets and "prediction_feature" to denote
target variable).

> returns the URI of the generated cleaned-up dataset. This is the way
> feature selection services should work too.
>
> First of all, I think we need a cleanup service that removes all string
> features from the dataset

You can do this right now with the following steps:

1) Get features for a dataset via /dataset/{id}/feature  or any other
means (e.g. looking through the entire dataset )
2)Select string features (numerics are denoted as in the latest opentox
ontology as ot:NumericFeature)
3) form the URL for the reduced dataset as
/dataset/{id}?feature_uris[]=/mynumericfeature1&feature_uris[]=/mynumericfeature2&feature_uris[]=/mynumericfeature3&eature_uris[]=etc

String feature dropping service will be just a convenient wrapper for
the steps above.
>  and a second service that handles the missing
> values of the dataset substituting them with the average of the median
> of all other values for the same feature in the dataset.
This functionality is indeed missing.
>  I have a
> proposal about a third preprocessing service where the missing values
> are calculated by some descriptor calculation service *if possible*.
>
>
>
> On Tue, 2010-01-12 at 19:23 +0200, Nina Jeliazkova wrote:
>   
>> Christoph Helma wrote:
>>     
>>> Excerpts from Tobias Girschick's message of Mon Jan 11 10:05:23 +0100 2010:
>>>   
>>>       
>>>> Hi Pantelis, All,
>>>>
>>>> On Thu, 2010-01-07 at 18:49 +0200, chung wrote: 
>>>>     
>>>>         
>>>>> Hi Tobias, All,
>>>>>  While trying to train a model, the service is possible to "find" some
>>>>> missing values for a specific feature. 
>>>>>       
>>>>>           
>>>> To obviate misunderstandings: You want to train a model with a data set
>>>> that contains missing values for a specific feature and the service
>>>> detects the missing features before training, right?
>>>>
>>>>     
>>>>         
>>>>> Is there a way to use your
>>>>> services to obtain the missing value? 
>>>>>       
>>>>>           
>>>> If the feature with the missing values was produced from our descriptor
>>>> calculation service, yes. But you would have to build a dataset with all
>>>> the compounds where the value is missing and submit it to the descriptor
>>>> calculation service.
>>>> The question is, if a model training service should automatically
>>>> provide the functionality of "filling up" missing values. I think this
>>>> is something that should be done in the preprocessing phase - in a
>>>> preprocessing/data cleaning service.
>>>>     
>>>>         
>>> I would be extremely careful with the addition of missing features for
>>> several reasons:
>>>
>>> - Sometimes there are good physical/chemical/biological/algorithmic reasons why
>>>   features are missing - calculating these features might give
>>>   you a number but it is very likely that it is meaningless. 
>>>   
>>>       
>> Agree.
>>     
>
> Yes, sometimes indeed. What about all other times. 
It might be an interesting topic to think how do we distinguish the two
cases :)
> For instance how
> useful is a dataset which contains a set of compounds and values for one
> and only feature (the target) without a service that calculates the
> values for the other features? 

Information about description calculation services, used to generate
existing values should be available for each Feature via ot:hasSource
property.  It is then straightforward to use the URL of the service to
launch remote or local calculation.
> I believe that there are lots of reasons
> to have a service which searches for missing values in the dataset and
> tries to calculate them; after all that service will not be bundled with
> the model training and its use would be optional.
>
>   
Why not just use descriptor calculation services , as they currently
exist?  It is implementation detail if the service will prefer to
calculate existing values once again or only perform calculations where
these are not available (I would actually prefer the later as default
implementation, purely for performance reasons).
>>> - A sameAs relationship does not guarantee, that (calculated and
>>>   measured) feature values are comparable (very frequently they are
>>> 	not).
>>>   
>>>       
>> Right, this is the reason of having ot:hasSource for features , allowing
>> to identify exactly the descriptor calculation service used. 
>>     
Replying to myself, see above for ot:hasSource  usage
>>> - Even if you find a measured value for the same feature, there is a
>>>   good chance, that it has been obtained by a different protocol and
>>> 	that it is not comparable with the other feature values.
>>>   
>>>       
>> Agree.
>>     
>>> I would suggest to add features only
>>>
>>> - if you have a clear understanding, why a feature is missing
>>>       
>
> Why should you understand this? 
Because lack of understanding will reflect on the resulting model
quality.  For example it is possible to calculate logP value of 15 or
even 100, but it is generally considering not meaningful for various
reasons. It will be better to remove compounds with such values, when
building a model.

> It is missing because no service
> calculated it or for any other reason 
It might have been a service had already attempted to calculate the
values, but failed for some reason.  It might be an exotic atom type ,
which prevent the calculations to be done properly,  or lack of  3D 
structure, etc.   We need a way to denote such cases.  

I would propose an extension of opentox.owl with a subclass of
ot:FeatureValue (e.g. ErrorValue or something alike) to denote the case
of failure/reason for failure.  FeatureValue is composed of feature and
value, where the feature will point to what calculation has been
attempted (including the algorithm) and the value itself can be string
with human readable description of the reason of failure, or URL.

http://opentox.org/dev/apis/api-1.1/Feature

A descriptor calculation service, requiring 3D structure will
continuously fail to calculate values for structures lacking 3D
coordinates.  BlueObelisk descriptor ontology provides means for setting
descriptor calculation requirements (2D/3D coordinates) via
*http://www.blueobelisk.org/ontologies/chemoinformatics-algorithms/#requires* 
property . 

For anybody, using descriptor calculation algorithms , described in BO,
please refer to the relevant algorithm, this will ensure we can use all
the available information in the ontology service.

For other descriptor calculation algorithms, it will be good if they can
be described in terms of BO.

> I can't think of right now, but a
> client needs to calculate those values (if possible) and of course using
> a proper descriptor calculation service to avoid calculating something
> else by mistake.
>
>   
So it should be sufficient to locate the proper calculation service and
run the calculations.  I don't really see the reason of introducing
another kind of calculation service.  It is more likely we might need
kind of filtering/querying data service, allowing to query a dataset for
entries without values for specific features, and this should be easy to
do via current algorithm API.
>>> - if you can prove that the feature calculation algorithm creates values
>>>   that are comparable with the original measurements (or calculation
>>> 	algorithm)
>>>       
>
> No computational tool can reproduce measured values
Not a very optimistic comment for any modeling exercise /framework :)

Best regards,
Nina
>  - I'm talking about
> descriptors which can be calculated given the structure of the chemical
> compound.
>
> Best regards,
> Pantelis
>
>   
>>> - if you clearly document how and why the original dataset has been
>>>   modified
>>>   
>>>       
>> An user interface supporting the above (e.g. allowing the user to
>> document why something is modified) would be relevant for both Fastox
>> and Toxmodel.
>>
>> Best regards,
>> Nina
>>     
>>> Best regards,
>>> Christoph
>>> _______________________________________________
>>> Development mailing list
>>> Development at opentox.org
>>> http://www.opentox.org/mailman/listinfo/development
>>>   
>>>       
>> _______________________________________________
>> Development mailing list
>> Development at opentox.org
>> http://www.opentox.org/mailman/listinfo/development
>>
>>     
>
>