[OTDev] Descriptor Calculation Services

Thu Jan 14 23:32:32 CET 2010

On Wed, 2010-01-13 at 10:06 +0200, Nina Jeliazkova wrote:
> Hi All,
> 
> chung wrote: 
> > Hi all,
> >  There has been a discussion on data preprocessing web services these
> > days so I'd like to summarize a few things here and then decide how
> > should we proceed with the implementation of such services. First of
> > all, there is no need for any changes in the current version of the API
> > to implement API1.1-compliant services. Preprocessing services are
> > algorithms, an RDF representation should be provided at:
> > 
> > /algorithm/datacleanup
> > 
> > and the method 
> > 
> > POST /algorithm/datacleanup
> > Content-type: application/rdf+xml
> >   
> To have this aligned with current API and ontologies, I would suggest:
> 1) use dataset_uri as input parameter (as with other algorithms),
> rather than post the content itself
Agreed
> 2)The URI follows generic algorithm naming scheme /algorithm/{id} .  A
> data cleanup algorithm is a subclass of
> http://www.opentox.org/algorithms.owl#DataCleanup  from AlgorithmTypes
> ontology, and RDF representation should contain proper rdf:type
> statement.

There's already an example at
http://opentox.ntua.gr:3000/algorithm/cleanup 

> 3)Agree on how the resulting URL is returned.  There are several
> options (not mutually exclusive)
> - if no content is returned, the URL is within Location-ref HTTP
> header (this is mandatory for redirect responses, like returning task
> IDs)
> - if content is returned, it can be text/uri-list (obviously)  or RDF
> representation , containing perhaps only a simple RDF node with the
> URL as node identificator.
> 
> http://opentox.org/dev/apis/api-1.1/Algorithm  entry is updated
> accordingly. 
> 
> (Please note for compatibility reasons, services should use
> "dataset_uri"to specify datasets and "prediction_feature" to denote
> target variable). 
> 
> > returns the URI of the generated cleaned-up dataset. This is the way
> > feature selection services should work too.
> > 
> > First of all, I think we need a cleanup service that removes all string
> > features from the dataset
> 
> You can do this right now with the following steps:
> 
> 1) Get features for a dataset via /dataset/{id}/feature  or any other
> means (e.g. looking through the entire dataset )
> 2)Select string features (numerics are denoted as in the latest
> opentox ontology as ot:NumericFeature)
> 3) form the URL for the reduced dataset as
> /dataset/{id}?feature_uris[]=/mynumericfeature1&feature_uris[]=/mynumericfeature2&feature_uris[]=/mynumericfeature3&eature_uris[]=etc
> 
> String feature dropping service will be just a convenient wrapper for
> the steps above.
> > and a second service that handles the missing
> > values of the dataset substituting them with the average of the median
> > of all other values for the same feature in the dataset.
> This functionality is indeed missing. 
In the next few days I will deploy a service which handles missing
values this way at http://opentox.ntua.gr:3000/algorithm/mvh1 
> > I have a
> > proposal about a third preprocessing service where the missing values
> > are calculated by some descriptor calculation service *if possible*.
> > 
> > 
> > 
> > On Tue, 2010-01-12 at 19:23 +0200, Nina Jeliazkova wrote:
> >   
> > > Christoph Helma wrote:
> > >     
> > > > Excerpts from Tobias Girschick's message of Mon Jan 11 10:05:23 +0100 2010:
> > > >   
> > > >       
> > > > > Hi Pantelis, All,
> > > > > 
> > > > > On Thu, 2010-01-07 at 18:49 +0200, chung wrote: 
> > > > >     
> > > > >         
> > > > > > Hi Tobias, All,
> > > > > >  While trying to train a model, the service is possible to "find" some
> > > > > > missing values for a specific feature. 
> > > > > >       
> > > > > >           
> > > > > To obviate misunderstandings: You want to train a model with a data set
> > > > > that contains missing values for a specific feature and the service
> > > > > detects the missing features before training, right?
> > > > > 
> > > > >     
> > > > >         
> > > > > > Is there a way to use your
> > > > > > services to obtain the missing value? 
> > > > > >       
> > > > > >           
> > > > > If the feature with the missing values was produced from our descriptor
> > > > > calculation service, yes. But you would have to build a dataset with all
> > > > > the compounds where the value is missing and submit it to the descriptor
> > > > > calculation service.
> > > > > The question is, if a model training service should automatically
> > > > > provide the functionality of "filling up" missing values. I think this
> > > > > is something that should be done in the preprocessing phase - in a
> > > > > preprocessing/data cleaning service.
> > > > >     
> > > > >         
> > > > I would be extremely careful with the addition of missing features for
> > > > several reasons:
> > > > 
> > > > - Sometimes there are good physical/chemical/biological/algorithmic reasons why
> > > >   features are missing - calculating these features might give
> > > >   you a number but it is very likely that it is meaningless. 
> > > >   
> > > >       
> > > Agree.
> > >     
> > 
> > Yes, sometimes indeed. What about all other times. 
> It might be an interesting topic to think how do we distinguish the
> two cases :)
> > For instance how
> > useful is a dataset which contains a set of compounds and values for one
> > and only feature (the target) without a service that calculates the
> > values for the other features? 
> 
> Information about description calculation services, used to generate
> existing values should be available for each Feature via ot:hasSource
> property.  It is then straightforward to use the URL of the service to
> launch remote or local calculation.
> > I believe that there are lots of reasons
> > to have a service which searches for missing values in the dataset and
> > tries to calculate them; after all that service will not be bundled with
> > the model training and its use would be optional.
> > 
> >   
> Why not just use descriptor calculation services , as they currently
> exist?  It is implementation detail if the service will prefer to
> calculate existing values once again or only perform calculations
> where these are not available (I would actually prefer the later as
> default implementation, purely for performance reasons).
> > > > - A sameAs relationship does not guarantee, that (calculated and
> > > >   measured) feature values are comparable (very frequently they are
> > > > 	not).
> > > >   
> > > >       
> > > Right, this is the reason of having ot:hasSource for features , allowing
> > > to identify exactly the descriptor calculation service used. 
> > >     
> Replying to myself, see above for ot:hasSource  usage
> > > > - Even if you find a measured value for the same feature, there is a
> > > >   good chance, that it has been obtained by a different protocol and
> > > > 	that it is not comparable with the other feature values.
> > > >   
> > > >       
> > > Agree.
> > >     
> > > > I would suggest to add features only
> > > > 
> > > > - if you have a clear understanding, why a feature is missing
> > > >       
> > 
> > Why should you understand this? 
> Because lack of understanding will reflect on the resulting model
> quality.  For example it is possible to calculate logP value of 15 or
> even 100, but it is generally considering not meaningful for various
> reasons. It will be better to remove compounds with such values, when
> building a model.
> 

I don't really think one should seek for some reasoning on whether a
value is acceptable or not. logP is a logarithmic quantity so indeed 100
is a "huge" but it is really hard for one to establish a borderline
especially in high dimensional spaces. So that's way we need the domain
of applicability: its a way to tell that a prediction of logP=100000 is
not acceptable in a given context (might be in some other however).

> > It is missing because no service
> > calculated it or for any other reason 
> It might have been a service had already attempted to calculate the
> values, but failed for some reason.  It might be an exotic atom type ,
> which prevent the calculations to be done properly,  or lack of  3D
> structure, etc.   We need a way to denote such cases.   
> 
> I would propose an extension of opentox.owl with a subclass of
> ot:FeatureValue (e.g. ErrorValue or something alike) to denote the
> case of failure/reason for failure.  FeatureValue is composed of
> feature and value, where the feature will point to what calculation
> has been attempted (including the algorithm) and the value itself can
> be string with human readable description of the reason of failure, or
> URL.
> 
> http://opentox.org/dev/apis/api-1.1/Feature
> 
> A descriptor calculation service, requiring 3D structure will
> continuously fail to calculate values for structures lacking 3D
> coordinates.  BlueObelisk descriptor ontology provides means for
> setting descriptor calculation requirements (2D/3D coordinates) via
> http://www.blueobelisk.org/ontologies/chemoinformatics-algorithms/#requires  property .  
> 
> For anybody, using descriptor calculation algorithms , described in
> BO, please refer to the relevant algorithm, this will ensure we can
> use all the available information in the ontology service.
> 
> For other descriptor calculation algorithms, it will be good if they
> can be described in terms of BO.
> 
> > I can't think of right now, but a
> > client needs to calculate those values (if possible) and of course using
> > a proper descriptor calculation service to avoid calculating something
> > else by mistake.
> > 
> >   
> So it should be sufficient to locate the proper calculation service
> and run the calculations.  I don't really see the reason of
> introducing another kind of calculation service.  It is more likely we
> might need kind of filtering/querying data service, allowing to query
> a dataset for entries without values for specific features, and this
> should be easy to do via current algorithm API. 
> > > > - if you can prove that the feature calculation algorithm creates values
> > > >   that are comparable with the original measurements (or calculation
> > > > 	algorithm)
> > > >       
> > 
> > No computational tool can reproduce measured values
> Not a very optimistic comment for any modeling exercise /framework :)

Why not? I just say that a model might be able to converge to the
average of the measured values in a certain way, but will never be
equal, otherwise it wouldn't be a model, it would be a theory.

> 
> Best regards,
> Nina
> > - I'm talking about
> > descriptors which can be calculated given the structure of the chemical
> > compound.
> > 
> > Best regards,
> > Pantelis
> > 
> >   
> > > > - if you clearly document how and why the original dataset has been
> > > >   modified
> > > >   
> > > >       
> > > An user interface supporting the above (e.g. allowing the user to
> > > document why something is modified) would be relevant for both Fastox
> > > and Toxmodel.
> > > 
> > > Best regards,
> > > Nina
> > >     
> > > > Best regards,
> > > > Christoph
> > > > _______________________________________________
> > > > Development mailing list
> > > > Development at opentox.org
> > > > http://www.opentox.org/mailman/listinfo/development
> > > >   
> > > >       
> > > _______________________________________________
> > > Development mailing list
> > > Development at opentox.org
> > > http://www.opentox.org/mailman/listinfo/development
> > > 
> > >     
> > 
> >   
>