[OTDev] Descriptor Calculation Services

Mon Jan 11 19:55:50 CET 2010

Hi all,
 There has been a discussion on data preprocessing web services these
days so I'd like to summarize a few things here and then decide how
should we proceed with the implementation of such services. First of
all, there is no need for any changes in the current version of the API
to implement API1.1-compliant services. Preprocessing services are
algorithms, an RDF representation should be provided at:

/algorithm/datacleanup

and the method 

POST /algorithm/datacleanup
Content-type: application/rdf+xml

returns the URI of the generated cleaned-up dataset. This is the way
feature selection services should work too.

First of all, I think we need a cleanup service that removes all string
features from the dataset and a second service that handles the missing
values of the dataset substituting them with the average of the median
of all other values for the same feature in the dataset. I have a
proposal about a third preprocessing service where the missing values
are calculated by some descriptor calculation service *if possible*.

On Tue, 2010-01-12 at 19:23 +0200, Nina Jeliazkova wrote:
> Christoph Helma wrote:
> > Excerpts from Tobias Girschick's message of Mon Jan 11 10:05:23 +0100 2010:
> >   
> >> Hi Pantelis, All,
> >>
> >> On Thu, 2010-01-07 at 18:49 +0200, chung wrote: 
> >>     
> >>> Hi Tobias, All,
> >>>  While trying to train a model, the service is possible to "find" some
> >>> missing values for a specific feature. 
> >>>       
> >> To obviate misunderstandings: You want to train a model with a data set
> >> that contains missing values for a specific feature and the service
> >> detects the missing features before training, right?
> >>
> >>     
> >>> Is there a way to use your
> >>> services to obtain the missing value? 
> >>>       
> >> If the feature with the missing values was produced from our descriptor
> >> calculation service, yes. But you would have to build a dataset with all
> >> the compounds where the value is missing and submit it to the descriptor
> >> calculation service.
> >> The question is, if a model training service should automatically
> >> provide the functionality of "filling up" missing values. I think this
> >> is something that should be done in the preprocessing phase - in a
> >> preprocessing/data cleaning service.
> >>     
> >
> > I would be extremely careful with the addition of missing features for
> > several reasons:
> >
> > - Sometimes there are good physical/chemical/biological/algorithmic reasons why
> >   features are missing - calculating these features might give
> >   you a number but it is very likely that it is meaningless. 
> >   
> Agree.

Yes, sometimes indeed. What about all other times. For instance how
useful is a dataset which contains a set of compounds and values for one
and only feature (the target) without a service that calculates the
values for the other features? I believe that there are lots of reasons
to have a service which searches for missing values in the dataset and
tries to calculate them; after all that service will not be bundled with
the model training and its use would be optional.

> > - A sameAs relationship does not guarantee, that (calculated and
> >   measured) feature values are comparable (very frequently they are
> > 	not).
> >   
> Right, this is the reason of having ot:hasSource for features , allowing
> to identify exactly the descriptor calculation service used. 
> > - Even if you find a measured value for the same feature, there is a
> >   good chance, that it has been obtained by a different protocol and
> > 	that it is not comparable with the other feature values.
> >   
> Agree.
> > I would suggest to add features only
> >
> > - if you have a clear understanding, why a feature is missing

Why should you understand this? It is missing because no service
calculated it or for any other reason I can't think of right now, but a
client needs to calculate those values (if possible) and of course using
a proper descriptor calculation service to avoid calculating something
else by mistake.

> > - if you can prove that the feature calculation algorithm creates values
> >   that are comparable with the original measurements (or calculation
> > 	algorithm)

No computational tool can reproduce measured values - I'm talking about
descriptors which can be calculated given the structure of the chemical
compound.

Best regards,
Pantelis

> > - if you clearly document how and why the original dataset has been
> >   modified
> >   
> An user interface supporting the above (e.g. allowing the user to
> document why something is modified) would be relevant for both Fastox
> and Toxmodel.
> 
> Best regards,
> Nina
> > Best regards,
> > Christoph
> > _______________________________________________
> > Development mailing list
> > Development at opentox.org
> > http://www.opentox.org/mailman/listinfo/development
> >   
> 
> _______________________________________________
> Development mailing list
> Development at opentox.org
> http://www.opentox.org/mailman/listinfo/development
>