[OTDev] Missing values [was Re: DataSet]

Christoph Helma helma at in-silico.de
Tue Oct 6 16:23:51 CEST 2009


Excerpts from Nina Jeliazkova's message of Mon Oct 05 08:40:08 +0200 2009:
> Dear Pantelis,
> 
> chung wrote:
> > Hi Nina,
> >
> > On Fri, 2009-10-02 at 17:43 +0300, Nina Jeliazkova wrote: 
> >   
> >> Hi Pantelis,
> >>
> >> chung wrote:
> >>     
> >>> Hi Nina,
> >>>  Once we define the RESTful operation in the new version of the API, we
> >>> will have to start developing. Yet from the API 1.0, models are trained
> >>> provided a dataset URI, so we need such a dataset to do some experiments
> >>> (build an Instances object, train a model, perform some predictions
> >>> using the trained model). Is it possible for you to provide us a dataset
> >>> URI? 
> >>>       
> >> I am not sure what is the question - can you please clarify?
> >>     
> >
> > I mean that we need a dataset for which all RESTful operations specified
> > in API 1.0 or API 1.1 are implemented and for every operation a status
> > code 200 is normally expected. We need a dataset, say:
> >
> > http://someserver.com/dataset/123 (i)
> >
> > such that, for any compound in that, e.g.
> >
> > http://someserver.com/compound/55 (ii)
> >
> > and every feature definition in it:
> >
> > http://someserver.com/feature_definition/10 (iii)
> >
> > the following URI returns the value of the feature definition (iii) for
> > the compound (ii):
> >
> > http://someserver.com/feature/compound/55/feature_definition/10 
> >
> > and will not return "NULL" or an error code (e.g. 404). 
> > We need that dataset to develop model training web services. The input
> > parameters to our services will be the dataset uri and probably a URI
> > for the target feature. Will it be possible for you to provide us a
> > complete dataset object with all RESTful operations implemented? I mean,
> > we dont need a huge one, 20 compounds and some feature definitions will
> > be ok, but we need every compound/feature_definition pair to correspond
> > to a feature value!  
> >   
> I understand your reasonong, but please note in a generic setup some
> feature values might be missing and it is not the dataset provider job
> to fix that.  Handling missing values is usually done by the modeller,
> we need still to think how to cast this process into the REST scheme.
> 
> For example in the Toxcast dataset there are plenty of entries with
> missing values; one might address the issue with creating "derived"
> dataset by ignoring the entries without values, but one could also
> replace missing values with e.g, averages or using more complicated
> methods.  I am copying this discussion to the development list as well,
> because it is a generic question - should the OpenTox framework provide
> API to handle missing values, where is the best place for this
> (preprocessing algorithms?), what API do we need?

My first impression is that we do not need a separate API (or a
convention) for missing values - I should be the developers task to deal
with "missing values".  With a clear separation between features and
feature annotations, we also would not run into the problem, that values
for feature definitions are missing: A dataset representation would
contain only the features, that are available, not feature definitions
with possibly empty values.

Best regards,
Christoph



More information about the Development mailing list