[OTDev] Missing values [was Re: DataSet]

chung chvng at mail.ntua.gr
Tue Oct 6 19:41:24 CEST 2009


Dear Nina, Christoph, All,

Datasets with missing values are valid, however we have to bear in mind
some density/sparsity criteria at least for the time. Its absolutely
impossible to train a model (even a "bad" one), using the following
"diagonal" dataset:

@attribute feat_1 double
@attribute feat_2 double
@attribute feat_3 double
...
@attribute feat_n double
1,?,?,?,...,?
?,25,?,?,...,?
?,?,-2.123,?,...,?
...
?,?,?,...,0.647,?
?,?,?,...,?,100

Another example is:

@attribute feat_1 double
@attribute feat_2 double
@attribute feat_3 double
...
@attribute feat_n double
?,1.1,2.23,100,200,...,1
?,1.41,5.32,9.0,843,92,...,10
...
?,3,4,5,...,300

in the second case the feature feat_1 should be completely removed.

In that can, the training service would return an error message. 
For the time, we need a dense dataset for experimental reasons before
coping with those special cases, which of course are the majority. Lets
take one step at the time. So I'd like to have a dataset - even a
hypothetic one - to do some tests.


On Tue, 2009-10-06 at 17:39 +0300, Nina Jeliazkova wrote: 
> Christoph Helma wrote:
> > Excerpts from Nina Jeliazkova's message of Mon Oct 05 08:40:08 +0200 2009:
> >   
> >> Dear Pantelis,
> >>
> >> chung wrote:
> >>     
> >>> Hi Nina,
> >>>
> >>> On Fri, 2009-10-02 at 17:43 +0300, Nina Jeliazkova wrote: 
> >>>   
> >>>       
> >>>> Hi Pantelis,
> >>>>
> >>>> chung wrote:
> >>>>     
> >>>>         
> >>>>> Hi Nina,
> >>>>>  Once we define the RESTful operation in the new version of the API, we
> >>>>> will have to start developing. Yet from the API 1.0, models are trained
> >>>>> provided a dataset URI, so we need such a dataset to do some experiments
> >>>>> (build an Instances object, train a model, perform some predictions
> >>>>> using the trained model). Is it possible for you to provide us a dataset
> >>>>> URI? 
> >>>>>       
> >>>>>           
> >>>> I am not sure what is the question - can you please clarify?
> >>>>     
> >>>>         
> >>> I mean that we need a dataset for which all RESTful operations specified
> >>> in API 1.0 or API 1.1 are implemented and for every operation a status
> >>> code 200 is normally expected. We need a dataset, say:
> >>>
> >>> http://someserver.com/dataset/123 (i)
> >>>
> >>> such that, for any compound in that, e.g.
> >>>
> >>> http://someserver.com/compound/55 (ii)
> >>>
> >>> and every feature definition in it:
> >>>
> >>> http://someserver.com/feature_definition/10 (iii)
> >>>
> >>> the following URI returns the value of the feature definition (iii) for
> >>> the compound (ii):
> >>>
> >>> http://someserver.com/feature/compound/55/feature_definition/10 
> >>>
> >>> and will not return "NULL" or an error code (e.g. 404). 
> >>> We need that dataset to develop model training web services. The input
> >>> parameters to our services will be the dataset uri and probably a URI
> >>> for the target feature. Will it be possible for you to provide us a
> >>> complete dataset object with all RESTful operations implemented? I mean,
> >>> we dont need a huge one, 20 compounds and some feature definitions will
> >>> be ok, but we need every compound/feature_definition pair to correspond
> >>> to a feature value!  
> >>>   
> >>>       
> >> I understand your reasonong, but please note in a generic setup some
> >> feature values might be missing and it is not the dataset provider job
> >> to fix that.  Handling missing values is usually done by the modeller,
> >> we need still to think how to cast this process into the REST scheme.
> >>
> >> For example in the Toxcast dataset there are plenty of entries with
> >> missing values; one might address the issue with creating "derived"
> >> dataset by ignoring the entries without values, but one could also
> >> replace missing values with e.g, averages or using more complicated
> >> methods.  I am copying this discussion to the development list as well,
> >> because it is a generic question - should the OpenTox framework provide
> >> API to handle missing values, where is the best place for this
> >> (preprocessing algorithms?), what API do we need?
> >>     
> >
> > My first impression is that we do not need a separate API (or a
> > convention) for missing values - I should be the developers task to deal
> > with "missing values".  With a clear separation between features and
> > feature annotations, we also would not run into the problem, that values
> > for feature definitions are missing: A dataset representation would
> > contain only the features, that are available, not feature definitions
> > with possibly empty values.
> >
> >   
> IMO, dataset with missing values is a valid one. There exists algorithms
> in machine learning that can deal with missing values, usually
> preprocessing ones (there are Weka implementations as well).
> In any case, one might need to create a dataset without missing values
> from a dataset with missing values (by ignoring empty entries or
> applying something else).  I am not sure if there should be API on
> dataset level, on algorithms level, or model level.  As understand,
> Christoph is in favour of dataset level API - am I right? 
> 
> BTW, Toxcast is a very nice example of a dataset with plenty of missing
> values.


  Maybe we shouldn't regard this to be normal in the case of topological
descriptors because they can be calculated in-silico. So, a topological
descriptor shouldn't be missing from a dataset or, if that is the case,
it would be better to calculate it for the corresponding compound rather
than replace the "?" with a value. Therefore, we can have a dense
dataset, with no missing values and as I said before such a dataset has
to be created at least as an example and for testing reasons. 


Best Regards,
Pantelis

> 
> Best regards,
> Nina
> > Best regards,
> > Christoph
> > _______________________________________________
> > Development mailing list
> > Development at opentox.org
> > http://www.opentox.org/mailman/listinfo/development
> >   
> 
> _______________________________________________
> Development mailing list
> Development at opentox.org
> http://www.opentox.org/mailman/listinfo/development
> 




More information about the Development mailing list