[OTDev] Missing values [was Re: DataSet]

Christoph Helma helma at in-silico.de
Wed Oct 7 09:40:50 CEST 2009


> > IMO, dataset with missing values is a valid one. There exists algorithms
> > in machine learning that can deal with missing values, usually
> > preprocessing ones (there are Weka implementations as well).
> > In any case, one might need to create a dataset without missing values
> > from a dataset with missing values (by ignoring empty entries or
> > applying something else).  I am not sure if there should be API on
> > dataset level, on algorithms level, or model level.  As understand,
> > Christoph is in favour of dataset level API - am I right?

І do not think, that we need API modifications to deal with missing
levels. The whole problem can be solved 

- by choosing an appropriate dataset representation (see below)
- by algorithm/model developers: they have to find a way to deal with
  missing values (calculate, ignore, ...)

> I like the idea of having missing values represented within the dataset.
> 
> One thing that would be useful, would be to have consistent notation to
> indicate a missing value. Something like 'NA' etc

I disagree.  My impression, is that the whole concept of missing values
originates from the fact that we (and a lot of software) are trained to
think in terms of tables. Having a fixed nuber of columns requires of
course a method to indicate missing values. As soon as we represent a
dataset differently e.g. like

compound1_uri:
  - feature1_uri
  - feature2_uri
  ...
compound2_uri:
  - feature1_uri
  - feature3_uri
  ...
...

or in XML

<dataset> 
 
    <compound> 
        <link ref="uri"/> 
        <feature> 
            <link ref="uri"/>   
        </feature> 
        <feature> 
            <link ref="uri"/>   
        </feature> 
    </compound> 
    <compound> 
        <link ref="uri"/> 
        <feature> 
            <link ref="uri"/>   
        </feature> 
    </compound> 
 
</dataset>

we do not have to indicate missing features - they are just not there
(and it is up to the model developer to deal with this situation).

Best regards,
Christoph



More information about the Development mailing list