[OTDev] Missing values [was Re: DataSet]

Wed Oct 7 10:03:28 CEST 2009

I have just joined a few days ago so pardon me if I am saying something that has been discussed before. Would it be better to adopt the standards used in PMML for representing datasets? For example, if you look at the specifications for the SupportVectorMachineModel, we can use the same idea of how it stores the support vectors using VectorDictionary to store a dataset. Just have to couple it with the DataDictionary so that information about the features are available.

Best regards,

YAP Chun Wei (Dr) :: Assistant Professor, Department of Pharmacy :: Faculty of Science :: National University of Singapore :: padel.nus.edu.sg (W) 

-----Original Message-----
From: development-bounces at opentox.org [mailto:development-bounces at opentox.org] On Behalf Of Christoph Helma
Sent: Wednesday, 7 October, 2009 3:41 PM
To: development
Subject: Re: [OTDev] Missing values [was Re: DataSet]

> > IMO, dataset with missing values is a valid one. There exists algorithms
> > in machine learning that can deal with missing values, usually
> > preprocessing ones (there are Weka implementations as well).
> > In any case, one might need to create a dataset without missing values
> > from a dataset with missing values (by ignoring empty entries or
> > applying something else).  I am not sure if there should be API on
> > dataset level, on algorithms level, or model level.  As understand,
> > Christoph is in favour of dataset level API - am I right?

І do not think, that we need API modifications to deal with missing
levels. The whole problem can be solved 

- by choosing an appropriate dataset representation (see below)
- by algorithm/model developers: they have to find a way to deal with
  missing values (calculate, ignore, ...)

> I like the idea of having missing values represented within the dataset.
> 
> One thing that would be useful, would be to have consistent notation to
> indicate a missing value. Something like 'NA' etc

I disagree.  My impression, is that the whole concept of missing values
originates from the fact that we (and a lot of software) are trained to
think in terms of tables. Having a fixed nuber of columns requires of
course a method to indicate missing values. As soon as we represent a
dataset differently e.g. like

compound1_uri:
  - feature1_uri
  - feature2_uri
  ...
compound2_uri:
  - feature1_uri
  - feature3_uri
  ...
...

or in XML

<dataset> 

    <compound> 
        <link ref="uri"/> 
        <feature> 
            <link ref="uri"/>   
        </feature> 
        <feature> 
            <link ref="uri"/>   
        </feature> 
    </compound> 
    <compound> 
        <link ref="uri"/> 
        <feature> 
            <link ref="uri"/>   
        </feature> 
    </compound> 

</dataset>

we do not have to indicate missing features - they are just not there
(and it is up to the model developer to deal with this situation).

Best regards,
Christoph
_______________________________________________
Development mailing list
Development at opentox.org
http://www.opentox.org/mailman/listinfo/development