[OTDev] Missing values [was Re: DataSet]

Wed Oct 7 10:48:00 CEST 2009

Christoph Helma wrote:
>>> IMO, dataset with missing values is a valid one. There exists algorithms
>>> in machine learning that can deal with missing values, usually
>>> preprocessing ones (there are Weka implementations as well).
>>> In any case, one might need to create a dataset without missing values
>>> from a dataset with missing values (by ignoring empty entries or
>>> applying something else).  I am not sure if there should be API on
>>> dataset level, on algorithms level, or model level.  As understand,
>>> Christoph is in favour of dataset level API - am I right?
>>>       
>
> І do not think, that we need API modifications to deal with missing
> levels. The whole problem can be solved 
>
> - by choosing an appropriate dataset representation (see below)
> - by algorithm/model developers: they have to find a way to deal with
>   missing values (calculate, ignore, ...)
>
>   
>> I like the idea of having missing values represented within the dataset.
>>
>> One thing that would be useful, would be to have consistent notation to
>> indicate a missing value. Something like 'NA' etc
>>     
>
> I disagree.  My impression, is that the whole concept of missing values
> originates from the fact that we (and a lot of software) are trained to
> think in terms of tables. Having a fixed nuber of columns requires of
> course a method to indicate missing values. As soon as we represent a
> dataset differently e.g. like
>
> compound1_uri:
>   - feature1_uri
>   - feature2_uri
>   ...
> compound2_uri:
>   - feature1_uri
>   - feature3_uri
>   ...
> ...
>
> or in XML
>
> <dataset> 
>  
>     <compound> 
>         <link ref="uri"/> 
>         <feature> 
>             <link ref="uri"/>   
>         </feature> 
>         <feature> 
>             <link ref="uri"/>   
>         </feature> 
>     </compound> 
>     <compound> 
>         <link ref="uri"/> 
>         <feature> 
>             <link ref="uri"/>   
>         </feature> 
>     </compound> 
>  
> </dataset>
>
> we do not have to indicate missing features - they are just not there
>   
Very good point. This representation is essentially using implicit
mechanism for denoting missing values; their interpretation is left to
the model service. Tables are unavoidable even for such common models as
MLR.

The formats Christoph proposes here (and at the wiki as well) is a nice
solution to this issue (missing values) and also the case when the
feature values might reside on different servers.
One of my concerns  is how verbose this is going to be - we are
effectively replacing single value by a much larger construct <feature>
<link ref="uri"/> </feature> (in XML; YAML is better, but still more
verbose than the value itself).

Regarding the dataset, I would keep the header part (feature
definitions) and refer to them in the <feature> tags. Otherwise it is
hard to find out which feature for Compound 1 corresponds to which
feature for Compound 2.   Relying on URI parsing will not work , unless
everything is supposed to live on the same server. For example, would
feature with uri="yourservice/3" be considered the same as the one with
"myservice/3" ?

<dataset> 

    <compound> 
        <link ref="myservice/c1"/> 
        <feature> 
            <link ref="myservice/2"/>   
        </feature> 
        <feature> 
            <link ref="myservice/3"/>   
        </feature> 
    </compound> 
    <compound> 
        <link ref="myservice/c2"/> 

        <feature> 
            <link ref="myservice/2"/>   
        </feature> 
        <feature> 
            <link ref="yourservice/3"/>   
        </feature> 
    </compound> 

</dataset>

Best regards,
Nina
> (and it is up to the model developer to deal with this situation).
>   

> Best regards,
> Christoph
> _______________________________________________
> Development mailing list
> Development at opentox.org
> http://www.opentox.org/mailman/listinfo/development
>