[OTDev] Missing values [was Re: DataSet]

Christoph Helma helma at in-silico.de
Wed Oct 7 13:18:35 CEST 2009


A question for the RDF gurus: Can you give us an example how to
represent features of a compound within RDF (maybe in different
notations)?

Christoph

Excerpts from Nina Jeliazkova's message of Wed Oct 07 10:48:00 +0200 2009:
> Christoph Helma wrote:
> >>> IMO, dataset with missing values is a valid one. There exists algorithms
> >>> in machine learning that can deal with missing values, usually
> >>> preprocessing ones (there are Weka implementations as well).
> >>> In any case, one might need to create a dataset without missing values
> >>> from a dataset with missing values (by ignoring empty entries or
> >>> applying something else).  I am not sure if there should be API on
> >>> dataset level, on algorithms level, or model level.  As understand,
> >>> Christoph is in favour of dataset level API - am I right?
> >>>       
> >
> > І do not think, that we need API modifications to deal with missing
> > levels. The whole problem can be solved 
> >
> > - by choosing an appropriate dataset representation (see below)
> > - by algorithm/model developers: they have to find a way to deal with
> >   missing values (calculate, ignore, ...)
> >
> >   
> >> I like the idea of having missing values represented within the dataset.
> >>
> >> One thing that would be useful, would be to have consistent notation to
> >> indicate a missing value. Something like 'NA' etc
> >>     
> >
> > I disagree.  My impression, is that the whole concept of missing values
> > originates from the fact that we (and a lot of software) are trained to
> > think in terms of tables. Having a fixed nuber of columns requires of
> > course a method to indicate missing values. As soon as we represent a
> > dataset differently e.g. like
> >
> > compound1_uri:
> >   - feature1_uri
> >   - feature2_uri
> >   ...
> > compound2_uri:
> >   - feature1_uri
> >   - feature3_uri
> >   ...
> > ...
> >
> > or in XML
> >
> > <dataset> 
> >  
> >     <compound> 
> >         <link ref="uri"/> 
> >         <feature> 
> >             <link ref="uri"/>   
> >         </feature> 
> >         <feature> 
> >             <link ref="uri"/>   
> >         </feature> 
> >     </compound> 
> >     <compound> 
> >         <link ref="uri"/> 
> >         <feature> 
> >             <link ref="uri"/>   
> >         </feature> 
> >     </compound> 
> >  
> > </dataset>
> >
> > we do not have to indicate missing features - they are just not there
> >   
> Very good point. This representation is essentially using implicit
> mechanism for denoting missing values; their interpretation is left to
> the model service. Tables are unavoidable even for such common models as
> MLR.
> 
> The formats Christoph proposes here (and at the wiki as well) is a nice
> solution to this issue (missing values) and also the case when the
> feature values might reside on different servers.
> One of my concerns  is how verbose this is going to be - we are
> effectively replacing single value by a much larger construct <feature>
> <link ref="uri"/> </feature> (in XML; YAML is better, but still more
> verbose than the value itself).
> 
> Regarding the dataset, I would keep the header part (feature
> definitions) and refer to them in the <feature> tags. Otherwise it is
> hard to find out which feature for Compound 1 corresponds to which
> feature for Compound 2.   Relying on URI parsing will not work , unless
> everything is supposed to live on the same server. For example, would
> feature with uri="yourservice/3" be considered the same as the one with
> "myservice/3" ?
> 
> <dataset> 
>      
>     <compound> 
>         <link ref="myservice/c1"/> 
>         <feature> 
>             <link ref="myservice/2"/>   
>         </feature> 
>         <feature> 
>             <link ref="myservice/3"/>   
>         </feature> 
>     </compound> 
>     <compound> 
>         <link ref="myservice/c2"/> 
> 
>         <feature> 
>             <link ref="myservice/2"/>   
>         </feature> 
>         <feature> 
>             <link ref="yourservice/3"/>   
>         </feature> 
>     </compound> 
>  
> </dataset>
> 
> 
> Best regards,
> Nina
> > (and it is up to the model developer to deal with this situation).
> >   
> 
> > Best regards,
> > Christoph
> > _______________________________________________
> > Development mailing list
> > Development at opentox.org
> > http://www.opentox.org/mailman/listinfo/development
> >   
> 



More information about the Development mailing list