[OTDev] Missing values [was Re: DataSet]
Christoph Helma helma at in-silico.deWed Oct 7 13:18:35 CEST 2009
- Previous message: [OTDev] Missing values [was Re: DataSet]
- Next message: [OTDev] Missing values [was Re: DataSet]
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
A question for the RDF gurus: Can you give us an example how to represent features of a compound within RDF (maybe in different notations)? Christoph Excerpts from Nina Jeliazkova's message of Wed Oct 07 10:48:00 +0200 2009: > Christoph Helma wrote: > >>> IMO, dataset with missing values is a valid one. There exists algorithms > >>> in machine learning that can deal with missing values, usually > >>> preprocessing ones (there are Weka implementations as well). > >>> In any case, one might need to create a dataset without missing values > >>> from a dataset with missing values (by ignoring empty entries or > >>> applying something else). I am not sure if there should be API on > >>> dataset level, on algorithms level, or model level. As understand, > >>> Christoph is in favour of dataset level API - am I right? > >>> > > > > І do not think, that we need API modifications to deal with missing > > levels. The whole problem can be solved > > > > - by choosing an appropriate dataset representation (see below) > > - by algorithm/model developers: they have to find a way to deal with > > missing values (calculate, ignore, ...) > > > > > >> I like the idea of having missing values represented within the dataset. > >> > >> One thing that would be useful, would be to have consistent notation to > >> indicate a missing value. Something like 'NA' etc > >> > > > > I disagree. My impression, is that the whole concept of missing values > > originates from the fact that we (and a lot of software) are trained to > > think in terms of tables. Having a fixed nuber of columns requires of > > course a method to indicate missing values. As soon as we represent a > > dataset differently e.g. like > > > > compound1_uri: > > - feature1_uri > > - feature2_uri > > ... > > compound2_uri: > > - feature1_uri > > - feature3_uri > > ... > > ... > > > > or in XML > > > > <dataset> > > > > <compound> > > <link ref="uri"/> > > <feature> > > <link ref="uri"/> > > </feature> > > <feature> > > <link ref="uri"/> > > </feature> > > </compound> > > <compound> > > <link ref="uri"/> > > <feature> > > <link ref="uri"/> > > </feature> > > </compound> > > > > </dataset> > > > > we do not have to indicate missing features - they are just not there > > > Very good point. This representation is essentially using implicit > mechanism for denoting missing values; their interpretation is left to > the model service. Tables are unavoidable even for such common models as > MLR. > > The formats Christoph proposes here (and at the wiki as well) is a nice > solution to this issue (missing values) and also the case when the > feature values might reside on different servers. > One of my concerns is how verbose this is going to be - we are > effectively replacing single value by a much larger construct <feature> > <link ref="uri"/> </feature> (in XML; YAML is better, but still more > verbose than the value itself). > > Regarding the dataset, I would keep the header part (feature > definitions) and refer to them in the <feature> tags. Otherwise it is > hard to find out which feature for Compound 1 corresponds to which > feature for Compound 2. Relying on URI parsing will not work , unless > everything is supposed to live on the same server. For example, would > feature with uri="yourservice/3" be considered the same as the one with > "myservice/3" ? > > <dataset> > > <compound> > <link ref="myservice/c1"/> > <feature> > <link ref="myservice/2"/> > </feature> > <feature> > <link ref="myservice/3"/> > </feature> > </compound> > <compound> > <link ref="myservice/c2"/> > > <feature> > <link ref="myservice/2"/> > </feature> > <feature> > <link ref="yourservice/3"/> > </feature> > </compound> > > </dataset> > > > Best regards, > Nina > > (and it is up to the model developer to deal with this situation). > > > > > Best regards, > > Christoph > > _______________________________________________ > > Development mailing list > > Development at opentox.org > > http://www.opentox.org/mailman/listinfo/development > > >
- Previous message: [OTDev] Missing values [was Re: DataSet]
- Next message: [OTDev] Missing values [was Re: DataSet]
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Development mailing list