[OTDev] Missing values [was Re: DataSet]
Yap Chun Wei phayapc at nus.edu.sgWed Oct 7 10:03:28 CEST 2009
- Previous message: [OTDev] Missing values [was Re: DataSet]
- Next message: [OTDev] Missing values [was Re: DataSet]
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
I have just joined a few days ago so pardon me if I am saying something that has been discussed before. Would it be better to adopt the standards used in PMML for representing datasets? For example, if you look at the specifications for the SupportVectorMachineModel, we can use the same idea of how it stores the support vectors using VectorDictionary to store a dataset. Just have to couple it with the DataDictionary so that information about the features are available. Best regards, YAP Chun Wei (Dr) :: Assistant Professor, Department of Pharmacy :: Faculty of Science :: National University of Singapore :: padel.nus.edu.sg (W) -----Original Message----- From: development-bounces at opentox.org [mailto:development-bounces at opentox.org] On Behalf Of Christoph Helma Sent: Wednesday, 7 October, 2009 3:41 PM To: development Subject: Re: [OTDev] Missing values [was Re: DataSet] > > IMO, dataset with missing values is a valid one. There exists algorithms > > in machine learning that can deal with missing values, usually > > preprocessing ones (there are Weka implementations as well). > > In any case, one might need to create a dataset without missing values > > from a dataset with missing values (by ignoring empty entries or > > applying something else). I am not sure if there should be API on > > dataset level, on algorithms level, or model level. As understand, > > Christoph is in favour of dataset level API - am I right? І do not think, that we need API modifications to deal with missing levels. The whole problem can be solved - by choosing an appropriate dataset representation (see below) - by algorithm/model developers: they have to find a way to deal with missing values (calculate, ignore, ...) > I like the idea of having missing values represented within the dataset. > > One thing that would be useful, would be to have consistent notation to > indicate a missing value. Something like 'NA' etc I disagree. My impression, is that the whole concept of missing values originates from the fact that we (and a lot of software) are trained to think in terms of tables. Having a fixed nuber of columns requires of course a method to indicate missing values. As soon as we represent a dataset differently e.g. like compound1_uri: - feature1_uri - feature2_uri ... compound2_uri: - feature1_uri - feature3_uri ... ... or in XML <dataset> <compound> <link ref="uri"/> <feature> <link ref="uri"/> </feature> <feature> <link ref="uri"/> </feature> </compound> <compound> <link ref="uri"/> <feature> <link ref="uri"/> </feature> </compound> </dataset> we do not have to indicate missing features - they are just not there (and it is up to the model developer to deal with this situation). Best regards, Christoph _______________________________________________ Development mailing list Development at opentox.org http://www.opentox.org/mailman/listinfo/development
- Previous message: [OTDev] Missing values [was Re: DataSet]
- Next message: [OTDev] Missing values [was Re: DataSet]
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Development mailing list