[OTDev] Missing values [was Re: DataSet]
chung chvng at mail.ntua.grTue Oct 6 19:41:24 CEST 2009
- Previous message: [OTDev] Missing values [was Re: DataSet]
- Next message: [OTDev] Missing values [was Re: DataSet]
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Dear Nina, Christoph, All, Datasets with missing values are valid, however we have to bear in mind some density/sparsity criteria at least for the time. Its absolutely impossible to train a model (even a "bad" one), using the following "diagonal" dataset: @attribute feat_1 double @attribute feat_2 double @attribute feat_3 double ... @attribute feat_n double 1,?,?,?,...,? ?,25,?,?,...,? ?,?,-2.123,?,...,? ... ?,?,?,...,0.647,? ?,?,?,...,?,100 Another example is: @attribute feat_1 double @attribute feat_2 double @attribute feat_3 double ... @attribute feat_n double ?,1.1,2.23,100,200,...,1 ?,1.41,5.32,9.0,843,92,...,10 ... ?,3,4,5,...,300 in the second case the feature feat_1 should be completely removed. In that can, the training service would return an error message. For the time, we need a dense dataset for experimental reasons before coping with those special cases, which of course are the majority. Lets take one step at the time. So I'd like to have a dataset - even a hypothetic one - to do some tests. On Tue, 2009-10-06 at 17:39 +0300, Nina Jeliazkova wrote: > Christoph Helma wrote: > > Excerpts from Nina Jeliazkova's message of Mon Oct 05 08:40:08 +0200 2009: > > > >> Dear Pantelis, > >> > >> chung wrote: > >> > >>> Hi Nina, > >>> > >>> On Fri, 2009-10-02 at 17:43 +0300, Nina Jeliazkova wrote: > >>> > >>> > >>>> Hi Pantelis, > >>>> > >>>> chung wrote: > >>>> > >>>> > >>>>> Hi Nina, > >>>>> Once we define the RESTful operation in the new version of the API, we > >>>>> will have to start developing. Yet from the API 1.0, models are trained > >>>>> provided a dataset URI, so we need such a dataset to do some experiments > >>>>> (build an Instances object, train a model, perform some predictions > >>>>> using the trained model). Is it possible for you to provide us a dataset > >>>>> URI? > >>>>> > >>>>> > >>>> I am not sure what is the question - can you please clarify? > >>>> > >>>> > >>> I mean that we need a dataset for which all RESTful operations specified > >>> in API 1.0 or API 1.1 are implemented and for every operation a status > >>> code 200 is normally expected. We need a dataset, say: > >>> > >>> http://someserver.com/dataset/123 (i) > >>> > >>> such that, for any compound in that, e.g. > >>> > >>> http://someserver.com/compound/55 (ii) > >>> > >>> and every feature definition in it: > >>> > >>> http://someserver.com/feature_definition/10 (iii) > >>> > >>> the following URI returns the value of the feature definition (iii) for > >>> the compound (ii): > >>> > >>> http://someserver.com/feature/compound/55/feature_definition/10 > >>> > >>> and will not return "NULL" or an error code (e.g. 404). > >>> We need that dataset to develop model training web services. The input > >>> parameters to our services will be the dataset uri and probably a URI > >>> for the target feature. Will it be possible for you to provide us a > >>> complete dataset object with all RESTful operations implemented? I mean, > >>> we dont need a huge one, 20 compounds and some feature definitions will > >>> be ok, but we need every compound/feature_definition pair to correspond > >>> to a feature value! > >>> > >>> > >> I understand your reasonong, but please note in a generic setup some > >> feature values might be missing and it is not the dataset provider job > >> to fix that. Handling missing values is usually done by the modeller, > >> we need still to think how to cast this process into the REST scheme. > >> > >> For example in the Toxcast dataset there are plenty of entries with > >> missing values; one might address the issue with creating "derived" > >> dataset by ignoring the entries without values, but one could also > >> replace missing values with e.g, averages or using more complicated > >> methods. I am copying this discussion to the development list as well, > >> because it is a generic question - should the OpenTox framework provide > >> API to handle missing values, where is the best place for this > >> (preprocessing algorithms?), what API do we need? > >> > > > > My first impression is that we do not need a separate API (or a > > convention) for missing values - I should be the developers task to deal > > with "missing values". With a clear separation between features and > > feature annotations, we also would not run into the problem, that values > > for feature definitions are missing: A dataset representation would > > contain only the features, that are available, not feature definitions > > with possibly empty values. > > > > > IMO, dataset with missing values is a valid one. There exists algorithms > in machine learning that can deal with missing values, usually > preprocessing ones (there are Weka implementations as well). > In any case, one might need to create a dataset without missing values > from a dataset with missing values (by ignoring empty entries or > applying something else). I am not sure if there should be API on > dataset level, on algorithms level, or model level. As understand, > Christoph is in favour of dataset level API - am I right? > > BTW, Toxcast is a very nice example of a dataset with plenty of missing > values. Maybe we shouldn't regard this to be normal in the case of topological descriptors because they can be calculated in-silico. So, a topological descriptor shouldn't be missing from a dataset or, if that is the case, it would be better to calculate it for the corresponding compound rather than replace the "?" with a value. Therefore, we can have a dense dataset, with no missing values and as I said before such a dataset has to be created at least as an example and for testing reasons. Best Regards, Pantelis > > Best regards, > Nina > > Best regards, > > Christoph > > _______________________________________________ > > Development mailing list > > Development at opentox.org > > http://www.opentox.org/mailman/listinfo/development > > > > _______________________________________________ > Development mailing list > Development at opentox.org > http://www.opentox.org/mailman/listinfo/development >
- Previous message: [OTDev] Missing values [was Re: DataSet]
- Next message: [OTDev] Missing values [was Re: DataSet]
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Development mailing list