[OTDev] Missing values [was Re: DataSet]

Nina Jeliazkova nina at acad.bg
Tue Oct 6 16:39:09 CEST 2009


Christoph Helma wrote:
> Excerpts from Nina Jeliazkova's message of Mon Oct 05 08:40:08 +0200 2009:
>   
>> Dear Pantelis,
>>
>> chung wrote:
>>     
>>> Hi Nina,
>>>
>>> On Fri, 2009-10-02 at 17:43 +0300, Nina Jeliazkova wrote: 
>>>   
>>>       
>>>> Hi Pantelis,
>>>>
>>>> chung wrote:
>>>>     
>>>>         
>>>>> Hi Nina,
>>>>>  Once we define the RESTful operation in the new version of the API, we
>>>>> will have to start developing. Yet from the API 1.0, models are trained
>>>>> provided a dataset URI, so we need such a dataset to do some experiments
>>>>> (build an Instances object, train a model, perform some predictions
>>>>> using the trained model). Is it possible for you to provide us a dataset
>>>>> URI? 
>>>>>       
>>>>>           
>>>> I am not sure what is the question - can you please clarify?
>>>>     
>>>>         
>>> I mean that we need a dataset for which all RESTful operations specified
>>> in API 1.0 or API 1.1 are implemented and for every operation a status
>>> code 200 is normally expected. We need a dataset, say:
>>>
>>> http://someserver.com/dataset/123 (i)
>>>
>>> such that, for any compound in that, e.g.
>>>
>>> http://someserver.com/compound/55 (ii)
>>>
>>> and every feature definition in it:
>>>
>>> http://someserver.com/feature_definition/10 (iii)
>>>
>>> the following URI returns the value of the feature definition (iii) for
>>> the compound (ii):
>>>
>>> http://someserver.com/feature/compound/55/feature_definition/10 
>>>
>>> and will not return "NULL" or an error code (e.g. 404). 
>>> We need that dataset to develop model training web services. The input
>>> parameters to our services will be the dataset uri and probably a URI
>>> for the target feature. Will it be possible for you to provide us a
>>> complete dataset object with all RESTful operations implemented? I mean,
>>> we dont need a huge one, 20 compounds and some feature definitions will
>>> be ok, but we need every compound/feature_definition pair to correspond
>>> to a feature value!  
>>>   
>>>       
>> I understand your reasonong, but please note in a generic setup some
>> feature values might be missing and it is not the dataset provider job
>> to fix that.  Handling missing values is usually done by the modeller,
>> we need still to think how to cast this process into the REST scheme.
>>
>> For example in the Toxcast dataset there are plenty of entries with
>> missing values; one might address the issue with creating "derived"
>> dataset by ignoring the entries without values, but one could also
>> replace missing values with e.g, averages or using more complicated
>> methods.  I am copying this discussion to the development list as well,
>> because it is a generic question - should the OpenTox framework provide
>> API to handle missing values, where is the best place for this
>> (preprocessing algorithms?), what API do we need?
>>     
>
> My first impression is that we do not need a separate API (or a
> convention) for missing values - I should be the developers task to deal
> with "missing values".  With a clear separation between features and
> feature annotations, we also would not run into the problem, that values
> for feature definitions are missing: A dataset representation would
> contain only the features, that are available, not feature definitions
> with possibly empty values.
>
>   
IMO, dataset with missing values is a valid one. There exists algorithms
in machine learning that can deal with missing values, usually
preprocessing ones (there are Weka implementations as well).
In any case, one might need to create a dataset without missing values
from a dataset with missing values (by ignoring empty entries or
applying something else).  I am not sure if there should be API on
dataset level, on algorithms level, or model level.  As understand,
Christoph is in favour of dataset level API - am I right? 

BTW, Toxcast is a very nice example of a dataset with plenty of missing
values.

Best regards,
Nina
> Best regards,
> Christoph
> _______________________________________________
> Development mailing list
> Development at opentox.org
> http://www.opentox.org/mailman/listinfo/development
>   




More information about the Development mailing list