[OTDev] Datatype of features

Wed Dec 30 00:10:13 CET 2009

Hi Nina, All,
 In general I agree with your point of view on features. Real life is
always more complex than what we can 
describe with mathematics but in every case we have to agree on some
assumptions. Hence, indeed some 
features accept both numeric and non-numeric values and there exist
various ways of handling them. 
On the other hand we need to take one step at the time and identify the
next steps to proceed with, 
so IMHO we now need the following:

*  We need some proposals in the next meeting (1/5/2010) about some data
cleaning services (Handling of missing values, 
    Removal of String Features, Consistency checking and so on).
Priority A algorithms, do not include data-cleaning ones 
    except for an Attribute Selection algorithm (infoGain).
*  We need a dataset containing only numeric values, declared to be
numeric and with numeric entries exclusively. 
    This one: http://opentox.ntua.gr/ds.rdf is a very simple dataset
with random entries but only GET is supported.
*  Definition of the datatype for every feature without having to check
all values one by one and a way to declare
    nominal features as well as mixed numeric/non-numeric ones.
*  Many algorithms cannot handle nominal values as-is (for example MLR),
so we have to consider for data cleanup services
    and services that check the compatibility between dataset and
algorithm and dataset and model as you pointed out too.

So I think we should discuss on these issues in the next meeting. 

Queries within the dataset URI doesn't offer much flexibility but would
be helpful; I think separate services are a better approach.

Have a Happy new Year 2010,
Pantelis

On Wed, 2009-12-30 at 21:21 +0200, Nina Jeliazkova wrote:

> Hi Pantelis,
> 
> chung wrote: 
> 
> > Hi Nina, All,
> >  I tried to use the dataset
> > http://ambit.uni-plovdiv.bg:8080/ambit2/dataset/6 to build some models
> > and I managed to build some MLR and SVM ones but I encountered some
> > problems handling the datatypes of features. For example, this dataset
> > includes the following contradictory entries:
> > 
> > 
> > ## line: 1406
> >   </rdf:Description>
> >   <rdf:Description rdf:nodeID="A400">
> >     <ot:value
> > rdf:datatype="http://www.w3.org/2001/XMLSchema#double">6.2426</ot:value>
> >     <ot:feature
> > rdf:resource="http://ambit.uni-plovdiv.bg:8080/ambit2/feature/11954"/>
> >     <rdf:type
> > rdf:resource="http://www.opentox.org/api/1.1#FeatureValue"/>
> >   </rdf:Description>
> > 
> > and
> > 
> > ## line: 1386
> >   <rdf:Description rdf:nodeID="A396">
> >     <ot:value rdf:datatype="http://www.w3.org/2001/XMLSchema#string">
> > </ot:value>
> >     <ot:feature
> > rdf:resource="http://ambit.uni-plovdiv.bg:8080/ambit2/feature/11954"/>
> >     <rdf:type
> > rdf:resource="http://www.opentox.org/api/1.1#FeatureValue"/>
> >   </rdf:Description>
> > 
> > This means that the same feature appears as string and double in the
> > same dataset. My understanding is that the second one is an empty
> > string, i.e. a missing value, but I think it would be better if missing
> > values where just missing. This would lead to a smaller RDF
> > representation. What do you think?
> > 
> >   
> 
> Missing values are not included in the RDF representation. You might
> check with other datasets/features, e.g
> http://ambit.uni-plovdiv.bg:8080/ambit2/dataset/9?feature_uris[]=http://ambit.uni-plovdiv.bg:8080/ambit2/feature/12142
> 
>  If there are empty strings in the RDF representation, then this means
> there are empty strings in the database and they should be considered
> string values.  Why there are string values for exactly this feature
> is another issue, I'll be checking.
> 
> > - Is there some query for the dataset with which we could retrieve only
> > the non-string features?
> >   
> 
> Please note - in the real datasets there are and will always be
> features with values of mixed types.  I have tried several times to
> present examples why this is so, and suggested we do need kind of
> transformation service for data cleaning.  Data cleaning is an
> essential part of any data mining workflow and we should not be hiding
> this inside a model generation service.
> 
> Once again examples
> 1) skin sensitisation dataset, LLNA EC3% values are both numerics and
> strings (NC for nonsensitizer )
> http://ambit.uni-plovdiv.bg:8080/ambit2/dataset/6?feature_uris[]=http://ambit.uni-plovdiv.bg:8080/ambit2/feature/11951
> 
> 2)EINECS list , Molecular weight, as generated and distributed by JRC
> http://ambit.uni-plovdiv.bg:8080/ambit2/dataset/1?feature_uris[]=http://ambit.uni-plovdiv.bg:8080/ambit2/feature/8
> 
> 3)DSSTox CPDBAS database, TD50_Rat_mg , you might notice there are
> numbers , also missing values and also strings like "blank"
> http://ambit.uni-plovdiv.bg:8080/ambit2/dataset/9?feature_uris[]=http://ambit.uni-plovdiv.bg:8080/ambit2/feature/12142
> 
> Of course we might introduce queries to filter for specific data
> types, but this so far does not exist in the API.  I would even
> propose we introduce in the API a check for compatibility between
> dataset and algorithm and dataset and a model, not just check for data
> types.
> 
> Best regards,
> Nina
> 
> > I have the feeling we're moving towards some kind of integration and
> > that's quite encouraging. I deployed the new version today to let you do
> > some tests. Working components are:
> > 
> > * MLR and SVM model creation provided that the target attribute is
> > declared as numeric.
> > * All GET methods on /model
> > and /model/{id}, /model/{id}/predicted, /model/{id}/dependent, /model/{id}/independent
> > * POST on /model/{id} is supported only for MLR models and what is
> > returned is just an ARFF representation of the predicted data.
> > 
> > 
> > I attach some draft documentation reports for the services...
> > 
> > 
> > Best regards,
> > Pantelis
> 
>