[OTDev] Datatype of features

Nina Jeliazkova nina at acad.bg
Wed Dec 30 20:21:46 CET 2009


Hi Pantelis,

chung wrote:
> Hi Nina, All,
>  I tried to use the dataset
> http://ambit.uni-plovdiv.bg:8080/ambit2/dataset/6 to build some models
> and I managed to build some MLR and SVM ones but I encountered some
> problems handling the datatypes of features. For example, this dataset
> includes the following contradictory entries:
>
>
> ## line: 1406
>   </rdf:Description>
>   <rdf:Description rdf:nodeID="A400">
>     <ot:value
> rdf:datatype="http://www.w3.org/2001/XMLSchema#double">6.2426</ot:value>
>     <ot:feature
> rdf:resource="http://ambit.uni-plovdiv.bg:8080/ambit2/feature/11954"/>
>     <rdf:type
> rdf:resource="http://www.opentox.org/api/1.1#FeatureValue"/>
>   </rdf:Description>
>
> and
>
> ## line: 1386
>   <rdf:Description rdf:nodeID="A396">
>     <ot:value rdf:datatype="http://www.w3.org/2001/XMLSchema#string">
> </ot:value>
>     <ot:feature
> rdf:resource="http://ambit.uni-plovdiv.bg:8080/ambit2/feature/11954"/>
>     <rdf:type
> rdf:resource="http://www.opentox.org/api/1.1#FeatureValue"/>
>   </rdf:Description>
>
> This means that the same feature appears as string and double in the
> same dataset. My understanding is that the second one is an empty
> string, i.e. a missing value, but I think it would be better if missing
> values where just missing. This would lead to a smaller RDF
> representation. What do you think?
>
>   
Missing values are not included in the RDF representation. You might
check with other datasets/features, e.g
http://ambit.uni-plovdiv.bg:8080/ambit2/dataset/9?feature_uris[]=http://ambit.uni-plovdiv.bg:8080/ambit2/feature/12142
<http://ambit.uni-plovdiv.bg:8080/ambit2/dataset/9?feature_uris%5B%5D=http://ambit.uni-plovdiv.bg:8080/ambit2/feature/12142>

 If there are empty strings in the RDF representation, then this means
there are empty strings in the database and they should be considered
string values.  Why there are string values for exactly this feature is
another issue, I'll be checking.
> - Is there some query for the dataset with which we could retrieve only
> the non-string features?
>   
Please note - in the real datasets _there are and will always be
features with values of mixed types_.  I have tried several times to
present examples why this is so, and suggested we do need kind of
transformation service for data cleaning.  Data cleaning is an essential
part of any data mining workflow and we should not be hiding this inside
a model generation service.

Once again examples
1) skin sensitisation dataset, LLNA EC3% values are both numerics and
strings (NC for nonsensitizer )
http://ambit.uni-plovdiv.bg:8080/ambit2/dataset/6?feature_uris[]=http://ambit.uni-plovdiv.bg:8080/ambit2/feature/11951
<http://ambit.uni-plovdiv.bg:8080/ambit2/dataset/6?feature_uris%5B%5D=http://ambit.uni-plovdiv.bg:8080/ambit2/feature/11951>

2)EINECS list , Molecular weight, as generated and distributed by JRC 
http://ambit.uni-plovdiv.bg:8080/ambit2/dataset/1?feature_uris[]=http://ambit.uni-plovdiv.bg:8080/ambit2/feature/8
<http://ambit.uni-plovdiv.bg:8080/ambit2/dataset/1?feature_uris%5B%5D=http://ambit.uni-plovdiv.bg:8080/ambit2/feature/8>

3)DSSTox CPDBAS database, TD50_Rat_mg
<http://ambit.uni-plovdiv.bg:8080/ambit2/feature/12142>, you might
notice there are numbers , also missing values and also strings like
"blank"
http://ambit.uni-plovdiv.bg:8080/ambit2/dataset/9?feature_uris[]=http://ambit.uni-plovdiv.bg:8080/ambit2/feature/12142
<http://ambit.uni-plovdiv.bg:8080/ambit2/dataset/9?feature_uris%5B%5D=http://ambit.uni-plovdiv.bg:8080/ambit2/feature/12142>

Of course we might introduce queries to filter for specific data types,
but this so far does not exist in the API.  I would even propose we
introduce in the API a check for compatibility between dataset and
algorithm and dataset and a model, not just check for data types.

Best regards,
Nina
> I have the feeling we're moving towards some kind of integration and
> that's quite encouraging. I deployed the new version today to let you do
> some tests. Working components are:
>
> * MLR and SVM model creation provided that the target attribute is
> declared as numeric.
> * All GET methods on /model
> and /model/{id}, /model/{id}/predicted, /model/{id}/dependent, /model/{id}/independent
> * POST on /model/{id} is supported only for MLR models and what is
> returned is just an ARFF representation of the predicted data.
>
>
> I attach some draft documentation reports for the services...
>
>
> Best regards,
> Pantelis




More information about the Development mailing list