[OTDev] ToxCreate integration of Ambit classification datasets

Tue Mar 22 19:55:45 CET 2011

Dear Christoph,

On 22 March 2011 20:43, Christoph Helma <helma at in-silico.ch> wrote:

> Dear Nina, Vedrin, All,
>
> I had a look at feature
> http://apps.ideaconsult.net:8080/ambit2/feature/21573 from
> http://apps.ideaconsult.net:8080/ambit2/dataset/9, which raises some
> interesting questions: IMHO "Canc" is clearly a nominal feature, but its
> representation tells me that it is both a nominal and a numeric feature
> (maybe due to the fact that classes are represented as "1.0", "2.0" and
> "3.0").

Yes.

> In order to call the correct (classification or regression)
> algorithms I need however to know unambiguously:
>
>  1. the feature type (Numeric or Nominal)
>  2. "true" and "false" classes for binary classifications
>
> I assume that 1. can be easily solved, by making NumericFeature and
> NominalFeature disjunct.
>

Currently, a numeric feature can be nominal, which is useful in this case,
and I don't think it is contradictory.

>
> Guessing "true" and "false" classes is harder, because there are many
> possibilities to indicate them in real world datasets. In our services
> we are currently checking with regular expressions for common cases
> (e.g. active/inactive, 1/0, toxic/nontoxic, ...), but this will not work
> for all possible feature values.
>

If you look at /dataset/9  RDF  representation, there is ot:acceptValue  in
RDF representation, which lists possible values for the feature. This was
agreed for API 1.1 and is in the opentox.owl , and is used by TUM/NTUA
services as far as I know.

<http://apps.ideaconsult.net:8080/ambit2/feature/21573>
      a       ot:Feature , ot:NumericFeature , ot:NominalFeature ;
      dc:creator "http://www.epa.gov/NCCT/dsstox/sdf_isscan_external.html" ;
      dc:title "Canc" ;
      ot:acceptValue "3.0" , "1.0" ;
      ot:hasSource <
http://apps.ideaconsult.net:8080/ambit2/dataset/ISSCAN_v3a_1153_19Sept08.1222179139.sdf>
;
      ot:units "" ;
      =       otee:Carcinogenicity .

I would suggest modifying your implementation to use ot:acceptValue, instead
of regexp.

> I have no definitive solution for problem 2, a few thoughts:
>
> a) Present a list of classes and let the user assign true and false
>   classes
>   + can be used for all datasets/features (also for the discretization
>   of NumericFeatures
>   - needs human intervention (not suited for automated model creation)
>   - same step has to be repeated every time a dataset is used
>   - might be error prone, might lead to suboptimal results from
> inexperienced users
>
> b) Standardize allowed values for NominalFeatures
>   + unambiguous, automated processing possible
>   - needs human curation of imported datasets
>

>
> I tend to favor b) as a long term solution, whats your opinion?
>

This was the reason to introduce  ot:acceptValue .  It allows to specify
which are the allowed values.  Setting the feature as nominal needs manual
intervention indeed.

>
> Another question:
>
> If I expand our regexp hack and implement a) as a fallback, I would need
> to write new feature values into a dataset. Would you prefer to
>
>  - overwrite the old values in the original dataset (original
>    information is lost)
>  - add a new feature (with modified values) to the original dataset
>    (original information untouched, but might destroy the dataset if
>    handled improperly)
>  - create a new consolidated dataset (IMHO safest)
>

No problem to create new datasets, but preferred option is to use
ot:acceptValue, as regexp will not work for other datasets with different
values.

Best regards,
Nina

>
> Best regards,
> Christoph
> _______________________________________________
> Development mailing list
> Development at opentox.org
> http://www.opentox.org/mailman/listinfo/development
>