[OTDev] ToxCreate integration of Ambit classification datasets

Christoph Helma helma at in-silico.ch
Tue Mar 22 19:43:19 CET 2011


Dear Nina, Vedrin, All,

I had a look at feature
http://apps.ideaconsult.net:8080/ambit2/feature/21573 from
http://apps.ideaconsult.net:8080/ambit2/dataset/9, which raises some
interesting questions: IMHO "Canc" is clearly a nominal feature, but its
representation tells me that it is both a nominal and a numeric feature
(maybe due to the fact that classes are represented as "1.0", "2.0" and
"3.0"). In order to call the correct (classification or regression)
algorithms I need however to know unambiguously:

  1. the feature type (Numeric or Nominal)
  2. "true" and "false" classes for binary classifications

I assume that 1. can be easily solved, by making NumericFeature and
NominalFeature disjunct. 

Guessing "true" and "false" classes is harder, because there are many
possibilities to indicate them in real world datasets. In our services
we are currently checking with regular expressions for common cases
(e.g. active/inactive, 1/0, toxic/nontoxic, ...), but this will not work
for all possible feature values.

I have no definitive solution for problem 2, a few thoughts:

a) Present a list of classes and let the user assign true and false
   classes
   + can be used for all datasets/features (also for the discretization
   of NumericFeatures
   - needs human intervention (not suited for automated model creation)
   - same step has to be repeated every time a dataset is used
   - might be error prone, might lead to suboptimal results from inexperienced users

b) Standardize allowed values for NominalFeatures
   + unambiguous, automated processing possible
   - needs human curation of imported datasets

I tend to favor b) as a long term solution, whats your opinion?

Another question: 

If I expand our regexp hack and implement a) as a fallback, I would need
to write new feature values into a dataset. Would you prefer to

  - overwrite the old values in the original dataset (original
    information is lost)
  - add a new feature (with modified values) to the original dataset
    (original information untouched, but might destroy the dataset if
    handled improperly)
  - create a new consolidated dataset (IMHO safest)

Best regards,
Christoph



More information about the Development mailing list