[OTDev] acceptValues (again)

Tue May 10 18:39:08 CEST 2011

Dear All,

I have to bring up the topic of acceptValues for classification datasets
once again. I will use the "stringified" version of the ISSCAN dataset
(http://apps.ideaconsult.net:8080/ambit2/dataset/429390) as an example.
This dataset has two nominal features:

"Canc" (http://apps.ideaconsult.net:8080/ambit2/feature/530584) with
acceptValues: "carcinogen", "noncarcinogen"

"SAL" (http://apps.ideaconsult.net:8080/ambit2/feature/530585) with
acceptValues: "ND", "equivocal", "mutagen", "nonmutagen"

Especially the second example makes it clear that acceptValues are
presently a mixed bag. Applying classification algorithms without caring
for the semantics of acceptValues would also create "ND" and "equivocal"
predictions, which is of course nonsense.

Generally speaking we would need mechanisms to

- indicate classes that should not be used for modelling (e.g. "ND",
  "equivocal", "inconclusive", ...)

- distinguish between ordered (e.g. weak, medium, strong) and unordered
  classes (e.g. toxic mechanisms like narcotic, alkylating, ...)

- indicate ranks in ordered classes (or "positives" vs "negatives" in
  binary classifications)

This information is not only necessary for the graphical depiction of
prediction results (coloring "toxic" classes in green would not be very
intuitive), but also for selecting algorithms (regression can make sense
for ordered classes, but not for unordered), the generation of reports
and for validation (how can we determine sensitivity/specificity if we
do not know positive/negative classes).

I am aware that adding such information will require (documented) human
intervention (WP3?), but I think it is worth the additional efforts. I
also think that such information should be added to the source (i.e.
datasets) and not through guesswork/hacks at the GUI/report/validation
level. I would also like to retain the original information (e.g.
equivocal classifications) in the dataset, because it can be useful for
exploration and comparison purposes.

If we can agree on these requirements we can proceed to discuss their
implementation in the dataset representation.

Best regards,
Christoph