[OTDev] acceptValues (again)

Tue May 10 19:41:27 CEST 2011

Dear Christoph, All,

I agree the ot:acceptValues property is not the proper way to represent
nominal features. This is actually a good illustration how one can run into
problems if taking shortcuts (that's it , acceptValues property) and not
doing the data model right.  Representing nominals in RDF should have been
done via enumeration or equivalent classes instead ( are there other
approaches?)

Here is what I would like to propose, after last month brainstorming with
Ivelina, sorry for the delay!

Nominal values have to be represented as subclasses of ot:FeatureValue,  and
additionally a restriction on allowed values is established via
owl:equivalentClass :

<!-- carcinogen -->
<owl:Class rdf:ID="FeatureValueCarcinogen">

<!-- It is a subclass of ot:FeatureValue -->
  <rdfs:subClassOf rdf:resource="#FeatureValue"/>
        <owl:equivalentClass>
            <owl:Class>
                <owl:intersectionOf rdf:parseType="Collection">
                    <owl:Restriction>
<!-- Here we set a restriction on ot:feature , it can only take ID of one
single feature, representing the Canc column -->
                        <owl:onProperty rdf:resource="#feature"/>
<!-- Here  should be the URI of the Canc column. -->
                        <owl:hasValue rdf:resource="#Canc"/>
                    </owl:Restriction>
                    <owl:Restriction>
<!-- Set restriction on the value - it can only be 3 here -->
                        <owl:onProperty rdf:resource="#value"/>
                        <owl:hasValue
rdf:datatype="&xsd;int">3</owl:hasValue>
                    </owl:Restriction>
                </owl:intersectionOf>
            </owl:Class>
        </owl:equivalentClass>

    </owl:Class>

<!-- non-carconogen -->

<owl:Class rdf:ID="FeatureValueNonCarcinogen">
        <rdfs:subClassOf rdf:resource="#FeatureValue"/>
        <owl:equivalentClass>
            <owl:Class>
                <owl:intersectionOf rdf:parseType="Collection">
                    <owl:Restriction>
<!-- Same restriction as above -->
                        <owl:onProperty rdf:resource="#feature"/>
                        <owl:hasValue rdf:resource="#Canc"/>
                    </owl:Restriction>
                    <owl:Restriction>
<!-- Set restriction on the value - it can only be 1 here -->
                        <owl:onProperty rdf:resource="#value"/>
                        <owl:hasValue
rdf:datatype="&xsd;int">1</owl:hasValue>
                    </owl:Restriction>
                </owl:intersectionOf>
            </owl:Class>
        </owl:equivalentClass>

    </owl:Class>

<!-- Example instances of the classes above - in this example case only 2
instances are needed -->

<FeatureValueCarcinogen rdf:ID="FeatureValueCarcinogen_instance"/>

<FeatureValueNonCarcinogen rdf:ID="FeatureValueNonCarcinogen_instance"/>

<!-- And data entries as usual -->

    <DataEntry rdf:ID="DataEntry_1">
        <compound rdf:resource="#Compound_1"/>
        <values rdf:resource="#FeatureValueCarcinogen_instance"/>
    </DataEntry>

    <DataEntry rdf:ID="DataEntry_2">
        <compound rdf:resource="#Compound_2"/>
        <values rdf:resource="#FeatureValueCarcinogen_instance"/>
    </DataEntry>

    <DataEntry rdf:ID="DataEntry_3">
        <compound rdf:resource="#Compound_3"/>
        <values rdf:resource="#FeatureValueNonCarcinogen_instance"/>
    </DataEntry>

    <DataEntry rdf:ID="DataEntry_4">
        <compound rdf:resource="#Compound_4"/>
        <values rdf:resource="#FeatureValueNonCarcinogen_instance"/>
    </DataEntry>

You may notice since FeatureValues are not anymore anonymous instances, they
are reused within the data entries, thus making the RDF more readable and
shorter.  One can introduce additional properties of the FeatureValue
classes, assigning positive/negative meaning or ordering.

I think we could continue using both ot:acceptValues and the new approach in
the same RDF, for compatibility reasons, and the clients can choose to parse
one or the other representation.  What the proposal adds is a new subclass
for each nominal value, and replacing the relevant  ot:FeatureValue
instances with instances of the new classes.  Datasets parsing should not
change at all.
The new classes for the nominal features should be generated  by the dataset
service.

The triples above are generated via Protege 3.4.  I hope they are correct,
but improvements are welcome, as well as alternative solutions.

Best regards,
Nina

On 10 May 2011 19:39, Christoph Helma <helma at in-silico.ch> wrote:

> Dear All,
>
> I have to bring up the topic of acceptValues for classification datasets
> once again. I will use the "stringified" version of the ISSCAN dataset
> (http://apps.ideaconsult.net:8080/ambit2/dataset/429390) as an example.
> This dataset has two nominal features:
>
> "Canc" (http://apps.ideaconsult.net:8080/ambit2/feature/530584) with
> acceptValues: "carcinogen", "noncarcinogen"
>
> "SAL" (http://apps.ideaconsult.net:8080/ambit2/feature/530585) with
> acceptValues: "ND", "equivocal", "mutagen", "nonmutagen"
>
> Especially the second example makes it clear that acceptValues are
> presently a mixed bag. Applying classification algorithms without caring
> for the semantics of acceptValues would also create "ND" and "equivocal"
> predictions, which is of course nonsense.
>
> Generally speaking we would need mechanisms to
>
> - indicate classes that should not be used for modelling (e.g. "ND",
>  "equivocal", "inconclusive", ...)
>
> - distinguish between ordered (e.g. weak, medium, strong) and unordered
>  classes (e.g. toxic mechanisms like narcotic, alkylating, ...)
>
> - indicate ranks in ordered classes (or "positives" vs "negatives" in
>  binary classifications)
>
> This information is not only necessary for the graphical depiction of
> prediction results (coloring "toxic" classes in green would not be very
> intuitive), but also for selecting algorithms (regression can make sense
> for ordered classes, but not for unordered), the generation of reports
> and for validation (how can we determine sensitivity/specificity if we
> do not know positive/negative classes).
>
> I am aware that adding such information will require (documented) human
> intervention (WP3?), but I think it is worth the additional efforts. I
> also think that such information should be added to the source (i.e.
> datasets) and not through guesswork/hacks at the GUI/report/validation
> level. I would also like to retain the original information (e.g.
> equivocal classifications) in the dataset, because it can be useful for
> exploration and comparison purposes.
>
> If we can agree on these requirements we can proceed to discuss their
> implementation in the dataset representation.
>
> Best regards,
> Christoph
> _______________________________________________
> Development mailing list
> Development at opentox.org
> http://www.opentox.org/mailman/listinfo/development
>