[OTDev] API 1.1. extensions

Tue Jan 5 10:57:07 CET 2010

Hello All,

and Happy New Year 2010!

Please find below proposals and discussion points for API 1.1 extension:
>
> ToDo:
> Nina (and other contributors) - Any required changes to API as soon as
> possible and by Jan 5 latest
>

1) Feature data types:
Proposal (based on Pantelis suggestions and Protege guide) at
http://opentox.org/data/documents/development/RDF%20files/Datatypes.
Updated opentox.owl at
http://opentox.org/data/documents/development/RDF%20files/OpenToxOntology/view

2) Ontology service.  Proposal at
http://opentox.org/dev/apis/api-1.1/Ontology%20service . Example queries
for retrieving models, given an endpoint.

in addition, for quick searching if an Ontology service is not
available, introduce
?query=URI-of-the-owl:sameAs-entry
parameter for algorithm, model and feature services, which will return
all algorithms (models, features), for which owl:sameAs is given by the
query.  Most services already have similar functionality, but parameter
value is and arbitrary string and unknown to generic clients.

3) How to identify features, generated by an algorithm and specific set
of parameters

According to current opentox.owl, a Feature can be assigned Algorithm,
Model or Dataset as its origin (via property ot:hasSource).   There is
no support for Algorithm + Parameters, except if the specific case of a
Model can be regarded as Algorithm + Parameter instance.

One possible solution could be:
- define superclass A , which is determined by Algorithm + Parameters
- Make Model subclass  of A
- define  domain of ot:hasSource  as classes A and Dataset
- Find a nice name for the superclass A :)

This will be searchable via ontology service.

4) Compatibility between algorithms, datasets and models.
There are different kind of requirements of algorithms and models.

Input:
- All Algorithms accept a Dataset on input (might be an empty one).
- All Models accept a Dataset as input.

Output:
- Two kinds of Algorithms (at least) , according to the output
generated:  a Dataset for data processing algorithms (e.g. descriptor
calculation), or a Model for model building algorithms. 
- An outcome of all Models is a Dataset

Input and output requirements are not yet reflected in the
AlgorithmTypes ontology. There is an attempt to introduce ot:hasInput
and ot:hasOutput in opentox.owl, but needs to be refined.

Independent variables
- Algorithms , requiring only chemical structure on input (e.g.
descriptor calculation, rule-based predictions like Toxtree)
- Algorithms , requiring input variables, obtained by elsewhere (e.g. by
data services or descriptor calculation)
- Algorithms like MLR and SVM need a dataset containing only numeric
values, declared to be numeric and with numeric entries exclusively.
This now could be handled by requiring ot:NumericDataset for this kind
of algorithm.
- Classification algorithms require nominal target variable .

Dependent variables
- Not required for some (e.g. descriptor calculation, rule-based
predictions like Toxtree)
- Not required for unsupervised algorithms (clustering)
- Classification algorithms require nominal target variable .
- Algorithms like MLR and SVM require numeric target variable.

Prediction results have to be stored under features, separate from
dependent variables, in order not to overwrite/mix observed and
predicted values.  Such features might be set via parameters, or
automatically created by algorithm/model services.

5) cleaning services (Handling of missing values, Removal of String
Features, Consistency checking and so on).

-Nominal variables.   Given a dataset with string variables, there is no
automatic way to recognise if these should be nominal or plain strings. 
One rather crude heuristic is to consider variable as nominal, if the
number of distinct values in a dataset is less than a threshold.  Better
approach is to annotate features and link them to ontologies, where one
can use enumerated classes for nominal variables.  The first approach
can be adopted as a quick workaround till end of February.

-Consistency checking: for this purpose one need to be able to define
the data format,required by an algorithm or model service and compare to
a dataset.  This could be done on the level of opentox.owl /
algorithmtypes.owl , or on the API level.  If on the API level , we
might introduce
GET /algorithm/compatibility?dataset_uri=...

It would be better to be able to use not the entire dataset for
compatibility check, but only RDF of features and some metadata.

Similar approach can be adopted for model/dataset compatibility check. 

- Handling of missing values and other transformation services.  I could
think of two approaches -
a) separate service, accepting raw dataset on input and generating
transformed dataset,according some rules (better for flexibility)
b) embedded in the model generation, and configurable via some options
(better for performance, e.g. available weka filters can be used)

We might borrow some ideas, or completely reuse existing ontologies for
data mining . Some of the partners might be especially interested in (or
already aware of) KDDOnto  (http://boole.diiga.univpm.it/paper/ida09.pdf )

Data mining ontology (DMO_ http://www.e-lico.eu/?q=dmo
OntoDM  http://kt.ijs.si/panovp/OntoDM
KDDONTO: An Ontology for Discovery and Composition of KDD Algorithms
<http://boole.diiga.univpm.it/paper/sokd09.pdf> . Ontology-driven KDD
Process Composition <http://boole.diiga.univpm.it/paper/ida09.pdf>

Best regards,
Nina