[OTDev] descriptor recalculation

Thu Apr 29 16:18:23 CEST 2010

Dear All,

I am not sure, if this helps, but here is my conceptualisation of the
situation:

>From an users point of view the situation is quite clear: He wants to
submit a dataset and obtain predictions for new compounds:

Input: training-data, compound(s)
Output: prediction(s)

Most real world users don't care how predictions are obtained (as long
as they are correct), so at the end-user level (ToxPredict, ToxCreate)
we should basically expose the functionality

	predict(training-dataset,compound).

Although it might be possible to make predictions without explicit
models (e.g. k-nn with graph-similarities), we have decided (with a good
reason) to break the procedure into two steps in our implementation :

Model-learning:
	Input: training-data
	Output: model

Prediction:
	Input: model, compound(s)
	Output: prediction(s)

As a ToxCreate developer (and I think Martin will agree for the
validation service) I would like to work at this level with basically
two operations: 

	model = training-algorithm(training-dataset,[parameters])
	prediction = model(compound)

At this level, I do not want to be bothered with implementation details
- this is the job of the webservice developers.

They might decide e.g. to implement the training-algorithm as

	descriptors = cdk-descriptors(training-dataset)
	pca-descriptors = pca(descriptors)
	mlr-model = mlr(training-dataset,pca-descriptors)
	model.model = mlr-model
	model.descriptors = pca-descriptors

and the prediction as

	compound-descriptors = cdk-descriptors(compound,model.descriptors)
	prediction = model.model(compound-descriptors)

or as

	model = svm-with-graph-kernel(training-dataset)

	prediction = model(compound)

The model learning task and the prediction task may utilize one or more
algorithms (or models - the separation blurs once again), but at the
high level I want to use only the "super" algorithms/models.

As a GUI developer I still want to have access to the underlying
algorithms, but they can be provided as parameters (our existing API is
quite flexible in this respect). An algorithm webservice could provide
e.g. a high level regression algorithm that allows to choose descriptor
calculation, feature selection and modelling parameters by setting
parameters (and it should document and check internally which algorithms
play together). Future lazar version e.g. will have the facility to
freely switch descriptor calculation services or use datasets with
biological measurements. Maybe we should add the facility to represent
sub-algorithms in OWL-DL for "super" algorithms.

According to our API the model knows about ot.Algorithm and
ot.IndependentVariables, but it would need to know the service to
calculate independent variables. This could be inferred from the
ot.Algorithms's sub-algorithms or stated explicitly. More importantly
the service would have to be able to call the necessary services (of
course this has to be implemented, if you are using stock ML/DM tools -
but OpenTox should be more than just wrapping existing programs into a
REST interface). It would be a large waste of efforts, if every
developer would have to implement descriptor calculation separatly in
their webservice clients. 

To sum up my personal opinion:

For ToxCreate I would like to handle to high-level objects/services:
training-algorithm (for creating models) and model (for predictions). I
do not want to have to care about implementation details for model
training and predictions, but would like to have access to the
underlying algorithms through parameters. We might need minor API
changes for representing "super" algorithm services (i.e. algorithm
services that call other algorithm sservices) and for informing the
model service about the right descriptor calculation service.

Best regards,
Christoph