[OTDev] descriptor recalculation

Thu Apr 29 20:04:17 CEST 2010

Dear Christoph,

Few comments from my point of view.  I (partially) agree, the difference
perhaps is I would like to keep both the "low level" services and
"super" services aboard, not hiding everything into "super" services.

Christoph Helma wrote:
> Dear All,
>
> I am not sure, if this helps, but here is my conceptualisation of the
> situation:
>
> >From an users point of view the situation is quite clear: He wants to
> submit a dataset and obtain predictions for new compounds:
>
> Input: training-data, compound(s)
> Output: prediction(s)
>
> Most real world users don't care how predictions are obtained (as long
> as they are correct), so at the end-user level (ToxPredict, ToxCreate)
> we should basically expose the functionality
>
> 	predict(training-dataset,compound).
>
> Although it might be possible to make predictions without explicit
> models (e.g. k-nn with graph-similarities), we have decided (with a good
> reason) to break the procedure into two steps in our implementation :
>
> Model-learning:
> 	Input: training-data
> 	Output: model
>
> Prediction:
> 	Input: model, compound(s)
> 	Output: prediction(s)
>
> As a ToxCreate developer (and I think Martin will agree for the
> validation service) I would like to work at this level with basically
> two operations: 
>
> 	model = training-algorithm(training-dataset,[parameters])
> 	prediction = model(compound)
>
> At this level, I do not want to be bothered with implementation details
> - this is the job of the webservice developers.
>
> They might decide e.g. to implement the training-algorithm as
>
> 	descriptors = cdk-descriptors(training-dataset)
> 	pca-descriptors = pca(descriptors)
> 	mlr-model = mlr(training-dataset,pca-descriptors)
> 	model.model = mlr-model
> 	model.descriptors = pca-descriptors
>
> and the prediction as
>
> 	compound-descriptors = cdk-descriptors(compound,model.descriptors)
> 	prediction = model.model(compound-descriptors)
>
> or as
>
> 	model = svm-with-graph-kernel(training-dataset)
>
> 	prediction = model(compound)
>   
The examples are of course valid, however these are actually not single
services, but workflows of services, rather simple ones.  

What we have right now are actually building blocks of services, and
composition of these are being build  case by case.  For example the 
mlr case might be

        training-dataset = data_preprocessing(raw-dataset)
        descriptors_dataset1 = cdk-descriptors(training-dataset)
        descriptors_dataset2 = other-descriptors(training-dataset)
        descriptors_dataset = merge(descriptors_dataset1,descriptors_dataset2)
	pca-descriptors = pca(descriptors)
	mlr-model = mlr(pca-descriptors)
	model.model = mlr-model
	model.descriptors = pca-descriptors

(note that  outcome of calculations are datasets again, so in short we have 

      training-dataset = data_preprocessing(raw-dataset)
	model =  mlr(merge(cdk-descriptors(training-dataset),other-descriptors(training-dataset)))

There could be multiple services in the chain, and it might not be
sufficient just to submit services as parameters, there should be a way
to specify the order of the processing.

However, it should be fine to wrap such an workflow in a "super" service.
> The model learning task and the prediction task may utilize one or more
> algorithms (or models - the separation blurs once again), but at the
> high level I want to use only the "super" algorithms/models.
>   
For clear separation, I would like to propose the following
interpretation - 

- "algorithm" means a generic "algorithm", or series of steps
- "model" is the "algorithm", applied to the data , given relevant
parameters.

Thus, in kNN example, the "algorithm" is the theoretical procedure of
finding the closest k neighbors and generating prediction based on its
values;
while the "model" is the kNN algorithm, applied to specific data set,
distance metric and equation/procedure (e.g. averaging values).

For eager learning algorithms there should not be confusion.
> As a GUI developer I still want to have access to the underlying
> algorithms, but they can be provided as parameters (our existing API is
> quite flexible in this respect). An algorithm webservice could provide
> e.g. a high level regression algorithm that allows to choose descriptor
> calculation, feature selection and modelling parameters by setting
> parameters (and it should document and check internally which algorithms
> play together). Future lazar version e.g. will have the facility to
> freely switch descriptor calculation services or use datasets with
> biological measurements. Maybe we should add the facility to represent
> sub-algorithms in OWL-DL for "super" algorithms.
>
> According to our API the model knows about ot.Algorithm and
> ot.IndependentVariables, but it would need to know the service to
> calculate independent variables.
It does actually - every feature (variable) has ot:hasSource, which
points to the service it has been generated from (e.g. descriptor
calculation one) - and this is what we use in ToxPredict.
>  This could be inferred from the
> ot.Algorithms's sub-algorithms or stated explicitly. More importantly
> the service would have to be able to call the necessary services 
Yes it already does - get the independent variable ot:hasSource
property, run POST and get the result
> (of
> course this has to be implemented, if you are using stock ML/DM tools -
> but OpenTox should be more than just wrapping existing programs into a
> REST interface). It would be a large waste of efforts, if every
> developer would have to implement descriptor calculation separatly in
> their webservice clients. 
>   
Agree.

But it will be also a waste, if e.g. descriptor calculations services
are hidden within model services , and not accessible for reuse for
other models.  
For example descriptors, currently used in lazar don't seem to be
accessible for use in other services (I might just not be aware though),
while descriptors, exposed as separate services by TUM and IDEA are.
> To sum up my personal opinion:
>
> For ToxCreate I would like to handle to high-level objects/services:
> training-algorithm (for creating models) and model (for predictions). I
> do not want to have to care about implementation details for model
> training and predictions, but would like to have access to the
> underlying algorithms through parameters. 
Or links from RDF representation of related objects.
> We might need minor API
> changes for representing "super" algorithm services (i.e. algorithm
> services that call other algorithm sservices) and for informing the
> model service about the right descriptor calculation service.
>   
I am also comfortable with the idea of having "super"  (proxy, composite
) services, to encapsulate workflows, like in your examples.  

BTW "super" service sounds better than "proxy" service I was suggesting
the other day.

Best regards,
Nina
> Best regards,
> Christoph
> _______________________________________________
> Development mailing list
> Development at opentox.org
> http://www.opentox.org/mailman/listinfo/development
>