[OTDev] descriptor recalculation

Thu Apr 29 23:33:21 CEST 2010

Excerpts from Nina Jeliazkova's message of Thu Apr 29 20:04:17 +0200 2010:
> Dear Christoph,
> 
> Few comments from my point of view.  I (partially) agree, the difference
> perhaps is I would like to keep both the "low level" services and
> "super" services aboard, not hiding everything into "super" services.

I agree, that the "low level" services should be still available for
everyone - that would make it much easier to create innovative "super"
services and it is also necessary for other tasks apart from creating
models and predictions.

> The examples are of course valid, however these are actually not single
> services, but workflows of services, rather simple ones.  
> 
> What we have right now are actually building blocks of services, and
> composition of these are being build  case by case.  For example the 
> mlr case might be
> 
>         training-dataset = data_preprocessing(raw-dataset)
>         descriptors_dataset1 = cdk-descriptors(training-dataset)
>         descriptors_dataset2 = other-descriptors(training-dataset)
>         descriptors_dataset = merge(descriptors_dataset1,descriptors_dataset2)
>     pca-descriptors = pca(descriptors)
>     mlr-model = mlr(pca-descriptors)
>     model.model = mlr-model
>     model.descriptors = pca-descriptors
> 
> (note that  outcome of calculations are datasets again, so in short we have 
> 
>       training-dataset = data_preprocessing(raw-dataset)
>     model =  mlr(merge(cdk-descriptors(training-dataset),other-descriptors(training-dataset)))
> 
> 
> There could be multiple services in the chain, and it might not be
> sufficient just to submit services as parameters, there should be a way
> to specify the order of the processing.

Agreed. Any ideas how to represent sequences?
> 
> However, it should be fine to wrap such an workflow in a "super" service.
> > The model learning task and the prediction task may utilize one or more
> > algorithms (or models - the separation blurs once again), but at the
> > high level I want to use only the "super" algorithms/models.
> >   
> For clear separation, I would like to propose the following
> interpretation - 
> 
> - "algorithm" means a generic "algorithm", or series of steps
> - "model" is the "algorithm", applied to the data , given relevant
> parameters.
> 
> Thus, in kNN example, the "algorithm" is the theoretical procedure of
> finding the closest k neighbors and generating prediction based on its
> values;
> while the "model" is the kNN algorithm, applied to specific data set,
> distance metric and equation/procedure (e.g. averaging values).
> 
> For eager learning algorithms there should not be confusion.

Thanks for clarification - hope I memorize it this time ;-)

> > As a GUI developer I still want to have access to the underlying
> > algorithms, but they can be provided as parameters (our existing API is
> > quite flexible in this respect). An algorithm webservice could provide
> > e.g. a high level regression algorithm that allows to choose descriptor
> > calculation, feature selection and modelling parameters by setting
> > parameters (and it should document and check internally which algorithms
> > play together). Future lazar version e.g. will have the facility to
> > freely switch descriptor calculation services or use datasets with
> > biological measurements. Maybe we should add the facility to represent
> > sub-algorithms in OWL-DL for "super" algorithms.
> >
> > According to our API the model knows about ot.Algorithm and
> > ot.IndependentVariables, but it would need to know the service to
> > calculate independent variables.
> It does actually - every feature (variable) has ot:hasSource, which
> points to the service it has been generated from (e.g. descriptor
> calculation one) - and this is what we use in ToxPredict.

True, but that makes sense only for "simple" descriptor calculation
algorithms (i.e. descriptors that are independent of the training
activities, like phys-chem properties, substructures). If we use e.g.
supervised graph mining techiques we need

(i) an algorithm (model because it is algorithm applied to data?) that
mines features in the training dataset and creates a feature dataset
(e.g. fminer)

(ii) a simple substructure matching algorithm that determines if the
mined features are present in the compound to be predicted (e.g.
OpenBabel Smarts matcher)

My interpretation was, that ot:hasSource should point to the graph
mining algorithm, but the model would need the substructure matcher for
predictions. How should we handle this?

> >  This could be inferred from the
> > ot.Algorithms's sub-algorithms or stated explicitly. More importantly
> > the service would have to be able to call the necessary services 
> Yes it already does - get the independent variable ot:hasSource
> property, run POST and get the result
> > (of
> > course this has to be implemented, if you are using stock ML/DM tools -
> > but OpenTox should be more than just wrapping existing programs into a
> > REST interface). It would be a large waste of efforts, if every
> > developer would have to implement descriptor calculation separatly in
> > their webservice clients. 
> >   
> Agree.
> 
> But it will be also a waste, if e.g. descriptor calculations services
> are hidden within model services , and not accessible for reuse for
> other models.  

I do not want to hide them completely - on the contrary, everyone should
be able to mash up descriptor calculation/selection and model learning
services. I was just arguing in favor of better encapsulation, that
makes programming/experimenting easier (e.g. models that accept
structures instead of descriptor sets as input for predictions).

> For example descriptors, currently used in lazar don't seem to be
> accessible for use in other services (I might just not be aware though),
> while descriptors, exposed as separate services by TUM and IDEA are.

Descriptor calculation is currently performed by fminer, which is
available (and used by lazar) as an independent standalone service:

http://webservices.in-silico.ch/algorithm/fminer

It can be easily exchanged for other descriptor services. On the other
hand Andreas uses e.g. fminer/last features to create SVM models.

O> > To sum up my personal opinion:
> >
> > For ToxCreate I would like to handle to high-level objects/services:
> > training-algorithm (for creating models) and model (for predictions). I
> > do not want to have to care about implementation details for model
> > training and predictions, but would like to have access to the
> > underlying algorithms through parameters. 
> Or links from RDF representation of related objects.

Might be also a possibility to represent sequences of steps.

> > We might need minor API
> > changes for representing "super" algorithm services (i.e. algorithm
> > services that call other algorithm sservices) and for informing the
> > model service about the right descriptor calculation service.
> >   
> I am also comfortable with the idea of having "super"  (proxy, composite
> ) services, to encapsulate workflows, like in your examples.  
> 
> BTW "super" service sounds better than "proxy" service I was suggesting
> the other day.

Thanks for the compliment, it was the first thing that came into my
mind.

Best regards,
Christoph