[OTDev] Wrapper/Super Services and Descriptor Calculation (please ignore prvious post)

Thu Feb 3 15:31:46 CET 2011

Hi all,

Sorry for the incomplete post before - I have sent it by accident.
> > As a client I do not care if the "Supermodel" is a one trick pony (with
> > hardcoded sub-algorithms) or a generic workflow system as in your
> > proposal, as long as it creates a prediction model from a training
> > dataset. For this reason there will be no generic "Superalgorithm"
> > interface, model parameters and usage will have to be documented by the
> > service devlopers.
> >
> 
> OK, this actually leaves the door open for different implementations -
> either a "black box" superalgorithm with no internal details exposed, or a
> more generic  superalgorithm with configurable algorithms.

Yes, that was the intention.
> 
> 
> >
> > Models:
> >
> > For the ToxCreate and model validation use cases we need models that
> >  - take chemical structure(s) (without additional information) as input and
> >  - create a prediction dataset as output
> >  - are *immutable*, i.e. there should be no possibility to modifiy models
> > once they are created (everything else would invalidate validation results,
> > and would open possibilities for cheating))
> >
> > OK
> 
> 
> > A model can use a variety of algorithms (internal or through
> > webservices), it might use other models (e.g. consensus models) or
> > datasets (instance based predictions).  But as a client I do not want to
> > be bothered with these details (we store references to algorithms and
> > datasets in the model representation, but YMMV).
> 
> 
> Most QSAR developers/users would like to know details though.

Which can be delivered in the model representation.

> > All I need is a straightforward
> > interface with compound(s) as input and a dataset as output. Can we
> > agree on this interface for API 1.2?
> >
> 
> This is of course useful and we can agree this is the minimal requirement
> for a supermodel for API 1.2.
> 
> In addition I would prefer that the algorithm is transparent in what it is
> doing (well, to a certain extent), in keeping track of which algorithms from
> the OpenTox API are used internally, being able to address intermediate
> calculations via URIs (e.g. descriptors).  This will definitely help at
> least in generating QMRF reports.

I think this information should be part of the models metadata, we
would still have to decide how to represent this information.

> 
> Lazar may not be the most generic example for such workflow, as AFAIK, on
> one Lazar model there is only one feature generation algorithm used.

This is no longer true, the most recent lazar version has a fixed
(read-across like) prediction workflow:

search for similar compounds -> create local model with similar compound -> predict query compound with local model

But algorithms for similarity calculation (includes feature calculation)
and (local) model creation are freely configurable during model
creation.  So you can use e.g.  euclidean similarity with phys/chem
properties for similarity calculations and ANNs for local regression
models.

> Contrary, in descriptor based QSARs, there might be several descriptor
> calculation algorithms involved , as well as preprocessing algorithms and it
> is important to keep track of these .

> 
> > Pantelis: Your proposal seems to be focused on a generic (linear)
> > workflow implementation.
> 
> 
> 
> While it would be worthwile to have such an
> > implementation, I do not hink we have to specify workflow systems at the
> > API level.
> > (BTW: Parallel workflows (e.g. for creating consensus models) and
> > generic DAG workflows
> 
> 
> As the proposal actually describe the "materialized" run that resulted in a
> model, not the workflow description, it covers DAG workflows as well, as a
> single path within a directed acyclic graph is a linear one.

But it would not work e.g. for consensus models, where several submodels
are built in parallel.

> (for experimental/data analysis that
> > involves merging, splitting) could also be interesting).
> >
> >
> Well, to be generic, workflows may not only be unidirectional, but contain
> loops, forks, joins, etc. , but this will lead us to the land of workflow
> languages, which should be better left for specific client implementations.

But it would be still a directed graph as no one starts with the end
results (well some people try to start with desired results and work
back to the experimental setup, but we should not support that ;-)).
I am not sure, if cycles make sense in our context.

> 
> To summarize my preference would be regardless of the superalgorithm
> implementation, to keep track which algorithms has been used (e.g.  to
> calculate features  or transform data in any way - via ot:hasSource , or via
> new properties, if necessary).

Agreed. Do you have a proposal? Having just a list of algorithms/models
would be straightforward, but it does not represent the complete
workflow. If we want to represent the complete workflow I think we will
need a directed graph like representation (e.g. algorithms/models linked
by input/output datasets) to cater for parallel and serial workflows.

Best regards,
Christoph