[OTDev] NTUA WebServices

Mon Aug 23 15:17:00 CEST 2010

On Mon, Aug 23, 2010 at 3:50 PM, Martin Guetlein <
martin.guetlein at googlemail.com> wrote:

> On Mon, Aug 23, 2010 at 1:27 PM, Nina Jeliazkova
> <jeliazkova.nina at gmail.com>wrote:
>
> > Christoph,
> >
> > On Mon, Aug 23, 2010 at 12:49 PM, Christoph Helma <helma at in-silico.ch
> > >wrote:
> >
> > > Excerpts from Nina Jeliazkova's message of Fri Aug 20 23:07:22 +0200
> > 2010:
> > >
> > > > My fault for not being clear - the superservice will not build a
> > > > model,  it could only apply a model.  To build a model, just POST the
> > > > dataset and prediction feature to the algorithm uri directly.
> > >
> > > Ok, lets see if I understand correctly:
> > >
> > > To create a prediction model from scratch I would have to
> > >
> > > - create a dataset with structures and activities
> > > - calculate (and eventually select) descriptors using one of the
> feature
> > >  calculation (selection) algorithms
> > > - apply one of the modelling algorithms to create a prediction model
> > >
> > > To make predictions I would use the superservice:
> > >
> > > - create a dataset with structures to be predicted
> > > - submit the prediction dataset and the model to the superservice to
> > >  obtain a dataset with the predictions
> > >
> > > Is this correct?
> > >
> > >
> > Yes.
> >
> >
> > > To simplify this procedure we are using for our services the following
> > > convenience methods:
> > >
> > > Model creation:
> > >
> > >  curl -X POST -d dataset_uri={datset_uri} -d feature_uri={feature_uri}
> -d
> > > feature_generation_uri={feature_generation_uri} {model_algorithm_uri}
> > >  returns task URI for the prediction model, feaure_uri specifies the
> > > dependent variable
> > >  - calls feature_generation_algorithm for dataset
> > >  - creates prediction model from calculated descriptors and training
> > >    activities (in dataset)
> > >
> > >
> >
> > This looks like "superservice" for model creation.
> >
> > 1) -d dataset_uri parameter is fine
> > 2) -d feature_uri parameter is not documented and not used by any of
> IDEA,
> > TUM or NTUA partners, nor (AFAIK) in the API documentation
> > Instead, what is used is the features , which are inherent to the dataset
> > specified. This allows to have thousands of features.
> > 3) The dependent variable, according to API should be under
> > prediction_feature={featureuris} parameter, not feature_uri (see the wiki
> > page for models).
> >
> > 4)feature_generation_uri is not specified anywhere in the API.  @ALL
> >  please
> > tell your opinions.
> >
> > Such parameter essentially makes every model a "super service" , which
> > should be able to care about descriptor calculations as well.  From point
> > of
> > view of modularity  and task encapsulation I am not sure this is a good
> > idea.  However, it could be very useful to have a "superservice" for
> model
> > creation, which could take such parameters.
> >
>
> Hello Nina, Christoph, All,
>
> I think we had that discussion a while ago (see e.g.
> http://www.opentox.org/pipermail/development/2010/d
> validating 000653.html<
> http://www.opentox.org/pipermail/development/2010/000653.html>
> ).
>

Indeed.

> I like the idea of models and algorithms to be able to handle datasets
> without features (-> christoph's proposal).

There are several disadvantages for calculating features on the fly:

- this is not practical for any but simplest features.  For example TUM
implementation of CDK descriptors can run  for hours on a moderately sized
datasets (at least when we tested before Berlin meeting).  The only
reasonable way to overcome this is storing the calculated results and reuse
when requested. This is what we do now.

- One of the most important advantages of having linked RDF representation
is to be able to provide links  between data "columns" and the procedure
that was used to generate that data. There is much talk about this currently
at ACS RDF session in Boston (see http://egonw.github.com/acsrdf2010/) .
OpenTox already has working support for this via features ot:hasSource
predicate  (this is how TUM, NTUA, IDEA calculations work and ToxPredict
makes use of it.)   If one is not using dereferencable features for
descriptors/fragments, and calculates everything on the fly , this
information is essentially lost.

Therefore I would ask IST/ALU descriptor calculation and model services to
use a feature service (their own or existing one).  This will also solve the
problem Andreas Maunz was mentioning in Oxford, on the need to generate
fragments on each cross validation run.  This is easily solved if you create
one feature per fragment - effective allows to cache any substructure - and
is how TUM fminer works.

> But as far as I remember we
> decided to use supermodels.
>

Yes.

> Therefore, I would vote for using supermodels (and extend the
> supermodel functionality to build models).
>
>
It looks like as two different super services - one for creating models and
one for prediction (currently existing).  Any other thoughts?

IMHO, the superservices should live as most close as possible to the dataset
services, to avoid unnecessary data transfer.

Best regards,
Nina

> Best regards,
> Martin
>
>
>
> >
> >
> > > I think this schema is rather generic as it allows to combine arbitrary
> > > modelling algorithms with any supervised and unsupervised feature
> > generation
> > > algorithms. Additional parameters for modelling/feature generation
> > > algorithms will be forwarded to these services.
> > >
> > >
> > 5) There are also additional _documented_ and implemented by IDEA, TUM
> and
> > NTUA parameters, namely "dataset_service" , which sets the dataset
> service,
> > where the prediction results should be stored (prediction and descriptor
> > calculation) .
> >
> >
> > Predictions:
> > >
> > > Predict a dataset (seems to be similar to superservice, but is included
> > in
> > > the model service)
> > >
> > >  curl -X POST -d dataset_uri={dataset_uri} {model_uri}
> > >  returns task URI for prediction dataset
> > >  - calls feature_generation_algorithm for dataset
> > >  - uses model to create a prediction dataset
> > >
> > > Predict a compound (convenience method without storing a dataset)
> > >
> > >  curl -X POST -d compound_uri={compound_uri} {model_uri}
> > >  returns prediction as rdf/xml or yaml
> > >  - calls feature_generation_algorithm for compound
> > >  - uses model to create a prediction for compound
> > >
> > > Do you think we should unify? I would like to keep our methods, because
> > > I find them intuitive and handy, but can of course provide a
> > > superservice like interface.
> > >
> >
> > I would like to keep things simple and not introduce descriptor
> calculation
> > facilities into models who are not aware of such.
> >
> > We do have a documented API to comply with ... of course it could be
> > modified.
> >
> > @ALL  - please let know our opinions.
> >
> > Best regards,
> > Nina
> >
> > >
> > > Best regards,
> > > Christoph
> > > _______________________________________________
> > > Development mailing list
> > > Development at opentox.org
> > > http://www.opentox.org/mailman/listinfo/development
> > >
> >
> >
> >
> > --
> >
> > Dr. Nina Jeliazkova
> > Technical Manager
> > 4 A.Kanchev str.
> > IdeaConsult Ltd.
> > 1000 Sofia, Bulgaria
> > Phone: +359 886 802011
> > _______________________________________________
> > Development mailing list
> > Development at opentox.org
> > http://www.opentox.org/mailman/listinfo/development
> >
>
>
>
> --
> Dipl-Inf. Martin Gütlein
> Phone:
> +49 (0)761 203 8442 (office)
> +49 (0)177 623 9499 (mobile)
> Email:
> guetlein at informatik.uni-freiburg.de
> _______________________________________________
> Development mailing list
> Development at opentox.org
> http://www.opentox.org/mailman/listinfo/development
>

-- 

Dr. Nina Jeliazkova
Technical Manager
4 A.Kanchev str.
IdeaConsult Ltd.
1000 Sofia, Bulgaria
Phone: +359 886 802011