[OTDev] NTUA WebServices

Mon Aug 23 18:01:25 CEST 2010

On Mon, Aug 23, 2010 at 6:51 PM, Christoph Helma <helma at in-silico.ch> wrote:

> Excerpts from Nina Jeliazkova's message of Mon Aug 23 15:17:00 +0200 2010:
> > > > This looks like "superservice" for model creation.
> > > >
> > > > 1) -d dataset_uri parameter is fine
> > > > 2) -d feature_uri parameter is not documented and not used by any of
> > > IDEA,
> > > > TUM or NTUA partners, nor (AFAIK) in the API documentation
> > > > Instead, what is used is the features , which are inherent to the
> dataset
> > > > specified. This allows to have thousands of features.
> > > > 3) The dependent variable, according to API should be under
> > > > prediction_feature={featureuris} parameter, not feature_uri (see the
> wiki
> > > > page for models).
>
> Sorry, this was a cut+pasete error from an outdated README. It is indeed
> prediction_feature in our services.
>
> > > I like the idea of models and algorithms to be able to handle datasets
> > > without features (-> christoph's proposal).
> >
> >
> > There are several disadvantages for calculating features on the fly:
> >
> > - this is not practical for any but simplest features.  For example TUM
> > implementation of CDK descriptors can run  for hours on a moderately
> sized
> > datasets (at least when we tested before Berlin meeting).  The only
> > reasonable way to overcome this is storing the calculated results and
> reuse
> > when requested. This is what we do now.
> >
> > - One of the most important advantages of having linked RDF
> representation
> > is to be able to provide links  between data "columns" and the procedure
> > that was used to generate that data. There is much talk about this
> currently
> > at ACS RDF session in Boston (see http://egonw.github.com/acsrdf2010/) .
> > OpenTox already has working support for this via features ot:hasSource
> > predicate  (this is how TUM, NTUA, IDEA calculations work and ToxPredict
> > makes use of it.)   If one is not using dereferencable features for
> > descriptors/fragments, and calculates everything on the fly , this
> > information is essentially lost.
> >
> > Therefore I would ask IST/ALU descriptor calculation and model services
> to
> > use a feature service (their own or existing one).  This will also solve
> the
> > problem Andreas Maunz was mentioning in Oxford, on the need to generate
> > fragments on each cross validation run.  This is easily solved if you
> create
> > one feature per fragment - effective allows to cache any substructure -
> and
> > is how TUM fminer works.
>
> I strongly disagree for the general case, which may include _supervised_
> feature mining and feature selection (which is the case for BBRC and
> LAST features). Storing features from supervised algorithms and reusing
> them for crossvalidation will lead to wrong (i.e. too optimistic)
> results, because information from test compounds has been already used
> to create the feature dataset.
>
>
But I am not advocating using same features from selection phase in
crossvalidation - if they are different fragments they will be stored under
different feature URIs - so nothing controversial here. Think of storing
just as caching descriptor results!

> We should make such mistakes impossible in our framework (this was also
> - in my understanding - a main motivation for a separate validation
> service).
> For this reason descriptors (at least from supervised algorithms) _have_
> to be calculated on the fly for each validation fold.
>
> Caching results of _unsupervised_ algorithms is of course ok, but
> this is IMHO an implementation/optimisation detail of descriptor
> calculation
> services. Service developers have to decide, if caching is allowed or
> not (assuming that they know to distiguish between supervised and
> unsupervised
> algorithms ;-)). This decision should not be delegated to client
> services (e.g. validation, ToxCreate) who do not know the algorithmic
> details.
>

I don't really understand - calculating descriptors, fargments, whatever
entities from molecules themselves is not related in any way to the learning
algorithms used in later phase of modeling.  If there is a fragment e.g.
CCCCCN it is the same fragment, regardless of it is used for clustering or
regression.  What I am saying that the presense or absense of fragment
CCCCCCN is stored under feature URI (dereferencable) , which could be then
reused by ANY algorithm that needs to verify that particular fragment
presense - the result may be cached once and not necessary to run ten times
to get the same result.

Best regards,

Nina

>
> Best regards,
> Christoph
> _______________________________________________
> Development mailing list
> Development at opentox.org
> http://www.opentox.org/mailman/listinfo/development
>

-- 

Dr. Nina Jeliazkova
Technical Manager
4 A.Kanchev str.
IdeaConsult Ltd.
1000 Sofia, Bulgaria
Phone: +359 886 802011