[OTDev] NTUA WebServices

Mon Aug 23 17:51:47 CEST 2010

Excerpts from Nina Jeliazkova's message of Mon Aug 23 15:17:00 +0200 2010:
> > > This looks like "superservice" for model creation.
> > >
> > > 1) -d dataset_uri parameter is fine
> > > 2) -d feature_uri parameter is not documented and not used by any of
> > IDEA,
> > > TUM or NTUA partners, nor (AFAIK) in the API documentation
> > > Instead, what is used is the features , which are inherent to the dataset
> > > specified. This allows to have thousands of features.
> > > 3) The dependent variable, according to API should be under
> > > prediction_feature={featureuris} parameter, not feature_uri (see the wiki
> > > page for models).

Sorry, this was a cut+pasete error from an outdated README. It is indeed
prediction_feature in our services.

> > I like the idea of models and algorithms to be able to handle datasets
> > without features (-> christoph's proposal).
> 
> 
> There are several disadvantages for calculating features on the fly:
> 
> - this is not practical for any but simplest features.  For example TUM
> implementation of CDK descriptors can run  for hours on a moderately sized
> datasets (at least when we tested before Berlin meeting).  The only
> reasonable way to overcome this is storing the calculated results and reuse
> when requested. This is what we do now.
> 
> - One of the most important advantages of having linked RDF representation
> is to be able to provide links  between data "columns" and the procedure
> that was used to generate that data. There is much talk about this currently
> at ACS RDF session in Boston (see http://egonw.github.com/acsrdf2010/) .
> OpenTox already has working support for this via features ot:hasSource
> predicate  (this is how TUM, NTUA, IDEA calculations work and ToxPredict
> makes use of it.)   If one is not using dereferencable features for
> descriptors/fragments, and calculates everything on the fly , this
> information is essentially lost.
> 
> Therefore I would ask IST/ALU descriptor calculation and model services to
> use a feature service (their own or existing one).  This will also solve the
> problem Andreas Maunz was mentioning in Oxford, on the need to generate
> fragments on each cross validation run.  This is easily solved if you create
> one feature per fragment - effective allows to cache any substructure - and
> is how TUM fminer works.

I strongly disagree for the general case, which may include _supervised_
feature mining and feature selection (which is the case for BBRC and
LAST features). Storing features from supervised algorithms and reusing
them for crossvalidation will lead to wrong (i.e. too optimistic)
results, because information from test compounds has been already used
to create the feature dataset. 

We should make such mistakes impossible in our framework (this was also
- in my understanding - a main motivation for a separate validation service).
For this reason descriptors (at least from supervised algorithms) _have_
to be calculated on the fly for each validation fold.

Caching results of _unsupervised_ algorithms is of course ok, but 
this is IMHO an implementation/optimisation detail of descriptor calculation
services. Service developers have to decide, if caching is allowed or
not (assuming that they know to distiguish between supervised and unsupervised
algorithms ;-)). This decision should not be delegated to client
services (e.g. validation, ToxCreate) who do not know the algorithmic
details.

Best regards,
Christoph