[OTDev] NTUA WebServices

Mon Aug 23 19:36:04 CEST 2010

Christoph,

Thinking again, it seems to me there is a confusion here between feature
generation and feature selection . The first is independent of the learning
model and results can be cached safely, while the second is indeed not and
is of course model /dataset specific.  And the confusion is because in BBRC
there is apparently no clear split between phases of feature generation and
feature selection.

What I was trying to say in previous emails, generated features should be
referred by dereferencable feature URIs, the content of these URIs having
ot:Feature representation and pointing back to the generating algorithm via
ot:hasSource property.  Then they are selected by feature selection
procedure and this information goes into the model (via ot:independent
parameter) and is later used by clients to figure out which algorithms to
use for calculating descriptors.

This allows to have model services completely independent from descriptor
calculation ones, but nevertheless linked and transparent - and we do have
such services.

Nina

On Mon, Aug 23, 2010 at 7:01 PM, Nina Jeliazkova
<jeliazkova.nina at gmail.com>wrote:

>
>
> On Mon, Aug 23, 2010 at 6:51 PM, Christoph Helma <helma at in-silico.ch>wrote:
>
>> Excerpts from Nina Jeliazkova's message of Mon Aug 23 15:17:00 +0200 2010:
>> > > > This looks like "superservice" for model creation.
>> > > >
>> > > > 1) -d dataset_uri parameter is fine
>> > > > 2) -d feature_uri parameter is not documented and not used by any of
>> > > IDEA,
>> > > > TUM or NTUA partners, nor (AFAIK) in the API documentation
>> > > > Instead, what is used is the features , which are inherent to the
>> dataset
>> > > > specified. This allows to have thousands of features.
>> > > > 3) The dependent variable, according to API should be under
>> > > > prediction_feature={featureuris} parameter, not feature_uri (see the
>> wiki
>> > > > page for models).
>>
>> Sorry, this was a cut+pasete error from an outdated README. It is indeed
>> prediction_feature in our services.
>>
>> > > I like the idea of models and algorithms to be able to handle datasets
>> > > without features (-> christoph's proposal).
>> >
>> >
>> > There are several disadvantages for calculating features on the fly:
>> >
>> > - this is not practical for any but simplest features.  For example TUM
>> > implementation of CDK descriptors can run  for hours on a moderately
>> sized
>> > datasets (at least when we tested before Berlin meeting).  The only
>> > reasonable way to overcome this is storing the calculated results and
>> reuse
>> > when requested. This is what we do now.
>> >
>> > - One of the most important advantages of having linked RDF
>> representation
>> > is to be able to provide links  between data "columns" and the procedure
>> > that was used to generate that data. There is much talk about this
>> currently
>> > at ACS RDF session in Boston (see http://egonw.github.com/acsrdf2010/)
>> .
>> > OpenTox already has working support for this via features ot:hasSource
>> > predicate  (this is how TUM, NTUA, IDEA calculations work and ToxPredict
>> > makes use of it.)   If one is not using dereferencable features for
>> > descriptors/fragments, and calculates everything on the fly , this
>> > information is essentially lost.
>> >
>> > Therefore I would ask IST/ALU descriptor calculation and model services
>> to
>> > use a feature service (their own or existing one).  This will also solve
>> the
>> > problem Andreas Maunz was mentioning in Oxford, on the need to generate
>> > fragments on each cross validation run.  This is easily solved if you
>> create
>> > one feature per fragment - effective allows to cache any substructure -
>> and
>> > is how TUM fminer works.
>>
>> I strongly disagree for the general case, which may include _supervised_
>> feature mining and feature selection (which is the case for BBRC and
>> LAST features). Storing features from supervised algorithms and reusing
>> them for crossvalidation will lead to wrong (i.e. too optimistic)
>> results, because information from test compounds has been already used
>> to create the feature dataset.
>>
>>
> But I am not advocating using same features from selection phase in
> crossvalidation - if they are different fragments they will be stored under
> different feature URIs - so nothing controversial here. Think of storing
> just as caching descriptor results!
>
>
>
>> We should make such mistakes impossible in our framework (this was also
>> - in my understanding - a main motivation for a separate validation
>> service).
>> For this reason descriptors (at least from supervised algorithms) _have_
>> to be calculated on the fly for each validation fold.
>>
>> Caching results of _unsupervised_ algorithms is of course ok, but
>> this is IMHO an implementation/optimisation detail of descriptor
>> calculation
>> services. Service developers have to decide, if caching is allowed or
>> not (assuming that they know to distiguish between supervised and
>> unsupervised
>> algorithms ;-)). This decision should not be delegated to client
>> services (e.g. validation, ToxCreate) who do not know the algorithmic
>> details.
>>
>
>
> I don't really understand - calculating descriptors, fargments, whatever
> entities from molecules themselves is not related in any way to the learning
> algorithms used in later phase of modeling.  If there is a fragment e.g.
> CCCCCN it is the same fragment, regardless of it is used for clustering or
> regression.  What I am saying that the presense or absense of fragment
> CCCCCCN is stored under feature URI (dereferencable) , which could be then
> reused by ANY algorithm that needs to verify that particular fragment
> presense - the result may be cached once and not necessary to run ten times
> to get the same result.
>
> Best regards,
>
> Nina
>
>
>>
>> Best regards,
>> Christoph
>> _______________________________________________
>> Development mailing list
>> Development at opentox.org
>> http://www.opentox.org/mailman/listinfo/development
>>
>
>
>
> --
>
> Dr. Nina Jeliazkova
> Technical Manager
> 4 A.Kanchev str.
> IdeaConsult Ltd.
> 1000 Sofia, Bulgaria
> Phone: +359 886 802011
>
>
>

-- 

Dr. Nina Jeliazkova
Technical Manager
4 A.Kanchev str.
IdeaConsult Ltd.
1000 Sofia, Bulgaria
Phone: +359 886 802011