[OTDev] NTUA WebServices
Nina Jeliazkova jeliazkova.nina at gmail.comMon Aug 23 18:01:25 CEST 2010
- Previous message: [OTDev] NTUA WebServices
- Next message: [OTDev] NTUA WebServices
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Mon, Aug 23, 2010 at 6:51 PM, Christoph Helma <helma at in-silico.ch> wrote: > Excerpts from Nina Jeliazkova's message of Mon Aug 23 15:17:00 +0200 2010: > > > > This looks like "superservice" for model creation. > > > > > > > > 1) -d dataset_uri parameter is fine > > > > 2) -d feature_uri parameter is not documented and not used by any of > > > IDEA, > > > > TUM or NTUA partners, nor (AFAIK) in the API documentation > > > > Instead, what is used is the features , which are inherent to the > dataset > > > > specified. This allows to have thousands of features. > > > > 3) The dependent variable, according to API should be under > > > > prediction_feature={featureuris} parameter, not feature_uri (see the > wiki > > > > page for models). > > Sorry, this was a cut+pasete error from an outdated README. It is indeed > prediction_feature in our services. > > > > I like the idea of models and algorithms to be able to handle datasets > > > without features (-> christoph's proposal). > > > > > > There are several disadvantages for calculating features on the fly: > > > > - this is not practical for any but simplest features. For example TUM > > implementation of CDK descriptors can run for hours on a moderately > sized > > datasets (at least when we tested before Berlin meeting). The only > > reasonable way to overcome this is storing the calculated results and > reuse > > when requested. This is what we do now. > > > > - One of the most important advantages of having linked RDF > representation > > is to be able to provide links between data "columns" and the procedure > > that was used to generate that data. There is much talk about this > currently > > at ACS RDF session in Boston (see http://egonw.github.com/acsrdf2010/) . > > OpenTox already has working support for this via features ot:hasSource > > predicate (this is how TUM, NTUA, IDEA calculations work and ToxPredict > > makes use of it.) If one is not using dereferencable features for > > descriptors/fragments, and calculates everything on the fly , this > > information is essentially lost. > > > > Therefore I would ask IST/ALU descriptor calculation and model services > to > > use a feature service (their own or existing one). This will also solve > the > > problem Andreas Maunz was mentioning in Oxford, on the need to generate > > fragments on each cross validation run. This is easily solved if you > create > > one feature per fragment - effective allows to cache any substructure - > and > > is how TUM fminer works. > > I strongly disagree for the general case, which may include _supervised_ > feature mining and feature selection (which is the case for BBRC and > LAST features). Storing features from supervised algorithms and reusing > them for crossvalidation will lead to wrong (i.e. too optimistic) > results, because information from test compounds has been already used > to create the feature dataset. > > But I am not advocating using same features from selection phase in crossvalidation - if they are different fragments they will be stored under different feature URIs - so nothing controversial here. Think of storing just as caching descriptor results! > We should make such mistakes impossible in our framework (this was also > - in my understanding - a main motivation for a separate validation > service). > For this reason descriptors (at least from supervised algorithms) _have_ > to be calculated on the fly for each validation fold. > > Caching results of _unsupervised_ algorithms is of course ok, but > this is IMHO an implementation/optimisation detail of descriptor > calculation > services. Service developers have to decide, if caching is allowed or > not (assuming that they know to distiguish between supervised and > unsupervised > algorithms ;-)). This decision should not be delegated to client > services (e.g. validation, ToxCreate) who do not know the algorithmic > details. > I don't really understand - calculating descriptors, fargments, whatever entities from molecules themselves is not related in any way to the learning algorithms used in later phase of modeling. If there is a fragment e.g. CCCCCN it is the same fragment, regardless of it is used for clustering or regression. What I am saying that the presense or absense of fragment CCCCCCN is stored under feature URI (dereferencable) , which could be then reused by ANY algorithm that needs to verify that particular fragment presense - the result may be cached once and not necessary to run ten times to get the same result. Best regards, Nina > > Best regards, > Christoph > _______________________________________________ > Development mailing list > Development at opentox.org > http://www.opentox.org/mailman/listinfo/development > -- Dr. Nina Jeliazkova Technical Manager 4 A.Kanchev str. IdeaConsult Ltd. 1000 Sofia, Bulgaria Phone: +359 886 802011
- Previous message: [OTDev] NTUA WebServices
- Next message: [OTDev] NTUA WebServices
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Development mailing list