[OTDev] Validation: Efficiency

Fri Feb 25 17:59:31 CET 2011

> On 25 February 2011 18:16, Christoph Helma <helma at in-silico.ch> wrote:
> 
> >
> > > Nina Jeliazkova wrote on 02/25/2011 01:32 PM:
> > > >
> > > >
> > > > On 25 February 2011 14:26, Martin Guetlein
> > > > <martin.guetlein at googlemail.com <mailto:martin.guetlein at googlemail.com
> > >>
> > > > wrote:
> > > >
> > > >     * Still 10 models would be created (and 10 validations, but I could
> > > >     try to solve this internally), so we would not end up with 1 policy
> > > >     for a crossvalidation.
> > > >
> > > > Unless predictions are stored in the same dataset.
> > >
> > > It sounds feasible to me. What do you think, Christoph?
> >
> > For efficiency reasons (and implementation simplicity) I prefer to keep
> > datasets in small and manageable chunks.  I am quite convinced that
> > aggregating everything in a single dataset will not scale well. Lets
> > assume a larger dataset with several 1000 compounds and several
> > 1000-10000 class sensitive descriptors. Adding features for each
> > validation fold would increase the dataset 11 times and with such a size
> > I assume that all search/subset operations will be extremely slow.  I do
> > not even dare thinking about serialising such a monster to rdfxml.
> >
> 
> Not impossible, try our monsters :) . Searching on indexed field is usually
> not a problem.

I will have to try again, but I remember that in the past downloading e.g. the
complete CPDB was quite time consuming. If this has improved let me know
what you have done!

> 
> Both approaches have their own pros and cons.
> 
Yes and both serve different purposes, it would be nice to have a
choice.

> >
> > @Martin: Would it help with AA to have "sets of datasets" accessible
> > through URIs like /dataset/{set_id}/{dataset_id}.
> >
> 
> Yes, let's try a construct like this.  I would prefer some other  top level
> term, instead of dataset, as it will be impossible to distinguish set ids
> from dataset ids, e.g.
> 
> /set/{id}/dataset/{id}
> 
Sorry, /dataset should only indicate that we are talking about the
dataset service, so my proposal was {dataset_service_uri}/{set_id}/{dataset_id}

Best regards,
Christoph