[OTDev] Validation: Efficiency

Christoph Helma helma at in-silico.ch
Fri Feb 25 17:16:05 CET 2011


> Nina Jeliazkova wrote on 02/25/2011 01:32 PM:
> >
> >
> > On 25 February 2011 14:26, Martin Guetlein
> > <martin.guetlein at googlemail.com <mailto:martin.guetlein at googlemail.com>>
> > wrote:
> >
> >     * Still 10 models would be created (and 10 validations, but I could
> >     try to solve this internally), so we would not end up with 1 policy
> >     for a crossvalidation.
> >
> > Unless predictions are stored in the same dataset.
> 
> It sounds feasible to me. What do you think, Christoph?

For efficiency reasons (and implementation simplicity) I prefer to keep
datasets in small and manageable chunks.  I am quite convinced that
aggregating everything in a single dataset will not scale well. Lets
assume a larger dataset with several 1000 compounds and several
1000-10000 class sensitive descriptors. Adding features for each
validation fold would increase the dataset 11 times and with such a size
I assume that all search/subset operations will be extremely slow.  I do
not even dare thinking about serialising such a monster to rdfxml.

@Martin: Would it help with AA to have "sets of datasets" accessible
through URIs like /dataset/{set_id}/{dataset_id}.

Best regards,
Christoph



More information about the Development mailing list