[OTDev] Validation: Efficiency

Fri Feb 25 17:43:26 CET 2011

On 25 February 2011 18:16, Christoph Helma <helma at in-silico.ch> wrote:

>
> > Nina Jeliazkova wrote on 02/25/2011 01:32 PM:
> > >
> > >
> > > On 25 February 2011 14:26, Martin Guetlein
> > > <martin.guetlein at googlemail.com <mailto:martin.guetlein at googlemail.com
> >>
> > > wrote:
> > >
> > >     * Still 10 models would be created (and 10 validations, but I could
> > >     try to solve this internally), so we would not end up with 1 policy
> > >     for a crossvalidation.
> > >
> > > Unless predictions are stored in the same dataset.
> >
> > It sounds feasible to me. What do you think, Christoph?
>
> For efficiency reasons (and implementation simplicity) I prefer to keep
> datasets in small and manageable chunks.  I am quite convinced that
> aggregating everything in a single dataset will not scale well. Lets
> assume a larger dataset with several 1000 compounds and several
> 1000-10000 class sensitive descriptors. Adding features for each
> validation fold would increase the dataset 11 times and with such a size
> I assume that all search/subset operations will be extremely slow.  I do
> not even dare thinking about serialising such a monster to rdfxml.
>

Not impossible, try our monsters :) . Searching on indexed field is usually
not a problem.

Both approaches have their own pros and cons.

>
> @Martin: Would it help with AA to have "sets of datasets" accessible
> through URIs like /dataset/{set_id}/{dataset_id}.
>

Yes, let's try a construct like this.  I would prefer some other  top level
term, instead of dataset, as it will be impossible to distinguish set ids
from dataset ids, e.g.

/set/{id}/dataset/{id}

Best regards,
Nina

>
> Best regards,
> Christoph
> _______________________________________________
> Development mailing list
> Development at opentox.org
> http://www.opentox.org/mailman/listinfo/development
>