[OTDev] Validation: Efficiency

Fri Feb 25 18:59:23 CET 2011

On 25 February 2011 18:59, Christoph Helma <helma at in-silico.ch> wrote:

>
> > On 25 February 2011 18:16, Christoph Helma <helma at in-silico.ch> wrote:
> >
> > >
> > > > Nina Jeliazkova wrote on 02/25/2011 01:32 PM:
> > > > >
> > > > >
> > > > > On 25 February 2011 14:26, Martin Guetlein
> > > > > <martin.guetlein at googlemail.com <mailto:
> martin.guetlein at googlemail.com
> > > >>
> > > > > wrote:
> > > > >
> > > > >     * Still 10 models would be created (and 10 validations, but I
> could
> > > > >     try to solve this internally), so we would not end up with 1
> policy
> > > > >     for a crossvalidation.
> > > > >
> > > > > Unless predictions are stored in the same dataset.
> > > >
> > > > It sounds feasible to me. What do you think, Christoph?
> > >
> > > For efficiency reasons (and implementation simplicity) I prefer to keep
> > > datasets in small and manageable chunks.  I am quite convinced that
> > > aggregating everything in a single dataset will not scale well. Lets
> > > assume a larger dataset with several 1000 compounds and several
> > > 1000-10000 class sensitive descriptors. Adding features for each
> > > validation fold would increase the dataset 11 times and with such a
> size
> > > I assume that all search/subset operations will be extremely slow.  I
> do
> > > not even dare thinking about serialising such a monster to rdfxml.
> > >
> >
> > Not impossible, try our monsters :) . Searching on indexed field is
> usually
> > not a problem.
>
> I will have to try again, but I remember that in the past downloading e.g.
> the
> complete CPDB was quite time consuming. If this has improved let me know
> what you have done!
>

Tell me if this is better than before (there are still things left to
optimize)  . This is run from a remote machine.

$ time curl -H "Accept:application/rdf+xml"
http://apps.ideaconsult.net:8080/ambit2/dataset/10?max=2000 1> cpdbas.rdf
  % Total    % Received % Xferd  Average Speed   Time    Time     Time
 Current
                                 Dload  Upload   Total   Spent    Left
 Speed
100 12.5M    0 12.5M    0     0  2043k      0 --:--:--  0:00:06 --:--:--
2210k

real    0m6.295s
user    0m0.036s
sys     0m0.136s

 Subset should take less time (as well as different mime type).

> >
> > Both approaches have their own pros and cons.
> >
> Yes and both serve different purposes, it would be nice to have a
> choice.
>
> > >
> > > @Martin: Would it help with AA to have "sets of datasets" accessible
> > > through URIs like /dataset/{set_id}/{dataset_id}.
> > >
> >
> > Yes, let's try a construct like this.  I would prefer some other  top
> level
> > term, instead of dataset, as it will be impossible to distinguish set ids
> > from dataset ids, e.g.
> >
> > /set/{id}/dataset/{id}
> >
> Sorry, /dataset should only indicate that we are talking about the
> dataset service, so my proposal was
> {dataset_service_uri}/{set_id}/{dataset_id}
>
>
Mapped to our services, there is a need for top level "noun"

http://host:port/ambit2/{set_id}/{dataset_id}

http://host:port/ambit2/dataset/{set_id}/{dataset_id}

http://host:port/ambit2/set/{set_id}/dataset/{dataset_id}

Best regards,
Nina

> Best regards,
> Christoph
>