[OTDev] Validation: Efficiency
Christoph Helma helma at in-silico.chFri Feb 25 17:59:31 CET 2011
- Previous message: [OTDev] Validation: Efficiency
- Next message: [OTDev] Validation: Efficiency
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
> On 25 February 2011 18:16, Christoph Helma <helma at in-silico.ch> wrote: > > > > > > Nina Jeliazkova wrote on 02/25/2011 01:32 PM: > > > > > > > > > > > > On 25 February 2011 14:26, Martin Guetlein > > > > <martin.guetlein at googlemail.com <mailto:martin.guetlein at googlemail.com > > >> > > > > wrote: > > > > > > > > * Still 10 models would be created (and 10 validations, but I could > > > > try to solve this internally), so we would not end up with 1 policy > > > > for a crossvalidation. > > > > > > > > Unless predictions are stored in the same dataset. > > > > > > It sounds feasible to me. What do you think, Christoph? > > > > For efficiency reasons (and implementation simplicity) I prefer to keep > > datasets in small and manageable chunks. I am quite convinced that > > aggregating everything in a single dataset will not scale well. Lets > > assume a larger dataset with several 1000 compounds and several > > 1000-10000 class sensitive descriptors. Adding features for each > > validation fold would increase the dataset 11 times and with such a size > > I assume that all search/subset operations will be extremely slow. I do > > not even dare thinking about serialising such a monster to rdfxml. > > > > Not impossible, try our monsters :) . Searching on indexed field is usually > not a problem. I will have to try again, but I remember that in the past downloading e.g. the complete CPDB was quite time consuming. If this has improved let me know what you have done! > > Both approaches have their own pros and cons. > Yes and both serve different purposes, it would be nice to have a choice. > > > > @Martin: Would it help with AA to have "sets of datasets" accessible > > through URIs like /dataset/{set_id}/{dataset_id}. > > > > Yes, let's try a construct like this. I would prefer some other top level > term, instead of dataset, as it will be impossible to distinguish set ids > from dataset ids, e.g. > > /set/{id}/dataset/{id} > Sorry, /dataset should only indicate that we are talking about the dataset service, so my proposal was {dataset_service_uri}/{set_id}/{dataset_id} Best regards, Christoph
- Previous message: [OTDev] Validation: Efficiency
- Next message: [OTDev] Validation: Efficiency
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Development mailing list