[OTDev] Validation: Efficiency

Fri Feb 25 13:21:46 CET 2011

Martin,

What do you think about introducing a resource for "groups of datasets" ? It
could be used as a placeholder for URIs of several datasets, and  use some
wildcards on the policy server to ensure only one policy for the group of
dataset is needed.

Regards,
Nina

On 25 February 2011 14:00, Martin Guetlein
<martin.guetlein at googlemail.com>wrote:

> > On 25 February 2011 13:02, Andreas Maunz <andreas at maunz.de> wrote:
> >
> >> Dear all,
> >>
> >> since a single validation of a model on a dataset creates multiple
> >> ressources (currently > 50), and by the fact that everything is
> >> decentralized (i.e. linked via URIs) and referenceable in OpenTox, we
> are
> >> facing the problem that currently prohibitively high load is placed on
> the
> >> AA services, because a policy must be created and requested multiple
> times
> >> (and eventually deleted) for each of the resources.
> >>
> >> For example the spike in http://tinyurl.com/6amuo8x to the very right
> is
> >> produced by a single validation. Moreover, the validation service is
> very
> >> slow, the AA related part alone takes at least several minutes. All this
> is
> >> induced by the amount of single policies that have to be created.
> >>
> >> Martin argues that currently there seems no API compliant way of
> improving
> >> performance: One way could be to collect all URIs and create a policy
> >> covering all of them at the end of the validation. However, there is no
> way
> >> of notifying validation-involved services to not create policies in the
> >> first place. Also, without policies, there would be no way for
> validation to
> >> access the resource, since default (without associated policy) is
> "deny".
> >>
> >> We consider this issue high priority, which should be dealt with before
> >> everyone starts using validation in production. Perhaps we would need an
> API
> >> extension that allows the collection strategy discussed before, or are
> there
> >> other suggestions?
> >>
> >> Best regards
> >> Andreas
>
> On Fri, Feb 25, 2011 at 12:06 PM, Nina Jeliazkova
> <jeliazkova.nina at gmail.com> wrote:
> > Andreas,
> >
> > I have not thought about it in detail, but having in mind differences in
> > dataset implementation at Freiburg and ours, I think part of the problem
> is
> > (AFAIK) your implementation makes full copy of the dataset on each run,
> > regardless of using same URIs (e.g. as same records in the database)
> >
> > So may be this is just an implementation specific?
> >
> > Nina
>
> Hi Nina, all
>
> I will try to explain my validation point of view on things:
>
> Andreas was talking about a 10-fold crossvalidation when talking about
> a validation. This is were the 50+ resources (and therefore policies)
> come from when making a crossvalidation:
> The dataset is split into 10 training and 10 testdataset. 10 models
> are built. The prediction of the models is stored in 10 prediction
> dataset. The results of each prediction on the folds are stored in 10
> single validations. The actual crossvalidation is 1 resource.
> (Creating repotrs adds new resources (1 per report))
>
> I implemented that no redundant training / dataset folds are created
> when a crossvalidation with equal params was performed before
> (dataset, num-of-folds, random-seed, ...). The problem is that
> ToxCreate uploads a new dataset, for each validation, so nothing can
> be reused here.
>
> What Nina is doing with her dataset services is to provide views on
> datasets/subsets off datasets by specifying a feature_uris[] or
> compund_uris[] parameter. I decided to not use this, to prevent
> alogrithms from cheating. Using this would save the 20 training and
> test folds of the about 50 resources.
>
> The problem is that the deletion of policies is very slow. So A&A does
> not slow done (considerably) slow down the actual validation, only
> deleting old validations takes long. This is why we try to think of
> ways to store all resources with one policy.
>
> Martin
>
>
> >> _______________________________________________
> >> Development mailing list
> >> Development at opentox.org
> >> http://www.opentox.org/mailman/listinfo/development
> >>
> > _______________________________________________
> > Development mailing list
> > Development at opentox.org
> > http://www.opentox.org/mailman/listinfo/development
> >
>
>
>
> --
> Dipl-Inf. Martin Gütlein
> Phone:
> +49 (0)761 203 8442 (office)
> +49 (0)177 623 9499 (mobile)
> Email:
> guetlein at informatik.uni-freiburg.de
>