[OTDev] Validation: Efficiency

Fri Feb 25 13:33:05 CET 2011

On Fri, Feb 25, 2011 at 1:21 PM, Nina Jeliazkova
<jeliazkova.nina at gmail.com> wrote:
> Martin,
>
> What do you think about introducing a resource for "groups of datasets" ? It
> could be used as a placeholder for URIs of several datasets, and  use some
> wildcards on the policy server to ensure only one policy for the group of
> dataset is needed.
>
> Regards,
> Nina

Would this not just shift the A&A work to adding and deleting those
wildcards (because we would need one wildcard for every dataset?)

Not sure if this would be faster, Andreas?

Regards,
Martin

>
> On 25 February 2011 14:00, Martin Guetlein <martin.guetlein at googlemail.com>
> wrote:
>>
>> > On 25 February 2011 13:02, Andreas Maunz <andreas at maunz.de> wrote:
>> >
>> >> Dear all,
>> >>
>> >> since a single validation of a model on a dataset creates multiple
>> >> ressources (currently > 50), and by the fact that everything is
>> >> decentralized (i.e. linked via URIs) and referenceable in OpenTox, we
>> >> are
>> >> facing the problem that currently prohibitively high load is placed on
>> >> the
>> >> AA services, because a policy must be created and requested multiple
>> >> times
>> >> (and eventually deleted) for each of the resources.
>> >>
>> >> For example the spike in http://tinyurl.com/6amuo8x to the very right
>> >> is
>> >> produced by a single validation. Moreover, the validation service is
>> >> very
>> >> slow, the AA related part alone takes at least several minutes. All
>> >> this is
>> >> induced by the amount of single policies that have to be created.
>> >>
>> >> Martin argues that currently there seems no API compliant way of
>> >> improving
>> >> performance: One way could be to collect all URIs and create a policy
>> >> covering all of them at the end of the validation. However, there is no
>> >> way
>> >> of notifying validation-involved services to not create policies in the
>> >> first place. Also, without policies, there would be no way for
>> >> validation to
>> >> access the resource, since default (without associated policy) is
>> >> "deny".
>> >>
>> >> We consider this issue high priority, which should be dealt with before
>> >> everyone starts using validation in production. Perhaps we would need
>> >> an API
>> >> extension that allows the collection strategy discussed before, or are
>> >> there
>> >> other suggestions?
>> >>
>> >> Best regards
>> >> Andreas
>>
>> On Fri, Feb 25, 2011 at 12:06 PM, Nina Jeliazkova
>> <jeliazkova.nina at gmail.com> wrote:
>> > Andreas,
>> >
>> > I have not thought about it in detail, but having in mind differences in
>> > dataset implementation at Freiburg and ours, I think part of the problem
>> > is
>> > (AFAIK) your implementation makes full copy of the dataset on each run,
>> > regardless of using same URIs (e.g. as same records in the database)
>> >
>> > So may be this is just an implementation specific?
>> >
>> > Nina
>>
>> Hi Nina, all
>>
>> I will try to explain my validation point of view on things:
>>
>> Andreas was talking about a 10-fold crossvalidation when talking about
>> a validation. This is were the 50+ resources (and therefore policies)
>> come from when making a crossvalidation:
>> The dataset is split into 10 training and 10 testdataset. 10 models
>> are built. The prediction of the models is stored in 10 prediction
>> dataset. The results of each prediction on the folds are stored in 10
>> single validations. The actual crossvalidation is 1 resource.
>> (Creating repotrs adds new resources (1 per report))
>>
>> I implemented that no redundant training / dataset folds are created
>> when a crossvalidation with equal params was performed before
>> (dataset, num-of-folds, random-seed, ...). The problem is that
>> ToxCreate uploads a new dataset, for each validation, so nothing can
>> be reused here.
>>
>> What Nina is doing with her dataset services is to provide views on
>> datasets/subsets off datasets by specifying a feature_uris[] or
>> compund_uris[] parameter. I decided to not use this, to prevent
>> alogrithms from cheating. Using this would save the 20 training and
>> test folds of the about 50 resources.
>>
>> The problem is that the deletion of policies is very slow. So A&A does
>> not slow done (considerably) slow down the actual validation, only
>> deleting old validations takes long. This is why we try to think of
>> ways to store all resources with one policy.
>>
>> Martin
>>
>>
>> >> _______________________________________________
>> >> Development mailing list
>> >> Development at opentox.org
>> >> http://www.opentox.org/mailman/listinfo/development
>> >>
>> > _______________________________________________
>> > Development mailing list
>> > Development at opentox.org
>> > http://www.opentox.org/mailman/listinfo/development
>> >
>>
>>
>>
>> --
>> Dipl-Inf. Martin Gütlein
>> Phone:
>> +49 (0)761 203 8442 (office)
>> +49 (0)177 623 9499 (mobile)
>> Email:
>> guetlein at informatik.uni-freiburg.de
>
>

-- 
Dipl-Inf. Martin Gütlein
Phone:
+49 (0)761 203 8442 (office)
+49 (0)177 623 9499 (mobile)
Email:
guetlein at informatik.uni-freiburg.de