[OTDev] Validation: Efficiency

Fri Feb 25 12:53:27 CET 2011

Andreas,

On 25 February 2011 13:28, Andreas Maunz <andreas at maunz.de> wrote:

> Nina,
>
> you are right (I think it still is the case that datasets are redundant).
> However, with different model parameters, which will probably be used a lot
> in validation, new datasets will be created.
> I think it would be definitely necessary to not store data redundantly (as
> you indicated), but that might be only part of the solution.
> So it may still be necessary to compress the amount of policies needed.
>
>
Well, thinking further

1)  I would implement validation splits (at least at our services) as
logical splits of the same dataset , assigning some tags, similar to what is
in the mutagenicity Benchmark dataset  (look for column "Set"
http://apps.ideaconsult.net:8080/ambit2/feature/28956 )

http://apps.ideaconsult.net:8080/ambit2/dataset/2344?max=100

and introduce searching similar to the queries below (restricted to the
property in question)

Training set
http://apps.ideaconsult.net:8080/ambit2/dataset/2344?search=TRAIN

Crossvalidation sets
http://apps.ideaconsult.net:8080/ambit2/dataset/2344?search=CV1
http://apps.ideaconsult.net:8080/ambit2/dataset/2344?search=CV2
http://apps.ideaconsult.net:8080/ambit2/dataset/2344?search=CV3
...

Thus, everything is in the original dataset (or a single copy of it on
another dataset service) and no need of additional policies.

Different features , calculated during validation run would be specified via
feature_uris[] parameter on the same dataset URI.

http://apps.ideaconsult.net:8080/ambit2/dataset/2344?search=CV3?feature_uris[]=..
..

2) Not changing your current approach, perhaps it makes sense to introduce
in the API a resource for "groups of datasets" , that could be used as a
placeholder for URIs of several datasets, and  use some wildcards on the
policy server to ensure only one policy for the group of dataset is needed.

I guess groups of datasets could be useful in other cases as well.

Nina

> Andreas
>
> Nina Jeliazkova wrote on 02/25/2011 12:06 PM:
>
>> Andreas,
>>
>> I have not thought about it in detail, but having in mind differences in
>> dataset implementation at Freiburg and ours, I think part of the problem
>> is (AFAIK) your implementation makes full copy of the dataset on each
>> run, regardless of using same URIs (e.g. as same records in the database)
>>
>> So may be this is just an implementation specific?
>>
>> Nina
>>
>> On 25 February 2011 13:02, Andreas Maunz <andreas at maunz.de
>> <mailto:andreas at maunz.de>> wrote:
>>
>>    Dear all,
>>
>>    since a single validation of a model on a dataset creates multiple
>>    ressources (currently > 50), and by the fact that everything is
>>    decentralized (i.e. linked via URIs) and referenceable in OpenTox,
>>    we are facing the problem that currently prohibitively high load is
>>    placed on the AA services, because a policy must be created and
>>    requested multiple times (and eventually deleted) for each of the
>>    resources.
>>
>>    For example the spike in http://tinyurl.com/6amuo8x to the very
>>    right is produced by a single validation. Moreover, the validation
>>    service is very slow, the AA related part alone takes at least
>>    several minutes. All this is induced by the amount of single
>>    policies that have to be created.
>>
>>    Martin argues that currently there seems no API compliant way of
>>    improving performance: One way could be to collect all URIs and
>>    create a policy covering all of them at the end of the validation.
>>    However, there is no way of notifying validation-involved services
>>    to not create policies in the first place. Also, without policies,
>>    there would be no way for validation to access the resource, since
>>    default (without associated policy) is "deny".
>>
>>    We consider this issue high priority, which should be dealt with
>>    before everyone starts using validation in production. Perhaps we
>>    would need an API extension that allows the collection strategy
>>    discussed before, or are there other suggestions?
>>
>>    Best regards
>>    Andreas
>>    _______________________________________________
>>    Development mailing list
>>    Development at opentox.org <mailto:Development at opentox.org>
>>
>>    http://www.opentox.org/mailman/listinfo/development
>>
>>
>>
> --
> http://www.maunz.de
>
>            According to my calculations the problem doesn't exist.
>