[OTDev] Validation: Efficiency

Fri Feb 25 13:26:10 CET 2011

On Fri, Feb 25, 2011 at 12:53 PM, Nina Jeliazkova
<jeliazkova.nina at gmail.com> wrote:
> Andreas,
>
> On 25 February 2011 13:28, Andreas Maunz <andreas at maunz.de> wrote:
>
>> Nina,
>>
>> you are right (I think it still is the case that datasets are redundant).
>> However, with different model parameters, which will probably be used a lot
>> in validation, new datasets will be created.
>> I think it would be definitely necessary to not store data redundantly (as
>> you indicated), but that might be only part of the solution.
>> So it may still be necessary to compress the amount of policies needed.
>>
>>
> Well, thinking further
>
> 1)  I would implement validation splits (at least at our services) as
> logical splits of the same dataset , assigning some tags, similar to what is
> in the mutagenicity Benchmark dataset  (look for column "Set"
> http://apps.ideaconsult.net:8080/ambit2/feature/28956 )
>
> http://apps.ideaconsult.net:8080/ambit2/dataset/2344?max=100
>
> and introduce searching similar to the queries below (restricted to the
> property in question)
>
> Training set
> http://apps.ideaconsult.net:8080/ambit2/dataset/2344?search=TRAIN
>
> Crossvalidation sets
> http://apps.ideaconsult.net:8080/ambit2/dataset/2344?search=CV1
> http://apps.ideaconsult.net:8080/ambit2/dataset/2344?search=CV2
> http://apps.ideaconsult.net:8080/ambit2/dataset/2344?search=CV3
> ...
>
>
> Thus, everything is in the original dataset (or a single copy of it on
> another dataset service) and no need of additional policies.
>
>
> Different features , calculated during validation run would be specified via
> feature_uris[] parameter on the same dataset URI.
>
> http://apps.ideaconsult.net:8080/ambit2/dataset/2344?search=CV3?feature_uris[]=..

My concerns regarding using one dataset for everything:

* This would allow the algorithm to remove the search string and
request data that is supposed to be unseen. What do people think about
that?

* Still 10 models would be created (and 10 validations, but I could
try to solve this internally), so we would not end up with 1 policy
for a crossvalidation.

* Our services (ALU and IST) do not support storing results in
existing datasets nor providing subsets of a dataset with
search-strings (or feature_uri[], compound_uri[]). To implement this
would require quite some effort.

Regrads,
Martin

> ..
>
> 2) Not changing your current approach, perhaps it makes sense to introduce
> in the API a resource for "groups of datasets" , that could be used as a
> placeholder for URIs of several datasets, and  use some wildcards on the
> policy server to ensure only one policy for the group of dataset is needed.
>
> I guess groups of datasets could be useful in other cases as well.
>
> Nina
>
>
>
>> Andreas
>>
>> Nina Jeliazkova wrote on 02/25/2011 12:06 PM:
>>
>>> Andreas,
>>>
>>> I have not thought about it in detail, but having in mind differences in
>>> dataset implementation at Freiburg and ours, I think part of the problem
>>> is (AFAIK) your implementation makes full copy of the dataset on each
>>> run, regardless of using same URIs (e.g. as same records in the database)
>>>
>>> So may be this is just an implementation specific?
>>>
>>> Nina
>>>
>>> On 25 February 2011 13:02, Andreas Maunz <andreas at maunz.de
>>> <mailto:andreas at maunz.de>> wrote:
>>>
>>>    Dear all,
>>>
>>>    since a single validation of a model on a dataset creates multiple
>>>    ressources (currently > 50), and by the fact that everything is
>>>    decentralized (i.e. linked via URIs) and referenceable in OpenTox,
>>>    we are facing the problem that currently prohibitively high load is
>>>    placed on the AA services, because a policy must be created and
>>>    requested multiple times (and eventually deleted) for each of the
>>>    resources.
>>>
>>>    For example the spike in http://tinyurl.com/6amuo8x to the very
>>>    right is produced by a single validation. Moreover, the validation
>>>    service is very slow, the AA related part alone takes at least
>>>    several minutes. All this is induced by the amount of single
>>>    policies that have to be created.
>>>
>>>    Martin argues that currently there seems no API compliant way of
>>>    improving performance: One way could be to collect all URIs and
>>>    create a policy covering all of them at the end of the validation.
>>>    However, there is no way of notifying validation-involved services
>>>    to not create policies in the first place. Also, without policies,
>>>    there would be no way for validation to access the resource, since
>>>    default (without associated policy) is "deny".
>>>
>>>    We consider this issue high priority, which should be dealt with
>>>    before everyone starts using validation in production. Perhaps we
>>>    would need an API extension that allows the collection strategy
>>>    discussed before, or are there other suggestions?
>>>
>>>    Best regards
>>>    Andreas
>>>    _______________________________________________
>>>    Development mailing list
>>>    Development at opentox.org <mailto:Development at opentox.org>
>>>
>>>    http://www.opentox.org/mailman/listinfo/development
>>>
>>>
>>>
>> --
>> http://www.maunz.de
>>
>>            According to my calculations the problem doesn't exist.
>>
> _______________________________________________
> Development mailing list
> Development at opentox.org
> http://www.opentox.org/mailman/listinfo/development
>

-- 
Dipl-Inf. Martin Gütlein
Phone:
+49 (0)761 203 8442 (office)
+49 (0)177 623 9499 (mobile)
Email:
guetlein at informatik.uni-freiburg.de