[OTDev] Validation: Efficiency

Nina Jeliazkova jeliazkova.nina at gmail.com
Fri Feb 25 13:32:51 CET 2011


On 25 February 2011 14:26, Martin Guetlein
<martin.guetlein at googlemail.com>wrote:

> On Fri, Feb 25, 2011 at 12:53 PM, Nina Jeliazkova
> <jeliazkova.nina at gmail.com> wrote:
> > Andreas,
> >
> > On 25 February 2011 13:28, Andreas Maunz <andreas at maunz.de> wrote:
> >
> >> Nina,
> >>
> >> you are right (I think it still is the case that datasets are
> redundant).
> >> However, with different model parameters, which will probably be used a
> lot
> >> in validation, new datasets will be created.
> >> I think it would be definitely necessary to not store data redundantly
> (as
> >> you indicated), but that might be only part of the solution.
> >> So it may still be necessary to compress the amount of policies needed.
> >>
> >>
> > Well, thinking further
> >
> > 1)  I would implement validation splits (at least at our services) as
> > logical splits of the same dataset , assigning some tags, similar to what
> is
> > in the mutagenicity Benchmark dataset  (look for column "Set"
> > http://apps.ideaconsult.net:8080/ambit2/feature/28956 )
> >
> > http://apps.ideaconsult.net:8080/ambit2/dataset/2344?max=100
> >
> > and introduce searching similar to the queries below (restricted to the
> > property in question)
> >
> > Training set
> > http://apps.ideaconsult.net:8080/ambit2/dataset/2344?search=TRAIN
> >
> > Crossvalidation sets
> > http://apps.ideaconsult.net:8080/ambit2/dataset/2344?search=CV1
> > http://apps.ideaconsult.net:8080/ambit2/dataset/2344?search=CV2
> > http://apps.ideaconsult.net:8080/ambit2/dataset/2344?search=CV3
> > ...
> >
> >
> > Thus, everything is in the original dataset (or a single copy of it on
> > another dataset service) and no need of additional policies.
> >
> >
> > Different features , calculated during validation run would be specified
> via
> > feature_uris[] parameter on the same dataset URI.
> >
> >
> http://apps.ideaconsult.net:8080/ambit2/dataset/2344?search=CV3?feature_uris[]=
> ..
>
> My concerns regarding using one dataset for everything:
>
> * This would allow the algorithm to remove the search string and
> request data that is supposed to be unseen. What do people think about
> that?
>
>
I am not sure you really prevent this  by the current approach - what
prevents the algorithm to retrieve any other data by from any  dataset
service?  Removing the search string changes the URI, then the algorithm
service could just change the entire URI and receive any other dataset it
has access to.


> * Still 10 models would be created (and 10 validations, but I could
> try to solve this internally), so we would not end up with 1 policy
> for a crossvalidation.
>
> Unless predictions are stored in the same dataset.


> * Our services (ALU and IST) do not support storing results in
> existing datasets nor providing subsets of a dataset with
> search-strings (or feature_uri[], compound_uri[]). To implement this
> would require quite some effort.
>

Yes, agree.

What about the grouping datasets option?

Nina

>
> Regrads,
> Martin
>
>
> > ..
> >
> > 2) Not changing your current approach, perhaps it makes sense to
> introduce
> > in the API a resource for "groups of datasets" , that could be used as a
> > placeholder for URIs of several datasets, and  use some wildcards on the
> > policy server to ensure only one policy for the group of dataset is
> needed.
> >
> > I guess groups of datasets could be useful in other cases as well.
> >
> > Nina
> >
> >
> >
> >> Andreas
> >>
> >> Nina Jeliazkova wrote on 02/25/2011 12:06 PM:
> >>
> >>> Andreas,
> >>>
> >>> I have not thought about it in detail, but having in mind differences
> in
> >>> dataset implementation at Freiburg and ours, I think part of the
> problem
> >>> is (AFAIK) your implementation makes full copy of the dataset on each
> >>> run, regardless of using same URIs (e.g. as same records in the
> database)
> >>>
> >>> So may be this is just an implementation specific?
> >>>
> >>> Nina
> >>>
> >>> On 25 February 2011 13:02, Andreas Maunz <andreas at maunz.de
> >>> <mailto:andreas at maunz.de>> wrote:
> >>>
> >>>    Dear all,
> >>>
> >>>    since a single validation of a model on a dataset creates multiple
> >>>    ressources (currently > 50), and by the fact that everything is
> >>>    decentralized (i.e. linked via URIs) and referenceable in OpenTox,
> >>>    we are facing the problem that currently prohibitively high load is
> >>>    placed on the AA services, because a policy must be created and
> >>>    requested multiple times (and eventually deleted) for each of the
> >>>    resources.
> >>>
> >>>    For example the spike in http://tinyurl.com/6amuo8x to the very
> >>>    right is produced by a single validation. Moreover, the validation
> >>>    service is very slow, the AA related part alone takes at least
> >>>    several minutes. All this is induced by the amount of single
> >>>    policies that have to be created.
> >>>
> >>>    Martin argues that currently there seems no API compliant way of
> >>>    improving performance: One way could be to collect all URIs and
> >>>    create a policy covering all of them at the end of the validation.
> >>>    However, there is no way of notifying validation-involved services
> >>>    to not create policies in the first place. Also, without policies,
> >>>    there would be no way for validation to access the resource, since
> >>>    default (without associated policy) is "deny".
> >>>
> >>>    We consider this issue high priority, which should be dealt with
> >>>    before everyone starts using validation in production. Perhaps we
> >>>    would need an API extension that allows the collection strategy
> >>>    discussed before, or are there other suggestions?
> >>>
> >>>    Best regards
> >>>    Andreas
> >>>    _______________________________________________
> >>>    Development mailing list
> >>>    Development at opentox.org <mailto:Development at opentox.org>
> >>>
> >>>    http://www.opentox.org/mailman/listinfo/development
> >>>
> >>>
> >>>
> >> --
> >> http://www.maunz.de
> >>
> >>            According to my calculations the problem doesn't exist.
> >>
> > _______________________________________________
> > Development mailing list
> > Development at opentox.org
> > http://www.opentox.org/mailman/listinfo/development
> >
>
>
>
> --
> Dipl-Inf. Martin Gütlein
> Phone:
> +49 (0)761 203 8442 (office)
> +49 (0)177 623 9499 (mobile)
> Email:
> guetlein at informatik.uni-freiburg.de
>



More information about the Development mailing list