[OTDev] Validation: Efficiency

Andreas Maunz andreas at maunz.de
Fri Feb 25 14:12:20 CET 2011


Nina Jeliazkova wrote on 02/25/2011 12:53 PM:
> Andreas,
>
> On 25 February 2011 13:28, Andreas Maunz <andreas at maunz.de
> <mailto:andreas at maunz.de>> wrote:
>
>     Nina,
>
>     you are right (I think it still is the case that datasets are
>     redundant).
>     However, with different model parameters, which will probably be
>     used a lot in validation, new datasets will be created.
>     I think it would be definitely necessary to not store data
>     redundantly (as you indicated), but that might be only part of the
>     solution.
>     So it may still be necessary to compress the amount of policies needed.
>
>
> Well, thinking further
>
> 1)  I would implement validation splits (at least at our services) as
> logical splits of the same dataset , assigning some tags, similar to
> what is in the mutagenicity Benchmark dataset  (look for column "Set"
> http://apps.ideaconsult.net:8080/ambit2/feature/28956 )
>
> http://apps.ideaconsult.net:8080/ambit2/dataset/2344?max=100
>
> and introduce searching similar to the queries below (restricted to the
> property in question)
>
> Training set
> http://apps.ideaconsult.net:8080/ambit2/dataset/2344?search=TRAIN
>
> Crossvalidation sets
> http://apps.ideaconsult.net:8080/ambit2/dataset/2344?search=CV1
> http://apps.ideaconsult.net:8080/ambit2/dataset/2344?search=CV2
> http://apps.ideaconsult.net:8080/ambit2/dataset/2344?search=CV3
> ...
>
>
> Thus, everything is in the original dataset (or a single copy of it on
> another dataset service) and no need of additional policies.
>
>
> Different features , calculated during validation run would be specified
> via feature_uris[] parameter on the same dataset URI.
>
> http://apps.ideaconsult.net:8080/ambit2/dataset/2344?search=CV3?feature_uris[]=....

This approach is optimal in terms of avoiding redundancy.
It imposes structure without adding more than the minimum required 
information, specifically without being redundant.
I opt for us to definitely (try to) go a similar way.

Andreas



More information about the Development mailing list