[OTDev] Validation: Efficiency

Fri Feb 25 14:21:56 CET 2011

On Fri, Feb 25, 2011 at 2:19 PM, Nina Jeliazkova
<jeliazkova.nina at gmail.com> wrote:
>
>
> On 25 February 2011 15:11, Martin Guetlein <martin.guetlein at googlemail.com>
> wrote:
>>
>> On Fri, Feb 25, 2011 at 1:32 PM, Nina Jeliazkova
>> <jeliazkova.nina at gmail.com> wrote:
>> >
>> >
>> > On 25 February 2011 14:26, Martin Guetlein
>> > <martin.guetlein at googlemail.com>
>> > wrote:
>> >>
>> >> On Fri, Feb 25, 2011 at 12:53 PM, Nina Jeliazkova
>> >> <jeliazkova.nina at gmail.com> wrote:
>> >> > Andreas,
>> >> >
>> >> > On 25 February 2011 13:28, Andreas Maunz <andreas at maunz.de> wrote:
>> >> >
>> >> >> Nina,
>> >> >>
>> >> >> you are right (I think it still is the case that datasets are
>> >> >> redundant).
>> >> >> However, with different model parameters, which will probably be
>> >> >> used a
>> >> >> lot
>> >> >> in validation, new datasets will be created.
>> >> >> I think it would be definitely necessary to not store data
>> >> >> redundantly
>> >> >> (as
>> >> >> you indicated), but that might be only part of the solution.
>> >> >> So it may still be necessary to compress the amount of policies
>> >> >> needed.
>> >> >>
>> >> >>
>> >> > Well, thinking further
>> >> >
>> >> > 1)  I would implement validation splits (at least at our services) as
>> >> > logical splits of the same dataset , assigning some tags, similar to
>> >> > what is
>> >> > in the mutagenicity Benchmark dataset  (look for column "Set"
>> >> > http://apps.ideaconsult.net:8080/ambit2/feature/28956 )
>> >> >
>> >> > http://apps.ideaconsult.net:8080/ambit2/dataset/2344?max=100
>> >> >
>> >> > and introduce searching similar to the queries below (restricted to
>> >> > the
>> >> > property in question)
>> >> >
>> >> > Training set
>> >> > http://apps.ideaconsult.net:8080/ambit2/dataset/2344?search=TRAIN
>> >> >
>> >> > Crossvalidation sets
>> >> > http://apps.ideaconsult.net:8080/ambit2/dataset/2344?search=CV1
>> >> > http://apps.ideaconsult.net:8080/ambit2/dataset/2344?search=CV2
>> >> > http://apps.ideaconsult.net:8080/ambit2/dataset/2344?search=CV3
>> >> > ...
>> >> >
>> >> >
>> >> > Thus, everything is in the original dataset (or a single copy of it
>> >> > on
>> >> > another dataset service) and no need of additional policies.
>> >> >
>> >> >
>> >> > Different features , calculated during validation run would be
>> >> > specified
>> >> > via
>> >> > feature_uris[] parameter on the same dataset URI.
>> >> >
>> >> >
>> >> >
>> >> > http://apps.ideaconsult.net:8080/ambit2/dataset/2344?search=CV3?feature_uris[]=..
>> >>
>> >> My concerns regarding using one dataset for everything:
>> >>
>> >> * This would allow the algorithm to remove the search string and
>> >> request data that is supposed to be unseen. What do people think about
>> >> that?
>> >>
>> >
>> > I am not sure you really prevent this  by the current approach - what
>> > prevents the algorithm to retrieve any other data by from any  dataset
>> > service?  Removing the search string changes the URI, then the algorithm
>> > service could just change the entire URI and receive any other dataset
>> > it
>> > has access to.
>> >
>> >>
>> >> * Still 10 models would be created (and 10 validations, but I could
>> >> try to solve this internally), so we would not end up with 1 policy
>> >> for a crossvalidation.
>> >>
>> > Unless predictions are stored in the same dataset.

I pressed the wrong reply button, here is what I send to Nina and her reply:

>> I counted the 10 models as resources with policies. If they produce a
>> new prediction dataset each (not store it in the same dataset), its
>> 20.
>>
>> One more thought / concern:
>> What if features have to be created (and selected) on each training
>> fold and stored in the same single dataset? This could lead to
>> problems, the super-service / feature creation model has to make sure
>> to not mix things up (like reuse features created on a previous fold).
>
> If each model generates new feature URI (as it should be ) , this is not an
> issue.  And a new model generates new prediction (new column in a dataset)
> and this naturaly means different object and different URI. Otherwise, it is
> blurring the meaning of a resource in RDF at least.
>
> A feature generated from different model should have different ot:hasSource
> , pointing to that model.  If one feature is shared between models, that
> introduces inconsistency , as there will be pointers to two models and the
> superservice will not know which one to use for calculation.  I hope we
> defined ot:hasSource as single valued in the ontology, if not, we should
> correct it.
>
> Nina