[OTDev] Validation: Efficiency
Martin Guetlein martin.guetlein at googlemail.comFri Feb 25 13:26:10 CET 2011
- Previous message: [OTDev] Validation: Efficiency
- Next message: [OTDev] Validation: Efficiency
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Fri, Feb 25, 2011 at 12:53 PM, Nina Jeliazkova <jeliazkova.nina at gmail.com> wrote: > Andreas, > > On 25 February 2011 13:28, Andreas Maunz <andreas at maunz.de> wrote: > >> Nina, >> >> you are right (I think it still is the case that datasets are redundant). >> However, with different model parameters, which will probably be used a lot >> in validation, new datasets will be created. >> I think it would be definitely necessary to not store data redundantly (as >> you indicated), but that might be only part of the solution. >> So it may still be necessary to compress the amount of policies needed. >> >> > Well, thinking further > > 1) I would implement validation splits (at least at our services) as > logical splits of the same dataset , assigning some tags, similar to what is > in the mutagenicity Benchmark dataset (look for column "Set" > http://apps.ideaconsult.net:8080/ambit2/feature/28956 ) > > http://apps.ideaconsult.net:8080/ambit2/dataset/2344?max=100 > > and introduce searching similar to the queries below (restricted to the > property in question) > > Training set > http://apps.ideaconsult.net:8080/ambit2/dataset/2344?search=TRAIN > > Crossvalidation sets > http://apps.ideaconsult.net:8080/ambit2/dataset/2344?search=CV1 > http://apps.ideaconsult.net:8080/ambit2/dataset/2344?search=CV2 > http://apps.ideaconsult.net:8080/ambit2/dataset/2344?search=CV3 > ... > > > Thus, everything is in the original dataset (or a single copy of it on > another dataset service) and no need of additional policies. > > > Different features , calculated during validation run would be specified via > feature_uris[] parameter on the same dataset URI. > > http://apps.ideaconsult.net:8080/ambit2/dataset/2344?search=CV3?feature_uris[]=.. My concerns regarding using one dataset for everything: * This would allow the algorithm to remove the search string and request data that is supposed to be unseen. What do people think about that? * Still 10 models would be created (and 10 validations, but I could try to solve this internally), so we would not end up with 1 policy for a crossvalidation. * Our services (ALU and IST) do not support storing results in existing datasets nor providing subsets of a dataset with search-strings (or feature_uri[], compound_uri[]). To implement this would require quite some effort. Regrads, Martin > .. > > 2) Not changing your current approach, perhaps it makes sense to introduce > in the API a resource for "groups of datasets" , that could be used as a > placeholder for URIs of several datasets, and use some wildcards on the > policy server to ensure only one policy for the group of dataset is needed. > > I guess groups of datasets could be useful in other cases as well. > > Nina > > > >> Andreas >> >> Nina Jeliazkova wrote on 02/25/2011 12:06 PM: >> >>> Andreas, >>> >>> I have not thought about it in detail, but having in mind differences in >>> dataset implementation at Freiburg and ours, I think part of the problem >>> is (AFAIK) your implementation makes full copy of the dataset on each >>> run, regardless of using same URIs (e.g. as same records in the database) >>> >>> So may be this is just an implementation specific? >>> >>> Nina >>> >>> On 25 February 2011 13:02, Andreas Maunz <andreas at maunz.de >>> <mailto:andreas at maunz.de>> wrote: >>> >>> Dear all, >>> >>> since a single validation of a model on a dataset creates multiple >>> ressources (currently > 50), and by the fact that everything is >>> decentralized (i.e. linked via URIs) and referenceable in OpenTox, >>> we are facing the problem that currently prohibitively high load is >>> placed on the AA services, because a policy must be created and >>> requested multiple times (and eventually deleted) for each of the >>> resources. >>> >>> For example the spike in http://tinyurl.com/6amuo8x to the very >>> right is produced by a single validation. Moreover, the validation >>> service is very slow, the AA related part alone takes at least >>> several minutes. All this is induced by the amount of single >>> policies that have to be created. >>> >>> Martin argues that currently there seems no API compliant way of >>> improving performance: One way could be to collect all URIs and >>> create a policy covering all of them at the end of the validation. >>> However, there is no way of notifying validation-involved services >>> to not create policies in the first place. Also, without policies, >>> there would be no way for validation to access the resource, since >>> default (without associated policy) is "deny". >>> >>> We consider this issue high priority, which should be dealt with >>> before everyone starts using validation in production. Perhaps we >>> would need an API extension that allows the collection strategy >>> discussed before, or are there other suggestions? >>> >>> Best regards >>> Andreas >>> _______________________________________________ >>> Development mailing list >>> Development at opentox.org <mailto:Development at opentox.org> >>> >>> http://www.opentox.org/mailman/listinfo/development >>> >>> >>> >> -- >> http://www.maunz.de >> >> According to my calculations the problem doesn't exist. >> > _______________________________________________ > Development mailing list > Development at opentox.org > http://www.opentox.org/mailman/listinfo/development > -- Dipl-Inf. Martin Gütlein Phone: +49 (0)761 203 8442 (office) +49 (0)177 623 9499 (mobile) Email: guetlein at informatik.uni-freiburg.de
- Previous message: [OTDev] Validation: Efficiency
- Next message: [OTDev] Validation: Efficiency
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Development mailing list