[OTDev] descriptor recalculation

Mon Apr 26 17:04:13 CEST 2010

On Tue, Apr 20, 2010 at 7:52 PM, Nina Jeliazkova <nina at acad.bg> wrote:

> Hi Tobias, All,
>
> I am trying to think of API extension/change , necessary to include the
> intermediate descriptor calculation service (preferably without making
> it mandatory) ?
>
> My suggestions:
> 1) For models and algorithms, there might be an (optional) parameter,
> pointing to an URI of the new calculation service.  If the parameter is
> missing and descriptor calculation is necessary, the model either
> initiates the calculations itself, or returns an error.
>
> 2) The API of the new calculation service (is the 'recalculation' right
> name for it - perhaps 'proxy calculation service' has a closer meaning ?)
>
> GET: RDF representation of ot:Algorithm object; with algorithm type as
> in  3)
>
> POST
> parameters:
> dataset_uri - as usual
> model_uri - the uri of the model
> or (alternatively) algorithm_uris[] - list of descriptor calculation
> algorithms
>
> POST result -  returns URI of the dataset with calculated descriptors
>
> 3) eventually include a new type in Algorithms type ontology for the
> proxy calculation service; and use it in the RDF representation of the
> proxy service; thus one could find if such a service exist within the
> list of algorithms.
>
>
> Will this be sufficient ?  Not sure if I am not missing something
> important, discussion welcome!
>
> Best regards,
> Nina
>
>

Hello all,

I just discussed the 'PCA-problem' that we discovered during the gotomeeting
with Andreas M. Our proposal requires further API changes, but it should
work.

------------------------------------

Problem description:
(We use Variant 2 from Tobias proposal.)

Step 1 - Client builds model with params:
* training-dataset (only compounds, no features)
* feature-mining-algorithm
* feature-selection-algorithm (PCA)

The model service passes the params to the feature (re)calculation service.
The feature-recalculation-service first applies the
feature-mining-algorithm, then applies the feature-selection-algorithm. The
feature-recalculation-service passes a new feature dataset back to the model
service. The model service builds the model based on the feature dataset.
The model service passes the model-uri to the client.

Step 2 - Client predicts a compound with the model
* uri: model-uri
* param: test compound (plain compounds, no features)

The model service passes the feature mining params (stored in the model???)
to the (re)calculation service to calculate the features for the test
compound.
PROBLEM: The feature-selection-algorithm PCA needs information from the
original training dataset to apply the same mechanism to the test features!

------------------------------------

Proposal:
* Distinguish between feature selection Algorithms and Models.
* Distinguish between a 'mine-feature-step' (on training data) and a
'check-feature-step' (test-compound).
* The 'mine-feature-step' is available from the algorithm and builds the
model (as well as the features for the dataset)
* The 'check-feature-step' is availbale for models only.
* Feature selection algorithms, where 'checking' is independent of 'mining',
are per definition models (like chi-square-filter).

In the use case descibed above, the feature-recalculation-service would
return a list of feature-selection-model-uris, which are stored in the
model, and can be used for recalculating the features for the test compound.

Alternatively, we could, for the beginning, not support feature
selection/preprocessing algorithms like PCA.

What do you think?

Best regards,
Martin

>
>
> Tobias Girschick wrote:
> > Hi Nina,
> >
> > the green and the black lines are two possibilities to go through the
> > workflow. In the pdf the workflow has to be read from bottom to top
> > (more or less). Everything starts with some prediction application (e.g.
> > ToxPredict or a ValidationService,...) that needs descriptors to be
> > recalculated for prediction. I added the third variant in red arrows and
> > made 3 out of the one slide to make it easier readable.
> >
> > In version 1 (black) no descriptor recalculation service is needed and
> > every model service has to delegate the descriptor recalculation to all
> > descriptor calculation services.
> > In version 2 (green) the descriptor recalculation service is called by
> > the model service. The recalc service delegates the necessary descriptor
> > calculations. In both cases the model service gets a dataset that has
> > not all the descriptors needed to use the model for predicting the
> > dataset.
> > In version 3 (red) the descriptor recalculation service is called
> > directly by the application, delegates the descriptor calculations at
> > updates the dataset. This updated dataset is the submitted by the
> > application itself to the model service.
> >
> > I hope this clarifies my rough sketch from last week.
> >
> > regards,
> > Tobias
> >
> > On Tue, 2010-04-20 at 14:42 +0300, Nina Jeliazkova wrote:
> >
> >> Hi Tobias,
> >>
> >> Could you tell what's the difference between black and green lines in
> >> your schema?
> >>
> >> I would suggest starting a new wiki page under API to discuss descriptor
> >> calculator and its API.
> >>
> >> Best regards,
> >> Nina
> >>
> >> Tobias Girschick wrote:
> >>
> >>> Hi All,
> >>>
> >>> I attached one slide which illustrates the problem from my point of
> >>> view. The green and the black lines are the two possibilities. Note
> that
> >>> the "descriptor recalculator" has to be implemented only once (if it is
> >>> generic). Otherwise, every new algorithm that learns models has to
> >>> provide the whole functionality of calling all the different descriptor
> >>> calculation services.
> >>>
> >>> I think that wrapping the distribution to the different descriptor
> >>> calculation services makes things a lot easier.
> >>>
> >>> Just to again kick-off the discussion.
> >>> regards,
> >>> Tobias
> >>>
> >>>
> >>>
> >
> >
> >
> > ------------------------------------------------------------------------
> >
> > This body part will be downloaded on demand.
>
> _______________________________________________
> Development mailing list
> Development at opentox.org
> http://www.opentox.org/mailman/listinfo/development
>

-- 
Dipl-Inf. Martin Gütlein
Phone:
+49 (0)761 203 8442 (office)
+49 (0)177 623 9499 (mobile)
Email:
guetlein at informatik.uni-freiburg.de