[OTDev] descriptor recalculation

Mon Apr 26 17:32:21 CEST 2010

Hi,

Martin Guetlein wrote:
> On Tue, Apr 20, 2010 at 7:52 PM, Nina Jeliazkova <nina at acad.bg> wrote:
>
>   
>> Hi Tobias, All,
>>
>> I am trying to think of API extension/change , necessary to include the
>> intermediate descriptor calculation service (preferably without making
>> it mandatory) ?
>>
>> My suggestions:
>> 1) For models and algorithms, there might be an (optional) parameter,
>> pointing to an URI of the new calculation service.  If the parameter is
>> missing and descriptor calculation is necessary, the model either
>> initiates the calculations itself, or returns an error.
>>
>> 2) The API of the new calculation service (is the 'recalculation' right
>> name for it - perhaps 'proxy calculation service' has a closer meaning ?)
>>
>> GET: RDF representation of ot:Algorithm object; with algorithm type as
>> in  3)
>>
>> POST
>> parameters:
>> dataset_uri - as usual
>> model_uri - the uri of the model
>> or (alternatively) algorithm_uris[] - list of descriptor calculation
>> algorithms
>>
>> POST result -  returns URI of the dataset with calculated descriptors
>>
>> 3) eventually include a new type in Algorithms type ontology for the
>> proxy calculation service; and use it in the RDF representation of the
>> proxy service; thus one could find if such a service exist within the
>> list of algorithms.
>>
>>
>> Will this be sufficient ?  Not sure if I am not missing something
>> important, discussion welcome!
>>
>> Best regards,
>> Nina
>>
>>
>>     
>
> Hello all,
>
>
> I just discussed the 'PCA-problem' that we discovered during the gotomeeting
> with Andreas M. Our proposal requires further API changes, but it should
> work.
>
> ------------------------------------
>
> Problem description:
> (We use Variant 2 from Tobias proposal.)
>
> Step 1 - Client builds model with params:
> * training-dataset (only compounds, no features)
> * feature-mining-algorithm
>   
I guess you mean descriptor calculation here.
> * feature-selection-algorithm (PCA)
>
> The model service passes the params to the feature (re)calculation service.
> The feature-recalculation-service first applies the
> feature-mining-algorithm, then applies the feature-selection-algorithm. The
> feature-recalculation-service passes a new feature dataset back to the model
> service. The model service builds the model based on the feature dataset.
> The model service passes the model-uri to the client.
>
>   
This will not quite fit for algorithms, which require no structures, but
directly descriptors.
> Step 2 - Client predicts a compound with the model
> * uri: model-uri
> * param: test compound (plain compounds, no features)
>
> The model service passes the feature mining params (stored in the model???)
>   
independent variables are already stored in the model representation
under ot:independentVariables property.
> to the (re)calculation service to calculate the features for the test
> compound.
> PROBLEM: The feature-selection-algorithm PCA needs information from the
> original training dataset to apply the same mechanism to the test features!
>
> ------------------------------------
>
> Proposal:
> * Distinguish between feature selection Algorithms and Models.
>   
OK
> * Distinguish between a 'mine-feature-step' (on training data) and a
> 'check-feature-step' (test-compound).
>   
Why - in both cases, the result is generating the descriptors , be it
from some mining procedure , or reading pre-calculated descriptors from
elsewhere. 
"Mining" is not always the right word,especially for the common class of
models, using calculated whole molecule descriptors.
> * The 'mine-feature-step' is available from the algorithm and builds the
> model (as well as the features for the dataset)
>   
This removes the separation of learning algorithms and descriptor
calculations, which is not a good approach IMHO.
> * The 'check-feature-step' is availbale for models only.
> * Feature selection algorithms, where 'checking' is independent of 'mining',
> are per definition models (like chi-square-filter).
>
>   
Yes indeed, feature selection algorithms like PCA might be considered
models , because the result from such a model will be a dataset. 
However, other feature selection algorithm simply return list of feature
URIs.
Then we need no API extension - PCA will just be an algorithm , building
a model with specific parameters and dataset, and storing the results in
a dataset and related features.  The prediction model itself will  use
the result dataset. 
We might need kind of super-model, encapsulating PCA + prediction model
in this case.

In fact, I was advocating some months earlier than we need to
distinguish between descriptor calculation algorithms and descriptor
calculation models - there are several examples of descriptor
calculation algorithms, which use set of parameters and even datasets as
parameters.   If we switch to have all features generated by models, not
algorithms, than the API becames lot more consistent.
> In the use case descibed above, the feature-recalculation-service would
> return a list of feature-selection-model-uris, which are stored in the
> model, and can be used for recalculating the features for the test compound.
>   
It seems to me the real case might be even more complex than having a
"descriptor_recalculation" service.   The more generic solution will be
to have kind of complex service (workflow?), where one could specify
series of steps and dependencies, necessary to build a model/get
predictions.  For example, data cleaning and sampling are also
algorithms, that could be used in the model building process, and in
using models for prediction later.
> Alternatively, we could, for the beginning, not support feature
> selection/preprocessing algorithms like PCA.
>
>   
This might be a reasonable option for the demo in Berlin.

Best regards,
Nina
> What do you think?
>
>
> Best regards,
> Martin
>
>
>
>
>
>
>   
>> Tobias Girschick wrote:
>>     
>>> Hi Nina,
>>>
>>> the green and the black lines are two possibilities to go through the
>>> workflow. In the pdf the workflow has to be read from bottom to top
>>> (more or less). Everything starts with some prediction application (e.g.
>>> ToxPredict or a ValidationService,...) that needs descriptors to be
>>> recalculated for prediction. I added the third variant in red arrows and
>>> made 3 out of the one slide to make it easier readable.
>>>
>>> In version 1 (black) no descriptor recalculation service is needed and
>>> every model service has to delegate the descriptor recalculation to all
>>> descriptor calculation services.
>>> In version 2 (green) the descriptor recalculation service is called by
>>> the model service. The recalc service delegates the necessary descriptor
>>> calculations. In both cases the model service gets a dataset that has
>>> not all the descriptors needed to use the model for predicting the
>>> dataset.
>>> In version 3 (red) the descriptor recalculation service is called
>>> directly by the application, delegates the descriptor calculations at
>>> updates the dataset. This updated dataset is the submitted by the
>>> application itself to the model service.
>>>
>>> I hope this clarifies my rough sketch from last week.
>>>
>>> regards,
>>> Tobias
>>>
>>> On Tue, 2010-04-20 at 14:42 +0300, Nina Jeliazkova wrote:
>>>
>>>       
>>>> Hi Tobias,
>>>>
>>>> Could you tell what's the difference between black and green lines in
>>>> your schema?
>>>>
>>>> I would suggest starting a new wiki page under API to discuss descriptor
>>>> calculator and its API.
>>>>
>>>> Best regards,
>>>> Nina
>>>>
>>>> Tobias Girschick wrote:
>>>>
>>>>         
>>>>> Hi All,
>>>>>
>>>>> I attached one slide which illustrates the problem from my point of
>>>>> view. The green and the black lines are the two possibilities. Note
>>>>>           
>> that
>>     
>>>>> the "descriptor recalculator" has to be implemented only once (if it is
>>>>> generic). Otherwise, every new algorithm that learns models has to
>>>>> provide the whole functionality of calling all the different descriptor
>>>>> calculation services.
>>>>>
>>>>> I think that wrapping the distribution to the different descriptor
>>>>> calculation services makes things a lot easier.
>>>>>
>>>>> Just to again kick-off the discussion.
>>>>> regards,
>>>>> Tobias
>>>>>
>>>>>
>>>>>
>>>>>           
>>> ------------------------------------------------------------------------
>>>
>>> This body part will be downloaded on demand.
>>>       
>> _______________________________________________
>> Development mailing list
>> Development at opentox.org
>> http://www.opentox.org/mailman/listinfo/development
>>
>>     
>
>
>
>