[OTDev] descriptor recalculation

Wed Apr 28 11:07:07 CEST 2010

Hello All,

There is an updated REST tutorial
http://www.inf.usi.ch/faculty/pautasso/lectures/REST-Tutorial-WWW2010.pdf 
<http://www.inf.usi.ch/faculty/pautasso/lectures/REST-Tutorial-WWW2010.pdf>
from the  WS-REST-2010 workshop http://www.ws-rest.org/ which took place
couple of days ago.

It seems to me Part 4. Composite resources (from slide 88 ) is quite
relevant to the discussion in this thread .

Best regards,
Nina

Nina Jeliazkova wrote:
> Hi,
>
> Martin Guetlein wrote:
>   
>> On Tue, Apr 20, 2010 at 7:52 PM, Nina Jeliazkova <nina at acad.bg> wrote:
>>
>>   
>>     
>>> Hi Tobias, All,
>>>
>>> I am trying to think of API extension/change , necessary to include the
>>> intermediate descriptor calculation service (preferably without making
>>> it mandatory) ?
>>>
>>> My suggestions:
>>> 1) For models and algorithms, there might be an (optional) parameter,
>>> pointing to an URI of the new calculation service.  If the parameter is
>>> missing and descriptor calculation is necessary, the model either
>>> initiates the calculations itself, or returns an error.
>>>
>>> 2) The API of the new calculation service (is the 'recalculation' right
>>> name for it - perhaps 'proxy calculation service' has a closer meaning ?)
>>>
>>> GET: RDF representation of ot:Algorithm object; with algorithm type as
>>> in  3)
>>>
>>> POST
>>> parameters:
>>> dataset_uri - as usual
>>> model_uri - the uri of the model
>>> or (alternatively) algorithm_uris[] - list of descriptor calculation
>>> algorithms
>>>
>>> POST result -  returns URI of the dataset with calculated descriptors
>>>
>>> 3) eventually include a new type in Algorithms type ontology for the
>>> proxy calculation service; and use it in the RDF representation of the
>>> proxy service; thus one could find if such a service exist within the
>>> list of algorithms.
>>>
>>>
>>> Will this be sufficient ?  Not sure if I am not missing something
>>> important, discussion welcome!
>>>
>>> Best regards,
>>> Nina
>>>
>>>
>>>     
>>>       
>> Hello all,
>>
>>
>> I just discussed the 'PCA-problem' that we discovered during the gotomeeting
>> with Andreas M. Our proposal requires further API changes, but it should
>> work.
>>
>> ------------------------------------
>>
>> Problem description:
>> (We use Variant 2 from Tobias proposal.)
>>
>> Step 1 - Client builds model with params:
>> * training-dataset (only compounds, no features)
>> * feature-mining-algorithm
>>   
>>     
> I guess you mean descriptor calculation here.
>   
>> * feature-selection-algorithm (PCA)
>>
>> The model service passes the params to the feature (re)calculation service.
>> The feature-recalculation-service first applies the
>> feature-mining-algorithm, then applies the feature-selection-algorithm. The
>> feature-recalculation-service passes a new feature dataset back to the model
>> service. The model service builds the model based on the feature dataset.
>> The model service passes the model-uri to the client.
>>
>>   
>>     
> This will not quite fit for algorithms, which require no structures, but
> directly descriptors.
>   
>> Step 2 - Client predicts a compound with the model
>> * uri: model-uri
>> * param: test compound (plain compounds, no features)
>>
>> The model service passes the feature mining params (stored in the model???)
>>   
>>     
> independent variables are already stored in the model representation
> under ot:independentVariables property.
>   
>> to the (re)calculation service to calculate the features for the test
>> compound.
>> PROBLEM: The feature-selection-algorithm PCA needs information from the
>> original training dataset to apply the same mechanism to the test features!
>>
>> ------------------------------------
>>
>> Proposal:
>> * Distinguish between feature selection Algorithms and Models.
>>   
>>     
> OK
>   
>> * Distinguish between a 'mine-feature-step' (on training data) and a
>> 'check-feature-step' (test-compound).
>>   
>>     
> Why - in both cases, the result is generating the descriptors , be it
> from some mining procedure , or reading pre-calculated descriptors from
> elsewhere. 
> "Mining" is not always the right word,especially for the common class of
> models, using calculated whole molecule descriptors.
>   
>> * The 'mine-feature-step' is available from the algorithm and builds the
>> model (as well as the features for the dataset)
>>   
>>     
> This removes the separation of learning algorithms and descriptor
> calculations, which is not a good approach IMHO.
>   
>> * The 'check-feature-step' is availbale for models only.
>> * Feature selection algorithms, where 'checking' is independent of 'mining',
>> are per definition models (like chi-square-filter).
>>
>>   
>>     
> Yes indeed, feature selection algorithms like PCA might be considered
> models , because the result from such a model will be a dataset. 
> However, other feature selection algorithm simply return list of feature
> URIs.
> Then we need no API extension - PCA will just be an algorithm , building
> a model with specific parameters and dataset, and storing the results in
> a dataset and related features.  The prediction model itself will  use
> the result dataset. 
> We might need kind of super-model, encapsulating PCA + prediction model
> in this case.
>
> In fact, I was advocating some months earlier than we need to
> distinguish between descriptor calculation algorithms and descriptor
> calculation models - there are several examples of descriptor
> calculation algorithms, which use set of parameters and even datasets as
> parameters.   If we switch to have all features generated by models, not
> algorithms, than the API becames lot more consistent.
>   
>> In the use case descibed above, the feature-recalculation-service would
>> return a list of feature-selection-model-uris, which are stored in the
>> model, and can be used for recalculating the features for the test compound.
>>   
>>     
> It seems to me the real case might be even more complex than having a
> "descriptor_recalculation" service.   The more generic solution will be
> to have kind of complex service (workflow?), where one could specify
> series of steps and dependencies, necessary to build a model/get
> predictions.  For example, data cleaning and sampling are also
> algorithms, that could be used in the model building process, and in
> using models for prediction later.
>   
>> Alternatively, we could, for the beginning, not support feature
>> selection/preprocessing algorithms like PCA.
>>
>>   
>>     
> This might be a reasonable option for the demo in Berlin.
>
> Best regards,
> Nina
>   
>> What do you think?
>>
>>
>> Best regards,
>> Martin
>>
>>
>>
>>
>>
>>
>>   
>>     
>>> Tobias Girschick wrote:
>>>     
>>>       
>>>> Hi Nina,
>>>>
>>>> the green and the black lines are two possibilities to go through the
>>>> workflow. In the pdf the workflow has to be read from bottom to top
>>>> (more or less). Everything starts with some prediction application (e.g.
>>>> ToxPredict or a ValidationService,...) that needs descriptors to be
>>>> recalculated for prediction. I added the third variant in red arrows and
>>>> made 3 out of the one slide to make it easier readable.
>>>>
>>>> In version 1 (black) no descriptor recalculation service is needed and
>>>> every model service has to delegate the descriptor recalculation to all
>>>> descriptor calculation services.
>>>> In version 2 (green) the descriptor recalculation service is called by
>>>> the model service. The recalc service delegates the necessary descriptor
>>>> calculations. In both cases the model service gets a dataset that has
>>>> not all the descriptors needed to use the model for predicting the
>>>> dataset.
>>>> In version 3 (red) the descriptor recalculation service is called
>>>> directly by the application, delegates the descriptor calculations at
>>>> updates the dataset. This updated dataset is the submitted by the
>>>> application itself to the model service.
>>>>
>>>> I hope this clarifies my rough sketch from last week.
>>>>
>>>> regards,
>>>> Tobias
>>>>
>>>> On Tue, 2010-04-20 at 14:42 +0300, Nina Jeliazkova wrote:
>>>>
>>>>       
>>>>         
>>>>> Hi Tobias,
>>>>>
>>>>> Could you tell what's the difference between black and green lines in
>>>>> your schema?
>>>>>
>>>>> I would suggest starting a new wiki page under API to discuss descriptor
>>>>> calculator and its API.
>>>>>
>>>>> Best regards,
>>>>> Nina
>>>>>
>>>>> Tobias Girschick wrote:
>>>>>
>>>>>         
>>>>>           
>>>>>> Hi All,
>>>>>>
>>>>>> I attached one slide which illustrates the problem from my point of
>>>>>> view. The green and the black lines are the two possibilities. Note
>>>>>>           
>>>>>>             
>>> that
>>>     
>>>       
>>>>>> the "descriptor recalculator" has to be implemented only once (if it is
>>>>>> generic). Otherwise, every new algorithm that learns models has to
>>>>>> provide the whole functionality of calling all the different descriptor
>>>>>> calculation services.
>>>>>>
>>>>>> I think that wrapping the distribution to the different descriptor
>>>>>> calculation services makes things a lot easier.
>>>>>>
>>>>>> Just to again kick-off the discussion.
>>>>>> regards,
>>>>>> Tobias
>>>>>>
>>>>>>
>>>>>>
>>>>>>           
>>>>>>             
>>>> ------------------------------------------------------------------------
>>>>
>>>> This body part will be downloaded on demand.
>>>>       
>>>>         
>>> _______________________________________________
>>> Development mailing list
>>> Development at opentox.org
>>> http://www.opentox.org/mailman/listinfo/development
>>>
>>>     
>>>       
>>
>>   
>>     
>
> _______________________________________________
> Development mailing list
> Development at opentox.org
> http://www.opentox.org/mailman/listinfo/development
>