[OTDev] descriptor recalculation

Fri Apr 30 10:43:48 CEST 2010

Dear Christoph,  All,

Christoph Helma wrote:
> Excerpts from Nina Jeliazkova's message of Thu Apr 29 20:04:17 +0200 2010:
>   
>> Dear Christoph,
>>
>> Few comments from my point of view.  I (partially) agree, the difference
>> perhaps is I would like to keep both the "low level" services and
>> "super" services aboard, not hiding everything into "super" services.
>>     
>
> I agree, that the "low level" services should be still available for
> everyone - that would make it much easier to create innovative "super"
> services and it is also necessary for other tasks apart from creating
> models and predictions.
>
>   
Great - so we have an aggreement ;)
>> The examples are of course valid, however these are actually not single
>> services, but workflows of services, rather simple ones.  
>>
>> What we have right now are actually building blocks of services, and
>> composition of these are being build  case by case.  For example the 
>> mlr case might be
>>
>>         training-dataset = data_preprocessing(raw-dataset)
>>         descriptors_dataset1 = cdk-descriptors(training-dataset)
>>         descriptors_dataset2 = other-descriptors(training-dataset)
>>         descriptors_dataset = merge(descriptors_dataset1,descriptors_dataset2)
>>     pca-descriptors = pca(descriptors)
>>     mlr-model = mlr(pca-descriptors)
>>     model.model = mlr-model
>>     model.descriptors = pca-descriptors
>>
>> (note that  outcome of calculations are datasets again, so in short we have 
>>
>>       training-dataset = data_preprocessing(raw-dataset)
>>     model =  mlr(merge(cdk-descriptors(training-dataset),other-descriptors(training-dataset)))
>>
>>
>> There could be multiple services in the chain, and it might not be
>> sufficient just to submit services as parameters, there should be a way
>> to specify the order of the processing.
>>     
>
> Agreed. Any ideas how to represent sequences?
>   
Not clear yet, see below .
>> However, it should be fine to wrap such an workflow in a "super" service.
>>     
>>> The model learning task and the prediction task may utilize one or more
>>> algorithms (or models - the separation blurs once again), but at the
>>> high level I want to use only the "super" algorithms/models.
>>>   
>>>       
>> For clear separation, I would like to propose the following
>> interpretation - 
>>
>> - "algorithm" means a generic "algorithm", or series of steps
>> - "model" is the "algorithm", applied to the data , given relevant
>> parameters.
>>
>> Thus, in kNN example, the "algorithm" is the theoretical procedure of
>> finding the closest k neighbors and generating prediction based on its
>> values;
>> while the "model" is the kNN algorithm, applied to specific data set,
>> distance metric and equation/procedure (e.g. averaging values).
>>
>> For eager learning algorithms there should not be confusion.
>>     
>
> Thanks for clarification - hope I memorize it this time ;-)
>
>   
>>> As a GUI developer I still want to have access to the underlying
>>> algorithms, but they can be provided as parameters (our existing API is
>>> quite flexible in this respect). An algorithm webservice could provide
>>> e.g. a high level regression algorithm that allows to choose descriptor
>>> calculation, feature selection and modelling parameters by setting
>>> parameters (and it should document and check internally which algorithms
>>> play together). Future lazar version e.g. will have the facility to
>>> freely switch descriptor calculation services or use datasets with
>>> biological measurements. Maybe we should add the facility to represent
>>> sub-algorithms in OWL-DL for "super" algorithms.
>>>
>>> According to our API the model knows about ot.Algorithm and
>>> ot.IndependentVariables, but it would need to know the service to
>>> calculate independent variables.
>>>       
>> It does actually - every feature (variable) has ot:hasSource, which
>> points to the service it has been generated from (e.g. descriptor
>> calculation one) - and this is what we use in ToxPredict.
>>     
>
> True, but that makes sense only for "simple" descriptor calculation
> algorithms (i.e. descriptors that are independent of the training
> activities, like phys-chem properties, substructures).
It only seems so - what we need in case of "complex" calculation, is an
unique URI, which could encapsulate all the specific parameters you listed.

We already did some brainstorming with TUM some time ago and they have
implementation of FMiner to offer URI of descriptors.

The idea is rather simple (not to say it's the best or the only one) -
given an algorithm URI and parameters , create another unique URI, which
encapsulates the algorithm + parameters. Parameters can be anything
specific to the algorithm, including a dataset.  My original idea was
the new URI points to a Model (as per my definition above :),  but there
were some objections, so finally we agreed the new URI points to a new
ot:Algorithm object. 

This have been discussing by email and described on wiki page
http://opentox.org/dev/apis/api-1.1/Algorithm since a while. I'm copying
here for convenience:

------

        Descriptor calculation algorithms   subclass of 
        http://www.opentox.org/algorithmTypes.owl#DescriptorCalculation
        <http://opentox.org/data/documents/development/RDF%20files/AlgorithmTypes>

    * input parameters: *dataset_uri* , *parameter***
    * output parameters:  *dataset_uri*

An Algorithm service shall provides separate URLs for algorithms with
default (or without) parameters and for algorithms with specific
parameter values. The second type of algorithms URLs are created on the
fly, when algorithm with specific parameters or dataset is invoked. For
example, when calculating descriptors , depending on the
http://dataset.service.eu/dataset/6 and set of parameters, the
calculation service creates the following feature:

<Feature  rdf:about="http:// <http://dataset.service.eu/feature/1>dataset.service.eu <http://dataset.service.eu/feature/1>/feature/1 <http://dataset.service.eu/feature/1>>
<dc:title rdf:datatype="&xsd;string">CCCC</dc:title>
<hasSource rdf:resource="http://algorithm.service.org/algorithm/subgraph1" <http://algorithm.service.org/algorithm/subgraph1>/>
</Feature>

and internally the algorithm service creates a new algorithm entry:

http://algorithm.service.org/algorithm/subgraph1

with a representation like below: 

<Algorithm rdf:about=" <http://algorithm.service.org/algorithm/subgraph1>http://algorithm.service.org/algorithm/subgraph1" <http://algorithm.service.org/algorithm/subgraph1>>
<hasInput rdf:resource="http://dataset.service.eu/dataset/6"/>
<parameters rdf:resource="#Parameter_4"/>
        <parameters rdf:resource="#Parameter_3"/>
       <owl:sameAs
 rdf:resource="http://www.blueobelisk.org/ontologies/chemoinformatics-algorithms/#subgraph"/>
</Algorithm>

<Parameter rdf:ID="Parameter_3">
       <paramValue rdf:datatype="&xsd;string"></paramValue>
</Parameter>
<Parameter rdf:ID="Parameter_4">
   <paramValue rdf:datatype="&xsd;double">0.7</paramValue>
</Parameter>

    * If a client would like to use exactly same algorithm settings to
      calculate http://ambit.bg/feature/1, it will use the
      http://algorithm.service.org/algorithm/subgraph1, available via
      /ot:hasSource/, in an uniform way for all kind of algorithms,
      regardless of the existence of parameters;

    * All the complexity is hidden within the algorithm service;

    * If a calculation with the generic
      http://algorithm.service.org/algorithm/subgraph1 algorithm is
      initiated with specific set of parameters, the service might
      lookup internally whether such set already exist and eventually
      reuse, otherwise a new algorithm URL is created along the
      calculations;

    * A calculation with the
      http://algorithm.service.org/algorithm/subgraph
      <http://algorithm.service.org/algorithm/subgraph1>1 algorithm is
      initiated without parameters , these are already known;
    * If a calculation with the http://
      <http://algorithm.service.org/algorithm/subgraph1>algorithm.service.org
      <http://algorithm.service.org/algorithm/subgraph1>/algorithm/subgraph1
      <http://algorithm.service.org/algorithm/subgraph1> algorithm is
      initiated with specific set of parameters, the service shall throw
      an error 400 "Bad request".

> If we use e.g.
> supervised graph mining techiques we need
>
> (i) an algorithm (model because it is algorithm applied to data?) that
> mines features in the training dataset and creates a feature dataset
> (e.g. fminer)
>
> (ii) a simple substructure matching algorithm that determines if the
> mined features are present in the compound to be predicted (e.g.
> OpenBabel Smarts matcher)
>   

> My interpretation was, that ot:hasSource should point to the graph
> mining algorithm, but the model would need the substructure matcher for
> predictions. How should we handle this?
>   
The idea, explained above is that (i) & (ii) complexity is hidden within
implementation, and it only exposes GET and POST interface to retrieve
its representation and run calculations,respectively.
I hope the explanation on the wiki is clear enough, let's discuss if not. 
TUM group was comfortable it covers their fminer case, and the same
approach can be reused by fragment calculation algorithms from SL &
IBMC.   From my point of view, it covers also the case of launching e.g.
MOPAC with different commands to calculate electronic parameters, and
provides unique URIs, identifying, if e.g. eHomo is calculated with PM3
or AM1. 

To make it short, one POSTs a dataset + parameters  to 
/algorithm/fminer , and receives the result dataset, where for each
feature ot:hasSource points to  /algorithm/fminer/my_new_fminer_instance.
The latest encapsulates the dataset+parameters and allows new
dataset/compounds to be POSTed to
/algorithm/fminer/my_new_fminer_instance , resulting in the dataset with
calculated values.
>   
>>>  This could be inferred from the
>>> ot.Algorithms's sub-algorithms or stated explicitly. More importantly
>>> the service would have to be able to call the necessary services 
>>>       
>> Yes it already does - get the independent variable ot:hasSource
>> property, run POST and get the result
>>     
>>> (of
>>> course this has to be implemented, if you are using stock ML/DM tools -
>>> but OpenTox should be more than just wrapping existing programs into a
>>> REST interface). It would be a large waste of efforts, if every
>>> developer would have to implement descriptor calculation separatly in
>>> their webservice clients. 
>>>   
>>>       
>> Agree.
>>
>> But it will be also a waste, if e.g. descriptor calculations services
>> are hidden within model services , and not accessible for reuse for
>> other models.  
>>     
>
> I do not want to hide them completely - on the contrary, everyone should
> be able to mash up descriptor calculation/selection and model learning
> services. I was just arguing in favor of better encapsulation, that
> makes programming/experimenting easier (e.g. models that accept
> structures instead of descriptor sets as input for predictions).
>   
OK, so we have to find a way to represent / dynamically generate "super"
services, given set of "lower level" services.
>   
>> For example descriptors, currently used in lazar don't seem to be
>> accessible for use in other services (I might just not be aware though),
>> while descriptors, exposed as separate services by TUM and IDEA are.
>>     
>
> Descriptor calculation is currently performed by fminer, which is
> available (and used by lazar) as an independent standalone service:
>
> http://webservices.in-silico.ch/algorithm/fminer
>
> It can be easily exchanged for other descriptor services. On the other
> hand Andreas uses e.g. fminer/last features to create SVM models.
>   
Thanks for clarification - will it be possible to extend your fminer
implementation , following the description of complex algorithms API
above? 
> O> > To sum up my personal opinion:
>   
>>> For ToxCreate I would like to handle to high-level objects/services:
>>> training-algorithm (for creating models) and model (for predictions). I
>>> do not want to have to care about implementation details for model
>>> training and predictions, but would like to have access to the
>>> underlying algorithms through parameters. 
>>>       
>> Or links from RDF representation of related objects.
>>     
>
> Might be also a possibility to represent sequences of steps.
>
>   
Yes, this is a possibility.  I am a bit reluctant to suggest a solution
right now, for it might already exist (I have seen recently BPEL for
REST, as well as there is a machine learning ontology in OWL, that does
something similar).

My previous API proposal was not that generic to allow composing
workflows, but might serve as a specific solution to encapsulate
descriptor calculations - what do you think of the proposed API?

1) For models and algorithms, there might be an (optional) parameter,
>> pointing to an URI of the new calculation service.  If the parameter is
>> missing and descriptor calculation is necessary, the model either
>> initiates the calculations itself, or returns an error.
>>
>> 2) The API of the new calculation service (is the 'recalculation' right
>> name for it - perhaps 'proxy calculation service' has a closer meaning ?)
>>
>> GET: RDF representation of ot:Algorithm object; with algorithm type as
>> in  3)
>>
>> POST
>> parameters:
>> dataset_uri - as usual
>> model_uri - the uri of the model
>> or (alternatively) algorithm_uris[] - list of descriptor calculation
>> algorithms
>>
>> POST result -  returns URI of the dataset with calculated descriptors
>>
>> 3) eventually include a new type in Algorithms type ontology for the
>> proxy calculation service; and use it in the RDF representation of the
>> proxy service; thus one could find if such a service exist within the
>> list of algorithms.

Best regards,
Nina
>>> We might need minor API
>>> changes for representing "super" algorithm services (i.e. algorithm
>>> services that call other algorithm sservices) and for informing the
>>> model service about the right descriptor calculation service.
>>>   
>>>       
>> I am also comfortable with the idea of having "super"  (proxy, composite
>> ) services, to encapsulate workflows, like in your examples.  
>>
>> BTW "super" service sounds better than "proxy" service I was suggesting
>> the other day.
>>     
>
> Thanks for the compliment, it was the first thing that came into my
> mind.
>
> Best regards,
> Christoph
> _______________________________________________
> Development mailing list
> Development at opentox.org
> http://www.opentox.org/mailman/listinfo/development
>