[OTDev] API 1.1. extensions - Numeric and Nominal data type implemented

Thu Feb 4 11:50:23 CET 2010

Hi Nina , Tobias

On Thu, Feb 4, 2010 at 1:07 PM, Nina Jeliazkova <nina at acad.bg> wrote:

> Hi Surajit,
>
> Few more questions, to help me understand MaxTox:
>
> Tobias Girschick wrote:
> > Hi Surajit,
> >
> > On Wed, 2010-02-03 at 23:02 +0530, surajit ray wrote:
> >
> >> First of all - Thanks Ivelina for your inputs !
> >>
> >> You are right about the RatTD50 being a individual - I was a little
> confused
> >> there. I have made the correction. The rest of the ontology follows from
> our
> >> work flow which is :
> >>
> >> a) We generate a dictionary of overlaps found from a dataset of
> molecules.
> >> (Basically we run a pairwise graph comparison of all the molecules in a
> >> dataset and collect all the overlaps found in all the comparisons.)
> >>
> Is the dictionary a set of fragments, obtained from a particular dataset
> , and not a dataset of molecules?
>
> e.g. if you compare "CCC"  with "CC" , the overlap will be "CC" ?
>

Yes you are correct with the example. Lets say we have 50 molecules in a
dataset for a particular toxicity endpoint. We compare two at a time (after
removing hydrogens) with the subgraph searching technique. Each comparison
generates some number of overlaps. We collect overlaps which are larger than
the cutoff number of atoms (lets say the cut off is 3 then all overlap
fragments with atoms >=3 will be kept).

We collect all such overlaps and remove the duplicates. While removing
duplicates we have two strategies :
a) we give a weight to the fragment - based on number of duplicates found.
b) no weight is assigned - meaning all the weights are uniform and constant

The molecules of the dataset are then compared to the generated dictionary
fragments and a fingerprint is generated for each molecule in the dataset.
Again there are four strategies
a) we give a number for each entry denoting number of times that
fragment occurred in the molecule
b) we just use binary 1 or 0 (meaning the fragment is present or not)
c) we give a number for each entry denoting the weight of the fragment in
the dictionary (see prev step)
d) we give a weight which is some function of both the weights (dictionary
weight of fragment as well as number of times the fragment occurred in the
molecule).

At this stage I must say we are experimenting with the weights - figuring
out which gives the best predictions.

After the above step - we feed the data to a RF model in R. Afterwards we
fingerprint a new molecule against this dictionary (which is pre-calculated
for a set of molecules and for a particular end point). We then use this FP
in the pre-generated RF model to determine toxicity.

> >> b) This dictionary is then used to generate a fingerprint of a test
> molecule
> >> (for the endpoint of the particular dataset which generated the
> >> dictionary).
> >>
> >> c) In the back end we have modeled the fingerprints of the molecules (
> of
> >> the dataset) against the dictionary generated from the dataset. For our
> >> internal testing and validation - we then use it in a RandomForest model
> to
> >> predict toxicity of future molecules.
> >>
> >> The question I am thinking about is how does this ontology fits in with
> the
> >> API. Since the API has a dataset class - but what about the dictionary
> we
> >> generate - would it be a dataset also - or a dataset derived object of
> some
> >> sort ?
> >>
> >
> > I am not completely sure, but I have the impression that your dictionary
> > is a set of features. Not simple features like a logP but features that
> > have parameters. We had some discussion on this topic going on on the
> > mailing list. The most important parameter of your dictionary features
> > would be the dataset from which it is created.
> It seems to me  MaxTox approach can indeed be handled the same way we
> have discussed recently in the context of TUM algorithms.
>
> 1)There is a description calculation algorithm, with a dataset as input
> parameter (and possibly other parameters)
> 2)The description calculation algorithm generates descriptor values
> (e.g. YES/NO/count for each fragment from the fragment set)
> 3)The descriptor values are used by random forest algorithm for building
> prediction models.
>
>
There is one aspect missing from the above work flow - and that is the
generation of the dictionary itself.

And yes the dictionary is a set of features (the fragments - their
occurrence/absence being the data point for a fingerprint for a molecule
when compared to the dictionary) --->  the most important parameter being
the source set of molecules that generated the fragments). Also each feature
(fragment) should/can have an associated weight determining the number of
times a duplicate was found. (We suspect it could be important in making the
final model more accurate)

We are in two minds about the dictionary making - should we offer that as a
service as well OR just use pre-generated and optimized (for better
modelling characteristics) set of fragments as dictionaries (for particular
endpoints). The latter will lock the overall usability of the prediction
algo to the dictionaries that are made available by us, while the former
leaves the possibility of making new dictionaries through the API itself.

> > You can have a look here
> > http://www.opentox.org/dev/apis/api-1.1/Algorithm in the section where
> > the descriptor calculation algorithms are explained in more detail.
> >
> To summarize:
> -There is a generic MaxTox fingerprint calculation algorithm
>  /algorithm/maxtoxfingerprints
> -Calculation of fingerprints for particular dataset is done via POSTing
> a dataset to the algorithm
> curl -X POST -d /dataset/ABCD  /algorithm/maxtoxfingerprints
>
> This operation returns an URL of a new dataset with descriptor values,
> but also creates a new URL for the algorithm with specific parameters, e.g.
>
>    /algorithm/maxtoxfingerprints1
>
>
I could not understand this part - could you clarify as to why we need a new
url for the algorithm ?

> The new dataset contains features like
>
>    <Feature  rdf:about="http://maxtox.in/feature/1>
>             <dc:title rdf:datatype="&xsd;string">CCCC</dc:title>
>            <hasSource rdf:resource=
> http://maxtox.in/algorithm/maxtoxfingerprints1/>
>    </Feature>
>
>
One more component of each feature would be the weight (as described above).

> The new algorithm entry  /algorithm/maxtoxfingerprints1 will provide RDF
> representation as follows:
>
>     <Algorithm rdf:about="http://maxtox.in/algorithm/maxtoxfingerprints1">
>           <hasInput rdf:resource="http://maxtox.in/dataset/ABCD"/>
>
>           <parameters rdf:resource="#Parameter_4"/>
>
>            <parameters rdf:resource="#Parameter_3"/>
>             <owl:sameAs rdf:resource="
> http://www.blueobelisk.org/ontologies/chemoinformatics-algorithms/#maxtox
> "/>
>        </Algorithm>
>
>         <Parameter rdf:ID="Parameter_3">
>             <paramValue rdf:datatype="&xsd;string"></paramValue>
>         </Parameter>
>         <Parameter rdf:ID="Parameter_4">
>             <paramValue rdf:datatype="&xsd;double">0.7</paramValue>
>         </Parameter>
>
>
Could you please clarify/ explain the use of the parameters in our context -
and what they should signify ?

> This approach will require only extending Blue Obelisk ontology with
> MaxTox descriptor calculation algorithm, an no further changes, neither
> in opentox ontology, nor in API.
>
> @Tobias and Fabian - if examples sounds familiar to you, it is
> intentional :)
>
> Finally, to align MaxTox with the API, there should be a clear split
> between descriptor calculation algorithm (fingerprints) and modeling
> (RandomForest).  For the Ranfom forest algorithm the representation as a
> machine learning algorithm should be used (see AlgorithmTypes ontology).
>
>
Yeah, I figured that much earlier - I had already sent a mail regarding
should we provide the RF algo as a separate REST service. Even now there is
a lack of consensus internally - the main issue being should we provide the
full model as one unit - or the components as individual REST services. (A
sub-topioc of the internal discussion being - the lack of man power we have
in making all the components work individually through the API - since I am
the only one coding the whole project from scratch).

I hope there will be some middle ground regarding the deliverable from our
side ....

I am still a little unclear about how we should handle the dictionary making
part. I hope this mail will help you give me a clear direction.

Thanks in advance for all the help. Its really lonely coding here !

Cheers
Surajit

Does this make sense for MaxTox?
>
> Best regards,
> Nina
>
> > I hope this helps just a little bit
> > Regards,
> > Tobias
> >
> >
> >> Thanks
> >> Surajit
> >>
> >> On Wed, Feb 3, 2010 at 7:36 PM, Ivelina Nikolova <iva at lml.bas.bg>
> wrote:
> >>
> >>
> >>> Dear Surajit,
> >>>
> >>> I'm looking at the MaxTox.owl you've attached earlier this week. The
> >>> reasoner classifies it well, so technically it is correct, but I'm
> >>> lacking some additional knowledge about the problem you wish to solve
> >>> with this ontology, may you give me some more explanations so that i
>  so
> >>> that I can get your point while creating it.
> >>>
> >>> It is surprising for me to see that you have chosen to create a class
> >>> called RatTD50 as a Dataset subclass. Normally the concrete datasets or
> >>> features are individuals and they are not part of the ontology, but of
> >>> some external resource and they have their URL. What is your reason to
> >>> design it this way?
> >>>
> >>> Best,
> >>> ivelina
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> surajit ray wrote:
> >>>
> >>>> Hi Nina,
> >>>>
> >>>> I have created a basic ontology for the Maxtox model. Could please go
> >>>> through it briefly when you have the time - and suggest improvements
> as
> >>>>
> >>> well
> >>>
> >>>> as how we can fit this into the existing Ontology of Opentox. Just
> >>>> indications will do (I understand you are already under a deadline
> >>>> pressure...)
> >>>>
> >>>> I have just started learning protege so a few concepts may have gone
> >>>>
> >>> awry. I
> >>>
> >>>> have tried my best to get the gist of our prediction system into the
> >>>> attached ontology.
> >>>>
> >>>> Thanks in advance
> >>>>
> >>>> Cheers
> >>>> Surajit
> >>>>
> >>>> On Tue, Feb 2, 2010 at 11:45 AM, Nina Jeliazkova <nina at acad.bg>
> wrote:
> >>>>
> >>>>
> >>>>
> >>>>>  Hi Surajit,
> >>>>>
> >>>>>
> >>>>> surajit ray wrote:
> >>>>>
> >>>>> Hi Nina,
> >>>>>
> >>>>>  Are we officially supposed to use the restlet2 m7 ?
> >>>>>
> >>>>>  The library and particular release are choice of the developer, so
> the
> >>>>> answer is - it is up to you.
> >>>>>
> >>>>>  At this point I have few questions about this interaction
> >>>>> a) Once a time consuming calculation is over - should the server
> notify
> >>>>>
> >>> an
> >>>
> >>>>> attached client about the result OR just sit with the data at a
> >>>>>
> >>> specified
> >>>
> >>>>> URL till it is fetched by the client ?
> >>>>>
> >>>>> The client sends GET requests and verifies if the result is ready
> (200
> >>>>>
> >>> OK).
> >>>
> >>>>> Have a look at the API  at
> >>>>>
> >>> http://opentox.org/dev/apis/api-1.1/AsyncTask
> >>>
> >>>>> When returning 303 (redirect) for an uncomplete task and making use
> of
> >>>>> Refresh field, it is very easy to implement browser -like client to
> >>>>> periodically check the task status. Browsers will automatically try
> to
> >>>>>
> >>> fetch
> >>>
> >>>>> the content after time interval, specified in Refresh:
> >>>>>
> >>>>>  (In the REST interface is it allowable for the server to contact a
> >>>>> connected client ?)
> >>>>>
> >>>>> No.  REST is using HTTP protocol for communication and there is no
> any
> >>>>> notion of an "attached" client.  The client sends GET/PUT/POST/DELETE
> >>>>> request, receives an answer and there is no permanent between them
> after
> >>>>>
> >>> the
> >>>
> >>>>> response is sent.
> >>>>> Unless you implement your client to behave as server as well, there
> is
> >>>>>
> >>> no
> >>>
> >>>>> way of server to tell the client anything, besides answering client
> >>>>>
> >>> request.
> >>>
> >>>>>  b) Should the server identify the client requesting the computation
> and
> >>>>> authenticate before delivering the data OR give it to any client
> >>>>>
> >>> requesting
> >>>
> >>>>> the URI of the predicted/computed values ?
> >>>>>
> >>>>> We've decided to postpone everything, concerning authentication and
> >>>>> authorization after the end of February, and currently all the
> services
> >>>>>
> >>> are
> >>>
> >>>>> open to everybody (I know it sounds scary :)
> >>>>>
> >>>>>  c) How long should the computed values be retained on the server OR
> >>>>> should there be REST interface for destroying (and hence freeing up
> >>>>> resources) a set of computed values ?
> >>>>>
> >>>>> It depends on your implementation.
> >>>>>
> >>>>> For example ambit services store everything in a database and keep
> the
> >>>>> results forever, unless a delete operation is performed on specific
> >>>>> resource.  REST way of deleting a resource is sending DELETE request
> >>>>> (instead of POST or PUT, which are generaly for create/update).  For
> >>>>>
> >>> most of
> >>>
> >>>>> OpenTox resources DELETE operation is defined in the API (see wiki),
> but
> >>>>>
> >>> not
> >>>
> >>>>> everybody has implemented the full set of the API.
> >>>>>
> >>>>> d) Should there be an "account" like system for every client on the
> >>>>>
> >>> server
> >>>
> >>>>> See answer in (b).
> >>>>>
> >>>>> ? If yes - should the data generated by a client be attached to that
> >>>>> "account" or available to all ?
> >>>>>
> >>>>> There might be public and private data, but at this moment we
> consider
> >>>>> everything is public and will decide on details after finalizing
> >>>>> deliverables at the end of this month.
> >>>>>
> >>>>>  Should the computed data persist indefinitely in each of these
> >>>>>
> >>> "accounts"
> >>>
> >>>>> ?
> >>>>>
> >>>>>  Again, it depends on your implementation.
> >>>>>
> >>>>> Best regards,
> >>>>> Nina
> >>>>>
> >>>>>  Cheers
> >>>>> Surajit
> >>>>>
> >>>>> On Tue, Feb 2, 2010 at 3:31 AM, Nina Jeliazkova <nina at acad.bg>
> wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>>> Hi Pantelis,
> >>>>>>
> >>>>>> There is a standard Java class java.util.concurrent.ExecutorService
> ;
> >>>>>>
> >>> it
> >>>
> >>>>>> could be configured to work as a pool of fixed or variable number of
> >>>>>> threads.
> >>>>>>
> >>>>>> There is a Restlet TaskService , which is wraps the ExecutorService.
> >>>>>> I've found it behaved weird and switched to the standard Java class.
> >>>>>>
> >>>>>> You might look at ambit code at
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>
> https://ambit.svn.sourceforge.net/svnroot/ambit/trunk/ambit2-all/ambit2-www/src/main/java/ambit2/rest/AmbitApplication.java
> >>>
> >>>>>>
> >>>
> https://ambit.svn.sourceforge.net/svnroot/ambit/trunk/ambit2-all/ambit2-www/src/main/java/ambit2/rest/task
> >>>
> >>>>>> For each asynchronous task, it creates a Callable class, which
> returns
> >>>>>> Reference. Each tasks has an unique identifier (UUID) and the set of
> >>>>>> tasks is stored in a ConcurrentMap. There is a timer, which removes
> >>>>>> completed tasks few hours after completion.
> >>>>>>
> >>>>>> Hope this helps,
> >>>>>> Nina
> >>>>>>
> >>>>>>
> >>>>>> chung wrote:
> >>>>>>
> >>>>>>
> >>>>>>> Hi Nina,
> >>>>>>>  I'm trying to make some improvements on the services so except for
> >>>>>>>
> >>> the
> >>>
> >>>>>>> migration from restlet2 m3 to m7 I was thinking of introducing some
> >>>>>>> execution pool (e.g. an ExecutorService or -why not- something
> >>>>>>> 'homemade') and establish a queue for the incoming requests
> >>>>>>>
> >>> (especially
> >>>
> >>>>>>> those characterized as time-consuming and memory-consuming ones).
> This
> >>>>>>> way I will be able to manage all running tasks on the server and
> make
> >>>>>>> some performance improvements I hope. Is there some standard way of
> >>>>>>> doing this? Could you suggest some executor or some utility to
> manage
> >>>>>>> the running threads and do you know if there is some way to specify
> >>>>>>>
> >>> the
> >>>
> >>>>>>> maximum number of running threads for Restlet?
> >>>>>>>
> >>>>>>> Best Regards,
> >>>>>>> Pantelis
> >>>>>>>
> >>>>>>>
> >>>>>>> On Tue, 2010-01-26 at 14:21 +0200, Nina Jeliazkova wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>> Hello All,
> >>>>>>>>
> >>>>>>>> Following the data type discussions and proposal earlier this
> month,
> >>>>>>>>
> >>>>>>>>
> >>>>>> now
> >>>>>>
> >>>>>>
> >>>>>>>> support for NumericFeature and NominalFeature are implemented in
> IDEA
> >>>>>>>> services.
> >>>>>>>>
> >>>>>>>> Please note all features are explicitly declared to be subclass of
> >>>>>>>> ot:Feature as well. While this is redundant and can be derived
> from
> >>>>>>>>
> >>> the
> >>>
> >>>>>>>> ontology with a help of a reasoner, it does make the client
> >>>>>>>> implementation somewhat easier.
> >>>>>>>>
> >>>>>>>> Examples from CPDBAS dataset at
> >>>>>>>> http://ambit.uni-plovdiv.bg:8080/ambit2/dataset/9
> >>>>>>>>
> >>>>>>>> <http://ambit.uni-plovdiv.bg:8080/ambit2/feature/12122>
> >>>>>>>>       a       ot:Feature , ot:NominalFeature ;
> >>>>>>>>       dc:identifier
> >>>>>>>> "http://ambit.uni-plovdiv.bg:8080/ambit2/feature/12122
> "^^xsd:anyURI
> >>>>>>>>
> >>> ;
> >>>
> >>>>>>>>       dc:title "ActivityOutcome_CPDBAS_SingleCellCall" ;
> >>>>>>>>       ot:acceptValue "inactive" , "active" ;
> >>>>>>>>       ot:hasSource
> >>>>>>>> <http://ambit.uni-plovdiv.bg:8080/ambit2/reference/11847> ;
> >>>>>>>>       ot:units "" ;
> >>>>>>>>       =       ot:ActivityOutcome_CPDBAS_SingleCellCall .
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> <http://ambit.uni-plovdiv.bg:8080/ambit2/feature/12124>
> >>>>>>>>       a       ot:Feature , ot:NumericFeature ;
> >>>>>>>>       dc:identifier
> >>>>>>>> "http://ambit.uni-plovdiv.bg:8080/ambit2/feature/12124
> "^^xsd:anyURI
> >>>>>>>>
> >>> ;
> >>>
> >>>>>>>>       dc:title "STRUCTURE_MolecularWeight" ;
> >>>>>>>>       ot:hasSource
> >>>>>>>> <http://ambit.uni-plovdiv.bg:8080/ambit2/reference/11847> ;
> >>>>>>>>       ot:units "" ;
> >>>>>>>>       =       ot:STRUCTURE_MolecularWeight .
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Bug reports are of course welcome at the usual place
> >>>>>>>>
> >>>>>>>> http://sourceforge.net/tracker/?group_id=191756
> >>>>>>>>
> >>>>>>>> Best regards,
> >>>>>>>> Nina
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> 1) Feature data types:
> >>>>>>>>> Proposal (based on Pantelis suggestions and Protege guide) at
> >>>>>>>>>
> http://opentox.org/data/documents/development/RDF%20files/Datatypes
> >>>>>>>>>
> >>> .
> >>>
> >>>>>>>>> Updated opentox.owl at
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>
> http://opentox.org/data/documents/development/RDF%20files/OpenToxOntology/view
> >>>
> >>>>>>>>>
> >>>>>>>> _______________________________________________
> >>>>>>>> Development mailing list
> >>>>>>>> Development at opentox.org
> >>>>>>>> http://www.opentox.org/mailman/listinfo/development
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>> _______________________________________________
> >>>>>>> Development mailing list
> >>>>>>> Development at opentox.org
> >>>>>>> http://www.opentox.org/mailman/listinfo/development
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>> _______________________________________________
> >>>>>> Development mailing list
> >>>>>> Development at opentox.org
> >>>>>> http://www.opentox.org/mailman/listinfo/development
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>> --
> >>>>> Surajit Ray
> >>>>> Partner
> >>>>> www.rareindianart.com
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>
> >>>>
> ------------------------------------------------------------------------
> >>>>
> >>>> _______________________________________________
> >>>> Development mailing list
> >>>> Development at opentox.org
> >>>> http://www.opentox.org/mailman/listinfo/development
> >>>>
> >>>>
> >>> _______________________________________________
> >>> Development mailing list
> >>> Development at opentox.org
> >>> http://www.opentox.org/mailman/listinfo/development
> >>>
> >>>
> >>
> >>
> >
> >
> >
>
> _______________________________________________
> Development mailing list
> Development at opentox.org
> http://www.opentox.org/mailman/listinfo/development
>

-- 
Surajit Ray
Partner
www.rareindianart.com