[OTDev] [Fwd: RE: help IBMC on writing features/datasets]

Thu Sep 9 10:48:23 CEST 2010

Hi Surajit, Developers,

below is a discussion we were having with Dmitry from IBMC to help him
build his OT MNA web service. It might be of interest.

regards,
Tobias

-------- Forwarded Message --------
> From: Tobias Girschick <tobias.girschick at in.tum.de>
> Reply-to: tobias.girschick at in.tum.de
> To: Druzhilovsky <dmitry.druzhilovsky at ibmc.msk.ru>
> Cc: 'Nina Jeliazkova' <jeliazkova.nina at gmail.com>, 'Pantelis
> Sopasakis' <chvng at mail.ntua.gr>, 'Buchwald, Fabian'
> <fabian.buchwald at in.tum.de>
> Subject: RE: help IBMC on writing features/datasets
> Date: Thu, 09 Sep 2010 10:14:14 +0200
> 
> Dear Dmitry,
> 
> that sounds indeed extremely similar to gSpan and FTM services we offer
> as OT webservices. The only difference seems to be that our algorithms
> have additional parameters. In our case the dictionary is not only
> dependent on the dataset but also on the parameters. What we do, is to
> save the parameters (including the dataset uri) and the resulting
> dataset uri so that we don't have to recalculate descriptors. Each
> parameter setting (in your case only the dataset) has its own "version"
> of the FTM algorithm.
> 
> opentox.informatik.tu-muenchen.de:8080/OpenTox-dev/algorithm/FTM1
> opentox.informatik.tu-muenchen.de:8080/OpenTox-dev/algorithm/FTM2
> opentox.informatik.tu-muenchen.de:8080/OpenTox-dev/algorithm/FTM3 ...
> 
> Those URIs are referenced in the hasSource field of the feature.
> 
> 
> For further guidance you might also have a look at the FTM (or gSpan)
> implementations, although they are in Java, not PhP. The
> CallableFTM.java is the asynchronously running part and contains the
> important stuff. 
> 
> http://lxkramer13.informatik.tu-muenchen.de/trac/TUMOpenTox/browser/trunk/src/opentox/algorithm/descriptorcalculation/FTMResource.java
> http://lxkramer13.informatik.tu-muenchen.de/trac/TUMOpenTox/browser/trunk/src/opentox/algorithm/descriptorcalculation/CallableFTM.java
> 
> Tobias
> 
> 
> On Wed, 2010-09-08 at 23:40 +0400, Druzhilovsky wrote: 
> > Dear Tobias,
> 
> > Does the dictionary of descriptors (point 3 below) you are talking about contain all the MNA 1-level and 2-level "words" for all compounds in the dataset? Can your descriptors be represented like that (descriptors vs.
> > compounds):
> > 
> > Yes, of course.
> > But for any particular dataset the dictionary of descriptors is specific and can not be preditermined. "General" dictionary is absolutely unrealistic: 
> > for our dataset of 270,000 structures the dictionary of descriptors has the volume about 67,000.
> > Another problem is that the generation of descripors is provided separately for different structures. So, any realistic way is the appropriate use of the MNA descriptors by user.
> > 
> > Dmitry
> > 
> > -----Original Message-----
> > From: Tobias Girschick [mailto:tobias.girschick at in.tum.de] 
> > Sent: Wednesday, September 08, 2010 7:29 PM
> > To: Druzhilovsky
> > Cc: 'Nina Jeliazkova'; 'Pantelis Sopasakis'; 'Buchwald, Fabian'
> > Subject: RE: help IBMC on writing features/datasets
> > 
> > Hi Dmitry,
> > 
> > from your clarifications I conclude that the representation of your MNA
> > descriptors can be chosen quite similar to our substructural descriptor
> > representations. 
> > 
> > What you have to change is, that each "word" (SMILES string) is one
> > feature. The corresponding feature value for a substance (e.g.
> > aripoprazole-1.0) then is the 0 or 1 of the bit-string.
> > 
> > Does the dictionary of descriptors (point 3 below) you are talking about
> > contain all the MNA 1-level and 2-level "words" for all compounds in the
> > dataset? Can your descriptors be represented like that (descriptors vs.
> > compounds):
> > 
> > C CC O N C-C=C ...
> > cmp1 1 1 0 1 1 ...
> > cmp2 1 1 1 0 0 ...
> > cmp3 1 1 0 1 1 ...
> > ...
> > 
> > 
> > regards,
> > Tobias
> > 
> > On Wed, 2010-09-08 at 19:18 +0400, Druzhilovsky wrote: 
> > > Hi Tobias,
> > > 
> > > > Just to clarify some things. Is this one descriptor or are this all 
> > > > substructural descriptors for molecule aripiprazole-1.0?
> > > 
> > > This is the set of all substructural MNA 1 level and MNA 2 level descriptors for molecule aripiprazole-1.0.
> > > Each of the MNA descriptors is "word" like C(C(CCC)C(CC-H-H)-H(C)-H(C)).
> > > Descriptors are separated by blank space.
> > > 
> > > 
> > > > Does that mean all those substructures/alerts are present in the 
> > > > structure?
> > > Yes, that is mean all those substructures/alerts are present in the 
> > > structure.
> > > You must take into account that MNA descriptors are atom centered 
> > > descriptors.
> > > The MNA descriptors are overlapping. For example, the second level 
> > > descriptor C(C(CCC)C(CC-H-H)-H(C)-H(C)) includes the following first level 
> > > descriptors: CCCC, CCCHH, HC.
> > > 
> > > > How exactly are they reused?
> > > The one of methods may be as:
> > > 1) generate MNA descriptors for a set of structures
> > > 2) prepare dictionary of these generated descriptors
> > > 3) represent each of the structures by bit string with 1 in positions of 
> > > numbers of MNA according this dictionary
> > > 4) use these bit strings as a vectors
> > > 
> > > 
> > > The MNA and QNA descriptors do not represent the stereochemical 
> > > peculiarities of a molecule, structures that only differ by stereochemistry 
> > > are formally considered equivalent.
> > > So, for one compound and for all it conformers the same sets of MNA or QNA 
> > > descriptors are generated.
> > > 
> > > Dmitry
> > > 
> > > -----Original Message-----
> > > From: Tobias Girschick [mailto:tobias.girschick at in.tum.de] 
> > > Sent: Wednesday, September 08, 2010 6:15 PM
> > > To: Druzhilovsky
> > > Cc: Nina Jeliazkova; Pantelis Sopasakis; Buchwald, Fabian
> > > Subject: RE: help IBMC on writing features/datasets
> > > 
> > > Hi Dmitry,
> > > 
> > > On Wed, 2010-09-08 at 18:09 +0400, Druzhilovsky wrote: 
> > > > Dear Tobias,
> > > > 
> > > > Yes, for one compound and for all it conformers are generated one descriptor (MNA or QNA).
> > > 
> > > Ok, and the descriptor looks like that(?):
> > > HC  HN    CHHCC CHHCN CHHCO CHCC  CCCC  CCCN  CCCO  CCCCl CCNO  NHCC
> > > NCCC  OC    OCC   ClC   C(C(CCC)C(CC-H-H)-H(C)-H(C))
> > > C(C(CCC)C(CC-H)N(CC-H))      C(C(CCC)C(CC-H)-H(C))
> > > C(C(CCN)C(CC-H)C(CC-H-H))    C(C(CCN)C(CC-H)-H(C))
> > > C(C(CCN)C(CC-O)-H(C))   C(C(CCN)C(CC-Cl)-Cl(C))
> > > C(C(CC-H-H)C(CN-O)-H(C)-H(C)) C(C(CC-H-H)N(CC-H)-O(C))
> > > C(C(CC-H)C(CC-H)-H(C))  C(C(CC-H)C(CC-H)-O(C-C))
> > > C(C(CC-H)C(CC-O)-H(C))      C(C(CC-H)C(CC-Cl)N(CCC))
> > > C(C(CC-H)C(CC-Cl)-H(C)) C(C(CC-H)C(CC-Cl)-Cl(C))
> > > C(C(CN-H-H)N(CCC)-H(C)-H(C)) C(C(CN-H-H)N(CC-C)-H(C)-H(C))
> > > N(C(CCN)C(CN-H-H)C(CN-H-H))      N(C(CCN)C(CN-O)-H(N))
> > > N(C(CN-H-H)C(CN-H-H)-C(N-H-H-C))   -H(C(CC-H)) -H(C(CC-H-H))
> > > -H(C(CN-H-H))     -H(N(CC-H)) -H(-C(N-H-H-C))   -H(-C(-H-H-C-C))
> > > -H(-C(-H-H-C-O))  -C(N(CC-C)-H(-C)-H(-C)-C(-H-H-C-C))
> > > -C(-H(-C)-H(-C)-C(N-H-H-C)-C(-H-H-C-C))
> > > -C(-H(-C)-H(-C)-C(-H-H-C-C)-C(-H-H-C-O))
> > > -C(-H(-C)-H(-C)-C(-H-H-C-C)-O(C-C)) -O(C(CC-O)-C(-H-H-C-O)) -O(C(CN-O))
> > > -Cl(C(CC-Cl)) 
> > > 
> > > Does that mean all those substructures/alerts are present in the
> > > structure? How exactly are they reused? Could you clarify the process?
> > > 
> > > Tobias
> > > 
> > > > 
> > > > Dmitry 
> > > > 
> > > > -----Original Message-----
> > > > From: Tobias Girschick [mailto:tobias.girschick at in.tum.de] 
> > > > Sent: Wednesday, September 08, 2010 4:47 PM
> > > > To: Druzhilovsky
> > > > Cc: Buchwald, Fabian; Nina Jeliazkova; Pantelis Sopasakis
> > > > Subject: RE: help IBMC on writing features/datasets
> > > > 
> > > > Dear Dmitry,
> > > > 
> > > > > 
> > > > > <rdf:Description rdf:about="http://195.178.207.160/api/1.1/MNA_DESCRIPTORS/0">
> > > > >    <ns1:aripiprazole-1.0>HC  HN    CHHCC CHHCN CHHCO CHCC  CCCC  CCCN  CCCO  CCCCl CCNO  NHCC      NCCC  OC    OCC   ClC   C(C(CCC)C(CC-H-H)-H(C)-H(C)) C(C(CCC)C(CC-H)N(CC-H))      C(C(CCC)C(CC-H)-H(C))   C(C(CCN)C(CC-H)C(CC-H-H))    C(C(CCN)C(CC-H)-H(C))      C(C(CCN)C(CC-O)-H(C))   C(C(CCN)C(CC-Cl)-Cl(C)) C(C(CC-H-H)C(CN-O)-H(C)-H(C)) C(C(CC-H-H)N(CC-H)-O(C))   C(C(CC-H)C(CC-H)-H(C))  C(C(CC-H)C(CC-H)-O(C-C))     C(C(CC-H)C(CC-O)-H(C))      C(C(CC-H)C(CC-Cl)N(CCC))     C(C(CC-H)C(CC-Cl)-H(C)) C(C(CC-H)C(CC-Cl)-Cl(C))      C(C(CN-H-H)N(CCC)-H(C)-H(C)) C(C(CN-H-H)N(CC-C)-H(C)-H(C)) N(C(CCN)C(CN-H-H)C(CN-H-H))      N(C(CCN)C(CN-O)-H(N))   N(C(CN-H-H)C(CN-H-H)-C(N-H-H-C))   -H(C(CC-H)) -H(C(CC-H-H))      -H(C(CN-H-H))     -H(N(CC-H)) -H(-C(N-H-H-C))   -H(-C(-H-H-C-C))  -H(-C(-H-H-C-O))  -C(N(CC-C)-H(-C)-H(-C)-C(-H-H-C-C)) -C(-H(-C)-H(-C)-C(N-H-H-C)-C(-H-H-C-C))  -C(-H(-C)-H(-C)-C(-H-H-C-C)-C(-H-H-C-O))     -C(-H(-C)-H(-C)-C(-H-H-C-C)-O(C-C)) -O(C(CC-O)-C(-H-H-C-O)) -O(C(CN-O))  -Cl(C(CC-Cl))     </ns1:aripiprazole-1.0>
> > > > > </rdf:Description>
> > > > 
> > > > Just to clarify some things. Is this one descriptor or are this all
> > > > substructural descriptors for molecule aripiprazole-1.0? 
> > > > 
> > > > Tobias
> > > > 
> > > > 
> > > > > 
> > > > > 
> > > > > -----Original Message-----
> > > > > From: Tobias Girschick [mailto:tobias.girschick at in.tum.de] 
> > > > > Sent: Wednesday, September 08, 2010 3:09 PM
> > > > > To: Nina Jeliazkova
> > > > > Cc: chung; Buchwald, Fabian; Druzhilovsky
> > > > > Subject: Re: help IBMC on writing features/datasets
> > > > > 
> > > > > Sure,
> > > > > 
> > > > > @Dmitry: could you explain what works at the moment, what needs to be
> > > > > done and where you think need help?
> > > > > 
> > > > > regards,
> > > > > Tobias
> > > > > 
> > > > > On Wed, 2010-09-08 at 14:06 +0300, Nina Jeliazkova wrote:
> > > > > > Hi Pantelis, Tobias, Fabian,
> > > > > > 
> > > > > > I know everybody is busy, but could you try helping Dmitry how to
> > > > > > write RDF, describing IBMC fingerprint - like descriptors and related
> > > > > > predictions.
> > > > > > 
> > > > > > Dmitry's feedback might be useful for the techie table exercise  as
> > > > > > well.
> > > > > > 
> > > > > > Thank you,
> > > > > > Nina
> > > > > > 
> > > > > > -- 
> > > > > > Dr. Nina Jeliazkova
> > > > > > Technical Manager
> > > > > > 
> > > > > > 4 A.Kanchev str.
> > > > > > IdeaConsult Ltd.
> > > > > > 
> > > > > > 1000 Sofia, Bulgaria
> > > > > > Phone: +359 886 802011
> > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 

-- 
Dipl.-Bioinf. Tobias Girschick

Technische Universität München
Institut für Informatik
Lehrstuhl I12 - Bioinformatik
Bolzmannstr. 3
85748 Garching b. München, Germany

Room: MI 01.09.042
Phone: +49 (89) 289-18002
Email: tobias.girschick at in.tum.de
Web: http://wwwkramer.in.tum.de/girschick