[OTDev] [Fwd: RE: help IBMC on writing features/datasets]
Tobias Girschick tobias.girschick at in.tum.deThu Sep 9 10:48:23 CEST 2010
- Previous message: [OTDev] Techie Table
- Next message: [OTDev] (no subject)
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi Surajit, Developers, below is a discussion we were having with Dmitry from IBMC to help him build his OT MNA web service. It might be of interest. regards, Tobias -------- Forwarded Message -------- > From: Tobias Girschick <tobias.girschick at in.tum.de> > Reply-to: tobias.girschick at in.tum.de > To: Druzhilovsky <dmitry.druzhilovsky at ibmc.msk.ru> > Cc: 'Nina Jeliazkova' <jeliazkova.nina at gmail.com>, 'Pantelis > Sopasakis' <chvng at mail.ntua.gr>, 'Buchwald, Fabian' > <fabian.buchwald at in.tum.de> > Subject: RE: help IBMC on writing features/datasets > Date: Thu, 09 Sep 2010 10:14:14 +0200 > > Dear Dmitry, > > that sounds indeed extremely similar to gSpan and FTM services we offer > as OT webservices. The only difference seems to be that our algorithms > have additional parameters. In our case the dictionary is not only > dependent on the dataset but also on the parameters. What we do, is to > save the parameters (including the dataset uri) and the resulting > dataset uri so that we don't have to recalculate descriptors. Each > parameter setting (in your case only the dataset) has its own "version" > of the FTM algorithm. > > opentox.informatik.tu-muenchen.de:8080/OpenTox-dev/algorithm/FTM1 > opentox.informatik.tu-muenchen.de:8080/OpenTox-dev/algorithm/FTM2 > opentox.informatik.tu-muenchen.de:8080/OpenTox-dev/algorithm/FTM3 ... > > Those URIs are referenced in the hasSource field of the feature. > > > For further guidance you might also have a look at the FTM (or gSpan) > implementations, although they are in Java, not PhP. The > CallableFTM.java is the asynchronously running part and contains the > important stuff. > > http://lxkramer13.informatik.tu-muenchen.de/trac/TUMOpenTox/browser/trunk/src/opentox/algorithm/descriptorcalculation/FTMResource.java > http://lxkramer13.informatik.tu-muenchen.de/trac/TUMOpenTox/browser/trunk/src/opentox/algorithm/descriptorcalculation/CallableFTM.java > > Tobias > > > On Wed, 2010-09-08 at 23:40 +0400, Druzhilovsky wrote: > > Dear Tobias, > > > Does the dictionary of descriptors (point 3 below) you are talking about contain all the MNA 1-level and 2-level "words" for all compounds in the dataset? Can your descriptors be represented like that (descriptors vs. > > compounds): > > > > Yes, of course. > > But for any particular dataset the dictionary of descriptors is specific and can not be preditermined. "General" dictionary is absolutely unrealistic: > > for our dataset of 270,000 structures the dictionary of descriptors has the volume about 67,000. > > Another problem is that the generation of descripors is provided separately for different structures. So, any realistic way is the appropriate use of the MNA descriptors by user. > > > > Dmitry > > > > -----Original Message----- > > From: Tobias Girschick [mailto:tobias.girschick at in.tum.de] > > Sent: Wednesday, September 08, 2010 7:29 PM > > To: Druzhilovsky > > Cc: 'Nina Jeliazkova'; 'Pantelis Sopasakis'; 'Buchwald, Fabian' > > Subject: RE: help IBMC on writing features/datasets > > > > Hi Dmitry, > > > > from your clarifications I conclude that the representation of your MNA > > descriptors can be chosen quite similar to our substructural descriptor > > representations. > > > > What you have to change is, that each "word" (SMILES string) is one > > feature. The corresponding feature value for a substance (e.g. > > aripoprazole-1.0) then is the 0 or 1 of the bit-string. > > > > Does the dictionary of descriptors (point 3 below) you are talking about > > contain all the MNA 1-level and 2-level "words" for all compounds in the > > dataset? Can your descriptors be represented like that (descriptors vs. > > compounds): > > > > C CC O N C-C=C ... > > cmp1 1 1 0 1 1 ... > > cmp2 1 1 1 0 0 ... > > cmp3 1 1 0 1 1 ... > > ... > > > > > > regards, > > Tobias > > > > On Wed, 2010-09-08 at 19:18 +0400, Druzhilovsky wrote: > > > Hi Tobias, > > > > > > > Just to clarify some things. Is this one descriptor or are this all > > > > substructural descriptors for molecule aripiprazole-1.0? > > > > > > This is the set of all substructural MNA 1 level and MNA 2 level descriptors for molecule aripiprazole-1.0. > > > Each of the MNA descriptors is "word" like C(C(CCC)C(CC-H-H)-H(C)-H(C)). > > > Descriptors are separated by blank space. > > > > > > > > > > Does that mean all those substructures/alerts are present in the > > > > structure? > > > Yes, that is mean all those substructures/alerts are present in the > > > structure. > > > You must take into account that MNA descriptors are atom centered > > > descriptors. > > > The MNA descriptors are overlapping. For example, the second level > > > descriptor C(C(CCC)C(CC-H-H)-H(C)-H(C)) includes the following first level > > > descriptors: CCCC, CCCHH, HC. > > > > > > > How exactly are they reused? > > > The one of methods may be as: > > > 1) generate MNA descriptors for a set of structures > > > 2) prepare dictionary of these generated descriptors > > > 3) represent each of the structures by bit string with 1 in positions of > > > numbers of MNA according this dictionary > > > 4) use these bit strings as a vectors > > > > > > > > > The MNA and QNA descriptors do not represent the stereochemical > > > peculiarities of a molecule, structures that only differ by stereochemistry > > > are formally considered equivalent. > > > So, for one compound and for all it conformers the same sets of MNA or QNA > > > descriptors are generated. > > > > > > Dmitry > > > > > > -----Original Message----- > > > From: Tobias Girschick [mailto:tobias.girschick at in.tum.de] > > > Sent: Wednesday, September 08, 2010 6:15 PM > > > To: Druzhilovsky > > > Cc: Nina Jeliazkova; Pantelis Sopasakis; Buchwald, Fabian > > > Subject: RE: help IBMC on writing features/datasets > > > > > > Hi Dmitry, > > > > > > On Wed, 2010-09-08 at 18:09 +0400, Druzhilovsky wrote: > > > > Dear Tobias, > > > > > > > > Yes, for one compound and for all it conformers are generated one descriptor (MNA or QNA). > > > > > > Ok, and the descriptor looks like that(?): > > > HC HN CHHCC CHHCN CHHCO CHCC CCCC CCCN CCCO CCCCl CCNO NHCC > > > NCCC OC OCC ClC C(C(CCC)C(CC-H-H)-H(C)-H(C)) > > > C(C(CCC)C(CC-H)N(CC-H)) C(C(CCC)C(CC-H)-H(C)) > > > C(C(CCN)C(CC-H)C(CC-H-H)) C(C(CCN)C(CC-H)-H(C)) > > > C(C(CCN)C(CC-O)-H(C)) C(C(CCN)C(CC-Cl)-Cl(C)) > > > C(C(CC-H-H)C(CN-O)-H(C)-H(C)) C(C(CC-H-H)N(CC-H)-O(C)) > > > C(C(CC-H)C(CC-H)-H(C)) C(C(CC-H)C(CC-H)-O(C-C)) > > > C(C(CC-H)C(CC-O)-H(C)) C(C(CC-H)C(CC-Cl)N(CCC)) > > > C(C(CC-H)C(CC-Cl)-H(C)) C(C(CC-H)C(CC-Cl)-Cl(C)) > > > C(C(CN-H-H)N(CCC)-H(C)-H(C)) C(C(CN-H-H)N(CC-C)-H(C)-H(C)) > > > N(C(CCN)C(CN-H-H)C(CN-H-H)) N(C(CCN)C(CN-O)-H(N)) > > > N(C(CN-H-H)C(CN-H-H)-C(N-H-H-C)) -H(C(CC-H)) -H(C(CC-H-H)) > > > -H(C(CN-H-H)) -H(N(CC-H)) -H(-C(N-H-H-C)) -H(-C(-H-H-C-C)) > > > -H(-C(-H-H-C-O)) -C(N(CC-C)-H(-C)-H(-C)-C(-H-H-C-C)) > > > -C(-H(-C)-H(-C)-C(N-H-H-C)-C(-H-H-C-C)) > > > -C(-H(-C)-H(-C)-C(-H-H-C-C)-C(-H-H-C-O)) > > > -C(-H(-C)-H(-C)-C(-H-H-C-C)-O(C-C)) -O(C(CC-O)-C(-H-H-C-O)) -O(C(CN-O)) > > > -Cl(C(CC-Cl)) > > > > > > Does that mean all those substructures/alerts are present in the > > > structure? How exactly are they reused? Could you clarify the process? > > > > > > Tobias > > > > > > > > > > > Dmitry > > > > > > > > -----Original Message----- > > > > From: Tobias Girschick [mailto:tobias.girschick at in.tum.de] > > > > Sent: Wednesday, September 08, 2010 4:47 PM > > > > To: Druzhilovsky > > > > Cc: Buchwald, Fabian; Nina Jeliazkova; Pantelis Sopasakis > > > > Subject: RE: help IBMC on writing features/datasets > > > > > > > > Dear Dmitry, > > > > > > > > > > > > > > <rdf:Description rdf:about="http://195.178.207.160/api/1.1/MNA_DESCRIPTORS/0"> > > > > > <ns1:aripiprazole-1.0>HC HN CHHCC CHHCN CHHCO CHCC CCCC CCCN CCCO CCCCl CCNO NHCC NCCC OC OCC ClC C(C(CCC)C(CC-H-H)-H(C)-H(C)) C(C(CCC)C(CC-H)N(CC-H)) C(C(CCC)C(CC-H)-H(C)) C(C(CCN)C(CC-H)C(CC-H-H)) C(C(CCN)C(CC-H)-H(C)) C(C(CCN)C(CC-O)-H(C)) C(C(CCN)C(CC-Cl)-Cl(C)) C(C(CC-H-H)C(CN-O)-H(C)-H(C)) C(C(CC-H-H)N(CC-H)-O(C)) C(C(CC-H)C(CC-H)-H(C)) C(C(CC-H)C(CC-H)-O(C-C)) C(C(CC-H)C(CC-O)-H(C)) C(C(CC-H)C(CC-Cl)N(CCC)) C(C(CC-H)C(CC-Cl)-H(C)) C(C(CC-H)C(CC-Cl)-Cl(C)) C(C(CN-H-H)N(CCC)-H(C)-H(C)) C(C(CN-H-H)N(CC-C)-H(C)-H(C)) N(C(CCN)C(CN-H-H)C(CN-H-H)) N(C(CCN)C(CN-O)-H(N)) N(C(CN-H-H)C(CN-H-H)-C(N-H-H-C)) -H(C(CC-H)) -H(C(CC-H-H)) -H(C(CN-H-H)) -H(N(CC-H)) -H(-C(N-H-H-C)) -H(-C(-H-H-C-C)) -H(-C(-H-H-C-O)) -C(N(CC-C)-H(-C)-H(-C)-C(-H-H-C-C)) -C(-H(-C)-H(-C)-C(N-H-H-C)-C(-H-H-C-C)) -C(-H(-C)-H(-C)-C(-H-H-C-C)-C(-H-H-C-O)) -C(-H(-C)-H(-C)-C(-H-H-C-C)-O(C-C)) -O(C(CC-O)-C(-H-H-C-O)) -O(C(CN-O)) -Cl(C(CC-Cl)) </ns1:aripiprazole-1.0> > > > > > </rdf:Description> > > > > > > > > Just to clarify some things. Is this one descriptor or are this all > > > > substructural descriptors for molecule aripiprazole-1.0? > > > > > > > > Tobias > > > > > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > From: Tobias Girschick [mailto:tobias.girschick at in.tum.de] > > > > > Sent: Wednesday, September 08, 2010 3:09 PM > > > > > To: Nina Jeliazkova > > > > > Cc: chung; Buchwald, Fabian; Druzhilovsky > > > > > Subject: Re: help IBMC on writing features/datasets > > > > > > > > > > Sure, > > > > > > > > > > @Dmitry: could you explain what works at the moment, what needs to be > > > > > done and where you think need help? > > > > > > > > > > regards, > > > > > Tobias > > > > > > > > > > On Wed, 2010-09-08 at 14:06 +0300, Nina Jeliazkova wrote: > > > > > > Hi Pantelis, Tobias, Fabian, > > > > > > > > > > > > I know everybody is busy, but could you try helping Dmitry how to > > > > > > write RDF, describing IBMC fingerprint - like descriptors and related > > > > > > predictions. > > > > > > > > > > > > Dmitry's feedback might be useful for the techie table exercise as > > > > > > well. > > > > > > > > > > > > Thank you, > > > > > > Nina > > > > > > > > > > > > -- > > > > > > Dr. Nina Jeliazkova > > > > > > Technical Manager > > > > > > > > > > > > 4 A.Kanchev str. > > > > > > IdeaConsult Ltd. > > > > > > > > > > > > 1000 Sofia, Bulgaria > > > > > > Phone: +359 886 802011 > > > > > > > > > > > > > > > > > > > > > > > > > > > -- Dipl.-Bioinf. Tobias Girschick Technische Universität München Institut für Informatik Lehrstuhl I12 - Bioinformatik Bolzmannstr. 3 85748 Garching b. München, Germany Room: MI 01.09.042 Phone: +49 (89) 289-18002 Email: tobias.girschick at in.tum.de Web: http://wwwkramer.in.tum.de/girschick
- Previous message: [OTDev] Techie Table
- Next message: [OTDev] (no subject)
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Development mailing list