[OTDev] help IBMC on writing features/datasets

Mon Sep 13 17:03:10 CEST 2010

Hi Dmitry,

On Fri, 2010-09-10 at 19:37 +0400, Druzhilovsky wrote: 
> Dear Nina, All,

> For the data set of about 100 structures the dictionary of MNA/1 and
> MNA/2 has a volume of about 1000 descriptors.
> 
> For the ordinary, usual, widespread QSAR methods number of independent
> variables must be less then number of compounds.

I would say, that depends a lot on the methodology. Support vector
regression, for example, should be able to handle more variables than
compounds, as only the number of support vectors not all the variables
define the final model. And, as Pantelis mentioned, feature selection
algorithms can be applied before learning the model.

Despite this, it still makes sense to offer the MNA descriptor
calculation functionality, as also local models with only few compounds
might be built by the user or by specific algorithms. This is another
point where your descriptors might be very useful. 

regards,
Tobias
> 
> For one compound MakeMNA generates about 30 decriptors. So, in
> appropriate bit string of 1000 bits only about 30 bits are set to 1
> and about 970 set to 0. For bigger data set number of zero bits is
> greate and greater.
> 
> It is also not useful for QSAR methods like MLR.
> 
>  
> 
> May be kernel SVR is appropriate for MNA descriptors.
> 
>  
> 
> General description of methods for use of MNA/QNA can be see in
> attached files.
> 
>  
> 
> Kind regards
> 
> Dmitry
> 
>  
> 
>  
> 
> From: Nina Jeliazkova [mailto:jeliazkova.nina at gmail.com] 
> Sent: Friday, September 10, 2010 1:32 PM
> To: Druzhilovsky
> Cc: chung; tobias.girschick at in.tum.de; Buchwald, Fabian; surajit ray
> Subject: Re: help IBMC on writing features/datasets
> 
> 
>  
> 
> Dmitry,
> 
> 
> 
> On Fri, Sep 10, 2010 at 11:29 AM, Druzhilovsky
> <dmitry.druzhilovsky at ibmc.msk.ru> wrote:
> 
> Hi All,
> 
>  
> 
> In general, the main problem with MNA/QNA descriptors is that the
> ordinary, usual, widespread QSAR methods are not suitable to use
> MNA/QNA at all.
> 
> 
> 
> Could you tell why exactly and which methods are not suitable? 
> 
> Nina 
> 
> P.S. I second Tobias opinion this discussion should be on the
> development list, not restricted to current participants only.
> 
> 
>          
>         
>         For QNA we have developed special approach, published in:
>         
>          
>         
>         Filimonov D.A., Zakharov A.V., Lagunin A.A., Poroikov V.V.
>         (2009). QNA based ‘Star Track’ QSAR approach. SAR and QSAR
>         Environ. Res., 20 (7-8), 679-709
>         
>          
>         
>         MNA we are using in the algorithm of PASS, published in:
>         
>          
>         
>         Filimonov D.A., Poroikov V.V. (2008). Probabilistic approach
>         in activity prediction. In: Chemoinformatics Approaches to
>         Virtual Screening. Eds. 
>         
>         Alexandre Varnek and Alexander Tropsha. Cambridge (UK): RSC
>         Publishing,
>         
>         p.182-216
>         
>          
>         
>         Poroikov V.V., Filimonov D.A., Borodina Yu. V., Lagunin A.A.,
>         Kos A. (2000). 
>         
>         Robustness of biological activity spectra predicting by
>         computer program PASS for non-congeneric sets of chemical
>         compounds. J. Chem. Inform. Comput. 
>         
>         Sci., 40 (6), 1349-1355
>         
>          
>         
>         and many-many others.
>         
>          
>         
>          
>         
>         Another method, that may be useful for QSAR based on MNA
>         descriptors is:
>         
>          
>         
>         1) we use PASS to estimate the each of abot 67,000 MNA
>         descriptors in PASS in general relevance of its use to
>         prediction biological activity
>         
>          
>         
>         2) we sorted 67,000 MNA descriptors in PASS accoding to its
>         relevance
>         
>          
>         
>         3) in addition to MNA itself we generate bit strings according
>         to first 100-1000 of builded on steps 1-2 list of "the best"
>         MNA descriptors.
>         
>          
>         
>         Development of this is possible, but it is a special task for
>         us.
>         
>          
>         
>         Keep regards
>         
>         Dmitry
>         
>          
>         
>          
>         
>         From: chung [mailto:chvng at mail.ntua.gr] 
>         Sent: Thursday, September 09, 2010 5:28 AM
>         To: Druzhilovsky
>         Cc: tobias.girschick at in.tum.de; 'Nina Jeliazkova'; 'Buchwald,
>         Fabian'
>         
>         
>         Subject: RE: help IBMC on writing features/datasets
>         
>         
>          
>         
>         Hi Dmitry, Tobias,
>           I think a set of some hundreds of thousands of descriptors
>         ("features" in OT terminology) is not that unrealistic and it
>         goes without question that such an algorithm that searches for
>         any substructure needs lots of features...  What I would
>         suggest is that every time the MNA descriptor calculation
>         algorithm is applied on a compound and a new substructure is
>         found (i.e. one that was not found in any other compound so
>         far), then a new feature is generated, stored on the server
>         and acquire a URI. So the MNA/QNΑ desc. calc. algorithm will
>         generate a dataset and when necessary a set of new features.
>         Internally you need a way to tell whether an MNA feature is
>         registered in your database (There are lots of ways to do
>         that). It might be also useful to use the SMILES string of the
>         substructure in the feature URI, i.e. something like
>         http://yourserver.com/feature/c1cccc1c (URL encoded!). This
>         way it will be easier to tell if a substructure is registered
>         in your database. 
>            You have to put some effort to do that but this will pay
>         back as will enable the use of your datasets in model training
>         and other QSAR stuff. I am in favor of the solution that
>         Tobias proposed with the boolean descriptors. Finally I would
>         say that String is the datatype one should use when no other
>         type fits the purpose (i.e. almost only for meta data). 
>         
>         Best Regards,
>         Pantelis
>         
>         On Wed, 2010-09-08 at 23:40 +0400, Druzhilovsky wrote: 
>         
>          
>         Dear Tobias,
>          
>          
>         Does the dictionary of descriptors (point 3 below) you are talking about contain all the MNA 1-level and 2-level "words" for all compounds in the dataset? Can your descriptors be represented like that (descriptors vs.
>         compounds):
>          
>         Yes, of course.
>         But for any particular dataset the dictionary of descriptors is specific and can not be preditermined. "General" dictionary is absolutely unrealistic: 
>         for our dataset of 270,000 structures the dictionary of descriptors has the volume about 67,000.
>         Another problem is that the generation of descripors is provided separately for different structures. So, any realistic way is the appropriate use of the MNA descriptors by user.
>          
>         Dmitry
>          
>         -----Original Message-----
>         From: Tobias Girschick [mailto:tobias.girschick at in.tum.de] 
>         Sent: Wednesday, September 08, 2010 7:29 PM
>         To: Druzhilovsky
>         Cc: 'Nina Jeliazkova'; 'Pantelis Sopasakis'; 'Buchwald, Fabian'
>         Subject: RE: help IBMC on writing features/datasets
>          
>         Hi Dmitry,
>          
>          
>         from your clarifications I conclude that the representation of your MNA
>         descriptors can be chosen quite similar to our substructural descriptor
>         representations. 
>          
>         What you have to change is, that each "word" (SMILES string) is one
>         feature. The corresponding feature value for a substance (e.g.
>         aripoprazole-1.0) then is the 0 or 1 of the bit-string.
>          
>         Does the dictionary of descriptors (point 3 below) you are talking about
>         contain all the MNA 1-level and 2-level "words" for all compounds in the
>         dataset? Can your descriptors be represented like that (descriptors vs.
>         compounds):
>          
>         C CC O N C-C=C ...
>         cmp1 1 1 0 1 1 ...
>         cmp2 1 1 1 0 0 ...
>         cmp3 1 1 0 1 1 ...
>         ...
>          
>          
>         regards,
>         Tobias
>          
>         On Wed, 2010-09-08 at 19:18 +0400, Druzhilovsky wrote: 
>         > Hi Tobias,
>         > 
>         > > Just to clarify some things. Is this one descriptor or are this all 
>         > > substructural descriptors for molecule aripiprazole-1.0?
>         > 
>         > This is the set of all substructural MNA 1 level and MNA 2 level descriptors for molecule aripiprazole-1.0.
>         > Each of the MNA descriptors is "word" like C(C(CCC)C(CC-H-H)-H(C)-H(C)).
>         > Descriptors are separated by blank space.
>         > 
>         > 
>         > > Does that mean all those substructures/alerts are present in the 
>         > > structure?
>         > Yes, that is mean all those substructures/alerts are present in the 
>         > structure.
>         > You must take into account that MNA descriptors are atom centered 
>         > descriptors.
>         > The MNA descriptors are overlapping. For example, the second level 
>          
>         > descriptor C(C(CCC)C(CC-H-H)-H(C)-H(C)) includes the following first level 
>         > descriptors: CCCC, CCCHH, HC.
>         > 
>         > > How exactly are they reused?
>         > The one of methods may be as:
>         > 1) generate MNA descriptors for a set of structures
>         > 2) prepare dictionary of these generated descriptors
>         > 3) represent each of the structures by bit string with 1 in positions of 
>         > numbers of MNA according this dictionary
>         > 4) use these bit strings as a vectors
>         > 
>         > 
>         > The MNA and QNA descriptors do not represent the stereochemical 
>         > peculiarities of a molecule, structures that only differ by stereochemistry 
>         > are formally considered equivalent.
>         > So, for one compound and for all it conformers the same sets of MNA or QNA 
>         > descriptors are generated.
>         > 
>         > Dmitry
>         > 
>         > -----Original Message-----
>         > From: Tobias Girschick [mailto:tobias.girschick at in.tum.de] 
>         > Sent: Wednesday, September 08, 2010 6:15 PM
>         > To: Druzhilovsky
>         > Cc: Nina Jeliazkova; Pantelis Sopasakis; Buchwald, Fabian
>         > Subject: RE: help IBMC on writing features/datasets
>         > 
>         > Hi Dmitry,
>         > 
>         > On Wed, 2010-09-08 at 18:09 +0400, Druzhilovsky wrote: 
>         > > Dear Tobias,
>         > > 
>         > > Yes, for one compound and for all it conformers are generated one descriptor (MNA or QNA).
>         > 
>         > Ok, and the descriptor looks like that(?):
>         > HC  HN    CHHCC CHHCN CHHCO CHCC  CCCC  CCCN  CCCO  CCCCl CCNO  NHCC
>         > NCCC  OC    OCC   ClC   C(C(CCC)C(CC-H-H)-H(C)-H(C))
>         > C(C(CCC)C(CC-H)N(CC-H))      C(C(CCC)C(CC-H)-H(C))
>         > C(C(CCN)C(CC-H)C(CC-H-H))    C(C(CCN)C(CC-H)-H(C))
>         > C(C(CCN)C(CC-O)-H(C))   C(C(CCN)C(CC-Cl)-Cl(C))
>         > C(C(CC-H-H)C(CN-O)-H(C)-H(C)) C(C(CC-H-H)N(CC-H)-O(C))
>         > C(C(CC-H)C(CC-H)-H(C))  C(C(CC-H)C(CC-H)-O(C-C))
>         > C(C(CC-H)C(CC-O)-H(C))      C(C(CC-H)C(CC-Cl)N(CCC))
>         > C(C(CC-H)C(CC-Cl)-H(C)) C(C(CC-H)C(CC-Cl)-Cl(C))
>         > C(C(CN-H-H)N(CCC)-H(C)-H(C)) C(C(CN-H-H)N(CC-C)-H(C)-H(C))
>         > N(C(CCN)C(CN-H-H)C(CN-H-H))      N(C(CCN)C(CN-O)-H(N))
>         > N(C(CN-H-H)C(CN-H-H)-C(N-H-H-C))   -H(C(CC-H)) -H(C(CC-H-H))
>         > -H(C(CN-H-H))     -H(N(CC-H)) -H(-C(N-H-H-C))   -H(-C(-H-H-C-C))
>         > -H(-C(-H-H-C-O))  -C(N(CC-C)-H(-C)-H(-C)-C(-H-H-C-C))
>         > -C(-H(-C)-H(-C)-C(N-H-H-C)-C(-H-H-C-C))
>         > -C(-H(-C)-H(-C)-C(-H-H-C-C)-C(-H-H-C-O))
>         > -C(-H(-C)-H(-C)-C(-H-H-C-C)-O(C-C)) -O(C(CC-O)-C(-H-H-C-O)) -O(C(CN-O))
>         > -Cl(C(CC-Cl)) 
>         > 
>         > Does that mean all those substructures/alerts are present in the
>         > structure? How exactly are they reused? Could you clarify the process?
>         > 
>         > Tobias
>         > 
>         > > 
>         > > Dmitry 
>         > > 
>         > > -----Original Message-----
>         > > From: Tobias Girschick [mailto:tobias.girschick at in.tum.de] 
>         > > Sent: Wednesday, September 08, 2010 4:47 PM
>         > > To: Druzhilovsky
>         > > Cc: Buchwald, Fabian; Nina Jeliazkova; Pantelis Sopasakis
>         > > Subject: RE: help IBMC on writing features/datasets
>         > > 
>         > > Dear Dmitry,
>         > > 
>         > > > 
>         > > > <rdf:Description rdf:about="http://195.178.207.160/api/1.1/MNA_DESCRIPTORS/0">
>         > > >    <ns1:aripiprazole-1.0>HC  HN    CHHCC CHHCN CHHCO CHCC  CCCC  CCCN  CCCO  CCCCl CCNO  NHCC      NCCC  OC    OCC   ClC   C(C(CCC)C(CC-H-H)-H(C)-H(C)) C(C(CCC)C(CC-H)N(CC-H))      C(C(CCC)C(CC-H)-H(C))   C(C(CCN)C(CC-H)C(CC-H-H))    C(C(CCN)C(CC-H)-H(C))      C(C(CCN)C(CC-O)-H(C))   C(C(CCN)C(CC-Cl)-Cl(C)) C(C(CC-H-H)C(CN-O)-H(C)-H(C)) C(C(CC-H-H)N(CC-H)-O(C))   C(C(CC-H)C(CC-H)-H(C))  C(C(CC-H)C(CC-H)-O(C-C))     C(C(CC-H)C(CC-O)-H(C))      C(C(CC-H)C(CC-Cl)N(CCC))     C(C(CC-H)C(CC-Cl)-H(C)) C(C(CC-H)C(CC-Cl)-Cl(C))      C(C(CN-H-H)N(CCC)-H(C)-H(C)) C(C(CN-H-H)N(CC-C)-H(C)-H(C)) N(C(CCN)C(CN-H-H)C(CN-H-H))      N(C(CCN)C(CN-O)-H(N))   N(C(CN-H-H)C(CN-H-H)-C(N-H-H-C))   -H(C(CC-H)) -H(C(CC-H-H))      -H(C(CN-H-H))     -H(N(CC-H)) -H(-C(N-H-H-C))   -H(-C(-H-H-C-C))  -H(-C(-H-H-C-O))  -C(N(CC-C)-H(-C)-H(-C)-C(-H-H-C-C)) -C(-H(-C)-H(-C)-C(N-H-H-C)-C(-H-H-C-C))  -C(-H(-C)-H(-C)-C(-H-H-C-C)-C(-H-H-C-O))     -C(-H(-C)-H(-C)-C(-H-H-C-C)-O(C-C)) -O(C(CC-O)-C(-H-H-C-O)) -O(C(CN-O))  -Cl(C(CC-Cl))     </ns1:aripiprazole-1.0>
>         > > > </rdf:Description>
>         > > 
>         > > Just to clarify some things. Is this one descriptor or are this all
>         > > substructural descriptors for molecule aripiprazole-1.0? 
>         > > 
>         > > Tobias
>         > > 
>         > > 
>         > > > 
>         > > > 
>         > > > -----Original Message-----
>         > > > From: Tobias Girschick [mailto:tobias.girschick at in.tum.de] 
>         > > > Sent: Wednesday, September 08, 2010 3:09 PM
>         > > > To: Nina Jeliazkova
>         > > > Cc: chung; Buchwald, Fabian; Druzhilovsky
>         > > > Subject: Re: help IBMC on writing features/datasets
>         > > > 
>         > > > Sure,
>         > > > 
>         > > > @Dmitry: could you explain what works at the moment, what needs to be
>         > > > done and where you think need help?
>         > > > 
>         > > > regards,
>         > > > Tobias
>         > > > 
>         > > > On Wed, 2010-09-08 at 14:06 +0300, Nina Jeliazkova wrote:
>         > > > > Hi Pantelis, Tobias, Fabian,
>         > > > > 
>         > > > > I know everybody is busy, but could you try helping Dmitry how to
>         > > > > write RDF, describing IBMC fingerprint - like descriptors and related
>         > > > > predictions.
>         > > > > 
>         > > > > Dmitry's feedback might be useful for the techie table exercise  as
>         > > > > well.
>         > > > > 
>         > > > > Thank you,
>         > > > > Nina
>         > > > > 
>         > > > > -- 
>         > > > > Dr. Nina Jeliazkova
>         > > > > Technical Manager
>         > > > > 
>         > > > > 4 A.Kanchev str.
>         > > > > IdeaConsult Ltd.
>         > > > > 
>         > > > > 1000 Sofia, Bulgaria
>         > > > > Phone: +359 886 802011
>         > > > > 
>         > > > > 
>         > > > 
>         > > 
>         > 
>          
>         
>          
>         
>         
> 
> 
> 
> -- 
> 
> 
> 
> Dr. Nina Jeliazkova
> 
> Technical Manager
> 
> 
> 
> 4 A.Kanchev str.
> 
> IdeaConsult Ltd.
> 
> 
> 
> 1000 Sofia, Bulgaria
> 
> Phone: +359 886 802011
> 
>  
> 
> 

-- 
Dipl.-Bioinf. Tobias Girschick

Technische Universität München
Institut für Informatik
Lehrstuhl I12 - Bioinformatik
Bolzmannstr. 3
85748 Garching b. München, Germany

Room: MI 01.09.042
Phone: +49 (89) 289-18002
Email: tobias.girschick at in.tum.de
Web: http://wwwkramer.in.tum.de/girschick