Tautomer generator

The ability of atoms within a molecule to interconvert between hydrogen-donor and acceptor due to tautomerism plays an important role in drug discovery.  This interconversion not only affects the molecule's interaction with its biological surrounding but also creates a number challenges for cheminformatics in terms of (i) chemical registration, (ii) structure/similarity searching, and (iii) QSAR modeling.  Our focus in this post is on (i); we hope to have the opportunity to address (ii) and (iii) in future posts. The difficulties associated with tautomerism in chemical registration can be boiled down to the following:
  • Given a molecule, can we identify all of its tautomeric forms?
  • Given a list of tautomeric forms, how do we identify the most "preferrable" form?
Both questions have been well-studied in the literature (e.g., see the recent perspective by Roger Sayle and references therein).  Here we describe our take on the problems and make available our implementation as a self-contained tool for experimentation.  Note that we only consider  prototropic tautomerism in the sequel---i.e., 1,n-shift tautomerism where n in {3, 5, 7,...}. Approaches to tautomer enumeration can be broadly classified as either local or global.  (Technically speaking, the term "enumeration" used in this context is not correct; instead "generation" is more appropriate here.)  In the case of local, a set of well-known structural patterns are applied to  transform the input molecule into its different tautomeric forms.  The  patterns are typically localized to 1,3-shift. Though not much details are available for the global approach, the basic outline is as follows:
  1. All hydrogen-donor and acceptor (hetero-) atoms are first identified
  2. Next, a tautomeric form is generated for each unique alternating path of double-single bonds between a hydrogen donor-acceptor atom pair by considering the parity of the path.  Care should be taken to handle overlapping paths.
  3. Repeat step (2) for all combinations of unique paths
Since this approach does not follow any specific patterns, there is no limit to the distance of the shift.  For example, consider the following two molecules (taken from Fig. 15 of this paper): Example of 1,11-tautomeric form Since what we have is a 1,11-shift, it's unlikely that a typical local approach will be able to resolve 1 and 2 as two different tautomeric forms of the same molecule.  The global approach, on the other hand, is able to handle this "hidden" tautomerism without a problem. To address the most "preferrable" tautomeric form, we extend the tautomer scoring scheme as described here.  This scoring scheme, in turn, is used to rank each generated tautomer.  The one with the highest score is selected as the canonical tautomer. Below is a quick look at the tautomer generation tool.  Currently, the maximum number tautomers generated is limited to 1000.  In the snapshot, the canonical tautomer is highlighted in pink.  As always, we appreciate any feedback that would help us improve on our implementation.

Leave a Reply

Your email address will not be published. Required fields are marked *

Please enlighten us with your wisdom... *