Post-doctoral Position in Combination Screening Informatics

The NCATS Informatics group is looking for a post-doctoral scientist to work on methodology development for the analysis and visualization of drug combination screening data. Currently, the NCATS Matrix Screening Platform has screened more than 65,000 combinations (in checkerboard fashion), across 200 cell lines, looking for synergy and antagonism. See the following papers for an overview of the platform and recent applications

I’m looking for a candidate who’s interested in addressing multiple aspects of combination screening such as

  • How do we use this data alongwith public data sources to prospectively predict drug combination responses?
  • How can we effectively summarize, visualize and prioritize 5,000 combination responses
  • What do combinations tell us about biological networks and their responses to perturbations?

However, these are just starting points. It’s expected that the candidate will dig in to the data and take the lead on formulating questions and identify opportunities to develop and apply new methods of analysis & visualization. There will be opportunities to work with experimental scientists so that novel methods can be tested on real data.

About You – You should have experience working with chemical structure-activity data and/or genomic data and a deep interest in developing methods to analyze such data. You’re strong on algorithms and/or statistics and on the lookout for new ways to apply them to chemical and biological data. Or develop brand new ones if they don’t do what you want. You also want to share your work – by implementing your research in software that is distributed publicly, by writing papers and presenting your work at meetings. At the end of it all, you want your work to impact real life problems and not just remain a theoretical curiosity.

About Us – The NCATS Informatics group plays a key role in the operation of multiple screening platforms (small molecules, RNAi, ADME, drug combinations) within the Division of Preclinical Innovation at NCATS. Informatics scientists are committed to research and development in all areas of molecular informatics. The group provides an intellectually stimulating environment and is interested in a diverse range of informatics topics that vary in scale from small molecules right up to phenotypes and clinical data. Since we collaborate closely with experimental colleagues at NCATS, you’ll be exposed to research on a wide variety of biological systems. Importantly, we’re a production center, which means we generate large quantities of screening data on a regular basis (30,000 plates over 1280 screens in 2014 alone). As a result, our research is regularly applied to live data and projects.


  • A Ph.D. in a computational field (e.g., computational chemistry, cheminformatics, bioinformatics) or a Ph.D. in Computer Science/Statistics with a specialization in biological and/or chemical topics.
  • 3 or more publications during your Ph.D. with at least 1 first author paper (or else, a paper where you were a significant contributor).
  • Must have strong programming skills – ideally in R. If not, you should have experience with other data analysis environments (Python, Matlab, etc.) and be willing to pick up R. In general you’ll likely need to munge data, internal and external, so this should not be a bottleneck.
  • Strong communication skills (in written and spoken English).


  • Background in biology, chemistry or biochemistry
  • Experience with gene expression (microarray or RNASeq) analysis
  • Experience with Java development

Please send a cover letter, that specifically addresses why you’re interested in this position, a CV and bibliography (all PDF) to Rajarshi Guha (

In memoriam Eric Dawson

Eric DawsonIt’s with great sadness that we note the passing our collaborator Eric Dawson. Through the BARD project we got to know Eric as a friendly, engaging and effective collaborator; his enthusiasms and laid back personality often bring proper perspective to any issues at hand; and without a doubt he is simply an all around good guy. It’s an honor for us to have the privilege of working with Eric.

Eric is one of our key early-adopters. It is a selfless task, working with something that is half-broken to help us see the value in it and make it better. It is the crucible through which all good software is forged. Being an early adopter requires technical acumen, which Eric has in spades, but he is invaluable in his willingness to endorse a nascent project and convince others to take a look. He builds consensus across disciplines and affiliation to generate real momentum.

Science is a messy business; we will stumble through a problem by gut instinct and publish a solution when the problem only half-solved. It would be a tale told by idiot, strutting and fretting his hour upon the stage, but for our collaborators that are willing to pitch in and lend a hand, and who give us the courage to move forward.

NCATSFind: Resolving Chemical Content in Browser

We have recently developed NCATSFind: a Firefox add-on to help users resolve web content to real chemical structures.

The goal is to make discovering, displaying and downloading chemical structures as simple as possible:

  1. Select some content from a page:
    • name
    • code
    • image
  2. Resolve it to a structure.
  3. Display structure and origin
  4. Allow export

This is accomplished in NCATSFind by using some simple client-side JavaScript to embed some UI elements, along with functions to AJAX our resolving web services. A few format-sniffing functions automatically highlight certain predictable systematic codes (UNII, CAS Number, InChI).

Names and codes are resolved using our resolver web service. This service uses dictionaries from several public sources (PubChem, FDA-SRS, NCI, etc …), as well as OPSIN for systematic name parsing. Images are resolved using our image resolver web service, which uses OSRA to convert an image to a chemical format.

For a simple demo, check out the video:


To install the add-on into your Firefox browser, simply click on the following link:

Logging Note:

While our web services don’t keep systematic logs of searches, any use of the resolver will transmit data to and from our web servers. For that reason, we recommend against attempting to resolve any sensitive data.

Do structurally similar molecules have similar hash codes?

Molecular hash codes are fixed-length alphanumeric encoding of molecular graphs. They play a key role within chemical data management systems in facilitating (among other things) structural identity and uniquess validation. While an important component of any hash code generation approach is the structure standardization procedure (something which we briefly discussed previously), the focus of this post is on the encoding step, i.e., that of generating the hash code given a standardized structure. In the remainder of this post, we first give a brief overview of spectral hash code, our proposed multi-resolution hash code based on InChI. We then highlight the utility of the hash code through a number of examples. We conclude with the description of a web resource for converting between PubChem CID, InChIKey, and spectral hash code. The complete source code is readily available from our bitbucket repository. As always, we welcome comments and feedback!

A spectral hash code is a 30-character (150-bit) alphanumeric hash string that uniquely encodes an InChI. The hash code has three logical blocks—denoted as h1, h2, and h3—that progressively encode the InChI graph at finer resolution where h1h2h3 and |h1| = 9 (45-bit), |h2| = 19 (95-bit), and |h3| = 30 (150-bit). Consider, for example, the following hash code for Gleevec A9QNVQSAZAMG4SH428AUW4Y21WVPXD; here we have h1 = A9QNVQSAZ, h2 = A9QNVQSAZAMG4SH428A, and h3 = A9QNVQSAZAMG4SH428AUW4Y21WVPXD. A key feature of the hash code is that it’s lexicographically meaningful at the block level; this means that when sorted, the hash codes tend to group together structurally similar molecules.

A brief description of how each block is generated follows.

  1. Block h1 encodes the topology (or molecular skeleton) of the InChI graph. As the name implies, the underlying principle behind block h1 is based on spectral graph theory. Here a normalized Laplacian graph is constructed from the InChI’s connection layer (i.e., /c) and its spectrum (sorted eigenvalues) is hashed. While other graph representations (e.g., adjacency, signless Laplacian, etc.) work equally well, the normalized Laplacian gives us numerical stability, which is an important requirement for portability. We should note that while it’s still an open question as to which graphs are determined by their spectrum, there exist a number of non-isomorphic graphs that are cospectral. This is particularly the case for trees (graphs without cycles); as such, there will be a few cases of systematic hash “collisions” at block h1.
  2. Block h2 refines the graph topology by imposing a canonical ordering over the vertices. The resulting hash value is a combined digest of h1 and InChI’s /c layer. An important implication of this chaining is that two hash codes of the same h2 must have the same h1 blocks as well.
  3. Block h3 encodes the InChI graph at full resolution. Its hash value is a combined digest of block h2 and the entire InChI string.

Below are a few examples of spectral hash codes.








We have also developed a web resource that  can interconvert spectral hash code to/from PubChem CID and InChIKey. The syntax is as follows:{id}/{oper}[?max=N]

where {}‘s are required arguments and [] is optional. {id} can be either PubChem CID, InChIKey, or spectral hash code (h1, h2, or h3). {oper} can be one of {hk,h1,h2,smiles,inchi} where

Hopping for success

Molecular scaffold is a fundamental concept in medicinal chemistry. Perhaps nowhere this is more apparent than during early stage discovery, where scaffolds are the primary means by which medicinal chemists communicate. Extending this concept further, we recently introduced Scaffold Hopper as a tool that utilizes scaffolds to facilitate context “hopping.” Starting with either (i) a set of structures (e.g., hits from an HTS screening campaign), (ii) publication, or (iii) search terms, Scaffold Hopper allows the user to quickly “hop” between related contexts (i.e., compounds, documents, targets, MeSH terms) with just a single click. (We’re currently working to add additional contexts such as assays, clinical trials, and patents.) Below is a quick preview of the tool

Click on the Launch button below to take it out for a spin:

Note for Mac users: Due to recent Java security updates, Java webstart is no longer launched when clicking on the jnlp file. You can still launch it manually by select “Open With” then “Java Web Start” on the downloaded hopper.jnlp file. Please drop us a note if you have any problems.

A quick note on the implementation. We’ve made efforts to ensure that Scaffold Hopper can be run within the Java sandbox. While this provides some sense of security, it does mean that feature such as cut & paste (except only within tool itself) is not available. As always, we’d love to hear comments and feedback on how to improve the tool.

PubChem fingerprint revisited

A few years ago we made available our implementation of the PubChem substructure keys fingerprint. Being it was our first attempt, there were a number of shortcomings with the implementation in terms of performance and accuracy. Recently, we have the opportunity to revisit the code and address some of these shortcomings. In this post, we briefly describe some of the improvements. (For the impatient, the latest code is available here.) We should note that since our original implementation, there have been at least two other implementations that we’re aware of. It certainly would be interesting to compare the different implementations; if anything, the comparison should reveal fundamental differences in aromaticity and ring perceptions across various cheminformatics toolkits.

As with most substructure keys-based fingerprints, perhaps the biggest performance bottleneck has to do with repeated subgraph isomorphism searches of a fixed set of substructure keys (encoded as either SMILES or SMARTS) against the target molecule. Here, a naïve approach that sequentially searches each key is unlikely to perform well for all but small key sets (e.g., 100–200). Given that there are roughly 618 (out of 881 total size) substructure keys defined for the PubChem fingerprint, this approach is unlikely to scale, as demonstrated by our original implementation. Efficiently addressing this bottleneck requires an indexing data structure much like suffix-tree or –array in stringology (though currently the approach described here is clearly the most effective). In this spirit, we’ve built upon our recent work on large scale structure searching as the indexing engine for the substructure keys, thereby reducing the problem of multiple substructure searches to that of a single superstructure search.

Another shortcoming of our original implementation centered around ring perception, a topic that has its share of ambiguities within the cheminformatics literature. Our use of smallest set of smallest rings (SSSR) in place of extended set of smallest rings (ESSR) was clearly not adequate. As such, we’ve modified an algorithm for finding the shortest cycle basis of a graph to effectively handle ESSR.

While we’ve strived to be diligent with the implementation, it’s just not possible for have an exact replica of PubChem’s for obvious reasons. Nonetheless, depending on the test set, the current implementation averages less than 1 bit of error per molecule. The complete code and example usages are available here. As always, we’d love to receive feedback on how to further improve it.

An online version of the Lilly reactivity & promiscuity filters

Recently Bruns & Watson published a paper describing a set of rules to filter out (potenially) reactive and promiscuous small molecules. There are 275 rules addressing a variety of possible reasons to reject a molecules: general reactivity, assay interference (such as flourescence), instability and so on. There filtering protocol also includes a set of explicit compounds specifically marked as interfering (even though they may have passed the structural rules).

The authors have also made their code available on Github – it’s C++ code and compiles easily on Unix systems. While it’s quite straightforward to use, we’ve put up a simple web page where you can paste a collection of SMILES (which should include molecule titles) and retrieve a summary report (in HTML, plain text or Excel formats). You can also interact with the page via POST requests to An example of doing this in Python is shown below

import urllib, urllib2
format = 'text' # can be 'excel' or 'html'
url = ''
values = {'format':format, 'queryids':'''[NH2+](C(C)c1ccccc1)CCC(c1ccccc1)c1ccccc1       85273758
O(C(=O)C(C1CCCCC1)c1ccccc1)CC[NH+](CC)CC        85273739
O(C(c1ccccc1)c1ccccc1)CC[NH+](C)C       85273734
n1c(N)c(N=Nc2ccccc2)ccc1N       85273730
S(=O)(=O)(N1CC[NH+](CC1)CC(=O)Nc1c(n[nH]c1C)C)c1cc(OC)c(OC)cc1  85273684
S1(=O)(=O)CC([NH+](CC(=O)Nc2cc(cc(c2)C(OC)=O)C(OC)=O)C)CC1      85273683
FC(F)(F)c1ccc(NC(=O)C([NH+](CC(=O)N2CC[NH+](CC2)Cc2ccccc2)C)C)cc1       85273682'''}
data = urllib.urlencode(values)
req = urllib2.Request(url, data)
response = urllib2.urlopen(req)
page =
print page

Charting chemical space

Charting, in a technical sense, is the process of mapping objects in high dimensional space to lower dimensional subspace, something which we have previously discussed using the FastMap algorithm. In this post, we consider a more literal interpretation of “charting,” namely, that of simply visualizing molecular data on an X-Y plot. While there are numerous plotting packages available, only a few actually understand chemical structures. Such a tool can be quite useful in elucidating structure activity/property relationships.

MolPlot is a tool that we recently develop for “charting” chemical structures. Its only requirement is where to extract the coordinates, which in turn can be any numerical data (e.g., principal axes from principal component analysis, biological readouts, etc.). The tool comes bundled with a kinase panel dataset, so feel free to take it out for a spin. We’d love to get feedback on how to further improve it.

Library synthesizer

Update March 2, 2013: The complete code to this tool is available here.

Library enumeration and profiling are an integral part of a medicinal chemist’s repertoire during early-stage discovery. Whether it’s synthesis planning for lead optimization or library design for lead identification, having an effective tool to enumerate and profile virtual compounds is ultimately critical to the success of a therapeutic campaign. Library synthesizer is a tool that we recently, working together with a medicinal chemist, develop to address such needs. While the tool is still a work in progress, it’s functional to the point that we thought others might find it useful. Clearly, the potential of this tool is limited only by the imagination (and of course implementation), so we’d love to get feedback.

(A note on the implementation: Due to technical difficulties, we weren’t able to deploy library synthesizer within the Java web start sandbox like other tools available on this site. This means that, upon launching the tool, you’ll be asked to allow an untrust application access to your system. Here, please take our words that it’s safe to say yes!)

Large scale substructure searching

Motivated by the recent renewed interest in substructure searching in the literature, we recently develop a proof-of-concept, self-contained substructure searching engine that can scale to large databases with modest hardware requirements. The current prototype is able to handle PubChem database (>30M structures) with reasonable performance on any modest server with sufficient (>12Gb) RAM. This work is an extension to our recent work on improving fingerprint screening. While it’s tempting to throw out qualitative (and/or unverifiable) performance numbers, we’ll let you be the judge. The prototype hosting the entire PubChem (snapshot taken in September of 2011) is available here. Please bear in mind this is a prototype, so it might not be able to handle DoS-type queries (e.g., c1ccccc1) gracefully. The binary and source code for the entire prototype are also available. We’d love to help you deploy it in-house, so feel free to contact us.

MLBD browser

To date, the NIH Molecular Libraries Program (MLP)—through its comprehensive and specialized centers—has produced an unprecedented amount of experimental data characterizing various levels of interactions of small molecules across a broad range of biological targets. While there has been much enthusiasm for the prospect of new discoveries, the fact remains that there is currently no adequate tool available to effectively distill the content-rich information afforded by the MLP data so as to facilitate new discoveries, serendipitously or otherwise. Herein, we present the Molecular Libraries Biological Database (MLBD) browser, our vision of how the MLP data can be organized and distilled in such a way that directly captures the essential information of a therapeutic project:

  • What’s the current status of the project?
  • How much effort, in terms of chemistry and follow-up screens, has gone into the project?
  • How many different series are there in the project?
  • How much optimization has been performed for a particular series?
  • etc.

By simply (i) using data from PubChem as-is, (ii) applying a few simple heuristics, and (ii) performing rudimentary analysis based on tools available elsewhere on this site, we have thus transformed analysis into browsing.

On faster substructure searching

With the rapid growth of chemical databases such as vendor catalogs and in-house screening collections, the constant desire to improve on (sub-) structure searching performance has never been more justified. Recent developments in this area have focused primarily on advances in hardware architecture, i.e., taking advantage of the many cores available on general-purpose graphics processing units (GPUs). While the reported performance can be quite impressive (i.e., order of magnitude over CPU), such approaches often require specific hardware configurations. Herein we describe a practical improvement that can achieve, depending on the query structure, as much as twice the speed-up over traditional substructure searching.

Before describing the proposed improvement, let’s recall that substructure searching consists of two basic steps: (i) a fast screening step follows by (ii) a relatively slow (sub-) graph isomorphism matching. In (i), molecular graphs are typically encoded as bit strings (or fingerprints) of length multiple of the hardware word size (e.g., 512- or 1024-bit). Such bit strings (where each bit encodes a specific feature of the molecular graph) are formally known as Bloom filters. With some probability of false positives, Bloom filters are extremely efficient data structures for set membership testing. Let p and q denote the Bloom filters of the target and query (molecular) graphs, respectively. Then to determine whether q is a subgraph of p amounts to the following binary logic:

(p & q) == q

Only when the above expression is true that p and q move on to a more rigorous subgraph isomorphism matching.

Although Bloom filters are quite powerful, screening through a large database in the millions requires a linear scan on the order of the database size:

foreach pi in db
  if (pi & q) == q then
    do graph isomorphism

When the Bloom filters for the entire database can fit into main memory, the linear scan above should still be manageable on modern server hardware. In fact, with the rapid lowering cost of memory hardware, we posit that in-memory searching is the only effectively solution that scales.

Instead of working with bit strings, let’s convert them into a decimal system so that it’s more natural to work with. Let |a| and |b| be the unsigned decimal values correspond to p and q, respectively. Now the logical expression above can be approximated as follows:

|a| >= |b|

The conversion effectively endows bit strings with an ordering that is consistent with the basic properties of the Bloom filters. This forms the basis of our proposed improvement, which we summarize as follows:

  • Let P = {p1pN} denote the entire database of N bit strings
  • A bit string pi is defined as an K-dimensional vector of 32-bit (signed or unsigned) integers; e.g., K = 16 and 32 for 512- and 1024-bit strings, respectively.
  • In a preprocessing step, we construct an k-d tree data structure for P.  A depth first traversal of the tree guarantees that, for any two given leaf nodes, the left most node is never contained within the right most node.
  • Now the screening step is reduced to that of (i) locating the appropriate leaf node in the tree and (ii) marching to the right in depth-first order and report all matches.

A self-contained reference implementation in Java is available here. It certainly has lots of room for improvements, so we’d love to get feedback.  Based on our not-so-exhaustive analysis, the following key properties have been observed:

  • The more specific the query is (i.e., the larger the query graph), the faster the search. Because the query bit string tends to contain large number of on bits, its starting leaf node is typically toward the right, thereby pruning a significant portion of the search space.
  • Bit strings with “structure” perform better than those without. Here, what we mean by “structure” is simply that the distribution of the  bits are not uniform. We can, for example, reorganize the split values during the tree construction based on the bit distribution. This also implies that structural keys perform better than hashed fingerprints.
  • The speed-up factor over linear scanning improves as the dimension K increases.


After a somewhat long hiatus (mostly in trying to tie up loose ends for the NPC browser), we’re finally back! Motivated by the feedback that we’ve received from the NPC browser, for the past few weeks we’ve been working on a tool—Siphonify—to help make managing of chemical structures a bit easier. Now one can organize, search, and browse chemical structures in much the same way as those of music files and documents. Please check out these slides for a quick tour of its features.  The tool is still very much in alpha stage, so we appreciate feedback and suggestions.

Below are various screenshots of the current version.

NPC paper

The paper describing the NCGC pharmaceutical collection (NPC) resource has just been published in the current issue of Science Translational Medicine. Among other things, the paper definitively resolves a seemingly simple question of How many drugs are there? For the impatient, to the right is a screen capture of the NPC browser that contains various numbers from the paper.

We also would like to take this opportunity to give a plug for the NPC browser. In its current form, the browser can be used to answer other similar “simple” questions such as

  • How many kinase inhibitors are in the FDA orange book?
  • Can I perform structure searching of interventions in Likewise, what about drug product labels from DailyMed?
  • Which drugs in the drugs@FDA list that are not approved by the UK NHS and vice versa?
  • And many more…

So if any one of these “simple” questions rouses your interest, please feel free to take the NCP browser out for a spin. As is with any tools available here, we’d love to receive suggestions on how to improve the software.

The kinome navigator

In a recent paper, Metz et al. describe a statistical framework for constructing the kinome polypharmacology network. Using a large kinase panel (i.e., more than 150,000 activity values across 172 kinases for 3,858 compounds), they propose a set of statistical parameters (based on some measures of reliability and relevance) that can be used to establish relationships between kinases. Although such networks are useful in analyzing polypharmacology trends, the authors subsequently provide the following insight:

In order to impact medicinal chemistry design decisions, what is required is an understanding of specific chemotypes or functional groups that either create or destroy a specific connection, as defined by the therapeutic endpoint.

To that end, we’ve recently put together a tool, the kinome navigator, that provides chemotype-centric views of kinase data. It’s based on an improved version of our automated R-group analysis tool. The input requirement is the same as that of the kinome viewer, namely, kinase activity values are expressed in nM. Here is the proper input of the above dataset (which is also bundled with the tool). Below are some sample screenshots.

Click on the button below to launch it. Please note that the tool is memory intensive, so it’s best to run it on a machine with at least 1Gb of memory.  As always, we welcome comments and/or suggestions.

The human kinome Java component

Update Feb. 2, 2013: This Java package is obsoleted; please checkout the latest code here.

Recently, we received a request to be able to access to the human kinome dendrogram programmatically.  We thought this might also be useful for others with similar needs.  Here is the kinome dendrogram as a self-contained Java package kinome.jar.  The only requirement is Java 1.5 or above.  This sample program provides the basic usage of the component.  To compile it, simply type:

javac -cp kinome.jar:.

And to run it

java -cp kinome.jar:. KinomeTest

Here is a sample image generated by the program.

As always, we welcome comments and feedback.

How many drugs are there?

Update May 29, 2011: Please see this post for the answer to the above question.

The NCGC Pharmaceutical Collection (NPC), as we noted earlier, is a result of our ongoing efforts to construct a definitive informatics and screening resource for drug repurposing.  Much of our recent efforts have been dedicated toward making this resource easily and readily accessible.  To that end, we’re now releasing an updated version of the NPC browser that contains the latest version of the NPC dataset in its entirety. To the best of our knowledge, this dataset constitutes the most up to date records of all compounds that have been approved for human and veterinary use.  The dataset is by no means complete and bug free.  We welcome comments and feedback to help improve on the data quality and coverage.  The NPC resource is available here.

The human kinome (part 1)

Update Nov 27, 2012: The entire source code for this tool is available here.

Kinases are an important class of enzymes that are therapeutically relevant for a wide range of indications such as cancer and neurodegenerative diseases. Given that there are roughly 500 different kinase proteins, selectivity—and in some cases polypharmacology—is perhaps one of the most important parameters used during lead optimization. Here the lead compound is subjected to a panel of kinase assays and the results are often projected onto the kinome dendrogram. Such visualizations provide an effective means to evaluate the selectivity profile of lead compounds. Recently, we develop a self-contained kinome prototype tool for this very purpose. At the moment the tool assumes that kinase activities are expressed in μM. Here is an example of the input format for the tool. As always, we welcome suggestions and/or feedback.

Although being able to robustly project data onto the kinome dendrogram is an important step in evaluating selectivity, a more useful scenario is to consider the underlying kinase profile in a broader context of  known/published results. In part 2 of this post, we discuss the integration of our kinome viewer with the kinase subset of the ChEMBL database.  Please stay tuned.

Fragment activity profiler

A challenging task in early stage lead discovery is the triaging of chemical series from high throughput screening (HTS) assays.  Here triaging is a multiplex problem that seeks to find the balance between a number of often competing goals such as potency, tractability, selectivity, novelty, etc.  Depending on the type of assay used (e.g., biochemical, cell-based), the number of identified chemical series can easily be in the hundreds.  Sifting through that much data to identify only a handful of promising series for follow-up can be quite overwhelming.

One effective strategy toward triaging is to utilize well curated databases to provide the necessary context around each chemical series. In particular, for given a chemical series we’d often like to address the following two basic questions:

  • What is known, if any, about this chemical series with respect to the intended and/or related target(s)?
  • What compound classes are known to modulate the intended and/or related target(s) and how similar are they to the underlying chemical series?

Recently, together with our colleague Rajarshi Guha, we develop a profiling tool to help with the triaging task.  The tool is built around our molecular framwork and uses a subset (namely, IC50 activity type) of the ChEMBL database (release 5) as the backend.  To facilitate activity comparisons across assays, activity values are normalized based on a robust variant of the Z-score (i.e., median and MAD are used in place of mean and standard deviation, respectively).  We hope the tool also serves as an effective way to browse ChEMBL.  Please feel free to let us know how we can make the tool more useful.  If you’d like to have the tool available in-house with your own version of ChEMBL, please drop us a note.  We’ll be happy to help you set it up.

A fragment-based approach to (multiple) MCS

Maximum common subgraph (MCS) is considered as one of the most challenging problems in combinatorial optimization.  As a generalization to (sub-) graph isomorphism, MCS has deep roots within cheminformatics.  From reaction mapping to enabling molecular graphs as first-class objects in kernel-based QSAR methods, MCS has become a basic necessity in any cheminformatics toolbox.  However, the utility of MCS (or lack thereof) is currently limited by our inability to solve MCS efficiently.  Pending a significant development in computational complexity theory, this fact will likely to remain true for a forseeable future.

Despite the inherent “hardness” in solving MCS, there exist a number of algorithms for solving MCS within reasonable time on modern hardware.  For a good survey of MCS algorithms, please see this paper.  The purpose of this post, as such, is to briefly describe our approach to solving MCS and, more generally, to the problem of solving MCS for a set of molecular graphs.  Multiple MCS (as the latter problem is often known) has numerous applications in cheminformatics such as Markush structure extraction (e.g., see our related R-group decomposition tool), generalizing bioisostere replacement beyond pairwise, (2D) pharmacophore elucidation, and so forth.

Clique-based approaches are arguably the current state-of-the-art for solving exact MCS.  Here the MCS problem is transformed into one of finding the maximum clique of the corresponding association graph.  This approach has a number of desired properties including  (but are not limited to) the following:

  • Algorithmic elegance.  The clique finding problem has been well-studied in computer science, so there exist many well-known algorithms available (e.g., see the second DIMACS implementation challenge).  Most of the published maximum clique algorithms are quite easy to understand.  In fact, the most problematic aspects of the clique-based approach, in terms of code, are likely to be in the construction of the association graph and bounding function. The actual clique finding routine can be expressed in a few lines of code using any modern computer programming language.
  • Optimized implementation.  Since most efficient clique solvers are based on the branch-and-bound algorithm, they can easily be parallelizable on modern multi-core CPUs.  Moreover, with some care the implementation of a clique solver can be done entirely with bit twiddling.
  • Built-in support for disconnected MCS.  Due to the nature of the association graph, the resulting maximum clique often yields disconnected MCS’s.  This can be quite useful, for example, in the identification and visualization of bioisostere replacements.

Notwithstanding these properties, however, clique-based approaches have a few drawbacks.  Given two graphs G1 and G2 with m and n nodes, respectively, the number of nodes in the corresponding association graph  is in the order of mn.  Even with a modest sized molecule of 20 atoms, this can result in an association graph of 400 nodes.  Here a fast clique solver will likely to spend a few CPU seconds finding the maximum clique.

Thus far we’ve considered MCS as an optimization problem, the premise being that the best solution is also the right one.  This is certainly true for most general cases, but for cheminformatics it’s not quite so obvious.  Consider, for example, the following two molecules:

There are at least 26 different maximum subgraph isomorphisms here.  Let’s look at one specific MCS below

The numbering denotes isomorphisms between the two molecules.  (This example also highlights the fact that we only consider maximum common edge subgraph isomorphism.)  Depending on the context, the MCS above might be desirable in that it captures the carboxyl functional group.  In other context, it might be the case that the less optimal subgraph isomorphism corresponds to the indole scaffold is desirable.

To remedy this shortcoming, a different approach is to consider MCS as a clique enumeration problem.  That is, instead of finding the maximum clique, we enumerate all maximal cliques in the association graph.  Each maximal clique, in turn, corresponds to a subgraph ismorphism, which may or may not be the actual MCS.  The distinction between this approach and that of the maximum clique is effectively there are no bounds defined, so the entire search space is evaluated.  Given there can be as much as 3(n/3) cliques in a graph of n nodes, such exhaustive enumeration can be prohibitly expensive.

An effective strategy to speed up the enumeration is to reduce the size of the association graph.  This is precisely the approach behind the C-clique algorithm (and also improvements thereof).  By considering only connected subgraph ismorphomism, the C-clique algorithm significantly reduces the size of the search space, thereby making it a very effective algorithm for MCS.  This is the algorithm that motivates our fragment-based (multiple) MCS approach, which we briefly discuss below.  (For the impatient, please click here to take the multiple MCS tool for a spin.)

The main idea behind our fragment-based approach, as we hinted at previously, is to simply use fragments as seeds to the C-clique algorithm.  Effectively, this seeding serves to (i) further reduce the search space, (ii) focus the search in a specific direction, and (iii) move a good portion of the burden in dealing with internal symmetry of the MCS out of C-clique.  This last point is significant in that it allows us to use other algorithms that are much more effective for automorphism detection.  (Clique-based MCS algorithms are notoriously bad in dealing with symmetry.)

Multiple MCS is simply an extension to the fragment-based MCS.  Here an advantage to our fragment-based approach is that we know, given a set of molecules, the fragments that span the entire set (if any). This allows us to do  pairwise MCS with appropriately seeded fragments to eventually arrive at the multiple MCS.  Clearly, pairwise MCS without seeded fragments will likely yield a non-optimal solution (if any at all).

The tool to try out multiple MCS is available here.  We believe the algorithm is fundamentally correct.  Any unexpected result is likely due to bugs in our implementation, in which case we’d love to hear about it.