# On faster substructure searching

With the rapid growth of chemical databases such as vendor catalogs and in-house screening collections, the constant desire to improve on (sub-) structure searching performance has never been more justified. Recent developments in this area have focused primarily on advances in hardware architecture, i.e., taking advantage of the many cores available on general-purpose graphics processing units (GPUs). While the reported performance can be quite impressive (i.e., order of magnitude over CPU), such approaches often require specific hardware configurations. Herein we describe a practical improvement that can achieve, depending on the query structure, as much as twice the speed-up over traditional substructure searching.

Before describing the proposed improvement, let’s recall that substructure searching consists of two basic steps: (i) a fast screening step follows by (ii) a relatively slow (sub-) graph isomorphism matching. In (i), molecular graphs are typically encoded as bit strings (or fingerprints) of length multiple of the hardware word size (e.g., 512- or 1024-bit). Such bit strings (where each bit encodes a specific feature of the molecular graph) are formally known as Bloom filters. With some probability of false positives, Bloom filters are extremely efficient data structures for set membership testing. Let *p* and *q* denote the Bloom filters of the target and query (molecular) graphs, respectively. Then to determine whether *q* is a subgraph of *p* amounts to the following binary logic:

(p & q) == q

Only when the above expression is true that *p* and *q* move on to a more rigorous subgraph isomorphism matching.

Although Bloom filters are quite powerful, screening through a large database in the millions requires a linear scan on the order of the database size:

foreach p_{i}in db

if (p_{i}& q) == q then

do graph isomorphism

When the Bloom filters for the entire database can fit into main memory, the linear scan above should still be manageable on modern server hardware. In fact, with the rapid lowering cost of memory hardware, we posit that in-memory searching is the only effectively solution that scales.

Instead of working with bit strings, let’s convert them into a decimal system so that it’s more natural to work with. Let |*a*| and |*b*| be the unsigned decimal values correspond to *p* and *q*, respectively. Now the logical expression above can be approximated as follows:

|a| >= |b|

*The conversion effectively endows bit strings with an ordering that is consistent with the basic properties of the Bloom filters*. This forms the basis of our proposed improvement, which we summarize as follows:

- Let
*P*= {*p*_{1}…*p*_{N}} denote the entire database of*N*bit strings - A bit string
*p*_{i}is defined as an*K*-dimensional vector of 32-bit (signed or unsigned) integers; e.g.,*K*= 16 and 32 for 512- and 1024-bit strings, respectively. - In a preprocessing step, we construct an k-d tree data structure for
*P*. A depth first traversal of the tree guarantees that, for any two given leaf nodes, the left most node is never contained within the right most node. - Now the screening step is reduced to that of (i) locating the appropriate leaf node in the tree and (ii) marching to the right in depth-first order and report all matches.

A self-contained reference implementation in Java is available here. It certainly has lots of room for improvements, so we’d love to get feedback. Based on our not-so-exhaustive analysis, the following key properties have been observed:

- The more specific the query is (i.e., the larger the query graph), the faster the search. Because the query bit string tends to contain large number of on bits, its starting leaf node is typically toward the right, thereby pruning a significant portion of the search space.
- Bit strings with “structure” perform better than those without. Here, what we mean by “structure” is simply that the distribution of the bits are not uniform. We can, for example, reorganize the split values during the tree construction based on the bit distribution. This also implies that structural keys perform better than hashed fingerprints.
- The speed-up factor over linear scanning improves as the dimension
*K*increases.

This is similar to something we do for PubChem 2-D neighboring (create index and search it) and for speeding 3-D RMSD clustering of conformers (just create the index to remove redundancy). The log(N) algorithmic aspect to it makes it super fast to create the index and search it. We don’t use this approach to speed 2-D structure/similarity searches, though, as we simply simultaneously search parts of PubChem (“embarrassing parallel” divide and conqur approach). (A linear fingerprint screen takes about 1-2 secs for 35M structures, as a result, when using “warmed up” memory-mapped files.)

If you are interested, let me know, I will send you the poster I presented at an OpenEye CUP meeting. (For some reason, the PDF of my poster on the OpenEye CUP VII site is not available… being linked to my talk and not to my poster.)

Best,

Evan