*p*and

*q*denote the Bloom filters of the target and query (molecular) graphs, respectively. Then to determine whether

*q*is a subgraph of

*p*amounts to the following binary logic:

```
(p & q) == q
```

Only when the above expression is true that *p*and

*q*move on to a more rigorous subgraph isomorphism matching. Although Bloom filters are quite powerful, screening through a large database in the millions requires a linear scan on the order of the database size:

```
foreach p
```_{i} in db
if (p_{i} & q) == q then
do graph isomorphism

When the Bloom filters for the entire database can fit into main memory, the linear scan above should still be manageable on modern server hardware. In fact, with the rapid lowering cost of memory hardware, we posit that in-memory searching is the only effectively solution that scales.
Instead of working with bit strings, let's convert them into a decimal system so that it's more natural to work with. Let |*a*| and |

*b*| be the unsigned decimal values correspond to

*p*and

*q*, respectively. Now the logical expression above can be approximated as follows:

```
|a| >= |b|
```

*The conversion effectively endows bit strings with an ordering that is consistent with the basic properties of the Bloom filters*. This forms the basis of our proposed improvement, which we summarize as follows:

- Let
*P*= {*p*_{1}...*p*_{N}} denote the entire database of*N*bit strings - A bit string
*p*_{i}is defined as an*K*-dimensional vector of 32-bit (signed or unsigned) integers; e.g.,*K*= 16 and 32 for 512- and 1024-bit strings, respectively. - In a preprocessing step, we construct an k-d tree data structure for
*P*. A depth first traversal of the tree guarantees that, for any two given leaf nodes, the left most node is never contained within the right most node. - Now the screening step is reduced to that of (i) locating the appropriate leaf node in the tree and (ii) marching to the right in depth-first order and report all matches.

- The more specific the query is (i.e., the larger the query graph), the faster the search. Because the query bit string tends to contain large number of on bits, its starting leaf node is typically toward the right, thereby pruning a significant portion of the search space.
- Bit strings with "structure" perform better than those without. Here, what we mean by "structure" is simply that the distribution of the bits are not uniform. We can, for example, reorganize the split values during the tree construction based on the bit distribution. This also implies that structural keys perform better than hashed fingerprints.
- The speed-up factor over linear scanning improves as the dimension
*K*increases.

This is similar to something we do for PubChem 2-D neighboring (create index and search it) and for speeding 3-D RMSD clustering of conformers (just create the index to remove redundancy). The log(N) algorithmic aspect to it makes it super fast to create the index and search it. We don’t use this approach to speed 2-D structure/similarity searches, though, as we simply simultaneously search parts of PubChem (“embarrassing parallel” divide and conqur approach). (A linear fingerprint screen takes about 1-2 secs for 35M structures, as a result, when using “warmed up” memory-mapped files.)

If you are interested, let me know, I will send you the poster I presented at an OpenEye CUP meeting. (For some reason, the PDF of my poster on the OpenEye CUP VII site is not available… being linked to my talk and not to my poster.)

Best,

Evan