Do structurally similar molecules have similar hash codes?

Molecular hash codes are fixed-length alphanumeric encoding of molecular graphs. They play a key role within chemical data management systems in facilitating (among other things) structural identity and uniquess validation. While an important component of any hash code generation approach is the structure standardization procedure (something which we briefly discussed previously), the focus of this post is on the encoding step, i.e., that of generating the hash code given a standardized structure. In the remainder of this post, we first give a brief overview of spectral hash code, our proposed multi-resolution hash code based on InChI. We then highlight the utility of the hash code through a number of examples. We conclude with the description of a web resource for converting between PubChem CID, InChIKey, and spectral hash code. The complete source code is readily available from our bitbucket repository. As always, we welcome comments and feedback!

A spectral hash code is a 30-character (150-bit) alphanumeric hash string that uniquely encodes an InChI. The hash code has three logical blocks—denoted as h1, h2, and h3—that progressively encode the InChI graph at finer resolution where h1h2h3 and |h1| = 9 (45-bit), |h2| = 19 (95-bit), and |h3| = 30 (150-bit). Consider, for example, the following hash code for Gleevec A9QNVQSAZAMG4SH428AUW4Y21WVPXD; here we have h1 = A9QNVQSAZ, h2 = A9QNVQSAZAMG4SH428A, and h3 = A9QNVQSAZAMG4SH428AUW4Y21WVPXD. A key feature of the hash code is that it’s lexicographically meaningful at the block level; this means that when sorted, the hash codes tend to group together structurally similar molecules.

A brief description of how each block is generated follows.

  1. Block h1 encodes the topology (or molecular skeleton) of the InChI graph. As the name implies, the underlying principle behind block h1 is based on spectral graph theory. Here a normalized Laplacian graph is constructed from the InChI’s connection layer (i.e., /c) and its spectrum (sorted eigenvalues) is hashed. While other graph representations (e.g., adjacency, signless Laplacian, etc.) work equally well, the normalized Laplacian gives us numerical stability, which is an important requirement for portability. We should note that while it’s still an open question as to which graphs are determined by their spectrum, there exist a number of non-isomorphic graphs that are cospectral. This is particularly the case for trees (graphs without cycles); as such, there will be a few cases of systematic hash “collisions” at block h1.
  2. Block h2 refines the graph topology by imposing a canonical ordering over the vertices. The resulting hash value is a combined digest of h1 and InChI’s /c layer. An important implication of this chaining is that two hash codes of the same h2 must have the same h1 blocks as well.
  3. Block h3 encodes the InChI graph at full resolution. Its hash value is a combined digest of block h2 and the entire InChI string.

Below are a few examples of spectral hash codes.


DMLC7QBAS17P7JKN43WHM2J1VA67GL

DMLC7QBAS17P7JKN43WJRKFUWQXKZ8

DMLC7QBAS186NS8RCUT48TKUL7TUW9

DMLC7QBAS186NS8RCUT6QQYS51CMKT

DMLC7QBASNCCV3MVHAJNZVRVTUM4FD

DMLC7QBASNCCV3MVHAJQ8CWYUW54YB

 

We have also developed a web resource that  can interconvert spectral hash code to/from PubChem CID and InChIKey. The syntax is as follows:

http://tripod.nih.gov/spectral/v1/{id}/{oper}[?max=N]

where {}‘s are required arguments and [] is optional. {id} can be either PubChem CID, InChIKey, or spectral hash code (h1, h2, or h3). {oper} can be one of {hk,h1,h2,smiles,inchi} where

Leave a Reply

Your email address will not be published. Required fields are marked *

Please enlighten us with your wisdom... *