Do structurally similar molecules have similar hash codes?

Molecular hash codes are fixed-length alphanumeric encoding of molecular graphs. They play a key role within chemical data management systems in facilitating (among other things) structural identity and uniquess validation. While an important component of any hash code generation approach is the structure standardization procedure (something which we briefly discussed previously), the focus of this post is on the encoding step, i.e., that of generating the hash code given a standardized structure. In the remainder of this post, we first give a brief overview of spectral hash code, our proposed multi-resolution hash code based on InChI. We then highlight the utility of the hash code through a number of examples. We conclude with the description of a web resource for converting between PubChem CID, InChIKey, and spectral hash code. The complete source code is readily available from our bitbucket repository. As always, we welcome comments and feedback! A spectral hash code is a 30-character (150-bit) alphanumeric hash string that uniquely encodes an InChI. The hash code has three logical blocks—denoted as h₁, h₂, and h₃—that progressively encode the InChI graph at finer resolution where h₁ ⊂ h₂ ⊂ h₃ and |h₁| = 9 (45-bit), |h₂| = 19 (95-bit), and |h₃| = 30 (150-bit). Consider, for example, the following hash code for Gleevec A9QNVQSAZAMG4SH428AUW4Y21WVPXD; here we have h₁ = A9QNVQSAZ, h₂ = A9QNVQSAZAMG4SH428A, and h₃ = A9QNVQSAZAMG4SH428AUW4Y21WVPXD. A key feature of the hash code is that it's lexicographically meaningful at the block level; this means that when sorted, the hash codes tend to group together structurally similar molecules. A brief description of how each block is generated follows.

Block h₁ encodes the topology (or molecular skeleton) of the InChI graph. As the name implies, the underlying principle behind block h₁ is based on spectral graph theory. Here a normalized Laplacian graph is constructed from the InChI's connection layer (i.e., /c) and its spectrum (sorted eigenvalues) is hashed. While other graph representations (e.g., adjacency, signless Laplacian, etc.) work equally well, the normalized Laplacian gives us numerical stability, which is an important requirement for portability. We should note that while it's still an open question as to which graphs are determined by their spectrum, there exist a number of non-isomorphic graphs that are cospectral. This is particularly the case for trees (graphs without cycles); as such, there will be a few cases of systematic hash "collisions" at block h₁.
Block h₂ refines the graph topology by imposing a canonical ordering over the vertices. The resulting hash value is a combined digest of h₁ and InChI's /c layer. An important implication of this chaining is that two hash codes of the same h₂ must have the same h₁ blocks as well.
Block h₃ encodes the InChI graph at full resolution. Its hash value is a combined digest of block h₂ and the entire InChI string.

Below are a few examples of spectral hash codes.

`DMLC7QBAS17P7JKN43WHM2J1VA67GL`	`DMLC7QBAS17P7JKN43WJRKFUWQXKZ8`	`DMLC7QBAS186NS8RCUT48TKUL7TUW9`
`DMLC7QBAS186NS8RCUT6QQYS51CMKT`	`DMLC7QBASNCCV3MVHAJNZVRVTUM4FD`	`DMLC7QBASNCCV3MVHAJQ8CWYUW54YB`

We have also developed a web resource that can interconvert spectral hash code to/from PubChem CID and InChIKey. The syntax is as follows:

http://tripod.nih.gov/spectral/v1/{id}/{oper}[?max=N]

where {}'s are required arguments and [] is optional. {id} can be either PubChem CID, InChIKey, or spectral hash code (h₁, h₂, or h₃). {oper} can be one of {hk,h1,h2,smiles,inchi} where

hk maps the input {id} into one or more spectral hash codes. If {id} is PubChem CID, then only one spectral hash code is returned. For InChIKey, h₁, or h₂, there are likely to be multiple hash codes returned. Here are some examples:

http://tripod.nih.gov/spectral/v1/5291/hk

http://tripod.nih.gov/spectral/v1/KTUFNOKKBVMGRW-UHFFFAOYSA/hk

http://tripod.nih.gov/spectral/v1/A9QNVQSAZAMG4SH428A/hk
h1 and h2 accept {id} as either PubChem CID or InChIKey and return all compounds with the same h₁ and h₂, respectively.

http://tripod.nih.gov/spectral/v1/5291/h2

http://tripod.nih.gov/spectral/v1/KTUFNOKKBVMGRW/h1
smiles and inchi accept {id} as either h₁, h₂, or h₃ and return PubChem structures as SMILES and InChI, respectively.

http://tripod.nih.gov/spectral/v1/A9QNVQSAZAMG4SH428A/inchi

http://tripod.nih.gov/spectral/v1/DMLC7QBAS/smiles?max=10

Tripod Development

Cheminformatics proving ground

Do structurally similar molecules have similar hash codes?

Leave a Reply Cancel reply

	`http://tripod.nih.gov/spectral/v1/5291/hk`
	`http://tripod.nih.gov/spectral/v1/KTUFNOKKBVMGRW-UHFFFAOYSA/hk`
	`http://tripod.nih.gov/spectral/v1/A9QNVQSAZAMG4SH428A/hk`

	`http://tripod.nih.gov/spectral/v1/A9QNVQSAZAMG4SH428A/inchi`
	`http://tripod.nih.gov/spectral/v1/DMLC7QBAS/smiles?max=10`