Molecular hash codes are fixed-length alphanumeric encoding of molecular graphs. They play a key role within chemical data management systems in facilitating (among other things) structural identity and uniquess validation. While an important component of any hash code generation approach is the structure standardization procedure (something which we briefly discussed previously), the focus of this post is on the encoding step, i.e., that of generating the hash code given a standardized structure. In the remainder of this post, we first give a brief overview of **spectral hash code**, our proposed multi-resolution hash code based on InChI. We then highlight the utility of the hash code through a number of examples. We conclude with the description of a web resource for converting between PubChem CID, InChIKey, and **spectral hash code**. The complete source code is readily available from our bitbucket repository. As always, we welcome comments and feedback!

A **spectral hash code** is a 30-character (150-bit) alphanumeric hash string that uniquely encodes an InChI. The hash code has three logical blocks—denoted as *h*_{1}, *h*_{2}, and *h*_{3}—that progressively encode the InChI graph at finer resolution where *h*_{1} ⊂ *h*_{2} ⊂ *h*_{3} and |*h*_{1}| = 9 (45-bit), |*h*_{2}| = 19 (95-bit), and |*h*_{3}| = 30 (150-bit). Consider, for example, the following hash code for Gleevec `A9QNVQSAZAMG4SH428AUW4Y21WVPXD`

; here we have *h*_{1} = `A9QNVQSAZ`

, *h*_{2} = `A9QNVQSAZAMG4SH428A`

, and *h*_{3} = `A9QNVQSAZAMG4SH428AUW4Y21WVPXD`

. A key feature of the hash code is that it’s lexicographically meaningful at the block level; this means that when sorted, the hash codes tend to group together structurally similar molecules.

A brief description of how each block is generated follows.

- Block
*h*_{1}encodes the topology (or molecular skeleton) of the InChI graph. As the name implies, the underlying principle behind block*h*_{1}is based on*spectral graph theory*. Here a normalized Laplacian graph is constructed from the InChI’s connection layer (i.e.,`/c`

) and its spectrum (sorted eigenvalues) is hashed. While other graph representations (e.g., adjacency, signless Laplacian, etc.) work equally well, the normalized Laplacian gives us numerical stability, which is an important requirement for portability. We should note that while it’s still an open question as to which graphs are determined by their spectrum, there exist a number of non-isomorphic graphs that are cospectral. This is particularly the case for trees (graphs without cycles); as such, there will be a few cases of systematic hash “collisions” at block*h*_{1}. - Block
*h*_{2}refines the graph topology by imposing a canonical ordering over the vertices. The resulting hash value is a*combined*digest of*h*_{1}and InChI’s`/c`

layer. An important implication of this chaining is that two hash codes of the same*h*_{2}must have the same*h*_{1}blocks as well. - Block
*h*_{3}encodes the InChI graph at full resolution. Its hash value is a combined digest of block*h*_{2}and the entire InChI string.

Below are a few examples of spectral hash codes.

`DMLC7QBAS17P7JKN43WHM2J1VA67GL` |
`DMLC7QBAS17P7JKN43WJRKFUWQXKZ8` |
`DMLC7QBAS186NS8RCUT48TKUL7TUW9` |

`DMLC7QBAS186NS8RCUT6QQYS51CMKT` |
`DMLC7QBASNCCV3MVHAJNZVRVTUM4FD` |
`DMLC7QBASNCCV3MVHAJQ8CWYUW54YB` |

We have also developed a web resource that can interconvert spectral hash code to/from PubChem CID and InChIKey. The syntax is as follows:

`http://tripod.nih.gov/spectral/v1/{id}/{oper}[?max=N]`

where `{}`

‘s are required arguments and `[]`

is optional. `{id}`

can be either PubChem CID, InChIKey, or spectral hash code (*h*_{1}, *h*_{2}, or *h*_{3}). `{oper}`

can be one of {`hk,h1,h2,smiles,inchi`

} where

`hk`

maps the input`{id}`

into one or more spectral hash codes. If`{id}`

is PubChem CID, then only one spectral hash code is returned. For InChIKey,*h*_{1}, or*h*_{2}, there are likely to be multiple hash codes returned. Here are some examples:

`http://tripod.nih.gov/spectral/v1/5291/hk`

`http://tripod.nih.gov/spectral/v1/KTUFNOKKBVMGRW-UHFFFAOYSA/hk`

`http://tripod.nih.gov/spectral/v1/A9QNVQSAZAMG4SH428A/hk`

`h1`

and`h2`

accept`{id}`

as either PubChem CID or InChIKey and return all compounds with the same*h*_{1}and*h*_{2}, respectively.

`http://tripod.nih.gov/spectral/v1/5291/h2`

`http://tripod.nih.gov/spectral/v1/KTUFNOKKBVMGRW/h1`

`smiles`

and`inchi`

accept`{id}`

as either*h*_{1},*h*_{2}, or*h*_{3}and return PubChem structures as SMILES and InChI, respectively.

`http://tripod.nih.gov/spectral/v1/A9QNVQSAZAMG4SH428A/inchi`

`http://tripod.nih.gov/spectral/v1/DMLC7QBAS/smiles?max=10`