Molecular hash codes are fixed-length alphanumeric encoding of molecular graphs. They play a key role within chemical data management systems in facilitating (among other things) structural identity and uniquess validation. While an important component of any hash code generation approach is the structure standardization procedure (something which we briefly
discussed previously), the focus of this post is on the encoding step, i.e., that of generating the hash code given a standardized structure. In the remainder of this post, we first give a brief overview of
spectral hash code, our proposed multi-resolution hash code based on InChI. We then highlight the utility of the hash code through a number of examples. We conclude with the description of a web resource for converting between PubChem CID, InChIKey, and
spectral hash code. The complete source code is readily available from our
bitbucket repository. As always, we welcome comments and feedback!
A
spectral hash code is a 30-character (150-bit) alphanumeric hash string that uniquely encodes an
InChI. The hash code has three logical blocks—denoted as
h1,
h2, and
h3—that progressively encode the InChI graph at finer resolution where
h1 ⊂
h2 ⊂
h3 and |
h1| = 9 (45-bit), |
h2| = 19 (95-bit), and |
h3| = 30 (150-bit). Consider, for example, the following hash code for
Gleevec A9QNVQSAZAMG4SH428AUW4Y21WVPXD
; here we have
h1 =
A9QNVQSAZ
,
h2 =
A9QNVQSAZAMG4SH428A
, and
h3 =
A9QNVQSAZAMG4SH428AUW4Y21WVPXD
. A key feature of the hash code is that it's lexicographically meaningful at the block level; this means that when sorted, the hash codes tend to group together structurally similar molecules.
A brief description of how each block is generated follows.
- Block h1 encodes the topology (or molecular skeleton) of the InChI graph. As the name implies, the underlying principle behind block h1 is based on spectral graph theory. Here a normalized Laplacian graph is constructed from the InChI's connection layer (i.e.,
/c
) and its spectrum (sorted eigenvalues) is hashed. While other graph representations (e.g., adjacency, signless Laplacian, etc.) work equally well, the normalized Laplacian gives us numerical stability, which is an important requirement for portability. We should note that while it's still an open question as to which graphs are determined by their spectrum, there exist a number of non-isomorphic graphs that are cospectral. This is particularly the case for trees (graphs without cycles); as such, there will be a few cases of systematic hash "collisions" at block h1.
- Block h2 refines the graph topology by imposing a canonical ordering over the vertices. The resulting hash value is a combined digest of h1 and InChI's
/c
layer. An important implication of this chaining is that two hash codes of the same h2 must have the same h1 blocks as well.
- Block h3 encodes the InChI graph at full resolution. Its hash value is a combined digest of block h2 and the entire InChI string.
Below are a few examples of spectral hash codes.
We have also developed a web resource that can interconvert spectral hash code to/from PubChem CID and InChIKey. The syntax is as follows:
http://tripod.nih.gov/spectral/v1/{id}/{oper}[?max=N]
where
{}
's are required arguments and
[]
is optional.
{id}
can be either PubChem CID, InChIKey, or spectral hash code (
h1,
h2, or
h3).
{oper}
can be one of {
hk,h1,h2,smiles,inchi
} where
hk
maps the input {id}
into one or more spectral hash codes. If {id}
is PubChem CID, then only one spectral hash code is returned. For InChIKey, h1, or h2, there are likely to be multiple hash codes returned. Here are some examples:
h1
and h2
accept {id}
as either PubChem CID or InChIKey and return all compounds with the same h1 and h2, respectively.
smiles
and inchi
accept {id}
as either h1, h2, or h3 and return PubChem structures as SMILES and InChI, respectively.