A few years ago we made available our implementation of the PubChem substructure keys fingerprint. Being it was our first attempt, there were a number of shortcomings with the implementation in terms of performance and accuracy. Recently, we have the opportunity to revisit the code and address some of these shortcomings. In this post, we briefly describe some of the improvements. (For the impatient, the latest code is available here.) We should note that since our original implementation, there have been at least two other implementations that we’re aware of. It certainly would be interesting to compare the different implementations; if anything, the comparison should reveal fundamental differences in aromaticity and ring perceptions across various cheminformatics toolkits.
As with most substructure keys-based fingerprints, perhaps the biggest performance bottleneck has to do with repeated subgraph isomorphism searches of a fixed set of substructure keys (encoded as either SMILES or SMARTS) against the target molecule. Here, a naïve approach that sequentially searches each key is unlikely to perform well for all but small key sets (e.g., 100–200). Given that there are roughly 618 (out of 881 total size) substructure keys defined for the PubChem fingerprint, this approach is unlikely to scale, as demonstrated by our original implementation. Efficiently addressing this bottleneck requires an indexing data structure much like suffix-tree or –array in stringology (though currently the approach described here is clearly the most effective). In this spirit, we’ve built upon our recent work on large scale structure searching as the indexing engine for the substructure keys, thereby reducing the problem of multiple substructure searches to that of a single superstructure search.
Another shortcoming of our original implementation centered around ring perception, a topic that has its share of ambiguities within the cheminformatics literature. Our use of smallest set of smallest rings (SSSR) in place of extended set of smallest rings (ESSR) was clearly not adequate. As such, we’ve modified an algorithm for finding the shortest cycle basis of a graph to effectively handle ESSR.
While we’ve strived to be diligent with the implementation, it’s just not possible for have an exact replica of PubChem’s for obvious reasons. Nonetheless, depending on the test set, the current implementation averages less than 1 bit of error per molecule. The complete code and example usages are available here. As always, we’d love to receive feedback on how to further improve it.