Large scale substructure searching

Motivated by the recent renewed interest in substructure searching in the literature, we recently develop a proof-of-concept, self-contained substructure searching engine that can scale to large databases with modest hardware requirements. The current prototype is able to handle PubChem database (>30M structures) with reasonable performance on any modest server with sufficient (>12Gb) RAM. This work is an extension to our recent work on improving fingerprint screening. While it’s tempting to throw out qualitative (and/or unverifiable) performance numbers, we’ll let you be the judge. The prototype hosting the entire PubChem (snapshot taken in September of 2011) is available here. Please bear in mind this is a prototype, so it might not be able to handle DoS-type queries (e.g., c1ccccc1) gracefully. The binary and source code for the entire prototype are also available. We’d love to help you deploy it in-house, so feel free to contact us.

3 Comments

[...] NCTT colleague, Trung Nguyen, recently announced a prototype chemical substructure search system based on fingerprint pre-screening and an efficient [...]

[...] should point out that NCATS has already released code to allow fast similarity search using an in-memory fingerprint index, that supports millisecond [...]

[…] should point out that NCATS has already released code to allow fast similarity search using an in-memory fingerprint index, that supports millisecond […]

Leave a comment

Your comment


8 − three =