Large scale compound clustering in 3D - Abstract

Large scale compound clustering in 3D was presented by Dr Paolo Tosco, Cresset at the Cresset European User Group Meeting 2015.


Clustering a collection of n compounds requires the computation of a triangular similarity matrix filled with pairwise similarity values. Since the cost for computing such a matrix is O(n2), fast similarity metrics based on 2D fingerprints are most often used for this purpose. However, a 2D metric has significant inherent limitations in capturing the biological similarity across conformationally flexible molecules. Large variations in functional group decoration may marginally affect 3D steric/electrostatic properties; conversely, moving from an extended to a folded conformation may have a dramatic influence on recognition by a macromolecular target, which is completely ignored by 2D methods.

We have recently been involved in a collaborative project with BioBlocks, namely clustering their Comprehensive Fragment Library (CFL), a non-random selection of about 800K variably decorated heterocyclic core structures generated from first principles. When 2D ECFP4 fingerprint Tanimoto distances were applied to a subset of this collection, results were disappointing: the distribution of 2D similarity values was rather flat across the set, and there was no correlation with 3D similarity metrics. The largely unprecedented chemical nature of structures and the 3D-oriented design of the library called for high quality, three-dimensional conformer generation, molecular alignment and similarity metrics.

Methodological and technical solutions adopted to enable 3D clustering of such a large collection were presented. The higher quality and informative content of 3D vs 2D clusters were illustrated through selected examples.

Try Cresset solutions on your project

Request a free software evaluation