2D fingerprint-based Tanimoto distances are widely used for clustering due the overall good balance between speed and effectiveness. However, there are significant limitations in the ability of a 2D fingerprint-based method to capture the biological similarity between molecules, especially when conformationally flexible structures are involved. Structures which appear to largely differ in functional group decoration may give rise to quite similar
steric/electrostatic properties, which are what actually determine their recognition by biological macromolecules.
In BioBlocks’ Comprehensive Fragment Library (CFL) program, we were confronted with clustering a very large collection of scaffolds generated from first principles. Due to the largely unprecedented structures in the set and our design aim to populate the 3D ‘world’, using the best 3D metrics was critical. The structural diversity of the starting collection of about 800K heterocyclic scaffolds with variable functional group decoration was not adequately captured by 2D ECFP4 fingerprint Tanimoto distances, as shown by the rather flat distribution of 2D similarity values across the set, and by their lack of correlation with the 3D similarity metrics.
The initial step of any clustering procedure is the computation of an upper triangular matrix holding similarity values between all pairs of compounds. This step becomes computationally demanding when using 3D methods, since an optimum alignment between the molecules needs be found taking into account multiple conformers.
The presentation covers the methodological and technical solutions adopted to enable 3D clustering of such a large set of compounds. Selected examples will be presented to compare the quality and the informative content of 3D vs 2D clusters.
See presentation ‘Examining the diversity of large collections of building blocks in 3D‘ as presented at Sheffield Chemoinformatics Conference 2016.