Large-scale compound clustering in 3D

At the Cresset European User Group Meeting 2015 I presented recent work carried out as a collaboration with BioBlocks, a US based company specializing in medicinal chemistry and drug discovery. BioBlocks had created a large fragment library. The goal was to identify ~2,000 representative compounds to synthesize from a virtual set of 800,000.

The challenges were from the size of the set and the nature of the compound structures, which all featured a methyl handle plus multiple diastereomers and conformers.

The plan was to use clustering based on the shapes and fields of the compounds in order to gain the greatest representative coverage from the chosen 2,000 compounds.

A pilot study to choose a 2D or 3D approach

Cresset and BioBlocks determined whether it was indeed necessary to use a 3D approach, or whether a 2D approach would be equally effective. A 2D approach would ignore the stereochemistry and would dispense with the need for a conformation search. Each pairwise comparison would take only 45 µs. By contrast, a 3D similarity approach would include a conformational hunt and would include the stereochemistry. As a result, each pairwise comparison would take 40 ms.

We carried out a pilot study to determine whether it was computationally feasible and valuable to carry out a 3D assessment.

The process proposed was to take the 2D structures as supplied by BioBlocks, protonating them based on our set of rules, generating racemic diastereomers for each structure, then building conformations and computing field points and finally clustering the compounds based on their 3D similarity.

Diastereomer enumeration should in principle be straightforward: 3 chiral centres lead to 8 diastereomers. However, when attempting to build 3D coordinates, not all diastereomers can give rise to sensible 3D structures. There were also other pitfalls with the diastereomer enumeration, connected with non-invertible nitrogens, meso forms and pseudoasymmetric centres.

Instead, we used a more complex workflow. We took into account the special cases not handled by the cheminformatics toolkit and discarded the diastereomers with bad geometries.

Then we computed pairwise similarities, comparing each diastereomer with all others, keeping into account multiple conformers. First we aligned pairs by their methyl handle, then rotated them around the methyl handle in 30° increments. Once we had done this across the 12 positions, we found the global minima by sampling the two lowest local minima in 5° increments.

The pilot set was assembled by picking roughly 20,000 compounds from the 800,000 pool; pairwise similarities were computed and collected in an upper triangular matrix. This calculation takes 96 CPU days and, once distributed on 100 cores on our in-house cluster, was completed in 2 and a half days. Scaling was difficult due to I/O contention, since all of the nodes were concurrently accessing the same NFS share. The same matrix was computed in 3 h using ECFP4 fingerprints.

There is very little correlation between the 2D and 3D similarity values.

This 20,000-compound pilot was clustered into 4,000 clusters. In order to assess the quality of clusters we used a silhouette metric. This assesses how similar the clusters are internally and how dissimilar to other clusters.

Clustering examples for 2D and 3D structures were presented. There is a nice alignment and a good separation of properties for the 3D. For the 2D, the properties are scattered and the alignments are unclear. Even for a cluster with a zero silhouette score, the 3D cluster was preferred.

It was quite apparent that the 3D clustering is much more successful at picking 3D-similar compounds than the 2D algorithm.

Clustering a 150,000 subset

Clustering on the full set of 800,000 was not computationally feasible. It would have required 380 CPU years. Instead, a representative pick of the full set (~150,000 compounds) was carried out in order to cover property space. Clustering was done on this set and the remaining 650,000 compounds were then assigned to clusters thus identified.

The methods used for dividing up the job into discrete nodes were presented. We used compressed files for the transfers in order to reduce the time required for reading and writing the data.

Jobs were distributed on the Amazon Elastic Compute Cloud using the StarCluster toolkit, which provides a nice interface to the underlying EC2 API, to manage the cluster and submit jobs through Sun Grid Engine.

An on-demand instance was used as the master node, while all worker nodes were spot instances. The latter are amenable to abrupt shutdown, but their 5 to 8-fold lower cost balanced the slightly higher management overhead. We shuffled around both the cluster and the storage between different available zones so as to get the best price.

The similarity matrix calculation was accomplished in 4 days. This time, scaling was excellent thanks to the high bandwidth and the use of compressed files. The 150,000 set was partitioned into 25,000 clusters.

Global clustering

Once the cluster centroids had been established, the remaining compounds were each assigned to the closest centroid.

Thanks to the way the 800,000 set was designed, given any compound in the set it is now possible to retrieve analogs of the core structure, including different substitution patterns, superstructures of the latter and finally, thanks to the 3D clustering, compounds having similar 3D structures. This will allow BioBlocks to estimate the coverage of the 800,000 original compounds from any picked subset.


  • 3D clustering outperforms 2D with reference to intra-cluster distribution of field/shape properties
  • The reduced clustering method is a viable way to cluster a large dataset
  • Scalable performance on cloud computing resources allows clustering 6-digit datasets in reasonable time for less than $10,000 CPU costs.