We have been thinking about and investigating the diversity and clustering of compounds using our 3D similarity metrics of Fields and shape for a couple of years. In our new Activity Miner module for Forge and Torch we are able to cluster a dataset using fields in the specific context of a particular protein. This approach has proved interesting on our hands and we hope that you will agree once the module is released in September (or join the Beta test program now).
However, we have also been considering the more general question of clustering compound collections using 3D similarity rather than 2D. Back in 2012 I presented our initial work (see the presentation here) in this area that was focussed towards a specific goal – preparation of a screening library for a client. This work has been put back in the spotlight with last month’s release of BlazeGPU. With the speed of the similarity calculations massively increased we believe that we can significantly increase the number of compounds that we can look at. Additionally we have been improving the GPU code to enable the comparisons of multiple conformations of both the probe and database molecules. The purpose of this is to eventually enable a GPU acceleration of the FieldTemplater calculations but we can use it here to create a matrix of similarities across a compound collection.
What we did:
- We created and booted an AmazonGPU instance
- We conformationally explored each member of a fragment library. Since these were small fragments we kept a maximum of 5 conformations per molecule
- Selected 100 compounds from the 8,000 randomly
- Compared all 100 compounds to each of the 8000 members of the collection to give a 100 column by 8000 row matrix containing the highest similarity values for pairs of molecules
- Cluster the molecules using the similarity matrix into 50 clusters
- Visually inspect clusters.
Looking at a set of randomly selected compounds from a couple of clusters gives an indication that this could be a useful method for looking at compound collections (see 3D and 2D pictures below). The method clearly has some correlation with 2D similarity but also transcends the limitations of 2D, especially with this size of compound.
For me there is clearly enough signal in this simple experiment to warrant further investigation of the best ways to create the matrix and to cluster compounds within it. In this case I have used 100 conformationally explored compounds to tease out the 3D difference of a dataset of 8000. Is this enough compounds to define the space? Is random the best way to choose these compounds? Is there a smarter way to analyze and cluster the data? These questions outline the scope of our research project.
If you are interested in the work done here then contact us. We would welcome an open discussion on the best ways to solve these problems.
Dr Tim Cheeseright, Director of Products