Tackling selectivity with Activity Atlas


Activity Atlas1 is a new component available in Forge2, Cresset’s powerful workbench for ligand design and SAR analysis. Activity Atlas models summarize the SAR for a series into a visual 3D model that informs design decisions and helps prioritize molecules for synthesis. This new method is particularly useful for project teams where there is not enough SAR for a traditional 3D-QSAR approach. In this case study, Activity Atlas was used to analyze the SAR of a series of adenosine A1, adenosine A2a and adenosine A3 antagonists, with the objective to investigate and understand the electrostatic, hydrophobic and shape features underlying receptor selectivity.


Activity Atlas is a probabilistic method of analyzing the Structure-Activity Relationships of a set of aligned compounds as a function of their electrostatic, hydrophobic and shape properties. The method uses a Bayesian approach to take a global view of the data in a qualitative manner. Results are displayed using Forge visualization capabilities to gain a better understanding of the features which underlie the SAR of your set of compounds.

Activity Atlas calculates and displays as 3D visualizations the:

  • ‘Activity cliff summary’: what do the activity cliffs tell us about the SAR?
  • ‘Average of actives’: what do active molecules have in common?
  • ‘Regions explored’: where have I been? For a new molecule, would making it increase our understanding? This analysis also calculates a novelty score for each molecule.

In this case study, the activity cliff summary method in Activity Atlas was used to analyze the SAR of a series of published3 adenosine A1, adenosine A2a and adenosine A3 antagonists, with the objective of understanding the electrostatic, hydrophobic and shape features underlying A2a over A1 and A2a over A3 selectivity.

The data set

The data set of 342 compounds originally published by Dimova and Bajorath3 was downloaded from the supplementary material together with their adenosine A1, A2a and A3 receptors potency values. A subset of 102 tricyclic compounds (see Figure 1) was selected for the Activity Atlas analysis.

Reference compounds
Figure 1. Reference compounds used to align the data set of 102 adenosine receptor antagonists. Left to right: Cmpd321, Cmpd296 and Cmpd249.

The Column Script Editor in Forge was used to calculate selectivity. The Editor uses a JavaScript syntax to operate in a simple, programmatic way on key properties of the molecules (like the atoms) and on the column data of the project (see Figure 2). Selectivity was calculated as follows:

  • A2a over A1 selectivity = pA2a potency – pA1 potency
  • A2a over A3 selectivity = pA2a potency – pA3 potency.

Columns Script Editor
Figure 2. The Columns Script Editor is a simple, programmatic way of creating or modifying values in the Molecules Table.

Conformation hunt and alignment of compounds

Cmpd321, Cmpd296 and Cmpd249 (see Figure 1) were chosen as the reference structures to drive the alignment of the full training set of 102 compounds.

A conformation hunt was carried out for Cmpd321 within Forge: an extended low energy conformation was chosen as the initial reference structure to which Cmpd296 and Cmpd249 were aligned by Maximum Common Substructure.

The 102 compounds in the training set were then aligned to the three reference compounds in Figure 1 by Maximum Common Substructure using a ‘very accurate but slow’ set-up for the conformation hunt:

  • Max number of conformations: 1000
  • RMS cut-off for duplicate conformers: 0.5
  • radient cut-off for conformer minimization: 0.1 kcal/mol
  • Energy window: 3 kcal/mol.

The use of a 3D similarity metric in Activity Atlas requires (as with 3D-QSAR) the generation of alignments for all compounds and is sensitive to misalignment and alignment noise. For this reason, visual inspection of alignments is always recommended, to ensure that there are no anomalies present. Where the calculated alignment is sub-optimal, manual intervention can be used to improve it. In this case study, the alignment of a few compounds was manually adjusted by flipping the phenyl ring on the phenyl-urea side chain (see Figure 1), to align the ortho and meta substituents in a consistent manner across the whole dataset.

Activity Atlas models are calculated following a probabilistic approach which takes into account the probability that a molecule is correctly aligned, as shown in Figure 3 below, rather than assuming that the top scoring or the selected preferred alignment is the correct alignment.

Analysis of both absolute and relative alignment scores to assess correctness of alignment
Figure 3. Analysis of both absolute and relative alignment scores to assess correctness of alignment.

This is done by associating a weight with each alignment based on its similarity score. Alignments with similarity higher than a certain threshold (which can either be automatically calculated by Forge, or manually defined by the user) are fully trusted. Alignments with similarity lower than the low similarity threshold are not trusted and discarded. Linear scaling is applied to associate a proper weight to alignments which have an intermediate similarity score.

Likewise, a weight is also associated with each molecule based on its activity. Molecules whose activity is higher than a certain threshold (which again can either be automatically calculated by Forge, or manually defined by the user) are considered fully active. Molecules whose activity is lower than the low activity threshold are considered inactive.

Read More…

July 2015 newsletter

Large-scale compound clustering in 3D

At our European User Group Meeting last month, Paolo Tosco presented on a collaborative project with BioBlocks, a specialist medicinal chemistry and drug discovery company. How Cresset helped identify 2,000 representative compounds to synthesize, from a virtual set of 80,000, is detailed in the blog Large-scale compound clustering in 3D.

Tackling selectivity with Activity Atlas

Activity cliff summary maps
Activity Atlas is a new component of Forge, our ligand design and SAR analysis workbench. The case study Tackling selectivity with Activity Atlas demonstrates the use of Activity Atlas in analyzing the SAR of a series of antagonists (adenosine A1, A2a and A3) with the objective of investigating and understanding the electrostatic, hydrophobic and shape features which underlie receptor selectivity.

Announcing the Spark CSD Fragment Database

Users of Spark now have access to the Spark CSD Fragment Database, a new database of fragments from the world’s most comprehensive curated collection of 3D small molecule organic and metal-organic crystal structures. Cresset and the Cambridge Crystallographic Data Centre (CCDC) have collaborated to create this new database from CCDC’s Cambridge Structural Database (CSD). The Spark CSD Fragment Database complements the current Spark databases, which include databases based on screening compounds, literature reports and theoretical rings.

Recent news

Meet us at 250th ACS National Meeting and Exposition

Technical division presentations and posters:

  • Examining the diversity of large collections of building blocks in 3D #COMP 67
  • Rapid technique for new scaffold generation II: What is the best source of inspiration? #MEDI 90
  • Is this compound worth making? #MEDI 92
  • Is it worth making? Assessing the information content of new structures #COMP 431

linkedin-ACS-Fall-2015_newsletterPresentations from booth #516 on August 17 and 18:

  • Map your project’s SAR with Activity Atlas
  • Rapid technique for new scaffold generation
  • Using shape and electrostatics to gain new perspectives in medicinal chemistry design

Skolnik Award Symposium and Reception:

  • We are sponsoring the ACS CINF Herman Skolnik Award symposium and reception on August 18

See times and locations of presentations and posters.

Read More…

Large-scale compound clustering in 3D

At the Cresset European User Group Meeting 2015 I presented recent work carried out as a collaboration with BioBlocks, a US based company specializing in medicinal chemistry and drug discovery. BioBlocks had created a large fragment library. The goal was to identify ~2,000 representative compounds to synthesize from a virtual set of 800,000.

The challenges were from the size of the set and the nature of the compound structures, which all featured a methyl handle plus multiple diastereomers and conformers.

The plan was to use clustering based on the shapes and fields of the compounds in order to gain the greatest representative coverage from the chosen 2,000 compounds.

A pilot study to choose a 2D or 3D approach

Cresset and BioBlocks determined whether it was indeed necessary to use a 3D approach, or whether a 2D approach would be equally effective. A 2D approach would ignore the stereochemistry and would dispense with the need for a conformation search. Each pairwise comparison would take only 45 µs. By contrast, a 3D similarity approach would include a conformational hunt and would include the stereochemistry. As a result, each pairwise comparison would take 40 ms.

We carried out a pilot study to determine whether it was computationally feasible and valuable to carry out a 3D assessment.

The process proposed was to take the 2D structures as supplied by BioBlocks, protonating them based on our set of rules, generating racemic diastereomers for each structure, then building conformations and computing field points and finally clustering the compounds based on their 3D similarity.

Diastereomer enumeration should in principle be straightforward: 3 chiral centres lead to 8 diastereomers. However, when attempting to build 3D coordinates, not all diastereomers can give rise to sensible 3D structures. There were also other pitfalls with the diastereomer enumeration, connected with non-invertible nitrogens, meso forms and pseudoasymmetric centres.

Instead, we used a more complex workflow. We took into account the special cases not handled by the cheminformatics toolkit and discarded the diastereomers with bad geometries.

Then we computed pairwise similarities, comparing each diastereomer with all others, keeping into account multiple conformers. First we aligned pairs by their methyl handle, then rotated them around the methyl handle in 30° increments. Once we had done this across the 12 positions, we found the global minima by sampling the two lowest local minima in 5° increments.

The pilot set was assembled by picking roughly 20,000 compounds from the 800,000 pool; pairwise similarities were computed and collected in an upper triangular matrix. This calculation takes 96 CPU days and, once distributed on 100 cores on our in-house cluster, was completed in 2 and a half days. Scaling was difficult due to I/O contention, since all of the nodes were concurrently accessing the same NFS share. The same matrix was computed in 3 h using ECFP4 fingerprints.

There is very little correlation between the 2D and 3D similarity values.

This 20,000-compound pilot was clustered into 4,000 clusters. In order to assess the quality of clusters we used a silhouette metric. This assesses how similar the clusters are internally and how dissimilar to other clusters.

Clustering examples for 2D and 3D structures were presented. There is a nice alignment and a good separation of properties for the 3D. For the 2D, the properties are scattered and the alignments are unclear. Even for a cluster with a zero silhouette score, the 3D cluster was preferred.

It was quite apparent that the 3D clustering is much more successful at picking 3D-similar compounds than the 2D algorithm.

Clustering a 150,000 subset

Clustering on the full set of 800,000 was not computationally feasible. It would have required 380 CPU years. Instead, a representative pick of the full set (~150,000 compounds) was carried out in order to cover property space. Clustering was done on this set and the remaining 650,000 compounds were then assigned to clusters thus identified.

The methods used for dividing up the job into discrete nodes were presented. We used compressed files for the transfers in order to reduce the time required for reading and writing the data.

Jobs were distributed on the Amazon Elastic Compute Cloud using the StarCluster toolkit, which provides a nice interface to the underlying EC2 API, to manage the cluster and submit jobs through Sun Grid Engine.

An on-demand instance was used as the master node, while all worker nodes were spot instances. The latter are amenable to abrupt shutdown, but their 5 to 8-fold lower cost balanced the slightly higher management overhead. We shuffled around both the cluster and the storage between different available zones so as to get the best price.

The similarity matrix calculation was accomplished in 4 days. This time, scaling was excellent thanks to the high bandwidth and the use of compressed files. The 150,000 set was partitioned into 25,000 clusters.

Global clustering

Once the cluster centroids had been established, the remaining compounds were each assigned to the closest centroid.

Thanks to the way the 800,000 set was designed, given any compound in the set it is now possible to retrieve analogs of the core structure, including different substitution patterns, superstructures of the latter and finally, thanks to the 3D clustering,

Read More…

Fragments and conformations from the CCDC’s Cambridge Structural Database accessible through Cresset’s Spark

Cambridge, UK – 28th July 2015 – Cresset, innovative provider of software and services for small molecule discovery and design, announces the release of the new Spark CSD Fragment Database derived from the Cambridge Crystallographic Data Centre’s (CCDC) Cambridge Structural Database (CSD).

“Spark replaces fragments of molecules with biologically equivalent alternatives,” explains Dr Robert Scoffin at Cresset. “It is ideal for scaffold hopping, moving to clear IP, replacing R-groups and growing or linking fragments. This new database of fragments from the CSD means that Spark’s results contain chemical replacements that have experimentally validated chemistry and known conformations. This gives them a higher chance of a smooth synthetic route and better likelihood of being a valid bioactive conformation.”

“The CSD is the world’s most comprehensive database of expert-curated 3D small-molecule organic and metal-organic crystal structures, containing the results of over ¾ million X-ray and neutron analyses,” commented Colin Groom, Executive Director of the CCDC. “The new Spark CSD Fragment Database further extends the ways in which researchers can use this wealth of crystallographic data to address a wide range of scientific problems.”

Read More…

Activity Atlas for complex SAR interpretation – New in Forge 10.4 from Cresset

Cambridge, UK – 23rd July 2015 – Cresset, innovative provider of software and services for small molecule discovery and design, is pleased to announce the release of Forge V10.4, the latest version of the leading computational chemistry workbench for ligand-based design.

Are you struggling with SAR tables? Finding it hard to keep all the SAR in your mind? With Activity Atlas, a new component in Forge, you can easily summarize complex SAR as a 3D visualization. Activity Atlas enables you to:

  • understand complex selectivity data at a glance
  • condense a large table of SAR data into a single picture
  • analyze the SAR of your data series from different viewpoints.

“Activity Atlas 3D visualizations (Figure 1) tell you what active molecules have in common, which regions you have explored, and help you pinpoint critical regions of SAR,” explains Dr Giovanna Tedesco, Forge Product Manager. “This information is very valuable for new molecule design and we believe it will particularly benefit project teams where there is not enough SAR for a traditional 3D-QSAR approach.”

Figure 1: Activity Atlas visualization of the critical SAR regions for three receptors: Adenosine A1, Adenosine A2a and Adenosine A3. [Published activity data taken from J. Bajorath, J. Chem. Inf. Model. 51, 258-266, 2011] This information could direct the next molecule design iteration towards (for example) maintaining potency at the A1 receptors while building in selectivity towards A2 and A3. Color coding: Red: more positive increases activity; Blue: more negative increases activity; Green: steric bulk in this position is favorable; Magenta: steric bulk in this position is detrimental.

Evaluate Forge.

Read More…