Rapid interpretation of patent SAR using Forge

Biological data is now a regular feature of new patent applications and this is readily available for download from Bindingdb which has data on over 2,500 patents encompassing more than 300,000 binding measurements. Generating meaningful insights to this data is perceived as less straightforward. In this post I will use Forge™ V10.6 to demonstrate that it is possible to get an overview of the SAR from a single patent entry with minimal human intervention and time.

Application to PIM-1

Selection and processing of 288 compounds from US9321756, ‘Azole compounds as PIM inhibitors’ (detailed in Appendix I) gave the Activity Atlas™ model shown in Figure 1. The total time to generate and interpret this model was around 30 minutes. It would be relatively straightforward to automate the process.

Activity Cliff Summary of Electrostatics’ and ‘Activity Cliff Summary of Shape’ views are shown
Figure 1: Activity Atlas model generated in this case study. From data download to model took 30 minutes. The ‘Activity Cliff Summary of Electrostatics’ and ‘Activity Cliff Summary of Shape’ views are shown. These detail regions of acute SAR – Red / Blue = positive / negative electrostatics preferred for greater activity; Green / Pink = activity favors /disfavors atoms in this region.

SAR interpretation

Firstly, the oxadiazole is clearly required as demonstrated in Figure 2 by region of negative (blue) next to both nitrogen atoms and representing the interaction of this group with the side chain of Lys67. Perhaps this is not surprising given the title of the patent application. The model also shows that the amino group next to the oxadiazole is constrained (area of pink surface).

Activity Atlas model close to the oxadizole group

Figure 2: Activity Atlas model close to the oxadizole group. Red = positive electrostatics preferred; Blue = negative electrostatics preferred; Green = Atoms in this region favored; Pink = Atoms in this region disfavored.

On initial inspection there appears to be space in the protein to accommodate a substituent on the nitrogen. However, by viewing the aligned ligands in the context of the protein and showing contacts in Forge, Figure 3 shows it is clear that all N-substituted ligands clash with Asp186 and that the adjacent space is not accessible from this position in the ligand.

Clash of a ligand with a morpholino substituent to Asp186

Figure 3: Clash of a ligand with a morpholino substituent to Asp186 (orange lines).

The model (Figure 4) shows that there is a clear preference for molecules that extend into the gap between the two arms of the ligand (green surface at the bottom of the model above). Whilst we would want to check the underlying data, the suggestion is that substitution on either R-group is tolerated. Indeed, the most active compound crosses this gap completely which raises the possibility of using a cyclized ligand.

High active from the patent displayed in CPK

Figure 4: A high active from the patent displayed in CPK. The N-trifluoroethyl group touches the cyclopropyl substituent on the opposite side of the molecule.

Surrounding the green (favorable volume) region between the two arms is large area of red surface. This suggests that positive electrostatics – edges of aromatics or H-bond donors etc. – is preferred in this region.
This summary is reinforced by looking at the individual compounds that make up the data, thankfully this is easy to do with the Activity Miner™ module of Forge. Using Activity Miner’s top pairs table (Figure 5) there are many pairs of molecules where introduction of a positive charge in the region below (as shown in the pictures) the ligand generates a more active molecule. Generally the difference is around 1 unit better activity for the charged species.

Top pairs table in the Activity Miner module of Forge showing a specific pair of molecules and the electrostatic difference map between them
Figure 5: The top pairs table in the Activity Miner module of Forge showing a specific pair of molecules and the electrostatic difference map between them. Red regions indicate where that ligand in more positive than the comparator; Blue where that ligand is more negative. In this case the ligand on the left is over a log unit more active and contains a positive charge in the region at the bottom of the picture.

Looking at the protein structure does not reveal a specific interaction or reason for this gain in potency. However, by using the protein field surface in Flare™, we can see that the protein is generating a negative potential in this region which would account for the gain in activity when introducing a positive charge.

Protein interaction potential contoured at 2kcal_mol

Figure 6: The protein interaction potential contoured at 2kcal/mol, Red = positive; Blue = negative. The potential indicates the nature of atoms that to use in a region, positive atoms fit well in negative regions etc.

Lastly, in the region of the pyrimidine group the model has a large area of blue. This indicates that there is a clear preference for molecules with nitrogen atoms in the ring at these points (e.g., pyrazine). This area points towards solvent and hence this is quite surprising. From the crystal structure alone it would be expected that introduction of heteroatoms would have little effect on activity. Examination of the data using Activity Miner confirms that, for example, pyrazine is more active than pyridine. In this case the protein fields do not reveal anything significant in the underlying potential of the protein and we are left to speculate at the reason for the SAR.

PDB 4TY1 showing the region around the pyrimidine group of the ligand

Figure 7: PDB 4TY1 showing the region around the pyrimidine group of the ligand. There are few interactions between the protein and the edge of the ligand in this region.

Speculating that protein movement was at the root of the observed SAR, I downloaded into Flare all the PIM-1 structures from the PDB, sequence aligned them and superposed based on the sequence alignment. Looking at this region across the 150+ structures show no clear case for protein flexibility although a number of structures do have a water molecule in this region that would bridge the ligand to the side chain of Arg122.

Over 150 PIM-1 crystal structures superposed in Flare

Figure 8: Over 150 PIM-1 crystal structures superposed in Flare. The backbone is shown in tube, residues close to the depicted ligand of structure 4TY1 are shown in thin sticks. Only two structures have any variation in loop conformation in this region.

The reason for the observed SAR remains elusive and could be a function of protein-protein interaction, water mediated interaction or something else.


Rapid interpretation of Bindingdb patent data can be achieved using Forge. In this case the SAR of 288 ligands was condensed to a single Activity Atlas model in less than 30 minutes. Interpretation of the model over the next 30 minutes generated clear SAR insights that could be employed on competing projects. Inspecting the protein electrostatics using Flare provided further insights into the observed SAR.

Try Forge on your project

Request a free evaluation of Forge to try this on your data or condense a patent into a simple summary of the published SAR.

See all licensing options for Forge.

Appendix I

Background computational details

The raw data was downloaded in tab separated format from Bindingdb and pre-processed in Excel. The raw data contains data for two biological targets – ‘PIM’ and ‘PIM-1’. Compounds with ‘PIM-1’ data were selected and checked for duplicate values. One compound was excluded because of a large variation in the reported IC50 value and four molecules were excluded due to missing activity values. All other duplicate IC50 values were averaged and converted to a pIC50 value resulting in a dataset of 288 molecules in a csv file.

The original dataset included the ligands of PDB codes 4TY1 and 4WT6. The protein-ligand complexes were downloaded into Flare, sequence aligned and superposed. Looking at the binding site, either ligand would work well as a reference for initial alignment of the dataset. The ligand from 4WT6 was chosen for further experiments and both ligand and corresponding protein transferred to Forge (Copy-Paste). The csv file was loaded into Forge (Training Set) and the molecules processed using Accurate but Slow conformation hunting, Substructure alignment and an Activity Atlas model built.

The Forge processing window showing the options used in this case study.

Using the Cresset Engine Broker, the calculation took 15 minutes to complete. Examining the results shows excellent alignment through the common substructure but some variation beyond that.

288 aligned ligands from US9321756 that were used to prepare the Activity Atlas model.


About Activity Atlas

Activity Atlas models are created by comparing all pairs of molecules in terms of positive and negative electrostatics plus the hydrophobics and shape properties and then combining these together, weighted by the change in activity for the pair. The result is a simple, qualitative picture of the critical points in the SAR landscape.

The resulting Activity Atlas model was automatically displayed. I always start with the ‘Activity Cliff Summary of Electrostatics’ and ‘Activity Cliff Summary of Shape’ views to understand the data. As this was a quick experiment and the alignments were noisier than a fully curated experiment, the Activity Atlas model is also noisier than ideal. However, by increasing the Confidence Level to 3.0 concentrates on the clear signals in the data.

The display options used for the Activity Atlas models shown in this study.

Model validation

Activity Atlas is a qualitative technique and hence difficult to validate except through manual inspection. However, Forge is capable of building quantitative models that can be used to validate the alignment of the molecules (we believe that consistent alignment is the single biggest factor in generating reliable 3D QSAR models). Using the Automatic regression model building methods of Forge with a 20% activity stratified test set generated an SVM model with q2 0.64 (LOO) and an r2 on the independent Test set of 0.62. Given the noisy nature of the input data I believe this represents a good model and that the alignments are valid.

Try Cresset solutions on your project