Thirty years, and counting

Lessons and Successes in the Use of Molecular Fields, an invited chapter for the ‘In Silico Drug Discovery Tools’ section of Comprehensive Medicinal Chemistry III, has recently been published by Elsevier. The chapter reviews more than thirty years of research in the area of molecular interaction fields, starting from the ground-breaking work done by Cramer, Goodford and Vinter, until the most recent applications to virtual screening, metabolism prediction, protein similarity across phylogenetically unrelated families and library design.

We believe that the reason for the longevity of molecular fields as computational chemistry workhorses is their capacity to provide a feature-rich description of molecular properties which is independent of the chemical structure. Encoding interaction properties through molecular fields inherently enables scaffold hopping, 3D similarity searches within large databases, ligand and protein structural alignment, cross-chemotype SAR elucidation.

All of the above tasks are the bread and butter of medicinal and computational chemists. Therefore, it is not surprising that over the years the creativity of researchers has built on top of the molecular field technology a number of blockbuster tools for computer-aided drug design, such as CoMFA, FieldScreen (now known as Blaze) or MetaSite, to name but a few. Thanks to their ‘core’ nature, as trends evolved and new needs emerged, fields have brilliantly managed to ride the ever changing waves of drug discovery, and are still at the forefront of the in silico armoury.

We take pride of having been among the lead actors on the molecular field stage, and we trust that readers will enjoy embarking on a fascinating journey through decades of field-based science as much as we did as authors.

Boosting RDKit molecular simulations through OpenMM

I am a big fan of the RDKit. In case you have never heard about it, it is an open-source, BSD-licensed C++ toolkit which allows one to accomplish a wide range of cheminformatics tasks. You can build a molecule from SMILES, create 2D depictions, generate 3D conformations, do substructure searches, run chemical reactions, and much more. It comes with C++, Python, Java and C# bindings, so you can access its functionality from your favourite programing language. It features tight integration with the Jupyter notebook, so you can display your molecules in 2D and 3D interactively while you develop your Python workflow. In case you are not much of a programer, a lot of the RDKit functionality is exposed through RDKit KNIME nodes. And, last but not least, the RDKit comes with a PostgreSQL cartridge which enables dealing with molecules in PostgreSQL databases.

Now you know why I love the RDKit, and I hope I managed to convince you to give it a go, if you haven’t already. There are a number of tutorials to get yourself started, and an amazing community of users which can help you out when you are stuck.

Cresset software incorporates the RDKit, which is mostly used to parse SMARTS queries: in Forge, Torch and Spark you may apply SMARTS filters to your molecule tables. In Spark you may also request certain SMARTS patterns to be, or not to be, included in the results; furthermore, the Torsion Library which analyses the torsional angles of the fragments retrieved by a Spark search is based on a database of SMARTS strings.

We also use the RDKit in our internal research projects, in Cresset Discovery Services, and occasionally to integrate or customize the functionality already available in Cresset desktop applications, command-line tools, and KNIME nodes.

Besides being RDKit users, we are also RDKit contributors. In 2015 we contributed a resonance structure enumerator, while at the 2016 RDKit User Group Meeting, which was hosted at the Novartis Campus in Basel, we presented some preliminary work on boosting RDKit molecular simulations through OpenMM.

OpenMM is an open-source toolkit for high-performance molecular simulations running on CPUs and GPUs. Originally developed in the Pande Lab at Stanford, it is currently supported also by other groups and individuals. OpenMM natively implements AMBER, CHARMM and AMOEBA force fields, which are focused on biological macromolecules, and provides support for implementing custom force fields. The RDKit natively implements MMFF94 and UFF force-fields. MMFF94 is a general-purpose, accurate force-field, while UFF is geared towards small molecules, and trades accuracy for wide chemistry coverage and speed. We thought that it would be interesting to:

  • implement MMFF94 in OpenMM
  • build an OpenMM interface into the RDKit, and
  • compare the performance of the native RDKit implementation of MMFF94 (CPU-only, single-threaded) with the OpenMM implementation (CPU and GPU, multi-threaded).

Even though OpenMM features outstanding support for custom force fields (it has a lexical parser for energy equations and can even compute their analytical derivatives), MMFF94 has rather complicated combination and scaling rules for non-bonded parameters, which required some tweaking on the OpenMM library to be implemented efficiently. I managed to implement under CPU and GPU platforms the seven energy terms of MMFF94 using a combination of AMOEBA and custom forces:

Below (and on GitHub) you will find a Jupyter notebook showing a geometry optimization benchmark on a small protein, villin.

As you may appreciate going through the notebook, the increase in performance provided by this preliminary proof-of-concept implementation is impressive: OpenMM single and double precision are respectively 150 and 11 times faster than the RDKit implementation on a GeForce GTX 960 graphics card.

Our goal is now to code a native implementation of the MMFF94 and UFF force fields within OpenMM, and then provide the RDKit with an interface to the latter, in order to benefit from the speed-up. Possible applications include the automated minimization of protein-ligand complexes after docking, or the molecular dynamics simulation of small molecules in explicit solvent under periodic boundary conditions. The availability of the latter will be announced on the Cresset website and on the RDKit mailing list.

Here follows the Jupyter Notebook (see it on GitHub):

In [1]:
import sys
import math
import timeit
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
import py3Dmol
from simtk.openmm import openmm

This is the villin headpiece as downloaded from the PDB:

In [2]:
villin = open('/home/paolo/pdb/2F4K.pdb', 'r').read()
p = py3Dmol.view(width = 400,height = 400)
p.addModel(villin, 'pdb')
p.setStyle({'cartoon': {'color':'spectrum'}})
In [3]:
molOrig = Chem.MolFromPDBBlock(villin)
In [4]:
mol = Chem.AddHs(molOrig, addCoords = True)
In [5]:
pyMP = AllChem.MMFFGetMoleculeProperties(mol)
In [6]:

Let’s create 4 forcefields, the “traditional” one and those spiced up with OpenMM, respectively using single and double precision CUDA kernels, and the multi-threaded single-precision CPU implementation.

In [7]:
for i in range(3):
    mol.AddConformer(Chem.Conformer(mol.GetConformer(0)), assignId = True)
In [8]:
platformNames = ['RDKit', 'OpenMM_CUDA_s', 'OpenMM_CUDA_d', 'OpenMM_CPU']
pyFF = {}
pyFF[platformNames[0]] = AllChem.MMFFGetMoleculeForceField(mol, pyMP, confId = 0)
for i in range(1, 4):
    pyFF[platformNames[i]] = AllChem.MMFFGetMoleculeOpenMMForceField(mol, pyMP, confId = i)

Now we instruct our RDKit interface to OpenMM to use the appropriate hardware platform:

In [9]:
    'CUDA', {'Precision': 'single', 'DeviceName': 'GeForce GTX 960'})
In [10]:
    'CUDA', {'Precision': 'double', 'DeviceName': 'GeForce GTX 960'})
In [11]:

These are the energies of the protein before minimization computed with the 4 methods; differences are negligible, as they should ideally be:

In [12]:
for name in platformNames:
    sys.stdout.write('{0:20s}{1:8.4f} kcal/mol\n'.format(name, pyFF[name].CalcEnergy()))
RDKit               826.8740 kcal/mol
OpenMM_CUDA_s       826.8734 kcal/mol
OpenMM_CUDA_d       826.8727 kcal/mol
OpenMM_CPU          826.8728 kcal/mol

Now we will carry out a geometry optimization with all methods, and take some timings.

The OpenMM minimizations in single precision bails out of the OpenMM L-BFGS minimizer with a LBFGSERR_MAXIMUMLINESEARCH error (-998) before the RMS gradient criterion kicks in. This is probably due to insufficient precision for the minimizer to behave correctly during the line search. Nonetheless, the energy values are not dramatically different from those computed by OpenMM using the GPU in double precision mode.

In [13]:
t = []
for i, name in enumerate(platformNames):
    ff = pyFF[name]
    t.append(timeit.timeit('ff.Minimize(maxIts = 100000, forceTol = 0.01)',
                      'from __main__ import ff', number = 1))
    sys.stdout.write('{0:20s}{1:8.4f} s ({2:.1f}x)\n'.format(name, t[i], t[0] / t[i]))
RDKit                82.7275 s (1.0x)
OpenMM_CUDA_s         0.5488 s (150.7x)
OpenMM_CUDA_d         7.3300 s (11.3x)
OpenMM_CPU           25.0867 s (3.3x)

The timings are impressive: OpenMM single and double precision are respectively 150 and 11 times faster than the RDKit implementation on a hyperthreading quad-core 3.40GHz Intel Core i7-3770 CPU equipped with a GeForce GTX 960 graphics card.

Also the multi-threaded OpenMM CPU implementation (single precision) scales very well, as it runs >3 times faster than the single-threaded RDKit implementation on the 8 virtual cores (4 physical) of our Core i7.

Energy values at the end of the minimization are comparable; the slight differences between are most likely amenable to the different implementations of the L-BFGS minimizer between RDKit and OpenMM:

In [14]:
for name in platformNames:
    sys.stdout.write('{0:20s}{1:8.4f} kcal/mol\n'.format(name, pyFF[name].CalcEnergy()))
RDKit               -53.4757 kcal/mol
OpenMM_CUDA_s       -52.6213 kcal/mol
OpenMM_CUDA_d       -57.5980 kcal/mol
OpenMM_CPU          -52.6949 kcal/mol

If we look at the heavy-atom-RMSD matrix across the 4 minimizations, we see that the smallest deviation occurs, as might be expected, between the RDKit and the OpenMM double precision implementations. However, the RMSD for the single precision calculations is < 0.7 Å.

In [15]:
molNoH = Chem.RemoveHs(mol)
In [16]:
confsNoH = [molNoH.GetConformer(i) for i in range(4)]
In [17]:
for y in range(len(confsNoH)):
    if (y == 0):
        for name in [''] + platformNames:
    for x in range(len(confsNoH)):
        if (x == 0):
        if (x < y):
                AllChem.AlignMol(molNoH, molNoH, prbCid = x, refCid = y)))
                           RDKit   OpenMM_CUDA_s   OpenMM_CUDA_d      OpenMM_CPU
           RDKit          0.0000          0.6815          0.2669          0.6701
   OpenMM_CUDA_s                          0.0000          0.5457          0.0463
   OpenMM_CUDA_d                                          0.0000          0.5315
      OpenMM_CPU                                                          0.0000

This is the visual difference between RDKit and OpenMM single precision (largest deviation)

In [18]:
p = py3Dmol.view(width = 400,height = 400)
p.addModel(Chem.MolToPDBBlock(molNoH, confId = 0), 'pdb')
p.addModel(Chem.MolToPDBBlock(molNoH, confId = 1), 'pdb')
p.setStyle({'cartoon': {'color':'spectrum'}})

And this is how RDKit and OpenMM double precision compare (smallest deviation)

In [19]:
p = py3Dmol.view(width = 400,height = 400)
p.addModel(Chem.MolToPDBBlock(molNoH, confId = 0), 'pdb')
p.addModel(Chem.MolToPDBBlock(molNoH, confId = 2), 'pdb')
p.setStyle({'cartoon': {'color':'spectrum'}})

Examining the diversity of large collections of building blocks in 3D as presented at Sheffield Chemoinformatics Conference 2016


2D fingerprint-based Tanimoto distances are widely used for clustering due the overall good balance between speed and effectiveness. However, there are significant limitations in the ability of a 2D fingerprint-based method to capture the biological similarity between molecules, especially when conformationally flexible structures are involved. Structures which appear to largely differ in functional group decoration may give rise to quite similar
steric/electrostatic properties, which are what actually determine their recognition by biological macromolecules.

In BioBlocks’ Comprehensive Fragment Library (CFL) program, we were confronted with clustering a very large collection of scaffolds generated from first principles. Due to the largely unprecedented structures in the set and our design aim to populate the 3D ‘world’, using the best 3D metrics was critical. The structural diversity of the starting collection of about 800K heterocyclic scaffolds with variable functional group decoration was not adequately captured by 2D ECFP4 fingerprint Tanimoto distances, as shown by the rather flat distribution of 2D similarity values across the set, and by their lack of correlation with the 3D similarity metrics.

The initial step of any clustering procedure is the computation of an upper triangular matrix holding similarity values between all pairs of compounds. This step becomes computationally demanding when using 3D methods, since an optimum alignment between the molecules needs be found taking into account multiple conformers.

The presentation covers the methodological and technical solutions adopted to enable 3D clustering of such a large set of compounds. Selected examples will be presented to compare the quality and the informative content of 3D vs 2D clusters.


See presentation ‘Examining the diversity of large collections of building blocks in 3D‘ as presented at Sheffield Chemoinformatics Conference 2016.

Is this compound worth making?


In deciding if a compound is worth making medicinal chemists need to consider a number of factors including:

  • does it fit with the known structure activity relationship (SAR)?
  • does it explore new interactions or regions of space?
  • does it have good physico-chemical properties?

Summarizing the SAR content of large compound series in a single informative picture is challenging. Traditional 3D-QSAR techniques have been used for this purpose, but are known to perform poorly where the structure-activity landscape is not smooth. 3D Activity cliff analysis has the potential to explore such rugged SAR landscapes: pairs of compounds with a high similarity and a large difference in activity carry important information relating to the factors required for activity. However, looking at pairs of compounds in isolation makes it hard to pinpoint the changes which consistently lead to increase potency across a series.

Similarly, summarizing the regions and interactions that have been explored in a single picture is not trivial, especially if the electrostatic character of compounds is to be considered (e.g., replacement of electron rich with electron poor rings).

Here we present Activity Atlas, a new technique to analyse a series of compounds and derive a global view of the structure-activity relationship data.



The Regions explored summary gives a comprehensive 3D picture of the regions explored in electrostatic and shape space. A novelty score is assigned to each compound, enabling to predict whether newly designed candidates are likely to contribute additional SAR knowledge, and are thus worth making.

Average of actives summarizes electrostatic and shape properties which consistently lead to potent ligands across the series.

The Activity cliff summary gives a global view at the activity cliff landscape highlighting the regions where either more positive/negative electrostatic potential, or larger steric/hydrophobic potentials increase activity.

A traditional 3D-QSAR model2 was built on the same data set (q2 = 0.7). While 3D-QSAR seems better at extracting information where SAR is continuous, Activity Atlas gives more definition in regions where SAR requirements are critical.

Activity Atlas gives more definition in regions where SAR requirements are critical

Accounting for uncertainties in the alignment

Activity Atlas takes into account multiple alignments, weighted by their probability to be correct:

Accounting for uncertainties in the alignment1
Based on the distance from the top result

Accounting for uncertainties in the alignment2
Based on the absolute similarity value (CCDC-AZ dataset)

Analyzing multiple alignments per molecule enables the technique to account for uncertainties in the orientation of flexible side chains not represented in the reference(s).

Application to selectivity

Application to selectivity

Application to a set of adenosine receptor antagonists with activities against A1, A2a and A3 receptors3 demonstrates the utility of the technique in locating critical regions for selectivity. Examination of the steric and electrostatic maps for the three subtypes clearly shows which regions should be targeted in order to enhance subtype selectivity.

In the example above, the right hand side of the molecules can be used to discriminate between A3 and the other two subtypes, while A1 and A2a can be separated by increasing steric bulk and positive charge around the top of the molecules.


The Activity Atlas technique is a powerful way of summarizing SAR data in 3D. By combining information across multiple pairs, it enables a global view of the critical points in the activity landscape. The Average of actives summary captures in one picture the 3D requirements for potency, while the Regions explored summary enables prioritizing compounds which add crucial SAR information over trivial analogues.


1. J. Chem. Inf. Model. 2006, 46, 665-676
3. J. Chem. Inf. Model. 2011, 51, 258-266

Large-scale compound clustering in 3D

At the Cresset European User Group Meeting 2015 I presented recent work carried out as a collaboration with BioBlocks, a US based company specializing in medicinal chemistry and drug discovery. BioBlocks had created a large fragment library. The goal was to identify ~2,000 representative compounds to synthesize from a virtual set of 800,000.

The challenges were from the size of the set and the nature of the compound structures, which all featured a methyl handle plus multiple diastereomers and conformers.

The plan was to use clustering based on the shapes and fields of the compounds in order to gain the greatest representative coverage from the chosen 2,000 compounds.

A pilot study to choose a 2D or 3D approach

Cresset and BioBlocks determined whether it was indeed necessary to use a 3D approach, or whether a 2D approach would be equally effective. A 2D approach would ignore the stereochemistry and would dispense with the need for a conformation search. Each pairwise comparison would take only 45 µs. By contrast, a 3D similarity approach would include a conformational hunt and would include the stereochemistry. As a result, each pairwise comparison would take 40 ms.

We carried out a pilot study to determine whether it was computationally feasible and valuable to carry out a 3D assessment.

The process proposed was to take the 2D structures as supplied by BioBlocks, protonating them based on our set of rules, generating racemic diastereomers for each structure, then building conformations and computing field points and finally clustering the compounds based on their 3D similarity.

Diastereomer enumeration should in principle be straightforward: 3 chiral centres lead to 8 diastereomers. However, when attempting to build 3D coordinates, not all diastereomers can give rise to sensible 3D structures. There were also other pitfalls with the diastereomer enumeration, connected with non-invertible nitrogens, meso forms and pseudoasymmetric centres.

Instead, we used a more complex workflow. We took into account the special cases not handled by the cheminformatics toolkit and discarded the diastereomers with bad geometries.

Then we computed pairwise similarities, comparing each diastereomer with all others, keeping into account multiple conformers. First we aligned pairs by their methyl handle, then rotated them around the methyl handle in 30° increments. Once we had done this across the 12 positions, we found the global minima by sampling the two lowest local minima in 5° increments.

The pilot set was assembled by picking roughly 20,000 compounds from the 800,000 pool; pairwise similarities were computed and collected in an upper triangular matrix. This calculation takes 96 CPU days and, once distributed on 100 cores on our in-house cluster, was completed in 2 and a half days. Scaling was difficult due to I/O contention, since all of the nodes were concurrently accessing the same NFS share. The same matrix was computed in 3 h using ECFP4 fingerprints.

There is very little correlation between the 2D and 3D similarity values.

This 20,000-compound pilot was clustered into 4,000 clusters. In order to assess the quality of clusters we used a silhouette metric. This assesses how similar the clusters are internally and how dissimilar to other clusters.

Clustering examples for 2D and 3D structures were presented. There is a nice alignment and a good separation of properties for the 3D. For the 2D, the properties are scattered and the alignments are unclear. Even for a cluster with a zero silhouette score, the 3D cluster was preferred.

It was quite apparent that the 3D clustering is much more successful at picking 3D-similar compounds than the 2D algorithm.

Clustering a 150,000 subset

Clustering on the full set of 800,000 was not computationally feasible. It would have required 380 CPU years. Instead, a representative pick of the full set (~150,000 compounds) was carried out in order to cover property space. Clustering was done on this set and the remaining 650,000 compounds were then assigned to clusters thus identified.

The methods used for dividing up the job into discrete nodes were presented. We used compressed files for the transfers in order to reduce the time required for reading and writing the data.

Jobs were distributed on the Amazon Elastic Compute Cloud using the StarCluster toolkit, which provides a nice interface to the underlying EC2 API, to manage the cluster and submit jobs through Sun Grid Engine.

An on-demand instance was used as the master node, while all worker nodes were spot instances. The latter are amenable to abrupt shutdown, but their 5 to 8-fold lower cost balanced the slightly higher management overhead. We shuffled around both the cluster and the storage between different available zones so as to get the best price.

The similarity matrix calculation was accomplished in 4 days. This time, scaling was excellent thanks to the high bandwidth and the use of compressed files. The 150,000 set was partitioned into 25,000 clusters.

Global clustering

Once the cluster centroids had been established, the remaining compounds were each assigned to the closest centroid.

Thanks to the way the 800,000 set was designed, given any compound in the set it is now possible to retrieve analogs of the core structure, including different substitution patterns, superstructures of the latter and finally, thanks to the 3D clustering, compounds having similar 3D structures. This will allow BioBlocks to estimate the coverage of the 800,000 original compounds from any picked subset.


  • 3D clustering outperforms 2D with reference to intra-cluster distribution of field/shape properties
  • The reduced clustering method is a viable way to cluster a large dataset
  • Scalable performance on cloud computing resources allows clustering 6-digit datasets in reasonable time for less than $10,000 CPU costs.