Presentations from Symposia on Streamlining Drug Discovery 2018

Cresset, and symposia organizing partners Elixir and Optibrium, thank speakers and delegates for contributing to the successful series of Symposia on Streamlining Drug Discovery, which took place in the US last month.

Presentations we have permission to publish are:

RDKit User Group Meeting review

In September I attended the RDKit User Group Meeting, which this year was (very conveniently for Cresset) held in Cambridge, UK, kindly hosted by Andreas Bender at the Department of Chemistry.

As most people will already know, the RDKit is an open-source cheminformatics toolkit developed by Greg Landrum, with regular and ongoing contributions from the community through its GitHub page.

As an RDKit enthusiast, the RDKit UGM is one of my favorite scientific events in the year. It is characterized by a relaxed and informal atmosphere: people attend to show what they achieved through the RDKit, or what they contributed to it, and to learn from others.

We have been building the RDKit into Cresset tools for a few years now, and have also contributed code to the project.

This year I gave a lightning talk about the integration of the RDKit in Flare, our structure-based design tool, and in particular on its usage from Python through the built-in Jupyter Notebook interface. Last month I wrote the blog post Integrating Jupyter Notebook into Flare on this topic, which you may find of interest. Actually, you’d better not get me started on the Flare Python Notebook or I might go on ranting for a long time, as I am terribly excited by the possibilities it opens up, and by how easy it makes it to implement one’s ideas and workflows!

But for now I’ll make an effort to stick to the topic of this post, which is a review of highlights of mine from the last RDKit UGM. As usual, the list is far from exhaustive – there were far more interesting talks than can fit in a blog post!

What’s new in the RDKit?

Greg Landrum (KNIME) started the meeting with the usual overview of the new features introduced since the last UGM. The C++ codebase has been largely updated to C++14, with a 20-30% increment in performance, which is quite impressive. The ETKDG conformer generator developed in 2015 by Greg and Sereina Riniker (ETH Zurich) has now become the default, and has been proved to be on par with commercial alternatives by two independent literature reviews [1, 2]. Further new features of the 2018.09 release include a new JSON-like chemical interchange format, CoordGen-based 2D depictions, and the possibility to instantiate RDKit molecules from SVG depictions endowed with chemical metadata. Finally, a nifty tool written by Nadine Schneider (Novartis Institutes for BioMedical Research) for CheTo to visualize fingerprint bits has been incorporated in the RDKit (Figure 1).


Figure 1. Fingerprint bit visualization.

GSoC RDKit-MolVS integration

Susan Leung (PhD student, University of Oxford) presented the 3-month Google Summer of Code project that she has been working on, which consisted in porting Matt Swain’s (Vernalis) MolVS, which had originally been written in Python, to the main C++ codebase, while expanding its current capabilities. MolVS is a great tool for molecule validation and standardization, i.e., all those automated operations to carry out when importing a set of chemical structures into a cheminformatics toolkit to get them in a consistent and manageable state (Figure 2).

Susan did a great job, even if the tautomer enumeration/canonicalization still has to be completed. This work is going to be very useful to the community and will ensure wider adoption of consistent molecular standardization criteria.


Figure 2. Some of the operations that MolVS can carry out on a chemical dataset.

Some (hopefully) useful open source programs built on the RDKit

Pat Walters (Relay Therapeutics) presented three programs built on top of the RDKit and available through Pat’s GitHub page.

Identify candidates to prioritize for synthesis

The first program addresses the case, quite common in pharmaceutical companies, where hundreds of compounds have been synthesized as part of a project, but still these compounds do not cover all possibilities: there are certainly holes and potentially interesting molecules that were missed. In such cases, a Free-Wilson analysis allows to evaluate the contributions of different R-groups and identify candidates to prioritize for synthesis. Molecules are decomposed into R-groups, a matrix which encodes presence or absence of each R group in each molecule is generated, then R-groups vector are regressed against pIC50 values. Linear regression does not usually work well for this purpose, as many characteristics are collinear, so Ridge Regression is a better choice to avoid hitting the collinearity problem.

Filter chemical libraries

The second program is aimed at filtering chemical libraries via the functional group filters from the ChEMBL database, and some property filters from the RDKit, which act as structural alerts to identify potentially problematic molecules at an early stage.

Predict aqueous solubility from molecular structure

The last program is an implementation of Delaney’s ESOL method to predict aqueous solubility from molecular structure. When discussing the results, Pat showed a chart comparing the r2 obtained by his ESOL implementation to those obtained through RF, DNN and Graph Convolution methods.

While apparently ESOL gets a higher r2, the difference is not statistically significant, because of the confidence intervals associated to the number of data points (Figure 3). This is a point that Pat has already stressed in his blog post Predicting Aqueous Solubility – It’s Harder Than It Looks, and certainly worth keeping in mind.

Figure 3. The light blue area shows the 95% confidence interval for r2 depending on the number of data points.

RDKit in the modern biotech

Ben Tehan and Rob Smith (Heptares) presented an overview of how the RDKit is used at Heptares. In an effort to harmonize and simplify their cheminformatics infrastructure, Heptares adopted the RDKit as their standard platform in 2013. I was pleased to learn that in their hands the performance of my Open3DALIGN alignment algorithm (which indeed was my first contribution to the RDKit, back in 2013) coupled with shape-based scoring was not too far from ROCS (EF1% ~10 using CrippenO3A descriptors compared to ~15 enrichment with ROCS). So I gave myself a pat on the back and carried on.

Deceptively simple: How some cheminformatics problems can be more complicated than they appear

For those who have never been to an RDKit UGM, I’ll do a small preamble. Traditionally, Roger Sayle (NextMove) gives a talk where he picks some RDKit algorithm, pinpoints how poorly coded it is in its current implementation, and how it could be written much better. This is usually quite enjoyable (Greg might have a different opinion), particularly because the talk is always accompanied by a GitHub pull request by Roger with the improved code, to the benefit of the community.

This year the focus was a bit different: rather than tearing some RDKit bit into pieces, Roger decided to show the potential performance pitfalls behind apparently simple common (chem)informatics tasks, such as counting text lines in a file or computing molecular weight, to finish by spotting out a claim in a scientific paper based on a wrong percentage calculation. While very interesting as usual, I thought that the arguments in favor of low-level programming to speed up certain tasks were not entirely convincing, as they were not supported by timings showing that the extent of the gain actually balances the losses in terms of effort and portability.


Once again, this was a thoroughly enjoyable meeting. If you’re interested in finding out more, see the materials from the RDKit UGM 2018.

Review of the 11th International Conference on Chemical Structures

We recently attended ICCS in Noordwijkerhout, where we presented two posters:

This was my second time at this conference, the first as a Cresset scientist. I remember very well my previous attendance in 2011, as it coincided with a turning point in my career: in fact, by that time I had realized that my main research interest had become the development and implementation of new methodologies in computational chemistry, and I started concentrating my efforts in that direction, ultimately joining Cresset a few years later.

Based on my two participations, I do rate ICCS very high in my personal ranking of scientific events, on par of the German and Sheffield conferences of cheminformatics: excellent scientific program, relaxed and informal atmosphere, extended poster sessions to merge and exchange ideas with colleagues, plus a sun-blessed boat trip on the Ijsselmeer.

In the following I will try to capture a few highlights of the talks that I found most inspiring, even though the list could indeed be much longer.

How do you build and validate 1500 models and what can you learn from them? An automated and reproducible system for building predictive models for bioassay data

Greg Landrum (KNIME, T5 Informatics) presented a KNIME-based workflow for automated building of predictive QSAR models (Figure 1).

Figure 1. KNIME-based workflow for automated building of predictive QSAR models, presented by Greg Landrum, KNIME.

ChEMBL 23 was used as the data source for model training, after ‘seeding’ it with compounds roughly similar to the active compounds (i.e., Morgan2 fingerprint Tanimoto between 0.35 and 0.60) which are assumed to be inactive. In fact, the ratio of actives to inactives in scientific publications (from which ChEMBL data are extracted) tends to be unnaturally high, because researchers prefer to focus on active compounds in their reports. Five different kinds of chemical fingerprints were generated for each compound, and ten different activity-stratified training/test set splits were carried out for each activity type. Four different models were built using various machine learning (ML) methods, namely Random Forest, Naïve Bayes, Logistic Regression and Gradient Boosting; the best model was selected as the one giving the highest enrichment factor at 5%.

Overall, over 310K models were built and their performance evaluated in a fully automated fashion; calculations were distributed on 65-70 load-balanced AWS nodes and took only a couple of days. While results seemed initially deceptively good, more in-depth critical analysis revealed that most models were overfit on the training data and not very generalizable. This conclusion was very interesting in itself, adding up to the usefulness of the comparative analysis of the predictive performance of the various fingerprint/ML method combinations.

Machine learning of partial charges from QM calculations and the application in fixed-charge force fields and cheminformatics

I enjoyed very much the talk given by Sereina Riniker (ETH) on deriving atom-centred partial charges for molecular mechanics calculations from ML models. Such models were trained on DFT electron densities computed with the COSMO-RS implicit solvent model (or COSMO for compounds containing elements not supported by COSMO-RS). Sereina had already anticipated that she was working on this subject during a lighting talk given at the 2017 RDKit UGM, and I had been looking forward to seeing the final results since then. The Random Forest model was trained on a set of 130K diverse lead-like compounds from ZINC/ChEMBL, picked as to include as many substructures as possible, with the goal to cover most of the known chemical space of drug-like molecules. In addition to being a lot faster than DFT, the methods yields 3D conformation-independent partial charges, which is a highly desirable feature (Figure 2).

Figure 2. Procedure to derive 3D conformation-independent partial charges from ML models trained on QM data, presented by Sereina Riniker, ETH.

Prediction accuracy is >90%, and can be further improved by adding compounds to the training set in case some feature turns out to be poorly predicted due to being under-represented.

Artificial intelligence for predicting molecular electrostatic potentials (ESPs): A step towards developing ESP-guided knowledge-based scoring functions

On a similar topic, Prakash Chandra Rathi (Astex) presented a method based on graph convolutional deep neural networks to predict the molecular electrostatic potential (ESP) around atoms in a molecule, with the goal of optimizing ligand-protein electrostatic complementarity, thus removing the need for expensive QM calculations. The neural network was trained on 48K diverse molecules for which ESP surfaces were computed using DFT calculations. The optimized model delivered excellent predictive performance on a validation set of 12K molecules (r2 0.95). In addition to visualization of field extrema, as in Cresset’s field point technology, ESPs computed in this fashion hold promise for developing knowledge-based scoring functions to estimate the propensity of a certain atom in a ligand to interact with another atom in a protein.

Multivariate regression with left-censored data – efficient use of incompletely measured bioactivity data for predictive modelling

Knut Baumann (TU Braunschweig) described a method developed in collaboration with Bayer to deal with a very common problem in QSAR modeling, i.e., the fact that for weakly active compounds an exact pIC50 or pKi cannot usually be assessed, as it lies below the assay detection threshold. This kind of data is called left-censored, and poses a serious problem as regression algorithms able to handle hundredths or thousands of left-censored predictors are not available. Treating such data ‘as is’, i.e., interpreting the ‘lesser than’ symbol as if it were an ‘equals’ symbol or omitting this data altogether leads to underestimating the slope and/or overestimating the intercept. To address this issue, Baumann and co-workers adapted the Buckley-James algorithm to principal component regression (PCR) and partial least-squares regression (PLS). In a nutshell, a self-convergent procedure is adopted: the model is initially fitted against all data, then corrections to censored data are estimated, and after applying them the model is re-fitted and the whole procedure is iteratively repeated until convergence is reached. The method was validated using PLS and PCR on both simulated data and on a real dataset from industry, yielding a significant performance improvement compared to the naïve cases where left-censored data are handled as if they were uncensored and to the case where all left-censored data are simply neglected.

How significant are unusual intermolecular interactions?

Bernd Kuhn (Roche) presented a thorough analysis of unusual intermolecular interactions (i.e., halogen bonding, dipole-dipole interactions of aromatic F, aromatic sulfur interactions, S σ-hole to carbonyl, etc.) as observed in crystallographic complexes from the Protein Data Bank and the Cambridge Structural Database. The goal of this study was assessing how important such interactions, on which a wide literature exists, actually are in real life. The analysis was done using the line-of-sight (LoS) paradigm (an interaction between two atoms can only take place if no other atoms are in the way), coupled with exposed surface areas to derive interaction statistics from crystallographic databases. 16 protein atom types were defined describing element, hybridization, polarity and charge of each atom, and their propensity to interact with other atom types was determined: for example, chlorine likes being an environment with polarized CH, C and NH π-systems, and non-polar groups, whereas cyano groups prefer hydrogen bond donors, positive charge, NH π-systems and waters. Additionally, the geometric preferences of each kind of intermolecular contact were determined.

Presentations from the Cresset User Group Meeting 2018

Thank you to all attendees who contributed to the success of the Cresset User Group Meeting 2018. We are grateful to the speakers and exhibitors for permission to share the presentations below.

We hope to see you at the Cresset User Group Meeting 2019 on 19-20 June in Cambridge, UK.

Cresset User Group Meeting 2018 invited speakers and meeting chairman, left to right: Rob Scoffin, Cresset; Jun Xu, Sun Yat-sen University; Stevan W Djuric, AbbVie, Antonia Mey, University of Edinburgh; Richard Lewis, Novartis; Graeme Stevenson, Cancer Therapeutics CTX; Kazuyoshi Ikeda, Keio University; Ashraf Zarkan, University of Cambridge.


Cresset User Group Meeting 2018 scientific program delegates.


3 of the 12 workshops at the Cresset User Group Meeting 2018.

“One of the best meetings I’ve ever been to. Good science. Collaborative and engaging discussions in a wonderful atmosphere.”

Pharmacophore constraints – when should they be used?

Cresset’s alignment and scoring technology is based around the idea of electrostatic matching: the ‘best’ alignment of a pair of conformations is generally the one that provides the most similar molecular interaction potentials. In most cases this is true. However, we recognize that often some portions of the electrostatic potential, for example those involved in a strong hydrogen bond, can be more important than others. This isn’t information intrinsic to the molecules themselves, but to the context in which they are being used. As a result, we have for many years offered the possibility of adding ‘field constraints’, which force the alignment algorithm to prioritize matching the electrostatic potential in particular places around a ligand.

It’s worth noting that the field constraint algorithm works by applying a score penalty to alignments which violate the constraint, not by forcing the constraint to be matched (Figure 1). In general, soft constraints delivered via score penalties are preferred to hard constraints where we simply disallow alignments that violate the constraint. There are two reasons behind this. Firstly, using a score penalty allows the strength of the constraint to be adjusted, so that it can be anything from a mild hint to a strict requirement. Second, in most cases we feel that it is better to get a best-effort alignment that violates the constraint than nothing at all, especially when using the algorithms in lead optimization and trying to align a set of known actives.

Figure 1: Field constraint requires the field at the position of a particular field point to have a certain minimum value.

While field constraints work extremely well in the majority of cases, there are a few exceptions. In some cases there isn’t a field point at the place you want to add a constraint (although this problem is alleviated by the new field editor in Forge and Spark). More commonly, the constraint that you want to add is either stronger or more specific than can be expressed through fields: sometimes generic positive potential isn’t enough and you want to explicitly require a hydrogen-bond donor, for example, and sometimes you want to ensure that the metal-chelating group in a set of ligands is aligned, and ‘good metal chelator’ is similarly difficult to pin down in terms of pure electrostatics. In addition, the donor-acceptor motif that is often required in the hinge binding region of kinases can be difficult to describe through fields alone, as the donor and acceptor electrostatics cancel out to some extent making this region contribute less to the overall alignment score than it should.

For these reasons we introduced the option of pharmacophoric constraints to our core algorithms. These operate in a similar fashion to field constraints, in that they are constraints not requirements. Traditional pharmacophore algorithms treat pharmacophores as binary switches: you either have a H-bond donor in a region of space, and hence a full match, or you don’t and therefore don’t have a match. Our pharmacophore constraints instead operate on a sliding scale. You specify how strong the constraint is, and hence how much alignment score should be lost if the constraint is not satisfied. We spent a while looking at what form the penalty function should take, and the best form overall was a simple linear penalty with a cap at 2Å distance (Figure 2).

Figure 2: Pharmacophore constraints require an atom of the specified type to be within 2Å of the constrained atom, or a penalty is applied to the score.

We’ve talked elsewhere about the validation of our pharmacophore constraints on virtual screening performance within Blaze, where for some targets they can make a significant difference. However, the new constraints are useful for more than just virtual screening. What effect does adding constraints have on the probability of a successful ligand alignment?

We investigated this by using the standard AZ/CCDC dataset from Giangreco et al. (J. Chem. Inf. Model. 2013, 53, 852−866), which contains a large number of targets, and for each target a collection of aligned co-crystallized ligands. The data set can be used to validate ligand alignment algorithms, by measuring what proportion of the ligands within a target class can be successfully aligned. However, running a validation over the entire data set would be challenging, as it would pose the question for each target as to what pharmacophore constraints should be applied. We decided to address this by only testing (in the first instance) 4 kinase targets, as the donor-acceptor-donor motif for kinases is well understood and fairly uncontroversial. For each ligand in these 4 target sets, we manually applied between 1 and 3 pharmacophore constraints to the hinge binding region based on how many H-bonds were actually made by that ligand. For the purposes of this experiment we only included ‘true’ H-bonds from hydrogens on O or N, not the weaker C-H form. Note that all of the alignments were performed in ligand space only, with no information from the proteins applied.

The distribution of results is shown in Figure 3. We used a cutoff of 2Å to distinguish between correct and incorrect alignments. As can be seen, a significant number (around 13%) of alignments that were incorrect when unconstrained were rescued by adding the constraints – that’s a significant improvement. However, at the same time, around 3% of the alignments that we had been getting correct were made worse by adding the constraints, and it’s also apparent that there are a number of incorrect alignments which are left unchanged by the imposition of constraints.  Overall the pharmacophores are a net win, especially for these kinase data sets, but it’s not a magic bullet.

Figure 3: Effect of adding pharmacophore constraints to 4 kinase alignment data sets. The alignments in the ‘Win’ section are improved by the addition of constraints, while those in the ‘Loss’ section are made incorrect.

Let’s look at a couple of examples. Figure 4 shows a case where adding hinge binder constraints converts an incorrect alignment to a correct one. The unconstrained result on the left matches the dimethoxyphenyl ring on the left and the phenol on the right very well, but at the expense of the quinazoline nitrogen which makes the hinge interaction. Adding a single acceptor constraint to that nitrogen switches the best solution to one where the indazole nitrogen satisfies the constraint, which happens to be the experimentally-observed alignment.

Figure 4: CDK2, aligning EZV to DTQ.

Similarly, in Figure 5, the best alignment in terms of raw shape and electrostatics of MTW and LZ8 is shown on the left. The 2-aminopyrimidine in MTW which makes the hinge interactions is partially matched by LZ8’s pyrazole and carbonyl, but the alignment in that part of the molecule isn’t that good. However, the alignment matches the phenyl ring in both molecules beautifully. Applying a donor/acceptor pharmacophore constraint pair forces the alignment algorithm to try and match the 2-aminopyrimidine much more closely, which leads to the correct alignment on the right. However, note that the phenyl rings are much more poorly aligned.

Figure 5: CDK2, aligning LZ8 to MTW.

Finally, Figure 6 shows a case where adding field constraints converts a correct alignment to an incorrect one. When aligning to DTQ as the reference, as in Figure 4, the field/shape alignment gets the correct alignment for C94. However, C94 does not make the hydrogen bond to the hinge that we have constrained in the reference. The constraint coerces the alignment to be flipped so that one of the sulfonamide oxygens matches the H-bond acceptor requirement, thus forcing the wrong answer to be obtained.

Figure 6: CDK2, aligning C94 to DTQ.

Pharmacophore constraints can, as you can see, be very useful. If you know that particular interactions are required by the protein, you can provide our tools with that information to help them find the correct alignment. However, it must be kept in mind that by adding these constraints you are adding bias to the experiment: you are nudging it to provide the answers that you expect. Sometimes, the unexpected answers turn out to be correct! We recommend, therefore, that pharmacophore constraints are used sparingly. I personally will always run an unconstrained alignment experiment as a control, just to see if Forge will find an alignment that I hadn’t expected but which might illuminate a previously-puzzling piece of SAR.

So, when should you use a field constraint and when a pharmacophore constraint? The answer depends both on your requirements and on the chemistry inside the active site. If there’s a pharmacophoric interaction that you are certain must be preserved, then using a pharmacophore constraint is probably justified. However, if the interaction is not always present in the same form, a field constraint may be more appropriate. Figure 7 shows a BTK example. The ligand on the left makes two hydrogen bonds to the backbone from the indazole. We might be tempted to add pharmacophore constraints to both the donor and the acceptor, except that there are known highly-active ligands which make a C-H hydrogen bond instead (see the ligand on the right). As a result, it is probably more appropriate, if constraints are required, to add a pharmacophore constraint to the acceptor nitrogen, but use a field constraint to indicate a donor preference in the NH/CH location.

Figure 7: BTK ligands – which constraint type should I use?

Try yourself

Pharmacophore constraints are now available in all Cresset ligand-based computational chemistry applications:

  • Forge: Powerful ligand-focused suite for understanding SAR and design
  • Torch: Molecular design and SAR for medicinal and synthetic chemists
  • Spark: Discover new directions for your project using bioisosteres
  • Blaze: Effective virtual screening optimized to return diverse new structures

Request an evaluation and try them yourself.

Review of the Collaborative Computational Project for Biomolecular Simulation training week

CCPBioSim (Collaborative Computational Project for Biomolecular Simulation) is an inclusive and wide ranging project, bringing together chemists, physicists and chemical engineers as well as researchers from all branches of ‘molecule-oriented’ biochemistry and biology.

Earlier this month I attended the CCPBioSim training week, hosted by the University of Bristol. The workshops were aimed at training biomolecular researchers in current methodologies and tools in biomolecular simulation. Training material was mainly presented in the form of interactive Python Jupyter notebooks, neatly combining illustrated instructions, code examples and exercises, and data visualization (Figure 1).

Figure 1: Writing python code and visualizing molecules in an interactive Jupyter notebook.

The workshops covered:

  • Introduction to Python for biomodelers, by Dr Christopher Woods (University of Bristol) who taught the basic skills and knowledge to write python code in the context of biomolecular simulations
  • Good practice and pitfalls in setting up, running, and analysing molecular dynamics simulations, by Professor Charles Laughton (University of Nottingham). This session featured a practical example of selecting suitable NMR structures using the PROCHECK webserver and an introduction to MD trajectory analysis methods using the python modules MDtraj, matlabplotlib, and pyPcazip
  • Introduction to setting up alchemical free energy perturbation calculations with FEsetup1 and Sire SOMD, by Dr Antonia Mey (University of Edinburgh). Hydration free energies of ethane and methanol and relative ligand binding free energies to lysoszyme were set up and calculated
  • Overview of different methods and best practices for analysing alchemical free energy calculations and computing protein-ligand binding free energies in an interactive python environment using ‘analyse_freenrg’ function of Sire
  • ISAMBARD2, an open-source Python package for structural analysis and rational design of biomolecules, was used to analyze structural elements of proteins and de novo-model coiled-coil structures
  • BioSimSpace, a large collaborative open-source project comprising multiple academic groups and industrial partners, including Cresset. BioSimSpace aims to remove barriers of different input formats and data preparation between different molecular simulation packages, enabling workflows to seamlessly plug together. Furthermore, it is usable in diverse interactive environments such as Jupyter notebooks and KNIME and provides a community hub and repository for developing software and biomolecular workflows together.
  • Examples of running free energy calculations with WaterSwap3,4 (integrated in Cresset’s Flare5), LigandSwap, and ProteinSwap (all part of Sire) for predicting how changes in a ligand or mutations of a protein affect protein-ligand binding. These methods also provide an in-built residue-based decomposition of the binding free energy, which makes it possible to visualise and rationalise predicted changes in binding affinity according to changes in specific protein-ligand interactions (Figure 2).
  • Introduction to QM/MM modeling of enzyme reactions, by Dr Marc van der Kamp (University of Bristol). During this workshop theoretical background, best practices for set up and analysis, and choice of QM level were discussed. As practical example, the AmberTools package was used to calculate the potential and free energy surface of the chorismate mutase catalysed reaction of chorismate to prephenate using the semi-empirical AM1 QM method.
  • Introduction to the ProtoMS software package and running biomolecular Monte Carlo simulations, by Dr Christopher Cave-Ayland and Hannah Bruce-Macdonald (University of Southampton). Subsequently, a very interesting application of ProtoMS was presented and run: The calculation of water affinities in protein binding sites with Grand Canonical Monte Carlo simulations.6,7 These simulations run at a µVT ensemble (constant chemical potential) and include insertion and deletion moves of water molecules, allowing to perform a titration of the chemical potential to determine free energies of each water molecule in the respective binding site.

All workshop attendees were given the opportunity to participate in an exciting virtual reality docking experiment, allowing to actually move ‘inside’ a biomolecular simulation while docking a small molecule ligand into a protein pocket with VR hand controllers.

Overall, the well-organized CCPBioSim training workshop laid a great foundation for good practices with set up, running, and analysis of biomolecular simulations and showcased exciting developments and applications. The whole week had a very collaborative and supportive atmosphere. See  for the Twitter discussion.

Thank you to the organizers, speakers, and attendees that have made this workshop so enjoyable and a great success.

Figure 2: Visualization of residue-based decomposition of the protein-ligand binding free energy of WaterSwap calculations in Flare5 (HSP90 example, pdb code 2XDK). Green areas highlight residues that prefer interactions with the ligand (left panel), whereas red areas highlight residues that prefer interactions with the water cluster (right panel).


  1. Loeffler, H. H.; Michel, J.; Woods, C. FESetup: Automating Setup for Alchemical Free Energy Simulations. J. Chem. Inf. Model. 2015, 55 (12), 2485–2490
  2. Wood, C. W.; Heal, J. W.; Thomson, A. R.; Woolfson, D. N. ISAMBARD: An Open-Source Computational Environment for Biomolecular Analysis, Modelling and Design. Bioinformatics 2017, 33 (19), Pages 3043–3050
  3. Woods, C. J.; Malaisree, M.; Hannongbua, S.; Mulholland, A. J. A Water-Swap Reaction Coordinate for the Calculation of Absolute Protein–ligand Binding Free Energies. J. Chem. Phys. 2011, 134 (5), 054114
  4. Woods, C. J.; Malaisree, M.; Michel, J.; Long, B.; McIntosh-Smith, S.; Mulholland, A. J. Rapid Decomposition and Visualisation of Protein–ligand Binding Free Energies by Residue and by Water. Faraday Discuss. 2014, 169, 477–499
  5. Flare.
  6. Cave-Ayland, C.; Skylaris, C.-K.; Essex, J. W. A Monte Carlo Resampling Approach for the Calculation of Hybrid Classical and Quantum Free Energies. J. Chem. Theory Comput. 2017, 13 (2), 415–424
  7. Ross, G. A.; Bruce Macdonald, H. E.; Cave-Ayland, C.; Essex, J. W. Replica-Exchange and Standard State Binding Free Energies with Grand Canonical Monte Carlo. J Chem Theory Comput 2017, 13 (12), 6373–6381.

Thirty years, and counting

Lessons and Successes in the Use of Molecular Fields, an invited chapter for the ‘In Silico Drug Discovery Tools’ section of Comprehensive Medicinal Chemistry III, has recently been published by Elsevier. The chapter reviews more than thirty years of research in the area of molecular interaction fields, starting from the ground-breaking work done by Cramer, Goodford and Vinter, until the most recent applications to virtual screening, metabolism prediction, protein similarity across phylogenetically unrelated families and library design.

We believe that the reason for the longevity of molecular fields as computational chemistry workhorses is their capacity to provide a feature-rich description of molecular properties which is independent of the chemical structure. Encoding interaction properties through molecular fields inherently enables scaffold hopping, 3D similarity searches within large databases, ligand and protein structural alignment, cross-chemotype SAR elucidation.

All of the above tasks are the bread and butter of medicinal and computational chemists. Therefore, it is not surprising that over the years the creativity of researchers has built on top of the molecular field technology a number of blockbuster tools for computer-aided drug design, such as CoMFA, FieldScreen (now known as Blaze) or MetaSite, to name but a few. Thanks to their ‘core’ nature, as trends evolved and new needs emerged, fields have brilliantly managed to ride the ever changing waves of drug discovery, and are still at the forefront of the in silico armoury.

We take pride of having been among the lead actors on the molecular field stage, and we trust that readers will enjoy embarking on a fascinating journey through decades of field-based science as much as we did as authors.

Presentations and posters from 253rd American Chemical Society National Meeting

Download our presentation and posters from the 253rd ACS National Meeting. Also available is the presentation, and poster, by the University of Bristol which features Flare, our new structure-based design application.

Improving new molecule design using electrostatics

MEDI poster 135

Tim Cheeseright, Director of Products, Cresset

Electrostatics are critical to ligand binding and yet largely overlooked in new molecule design due to the difficulty in calculation and visualization of meaningful potentials. We have previously shown how electrostatics can be used effectively for scaffold hopping, virtual screening, ligand alignment and SAR interpretation. In this poster we will focus on ligand and protein electrostatics. We will show how considering the changes in ligand electrostatics improves the outcome for new molecule design. Going beyond traditional H-bonding based pharmacophore descriptors enables designers to map the effect of molecular changes on the full electrostatic potential of their molecules. Full exploitation of aromatic dipole moments, C-H hydrogen bonding, halogen bonds, and pi-pi interactions is only possible by understanding the electrostatic basis of these effects. Consideration of the protein electrostatics can inform ligand design to generate complementary patterns of electron rich and electron poor regions.

Putting electrostatics and water at the center of structure-based drug design

COMP poster 292

Tim Cheeseright, Director of Products, Cresset

The electrostatics of protein active sites can inform ligand design and SAR. However, no calculation of electrostatic potentials in the active site can be complete without also considering whether water molecules are tightly bound and contributing to the potentials and ligand binding. In this poster we will explore the effect of including water molecules shown to be energetically favorable on the electrostatic potential of individual proteins. We will present a new application to enable rapid and accurate calculation of water stability and protein interaction potentials. Combining these analyses with the popular Waterswap technique results in an in-depth knowledge of protein targets and ligand protein binding.

Combining protein interaction potentials with water analysis in structure-based design

COMP oral 497

Tim Cheeseright, Director of Products, Cresset

The XED molecular mechanics force field is unique in providing off-atom centered charges that enable studying complex charge distributions in ligands and proteins alike. We have previously described the use of the molecular interaction potentials derived from the XED force field for ligand similarity calculations which have, in turn, enabled virtual screening, scaffold hopping and SAR analysis to be performed using these effective and meaningful descriptors. In this talk we will describe in detail our application of the XED force field to proteins. We will focus on the calculation of protein interaction potentials showing how protein electrostatics provide deep insights into ligand binding that would otherwise be missed. In parallel to the calculation of the protein interaction potentials we describe using XED to minimize protein-ligand complexes and demonstrate the synergy between these methods.

Understanding protein-ligand binding at the molecular level: Using swap-based methods to visualise binding free energy components

COMP oral 103

Christopher Woods, University of Bristol

Binding free energy methods allow computational chemists to predict whether or not changes to a ligand increase or decrease its binding affinity to a medicinally important protein. While these methods can reveal whether or not a change to the ligand increases its binding affinity, they provide little information as to how this change affects individual molecular interactions. Such information would be extremely useful, as it could provide feedback to a drug designer that could inspire the next series of ligand modifications. For example, it would be valuable for the free energy calculation to reveal that addition of a hydroxyl group to a ligand strengthens its interaction with an active site aspartate residue, but simultaneously destabilises a water molecule that bridges between it and the protein. This feedback could inspire a drug designer to investigate ligand modifications that combine addition of the hydroxyl group with the addition of moeities that displace the bridging water molecule. We have developed a range of binding free energy methods, based on WaterSwap. These swap-based methods allow calculation of absolute and relative protein-ligand binding free energies using a single simulation over a single λ-coordinate. Use of a single coordinate allows components of the binding free energy (i.e. specific interactions between the ligand and individual active-site residues, or between the ligand and neighbouring water molecules) to be integrated across λ during the simulation. The resulting components are used to colour-code 3D views of the proteinligand system. This allows drug designers to easily visualise the affect of ligand modifications on the interactions between the ligand and individual residues in the protein, and individual water molecules in the binding site. These components are not true free energies. However, what they reveal is chemically intuitive, and provide a level of insight that allows drug designers to suggest modifications that lead to improved binding affinity. The components show cooperative affects, and also reveal how increasing the strength of interaction

Visualising the molecular drivers behind drug resistance

COMP Sci-Mix poster 347

Christopher Woods, University of Bristol

The huge expense of developing a new drug can be wasted if natural mutations of amino acid residues in the targetted protein lead to a loss of drug binding affinity. Rational drug design is a continual struggle, with evolution driving mutations that develop emerging drug resistance into widespread drug inefficacy. We have developed a new free energy method, based on WaterSwap, that can provide drug designers with the insight needed to understand how protein mutations affect drug binding. This method allows the change in binding free energy of the drug associated with the mutation of the protein to be predicted. In addition, as for other Swap-Based methods, this free energy change can be decomposed into components that can be used for visualisation. These components allow the change in binding free energy to be understood in terms of changes in specific ligand-protein interactions, or changes in the solvating water network in the active site. This method allows drug designers to pro-actively screen a new drug computationally against a range of likely protein mutants, thereby enabling the drug designer to get one step ahead of nature. Alternatively, it allows drug designers to investigate the molecular basis for reduced binding affinity in known drug-resistant mutants of a protein. This information would be useful for the development of the subsequent generations of the drug.

3D-RISM: Better water positions through improved electrostatics?

Professor Dr Paolo Tosco presented 3D-RISM: Better water positions through improved electrostatics? at the German Chemoinformatics Conference, Fulda, Germany. on 7th November.


The Reference Interaction Site Model (RISM) is a modern approach to solvation based on the integral equation theory of liquids[1]. It has seen increasing use as a method to analyse the structure of water in and around protein active sites. The position and energetics of water molecules in and around the active site is known to be of crucial importance in understanding ligand binding, and in particular knowledge of which water molecules are tightly bound and which are energetically unfavourable can give valuable insights into structure-activity relationships.

The accuracy of 3D-RISM depends on the closure used and the functions used to compute the intermolecular potential between the protein and water molecules. In practise, the potential function consists of van der Waals interactions and electrostatic interactions. Increasing the accuracy of electrostatic calculations should therefore give more accurate 3D-RISM water placements and energies.

To date, 3D-RISM has been applied using traditional force fields such as AMBER[2] which utilise simple unpolarised atom-centred charges to represent electrostatic potentials. We have investigated whether a more accurate electrostatic model, XED[3], which provides both polarizability and electronic anisotropy, improves the effectiveness of the 3D-RISM method. As water energetics around a protein active site are not in general experimentally observable, we have in the first instance concentrated on computing water positions and energetics around a range of small molecule model systems, and validated these using DFT calculations with ONETEP[4] and interaction statistics from the CSD. Polarization and electronic anisotropy are found to be particularly important when assessing water structure around carbonyl groups, and the results from AMBER are found to be noticeably unrealistic in many cases. This has important implications for the accuracy of 3D-RISM calculations on protein active sites.

Read about progress in structure-based design at Cresset.

[1] Skyner, R.; McDonagh, J. L., Groom, C. R., van Mourik, T., Mitchell, J. B. O.; Groom, C. R.; Van Mourik, T.; Mitchell, J. B. O. A Review of Methods for the Calculation of Solution Free Energies and the Modelling of Systems in Solution. Phys. Chem. Chem. Phys 2015, 17 (9): 6174.

[2] J.A. Maier; C. Martinez; K. Kasavajhala; L. Wickstrom; K.E. Hauser; C. Simmerling. ff14SB: Improving the Accuracy of Protein Side Chain and Backbone Parameters from ff99SB. J. Chem. Theor. Comput. 2015, 11:3696-3713

[3] J.G. Vinter Extended Electron Distributions Applied to the Molecular Mechanics of some Intermolecular Interactions. J. Comput.-Aided Mol. Des. 1994, 8: 653-668

[4] C.-K. Skylaris; P. D. Haynes; A. A. Mostofi; M. C. Payne Introducing ONETEP: Linear-scaling density functional simulations on parallel computers. J. Chem. Phys. 2005, 122:084119

Professor Dr Paolo Tosco, Cresset, Presenting at the German Chemoinformatics Conference
(Photo courtesy of Wendy Warr)

At the heart of medicinal chemistry design: The synergistic relationship between Forge and Spark


We present an exercise in in silico medicinal chemistry: using Activity Atlas1 to assess and understand the features that drive activity for a collection of CDK2 inhibitors; using Spark2 to suggest new ideas to explore; then visually assessing those new ideas in Forge3 in the context of the qualitative 3D-SAR models generated by Activity Atlas.


The efforts of medicinal chemists in their quest to design new, active ligands can be summarized by answering three deceptively simple questions:

  1. What to make next?
  2. Why should I make it?
  3. Has this chemical space been explored previously?

These questions are frequently addressed separately. However, there is much value in considering the questions simultaneously with the help of Forge and Spark.

In this case study, Cresset ligand-based design tools are used to address the questions above with respect to the well-studied target for oncology Cyclin-dependent kinase 2 (CDK2).


Compound collection and alignment

A collection of 38 imidazopyridine containing CDK2 inhibitors was gathered from CHEMBL4 and curated to create a molecular dataset spanning a pIC50 range of 4-9.

Prior to assessing any relationship between activity and 3D structure, it is necessary to align the collection of molecules. In this case we chose to align the molecules to the ligand derived from the crystal structure 1OIT (CHEMBL73303: pIC50 =9.0). As many of the compounds in the dataset contained a solvent-exposed group that was not present in this reference an additional active ligand was identified (CHEMBL70808: pIC50 = 8.52) that maintained the core, but expanded beyond the terminal sulfonamide of the 1OIT ligand. The CHEMBL70808 ligand was aligned to the 1OIT ligand using manual and automated methods to ensure good correspondence of the imidazo[1,2-a]pyridine groups that are deeply embedded in the binding site. Further manual alignment was employed to ensure comparable orientation of the sulfonamide, despite being solvent-exposed. The pair of ligands (Figure 1) was then used as a combined, equally weighted, reference for the subsequent alignment of the 38 compounds gathered.

Figure 1_ Alignment of PDB1OIT ligand
Figure 1. Alignment of PDB:1OIT ligand (pink) and CHEMBL70808 (grey).

Compound collection and alignment

A collection of 38 imidazopyridine containing CDK2 inhibitors was gathered from CHEMBL4 and curated to create a molecular dataset spanning a pIC50 range of 4-9.

Prior to assessing any relationship between activity and 3D structure, it is necessary to align the collection of molecules. In this case we chose to align the molecules to the ligand derived from the crystal structure 1OIT (CHEMBL73303: pIC50 =9.0). As many of the compounds in the dataset contained a solvent-exposed group that was not present in this reference from an additional active ligand was identified (CHEMBL70808: pIC50 = 8.52) that maintained the core, but expanded beyond the terminal sulfonamide of the 1OIT ligand. The CHEMBL70808 ligand was aligned to the 1OIT ligand and subjected to additional manual alignment in order to better align the imidazo[1,2-

a]pyridine groups that are deeply embedded in the binding site. Further manual alignment was employed to ensure comparable orientation of the sulfonamide, despite being solvent-exposed. The pair of ligands (Figure 1) was then used as a combined, equally weighted, reference for the subsequent alignment of the 38 compounds gathered.

Within Forge, the collection of CDK2 inhibitors was aligned to the two equally weighted references, using 50% fields and 50% shape and a soft protein excluded volume to generate an alignment (Figure 2) and similarity score (range 0.59 through 0.85). As alignment is at the heart of every 3D-QSAR approach, the alignments were visually inspected and manually tweaked to ensure consistent orientation of all side chains.

Figure 2_ Alignment of 38 CDK2 inhibitors to PDB1OIT and CHEMBL70808
Figure 2. Alignment of 38 CDK2 inhibitors to PDB:1OIT and CHEMBL70808.

Modeling 3D SAR with Activity Atlas

To generate a qualitative assessment of the features driving activity, an Activity Atlas model calculation was performed.

Activity Atlas is available in Forge, and generates visually striking maps based on the aligned molecules. The models depict: the average electrostatics and shape for active compounds; the summary of activity cliffs in electrostatics, shape and hydrophobicity; and the regions explored. The models generated by Activity Atlas are intended to help answer the questions asked earlier and are shown in Figures 3-5.

In Figure 3, the average active molecule map is depicted. It indicates the common electrostatics and hydrophobic features of active molecules in the dataset – essentially molecules that do not have these features are unlikely to have any significant activity. The central H-bond donor (amine)/H-bond acceptor (pyrimidine) motif mapping the interaction with the hinge region of CDK2 is present in all actives. Negative regions near the H-bond acceptors of sulfonamide and imidazopyridine are also present on all actives.

Figure 3_ Activity Atlas map for average electrostatics and hydrophobics for active molecules
Figure 3. Activity Atlas map for average electrostatics and hydrophobics for active molecules. Blue = negative electrostatics; Red = positive electrostatics; Gold = hydrophobics.

Figure 4 shows the summary of activity cliffs in electrostatics and steric space – the fine detail on the average of actives map. Areas that are blue suggest that making this area more negative (or less positive) could enhance activity; areas that are red suggest that making the area more positive (or less negative) can drive activity; and finally, steric bulk is favorable in the green regions, whereas it is not tolerated in the pink regions.

Figure 4 (right) displays the activity cliff analysis in an orientation better suited to examining the imidazopyridine. Tilted on its side, the Activity Atlas model suggests that a more positive/less negative π-cloud should enhance activity, as well as noting that a stronger negative near the H-bond accepting ring nitrogen should also favor activity.

Figure 4_ Two views of the Activity cliff summary for electrostatics and sterics
Figure 4. Two views of the Activity cliff summary for electrostatics and sterics. Left: Electrostatics and sterics. Right: Electrostatics only in a view rotated around the x axis. Areas in blue suggest that negative electrostatics are favored; areas in red favor positive electrostatics. Areas in green are favorable sterically, while areas in pink are unfavorable for sterically bulky groups.

The assessment of the electrostatic regions explored suggests that at least 10 of 38 molecules have contributions as shown in Figure 5. This means that regions that are not explored may present opportunities for modification and further optimization.

The region explored analysis also performs a calculation of novelty for all the data set compounds as well and novel designs. The calculated novelty can be used as an indication for how much information would be gained from the molecule should it be made.

Figure 5_ Aligned ligands with a display of the regions of electrostatic space explored
Figure 5: Aligned ligands with a display of the regions of electrostatic space explored.

Generating ideas in Spark

With a SAR analysis in hand, Spark was used to find bioisosteric replacements for the highlighted portion of

the 1OIT ligand as shown in Figure 6. The fragments were sourced from the Spark reagent database of boronic acids derived from eMolecules.

Figure 6_ The Spark eMolecules boronic acids reagent database
Figure 6. The Spark eMolecules boronic acids reagent database was designated to identify bioisosteric replacements for the moiety shown in pink in the PDB:1OIT ligand.

Results and analysis

Four results were selected from the Spark experiment were selected for further analysis and are shown in Figure 7. Each of the selected Spark hits have been chosen to either enhance activity, and/or to challenge the model since it was built on a relatively small number of compounds.

Each of the proposed bioisosteric replacements are derived from commercially available reagents that are shipped from the supplier within 2 days to 4 weeks.

In (a), the pyrazolopyridine replacement would test the importance of the H-bond acceptor strength on the original imidazopyridine. Moving the heteroatom across the ring results in a weaker H-bond acceptor while maintaining a reasonable density above the ring. Note that this replacement results in a richer central pyrimidine ring.

The dihdropyrroloimidazole (b) has similar electrostatics to that starting ligand but tests the requirement for an electron deficient aromatic by saturating this ring, removing all pi-electrons from this region.

In (c), the benzofuran might be suitable for testing the model as it removes the strong H-bond acceptor altogether, replacing it with a weak feature. Whilst this might not be productive in activity the molecule would add knowledge to the model and the reagent is available immediately.

Finally, (d) suggests replacing the imidazopyridine with a quinoline. The change from 5,6 to 6,6 ring introduces a different geometry for the H-bond acceptor but maintains many of the electrostatic features of the starting ligand in a reagent that is available for immediate dispatch. If this substitution was tolerated then the spark results contain other, more exotic suggestions that include a H-bond acceptor in a 6 membered ring in this position.

Figure 7. The starting ligand from 1OIT and selected Spark results (a-d). Top: in 2D. Center: in 3D showing positive and negative interaction potentials. Bottom: presented in the context of the Activity Atlas activity cliff summary for electrostatics.


This article presents our approach to answering the three questions at the heart of medicinal chemistry design:

1. What to make next?

Every chemist has a collection of tricks and substitutions that have been built up from their laboratory experience. Spark enhances the generation of ideas in an unbiased way, and in combination with a chemist’s intuition, can provide ideas about what compounds can be made that are in potentially open IP space, but still retain the characteristics of active molecules. The Spark experiment using eMolecules reagent databases also provides tier and ordering information for applicable reagents exploiting specific chemical reactions to improve laboratory efficiency.

2. Why should it be made?

Taking the Spark and chemist-designed ideas and examining them in the context of the Activity Atlas models is a way to ensure that new compounds meet the characteristics of active molecules, or conversely, are designed to test specific aspects of the models. The combination of average active molecule and activity cliff summary maps can provide the rationale needed to give confidence that taking a new design to the laboratory is worth the effort.

3. Has this chemical space been explored previously?

The region explored analysis combined with the novelty calculation within Activity Atlas can be used to assess whether the Spark and chemist-design ideas lie within the already explored chemical space, or whether they map regions of this space so far unexplored.

References and links

4 P. Bento, A. Gaulton, A. Hersey, L.J. Bellis, J. Chambers, M. Davies, F.A. Krüger, Y. Light, L. Mak, S. McGlinchey, M. Nowotka, G. Papadatos, R. Santos and J.P. Overington (2014) ‘The ChEMBL bioactivity database: an update.’ Nucleic Acids Res., 42 1083-1090.