Run Cresset calculations in the cloud with Cresset Engine Broker™ and ElastiCluster

In this post I will describe how to instantiate a cluster on Google Cloud (or any other cloud platform supported by ElastiCluster) and use it to run calculations with any Cresset desktop application, i.e., Forge, Spark or Flare, on any platform – Windows, Linux or macOS.

Configuring ElastiCluster is simple, and it is a one-off operation. Once this is done, you can spin up a Linux cluster in the cloud with a single command:

elasticluster start cebroker

Within a few minutes, the cluster will be up and running, and ready to run Cresset calculations via the Cresset Engine Broker.

Once your calculations are finished and you do not need the cluster anymore, you can switch it off with another single command:

elasticluster stop cebroker

How to

As I was new to Elasticluster and to Google Cloud, I followed the excellent Creating a Grid Engine Cluster with Elasticluster tutorial to understand how to enable the Google Cloud SDK, and generate credentials to grant ElastiCluster access to my Google Cloud account.

The rest of this blog post concentrates on the steps required to get this to work in a Linux bash shell. The same workflow runs smoothly in a macOS Terminal and in a Windows bash shell.

The process that I will follow is:

  1. Install ElastiCluster and its dependencies in a virtual environment
  2. Create an ElastiCluster configuration file
  3. Spin up the cluster and use it with Cresset desktop apps

Not all the steps are necessary if you have a VPC already set-up and running on Google Cloud, but this details how to start from scratch in the minimum time.

Install ElastiCluster and its dependencies in a virtual environment

Firstly, I suggest to follow the advice in the Creating a Grid Engine Cluster with Elasticluster tutorial  and install virtualenv, in order to create a separate Python environment for ElastiCluster:

paolo@cresset77 ~/blog$ pip install virtualenv
Collecting virtualenv
Downloading (2.0MB)
100% |################################| 2.0MB 7.7MB/s
Installing collected packages: virtualenv
Successfully installed virtualenv-16.4.3

Then, create and activate a new elasticluster virtual environment:

paolo@cresset77 ~/blog$ virtualenv elasticluster
No LICENSE.txt / LICENSE found in source
New python executable in /home/paolo/blog/elasticluster/bin/python2
Also creating executable in /home/paolo/blog/elasticluster/bin/python
Installing setuptools, pip, wheel...
paolo@cresset77 ~/blog$ cd elasticluster/
paolo@cresset77 ~/blog/elasticluster$ . bin/activate
(elasticluster) paolo@cresset77 ~/blog/elasticluster$

Next, clone from GitLab the Cresset fork of the elasticluster repository which contains a few Cresset-specific Ansible playbooks required to automatically set up the Cresset Engine Broker and FieldEngine machinery on the cloud-hosted cluster:

(elasticluster) paolo@cresset77 ~/blog/elasticluster$ git clone src
Cloning into 'src'...
remote: Enumerating objects: 13997, done.
remote: Counting objects: 100% (13997/13997), done.
remote: Compressing objects: 100% (4820/4820), done.
remote: Total 13997 (delta 8383), reused 13960 (delta 8375)
Receiving objects: 100% (13997/13997), 5.23 MiB | 1.21 MiB/s, done.
Resolving deltas: 100% (8383/8383), done.

Finally, install elasticluster dependencies:

(elasticluster) paolo@cresset77 ~/blog/elasticluster$ cd src
(elasticluster) paolo@cresset77 ~/blog/elasticluster/src$ pip install -e .
Obtaining file:///home/paolo/blog/elasticluster/src
Collecting future (from elasticluster==1.3.dev9)
Downloading (829kB)
100% |################################| 829kB 6.6MB/s
Requirement already satisfied: pip>=9.0.0 in /home/paolo/blog/elasticluster/lib/python2.7/site-packages (from elasticluster==1.3.dev9) (19.0.3)


Running develop for elasticluster
Successfully installed Babel-2.6.0 MarkupSafe-1.1.1 PrettyTable-0.7.2 PyCLI-2.0.3 PyJWT-1.7.1 PyYAML-5.1 adal-1.2.1 ansible-2.7.10 apache-libcloud-2.4.0 appdirs-1.4.3 asn1crypto-0.24.0 azure-common-1.1.18 azure-mgmt-compute-4.5.1 azure-mgmt-network-2.6.0 azure-mgmt-nspkg-3.0.2 azure-mgmt-resource-2.1.0 azure-nspkg-3.0.2 bcrypt-3.1.6 boto-2.49.0 cachetools-3.1.0 certifi-2019.3.9 cffi-1.12.2 chardet-3.0.4 click-7.0 cliff-2.14.1 cmd2-0.8.9 coloredlogs-10.0 contextlib2-0.5.5 cryptography-2.6.1 debtcollector-1.21.0 decorator-4.4.0 dogpile.cache-0.7.1 elasticluster enum34-1.1.6 funcsigs-1.0.2 functools32-3.2.3.post2 future-0.17.1 futures-3.2.0 google-api-python-client-1.7.8 google-auth-1.6.3 google-auth-httplib2-0.0.3 google-compute-engine-2.8.13 httplib2-0.12.1 humanfriendly-4.18 idna-2.8 ipaddress-1.0.22 iso8601-0.1.12 isodate-0.6.0 jinja2-2.10.1 jmespath-0.9.4 jsonpatch-1.23 jsonpointer-2.0 jsonschema-2.6.0 keystoneauth1-3.13.1 monotonic-1.5 msgpack-0.6.1 msrest-0.6.6 msrestazure-0.6.0 munch-2.3.2 netaddr-0.7.19 netifaces-0.10.9 oauth2client-4.1.3 oauthlib-3.0.1 openstacksdk-0.27.0 os-client-config-1.32.0 os-service-types-1.6.0 osc-lib-1.12.1 oslo.config-6.8.1 oslo.context-2.22.1 oslo.i18n-3.23.1 oslo.log-3.42.3 oslo.serialization-2.28.2 oslo.utils-3.40.3 paramiko-2.4.2 pathlib2-2.3.3 pbr-5.1.3 pyOpenSSL-19.0.0 pyasn1-0.4.5 pyasn1-modules-0.2.4 pycparser-2.19 pycrypto-2.6.1 pyinotify-0.9.6 pynacl-1.3.0 pyparsing-2.4.0 pyperclip-1.7.0 python-cinderclient-4.1.0 python-dateutil-2.8.0 python-gflags-3.1.2 python-glanceclient-2.16.0 python-keystoneclient-3.19.0 python-neutronclient-6.12.0 python-novaclient-9.1.2 pytz-2018.9 requests-2.21.0 requests-oauthlib-1.2.0 requestsexceptions-1.4.0 rfc3986-1.2.0 rsa-4.0 scandir-1.10.0 schema-0.7.0 secretstorage-2.3.1 simplejson-3.16.0 six-1.12.0 stevedore-1.30.1 subprocess32-3.5.3 typing-3.6.6 unicodecsv-0.14.1 uritemplate-3.0.0 urllib3-1.24.1 warlock-1.3.0 wcwidth-0.1.7 wrapt-1.11.1

Create an ElastiCluster configuration file

Open your favourite text editor and create a ~/.elasticluster/config file with the following content:

# slurm software to be configured by Ansible
# ***EDIT ME*** add cresset_flare only if you plan to run Flare calculations
# ***EDIT ME*** set the path to your Cresset license

# Create a cloud provider (call it 'google-cloud')
# ***EDIT ME*** enter your Google project ID here
gce_project_id= xxxx-yyyyyy-123456
# ***EDIT ME*** enter your Google client ID here
# ***EDIT ME*** enter your Google client secret here

# Create a login (call it 'google-login')

# Bring all of the elements together to define a cluster called 'cebroker'
# ***EDIT ME*** set compute_nodes to the number of worker nodes that you
# wish to start


# ***EDIT ME*** change 8 into 2, 4, 8, 16, 32, 64 depending on how many cores
# you wish on each worker node

The only bits that you will need to edit are those highlighted in red, i.e.:

  • The path to the Cresset license on your computer
  • Add cresset_flare if you are planning to run Flare calculations on the cluster
  • Your Google Cloud credentials
  • The number/type of nodes that you wish to use in your cluster

The example above uses a dual-core node to act as the head node (the Cresset Engine Broker does not need huge resources) and starts two 8-core nodes as worker nodes; you may well want to use more and beefier nodes, but this is meant as a small, inexpensive example.

At this stage, you are done with the configuration part. Please note that you only need to do this once; the only future action that you may want to carry out is changing the configuration to modify the type/number of your worker nodes.

Spin up the cluster and use it with Cresset desktop apps

This is the single command that you need to run, as advertised in the blog post headline:

(elasticluster) paolo@cresset77 ~/blog/elasticluster$ elasticluster start cebroker
Starting cluster `cebroker` with:
* 1 frontend nodes.
* 2 compute nodes.
(This may take a while...)

If you feel like having a coffee, this is a good time to brew one – bringing up the cluster will take a few minutes.

If everything works well, at the end of the process you should see:

Your cluster `cebroker` is ready!

Cluster name:     cebroker
Cluster template: cebroker
Default ssh to node: frontend001
- compute nodes: 2
- frontend nodes: 1

To login on the frontend node, run the command:

elasticluster ssh cebroker

To upload or download files to the cluster, use the command:

elasticluster sftp cebroker

Now ssh into your frontend node, forwarding port 9500 to localhost:

(elasticluster) paolo@cresset77 ~/blog/elasticluster$ elasticluster ssh cebroker -- -L9500:localhost:9500
Welcome to Ubuntu 18.04.2 LTS (GNU/Linux 4.15.0-1029-gcp x86_64)
* Documentation:
* Management:
* Support:

Get cloud support with Ubuntu Advantage Cloud Guest:
This system has been minimized by removing packages and content that are
not required on a system that users do not log into.

To restore this content, you can run the 'unminimize' command.

0 packages can be updated.
0 updates are security updates.

Last login: Mon Apr  8 21:28:16 2019 from


Tunnelling port 9500 through ssh creates a secure, encrypted SSL connection to the Cresset Engine Broker running in the cloud. You probably won’t need this if you already have configured your VPC in Google Cloud.

You may easily verify that the Cresset Engine Broker is already running on the frontend node, waiting for connections from Cresset clients:

cebroker@frontend001:~$ systemctl status cresset_cebroker2
* cresset_cebroker2.service - server that allows Cresset products to start/connect to remote FieldEngines
Loaded: loaded (/etc/systemd/system/cresset_cebroker2.service; static; vendor preset: enabled)
Active: active (running) since Mon 2019-04-08 21:41:50 UTC; 8min ago
Process: 20917 ExecStop=/bin/bash /opt/cresset/CEBroker2/documentation/init.d/ stop /var/run/ (code=exited, status=0/SUCCESS)
Process: 20930 ExecStart=/bin/bash /opt/cresset/CEBroker2/documentation/init.d/ start /var/run/ (code=exited, status=0/SUCCESS)
Main PID: 20956 (CEBroker2.exe)
Tasks: 0 (limit: 4915)
CGroup: /system.slice/cresset_cebroker2.service
> 20956 /opt/cresset/CEBroker2/lib/CEBroker2.exe --pid /tmp/tmp.Y7Wvk135fM -p 9500 -P 9501 -s /opt/cresset/CEBroker2/documentation/examples/ -m 16 -M 16 -v

Now, start a Forge client running on your local machine.

In the Processing Preferences, configure the Cresset Engine Broker to Host localhost, Port 9500, and press the ‘Test Connection’ button; you should see ‘Connection OK’:

Now, align a dataset against a reference. In addition to your local processes, you should see 16 remote processes running on your cloud-hosted cluster:

Using remote computing resources requires a Cresset Engine Broker license and a Remote FieldEngine license; please contact us if you do not currently have such licenses and wish to give this a try. And do not hesitate to contact our support team if you need help to get the above working for you.

Python extension enabling Jupyter Notebook integration in Flare released

In a recent post I wrote about Integrating Jupyter Notebook into Flare as a new Python extension dedicated to Python developers and enthusiasts. The Python extension that makes this possible is now released (Figure 1).

Figure 1. The button which enables the Python Notebook extension.
While using it to carry out my daily Python coding tasks, I have identified a number of features that the protoype extension was missing and were worth implementing. So, there are a few more highlights that I’d like to share with you.

As discussed in my previous post, the feature that personally I enjoy most is the fact that the Flare Python Notebook has direct access to the Flare main_window() object, and hence allows you to work on the project currently loaded in the main viewport, i.e., interact with ligands and proteins, visualize molecular and field surfaces, etc. As this involves running the Python code in the main GUI thread, only a single Python Notebook may have access to the GUI at any given time.

However, I thought it would be useful to be able to run other concurrent, separate pyflare processes within the same Python Notebook while the main GUI process is busy doing a computation, e.g., preparing a protein (Figure 2):

Figure 2. Download a PDB complex in the GUI, then run Protein Preparation.
The Python Notebook remains responsive while the Protein Preparation task is run by a FieldEngine process in the background. This means I can open a second Python Notebook tab and, for example, visualize the 2D ligand structure. Since the new notebook tab runs as a separate pyflare process, it does not have access to the Flare main_window() object, as shown by the absence of the Flare icon and by the tooltip (Figure 3):

Figure 3. Open another tab and carry out some other task in a separate process.
Once the calculation has finished, you can switch back to the main tab and keep on working there.

To provide better integration with the Flare GUI, I have moved the familiar ‘Kernel’ notebook menu controls to the bottom of the window (Figure 4):

Figure 4. Restart/Stop commands can be accessed from bottom left buttons.
Also, the Load/Save commands were moved from the File menu to buttons, in order to provide more control on the location the notebooks can be saved to or retrieved from (Figure 5):

Figure 5. Load/Save notebooks through a standard file dialog.
The Python Notebook extension is now ready for download from Developers extension on our GitLab page. I’d be really keen on hearing thoughts and ideas from other Python enthusiasts out there, so please do not hesitate to get in touch if you would like more information, have feedback or have suggestions for new features in the next version of the Python Notebook.

RDKit User Group Meeting review

In September I attended the RDKit User Group Meeting, which this year was (very conveniently for Cresset) held in Cambridge, UK, kindly hosted by Andreas Bender at the Department of Chemistry.

As most people will already know, the RDKit is an open-source cheminformatics toolkit developed by Greg Landrum, with regular and ongoing contributions from the community through its GitHub page.

As an RDKit enthusiast, the RDKit UGM is one of my favorite scientific events in the year. It is characterized by a relaxed and informal atmosphere: people attend to show what they achieved through the RDKit, or what they contributed to it, and to learn from others.

We have been building the RDKit into Cresset tools for a few years now, and have also contributed code to the project.

This year I gave a lightning talk about the integration of the RDKit in Flare, our structure-based design tool, and in particular on its usage from Python through the built-in Jupyter Notebook interface. Last month I wrote the blog post Integrating Jupyter Notebook into Flare on this topic, which you may find of interest. Actually, you’d better not get me started on the Flare Python Notebook or I might go on ranting for a long time, as I am terribly excited by the possibilities it opens up, and by how easy it makes it to implement one’s ideas and workflows!

But for now I’ll make an effort to stick to the topic of this post, which is a review of highlights of mine from the last RDKit UGM. As usual, the list is far from exhaustive – there were far more interesting talks than can fit in a blog post!

What’s new in the RDKit?

Greg Landrum (KNIME) started the meeting with the usual overview of the new features introduced since the last UGM. The C++ codebase has been largely updated to C++14, with a 20-30% increment in performance, which is quite impressive. The ETKDG conformer generator developed in 2015 by Greg and Sereina Riniker (ETH Zurich) has now become the default, and has been proved to be on par with commercial alternatives by two independent literature reviews [1, 2]. Further new features of the 2018.09 release include a new JSON-like chemical interchange format, CoordGen-based 2D depictions, and the possibility to instantiate RDKit molecules from SVG depictions endowed with chemical metadata. Finally, a nifty tool written by Nadine Schneider (Novartis Institutes for BioMedical Research) for CheTo to visualize fingerprint bits has been incorporated in the RDKit (Figure 1).


Figure 1. Fingerprint bit visualization.

GSoC RDKit-MolVS integration

Susan Leung (PhD student, University of Oxford) presented the 3-month Google Summer of Code project that she has been working on, which consisted in porting Matt Swain’s (Vernalis) MolVS, which had originally been written in Python, to the main C++ codebase, while expanding its current capabilities. MolVS is a great tool for molecule validation and standardization, i.e., all those automated operations to carry out when importing a set of chemical structures into a cheminformatics toolkit to get them in a consistent and manageable state (Figure 2).

Susan did a great job, even if the tautomer enumeration/canonicalization still has to be completed. This work is going to be very useful to the community and will ensure wider adoption of consistent molecular standardization criteria.


Figure 2. Some of the operations that MolVS can carry out on a chemical dataset.

Some (hopefully) useful open source programs built on the RDKit

Pat Walters (Relay Therapeutics) presented three programs built on top of the RDKit and available through Pat’s GitHub page.

Identify candidates to prioritize for synthesis

The first program addresses the case, quite common in pharmaceutical companies, where hundreds of compounds have been synthesized as part of a project, but still these compounds do not cover all possibilities: there are certainly holes and potentially interesting molecules that were missed. In such cases, a Free-Wilson analysis allows to evaluate the contributions of different R-groups and identify candidates to prioritize for synthesis. Molecules are decomposed into R-groups, a matrix which encodes presence or absence of each R group in each molecule is generated, then R-groups vector are regressed against pIC50 values. Linear regression does not usually work well for this purpose, as many characteristics are collinear, so Ridge Regression is a better choice to avoid hitting the collinearity problem.

Filter chemical libraries

The second program is aimed at filtering chemical libraries via the functional group filters from the ChEMBL database, and some property filters from the RDKit, which act as structural alerts to identify potentially problematic molecules at an early stage.

Predict aqueous solubility from molecular structure

The last program is an implementation of Delaney’s ESOL method to predict aqueous solubility from molecular structure. When discussing the results, Pat showed a chart comparing the r2 obtained by his ESOL implementation to those obtained through RF, DNN and Graph Convolution methods.

While apparently ESOL gets a higher r2, the difference is not statistically significant, because of the confidence intervals associated to the number of data points (Figure 3). This is a point that Pat has already stressed in his blog post Predicting Aqueous Solubility – It’s Harder Than It Looks, and certainly worth keeping in mind.

Figure 3. The light blue area shows the 95% confidence interval for r2 depending on the number of data points.

RDKit in the modern biotech

Ben Tehan and Rob Smith (Heptares) presented an overview of how the RDKit is used at Heptares. In an effort to harmonize and simplify their cheminformatics infrastructure, Heptares adopted the RDKit as their standard platform in 2013. I was pleased to learn that in their hands the performance of my Open3DALIGN alignment algorithm (which indeed was my first contribution to the RDKit, back in 2013) coupled with shape-based scoring was not too far from ROCS (EF1% ~10 using CrippenO3A descriptors compared to ~15 enrichment with ROCS). So I gave myself a pat on the back and carried on.

Deceptively simple: How some cheminformatics problems can be more complicated than they appear

For those who have never been to an RDKit UGM, I’ll do a small preamble. Traditionally, Roger Sayle (NextMove) gives a talk where he picks some RDKit algorithm, pinpoints how poorly coded it is in its current implementation, and how it could be written much better. This is usually quite enjoyable (Greg might have a different opinion), particularly because the talk is always accompanied by a GitHub pull request by Roger with the improved code, to the benefit of the community.

This year the focus was a bit different: rather than tearing some RDKit bit into pieces, Roger decided to show the potential performance pitfalls behind apparently simple common (chem)informatics tasks, such as counting text lines in a file or computing molecular weight, to finish by spotting out a claim in a scientific paper based on a wrong percentage calculation. While very interesting as usual, I thought that the arguments in favor of low-level programming to speed up certain tasks were not entirely convincing, as they were not supported by timings showing that the extent of the gain actually balances the losses in terms of effort and portability.


Once again, this was a thoroughly enjoyable meeting. If you’re interested in finding out more, see the materials from the RDKit UGM 2018.

Integrating Jupyter Notebook into Flare

In a recent blog post I have shown the integration of the Jupyter QtConsole in Flare.

The Jupyter QtConsole nicely fits in with the rest of the Flare GUI and provides a comfortable Python shell environment with most of the nifty Jupyter features such as history, TAB completion, syntax highlighting, embedding of images, etc.

Since I published that post, I have started thinking that it would have been great to embed a Jupyter Notebook, as it offers an even richer interface: most importantly, it enables editing and running individual code cells, thus constituting the ideal environment for Python enthusiasts.
There were a few technical hoops that I had to jump through to get this to work, but I finally managed.

So here I am proudly presenting the Flare Python Notebook, i.e. an instance of the Jupyter Notebook embedded into the Flare GUI which has direct access to the Flare GUI objects and methods just as the Python QtConsole (Figure 1).

Figure 1. A screenshot showing the new Python Notebook embedded in Flare.
To demonstrate the new fuctionality, last week at the 7th RDKit UGM in Cambridge I gave a lightning talk showcasing a sample Python Notebook which downloads a set of AChE inhibitors from ChEMBL, loads them into Flare, generates 3D coordinates and field points, and finally docks them to a crystallographic AChE protein downloaded from the Protein Data Bank.
The RDKit is used to compute 2D similarities and maximum common substructure across ligands and to generate 2D molecule layouts. RDKit molecules are fully interoperable with Cresset molecules within Flare, so Cresset 3D technologies and RDKit methods can be synergycally combined in one Python workflow.
Matplotlib, NumPy and SciPy are used to generate a scatterplot with a regression line and compute some statistics.

Now, on to the Jupyter Notebook:

The Flare Python Notebook unleashes the full potential of embedding a highly interactive Python environment within the Flare GUI.
RDKit cheminformatics methods and Cresset 3D technologies can be used side-by-side and their results visualized in real time while writing Python code, making the development cycle much more efficient and expedite.
The Jupyter Notebook has made scripting a simple everyday task for cheminformaticians, bioinformaticians and data scientists. I am confident that the Flare Python Notebook will do the same for CADD scientists and computational chemists.

And if you are not a Python guru, I can help you out; actually, I can’t wait to write my next script!

We will be releasing the Jupyter Notebook integration to our GitLab repository of Python extensions as soon as it is finalised. To find out more about Flare or to talk about how the Python integration can help you in your research or to request a Python script to achieve a particular task within Flare, then please contact us.

Review of the 11th International Conference on Chemical Structures

We recently attended ICCS in Noordwijkerhout, where we presented two posters:

This was my second time at this conference, the first as a Cresset scientist. I remember very well my previous attendance in 2011, as it coincided with a turning point in my career: in fact, by that time I had realized that my main research interest had become the development and implementation of new methodologies in computational chemistry, and I started concentrating my efforts in that direction, ultimately joining Cresset a few years later.

Based on my two participations, I do rate ICCS very high in my personal ranking of scientific events, on par of the German and Sheffield conferences of cheminformatics: excellent scientific program, relaxed and informal atmosphere, extended poster sessions to merge and exchange ideas with colleagues, plus a sun-blessed boat trip on the Ijsselmeer.

In the following I will try to capture a few highlights of the talks that I found most inspiring, even though the list could indeed be much longer.

How do you build and validate 1500 models and what can you learn from them? An automated and reproducible system for building predictive models for bioassay data

Greg Landrum (KNIME, T5 Informatics) presented a KNIME-based workflow for automated building of predictive QSAR models (Figure 1).

Figure 1. KNIME-based workflow for automated building of predictive QSAR models, presented by Greg Landrum, KNIME.

ChEMBL 23 was used as the data source for model training, after ‘seeding’ it with compounds roughly similar to the active compounds (i.e., Morgan2 fingerprint Tanimoto between 0.35 and 0.60) which are assumed to be inactive. In fact, the ratio of actives to inactives in scientific publications (from which ChEMBL data are extracted) tends to be unnaturally high, because researchers prefer to focus on active compounds in their reports. Five different kinds of chemical fingerprints were generated for each compound, and ten different activity-stratified training/test set splits were carried out for each activity type. Four different models were built using various machine learning (ML) methods, namely Random Forest, Naïve Bayes, Logistic Regression and Gradient Boosting; the best model was selected as the one giving the highest enrichment factor at 5%.

Overall, over 310K models were built and their performance evaluated in a fully automated fashion; calculations were distributed on 65-70 load-balanced AWS nodes and took only a couple of days. While results seemed initially deceptively good, more in-depth critical analysis revealed that most models were overfit on the training data and not very generalizable. This conclusion was very interesting in itself, adding up to the usefulness of the comparative analysis of the predictive performance of the various fingerprint/ML method combinations.

Machine learning of partial charges from QM calculations and the application in fixed-charge force fields and cheminformatics

I enjoyed very much the talk given by Sereina Riniker (ETH) on deriving atom-centred partial charges for molecular mechanics calculations from ML models. Such models were trained on DFT electron densities computed with the COSMO-RS implicit solvent model (or COSMO for compounds containing elements not supported by COSMO-RS). Sereina had already anticipated that she was working on this subject during a lighting talk given at the 2017 RDKit UGM, and I had been looking forward to seeing the final results since then. The Random Forest model was trained on a set of 130K diverse lead-like compounds from ZINC/ChEMBL, picked as to include as many substructures as possible, with the goal to cover most of the known chemical space of drug-like molecules. In addition to being a lot faster than DFT, the methods yields 3D conformation-independent partial charges, which is a highly desirable feature (Figure 2).

Figure 2. Procedure to derive 3D conformation-independent partial charges from ML models trained on QM data, presented by Sereina Riniker, ETH.

Prediction accuracy is >90%, and can be further improved by adding compounds to the training set in case some feature turns out to be poorly predicted due to being under-represented.

Artificial intelligence for predicting molecular electrostatic potentials (ESPs): A step towards developing ESP-guided knowledge-based scoring functions

On a similar topic, Prakash Chandra Rathi (Astex) presented a method based on graph convolutional deep neural networks to predict the molecular electrostatic potential (ESP) around atoms in a molecule, with the goal of optimizing ligand-protein electrostatic complementarity, thus removing the need for expensive QM calculations. The neural network was trained on 48K diverse molecules for which ESP surfaces were computed using DFT calculations. The optimized model delivered excellent predictive performance on a validation set of 12K molecules (r2 0.95). In addition to visualization of field extrema, as in Cresset’s field point technology, ESPs computed in this fashion hold promise for developing knowledge-based scoring functions to estimate the propensity of a certain atom in a ligand to interact with another atom in a protein.

Multivariate regression with left-censored data – efficient use of incompletely measured bioactivity data for predictive modelling

Knut Baumann (TU Braunschweig) described a method developed in collaboration with Bayer to deal with a very common problem in QSAR modeling, i.e., the fact that for weakly active compounds an exact pIC50 or pKi cannot usually be assessed, as it lies below the assay detection threshold. This kind of data is called left-censored, and poses a serious problem as regression algorithms able to handle hundredths or thousands of left-censored predictors are not available. Treating such data ‘as is’, i.e., interpreting the ‘lesser than’ symbol as if it were an ‘equals’ symbol or omitting this data altogether leads to underestimating the slope and/or overestimating the intercept. To address this issue, Baumann and co-workers adapted the Buckley-James algorithm to principal component regression (PCR) and partial least-squares regression (PLS). In a nutshell, a self-convergent procedure is adopted: the model is initially fitted against all data, then corrections to censored data are estimated, and after applying them the model is re-fitted and the whole procedure is iteratively repeated until convergence is reached. The method was validated using PLS and PCR on both simulated data and on a real dataset from industry, yielding a significant performance improvement compared to the naïve cases where left-censored data are handled as if they were uncensored and to the case where all left-censored data are simply neglected.

How significant are unusual intermolecular interactions?

Bernd Kuhn (Roche) presented a thorough analysis of unusual intermolecular interactions (i.e., halogen bonding, dipole-dipole interactions of aromatic F, aromatic sulfur interactions, S σ-hole to carbonyl, etc.) as observed in crystallographic complexes from the Protein Data Bank and the Cambridge Structural Database. The goal of this study was assessing how important such interactions, on which a wide literature exists, actually are in real life. The analysis was done using the line-of-sight (LoS) paradigm (an interaction between two atoms can only take place if no other atoms are in the way), coupled with exposed surface areas to derive interaction statistics from crystallographic databases. 16 protein atom types were defined describing element, hybridization, polarity and charge of each atom, and their propensity to interact with other atom types was determined: for example, chlorine likes being an environment with polarized CH, C and NH π-systems, and non-polar groups, whereas cyano groups prefer hydrogen bond donors, positive charge, NH π-systems and waters. Additionally, the geometric preferences of each kind of intermolecular contact were determined.

Flare API: A new playground for computational chemists and developers

Flare 2.0 comes with a full-featured Python API which enables both high-level access to scientific functionality and low-level access to the graphical user interface and internal processes. In addition to Flare methods, you will have access to RDKit, NumPy, SciPy, and to virtually any other Python module that can be pip-installed. RDKit molecules are treated just as native Flare molecules, and can be loaded interchangeably to Flare’s ligand table. This means that you can generate a virtual 2D library using RDKit reaction SMARTS and then turn it into a 3D library of field-enabled Cresset molecules without ever leaving the Flare environment.

Additionally, if you realize that you often use a certain Python script as part of your Flare workflow, you may decide to associate it to a custom control in the Flare ribbon user interface, be it a button, a menu, or a complex dialog with signals and callback functions (Figure 1).

Figure 1. An example of the custom controls that can be added to the Flare ribbon.
We have done our best to write a clear and easy-to-browse HTML documentation, but even more importantly we have come up with a number of sample scripts which cover a variety of use cases, in order to get you started as quickly as possible.

Python commands can be loaded and executed in Flare in a number of different ways, ranging from highest automation to highest interactivity.

For example, when you need to carry out a completely automated task, such as overnight preparation of a panel of proteins followed by docking of several ligand series, the most convenient option is to write a Python script that runs outside the Flare GUI, such that it can be distributed on a cluster via a queueing system for maximum performance. This use case is covered by the pyflare binary, which is effectively a Python interpreter just as the familiar python binary, slightly modified to play well with the rest of the Flare ecosystem (Figure 2).

Figure 2. Running a Flare Python script outside Flare using the pyflare interpreter.
If your goal is to run a simple Python one-liner to, for example, list all cysteine residues in a protein chain or to print the distance between a ligand atom and a couple of key residues in the active site, the embedded Python Console is probably the simplest and leanest option – run the command and examine its output in the text port (Figure 3).

Figure 3. Entering simple one-liners in the Python Console widget.
If you believe you will need to go through some iterations of a code-run-test-debug workflow on a somewhat more complex script, you will most likely choose the embedded Python Interpreter widget, which allows you to load a script, interactively edit it inside Flare and then save your modifications (Figure 4). Both the Python Console and the Interpreter come with a multi-tabbed interface that makes it possible to work on multiple Python snippets at the same time.

Figure 4. Loading and editing a script in the Python Interpreter widget.
If you are used to the highly interactive environment of Jupyter notebooks, then you are going to love the Python QtConsole embedded in Flare. This widget provides all the nifty Jupyter features, i.e., TAB completion, auto-indentation, syntax highlighting, context help, inline graphics, and more (Figure 5). Type Python commands, examine molecules, draw plots – all in the same window.

Figure 5. Visualizing text output and inline graphics in the Python QtConsole widget.
If you would rather prefer to use Jupyter Notebooks in the traditional way, i.e., multiple tabbed sessions inside your favorite browser, that’s entirely possible, too. You will even be able to look at ligands and proteins in 3D inside your notebook, including Cresset surfaces and field points (Figure 6).

Figure 6. Running a Flare workflow inside a Jupyter Notebook outside the Flare GUI application.
In summary, programmatic access to Flare methods is very appealing to computational chemists looking for ways to automate tasks or run complex workflows on multiple datasets. On the other hand, the possibility to extend the core functionality as well as the graphical user interface with own methods makes Flare the ideal playground for Python developers.

For academic researchers, Flare is an opportunity to make their science more broadly accessible through integration into a user-friendly environment, while corporate users will appreciate the possibility to augment Flare capabilities with in-house REST services and tools. The possibility to code in Python taking advantage of the highly interactive experience provided by the Jupyter QtConsole and Notebook is a further incentive to make Flare the environment of choice for computational chemistry workflows.

Request a free evaluation of Flare.

Thirty years, and counting

Lessons and Successes in the Use of Molecular Fields, an invited chapter for the ‘In Silico Drug Discovery Tools’ section of Comprehensive Medicinal Chemistry III, has recently been published by Elsevier. The chapter reviews more than thirty years of research in the area of molecular interaction fields, starting from the ground-breaking work done by Cramer, Goodford and Vinter, until the most recent applications to virtual screening, metabolism prediction, protein similarity across phylogenetically unrelated families and library design.

We believe that the reason for the longevity of molecular fields as computational chemistry workhorses is their capacity to provide a feature-rich description of molecular properties which is independent of the chemical structure. Encoding interaction properties through molecular fields inherently enables scaffold hopping, 3D similarity searches within large databases, ligand and protein structural alignment, cross-chemotype SAR elucidation.

All of the above tasks are the bread and butter of medicinal and computational chemists. Therefore, it is not surprising that over the years the creativity of researchers has built on top of the molecular field technology a number of blockbuster tools for computer-aided drug design, such as CoMFA, FieldScreen (now known as Blaze) or MetaSite, to name but a few. Thanks to their ‘core’ nature, as trends evolved and new needs emerged, fields have brilliantly managed to ride the ever changing waves of drug discovery, and are still at the forefront of the in silico armoury.

We take pride of having been among the lead actors on the molecular field stage, and we trust that readers will enjoy embarking on a fascinating journey through decades of field-based science as much as we did as authors.

Boosting RDKit molecular simulations through OpenMM

I am a big fan of the RDKit. In case you have never heard about it, it is an open-source, BSD-licensed C++ toolkit which allows one to accomplish a wide range of cheminformatics tasks. You can build a molecule from SMILES, create 2D depictions, generate 3D conformations, do substructure searches, run chemical reactions, and much more. It comes with C++, Python, Java and C# bindings, so you can access its functionality from your favourite programing language. It features tight integration with the Jupyter notebook, so you can display your molecules in 2D and 3D interactively while you develop your Python workflow. In case you are not much of a programer, a lot of the RDKit functionality is exposed through RDKit KNIME nodes. And, last but not least, the RDKit comes with a PostgreSQL cartridge which enables dealing with molecules in PostgreSQL databases.

Now you know why I love the RDKit, and I hope I managed to convince you to give it a go, if you haven’t already. There are a number of tutorials to get yourself started, and an amazing community of users which can help you out when you are stuck.

Cresset software incorporates the RDKit, which is mostly used to parse SMARTS queries: in Forge, Torch and Spark you may apply SMARTS filters to your molecule tables. In Spark you may also request certain SMARTS patterns to be, or not to be, included in the results; furthermore, the Torsion Library which analyses the torsional angles of the fragments retrieved by a Spark search is based on a database of SMARTS strings.

We also use the RDKit in our internal research projects, in Cresset Discovery Services, and occasionally to integrate or customize the functionality already available in Cresset desktop applications, command-line tools, and KNIME nodes.

Besides being RDKit users, we are also RDKit contributors. In 2015 we contributed a resonance structure enumerator, while at the 2016 RDKit User Group Meeting, which was hosted at the Novartis Campus in Basel, we presented some preliminary work on boosting RDKit molecular simulations through OpenMM.

OpenMM is an open-source toolkit for high-performance molecular simulations running on CPUs and GPUs. Originally developed in the Pande Lab at Stanford, it is currently supported also by other groups and individuals. OpenMM natively implements AMBER, CHARMM and AMOEBA force fields, which are focused on biological macromolecules, and provides support for implementing custom force fields. The RDKit natively implements MMFF94 and UFF force-fields. MMFF94 is a general-purpose, accurate force-field, while UFF is geared towards small molecules, and trades accuracy for wide chemistry coverage and speed. We thought that it would be interesting to:

  • implement MMFF94 in OpenMM
  • build an OpenMM interface into the RDKit, and
  • compare the performance of the native RDKit implementation of MMFF94 (CPU-only, single-threaded) with the OpenMM implementation (CPU and GPU, multi-threaded).

Even though OpenMM features outstanding support for custom force fields (it has a lexical parser for energy equations and can even compute their analytical derivatives), MMFF94 has rather complicated combination and scaling rules for non-bonded parameters, which required some tweaking on the OpenMM library to be implemented efficiently. I managed to implement under CPU and GPU platforms the seven energy terms of MMFF94 using a combination of AMOEBA and custom forces:

Below (and on GitHub) you will find a Jupyter notebook showing a geometry optimization benchmark on a small protein, villin.

As you may appreciate going through the notebook, the increase in performance provided by this preliminary proof-of-concept implementation is impressive: OpenMM single and double precision are respectively 150 and 11 times faster than the RDKit implementation on a GeForce GTX 960 graphics card.

Our goal is now to code a native implementation of the MMFF94 and UFF force fields within OpenMM, and then provide the RDKit with an interface to the latter, in order to benefit from the speed-up. Possible applications include the automated minimization of protein-ligand complexes after docking, or the molecular dynamics simulation of small molecules in explicit solvent under periodic boundary conditions. The availability of the latter will be announced on the Cresset website and on the RDKit mailing list.

Here follows the Jupyter Notebook (see it on GitHub):

In [1]:
import sys
import math
import timeit
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
import py3Dmol
from simtk.openmm import openmm

This is the villin headpiece as downloaded from the PDB:

In [2]:
villin = open('/home/paolo/pdb/2F4K.pdb', 'r').read()
p = py3Dmol.view(width = 400,height = 400)
p.addModel(villin, 'pdb')
p.setStyle({'cartoon': {'color':'spectrum'}})
In [3]:
molOrig = Chem.MolFromPDBBlock(villin)
In [4]:
mol = Chem.AddHs(molOrig, addCoords = True)
In [5]:
pyMP = AllChem.MMFFGetMoleculeProperties(mol)
In [6]:

Let’s create 4 forcefields, the “traditional” one and those spiced up with OpenMM, respectively using single and double precision CUDA kernels, and the multi-threaded single-precision CPU implementation.

In [7]:
for i in range(3):
    mol.AddConformer(Chem.Conformer(mol.GetConformer(0)), assignId = True)
In [8]:
platformNames = ['RDKit', 'OpenMM_CUDA_s', 'OpenMM_CUDA_d', 'OpenMM_CPU']
pyFF = {}
pyFF[platformNames[0]] = AllChem.MMFFGetMoleculeForceField(mol, pyMP, confId = 0)
for i in range(1, 4):
    pyFF[platformNames[i]] = AllChem.MMFFGetMoleculeOpenMMForceField(mol, pyMP, confId = i)

Now we instruct our RDKit interface to OpenMM to use the appropriate hardware platform:

In [9]:
    'CUDA', {'Precision': 'single', 'DeviceName': 'GeForce GTX 960'})
In [10]:
    'CUDA', {'Precision': 'double', 'DeviceName': 'GeForce GTX 960'})
In [11]:

These are the energies of the protein before minimization computed with the 4 methods; differences are negligible, as they should ideally be:

In [12]:
for name in platformNames:
    sys.stdout.write('{0:20s}{1:8.4f} kcal/mol\n'.format(name, pyFF[name].CalcEnergy()))
RDKit               826.8740 kcal/mol
OpenMM_CUDA_s       826.8734 kcal/mol
OpenMM_CUDA_d       826.8727 kcal/mol
OpenMM_CPU          826.8728 kcal/mol

Now we will carry out a geometry optimization with all methods, and take some timings.

The OpenMM minimizations in single precision bails out of the OpenMM L-BFGS minimizer with a LBFGSERR_MAXIMUMLINESEARCH error (-998) before the RMS gradient criterion kicks in. This is probably due to insufficient precision for the minimizer to behave correctly during the line search. Nonetheless, the energy values are not dramatically different from those computed by OpenMM using the GPU in double precision mode.

In [13]:
t = []
for i, name in enumerate(platformNames):
    ff = pyFF[name]
    t.append(timeit.timeit('ff.Minimize(maxIts = 100000, forceTol = 0.01)',
                      'from __main__ import ff', number = 1))
    sys.stdout.write('{0:20s}{1:8.4f} s ({2:.1f}x)\n'.format(name, t[i], t[0] / t[i]))
RDKit                82.7275 s (1.0x)
OpenMM_CUDA_s         0.5488 s (150.7x)
OpenMM_CUDA_d         7.3300 s (11.3x)
OpenMM_CPU           25.0867 s (3.3x)

The timings are impressive: OpenMM single and double precision are respectively 150 and 11 times faster than the RDKit implementation on a hyperthreading quad-core 3.40GHz Intel Core i7-3770 CPU equipped with a GeForce GTX 960 graphics card.

Also the multi-threaded OpenMM CPU implementation (single precision) scales very well, as it runs >3 times faster than the single-threaded RDKit implementation on the 8 virtual cores (4 physical) of our Core i7.

Energy values at the end of the minimization are comparable; the slight differences between are most likely amenable to the different implementations of the L-BFGS minimizer between RDKit and OpenMM:

In [14]:
for name in platformNames:
    sys.stdout.write('{0:20s}{1:8.4f} kcal/mol\n'.format(name, pyFF[name].CalcEnergy()))
RDKit               -53.4757 kcal/mol
OpenMM_CUDA_s       -52.6213 kcal/mol
OpenMM_CUDA_d       -57.5980 kcal/mol
OpenMM_CPU          -52.6949 kcal/mol

If we look at the heavy-atom-RMSD matrix across the 4 minimizations, we see that the smallest deviation occurs, as might be expected, between the RDKit and the OpenMM double precision implementations. However, the RMSD for the single precision calculations is < 0.7 Å.

In [15]:
molNoH = Chem.RemoveHs(mol)
In [16]:
confsNoH = [molNoH.GetConformer(i) for i in range(4)]
In [17]:
for y in range(len(confsNoH)):
    if (y == 0):
        for name in [''] + platformNames:
    for x in range(len(confsNoH)):
        if (x == 0):
        if (x < y):
                AllChem.AlignMol(molNoH, molNoH, prbCid = x, refCid = y)))
                           RDKit   OpenMM_CUDA_s   OpenMM_CUDA_d      OpenMM_CPU
           RDKit          0.0000          0.6815          0.2669          0.6701
   OpenMM_CUDA_s                          0.0000          0.5457          0.0463
   OpenMM_CUDA_d                                          0.0000          0.5315
      OpenMM_CPU                                                          0.0000

This is the visual difference between RDKit and OpenMM single precision (largest deviation)

In [18]:
p = py3Dmol.view(width = 400,height = 400)
p.addModel(Chem.MolToPDBBlock(molNoH, confId = 0), 'pdb')
p.addModel(Chem.MolToPDBBlock(molNoH, confId = 1), 'pdb')
p.setStyle({'cartoon': {'color':'spectrum'}})

And this is how RDKit and OpenMM double precision compare (smallest deviation)

In [19]:
p = py3Dmol.view(width = 400,height = 400)
p.addModel(Chem.MolToPDBBlock(molNoH, confId = 0), 'pdb')
p.addModel(Chem.MolToPDBBlock(molNoH, confId = 2), 'pdb')
p.setStyle({'cartoon': {'color':'spectrum'}})

Examining the diversity of large collections of building blocks in 3D as presented at Sheffield Chemoinformatics Conference 2016


2D fingerprint-based Tanimoto distances are widely used for clustering due the overall good balance between speed and effectiveness. However, there are significant limitations in the ability of a 2D fingerprint-based method to capture the biological similarity between molecules, especially when conformationally flexible structures are involved. Structures which appear to largely differ in functional group decoration may give rise to quite similar
steric/electrostatic properties, which are what actually determine their recognition by biological macromolecules.

In BioBlocks’ Comprehensive Fragment Library (CFL) program, we were confronted with clustering a very large collection of scaffolds generated from first principles. Due to the largely unprecedented structures in the set and our design aim to populate the 3D ‘world’, using the best 3D metrics was critical. The structural diversity of the starting collection of about 800K heterocyclic scaffolds with variable functional group decoration was not adequately captured by 2D ECFP4 fingerprint Tanimoto distances, as shown by the rather flat distribution of 2D similarity values across the set, and by their lack of correlation with the 3D similarity metrics.

The initial step of any clustering procedure is the computation of an upper triangular matrix holding similarity values between all pairs of compounds. This step becomes computationally demanding when using 3D methods, since an optimum alignment between the molecules needs be found taking into account multiple conformers.

The presentation covers the methodological and technical solutions adopted to enable 3D clustering of such a large set of compounds. Selected examples will be presented to compare the quality and the informative content of 3D vs 2D clusters.


See presentation ‘Examining the diversity of large collections of building blocks in 3D‘ as presented at Sheffield Chemoinformatics Conference 2016.

Is this compound worth making?


In deciding if a compound is worth making medicinal chemists need to consider a number of factors including:

  • does it fit with the known structure activity relationship (SAR)?
  • does it explore new interactions or regions of space?
  • does it have good physico-chemical properties?

Summarizing the SAR content of large compound series in a single informative picture is challenging. Traditional 3D-QSAR techniques have been used for this purpose, but are known to perform poorly where the structure-activity landscape is not smooth. 3D Activity cliff analysis has the potential to explore such rugged SAR landscapes: pairs of compounds with a high similarity and a large difference in activity carry important information relating to the factors required for activity. However, looking at pairs of compounds in isolation makes it hard to pinpoint the changes which consistently lead to increase potency across a series.

Similarly, summarizing the regions and interactions that have been explored in a single picture is not trivial, especially if the electrostatic character of compounds is to be considered (e.g., replacement of electron rich with electron poor rings).

Here we present Activity Atlas, a new technique to analyse a series of compounds and derive a global view of the structure-activity relationship data.



The Regions explored summary gives a comprehensive 3D picture of the regions explored in electrostatic and shape space. A novelty score is assigned to each compound, enabling to predict whether newly designed candidates are likely to contribute additional SAR knowledge, and are thus worth making.

Average of actives summarizes electrostatic and shape properties which consistently lead to potent ligands across the series.

The Activity cliff summary gives a global view at the activity cliff landscape highlighting the regions where either more positive/negative electrostatic potential, or larger steric/hydrophobic potentials increase activity.

A traditional 3D-QSAR model2 was built on the same data set (q2 = 0.7). While 3D-QSAR seems better at extracting information where SAR is continuous, Activity Atlas gives more definition in regions where SAR requirements are critical.

Activity Atlas gives more definition in regions where SAR requirements are critical

Accounting for uncertainties in the alignment

Activity Atlas takes into account multiple alignments, weighted by their probability to be correct:

Accounting for uncertainties in the alignment1
Based on the distance from the top result

Accounting for uncertainties in the alignment2
Based on the absolute similarity value (CCDC-AZ dataset)

Analyzing multiple alignments per molecule enables the technique to account for uncertainties in the orientation of flexible side chains not represented in the reference(s).

Application to selectivity

Application to selectivity

Application to a set of adenosine receptor antagonists with activities against A1, A2a and A3 receptors3 demonstrates the utility of the technique in locating critical regions for selectivity. Examination of the steric and electrostatic maps for the three subtypes clearly shows which regions should be targeted in order to enhance subtype selectivity.

In the example above, the right hand side of the molecules can be used to discriminate between A3 and the other two subtypes, while A1 and A2a can be separated by increasing steric bulk and positive charge around the top of the molecules.


The Activity Atlas technique is a powerful way of summarizing SAR data in 3D. By combining information across multiple pairs, it enables a global view of the critical points in the activity landscape. The Average of actives summary captures in one picture the 3D requirements for potency, while the Regions explored summary enables prioritizing compounds which add crucial SAR information over trivial analogues.


1. J. Chem. Inf. Model. 2006, 46, 665-676
3. J. Chem. Inf. Model. 2011, 51, 258-266