Electrostatic Complementarity™ scores: How can I use them?

Flare™ V2 introduced a new analysis method called Electrostatic Complementarity (EC). The basic idea is quite simple: the maximum electrostatic affinity between the ligand and the receptor is achieved when the electrostatic potentials of the ligand and the receptor match (that is, have the same magnitude and opposite sign). At first glance that seems obvious. At second glance it seems a bit more surprising – why wouldn’t it be a good idea to have an even larger potential on the ligand to make the interaction energy even better? The reason is that the improved electrostatic interaction energy between the ligand and the protein will be cancelled out by the increase in desolvation penalty for the ligand.

So, all we need to do is to compute the electrostatic potential of the ligand and the protein over a suitable contact surface, and then compute some sort of correlation metric to measure how similar they are. In a vacuum, this calculation would be quite straightforward. Unfortunately, water (as usual) makes everything much more complicated. In the absence of running long dynamics simulations, we’re going to have to approximate the solvent effects somehow. I’m not a great believer in continuum solvent approximations for this purpose, as water in and around a protein active site is very far from being a continuous dielectric. However, we must do something to account for the water. Our answer is a mix of a complex dielectric function and special treatment of formal charges, which we’ve already show works well for visualizing the electrostatic potentials inside a protein active site (ref to earlier blog post on protein interaction potentials).


Figure 1: Electrostatic potentials and surface complementarity for the biotin-streptavidin complex.

So, we can compute the potentials (J. Med. Chem., 2019, 62 (6), pp 3036–3050), we can visualize them by coloring the surface by the complementarity (Figure 1), and we can compute an overall EC score. The question now is ‘Does it actually do anything useful?’.

Well, we’re computing an overall EC score, so the obvious thing to check is if the score correlates with activity. We have done this for lots of data sets (see Figure 2 and the J Med Chem paper referred to earlier), and you get anywhere from a modest (r2=0.33 for RPA70N) to very good (r2=0.79 for PERK) correlation. Problem solved, then: just dock your ligand designs into your protein, compute EC scores, and pick the one with the highest EC score to make!


Figure 2: Correlation of EC scores with activity for a range of data sets.

Unfortunately, it’s not actually that simple. While we do show that EC score correlates with activity for a wide variety of data set on different targets, these data sets are very carefully curated. The reason is that the binding of a ligand to a protein depends on many different physical effects. Electrostatics is one of these, and a very important one, but it’s not the only one, so EC score is only going to predict activity differences where the other effects do not change.

The data sets used to get the correlations in Figure 2 are very conservative: the ligands  within each set are all very closely related, they are all very close to the same size, they have much the same number of rotatable bonds, they have consistent binding modes, and so on. In addition, we find that to get a strong correlation you need to minimize alignment noise (much as you do when generating a good 3D QSAR), so we align all the ligands on a common substructure rather than relying on a free dock.

All other things being equal, then, a higher EC score should give you a higher affinity. Unfortunately, in the real world, all other things are rarely equal, and so unless you are looking at quite conservative changes (for example asking where on my ligand I could substitute a fluorine to improve affinity) the EC scores are likely to be a poor guide. Back to the original question, then: ‘Does it actually do anything useful?’.

Luckily, although the single numeric EC score is very sensitive to placement of the molecule in the active site, the distribution of EC values over the surface is much more robust. The primary use of the EC method isn’t the calculation of scores: it’s the visualization of where your molecule is matching the protein electrostatics well, and where it isn’t matching as well (Figure 3). This gives you hints as to where you might want to make changes to your molecule, and what changes you might want to make: add a halogen? Move a nitrogen in a heterocycle? Small electrostatic interactions to halogen atoms or to the edges of aromatic rings are hard to visualize any other way.


Figure 3: The mGLU5 inhibitor on the left has a minor electrostatic clash on the pyridine ring, as seen in the EC surface coloring on the left. Placing a fluorine in this position removes the clash and improves affinity.

The primary use of the EC method, then, is analyzing your ligands and pointing out where improvements can be made. You can be confident that these suggestions are sensible, as we have shown in multiple data sets that where the difference between ligands is primarily electrostatic the EC score correlates with affinity. However, the EC scores themselves aren’t a general predictor of affinity, as there are many factors not included in the score that can make a molecule a better or worse binder.

If you’d like to visualize the electrostatics of your molecules in their active site and get guidance on how to improve them, request a free evaluation of Flare.

Comparing Forge’s command line utility to Blaze – which one should you use?

Here at Cresset we’re very interested in ligand-based virtual screening – it’s been a focus of the company ever since we started more than seventeen years ago. In that time there have been many advances and refinements of the techniques for both ligand-based virtual screening and structure-based methods. We have stuck by our fundamental principle that ligand similarity based on both electrostatics and shape is an excellent way to sort the wheat from the chaff. The results obtained by our services division, who have run more than 200 virtual screening campaigns with a better than 80% success rate, is testament to that.

Difference between falign and Blaze

One of the things our customers ask from time to time is which application should they be using to do virtual screening. The simple answer is that there are two, Forge (and its command-line utility ‘falign’) and Blaze, and the differences are readily apparent.

In falign, you can generate conformations for a large set of molecules, align them to one or more references, and rank them by the similarity score. You also have the option to bias the alignments and scores by adding field constraints, pharmacophore constraints, and protein excluded volumes.

By way of contrast, in Blaze, you can generate conformations for a large set of molecules, align them to one or more references, rank them by the similarity score, and… ok, point taken. So, given that falign and Blaze apparently do the same thing…

Why falign and Blaze?

The answer is scale. As anyone who’s ever played with large data sets knows, doing calculations on a few hundred compounds is fundamentally different to doing them on tens of millions of compounds. Once you are working at large scale, seemingly trivial operations such as filtering data sets become much more difficult if you want to be efficient. Blaze was designed from the ground up to work with large data sets of 107 molecules and more, with an emphasis on maximizing throughput on a computational cluster. Forge/falign on the other hand are much more aimed at small-scale work, enabling simple screening or analysis of relatively small sets of compounds where the big iron of Blaze is overkill.

Data preparation

As an example, let’s look at the preparation of the data set in the two software suites. In falign, this is relatively simple: you provide the compounds to falign in 2D or 3D form, it assigns protonation states as necessary, and computes conformations on-the-fly if required before aligning to the query:

Falign has a secondary mode for use when aligning structurally-related compounds, which ensures that the common substructure within the dataset is perfectly matched:

Blaze, on the other hand, is much more sophisticated in its conformer handling. The average user of Blaze has multiple data sets that they want to screen (in-house compounds, vendor screening compounds, virtual libraries, custom collections), and these often have significant overlap. In addition, these data sets are usually reused multiple times for multiple virtual screens. As a result, Blaze has a sophisticated deduplication and precomputation pipeline that maximizes computational efficiency. The Blaze workflow looks more like this:

Any given chemical structure is only present once within Blaze: it may have multiple different names, and be present in multiple collections, but we’ll only precompute its conformations once and we’ll only align it once in any given screen. The conformer computation pipeline is heavily optimized for performance: we’ve done extensive studies on our conformer generation algorithm XedeX to find the optimal trade-off between conformation space coverage, rejection of higher-energy conformations, calculation time and number of conformations required. In addition, we’ve developed a special-purpose file format that is highly compressed (less than 13 bytes per atom on average, including coordinates, atom types, element, charge and formal charge) while being unbelievably fast to parse.

Blaze has a multiple-step pipeline to filter the data set, so that the full 3D electrostatic shape and alignment algorithm is only applied to molecules that are likely to have a high score. For extremely large data sets there’s an initial filter by FieldPrint, an alignment-free fingerprint method that gives a crude measure of electrostatic similarity. The molecules that pass the filter then go into an ultrafast version of our 3D alignment and similarity algorithm, and the full similarity algorithm is applied only to the best 10% or so of these. As a result, Blaze can chew through millions of molecules very quickly on even a modest cluster. The processing capability of Blaze is further enhanced by the fact that there’s a GPU version which is even faster.

Small versus large data sets

So, falign is designed for the simple use case on small sets of molecules, while Blaze is aimed at maximum computational and I/O efficiency on very large data sets. There is another important difference between the two. As anyone who’s been in charge of maintaining a virtual screening system knows, keeping it up to date is often a painful and thankless task. It’s bad enough keeping up with the weekly additions to the internal compound collection but keeping track of updates to external vendor’s collections is difficult: not only are new compounds being added but old ones are being retired. Blaze makes handling this situation easy. You simply provide Blaze with the new set of compounds that you want to be in the collection, and Blaze will automatically handle the update.

Any new compounds will be added, no-longer-available compounds will be marked and removed from the screening process, and any unchanged compounds will be left alone. This is far more computationally efficient than fully rebuilding the conformations for everything. Blaze can even be directly connected to your internal compound database, so that the Blaze collection holding your in-house compounds is always right up to date.

Given how great Blaze is at handling virtual screening, why would you ever want to use falign?

Blaze is optimized for throughput and computational efficiency, but the downside of this is latency. If you have a set of compounds you want to align and score in Blaze, you have to upload them, wait for Blaze to process them and build the conformations, wait for Blaze to build its indices, initiate a search, and wait for it to be submitted to your cluster queueing system. There’s five- or ten-minute’s latency in all of this, which is fine for a million molecules but is overkill if you have only one hundred. Falign, by contrast, will start work straight away on your local machine with no waiting at all.

The answer to the falign vs Blaze question, then, is largely a question of scale. Got a dataset of a million molecules that you want to run repeated virtual screens against? Blaze is just the ticket. Got a small set of compounds that you want to align and score as a one-off? Forge and falign are just what you need. For our in-house work we tend to find that the tipping point occurs around a few thousand molecules. Falign can easily chew through this many in an hour or so (especially if plugged into your computing cluster using the Cresset Engine Broker). However, if there’s more compounds than this or we’re going to want to run multiple queries, then Blaze it is. Since Blaze is accessible through the Forge front end, and both are accessible through KNIME and Pipeline Pilot, it’s as easy as pie to pick the right tool for the job.

Try on your project

Request a free evaluation to try out Forge or Blaze for your small or large-scale virtual screening needs. Don’t have a cluster? Blaze Cloud and Blaze AWS provide simple ways to access cloud resources to do the number crunching for you.


Pharmacophore constraints – when should they be used?

Cresset’s alignment and scoring technology is based around the idea of electrostatic matching: the ‘best’ alignment of a pair of conformations is generally the one that provides the most similar molecular interaction potentials. In most cases this is true. However, we recognize that often some portions of the electrostatic potential, for example those involved in a strong hydrogen bond, can be more important than others. This isn’t information intrinsic to the molecules themselves, but to the context in which they are being used. As a result, we have for many years offered the possibility of adding ‘field constraints’, which force the alignment algorithm to prioritize matching the electrostatic potential in particular places around a ligand.

It’s worth noting that the field constraint algorithm works by applying a score penalty to alignments which violate the constraint, not by forcing the constraint to be matched (Figure 1). In general, soft constraints delivered via score penalties are preferred to hard constraints where we simply disallow alignments that violate the constraint. There are two reasons behind this. Firstly, using a score penalty allows the strength of the constraint to be adjusted, so that it can be anything from a mild hint to a strict requirement. Second, in most cases we feel that it is better to get a best-effort alignment that violates the constraint than nothing at all, especially when using the algorithms in lead optimization and trying to align a set of known actives.

Figure 1: Field constraint requires the field at the position of a particular field point to have a certain minimum value.

While field constraints work extremely well in the majority of cases, there are a few exceptions. In some cases there isn’t a field point at the place you want to add a constraint (although this problem is alleviated by the new field editor in Forge and Spark). More commonly, the constraint that you want to add is either stronger or more specific than can be expressed through fields: sometimes generic positive potential isn’t enough and you want to explicitly require a hydrogen-bond donor, for example, and sometimes you want to ensure that the metal-chelating group in a set of ligands is aligned, and ‘good metal chelator’ is similarly difficult to pin down in terms of pure electrostatics. In addition, the donor-acceptor motif that is often required in the hinge binding region of kinases can be difficult to describe through fields alone, as the donor and acceptor electrostatics cancel out to some extent making this region contribute less to the overall alignment score than it should.

For these reasons we introduced the option of pharmacophoric constraints to our core algorithms. These operate in a similar fashion to field constraints, in that they are constraints not requirements. Traditional pharmacophore algorithms treat pharmacophores as binary switches: you either have a H-bond donor in a region of space, and hence a full match, or you don’t and therefore don’t have a match. Our pharmacophore constraints instead operate on a sliding scale. You specify how strong the constraint is, and hence how much alignment score should be lost if the constraint is not satisfied. We spent a while looking at what form the penalty function should take, and the best form overall was a simple linear penalty with a cap at 2Å distance (Figure 2).

Figure 2: Pharmacophore constraints require an atom of the specified type to be within 2Å of the constrained atom, or a penalty is applied to the score.

We’ve talked elsewhere about the validation of our pharmacophore constraints on virtual screening performance within Blaze, where for some targets they can make a significant difference. However, the new constraints are useful for more than just virtual screening. What effect does adding constraints have on the probability of a successful ligand alignment?

We investigated this by using the standard AZ/CCDC dataset from Giangreco et al. (J. Chem. Inf. Model. 2013, 53, 852−866), which contains a large number of targets, and for each target a collection of aligned co-crystallized ligands. The data set can be used to validate ligand alignment algorithms, by measuring what proportion of the ligands within a target class can be successfully aligned. However, running a validation over the entire data set would be challenging, as it would pose the question for each target as to what pharmacophore constraints should be applied. We decided to address this by only testing (in the first instance) 4 kinase targets, as the donor-acceptor-donor motif for kinases is well understood and fairly uncontroversial. For each ligand in these 4 target sets, we manually applied between 1 and 3 pharmacophore constraints to the hinge binding region based on how many H-bonds were actually made by that ligand. For the purposes of this experiment we only included ‘true’ H-bonds from hydrogens on O or N, not the weaker C-H form. Note that all of the alignments were performed in ligand space only, with no information from the proteins applied.

The distribution of results is shown in Figure 3. We used a cutoff of 2Å to distinguish between correct and incorrect alignments. As can be seen, a significant number (around 13%) of alignments that were incorrect when unconstrained were rescued by adding the constraints – that’s a significant improvement. However, at the same time, around 3% of the alignments that we had been getting correct were made worse by adding the constraints, and it’s also apparent that there are a number of incorrect alignments which are left unchanged by the imposition of constraints.  Overall the pharmacophores are a net win, especially for these kinase data sets, but it’s not a magic bullet.

Figure 3: Effect of adding pharmacophore constraints to 4 kinase alignment data sets. The alignments in the ‘Win’ section are improved by the addition of constraints, while those in the ‘Loss’ section are made incorrect.

Let’s look at a couple of examples. Figure 4 shows a case where adding hinge binder constraints converts an incorrect alignment to a correct one. The unconstrained result on the left matches the dimethoxyphenyl ring on the left and the phenol on the right very well, but at the expense of the quinazoline nitrogen which makes the hinge interaction. Adding a single acceptor constraint to that nitrogen switches the best solution to one where the indazole nitrogen satisfies the constraint, which happens to be the experimentally-observed alignment.

Figure 4: CDK2, aligning EZV to DTQ.

Similarly, in Figure 5, the best alignment in terms of raw shape and electrostatics of MTW and LZ8 is shown on the left. The 2-aminopyrimidine in MTW which makes the hinge interactions is partially matched by LZ8’s pyrazole and carbonyl, but the alignment in that part of the molecule isn’t that good. However, the alignment matches the phenyl ring in both molecules beautifully. Applying a donor/acceptor pharmacophore constraint pair forces the alignment algorithm to try and match the 2-aminopyrimidine much more closely, which leads to the correct alignment on the right. However, note that the phenyl rings are much more poorly aligned.

Figure 5: CDK2, aligning LZ8 to MTW.

Finally, Figure 6 shows a case where adding field constraints converts a correct alignment to an incorrect one. When aligning to DTQ as the reference, as in Figure 4, the field/shape alignment gets the correct alignment for C94. However, C94 does not make the hydrogen bond to the hinge that we have constrained in the reference. The constraint coerces the alignment to be flipped so that one of the sulfonamide oxygens matches the H-bond acceptor requirement, thus forcing the wrong answer to be obtained.

Figure 6: CDK2, aligning C94 to DTQ.

Pharmacophore constraints can, as you can see, be very useful. If you know that particular interactions are required by the protein, you can provide our tools with that information to help them find the correct alignment. However, it must be kept in mind that by adding these constraints you are adding bias to the experiment: you are nudging it to provide the answers that you expect. Sometimes, the unexpected answers turn out to be correct! We recommend, therefore, that pharmacophore constraints are used sparingly. I personally will always run an unconstrained alignment experiment as a control, just to see if Forge will find an alignment that I hadn’t expected but which might illuminate a previously-puzzling piece of SAR.

So, when should you use a field constraint and when a pharmacophore constraint? The answer depends both on your requirements and on the chemistry inside the active site. If there’s a pharmacophoric interaction that you are certain must be preserved, then using a pharmacophore constraint is probably justified. However, if the interaction is not always present in the same form, a field constraint may be more appropriate. Figure 7 shows a BTK example. The ligand on the left makes two hydrogen bonds to the backbone from the indazole. We might be tempted to add pharmacophore constraints to both the donor and the acceptor, except that there are known highly-active ligands which make a C-H hydrogen bond instead (see the ligand on the right). As a result, it is probably more appropriate, if constraints are required, to add a pharmacophore constraint to the acceptor nitrogen, but use a field constraint to indicate a donor preference in the NH/CH location.

Figure 7: BTK ligands – which constraint type should I use?

Try yourself

Pharmacophore constraints are now available in all Cresset ligand-based computational chemistry applications:

  • Forge: Powerful ligand-focused suite for understanding SAR and design
  • Torch: Molecular design and SAR for medicinal and synthetic chemists
  • Spark: Discover new directions for your project using bioisosteres
  • Blaze: Effective virtual screening optimized to return diverse new structures

Request an evaluation and try them yourself.

What’s great about Lead Finder?

We recently announced our collaboration with BioMolTech, a small modeling software company best known for their docking software, Lead Finder. Cresset has been traditionally focused on ligand-based design, but as we expand our capabilities into more structure-based methods we realized that we would have to supply a robust and accurate docking method to our customers. So, why did we choose Lead Finder?


A graphical interface to Lead Finder will be included on our new structure-based design application.

The requirements for a great docking engine are simple to state: it needs to be fast and it needs to be accurate. The latter is by far the most important: nobody cares how quickly you got the answer if it is wrong! Our first question when evaluating docking methods was therefore to ask how good it was. This is actually a difficult question to ask, as there are several different definitions of ‘good’ depending on what you want: virtual screening enrichment? Good pose prediction? Accurate ranking of active molecules?

The first of these, virtual screening, is what most people think of when they think of docking success. Lead Finder has been validated on a wide variety of target classes and shows excellent enrichment rates (median ROC value across 34 protein targets was 0.94), even on targets traditionally seen as very hard such as PPAR-γ. The performance on kinases was uniformly excellent, with ROC values ranging from 0.86 for fibroblast growth factor receptor kinase (FGFR) to 0.96 for tyrosine kinase c-Src.


A series of SYK ligands docked to PDB 4yjq with crystal ligand shown in purple.

Pose prediction is of more interest to those working in the lead optimization phase, where assessing the likely bound conformation of a newly-proposed structure can be very helpful in guiding design. Here, too, Lead Finder performs well. On the widely-used Astex Diverse Set, used to test docking performance, Lead Finder produces the correct pose as the top-scoring result 82% of the time, which is comparable to other state-of-the-art methods (Gold, for example, gets 81% on the same measure). On a number of literature data sets testing self-docking performance Lead Finder finds the correct pose between 81 and 96% of the time, which is excellent.


Lead Finder includes dedicated modes for extra-precision and virtual screening experiments.

One of the most intriguing things about Lead Finder is the makeup of the scoring functions. In contrast to many other scoring functions which use heuristic or knowledge-based potentials, the Lead Finder scoring functions comprise a set of physics-based potentials describing electrostatics, hydrogen bonding, desolvation energy, entropic losses on binding and so on. Different scoring functions can be obtained by weighting these contributions differently: BioMolTech have found that the optimal weights for pose prediction differ slightly from those for energy prediction, for example. A separate scoring function has been developed which aims to compute a ΔG of binding given a correct pose. This is a difficult task, and the success of the Lead Finder function was demonstrated in the in the 2010 CSAR blind challenge, where the binding energy of 343 protein-ligand complexes had to be predicted ab initio. Lead Finder was the best-performing docking method in that challenge. BioMolTech are actively building on this excellent result with the aim of making robust and reliable activity predictions a standard outcome of a Lead Finder experiment.

Cresset are proud to be the worldwide distributors for Lead Finder. It is available today as a command-line application and will be built into Cresset’s upcoming structure-based drug design workbench.

Request an evaluation of Lead Finder.

Understanding torsions in Spark 10.4

Spark is the most advanced and complex bioisostere finding tool available today. While it excels at finding bioisosteric replacements that match the steric, pharmacophoric and electronic nature of the original molecule, we realize that the output of a Spark search is only the beginning. Spark suggests new molecules, so someone needs to make them. Given the time and effort that goes into even a simple synthesis, we need to make sure that Spark’s suggestions are as robust as they can be to maximize the changes of success for each Spark result.

Spark’s internal algorithms do a lot of complicated work to ensure that the result molecules are chemically sensible. When you are stitching a fragment into a new molecule there are many subtle effects that you have to take account of:

  • Does the hybridization state of any of the atoms change (especially important for nitrogens, which may shift from pyramidal to planar)?
  • Does the pKa of any ionizable groups change, and if so do we need to re-assign protonation states?
  • How do the electrostatic properties of the fragment change once it is stitched into the rest of the molecule (and vice versa)?
  • If any newly-created bonds are free to rotate, what is the rotamer that maximizes the similarity of the new product molecule to the starting point?
  • Is this conformation energetically sensible?

On this last point, Spark carries out a rotation around the newly-formed bond in order to estimate the amount of strain energy carried by that bond in the assigned torsion angle. For speed, this rotational scan is performed holding the parts of the molecule on each side of the bond rigid, and as a result the computed strain energy is only an estimate. However, even this estimated strain energy can be very useful to flag up cases where the new molecule is in an energetically unfavorable conformation and hence would need further investigation before it could be recommended for synthesis.

While this purely computational procedure works well, it would be nice to bring in some more empirical knowledge gleaned from the vast numbers of small molecule crystal structures that are available in the Cambridge Structural Database (CSD). Luckily, this analysis has already been done by Prof. Rarey’s group at the University of Hamburg, resulting in a pair of papers detailing a hierarchical set of rules to determine preferred values for torsion angles in small drug-like molecules.1, 2 This rule set, called the Torsion Library, has been incorporated into the most recent release of Spark (version 10.4).

Whenever a new product molecule is formed, Spark applies the Torsion Library rules to the newly-formed bonds, and highlights cases where the torsion angle is not one that is frequently observed in the CSD. This doesn’t automatically exclude that result from consideration (there may be a reason such as steric clashes for the uncommon torsion), but it does flag that result molecule as needing careful inspection. The Torsion Library results are automatically displayed for all results and can be used in Spark’s filters like any other result column.

As a quick example of the usefulness of the Torsion Library, we have re-examined the 2011 case study FieldStere V3.0 Example 2: Fragment Growing. In this study we were performing a fragment growing experiment in p38, using an existing ligand to guide the growth of a fragment towards the hinge binding region (Figure 1). If we had particular chemistries in mind, then instead of searching the general reagent databases, we could search specific reagent databases to find what commercially-available reagents we could use.

Fragment growing Spark experiment
Figure 1: Fragment growing Spark experiment.
In this case, we assume that we could grow the fragment either through addition of a thiol (leading to a sulfur linker) or via addition of an amine (leading to a nitrogen linker). Both searches lead to a variety of interesting-looking results which look potentially active. However, analysis of the Torsion Library result reveals significant differences. Looking at the amine linker results (Figure 2), 493 of the top 500 results have only a ‘Medium’ Torsion Library score, indicating that to get the amino substituents to reach to the hinge binding region you have to twist the amine into a moderately unfavorable conformation.

Amine linkers
Figure 2: Amine linkers.
However, when we look at the thioether result set (Figure 3), the majority (327/500) of the results are in the ‘High’ frequency category.

Thioether linkers
Figure 3: Thioether linkers.
Of course, it’s possible that good highly-active compounds could be obtained from either linking chemistry. However, the Torsion Library results clearly indicate that a thioether linkage is preferred here, purely because it orients the added fragments towards the hinge better. This is valuable knowledge if we were planning a small combinatorial library on this fragment expansion.

Adding the Torsion Library to Spark makes the results even more robust and useful, allowing you to see at a glance what the known experimental conformational preferences of small molecules say about the conformer quality in your Spark results. The new feature is available now as part of the Spark 10.4 releasetry a free evaluation today.

1 Shärfer  et al., J. Med. Chem. 2013, 56(5), 2016

2 Guba et al., JCIM 2016, 56, 1

What rings do medicinal chemists use, and why?

The vast majority of small molecule drugs contain at least one ring. The rigidity, synthetic accessibility and geometric preferences of rings mean that medicinal chemistry series are usually defined in terms of which ring or rings they have at their core. However, ring systems are more than just scaffolds waiting to be elaborated: the electrostatic and pharmacophoric properties of ring systems are usually crucial to the biological activity of the molecules that contain them.

We have conducted an investigation into the most common ring system and substitution patterns in the recent medicinal chemistry literature, as derived from the ChEMBL database. For each of these rings, the electrostatic potential has been calculated allowing the chemist to see at a glance the electronic properties of each system. In addition, applying the Spark bioisostere evaluation metric to the rings database reveals the best bioisosteric replacements for each ring system.

In the poster, What rings do medicinal chemists use, and why?, selected entries from the rings database are shown and discussed. The full data set is an invaluable aid to the medicinal chemist looking to understand the properties of their lead molecule and the opportunities for variation of its core.

April 2016 release of new Spark databases

The new release of Spark comes with new and updated fragment and reagent databases. These are designed to give you the widest sources of inspiration for your projects, whilst also enabling a close link between Spark’s suggestions and the chemistry that is available to you.

Fragment Databases

The latest Spark databases include over 3.5 Million fragments that are used to find novel bioisosteres for your project. These come from two distinct sources – ZINC and ChEMBL. In each case the molecules from the entire source collection are fragmented and the frequency with which any fragment has appeared is noted. We then sort the fragments according to frequency and label them according to the number of bonds that were broken to disconnect the fragment from its original molecule.

Figure 1: Count of fragments in Spark databases from ZINC and ChEMBL split by the number of connection points of each fragment.

Analysis of the numbers of fragments in common between the ZINC and ChEMBL databases shows surprising complementarity.

Number of fragments
(to nearest 1000)
Very Common Common Less Common Rare Very Rare Singleton
ChEMBL21 common 41,000 41,000 43,000 26,000 18,000 15,000
ChEMBL21 rare 7,000 24,000 44,000 46,000 45,000 34,000
ChEMBL21 very rare 3,000 11,000 26,000 34,000 42,000 70,000

Unsurprisingly, there is significant overlap in the most common fragments from each database. However, once you get to the rarer fragments it is apparent that ZINC and ChEMBL occupy quite distinct parts of chemical space, with the majority of “rare” fragments being unique to each database.

Reagent databases

In this release we have completely replaced the source of our reagent fragments. We are delighted to be working with eMolecules to provide you with over 500,000 reagents that are easy to order with known availability. The new eMolecules based reagent databases use an enhanced set of rules to more closely relate the Spark results to the chemistry that you want to use on your molecules.

Figure 2: Analysis of Spark reagent databases split by molecular weight.

Each fragment in the eMolecules database is linked back to both the eMolecules ID for the source reagent and its availability. Running a Spark search on these databases thus allows you to very simply move from the Spark experiment to ordering the reagents you require to turn your Spark results into reality.

Figure 3: Spark reagent results include availability information from eMolecules.


This release of databases for Spark increase the number of fragments and the improve the availability of reagents. When combined with the existing VEHICLe derived database, the CSD derived database and databases from your corporate collections generated with Spark’s database generator we believe that Spark will find a even better range bioisosteres for your project.

To update to the latest databases or to take a look at how Spark can impact your project please contact us.

Pros and cons of alignment-independent descriptors

Working with molecules in 3D is computationally expensive compared to most 2D methods. Most modern cheminformatics toolkits can do hundreds of thousands to millions of 2D fingerprint comparisons per second, with 3D similarity techniques being multiple orders of magnitude slower.

The computationally-expensive part of the 3D calculations usually involves aligning conformations to each other. The natural tendency therefore, is to see if we can skip this step and compute a set of properties that can tell us if two molecules are similar in 3D without actually having to align them. If this works we can get the best of both worlds: the speed of 2D comparisons combined with the accuracy and structural independence of 3D similarity functions.

Pharmacophoric descriptors

The earliest version of this idea is the simple pharmacophore. All you have to do is assign a few pharmacophoric points to each molecule (usually based on some sort of functional group pattern recognition), then generate a set of descriptors based on sets of these (usually 2 or 3). If two molecules share one or more pharmacophore descriptor, then they match.

Pharmacophore searches succeed on some counts: they are indeed very fast, and they do encode some 3D information. However, they involve a very crude binning of the wide array of possible intermolecular interactions into a few pharmacophore types, and they describe shape poorly giving them very limited predictive power.

If pharmacophores can’t describe shape well, are there other techniques that can? A number of different methods have been presented, such as those based on multipole moments or spherical harmonic coefficients (e.g. ParaSurf/ParaFit), as well as methods based on statistical moments such as Ultrafast Shape Recognition (USR). None of these has achieved widespread use: harmonic coefficients are not rotation-invariant, while the USR technique correlates poorly to more accurate measures of shape similarity. 1

Field descriptor distances

It would be nice if there was a way of providing alignment-independent descriptors which described both electrostatics and shape/pharmacophoric properties with a reasonable degree of accuracy. This is actually one of the first things we did when we were looking into starting Cresset – we developed a method called FieldPrint that encodes the distance matrix of field descriptors down into a fingerprint that can be used for alignment-independent similarity calculations. The concept is similar to that of GRIND 2 which was published around the same time, although the algorithmic details are somewhat different.

We put a lot of work into these techniques, but were never able to get a method that we were completely satisfied with. The problem we found is that encoding the distances in pairs/triplets of field descriptors ends up losing too much 3D information, and as a result you either end up with a slower mimic of standard 2D fingerprints, or you end up with a large false positive count. The FieldPrints have a tendency to find molecules with a similar overall pattern of positive/negative field, but can compute a very high similarity for molecules that are in reality quite dissimilar in terms of the 3D spatial arrangement of those fields. My belief now is that this is an inherent flaw of alignment-independent descriptors: they either have to be sufficiently complex that you are in effect computing an alignment, or you lose too much information and are not significantly better off than just using old-fashioned structural/pharmacophoric fingerprints.

As you move from full 3D interaction potentials to 2D correlograms to 1D fingerprints comparisons get faster but you lose information
Figure 1: As you move from full 3D interaction potentials to 2D correlograms to 1D fingerprints, comparisons get faster but you lose information

Handling conformation

A further consideration is how you handle conformation. The original GRIND papers just use a single conformer per molecule, and their validation was confined to series of rigid molecules or sets of molecules where single conformations were generated and manually adjusted to be similar. In the general case neither of these shortcuts will work. Any method that purports to be 3D but starts with a single conformation per molecule is inherently flawed: the whole point of 3D is that molecules are flexible.

There is a disturbing number of papers out there that do some sort of notionally-3D analysis on set of single CORINA-derived conformations. You can get very good enrichment factors on retrospective virtual screens doing this, but in practise the enrichments are largely bogus. CORINA is deterministic, and as a result molecules with similar structures will tend to be put into similar conformations. Combine this with the fact that many standard retrospective VS data sets have very low structural diversity, and the problem becomes apparent. The query molecule and its dozens of congeners in the “actives” data set are all placed in the same single conformation, and so application of a 3D or pseudo-3D technique can easily produce excellent-looking enrichment statistics. However, the enrichment all comes from a hidden 2D similarity.

So, single-conformation methods are a dead end and we need to consider flexibility. Once you are doing so, you need to factor in both the conformer generation time as part of the build time for the descriptor, and also factor in that your comparison speeds will now be two to four orders of magnitude slower than 2D fingerprints (assuming 100 conformers per molecule, and depending on whether you know a single bioactive conformation for one of the two molecules being compared or whether you need to compare conformer populations). 2D methods thus have an unassailable speed advantage, which is part of the reason they remain so popular.

Using FieldPrint as a filter

Our original vision for Blaze (or FieldScreen, as it was then) was that it would rely on the FieldPrints to give extremely rapid searching. You can get quite good enrichment factors from the FieldPrints in retrospective virtual screens, but when we investigated further this is largely because they act as a proxy for overall molecular size and charge. Once you control for that by more careful selection of decoys the FieldPrint performance is much less good. Analysing a molecular similarity technique through retrospective virtual screening performance is very very hard to do well, and as a result I am intrinsically wary of methods that present a set of DUD enrichments as their sole validation: FieldPrints perform quite well on DUD, but we know that they are not particularly effective in real prospective applications.

We still use the FieldPrint technology: it’s the first search stage in every Blaze run. It’s generally good enough to filter out 25-50% of decoy molecules that have no similarity to the query, but certainly not good enough to use the FieldPrint ranks directly. This is why we just use them as a pre-filter: molecules that pass that filter have much more accurate similarities computed using our alignment-based clique/simplex algorithms.

In the end, there’s no real short cut. All attempts to date to make 3D comparisons faster by simplifying descriptors and skipping the expensive alignment step just seem to leave out too much information – such techniques can be useful for cutting down the search space but if you’re going to spend CPU cycles working in 3D you might as well do it properly!

1. T. Zhou et al. / Journal of Molecular Graphics and Modelling 29 (2010) 443–449
2. Pastor, M.; Cruciani, G.; McLay, I.; Pickett, S.; Clementi, S. J. Med. Chem. 2000, 43 (17), 3233–3243.

Download an evaluation

Try our software for yourself – download a free evaluation.

Affordable virtual screening with Blaze: Benchmarks


We released BlazeGPU a couple of years ago, allowing the full power of the Blaze virtual screening system to be used on a few consumer graphics cards rather than a full-scale Linux cluster. Since then, graphics cards and CPUs have only got faster, so we decided that it was time to update our benchmarks and see how well all of the new hardware performs.

For these benchmarks we took a random subset of 4,000 molecules from our in-house Blaze data set and searched with a medium-sized query molecule. The molecules in the data set average 80 conformers each. We’ve run with three different search conditions: the full slow-but-accurate simplex algorithm, the standard clique algorithm and the new fastclique algorithm. All of these were run with 50% fields and 50% shape.

CPU performance

Firstly, the CPU benchmarks. All of these are single-core performance, but with all cores loaded so that we’re not benefitting from Intel Turbo Boost. In most cases Blaze will be saturating all cores, so this is representative of real-world performance. Note that the vertical axis is on a log scale.

CPU benchmarks

As can be seen, there’s a significant performance difference between the older CPUs at Cresset (such as the Q6600) and the newer Ivy Bridge i7-3770K chips, but not nearly as much as you would expect given that the Q6600s are around 7-8 years old at this point. The significant speed improvements of the fastclique algorithm are clearly visible with the throughput being more than 4x greater than the original clique algorithm. The last set of columns on the graph are from an Amazon c4.xlarge instance and show that the performance of each core on those systems is roughly the same as the Sandy Bridge i3-2120.

GPU Performance

Moving on to the GPUs, we’ve tested the throughput on a variety of different systems. Firstly, we’ve tested a variety of GTX580s on different motherboards and processors. As you would expect, for the most part the performance is governed by the GPU, but the exception is the fifth test system which is noticeably slower than the others. That card is sitting in a much older chassis with an older motherboard and hence is probably suffering from lack of backplane bandwidth to the GPU.

GPU benchmarks

The newer GTX960s perform extremely well on the Blaze calculations. We weren’t sure if they would, after the disappointment of the GTX680 which was noticeably slower than the 580 (data not shown). The difference is noticeable in the clique stages, but really stands out in the simplex calculations where a GTX960 is 50% faster than the GTX580s. By contrast, the high-end Tesla hardware is not a great performer on the Blaze OpenCL kernels. By all accounts the Tesla hardware is significantly faster than the consumer hardware on double precision workloads, but the Blaze code is all single precision and in that realm the cheap consumer hardware has an unbeatable price/performance advantage.
Finally, the GRID K520 is the hardware found on the Amazon g2.2xlarge and g2.8xlarge instances. As can be seen, it’s not a brilliant performer on the Blaze workload, being around the same speed as the Tesla on the fastclique algorithm but noticeably slower than all of the other cards tested on the simplex workload. However, it provides a nice test of GPU scaling: when running on a 4 times larger data set on all 4 GPUs of a g2.8xlarge instance, we observed substantially the same throughput as running the original data set on a single K520 GPU, showing that we can parallelise across multi-GPU systems with no loss of performance.

Cost efficiency on Amazon

Converting the throughput shown above, we can look at the cost of screening on the Amazon cluster with Blaze. The raw cost to screen a million molecules is shown in the table. Note that the actual costs will be somewhat higher, due to job overheads and data transfer costs.

Cost efficiency on Amazon

The Amazon GPU solutions are noticeably cheaper for fastclique jobs, roughly cost-competitive for the clique runs, but the poor performance of the K520 on the simplex task means that it is significantly more expensive there. As a result, at the moment there’s no real impetus to use the Amazon GPU resources unless you can get them significantly more discounted than the CPU instances on the spot market.


New hardware is significantly faster at running Blaze than old stock as would be expected. However, the speed increases are much lower than they have been in the past, with CPUs that are well past their best still performing adequately. On the GPU side, Blaze performs particularly well on commodity graphics cards leaving few reasons for us to invest in dedicated GPU co-processing cards.

The cost of running a million molecule virtual screen on the Amazon cloud has never been cheaper. If tiered processing is used as is the default for Blaze then these screens can be performed for a very low cost indeed – less than $15 per million molecules for the processing costs.

Contact us for a free evaluation to try Blaze on your own cluster, or Blaze Cloud.

Virtual screening – how many conformations is enough?

One of the things that distinguishes 3D virtual screening methods (such as Cresset’s Blaze) from 2D methods such as fingerprints is that you have to start worrying about conformations. A molecule only has one ECFP4 fingerprint, but if it is flexible then it has lots of shapes and pharmacophores. How to deal with this is one of the fundamental questions in ligand-based virtual screening (LBVS).

A few methods such as some docking tools explicitly search the conformation space as part of the scoring algorithm, but most ligand-based methods rely on pre-enumeration of a set of conformations of the molecules to search. A few notionally 3D methods avoid this by using only a single conformation of each compound, usually that chosen by some 3D coordinate generation program such as Corina. I would argue that these methods are actually 2D methods in disguise, as the ‘3D’ properties they calculate are determined by the combination of the topology and the rules in the Corina database.

Assuming that we are properly considering conformations, the big question that required answering is: how many? The naïve answer is ‘all of them’, but that turns out not to be that useful for three reasons. Firstly, if you define ‘conformer space’ as the set of structures sitting in potential energy local minima, then the space depends quite strongly on the force field and solvation model. Secondly, although this definition works for small and simple molecules, once you move to large and more flexible molecules it becomes less useful: rotating a central bond in the molecule by a few degrees may have virtually no impact on the conformational energy but may move the ends by enough to give a significantly different pharmacophore. The conformation space you need to consider thus has to be wider than just the potential energy minima. Thirdly, there are just too many: the number of conformations goes up exponentially with the number of rotatable bonds.

As a result, most LBVS systems perform a limited sampling of conformation space, generally by capping the number of conformations that are considered for a molecule. As the computational cost increases with the cap, we now need to decide what that should be. A recent Schrödinger paper by Cappel et al. found that this limit is actually very low: although a full exploration of conformation space is necessary to get good results in 3D-QSAR, you only need a fast and limited conformational search to get good LBVS performance.

Interesting – but is it true?

In the past we have done several analyses of Blaze performance vs. the number of conformations, as a result of which we raised the recommended maximum number of conformations from 50 to 100 quite a few years back. The Schrödinger paper seems to contradict this recommendation, and so far we’ve never investigated Blaze performance with really small numbers of conformations, so it was time to have another look.

To do this, we grabbed a few of the DUD data sets that we examined in the original FieldScreen publication, and re-ran them with differing numbers of conformations for the actives and decoys: 5, 10, 20, 50, 100 and 200. The results are shown in Figure 1, firstly as the ROC AUC value, and then as the cluster-corrected ROC AUC value where the results are assed in terms of number of chemotypes retrieved rather than the number of actives.

Performance of Blaze on DUD data set - ROC AUC
Performance of Blaze on DUD data set - ROC AUC-CA
Figure 1 – Blaze performance on a DUD subset

Well, this is pretty conclusive. The VS performance doesn’t show any consistent trends with numbers of conformations: a few searches get better, but a few get worse. Overall the average performance is much the same (or possibly slightly worse) with 200 confs as it is with 5 confs. Looks like Cappel et al. were right, after all.

Or were they? There are several big issues with using retrospective data sets such as DUD to do this sort of analysis. This first is that the actives and inactives come from different chemical spaces. The actives are chosen from J. Med. Chem. papers and tend to be highly-optimised compounds with a regrettable lack of structural diversity, and the inactives are chosen from some chemical space (ZINC, in the case of DUD) with some effort put in to match the properties of the inactives to the actives (number of heavy atoms, etc.). However, it is very difficult to ensure that the decoys actually match the actives (and harder still to define properly what ‘matching’ should actually mean).

The PDE5 data set illustrates the problem. From the graphs, it can be seen that the ROC AUC declines markedly as we increase the number of conformations. The reason becomes apparent when we examine the distribution of rotatable bonds in the data set (Figure 2). The orange bars are the decoys, while the blue bars are the actives. As can be seen, the actives are significantly more rigid than the decoys. As a result, when you increase the number of conformations, the scores for the decoys are more likely to increase (as you find conformers that better match the query) than the scores for the more rigid actives.

Rotatable bone description for PDE5
Figure 2 – Rotatable bond distribution for PDE5

The other data set with decreasing performance with increasing conformer count is HIV RT. Here the distribution of rotatable bonds is much more similar when comparing actives and decoys. However, let’s examine more closely the structures of the first 12 hits in the 5-conformation data set (Figure 3). The first hit is the query molecule itself, and 8 of the top 12 hits are very structurally similar to it. Although some of these have a high rotatable bond count (e.g. the third hit), matching just the core pyrimidinedione and the pendant benzyl is enough to give a high similarity score; the remaining flexible portions of the molecule are not very large and hence only have a modest contribution to the score. As a result, a high score is obtained for the third hit even though the flexible glycol chain is not in the optimal conformation.

HIVRT top-ranked hits
Figure 3 – HIVRT top-ranked hits

The remaining 4 of the top 12 hits all have very little flexibility (1-2 rotatable bonds), and hence are just as easy to find in the 5-conformer data set as the 200-conformer one. The apparent successful performance of the HIV RT search with only 5 conformations is thus misleading: we find the molecules from the same series as the query and the rigid actives earlier than we do in the 200-conformer data set, and these results mask the better performance for the more ‘interesting’ actives in the latter.

The other data sets all tell a similar story. For example, consider the EGFR data set, on which superb performance is obtained even with only 5 conformations. Looking at the actives, they all seem to consist of a phenylamino-substituted bi- or tricyclic heterocycle (Figure 4). In fact, of the 396 actives, 329 match the SMARTS pattern ‘c-[NH]-c’. Even a minimal conformational sampling is likely to produce a conformation where the two aromatic rings are in roughly the correct orientation, and as a result superb screening results can be obtained even with the 5-conf data set. However, this is really an artefact of the lack of chemotype diversity rather than a true reflection of the performance.

EGFR top-ranked hits
Figure 4 -EGFR top-ranked hits

So, what does this experiment (and the analogous ones carried out by Cappel et al.) tell us? Unfortunately, they tell us more about how difficult it is to set up a retrospective virtual screening test set than anything about the optimal parameters to use. The lack of diversity in actives gleaned from the literature, together with the difficulty in property-matching decoys to the actives, mean that the results from this sort of analysis should be taken with a very large pinch of salt indeed. It should be pointed out that Cappel et al. used all 39 of the DUD data sets in their analysis: in our FieldScreen paper we showed that only 15 of these have enough chemotype diversity to be useful for testing LBVS systems, and as we have seen here even those 15 aren’t diverse enough to make robust conclusions.

In their paper Cappel et al. state that they obtained almost identical enrichment values whether they used the correct bioactive conformation or the lowest energy conformation of each query molecule (as determined by their force field) . If the enrichment performance does not depend on the query conformation, then by definition either your method isn’t a 3D method and is doing a 2D search in disguise, or the validation data you are using is biased. The low diversity of actives in the majority of the DUD data set argues for the latter.

I believe that the correct way to assess the relative and absolute performance of VS methods is to test them on actual screening data, not on artificial test sets. Big pharma companies have decades of HTS data waiting to be analysed: the hits are from the same collection as the decoys, they are weak binders rather than optimised compounds, and there are lots of them. Unfortunately, this data is not available to independent researchers. Researchers in big pharma, who do have access to this data, tend for some reason to publish VS performance analyses on literature data sets instead. If any big pharma researchers reading this post fancy working with us on examining the issue of conformation space sampling for virtual screening on their HTS data, then please do let us know.