Comparing Forge’s command line utility to Blaze – which one should you use?

Here at Cresset we’re very interested in ligand-based virtual screening – it’s been a focus of the company ever since we started more than seventeen years ago. In that time there have been many advances and refinements of the techniques for both ligand-based virtual screening and structure-based methods. We have stuck by our fundamental principle that ligand similarity based on both electrostatics and shape is an excellent way to sort the wheat from the chaff. The results obtained by our services division, who have run more than 200 virtual screening campaigns with a better than 80% success rate, is testament to that.

Difference between falign and Blaze

One of the things our customers ask from time to time is which application should they be using to do virtual screening. The simple answer is that there are two, Forge (and its command-line utility ‘falign’) and Blaze, and the differences are readily apparent.

In falign, you can generate conformations for a large set of molecules, align them to one or more references, and rank them by the similarity score. You also have the option to bias the alignments and scores by adding field constraints, pharmacophore constraints, and protein excluded volumes.

By way of contrast, in Blaze, you can generate conformations for a large set of molecules, align them to one or more references, rank them by the similarity score, and… ok, point taken. So, given that falign and Blaze apparently do the same thing…

Why falign and Blaze?

The answer is scale. As anyone who’s ever played with large data sets knows, doing calculations on a few hundred compounds is fundamentally different to doing them on tens of millions of compounds. Once you are working at large scale, seemingly trivial operations such as filtering data sets become much more difficult if you want to be efficient. Blaze was designed from the ground up to work with large data sets of 107 molecules and more, with an emphasis on maximizing throughput on a computational cluster. Forge/falign on the other hand are much more aimed at small-scale work, enabling simple screening or analysis of relatively small sets of compounds where the big iron of Blaze is overkill.

Data preparation

As an example, let’s look at the preparation of the data set in the two software suites. In falign, this is relatively simple: you provide the compounds to falign in 2D or 3D form, it assigns protonation states as necessary, and computes conformations on-the-fly if required before aligning to the query:

Falign has a secondary mode for use when aligning structurally-related compounds, which ensures that the common substructure within the dataset is perfectly matched:

Blaze, on the other hand, is much more sophisticated in its conformer handling. The average user of Blaze has multiple data sets that they want to screen (in-house compounds, vendor screening compounds, virtual libraries, custom collections), and these often have significant overlap. In addition, these data sets are usually reused multiple times for multiple virtual screens. As a result, Blaze has a sophisticated deduplication and precomputation pipeline that maximizes computational efficiency. The Blaze workflow looks more like this:

Any given chemical structure is only present once within Blaze: it may have multiple different names, and be present in multiple collections, but we’ll only precompute its conformations once and we’ll only align it once in any given screen. The conformer computation pipeline is heavily optimized for performance: we’ve done extensive studies on our conformer generation algorithm XedeX to find the optimal trade-off between conformation space coverage, rejection of higher-energy conformations, calculation time and number of conformations required. In addition, we’ve developed a special-purpose file format that is highly compressed (less than 13 bytes per atom on average, including coordinates, atom types, element, charge and formal charge) while being unbelievably fast to parse.

Blaze has a multiple-step pipeline to filter the data set, so that the full 3D electrostatic shape and alignment algorithm is only applied to molecules that are likely to have a high score. For extremely large data sets there’s an initial filter by FieldPrint, an alignment-free fingerprint method that gives a crude measure of electrostatic similarity. The molecules that pass the filter then go into an ultrafast version of our 3D alignment and similarity algorithm, and the full similarity algorithm is applied only to the best 10% or so of these. As a result, Blaze can chew through millions of molecules very quickly on even a modest cluster. The processing capability of Blaze is further enhanced by the fact that there’s a GPU version which is even faster.

Small versus large data sets

So, falign is designed for the simple use case on small sets of molecules, while Blaze is aimed at maximum computational and I/O efficiency on very large data sets. There is another important difference between the two. As anyone who’s been in charge of maintaining a virtual screening system knows, keeping it up to date is often a painful and thankless task. It’s bad enough keeping up with the weekly additions to the internal compound collection but keeping track of updates to external vendor’s collections is difficult: not only are new compounds being added but old ones are being retired. Blaze makes handling this situation easy. You simply provide Blaze with the new set of compounds that you want to be in the collection, and Blaze will automatically handle the update.

Any new compounds will be added, no-longer-available compounds will be marked and removed from the screening process, and any unchanged compounds will be left alone. This is far more computationally efficient than fully rebuilding the conformations for everything. Blaze can even be directly connected to your internal compound database, so that the Blaze collection holding your in-house compounds is always right up to date.

Given how great Blaze is at handling virtual screening, why would you ever want to use falign?

Blaze is optimized for throughput and computational efficiency, but the downside of this is latency. If you have a set of compounds you want to align and score in Blaze, you have to upload them, wait for Blaze to process them and build the conformations, wait for Blaze to build its indices, initiate a search, and wait for it to be submitted to your cluster queueing system. There’s five- or ten-minute’s latency in all of this, which is fine for a million molecules but is overkill if you have only one hundred. Falign, by contrast, will start work straight away on your local machine with no waiting at all.

The answer to the falign vs Blaze question, then, is largely a question of scale. Got a dataset of a million molecules that you want to run repeated virtual screens against? Blaze is just the ticket. Got a small set of compounds that you want to align and score as a one-off? Forge and falign are just what you need. For our in-house work we tend to find that the tipping point occurs around a few thousand molecules. Falign can easily chew through this many in an hour or so (especially if plugged into your computing cluster using the Cresset Engine Broker). However, if there’s more compounds than this or we’re going to want to run multiple queries, then Blaze it is. Since Blaze is accessible through the Forge front end, and both are accessible through KNIME and Pipeline Pilot, it’s as easy as pie to pick the right tool for the job.

Try on your project

Request a free evaluation to try out Forge or Blaze for your small or large-scale virtual screening needs. Don’t have a cluster? Blaze Cloud and Blaze AWS provide simple ways to access cloud resources to do the number crunching for you.


Tversky similarity in field-based virtual screening

In the releases of Blaze V10.3 and Forge V10.5 we introduced new similarity metrics alongside the new capabilities to manually weight the similarity function using pharmacophore constraints. With the introduction of Tanimoto and particularly Tversky measures of similarity, a new range of experiments are available to you that help you tailor the results you get. In this post I will use the Tversky similarity to perform substructure and superstructure type searches using Blaze. These new options are also available in Forge.

Figure 1: Blaze results can be tailored to generate the type of results that interest you, from substructure like to pure chemotype switching or super-structure like.

Similarity in Blaze

Blaze uses the field point patterns of molecules combined with their shape to align and score a ‘database’ of molecules against a ‘reference’ or ‘query’ that is usually a known active. In this context the default Dice similarity has worked well. It returns active molecules that are similar in size to the query, but is not too size-dependent allowing Blaze to find hits that are smaller than the reference. In most cases this is exactly what you want – a ligand the same size or smaller than the reference that maintains most of the potential sites of interaction. The scoring algorithm could be altered to generate more substructure like or more superstructure like results. However, this was complex to set up and sub-optimal in performance. In Blaze V10.3 the new Tversky similarity makes these searches more accessible. A look at the average MW of the first 100 compounds returned using the standard Dice and the new Tversky options highlights the difference:

Table of average MW of first 100 compounds returned using different similarity metrics. Database of 35283 positively charged Chembl compounds with 5-30 heavy atoms on Blaze demo server. Query MW: 319. Database average MW: 318

Dice Tanimoto Tversky, α 0.05 Tversky, α 0.95
314 313 192 363


Substructure searches with Blaze

The Tversky metric has two parameters, α and β. Using the Tversky similarity option in Blaze, and setting α to 0.05 and β 0.95, results in a substructure-like search. In fact, we don’t deal with structures so this actually equates to a ‘sub-field’ search. It returns molecules that contain a field pattern that is contained within the query – i.e. field fragments of the query. This is useful where you have a large known active but want to screen or design a fragment library of smaller molecules that match parts of the query.

Figure 2: Search query and 3 selected results (ranks 3, 5, 11) from a sub-field search using the A2C active from the Fragment hopping with Blaze case study. Each result includes some features of the search query but also omits at least one functional group.

Superstructure searches with Blaze

Setting a Tversky similarity with α at 0.95 and β at 0.05 generates a ‘super-field’ search. That is, molecules that contain a field pattern similar to the query are scored highly whether or not they have additional field points. This is useful for growing hits from a fragment screen or in other situations where you do not want to penalize results for having additional functionality to the query. As hits could contain the query at any position and any orientation, this option works particularly well when combined with field, pharmacophore or excluded volume constraints. For example, using an excluded volume will direct the results towards the available space around the query. Equally, using field constraints or the new pharmacophore constraints will ensure that results contain the interactions that you know to be important.

Figure 3: Search query and 3 selected results (ranks 2, 4, 6) from a super-field search using A2C active from the Fragment hopping with Blaze case study and an expanded database to include larger fragments. Each result contains a similar field pattern to the query plus additional features or functional groups.

Tanimoto similarity in place of Dice

In addition to Tversky, the new versions of Blaze and Forge offer the opportunity to change from the default Dice similarity to Tanimoto. This will make a difference to how the individual elements of the score are combined, resulting in a small change in the order that molecules are returned in a virtual screening experiment, but the two experiments are highly correlated. The effect is somewhat complicated to describe and hence will be explored in a future post.

Figure 4: Plot of rank returned using Tanimoto similarity vs Dice similarity for ~10,600 compounds. The results are highly correlated with r2 0.96.


The new similarity metrics increase the range of experiments that can be easily performed within Blaze. Using the new metrics in Forge enables refinement or enhancement of Blaze results using the same metrics. Sub-field and super-field searches in particular should prove useful for fragment-based discovery.

If you would like to try the Blaze interface, or study the effects of the new similarity metrics, then signup for a Blaze demo server account.

To try Blaze on your datasets or your projects, request a full evaluation.

Blaze used in discovery of allosteric modulators of the high affinity choline transporter

A variety of neurological conditions can potentially be treated through the stimulation of cholinergic neurotransmission. The choline uptake into certain neurons is mediated by the choline transporter (CHT), which is well-characterized but otherwise unexplored as a potential drug target.

A team consisting of scientists from Pfizer, Neusentis, Nanion Technologies, and Kissei Pharmaceutical Company used two compound sets: (1) a specially created set of 887 molecules derived from the full Pfizer compound screening collection using Cresset’s virtual screening tool Blaze; (2) 2,753 molecules from the Pfizer Chemogenomic Library. From these sets they were able to identify nine active small molecules that modulate CHT.

This work will enable them to test the hypothesis that positive modulation of CHT will enhance activity-dependent cholinergic signaling. Read the full paper Discovery of Compounds that Positively Modulate the High Affinity Choline Transporter.

Using Blaze to develop a screening set from a corporate compound library

The team had identified two CHT modulators from the literature: one CHT positive allosteric modulator and one CHT negative allosteric modulator. Each of these was used within Blaze to search the full Pfizer compound screening collection for compounds with similar electrostatic and shape properties and therefore potentially similar biological activity.

The computational team kept the top 500 compounds from each virtual screen, based on the Blaze scoring function to form a set of 1000 compounds. This set was filtered based on compound availability and the removal of chemically unattractive groups, resulting in a test set of 887 compounds. This library was screened in assays, as detailed in the paper.

Identification of previously unknown active and structurally distinct molecules

Five compounds of interest were identified from the 887 test set created using Cresset’s Blaze. Three of these were confirmed as positive allosteric CHT modulators and two as negative allosteric modulators of CHT function. A further four compounds of interest were identified from the 2,753 molecules from the Pfizer Chemogenomic Library. The compounds of interest are shown in Table 2 ‘Tool compound data’ which forms part of the paper.

This paper demonstrates the high value of virtual screening in focusing a screening campaign. The team successfully identified previously unknown active and structurally distinct molecules that could be used as tools to further explore CHT biology or as a starting point for further medicinal chemistry.

Selected images from Blaze results with purported CHT modulator seed molecules (PAM MKC-351 and NAM ML-352) (green) shown on the left and output molecules 1-5 shown on the right (grey). Fields are shown with positive (red), negative (cyan), van der Waals (yellow), and hydrophobic (orange) regions.

Conduct ligand-protein docking

A long-standing customer of Cresset Discovery Services asked us to identify new compounds that could be active at their protein target. We conducted ligand-protein docking to narrow down their 50k compound library to the best 1.5k compounds. The cost of the consulting project plus the chemistry for 1.5k compounds was about 20% of what it would have cost to buy and screen the entire 50k library.

Ligand-protein docking can be an excellent way to build up knowledge about the binding pocket. It can also form the basis for a virtual screen to identify new active compounds.

Cresset Discovery Services had been working with this customer on a particular ligand for some time, but there was very little information available about the protein target. There were homologues in the literature, but they were distantly related and nothing very similar had been crystallized.

Detailed preparatory work to model the protein active site

It was necessary to do a lot of modeling work to build up the relationship between the human target and the distantly related proteins available from the literature. We built sequence alignments and compared them, enabling us to build up 3D models of the target and its interaction with the ligand.

Some mutagenesis data was available on the known ligands, so we were able to use this to refine the 3D models and check that the correct residues were in the right places on the active site. This enabled us to define the active site for the ligands. We went on to calculate the energies for the protein-ligand interactions to make sure we had identified poses that made sense.

This was a complex system that required a great deal of protein preparation. This preparatory work was essential for successful docking and required expert knowledge, experience and skill.

Docking and virtual screening using different scenarios

At the end of this process we had a good model of the protein-ligand system. The next step was to remove the ligand and carry out docking.

Docking was first tested on the molecules that were known to bind to the target. This resulted in excellent retrieval rates, showing that the model would also be able to retrieve new compounds.

There were a number of different binding sites on the protein so we decided to carry out the virtual screening using different scenarios for the protein. We:

  • Kept the ligand intact in the binding site
  • Removed the ligand completely
  • Looked at partly bound situations and un-bound situations for each of the binding sites.

The customer provided us with a set of 50k ligands and we docked each of these against the binding pockets. A docking scoring system was used to rank the top 2k compounds from each of the screens.

Analyzing the results and compiling a purchasing list

The top 2k compounds from the four screens were analysed in detail. We visualized every one of the top 2k compounds and looked at each of the docking poses. The docking gave us good geometries for the ligands and we used Cresset software to check that the electrostatics made sense. Any compounds that were unlikely to bind well were rejected.

A final, ranked list was provided to the customer with a very high degree of confidence that it included compounds that were active at the protein target. They were able to procure about 75% of the compounds from the hit list, giving them a final set of 1.5k compounds to test.

An incredible saving in time and money

Carrying out virtual screening to focus the library in this way represented an incredible saving in time and money for our customer. The alternative approach would have been to buy and test the whole 50k compound set. Not only would the customer have needed to purchase all of the compounds, but also shipped them, stored them, plated them, screened them, and then they would still have to analyse the results.

The estimated cost of doing this for all 50k compounds would have been about five times the cost of the combined tasks of the Cresset Discovery Services project plus buying and testing 1.5k compounds.

Cost of

{buying and testing 50k compounds}

=  5 X

Cost of

{Cresset Discovery Services project + buying and testing 1.5k hit list}

Contact us to find out how we can add value to your project.






Dr Martin Slater, Director of Consulting Services

What’s great about Lead Finder?

We recently announced our collaboration with BioMolTech, a small modeling software company best known for their docking software, Lead Finder. Cresset has been traditionally focused on ligand-based design, but as we expand our capabilities into more structure-based methods we realized that we would have to supply a robust and accurate docking method to our customers. So, why did we choose Lead Finder?


A graphical interface to Lead Finder will be included on our new structure-based design application.

The requirements for a great docking engine are simple to state: it needs to be fast and it needs to be accurate. The latter is by far the most important: nobody cares how quickly you got the answer if it is wrong! Our first question when evaluating docking methods was therefore to ask how good it was. This is actually a difficult question to ask, as there are several different definitions of ‘good’ depending on what you want: virtual screening enrichment? Good pose prediction? Accurate ranking of active molecules?

The first of these, virtual screening, is what most people think of when they think of docking success. Lead Finder has been validated on a wide variety of target classes and shows excellent enrichment rates (median ROC value across 34 protein targets was 0.94), even on targets traditionally seen as very hard such as PPAR-γ. The performance on kinases was uniformly excellent, with ROC values ranging from 0.86 for fibroblast growth factor receptor kinase (FGFR) to 0.96 for tyrosine kinase c-Src.


A series of SYK ligands docked to PDB 4yjq with crystal ligand shown in purple.

Pose prediction is of more interest to those working in the lead optimization phase, where assessing the likely bound conformation of a newly-proposed structure can be very helpful in guiding design. Here, too, Lead Finder performs well. On the widely-used Astex Diverse Set, used to test docking performance, Lead Finder produces the correct pose as the top-scoring result 82% of the time, which is comparable to other state-of-the-art methods (Gold, for example, gets 81% on the same measure). On a number of literature data sets testing self-docking performance Lead Finder finds the correct pose between 81 and 96% of the time, which is excellent.


Lead Finder includes dedicated modes for extra-precision and virtual screening experiments.

One of the most intriguing things about Lead Finder is the makeup of the scoring functions. In contrast to many other scoring functions which use heuristic or knowledge-based potentials, the Lead Finder scoring functions comprise a set of physics-based potentials describing electrostatics, hydrogen bonding, desolvation energy, entropic losses on binding and so on. Different scoring functions can be obtained by weighting these contributions differently: BioMolTech have found that the optimal weights for pose prediction differ slightly from those for energy prediction, for example. A separate scoring function has been developed which aims to compute a ΔG of binding given a correct pose. This is a difficult task, and the success of the Lead Finder function was demonstrated in the in the 2010 CSAR blind challenge, where the binding energy of 343 protein-ligand complexes had to be predicted ab initio. Lead Finder was the best-performing docking method in that challenge. BioMolTech are actively building on this excellent result with the aim of making robust and reliable activity predictions a standard outcome of a Lead Finder experiment.

Cresset are proud to be the worldwide distributors for Lead Finder. It is available today as a command-line application and will be built into Cresset’s upcoming structure-based drug design workbench.

Request an evaluation of Lead Finder.

What’s in the CDS virtual screening toolbox?

Cresset is very well known for providing fast and accurate ligand-based virtual screening through Blaze. We have now added the Lead Finder docking engine to our virtual screening toolbox, giving Cresset Discovery Services (CDS) the most comprehensive virtual screening capabilities available anywhere in the industry.

Based on an informal survey of our contacts and customers, I estimate that something like 50% of all current pharma SME projects are ‘structure enabled’. Lead discovery and lead optimization are driven through the use of in-house structures, public structures (typically from the PDB) and homology models. These structures inform lead optimization programs by explaining observed SAR and providing feedback and a detailed context for the design of further analogues.

CDS routinely uses the Cresset software Blaze for ligand-based virtual screening. Although we had access to structure-based methods, we are pleased to have brought Lead Finder in-house, giving us full capability in conducting ligand-protein docking.

Ligand-based virtual screening with Blaze

Virtual screening with Blaze remains one of the most consistently requested projects for CDS. What makes Blaze extremely useful for our customers is:

  • Virtual screening is probably the only way to really sample adequate chemical diversity
  • Virtual screens are far more cost effective than wet HTS
  • Excellent enrichments can be achieved
  • The chemotype diversity in the output is second to none.

Blaze also relies on two very simple premises:

  1. A bioactive conformation encodes, in its shape and electrostatic field, both the properties, recognition features and solvation pattern optimised for interaction with its protein target site.
  2. A molecule conformation with increasing ‘shape and field’ similarity to that bioactive conformation has an increasing probability of also being active.

So, the key determinants of real activity obtained from hit lists (other than was this truly the ‘bioactive conformation’?) is often just how relevant and what distribution that hit conformation has in the population. This is fundamentally why our ligand-centric screening invariably works extremely well. Given that a molecule can adopt a similar shape, and project the same electrostatic patterns, from a completely different chemical architecture, leads to a very diverse output.

Structure-based virtual screening with Lead Finder

The Lead Finder software has been developed to provide cutting-edge docking for an array of typical tasks, from high-throughput virtual screening to best-in-class prediction of bioactive conformations to accurate prediction of binding energies. In combination with the companion Build Model protein preparation tool, Lead Finder has been shown to match or outperform the historically leading docking solutions.

When preparing ligands for virtual screening in Blaze, CDS scientists use modeling to help define the best ‘hand-crafted’ estimate of a bioactive conformation, based on the widest data for any given system. We apply the same care to exploring and preparing protein targets prior to structure-based virtual screens. We take advantage of three main approaches. Firstly, Lead Finder includes the excellent Build Model protein preparation tool. Secondly, we are privileged to be able to model proteins and ligands using the same proprietary XED force field used to give the accurate electrostatics that all Cresset software is based on. Finally, at CDS we have access to the latest Cresset software that is still under development. This gives us capability to provide protein electrostatic field maps and water analysis, providing a very reliable starting position for structure-based virtual screening.


Lead Finder uses a stochastic ligand sampling workflow, with conformations generated on-the-fly, and a genetic algorithm for processing these into pools of the best docking poses. Multiple interaction grids are generated from the protein target and combined to define a scoring system for poses. More importantly, the scoring method has been shown to outperform some of the more conventional docking engines currently available commercially.

Structure-based or ligand-based?

What are the advantages of having structure-based and ligand-based virtual screening?  And how do we choose which is the best approach for a project?

Ligand-based virtual screening is less computationally intensive, making it a preferred option when there is a known ligand available. An average protein of 400 amino acids has over 20,000 heavy atoms and 9,600 bonds and in excess of 50 charges, making it a more challenging system to model.

However, even when there is a known ligand there are some situations when a ligand-based virtual screening is not viable, such as when the known ligand does not exploit all the interactions available in an active site or when a protein has an unattractive orthosteric site and attractive allosteric sites with no known ligands. In these cases, we prefer to use a structure-based method.

In the case of protein-protein interaction sites and protein-DNA/RNA sites, Blaze can take DNA and protein fragments as a template in place of a ligand. However, it is useful to have a structure-based approach available for comparison.

In fact, we often find it useful to combine different virtual screening techniques. In lead discovery, one of the key requirements for virtual screening is to maximise the diversity of hits returned.  All virtual screening techniques, be they ligand-based or structure-based, are probabilistic techniques in that they may be used to increase the likelihood of getting hits from a wet screen. No technique guarantees to give absolute binding energies (at least not in the context of virtual screening on any realistic size of screening library), but they do give good rank ordering of compounds and can, therefore, be used as a means of selection and prioritisation.

Ligand-based techniques, whether 2D or 3D, are algorithmically distinct from structure-based techniques such as docking and, therefore, give different rankings to compounds. Different approaches return different hits and the results can be combined into an enriched final list.

Combining the results of structure-based and ligand-based techniques provides further diversity, leading to better hit rates and more interesting hits.

A one-stop shop for virtual screening

Through combining the strengths of Blaze in the ligand-based world with Lead Finder for docking, CDS now has the most comprehensive virtual screening capabilities available anywhere in the industry. Both Blaze and Lead Finder are available to purchase as software or as a service through CDS. CDS is truly now a one stop shop for virtual screening and indeed very much more.

Download a free evaluation of Lead Finder or access the Blaze demo server.

Affordable virtual screening with Blaze: Benchmarks


We released BlazeGPU a couple of years ago, allowing the full power of the Blaze virtual screening system to be used on a few consumer graphics cards rather than a full-scale Linux cluster. Since then, graphics cards and CPUs have only got faster, so we decided that it was time to update our benchmarks and see how well all of the new hardware performs.

For these benchmarks we took a random subset of 4,000 molecules from our in-house Blaze data set and searched with a medium-sized query molecule. The molecules in the data set average 80 conformers each. We’ve run with three different search conditions: the full slow-but-accurate simplex algorithm, the standard clique algorithm and the new fastclique algorithm. All of these were run with 50% fields and 50% shape.

CPU performance

Firstly, the CPU benchmarks. All of these are single-core performance, but with all cores loaded so that we’re not benefitting from Intel Turbo Boost. In most cases Blaze will be saturating all cores, so this is representative of real-world performance. Note that the vertical axis is on a log scale.

CPU benchmarks

As can be seen, there’s a significant performance difference between the older CPUs at Cresset (such as the Q6600) and the newer Ivy Bridge i7-3770K chips, but not nearly as much as you would expect given that the Q6600s are around 7-8 years old at this point. The significant speed improvements of the fastclique algorithm are clearly visible with the throughput being more than 4x greater than the original clique algorithm. The last set of columns on the graph are from an Amazon c4.xlarge instance and show that the performance of each core on those systems is roughly the same as the Sandy Bridge i3-2120.

GPU Performance

Moving on to the GPUs, we’ve tested the throughput on a variety of different systems. Firstly, we’ve tested a variety of GTX580s on different motherboards and processors. As you would expect, for the most part the performance is governed by the GPU, but the exception is the fifth test system which is noticeably slower than the others. That card is sitting in a much older chassis with an older motherboard and hence is probably suffering from lack of backplane bandwidth to the GPU.

GPU benchmarks

The newer GTX960s perform extremely well on the Blaze calculations. We weren’t sure if they would, after the disappointment of the GTX680 which was noticeably slower than the 580 (data not shown). The difference is noticeable in the clique stages, but really stands out in the simplex calculations where a GTX960 is 50% faster than the GTX580s. By contrast, the high-end Tesla hardware is not a great performer on the Blaze OpenCL kernels. By all accounts the Tesla hardware is significantly faster than the consumer hardware on double precision workloads, but the Blaze code is all single precision and in that realm the cheap consumer hardware has an unbeatable price/performance advantage.
Finally, the GRID K520 is the hardware found on the Amazon g2.2xlarge and g2.8xlarge instances. As can be seen, it’s not a brilliant performer on the Blaze workload, being around the same speed as the Tesla on the fastclique algorithm but noticeably slower than all of the other cards tested on the simplex workload. However, it provides a nice test of GPU scaling: when running on a 4 times larger data set on all 4 GPUs of a g2.8xlarge instance, we observed substantially the same throughput as running the original data set on a single K520 GPU, showing that we can parallelise across multi-GPU systems with no loss of performance.

Cost efficiency on Amazon

Converting the throughput shown above, we can look at the cost of screening on the Amazon cluster with Blaze. The raw cost to screen a million molecules is shown in the table. Note that the actual costs will be somewhat higher, due to job overheads and data transfer costs.

Cost efficiency on Amazon

The Amazon GPU solutions are noticeably cheaper for fastclique jobs, roughly cost-competitive for the clique runs, but the poor performance of the K520 on the simplex task means that it is significantly more expensive there. As a result, at the moment there’s no real impetus to use the Amazon GPU resources unless you can get them significantly more discounted than the CPU instances on the spot market.


New hardware is significantly faster at running Blaze than old stock as would be expected. However, the speed increases are much lower than they have been in the past, with CPUs that are well past their best still performing adequately. On the GPU side, Blaze performs particularly well on commodity graphics cards leaving few reasons for us to invest in dedicated GPU co-processing cards.

The cost of running a million molecule virtual screen on the Amazon cloud has never been cheaper. If tiered processing is used as is the default for Blaze then these screens can be performed for a very low cost indeed – less than $15 per million molecules for the processing costs.

Contact us for a free evaluation to try Blaze on your own cluster, or Blaze Cloud.

Spatial overlap of peptide hotspots and canonical drug pockets in a model enzyme

Spatial overlap of peptide hotspots and canonical drug pockets in a model enzyme was presented by Dr Walraj Gosal, Senior Scientist, Isogenica at the Cresset European User Group Meeting 2015.

Walraj’s talk described the process of moving from peptides from molecular display to small molecule inhibitors, with the help of Cresset technology. In collaboration with Cresset and Biolauncher, the team found that Cresset’s field patterns based on peptides can be used to find new inhibitors. The work was funded by the TSB (Technology Strategy Board).

Molecular (CIS) display1 is an Isogenica technology that allows you to find novel peptides and protein scaffolds that bind a given target.

Walraj described the basic problem: we don’t know how to move from the primary sequence to the precise 3D fold of the protein. He described it as the ultimate needle in a haystack problem, whereby a 100 amino acid protein relates to x 10130 sequences. They are trying to figure out how, if someone comes to us with a target, they can get a sequence that will bind to the target.

Ultimately, the ideal solution would be a complete algorithmic solution and he briefly highlighted recent computational approaches (e.g. Rosetta) that are showing evermore promise towards this goal2,3. However, at the moment the only viable approach is molecular display. For example, Humira – the biggest selling drug worldwide – was one of the first to reach the market that was partially discovered using display technology.

He went on to describe the basic premise of molecular display, which is to have a library of peptides or proteins that maintain their link to RNA or DNA (a ‘genotype to phenotype’ link). The process is then to enrich the library by presenting it with the target over many rounds of selection.

Moving onto the problem the team were trying to solve – can CIS display peptides inform small molecule discovery. Their target choice was thrombin, which Walraj described as ideal for a number of reasons. Firstly, there is already a mountain of medicinal chemistry data available in the public domain due to the race for a direct thrombin inhibitor in Industry. In general, the compounds that have made it on to the market are all based on the substrate, but are very basic – the reasons being that they mimic a key arginine-aspartate salt bridge in the so-called S1 pocket. This led to lead molecules where the bioavailability was low, and clever pro-drug strategies were necessary that eventually led to drugs on the market (e.g. Dabigatran).

Secondly, and more importantly, the team were inspired by the fact that Nature has found alternative solutions to the problem of inhibiting thrombin and Walraj highlighted three: from a tropical bont tick, the mosquito and a medicinal leech4.

So here is the key question: are molecular display peptides going to open up more avenues for drug design, or are they consistent with previous efforts? The answer turns out to be a bit of both.

They found that many of the peptides bound to the active site but some that also bind to an allosteric site – the latter already suggesting that drug design efforts could be focused on other sites largely ignored by Industry. Nevertheless, looking at the active site binders, whilst many of those peptides contained a motif that mimic the natural mosquito inhibitor, most of them appeared unrelated to each other suggesting multiple solutions. A lot of biochemistry was carried out by the team, and eventually the two best peptides were crystallised with thrombin, which confirmed the binding at the active site.

These structures showed orthogonal solutions – one very much based on the Mosquito solution which is to insert a key arginine in S1 in an opposite direction to substrate – incompatible with catalysis. The other solution appeared substrate-like in its path and direction to the S1 site but with a key difference. Here the peptide delivered an extremely novel ‘warhead’ in the S1 pocket that violated the paradigm that the arginine-aspartate salt bridge was required for high affinity. The latter was especially important as the peptide bound with single-digit nanomolar affinity.

They carried out a computational study (using Rosetta) and alanine-scanning mutagenesis experiments (seeing excellent correlation between the two) to determine the key interactions or ‘hotspots’. They then asked whether there was a spatial overlap between the hotspots of these peptides and the canonical drug pockets and interactions that have been exploited over the last 40 years (using data from the PDB). They saw that the hotspots overlap remarkably well with drugs from the PDB. However, some of the high-energy interactions seen in the peptides have never been exploited for drug design. For example, for one of the peptides, a loop movement creates a whole new pocket close to the active site.

Furthermore, the paradigm-violating lipophilic and neutral solution to S1 occupation appears to have been already discovered through HTS and fragment-based design. This highlights the power of molecular display – orthogonal solutions can be discovered remarkably quickly.

The next crucial – and by no means trivial step – is then how you move from these peptide solutions to discovering small molecules? Here, Cresset’s virtual screening technology, Blaze, proved invaluable. The team at Cresset took the crystal structures and produced field patterns based on linear stretches of the peptides incorporating one or more hotspots. For one of the peptides, fewer than 160 of the top compounds suggested by Blaze were experimental screened, and the team found two every small competitive inhibitors of thrombin that were previously unknown.


Walraj concluded that Cresset’s peptide field maps arising from molecular display are sufficient to discover small molecular inhibitors, and combining the power of molecular display and virtual screening would open up a powerful new avenue to drug discovery.

  1. Odegrip, R. et al. PNAS 101, 2806-2810 (2004)
  2. Kuhlman,B. et al. Science 302, 1364-1368 (2003)
  3. Fleishman,S.J. et al. Science 332, 816-821 (2011)
  4. Huntington, J. A. Thromb. Haemost., 111, 583-589 (2014).

ChEMBL leadlike compounds freely searchable on Blaze demo server

The Blaze virtual screening demo server has proved popular since its launch last year, however, we wanted to extend the range of compounds that are available for users to search. We have now achieved this through the introduction of three new collections of ChEMBL compounds. These collections provide leadlike compounds for drug, agrochemical, flavor and fragrance discovery that are suitable for the evaluation of Blaze in these areas. The collections are open to all registered Blaze demo users whether accessed through a web page, KNIME or using Forge, Torch, or TorchLite.

Creating collections of molecules for searching in Blaze

Blaze is a full virtual screening system that is integrated to queuing systems like SGE for database population and searching and hence creating a new collection to be searched is easy. Blaze takes care of the difficult part – splitting uploads into different sizes, identifying and linking duplicates, exploding unspecified chirality and populating the conformations of new molecules. This creates a new collection for searching. All that is required is to tell Blaze about the collection and then to upload an SDF file to the server. Choosing what to upload is more difficult. On our main Blaze server that we use for our consulting projects we have 10,000,000 molecules arranged in collections from compound suppliers. In the demo server it is not possible to use such large numbers of compounds. Until now we have had only a few thousand compounds. Here we expand that to over 400,000 compounds, derived largely from ChEMBL.

Creating the ChEMBL collections

To generate collections with appropriate properties ChEMBL was filtered in KNIME using physico-chemical properties as shown in the table below.

Property Chembl20_filtered
leadlike collection for
drug discovery
leadlike collection for
agrochemical discovery
ChEMBL filtered for
fragrance like molecules
MW 200 – 400 200 – 430 30 – 300
TPSA 40 – 80  N/A < 60
RotBonds 0 – 5  < 5 0 – 4
Aryl rings 0 – 3 N/A N/A
HBD 0 – 3 2 – 3 0 – 1
HBA 0 – 6 2 – 12 0 – 3
SlogP -1 – 4 0.75 – 4.5 > 1
Elements C,N,S,O,F,Cl,Br,I  C,N,S,O,F,Cl C,H,N,O
Total Molecules
available for searching
202,895 136,457 45,383

Additionally for the drug discovery library we removed compounds that we considered to be toxic or undruglike (acyl halides, sulfate esters etc.) and compounds that contain specific functional groups that have regularly appeared as false positives with Blaze (thioethers, hydrazones and imines).

The filtering was performed in KNIME workflows (represented for the drug discovery collection below).

The upload is traditionally done using Blaze’s web interface but on this occasion we chose to extend our KNIME protocol to upload the compounds to Blaze using the REST interface. This feature was introduced in Blaze 10.2 and has proved a popular and easy way to keep Blaze in sync with corporate databases. While we are using KNIME here, the protocol would work equally well with Pipeline Pilot. The upload workflow is shown below with the filtering steps reduced to metanodes.


Using the new collections

The new collections are available to search using the standard Blaze web interface or through the REST interface enabling searching from KNIME and Pipeline pilot as well as Cresset’s desktop applications Forge, Torch, or TorchLite. The applications require configuring (in the preferences) with the address of the Blaze server together with a username and password for access. Once this is done the Run menu → Send to Blaze and right click menu ‘Send to Blaze’ options will open a dialog box for configuration of the Blaze search.

The advantage of submitting a Blaze search from within the desktop applications is that your current field constraints and the protein excluded volume will get transferred to Blaze and used without extensive interaction or file uploading.

Note that result download is also possible from within the desktop applications. Selecting the File menu → Download Blaze Search Results brings a dialog containing a tree view of Blaze searches. One tip here – it is important to make sure that we select the best results – those from the simplex refinement not the initial search.

To try the new Blaze collections for yourself please register for a username and password. If you think that there are other sets that we could usefully include or that we could improve the filters that we have used here then please contact us to discuss your suggestion.

Rapid and simple Blaze database population and searching using KNIME and Forge


Blaze1 is Cresset’s ligand-based virtual screening platform. It uses the shape and electrostatic character of known ligands to rapidly search large chemical collections for molecules with similar properties. In this case study, a Blaze database of approximately 200,000 compounds from ChEMBL2 was prepared in a seamless manner using a KNIME3 workflow and standard Blaze database creation routines. The new collection, named ’Chembl20_filtered’, is available from the Blaze Demo Server4. Blaze searches were launched within Forge5 and by means of a KNIME workflow to test the ease of use of both workflows. The output of the searches was finally downloaded into Forge and visually inspected.


Blaze, Cresset’s ligand-based virtual screening platform, uses the shape and electrostatic character of known ligands (as encoded by Cresset’s field technology6) to rapidly search large chemical collections for molecules with similar properties. It is excellent for finding novel leadlike hits from known actives, replacing peptides with non-peptides or steroids with non-steroids.

Using Blaze you can increase the diversity of your project’s lead compounds and jump into new areas of chemical space giving substantial improvements in the properties of your hits. Cresset have run hundreds of projects through Blaze with an excellent track record: hit rates as high as 30% are reported by our customers.


Blaze is a full virtual screening system containing the infrastructure to manage compound collections and the associated conformation populations. It automatically records additions and removals from any collection and handles duplication across collections. New compounds are automatically submitted to a queuing system (typically SGE or Platform LSF) for conformer generation on a Linux cluster.

Database searching is configured through a single webpage, REST call or on the command line. Compounds are automatically triaged through a cascade of increasingly accurate search methods. Blaze automatically manages database searches with differing priorities, submitting them to a queuing system of either a GPU or CPU cluster).

Lastly, Blaze contains a full user and project based permissions system to control the visibility of individual and groups of search results.

Blaze V10.2

This most recent version of Blaze includes:

  • A new search algorithm that enables full 3D assessment of molecules at four times the previous speed, enabling the processing of databases of over 10 million compounds.
  • A new RESTful web service providing easy integration with Forge, KNIME and Pipeline Pilot7 and custom software solutions.
  • Simplified security features that are easier to unify with corporate authentication servers, in response to customer requests. This makes user management significantly simpler for large installations.
  • A free demo server, enabling you to test the performance and functionality of Blaze on a small collection of commercially available compounds.

In this case study a Blaze database of approximately 200,000 compounds from ChEMBL (database of bioactive data for drug discovery) was rapidly prepared and uploaded (added) to the Blaze demo server using the new REST API interface.



The full ChEMBL 20 data set (containing approximately 1.5 million compounds) was downloaded as an SDF file.
The set was filtered using a KNIME workflow (Figure 1) applying the following physico-chemical cut-offs to select potential leadlike structures to be used as starting points for medicinal chemistry optimization:

  • MW 200-400
  • TPSA 40-80
  • RotBonds 0-5
  • Aryl rings 0-3
  • HBD 0-3
  • HBA 0-6
  • SlogP -1-4.

Figure 1. KNIME workflow used to filter the original ChEMBL data set (1.5M compounds).

The data set was further cleaned with the removal of compounds carrying reactive functional groups (e.g. alkyl halides), potentially toxic groups (e.g. azides) or other unwanted chemical moieties (e.g. heavy metals). After filtering, approximately 202,000 compounds remained for uploading to Blaze.

Upload to Blaze

The upload of the new collection could be achieved using the command line or the web interface. However, as all the compounds exist within KNIME we chose to directly upload to the Blaze free demo server using the Blaze REST API (Figure 2).

The creation of the Blaze Chembl20_filtered collection took a few hours on 150 cores using Cresset’s internal Linux cluster.

Figure 2. Blaze compound upload protocol.

Using Blaze from Forge/Torch

The introduction of the REST interface has enabled Blaze searching directly from many platforms and scripts including Cresset’s desktop applications Forge and Torch. To work with Blaze the applications require the address of the Blaze server and your username and password in the relevant preference setting (Edit menu -> Preferences -> Blaze panel, Figure 3).

Set up of Forge Torch connection to Blaze
Figure 3. Set up of Forge/Torch connection to Blaze.

The interface enables sending the current molecule, including any field constraints and the current protein excluded volume, to Blaze, configuration of the search options and download of results directly into the application.
To test the new ChEMBL collection and further demonstrate the usefulness of the Blaze REST interface a search was performed using Nevirapine8, one of the first round of HIV NNRTI inhibitors. The search was submitted using Cresset’s Forge and also using a KNIME protocol.

Searching Blaze from Forge

The crystal structure of the Y181C mutant HIV-1 reverse transcriptase in complex with the inhibitor Nevirapine (PDB code 1jlb) was downloaded in Forge (an identical procedure is applied when working with Torch). The workflow is summarized in Figure 4.

Nevirapine was selected as the reference structure and imported into Forge together with the HIV-1 reverse transcriptase protein. Cresset’s rules were used to define the protonation state of Nevirapine and the protein. After visual inspection the reference structure was minimized to improve the bond angles.

To initiate the Blaze search, the reference molecule was selected in the main ‘Molecules’ table then ‘Sent to Blaze’ using the right click menu. The resulting Blaze search configuration menu was used to name the search ‘1jlb’, select the ‘Chembl20_filtered’ collection and accept the default search parameters (Figure 4).

Once complete, the search results were imported into Forge (Torch would work identically) for visual inspection and further analysis.

Figure 4. PDB download, selection of reference structure and start of Blaze search in Forge.

Figure 5. Blaze search protocol.

Blaze Searching from KNIME

A KNIME Blaze search workflow (see Figure 5) was also tested for user friendliness.
The protocol requires the manual setting of a small number of workflow variables (Blaze URL, username and password) and the configuration of three input nodes to:

  • define the name and conditions of the search (Table creator node),
  • load the reference structure as an SDF (using SDF reader node),
  • define the name of the Blaze collection to search (Chembl20_filtered, Table creator node).

Download of results to Forge/Torch

The results of the Blaze search on Chembl20_filtered using Nevirapine as the query were downloaded into Forge (Figure 6).

Download of Blaze results into Forge
Figure 6. Download of Blaze results into Forge.

While a thorough evaluation of the results of the Blaze search is beyond the scope of this case study, a qualitative analysis of the 200 top scoring results shows that Blaze was able to identify some chemically diverse potential hit compounds. As expected a large fraction of the top scoring compounds belong to the same (widely explored) chemical class of Nevirapine: however a few top scoring molecules (see examples in Table 1, Figure 7) are structurally different and are reported in ChEMBL to have been tested for HIV-1 reverse transcriptase activity.

Interesting hits retrieved by the Blaze search on Nevirapine
Table 1. Interesting hits retrieved by the Blaze search on Nevirapine.

Figure 7. CHEMBL314103 overlaid (grey) with Nevirapine (green).


A Blaze database of approximately 200,000 compounds from ChEMBL was prepared in a seamless manner using a KNIME workflow. Using the Blaze REST interface the dataset could be uploaded to Blaze from within KNIME and was available for searching within a few hours.

To test ease of use of the search workflows available in Forge (Torch) and KNIME, the same search was run on each platform. While both protocols are relatively straightforward the Forge guided interface is definitely simpler to set-up for the end user. The KNIME workflow offers a higher flexibility, however, and allows the integration of Blaze searches into more customized protocols with complex post-processing of results. Using the Torch or Forge viewers within KNIME enables viewing of the 3D alignment of the returned compounds within that platform.

The new Chembl20_filtered collection is available for searching by all users of the Blaze demo server – register for free access by visiting

References and Links

4. Blaze free demo server: Register for your username and password at the Blaze demo signup page
7. Pipeline Pilot:
8. US5366972 (A) – 5,11-dihydro-6H-dipyrido(3,2-B:2′,3′-E)(1,4)diazepines and their use in the prevention or treatment of HIV infection