Spark™ performance validated with a new data set for evaluating bioisostere methods

The recent paper by Dave Evans and Matthew Baumgartner Side chain virtual screening of matched molecular pairs: a PDB-wide and ChEMBL-wide analysisJ Comput Aided Mol Des (2020) is a fascinating new take on evaluating methods for finding bioisosteres. Previous papers in the literature for bioisostere methods tend to have looked at relatively small number of data sets, often with subjective and/or manual definitions of what constitutes a valid bioisostere. In contrast, Baumgartner and Evans have taken many of the lessons learned in evaluating virtual screening methods and have applied them in a new context.

The authors have generated an exhaustive data set of 'R-group replacements' by automated mining of the ChEMBL database, finding set of compounds where at least one compound has a crystal structure and the other compounds are both active and have a single significant structural change (i.e., are matched molecular pairs against the crystallized ligand). They then generated a set of 'decoy R-groups' which were property-matched to the actives in a similar way to the construction of the well-known DUD-E data set for virtual screening.

This large evaluation set (404 separate sets of actives!) can then be used for large-scale comparisons of different bioisostere evaluation methods. Comparing lots of different data sets makes comparisons much more statistically robust and less likely to be influenced by chance correlations. The authors carried out exhaustive testing multiple methods for assessing whether a new side chain was a good match for the one in the crystallographic ligand:

  • 2D similarity
  • RDKit alignment with Smina scoring
  • Full docking with Smina
  • Full docking with Glide SP
  • Spark in default mode, ligand-only
  • Spark with the protein structure used as an excluded volume
  • Spark, using a random low-energy conformation of the active rather than the crystallographic conformation

The results clearly show that Spark is the best all-round solution to this common problem in medicinal chemistry, as shown in the following table.

Method Mean AUC
2D similarity 0.699 ± 0.212
RDKit alignment + Smina 0.639 ± 0.174
Smina docking 0.373 ± 0.223
Glide docking 0.636 ± 0.178
Spark 0.783 ± 0.166
Spark with protein 0.768 ± 0.169
Spark with random conformation 0.790 ± 0.177

As the authors comment: "In general, the Spark methods worked well across a range of data sets." Spark outperformed (by quite a large margin) all of the other methods tested. Even when the trivial data sets (those where the 2D method performed extremely well) were removed from the analysis Spark’s performance was essentially unchanged (mean AUC 0.777), and likewise when the data sets were filtered to only count R-group replacements with different Murcko scaffolds to the target (mean AUC 0.773). It is worth noting that Spark virtually never performed worse than random, which was not the case for the other methods studied.

The results clearly show the superior performance of ligand-based methods (particularly Spark) over structure-based ones. The authors hypothesis that docking methods struggle “due to limitations in their force fields which appear to lack the ability to discriminate between the relatively small differences within the congeneric series”.

It is interesting to contrast these results with the advice that we have traditionally given our customers on the use of Spark. The addition of protein excluded volumes has no (or possibly a slight negative) effect on mean AUC. My guess is that the targets for which the addition of a shape constraint from the protein improves performance are balanced by those for which protein flexibility leads to incorrectly scoring down active molecules. Our advice, that protein excluded volumes should be used only when you are certain that the active site is likely to remain in the same conformation between different ligands, seems sensible given these results.

More unexpected, perhaps, is that the experiments where Spark was run on a random low-energy conformation of the target active were just as good on average as those run with the bioactive conformation. We know from experience that Spark can work well in this case, but I would have expected the results with the 'right' conformation to have performed better. This is probably because in the set of experiments performed in this paper, where an R-group connected to the scaffold by a single bond is replaced, is probably the least sensitive to the initial conformation. Only in cases where the conformation of the scaffold restricts the conformation of the flexible R-group, or when the R-group to be replaced adopts a conformation that more rigid but inactive analogues can mimic, would one expect the conformation of the initial active molecule to make a major difference. The results from this data set indicate that this is an infrequent occurrence. We would still advise Spark users to use the bioactive conformation where possible (particularly when doing core replacements as opposed to the R-group replacements analysed in this study), but the results from Baumgartner and Evans indicate that in the lead optimization phase, when utilizing Spark to search for R-group replacements, having a crystal structure of the lead series is not as important as we had assumed.

These results confirm our belief that Spark is the best bioisostere tool on the market. However, as Baumgartner and Evans note, “it is clear that there is still plenty of room for improvement in ranking matched molecular series.” We are committed to continually improving the scientific methods in our software, and the data sets presented in this paper give us an excellent way to measure our progress!

Try Spark on your project

Request a software evaluation, Torx® demo or Discovery CRO discussion

Contact us today