A new release of workflow components for the KNIME™ environment is now available. This includes nodes for the Machine Learning methods in Forge™, nodes for accessing Flare™ functionality through the Flare Python API, and a number of enhancements to existing components.
To illustrate these enhancements, I created an integrated workflow to automatically perform qualitative and quantitative Structure Activity Relationships (SAR) analysis for patent data from Bindingdb.
An integrated workflow for SAR analysis
Reading the Rapid interpretation of patent SAR using Forge blog post, I thought it would be very nice to have a workflow to analyze Bindingdb data in an automated manner requiring minimal human intervention. Cresset workflow solutions are ideal for this, and to test the feasibility of this idea, I put together the KNIME workflow shown in Figure 1.
Figure 1. An integrated KNIME workflow for SAR analysis of Bindingdb patent data.
This block of nodes prepares the raw data downloaded from Bindingdb (in this case, from the US9321756 patent: ‘Azole compounds as PIM inhibitors’) for the SAR analysis.
Nodes which require manual input are labelled in Figure 1. I need to specify the name and location of the csv file I want to use; choose the biological target I am interested in (US9321756 reports activity for two biological targets – ‘PIM’ and ‘PIM-1’: I used PIM-1) and make sure I am working with the correct activity column (IC50, for PIM-1).
There are also nodes to filter away IC50 missing values and those with ‘higher than’ (>) or ‘lower than (<) modifiers, which transform the activity values into pIC50, calculate mean pIC50 values for compounds which were tested multiple times on PIM-1, and remove those compounds where the mean pIC50 is associated with a high standard deviation (>0.7). Finally, the compounds are sorted in order of descending activity to enable an activity stratified partitioning in a training and test set.
The ‘Histogram’ node (at the bottom in Figure 1) can be used to check that the distribution and range of the activity values meet the conditions for building robust qualitative and/or quantitative SAR models in Forge. In this case (Figure 2), the activity range covers almost 4 log units and the distribution is reasonably even, so I can confidently go ahead with the model building.
Figure 2. Range and distribution of the PIM-1 pIC50 values from the US9321756 patent.
This bit of the workflow downloads the protein-ligand complex with PDB code: 4TY1 and sends it to the new ‘pyflare’ node to prepare the protein, extract the reference ligand, and remove the crystallographic water molecules. The pyflare node allows the Flare Python API to be used from within KNIME, enabling access to all the Flare functionality.
The ‘Flare Viewer’ node, also new in this release, can be used to launch Flare and visualize the results, as shown in Figure 3.
Figure 3. The prepared 4TY1 protein and the 38W reference ligand. Crystallographic water molecules were removed using the ‘pyflare’ KNIME node.
Here I used the ‘Forge Align’ node to align the molecules in the training and test sets to the 38W reference ligand from PDB: 4TY1, using the protein as an excluded volume. I configured the node to use the ‘Exhaustive’ setting (which runs a more accurate conformation hunt), to assign formal charges to the molecules according to the Cresset rules, and to align the molecules by Maximum Common Substructure (MCS), as shown in Figure 4. To speed up the alignment process, I configured KNIME to use the Cresset Engine Broker.
Figure 4. Configuration of the ‘Forge Align’ node used in this case study.
SAR analysis and visualization
In this final part of the workflow, qualitative and quantitative SAR models are calculated using the ‘Forge Build Field QSAR’, ‘Forge Build Activity Atlas’ and the new ‘Forge Build Machine Learning’ nodes. The visualization is mainly done using the ‘Forge Project Viewer’ node, but as an alternative I could use the ‘Forge Project Writer’ node to save the results into separate project files to view at a later stage.
Field QSAR model
The Field QSAR method uses Forge 3D electrostatic (based on Cresset’s XED force field) and volume descriptors to create an equation that describes activity, using Partial Least Squares (PLS) analysis.
For this case study, I configured the ‘Forge Build Field QSAR’ node to use the ‘Weight molecules by similarity’ option, as shown in Figure 5 – left, which weighs each molecule according to its similarity to the reference. Using this setting will downweigh the importance of training set molecules not optimally aligned to the reference (and accordingly associated with a lower similarity), and may generate better models in those cases where the alignment is not carefully curated.
The Field QSAR model shows a Q2 (training set CV, LOO) = 0.52 and a R2 (test set) = 0.65. Visual inspection of the training and test sets and inspection of the PCA plot (Figure 5 – right) reveal that there is a group of compounds with an incorrect protonation state on the pyridine ring. Recalculating the model after removing these compounds from the training and test sets gives a model with similar statistics (Q2 training set CV, LOO = 0.57 and R2 test set = 0.58).
Figure 5. Left: Configuration of the ‘Forge Build Field QSAR’ node used in this case study. Right: the PCA plot highlights a group of compounds in the training set with incorrect protonation state of the pyridine ring.
Machine Learning model
The new ‘Forge Build Machine Learning’ node can be used to generate Machine Learning (ML) regression or classification models in KNIME, using Forge 3D electrostatic and volume descriptors. You can decide which model type will be generated (choosing from k-Nearest Neighbors, Random Forest, Relevance Vector Machine or Support Vector Machine), but for this case study I kept the default ‘Auto’ option, which automatically runs all the ML models and pick the best one for the output. To calculate the predicted pIC50 values for the molecules in the test set, I used the new ‘Forge Score Machine Learning’ node.
The best performance is obtained with a Support Vector Machine model showing a Q2 (training set CV) = 0.62 and a R2 (test set) = 0.71. Also in this case, recalculating the model excluding the molecules with an incorrect protonation in the pyridine ring from the training and test sets gives a model with similar statistics (Q2 training set CV = 0.57 and R2 test set= 0.69).
For this data set, the SVM model is marginally more predictive than the Field QSAR model.
Activity Atlas models
Activity Atlas™ models generate a simple, qualitative picture of the critical points in the SAR landscape. In particular, the ‘Activity Cliff Summary’ views highlight regions of acute SAR and are a useful starting point to understand the data.
The new default ‘Weighted Sum’ Activity Cliffs Summary algorithm in the ‘Forge Build Activity Atlas’ node generates more detailed SAR maps by reducing the reliance on individual compounds, and is especially useful for small and medium sized data sets.
As this dataset is relatively large though, I also built an alternative Activity Atlas model using the original ‘Sum’ algorithm, which instead focuses on the prevalent SAR signals, and I compared the maps obtained with the two methods. Default conditions were used for all the other options. The two models are shown side by side in Figure 6, and for this case study give very similar results, comparable to those of the original blog post.
Figure 6: Activity Atlas model showing the ‘Activity Cliff Summary of Electrostatics’ and ‘Activity Cliff Summary of Shape’ views. Left: Activity Atlas maps calculated with the new ‘Weighted Sum’ algorithm. Right: Activity Atlas maps calculated with the original ‘Sum’ algorithm. Red / Blue = positive / negative electrostatics preferred for greater activity; Green / Pink = steric bulk is favorable/disfavorable in this region.
Also new and improved in this release
V2.5 Cresset KNIME nodes also include additional new features and improvements:
- New example workflows to illustrate the usage of Cresset KNIME nodes
- New Surface Writer node to write molecule surfaces to a directory
- Spark™ Database Search node: new options to set field and pharmacophore constraints, use multiple reference molecules to guide the search, specify a database to search in addition to automatically detected databases, additional similarity metrics (Tanimoto and Tversky)
- Forge Align node: additional similarity metrics (Tanimoto and Tversky), new option to remove boat conformations before performing the alignment
- Forge Build Activity Atlas node: new options to write the Activity Atlas surfaces to a directory and to specify a column in the input table to use for the similarity data
- Forge Build Machine Learning: new Weighting Scheme option in support of kNN models
- Forge Project Roles Extractor node: new option to output additional data for each role
- Improved import of ligand and protein files, with most nodes now accepting ligands in SMILES format and proteins in PDB format.
The KNIME workflow built for this post can quickly run a preliminary qualitative and quantitative SAR analysis of any interesting patent data in Bindingdb in an automated manner requiring minimal human intervention. For the US9321756, running this workflow took approximately 30 minutes and resulted in a SVM quantitative SAR model with reasonable predictive ability and clear SAR maps for Activity Atlas.