Support Vector Machine (SVM) is one of the most used machine learning approaches in molecular property prediction for drug discovery. In recent years, deep learning (DL) methods are increasingly popular in drug discovery for both property prediction and chemical space exploration, which raise the question on whether SVM would be replaced by DL. However, as Prof. Bajorath pointed out recently, ‘traditional’ machine learning models such as the SVM still play an irreplaceable role in modern QSAR tasks in drug discovery1.
Introduction to Support Vector Machine
SVM is a supervised machine learning algorithm that can be used for both compound classification and numeric molecular property prediction. Development of the SVM model relies on locating the support vectors (SVs) from the training set. For compound classification task, SVs are used to define hyperplane separating compounds with maximal margin (Figure 1, left). For regression task, SVs are used to define the hyperplane for property prediction with some error tolerance (Figure 1, right).
Figure 1. Illustration of SVM algorithm for classification (left) and regression (right). Adapted from Reference (1).
The ‘kernel trick’ is important in SVM modeling. Data points for cheminformatics tasks are usually not linearly separable in the given input space, and the kernel trick can be applied to map the training data into higher dimensional space where the data could be linearly separable. As shown in Figure 2, the scalar product in space 𝓧 is projected into a higher dimension with the nonlinear mapping function ø, and the data become separable in the projected space 𝓗.
Figure 2. Using kernel Φ to project the input data into higher, potentially linearly separable dimensions. Adapted from Reference 1.
For both the SVM classification and regression models, compromise should be achieved between model accuracy (making the correct predictions) and generalizability (ensuring larger margin between SVs). The regularization parameter C need to be optimized. Larger C gives a more accurate fit to the training data, with a higher likelihood of overfitting. Smaller C results in a simpler decision boundary that may generalize better to unseen data.
Application of SVM in drug discovery
SVM is an important machine learning method in cheminformatics because of the high accuracy in both compound classification and non-linear QSAR for virtual screening. SVM could also be adapted to specialized applications such as multi-target activity prediction, new target identification and activity cliff prediction1.
With the advent in deep learning (DL) models in recent years, much attention has been paid to DL models for cheminformatics tasks. However, ‘traditional’ models such as SVM are unlikely to be replaced by DL models, at least in the near future. Current DL models are strong for tasks where large volume of unstructured data is available and representation learning is important. However, for many of the predictive tasks for drug discovery, only limited data is available, and such data is often better-defined compared to the unstructured image or text data used to train DL models. As a result, DL models exhibit little if any advantages in predictive tasks for drug discovery QSAR over more traditional machine learning models like SVM. In the future, the two types of models may well be combined, where predictive tasks can be performed using SVM model, while molecular representation is obtained from DL models in a data-driven manner.
SVM in Flare™
Generating 2D and 3D SVM models for both classification and regression tasks is readily done in Flare based on the SVM package in scikit-learn2.
Electrostatic and volume fields can be calculated and used as descriptors for the compounds. A range of kernels are available for the SVM model including the sigmoid, RBF, linear, polynomial and precomputed kernels. Tunning for parameters such as C (regularization), gamma (RBF kernel) and epsilon (error tolerance in regression model) is performed using k-fold grid search on the training set – the parameter combination with the best predictive power can be obtained automatically with no need for user intervention.
A range of evaluation metrics on the model performance are provided in Flare. The performance of classification models can be measured from the confusion matrix and associated statistics such as Precision, Recall and Informedness. Performance of regression models can be measured by statistics such as r2/q2, RMSE and Kendall's tau, as well as inspection of the descriptor landscape via PCA plots. Model performance for training, cross validation and test sets are available for visualization from the Flare GUI (Figure 3.). As with most other functions in Flare, all aspects can be controlled via the Python API making automation trivial.
In summary, for most of the QSAR problems that encountered on a daily basis in drug discovery there is little evidence that deep learning methods outperform more traditional machine learning tools such as support vector machines. The new QSAR framework in Flare makes it easier than ever to generate and use 3D-QSAR descriptors to build models with good predictivity and generalizability, and the tools provided give simple guidance on how best to model your data, flag up any issues with data quantity or quality, and apply the predictions to the design of new molecules. If you haven’t already tried developing QSAR models in Flare, ask us for an evaluation!
Figure 3. Visualization of the performance of SVM model in Flare – demonstrated using a regression task.
References
- Rodríguez-Pérez, R.; Bajorath, J. Evolution of Support Vector Machine and Regression Modeling in Chemoinformatics and Drug Discovery. J. Comput. Aided. Mol. Des. 2022, 1–8. Open access under the Creative Commons Attribution 4.0 International License http://creativecommons.org/licenses/by/4.0/.
- Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; Vanderplas, J.; Passos, A.; Cournapeau, D.; Brucher, M.; Perrot, M.; Duchesnay, E. Scikit-Learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830.