Predicting Kinase Inhibitor Resistance: Physics-Based and Data-Driven Approaches

Resistance to small molecule drugs often emerges in cancer cells, viruses, and bacteria as a result of the evolutionary pressure exerted by the therapy. Protein mutations that directly impair drug binding are frequently involved in resistance, and the ability to anticipate these mutations would be beneficial in drug development and clinical practice. Here, we evaluate the ability of three distinct computational methods to predict ligand binding affinity changes upon protein mutation for the cancer target Abl kinase. These structure-based approaches rely on first-principle statistical mechanics, mixed physics- and knowledge-based potentials, and machine learning, and were able to estimate binding affinity changes and identify resistant mutations with remarkable accuracy. We expect that these complementary approaches will enable the routine prediction of resistance-causing mutations in a variety of other target proteins.


■ INTRODUCTION
Drug resistance is one of the chief challenges to be overcome in the development of robust anticancer and antimicrobial therapies. While resistance can emerge via multiple mechanisms, such as increased drug efflux or activation of alternative signaling pathways, it is often caused by protein mutations directly impacting drug binding. 1,2 Anticipating these resistance-causing mutations is of interest for personalized medicine, as it would inform treatment decisions based on the patient's genotype 3,4 and aid the development of combination therapies. It would also benefit drug development by allowing the parallel exploration of inhibitors with different resistance profiles. While large-scale experimentation is feasible, 5 it is neither cheap nor convenient, and accurate computer predictions mitigate the need for systematic experimentation.
Protein kinases are among the most exploited drug targets, with 48 inhibitors approved to date in the United States. 6 The majority of these inhibitors target tyrosine kinases, 6 which play a critical role in the modulation of growth factor signaling. 7,8 Hence, tyrosine kinase inhibitors (TKIs) are employed against a number of malignancies, like chronic myelogenous leukemia (CML) and nonsmall-cell lung cancer. 7,8 In particular, TKIs targeting the human kinase Abl are the first-line therapy for the treatment of CML. 9 However, susceptibility to resistance requires continued development of new-generation inhibitors. 8 For instance, in nonsmall-cell lung cancer, acquired resistance inevitably occurs within 1−2 years of starting the therapy. 10 In CML, more than 25% of patients switch TKI due to intolerance or resistance, 11 the latter being most often caused by mutations in Abl. 8 Because kinases display a long tail of rare and uncharacterized mutations, 12 the sensitivity of many clinically identified kinase mutants to TKI treatment is often unknown. Thus, rapid and accurate computational approaches could impact clinical decision-making by providing oncologists with a first indication of whether the observed mutation is likely to cause resistance to certain inhibitors.
Here, we show how both physics-based and data-driven computational approaches can be used to accurately estimate the change in affinity of TKIs for the human kinase Abl caused by single-point mutations. To test the different methodologies, we used a data set of 144 Abl:TKI affinity changes (ΔΔG) across 8 TKIs caused by 31 Abl mutations (Figure 1), which was compiled by Hauser et al. 12 using publicly available cell viability IC 50 data. The computational approaches tested were based on (i) molecular dynamics (MD) simulations 13 with a nonequilibrium free energy calculation protocol; 14−16 (ii) Rosetta, a modeling program that uses mixed physics-and knowledge-based potentials; 17,18 and (iii) machine learning (ML). 19 For completeness, we discuss our results also in light of those obtained by Hauser et al. 12 with the OPLS3 force field 20 and a MD-based approach similar to ours (Table 1).

■ RESULTS AND DISCUSSION
The performance of all calculations was assessed by the rootmean-square error (RMSE) between calculated and experimentally measured ΔΔG values, the Pearson correlation coefficient (r), and the area under the precision recall curve (AUPRC). 21 The latter measures the ability to classify mutations as resistant or susceptible. In particular, precision measures which fraction of mutations classified as resistant are actually resistant, whereas recall measures which fraction of resistant mutations are correctly classified as resistant. Consistent with previous work, 12 resistant mutations were defined as those causing more than a 10-fold drop in affinity (ΔΔG exp > 1.36 kcal/mol). On this data set, given the fraction of resistant mutations (Figure 1c), a random classifier would return an AUPRC of 0.13.
First, we estimated changes in TKI affinity with an approach based on the first-principles of statistical mechanics. MD simulations were used to sample the isothermal−isobaric ensemble while a specific residue was "alchemically" mutated 14 into another one. From the nonequilibrium work required to transform the residue, it is possible to recover the equilibrium free energy difference 22,23 and, using a suitable thermodynamic cycle ( Figure S1), a rigorous estimate of ΔΔG. We tested multiple force fields, and, on this data set, Amber force fields performed better than Charmm ( Figure S2, Table S1); halogen bond modeling did not improve the Charmm results ( Figure  S3). For simplicity, we discuss only the results obtained with the best-performing Amber (A99) 24−27 and Charmm (C22) 28−31 force fields ( Table 1). The performance of A99 was comparable to that of OP3 (Figure 2), 12 which was obtained by Hauser et al. 12 with the FEP+ program by Schrodinger Inc. and the recently optimized proprietary OPLS3 force field. 20 C22 performed significantly worse than A99 and OP3 in terms of correlation and classification ability. Surprisingly, combining force fields in a consensus approach (considering half of the simulations for each parent force field) did not generally improve the results ( Figure S4). It is conceivable that, on this data set, the benefits of the additional sampling obtained with one force field outweighs those expected by combining multiple force fields. 16 We also did not observe a dependence of the accuracy of the calculation on the net charge change of the system upon mutation ( Figure  S5).
While our free energy protocol used a total of 216 ns of MD per ΔΔG estimate, 16 OP3 used 720 ns for charge-changing mutations, and 360 ns for charge-conserving ones. 12 We thus also tested a more expensive protocol A ( 99 ), which matched exactly the simulation time invested by OP3. A99 improved upon the uncertainty ( Figure S6), but not upon the accuracy (Figure 2), of the A99 estimates. Lower uncertainty is expected when more calculation repeats are used as in this case. On the other hand, accuracy is not necessarily proportional to the length of the simulations, as longer simulations allow for better sampling but also larger deviation from the experimentally determined structure in a potentially artificial fashion. Overall, this rigorous physics-based approach returned accurate ΔΔG estimates (e.g., for A99, RMSE = 0.91 kcal/mol, r = 0.44, and AUPRC = 0.56) and was able to discriminate between resistant and susceptible mutants well. However, a strong dependence of the results on the force field was observed. In addition, compared to the approaches discussed later, the MD based calculations required considerable computational resources (∼2.5 days on 10 CPU cores and one GPU per ΔΔG estimate).
As an approach that is neither rigorously physics-based nor purely statistical, we tested a recently proposed Rosetta protocol 18 that was found to perform well also on protein− ligand binding. 16 This method samples multiple conformations of wild-type and mutant proteins with a Monte Carlo algorithm and estimates ΔΔG with the all-atom Rosetta energy function, 17 a mixed knowledge-and physics-based potential, by averaging over the generated wild-type and mutant ensembles. We found this approach (R15) to be highly accurate on this data set ( Figure 2)with low absolute errors (RMSE = 0.72 kcal/mol), strong correlation (r = 0.67), and good classification performance (AUPRC = 0.53)while requiring only moderate computational resources (∼32 h on a single CPU core per ΔΔG estimate). In addition to the standard REF15 scoring function, we tested the βNOV16 function (R16), which returned slightly worse results than R15 for this system ( Figure 2). As had been noticed before, 16 here too, complementing Rosetta with the MD results via simple averaging generally resulted in enhanced performance ( Figure   S7). For instance, combining A99 and R15, we obtained a particularly strong consensus estimate ( Figure S8) with RMSE = 0.62 kcal/mol, r = 0.71, and AUPRC = 0.61. This can be attributed to the complementary performance of the two approaches: while the MD based calculations showed better precision performance and a lower number of false positives, Rosetta displayed better recall and a lower number of false negatives (Table S2, Figure 2).
As a purely data-driven approach, we employed extremely randomized regression trees, 19 a ML technique that uses ensembles of decision trees similarly to random forest. As input features, we used terms derived from the crystallographic protein−ligand structures (e.g., hydrogen bonds, nonpolar contacts, residue-ligand distance), ligand and residue physicochemical properties, and fast empirical scoring functions. 32−34 However, the way information from these sources is transformed to produce a ΔΔG estimate does not rely on any specific physical model. The technical details of the ML model construction are provided in the SI Methods. An advantage of this statistical approach is speed: once the necessary input features have been computed, a ΔΔG estimate can be obtained in seconds.
The ML model was trained and validated on a subset of the Platinum database, 35 and then tested on the TKI data set. The training/validation set contained 484 point-mutations with associated ΔΔG values across 84 proteins and 143 ligands. We computed a total of 128 features, and then used a greedy algorithm to select any number of features, up to 40, which minimized the mean-squared error of 10-fold cross-validation. The folds were built such that each of them would contain a unique set of proteins not present in the other folds. This procedure identified an optimal set of 29 features (Data S3). After training, the model was tested on the TKI data set (ML1 in Figure 2). Given no tyrosine kinase was present in the training set, we effectively assessed whether the model could extrapolate to this protein target. ML1 performed well in terms of RMSE (0.87 kcal/mol) but was not significantly better than random in terms of correlation and classification ability (r = 0.12 -0.04 0.29 , AUPRC = 0.20 0.10 0.39 ), and was overall comparable to the MD results obtained with the Charmm force fields ( Figure  2, Table S1). These results are similar to those obtained with

ACS Central Science
Research Article  12 and an equivalent cutoff (1.36 kcal/mol) for the calculated ΔΔG values. Each ΔΔG estimate is color-coded according to its absolute error with respect to the experimental ΔΔG value; at 300 K, an error of 1.4 kcal/mol corresponds to a ∼10-fold error in K d change, and an error of 2.8 kcal/mol to a ∼100-fold error in K d change. (b) Summary of the performance of the ΔΔG estimates across approaches in terms of RMSE, Pearson correlation, and AUPRC; point estimates from the original samples and 95% bootstrapped confidence intervals are shown (SI Methods). Differences at three levels of significance are reported using labels within the chart: e.g., a "C22**" label above the RMSE mark of OP3 indicates that the RMSE of OP3 is significantly lower (i.e., better agreement with experiment) than that of C22 at α = 0.05.
another ML model 36 based on Gaussian process regression ( Figure S9).
Given that ML is expected to perform best when trained on the most relevant data, 37 we asked whether the model performance would improve when trained on tyrosine kinase mutations only. In this case, we effectively assessed the interpolative ability of the model, as it was trained and tested on the TKI data set (ML2). To do this, we used 8-fold nested cross-validation. At each iteration, the model selected a subset of the 128 features via 7-fold cross validation with 7 inhibitors and was then tested on the eighth. This was done 8 times to obtain predictions for all TKIs and thus all entries in the data set. ML2 accurately estimated ΔΔG values and could discriminate well between resistant and susceptible mutations (RMSE = 0.68 kcal/mol, r = 0.57, AUPRC = 0.47).
If the objective is only to identify resistant mutations 38 rather than assessing their impact on ligand binding quantitatively, then AUPRC is the most relevant performance measure to examine. In this regard, MD based calculations with the OPLS3 or Amber99 force fields (OP3, A99, A99 ), and Rosetta with the REF15 scoring function (R15) provided the best performance ( Figure 3, Figure S10). Furthermore, in this scenario, one is likely willing to accept a larger fraction of false positives if this means recovering more resistant mutations. Thus, a suitable consensus strategy is to select the most positive ΔΔG estimate among those available to maximize the recall (at the expense of precision) at a given threshold ( Figure S11). Using this approach with A99 and R15 returns an AUPRC of 0.62 and, at the conventional threshold of ΔΔG calc > 1.36 kcal/mol, a recall of 0.79 and a precision of 0.48 ( Figure 3; Table S2). This means that about half of the mutations classified as resistant are in fact resistant, and 79% of the resistant mutations in the data set have been identified.
From the results presented, it emerges that the latest computational approaches, from physics-based to data-driven ones, are able to predict TKI affinity changes upon Abl mutation and identify resistance-causing mutations. However, the different methods have different strengths and weaknesses. The MD-based calculations achieved a remarkable performance considering that force field parameters are based on fundamental physical properties of organic molecules (e.g., quantum mechanical energies and electrostatic potentials, liquid densities, and enthalpies of vaporization), rather than binding affinities directly. This, together with the rigorous statistical mechanical treatment, might lie at the basis of the demonstrated generalizability 16,39 of the approach across protein−ligand systems. The downside is the comparably higher computational cost required to obtain precise ΔΔG estimates. On the other hand, ML predictions proved accurate only when the binding affinities for Abl:TKI were used directly for training. With this information, however, a ΔΔG estimate could be generated in a matter of seconds to minutes on a single CPU core, with most of the time spent collecting the necessary structure-based features. Thus, the optimal approach choice in a prospective scenario will depend on the amount of prior information available for the system of interest, the nature of the study (a large-scale scan versus an in-depth study of a few mutations), as well as the computational resources at hand. That being said, we found that Rosetta struck a good balance between accuracy, generalizability, and computational cost. Similar to MD, this approach considers an ensemble of protein−ligand conformations, 18 and its scoring function is based on simple physical and structural properties of organic molecules and biomolecules, 17 so that it does not require binding affinities to the system of interest for training. In addition, approximations such as the use of implicit solvation and a more limited conformational sampling result in a moderate computational cost.

■ CONCLUSION
In summary, we have shown how fundamentally different structure-based approaches are able to predict TKI resistance to similar or better accuracy than reported thus far. Taking advantage of multiple techniques in a consensus fashion resulted in remarkable accuracy and tuning of the classification performance. These results suggest that a diverse set of computational approaches, each with its own strengths and weaknesses, is now available for the reliable estimate of resistance-causing protein mutations. It is our hope that the availability of multiple complementary approaches will enable the routine prediction of resistance-causing mutations across many other protein targets.

* S Supporting Information
The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acscentsci.9b00590.
Methods, including supporting figures and data tables (PDF) Figure 3. Precision recall curves for selected approaches. A99 and R15 have been combined to give two consensus results: in "avg(A99,R15)", the results of A99 and R15 have been averaged; in "max(A99,R15)", for each mutation, the most positive ΔΔG estimate among A99 and R15 was selected. The curve for a random estimator is shown as a dashed black line (baseline with AUPRC of 0.13). The precision and recall corresponding to the conventional threshold of ΔΔG calc > 1.36 kcal/mol is reported and marked by a circle on the curves.

Research Article
Data S1: detailed information on the data set and numerical results for all calculations (XLSX) Data S2: input files pertaining to the moleculardynamics-based free energy calculations and Rosetta calculations (ZIP) Data S3: input files and Jupyter notebooks used for the training and testing of the machine learning model.