Nucleophilicity Prediction via Multivariate Linear Regression Analysis

The concept of nucleophilicity is at the basis of most transformations in chemistry. Understanding and predicting the relative reactivity of different nucleophiles is therefore of paramount importance. Mayr’s nucleophilicity scale likely represents the most complete collection of reactivity data, which currently includes over 1200 nucleophiles. Several attempts have been made to theoretically predict Mayr’s nucleophilicity parameters N based on calculation of molecular properties, but a general model accounting for different classes of nucleophiles could not be obtained so far. We herein show that multivariate linear regression analysis is a suitable tool for obtaining a simple model predicting N for virtually any class of nucleophiles in different solvents for a set of 341 data points. The key descriptors of the model were found to account for the proton affinity, solvation energies, and sterics.


■ INTRODUCTION
The term nucleophile (and electrophile) was first introduced by Ingold in 1933. 1,2 Since then, the field of physical organic chemistry has grown shaping the thinking of organic chemists, and the concept of nucleophilicity established roots becoming one of the basis of organic chemistry. In the attempt to establish general reactivity rules, several groups developed different nucleophilicity scales. The Swain−Scott 3 and the Ritchie 4,5 equations are examples of early contributions in the field. However, a more general scale was more recently found to rely on the Mayr−Patz 6,7 equation (eq 1) where k is the reaction rate constant, N is the nucleophilicity parameter, E is the electrophilicity parameter, and s N is a nucleophile-specific sensitivity parameter. Equation 1 was found to provide linear correlations for more than 1200 nucleophiles and 330 electrophiles over a reactivity range greater than 35 orders of magnitude. 7c The reasons for such long linear correlations and the physical basis of these parameters are still not fully understood. 7d,8 However, the advantages provided by the Mayr−Patz reactivity scale from an experimental perspective are unquestionable. 9 Given N, s N , and E for two reaction partners, it is possible to predict semiquantitatively their reaction rate constant k at 20°C. Moreover, comparison of N for different nucleophiles (or E for different electrophiles) gives direct evaluation of their relative reactivity. The determination of N and s N parameters from experimentally measured rate data for a nucleophile requires the kinetic analysis of the reaction with a series of suitable reference electrophiles of known electrophilicity E. 6a,7d However, this experimental procedure is time-consuming in the context of predicting/screening the reactivity of compounds outside the database. Given the increasing precision of quantum chemical calculations, the kinetics and thermodynamics of most organic reactions can now be evaluated in reasonable time. Therefore, several groups employed computational methods to evaluate Mayr's reactivity parameters for different species. The groups of Houk and Mayr reported that the electrophilicities E of benzhydrylium ions correlate with their affinities for CH 3 − , OH − , and H − . 8a Contreras, Peŕez, Gaźquez, and Fujiyama explored interesting trends between E and N with hard and soft acids and bases theory (HSAB) indices. 10 Fu and Liu evaluated N, s N , and E for a set of nucleophiles and electrophiles by reproducing the experimental procedure in silico via calculation of reaction energy barriers. 8d These studies showed that benzhydrylium electrophilicities E correlate with simple descriptors such as the lowest unoccupied molecular orbital (LUMO) energy or the Mulliken atomic charge at the electrophilic C atom. However, Michael acceptors did not follow the same trend and were shown to correlate with their methyl anion affinities instead. 8a,11 Similarly, nucleophiles provided fair correlations with the highest occupied molecular orbital (HOMO) energy only when divided in structurally similar subsets over a relatively small range of N values. A general correlation accounting for different types of nucleophiles could not be found so far. This is likely due to the structural diversity of nucleophiles in the database with respect to the electrophiles. In fact, these include π-, n-, and σ-nucleophiles of different functional groups, charges (neutral or anionic nucleophiles), and steric properties. Thus, in order to access a general and simple approach for the prediction of nucleophilic reactivity, we reasoned that such a variety of features affecting reactivity would be better described by a multidimensional correlation. Ideally, each electronic, steric, or solvation factor affecting nucleophilicity could be described by an appropriate parameter and embodied into a polynomial equation via multivariate regression analysis as described by Sigman and co-workers. 12 Analysis of the model equation obtained could also provide insights regarding the significance of each descriptor for a given class of nucleophiles, thus deciphering the origin of its reactivity ( Figure 1). 12 According to eq 1, relative nucleophilic reactivities are controlled not only by the nucleophilicity parameter N but also by the susceptibility s N , with the consequence that a nucleophile A which reacts faster than B with one electrophile may react slower than B with another electrophile. This is graphically translated in line crossings in log k versus E correlations. 7a,b However, by using a floating scale of reference electrophiles for determining N according to eq 1, far-reaching extrapolations are avoided with the result that the nucleophilicity parameter N represents a good approximation for average nucleophilic reactivities. Being aware of this limitation, we will focus on the interpretation of N in this work and refer to the Supporting Information for evaluation of N·s N as a possible alternative.

■ RESULTS AND DISCUSSION
The diversity of these nucleophiles poses a challenge for their parameterization as this is most often performed on sets of similar compounds in order to avoid biases given by major perturbations to the structure. 8d,10 How can steric descriptors be calculated consistently for different classes of compounds? For instance, which bond should be chosen for obtaining steric parameters in olefins to obtain meaningful description of sterics when compared to NHCs or amines? To solve this problem, we reasoned that consistent parameters could be calculated from the "electrophile perspective". Therefore, the structure of the protonated nucleophiles (Int, Figure 1) was optimized at the M06-2X/def2TZVP level of theory alongside with the plain nucleophile structures (Nu, Figure 1). In the general structure Int, the proton H represents a probe electrophile which is common for every nucleophile (N is independent from the electrophile) and allows for obtaining consistent parameters through the whole nucleophile data set. The computed parameters include the following ( Figure 1): % V, buried volume of a sphere centered on the electrophile H in Int; 13 B1 and B5, sterimol minimum and maximum width of the nucleophile, respectively, measured from the H−Nu bond in Int considering atomic van der Waals radii (i.e., least and most distant points from the Nu−H bond trajectory of the van der Waals isosurface's projection on the plane orthogonal to Nu−H); L, 14 sterimol length of the nucleophile measured from the H−Nu bond in Int; 14 NBO and MK, natural bond orbital 15 and Merz−Kollman 16 charges of the nucleophilic atom in Nu, respectively; e HOMO and e LUMO , HOMO and LUMO energies of Nu, respectively; e + HOMO and e + LUMO , HOMO and LUMO energies of Int, respectively; 17 μ and η, chemical potential and hardness of Nu according to HSAB, respectively; 18 E PA , protonation energy of Nu (calculated as the difference between the electronic energy of Int and Nu), that is, the negative of the proton affinity; S Nu , S Int , and S H , solvation energy of Nu, Int, and H + , respectively, calculated by single-point energy calculation at the M06-2X/def2TZVP [SMD = solvent] level on the M06-2X/def2TZVP structure optimized in the gas phase. 19 Additional convenient descriptors were added to the parameter matrix in order to provide possible corrections to the models. These are q, the nucleophile charge accounting for rough electrostatic effects, and ε, the solvent dielectric constant.

Chart 1. Nucleophiles Included in This Study
The Journal of Organic Chemistry pubs.acs.org/joc Article full list of the nucleophiles included in this study is given in Chart 1. Preliminary analysis of the computed descriptors via singleparameter correlation against N showed that E PA provides a good basis for further multidimensional modeling. The plot reported in Figure 2a shows a fair correlation (R 2 = 0.86) when removing anionic and N(sp 3 )-nucleophiles from the data set. Interestingly, this latter subset gave a parallel trend with a similar slope. Anionic nucleophiles follow a separate trend and do not correlate. As recently shown by Van Vranken in the context of electrophilicity E, 11b this is due to stronger solvation effects at place in the case of charged species. Upon addition of solvation descriptors S Nu and S Int to E PA , the raw model in Figure 2b is obtained (eq 2). Thus, inclusion of solvent effects increased the quality of the fit (R 2 = 0.87) and allowed maintaining all the nucleophiles into the same correlation. Aiming at deeper understanding of the factors implicated in the correlation between N and E PA , we performed a multivariate linear regression analysis using E PA as the observable and the other acquired descriptors as the parameters. Since E PA is the difference between the energies of Int and Nu by definition, it is not surprising that the regression rendered a linear combination of e HOMO and e + LUMO with similar coefficients (R 2 = 0.98, Figure 3). 17 This suggests that E PA (and therefore N) depends not only on the activation of the reagent Nu (high e HOMO ) but also on the electronic stabilization of the product Int (high e + LUMO ). The highlighted data points in the plot (orange crosses) identify those cases in which the LUMO of Int does not correspond to the reacting MO. These can be corrected to the shown values by changing such e + LUMO values with the energy of the MO corresponding to the formed σ* orbital (see the Supporting Information). Interestingly, even though e HOMO and e + LUMO are related (e + LUMO is obtained by mixing e HOMO into the electrophile's LUMO), these two parameters do not correlate among them for the whole data set (see the Supporting Information). Equation 2 in Figure 2b allows establishing a simple linear free-energy relationship. According to the thermodynamic cycle for a generic protonation reaction in the solvent phase (s), eq 3 holds, where thermal and entropic factors are neglected in E PA . 30

Chart 1. continued
The Journal of Organic Chemistry pubs.acs.org/joc Article Substitution with N according to eq 2 gives eq 4, which upon differentiation with respect to the nucleophile structural variation gives eq 5 Equation 5 is reminiscent of the linear free-energy relationship αδΔG = δΔG ⧧ , where N, being proportional to log k, is also related to ΔG ⧧ (s). Thus, eq 2 correlates N with a set of parameters reproducing the nucleophile computed relative pK a 's. This could explain why S-nucleophiles are moderate outliers in the plot in Figure 2b as these are known to follow a reversed H-versus C-basicity trend with respect to O-nucleophiles, for instance. 31a This is likely due to solvation factors as Arnett showed that in the gas phase, Me 2 S and MeSH have higher proton affinity than Me 2 O and MeOH, respectively. 31b For this reason, the methyl cation affinity was computed for a subset of 96 nucleophile/solvent combinations including all the S-nucleophiles. However, these were found to follow a similar trend to the proton affinities without providing any improvement (see the Supporting Information).
The correlation in Figure 2b is in line with previous work by Bordwell and co-workers, who found a correlation between the nucleophilicity toward butyl chloride and the pK a of a set of fluorene-derived anionic nucleophiles. 32 Interestingly, they also found that different sets of nucleophiles follow different correlations depending on steric effects. 32 Mayr and coworkers found several correlations between N and experimental pK a values. However, they showed in different instances that such correlations are dependent on several experimental factors including the steric and solvent. 33 This prompted us to include additional parameters into our correlation via multidimensional correlation analysis in order to access an increased predictive power.
We divided randomly our data set into a training set (222 data points) and an external validation set (119 data points). Submitting the N array and the parameter matrix of the training set to multivariate linear regression resulted in the model depicted in Figure 4 (eq 6). In addition to E PA , S Nu , and S Int , the parameters q, B1, % V, e HOMO , ε, and S H were found to be relevant descriptors for N (see the Supporting The Journal of Organic Chemistry pubs.acs.org/joc Article Information). B1 and % V describe the steric hindrance of the nucleophile and suggest this to impact its reactivity. The magnitude of the coefficients (0.18 and −0.17 for B1 and % V, respectively) suggests that steric effects are much less important than electronics in this instance. It is worth mentioning that the combination of steric parameters rendered by the multivariate regression procedure is not a general representation of the nucleophile steric hindrance. This is related to the steric effects arising when the nucleophile is combined with benzhydrylium cations (or structurally related quinone methides). Different steric effects may be expected to impact the reaction rate when used in combination with a different set of electrophiles! 34 Such effects also include noncovalent interactions that can be specific for each nucleophile−electrophile pair. 35 Additional considerations can be made concerning the inclusion of the other parameters. The correlation in Figure 2b, on which the multidimensional model is based, is purely thermodynamic in nature. That is, the kinetic behavior of different nucleophiles is correlated with a thermodynamic property: the pK a . This is a limitation as it is known that in some cases, Marcus' intrinsic barriers ΔG 0 ⧧ are not colinear and not negligible with respect to the reaction free energy ΔG 0 . This plays a crucial role in ambident reactivity. 29d, 36 Mayr's N values also bring information regarding the solvation energy of the reference electrophile, the E value of which is defined to be independent from the solvent. In addition, N, s N , and E are derived via fitting of experimental k-values, and the parameters calculated hereby are dependent on the used computational method. These procedures do not come without errors. These considerations provide a rationalization for the requirement for additional parameters that are needed to provide fine tuning of the model (q, e HOMO , ε, and S H ).
Our model shows an excellent quality of fit (R 2 = 0.935, Figure 4a) and robustness as suggested by cross validation analyses (leave-one-out, Q 2 = 0.927 and average-2-fold, Q 2 = 0.922). Moreover, the root-mean-square error (RMSE) associated to the fitting is as low as 1.41, suggesting good predictive quality. The much lower RMSE given by eq 6 compared to the one provided by eq 2 (2.56) supports the requirement for the additional parameters for fine tuning of the nucleophilicity description.
Prediction of the nucleophilicities of the validation set with the model eq 6 resulted in a fit with similar properties to the training set (R 2 = 0.946 and RMSE = 1.40). The plot of the residuals for both training (circles) and validation sets (crosses) is reported in Figure 4b and divided in subsets according with the color legend in Figure 2b. The residual plot shows that the error roughly increases with the nucleophilicity, thus making prediction of anionic nucleophiles more challenging. Modeling of the product N·s N showed that this could arise from neglecting s N for strong nucleophiles as the sensitivity parameter becomes more important as the nucleophilicity N increases (see the Supporting Information). The RMSE for each class of nucleophiles is also reported in Figure 4. Overall, the nucleophilicity of 295 nucleophiles (87% of the set explored) could be predicted with an error <2.5 units.

■ CONCLUSIONS
In summary, we have shown that parameterization of a variety of nucleophiles featuring a wide structural diversity is possible. This adds to recent advances into the development of holistic multidimensional models. 37 Multivariate linear regression analysis of Mayr's nucleophilicity N with the acquired parameters allowed establishing a multidimensional model accounting for the reactivity of >300 nucleophiles over a range of 35 orders of magnitude. The model is statistically sound and provides high predicting power (RMSE < 1.5). Moreover, the descriptors appearing in the model are informative regarding the key factors affecting nucleophilicity. The nucleophile proton affinity (−E PA ) and the solvation energy of both the nucleophile (S Nu ) and the addition product (S Int ) appear as the main parameters. As these were shown to coincide with the pK a , this suggests that thermodynamics is a key factor contributing to the reaction rate constant k in agreement with the Bell−Evans−Polanyi principle.
However, other minor parameters were rendered by multidimensional correlation analysis, which account for several effects such as solvation, sterics, and electrostatic attraction that might play a role at the level of intrinsic activation energies (e.g., deviations from the Bell−Evans− Polanyi principle according with the Marcus theory). We also showed that E PA can be decomposed into two major  Due to the high impact that Mayr's reactivity scale has been having since its development, we foresee that this work will support its implementation in the chemists' everyday effort toward exploring and rationalizing new chemical reactivity.

■ EXPERIMENTAL SECTION
All computations were performed using the Gaussian 16 38 suite of programs. The nucleophile conformational space was explored manually as most nucleophiles included in this study are small, simple molecules with only few rotatable bonds. Geometry optimizations, vibrational frequencies, and intensities were calculated at the M06-2X/def2TZVP level of theory. 39 All the structures computed in this study are minima as confirmed by vibrational analysis showing no imaginary frequencies. NBO charges were computed using NBO 3.1 implemented in Gaussian 16. 40 Buried volumes were computed using SambVca 2.1. 41 Solvation energies were computed by single-point energy calculation on the M06-2X/ def2TZVP geometry at the same theory level with the addition of the SMD 42 solvation model implemented in Gaussian 16. Multidimensional regression analyses were performed using Matlab. 43