Predictive Multivariate Linear Regression Analysis Guides Successful Catalytic Enantioselective Minisci Reactions of Diazines

The Minisci reaction is one of the most direct and versatile methods for forging new carbon–carbon bonds onto basic heteroarenes: a broad subset of compounds ubiquitous in medicinal chemistry. While many Minisci-type reactions result in new stereocenters, control of the absolute stereochemistry has proved challenging. An asymmetric variant was recently realized using chiral phosphoric acid catalysis, although in that study the substrates were limited to quinolines and pyridines. Mechanistic uncertainties and nonobvious enantioselectivity trends made the task of extending the reaction to important new substrate classes challenging and time-intensive. Herein, we describe an approach to address this problem through rigorous analysis of the reaction landscape guided by a carefully designed reaction data set and facilitated through multivariate linear regression (MLR) analysis. These techniques permitted the development of mechanistically informative correlations providing the basis to transfer enantioselectivity outcomes to new reaction components, ultimately predicting pyrimidines to be particularly amenable to the protocol. The predictions of enantioselectivity outcomes for these valuable, pharmaceutically relevant motifs were remarkably accurate in most cases and resulted in a comprehensive exploration of scope, significantly expanding the utility and versatility of this methodology. This successful outcome is a powerful demonstration of the benefits of utilizing MLR analysis as a predictive platform for effective and efficient reaction scope exploration across substrate classes.

The crude product was purified via flash column chromatography (eluting with a gradient of EtOAc to 1.5% MeOH/EtOAc) to give the title product as a white solid (29.0 mg, 0.120 mmol, 60% yield, 78% ee).
The absolute configuration was determined to be (S). The structure was deposited in the Cambridge Crystallographic Data Centre (deposition no.: CCDC 1924476). The absolute stereochemistry of all other diazine products in the scope have been assigned in analogy.

Computational Methods
Model catalyst structures were optimized with constraints in the gas-phase with the M06-2X density functional, 21 and the triple-ζ valence quality def2-TZVP basis set of Weigend and Ahlrichs, 22 as implemented in Gaussian 09 (revision D.01). 23 The torsion angle defined by the atoms 2, 1, 1' and 2', where 2 and 2' are the oxygen bearing carbons, was constrained to 60 o as this reproduces well the geometric effect of the BINOL backbone. 24 All of the optimized geometries were verified by frequency computations as minima (zero imaginary frequencies). Parameters were acquired from these ground state structures.
NBO charges were calculated using NBO6 as implemented in Gaussian09, 25 at the same level. Sterimol values were calculated using a modified version of Paton's Python script. 26 Multidimensional regression analyses were performed using MATLAB ® . 27 The same procedure was applied to substrates A-G and any out-of-sample prediction platform.
Conformational searches were performed with Macromodel version 11.7 28

Parameters Collected
The parameters calculated and considered for the systems are reported in Tables S1-S26

S135
Catalyst parameter tables Table S1. Torsion parameters. Table S2. Sterimol values using Bondi radii collected from the C1-C3 positions of the aromatic ring.

S138
Quinoline and pyridine products (matrix data set) Quinoline and pyridine products (prediction set)   Table S33. Sterimol parameters collected from the heterocycle.

Model development
Measured ΔΔG ‡ values were calculated using the formula ΔΔG ‡ = -RTln(er) where R is the gas constant, T is temperature (298.15 K), and er is the enantiomeric ratio. Linear regression models were developed using an in-house script implemented in MATLAB ® (version R2018b), to obtain the predicted ΔΔG ‡ . 33 A good linear correlation (R 2 close to 1.0 and intercept close to 0.0) between the predicted ΔΔG ‡ and the measured ΔΔG ‡ indicates that the obtained model adequately approximates the system under study. As the model search process can produce a large pool of model candidates, we truncated the models on the recorded statistics as well as the number of included parameters and presence of crossterms, because this allows for a mechanistically informative interrogation. The model development is an iterative process in which the "best" model is assessed in various manners at each stage of the development process as described in Figure S2. Another method to test model robustness is k-fold cross-validation. In this validation method, the dataset is divided randomly into k subsets with same/similar sizes, each set would then be predicted out by the other k−1 subsets, and the goodness of fit would be tested based on these predictions.
However, it is important to note that the k-fold statistics is dependent on the partitioning of the dataset, and that there's a necessity of executing the process multiple times for average results.
The training reactions were split into 70:30 TS:VS sets for external validation. The split was partitioned based on the response values using the MATLAB "equidistant" function. This process can be described as pseudorandom as it is generated using a deterministic algorithm. Since we are searching for general models that would be effective in predicting out-of-sample, the lead models from this process is then subjected to additional rounds of external validation. For this purpose, we performed additional tests with two out-of-sample prediction platforms (catalysts not included, some catalyst and substrate components not included, see page S155). The top model (model 1 on page S151) with the lead statistical scores (R 2 , LOO, k-fold, out-of-sample) from this process was taken forward for virtual S150 screening with the diazine set (out-of-sample prediction set 3, see page S156) and we found that the predictions were remarkably accurate. Figure S2. Flow-chart describing the model development process.
The discarded models at step 6, which were only slightly less statistically significant, were also found to predict the diazine data set quite accurately with different parameters. This suggests one or more parameters can account for the observations. However, using a correlation matrix we can determine that the parameters are measuring the same general properties meaning mechanistic interpretation of a singular model does not affect the overall analysis. The top 10 models as described by their statistical scores and the correlation matrix are shown on pages S157 and S158. Ultimately, this suggests that alternative models can capture the same mechanistic features and have similar prediction capabilities. The use of a single model however, is much more straightforward for analysis and as a prediction platform. Furthermore, this demonstrates that the workflow described in Figure   S2 is effective in producing models with sufficient generality.

Training Set Design
Catalyst/Substrate Matrix

Parameter Identification and Acquisition
Steric and Electronic molecular descriptors

Out-of-sample predictions
We evaluated the ability to transfer the mechanistic principles leading to enantioselective catalysis captured by the statistical model to genuinely different structural motifs not contained in the training dataset. The workflow for ee prediction is straightforward and is initiated by locating the ground state of the targeted reaction variable by DFT computation, collecting the requisite parameters and submitting them to the equation as pictorially described below.

Alternative Models and Correlation Maps
Catalyst terms highlighted in blue and product terms in red.

S158
The catalyst correlation map determines that each model (1-10) emphasises the importance of large substituents at the 2 and 6 positions for high levels of enantioselectivity through a variety of descriptors. B1whole, Lwhole and iPOas are highly correlated suggesting they capture comparable structural effects. Similar relationships between parameters can be observed with the torsion angle,  and NBOC2. The substrate parameters are more conserved in models 1-10 than that of the catalyst, however, variance is detected in the description of the proximal N-heterocycle steric profile. The correlation map show that this molecular feature can be described through B5wholeNHet, LNHetC6 and B1NHetC6. Thus, by combining the interpretation of models 1-10 the incorporated parameters suggest that large catalyst and N-heterocycle substituents, in addition to the collective structure effects described by the NBORAEC1 term are important in determining the enantioselectivity.

Regioselectivity data
A correlation of the C2:C4 isomeric ratio (rr) with the enantioselectivity of the product reveals a linear relationship, in which, as the ee increases the rr generally increases, see below. Result with non-aromatic derived chiral phosphoric acid catalysts In our analyses we collect a diverse array of molecular descriptor values from DFT optimized geometries to describe the structural features of the substrate and catalyst. Unfortunately, the lack of structural commonality for particular molecular subsets creates a challenge in identifying readily comprehensible and extensive parameter sets for each of these components. For example, when comparing catalysts with aromatic substituents at the 3 and 3' positions, it is apparent that they have overlapping and distinctive features that are probably required for determining selectivity patterns.
By contrast, the inclusion non-aromatic derived catalysts reduce the commonality in substructure and consequently results in a decrease in important feature space. Despite this, model 2 on page S157 contains features that are common to both aromatic and non-aromatic derived catalysts and therefore may be able to gauge the effect of the latter catalyst class. To test this, the reaction yielding product A catalysed by TIPSY (3,3' = SiPh3) was performed, providing A in 75% ee. This result was used to validate model 2 as shown below. The comparison of predicted (86% ee) and experiment (75% ee) corresponds to a G ‡ error of 0.38 kcal/mol. This suggests that the secondary model which contain features common to both catalyst classes could potentially predict the effects of these catalysts.