Prediction of Aqueous pKa Values for Guanidine-Containing Compounds Using Ab Initio Gas-Phase Equilibrium Bond Lengths

In this work, we demonstrate the existence of linear relationships between gas-phase equilibrium bond lengths of the guanidine skeleton of 2-(arylamino)imidazolines and their aqueous pKa value. For a training set of 22 compounds, in the most stable conformation of their lowest energy tautomeric form, three bonds were found to exhibit r2 and q2 values >0.95 and root-mean-squared-error of estimation values ≤0.25 when regressed individually against pKa. The equations describing these one-bond-length linear relationships, in addition to a multiple linear regression model using all three bond lengths, were then used to predict the experimental pKa values of an external test set of further 27 derivatives. The optimal protocol we derive here shows an overall mean absolute error (MAE) of 0.20 and standard deviation of errors of 0.18 for the test set. Predictions for a second test set of diphenyl-based bis(2-iminoimidazolidines) yielded an MAE of 0.27 and a standard deviation of 0.10. The predictive power of the optimal model is further demonstrated by its ability to correct erroneously reported experimental values. Finally, a previously established guanidine model is recalibrated at a new level of theory, and predictions are made for novel phenylguanidine derivatives, showing an MAE of just 0.29. The protocols established and tested here pass both of Roy’s modern and stringent MAE-based criteria for a “good” quantitative structure–activity relationship/quantitative structure–property relationship model predictivity. Notably, the ab initio bond length high correlation subset protocol developed in this work demonstrates lower MAE values than the Marvin program by ChemAxon for all test sets.


INTRODUCTION
Having a measure for the propensity of a molecule to lose or gain a proton, under given conditions, (e.g., pH, solvent, and temperature) has applications in numerous fields. For synthetic chemists, knowledge of pK a values provides means to manipulate mechanistic routes. In a physiological context, biochemical reactions may be rationalized with insight into the protonation state of the reacting species. The absorption, diffusion, metabolism, excretion, and toxicity profiles of a drug candidate strongly depends on its ionization state under physiological conditions. Furthermore, the Trapp model, used in the agrochemical industry to predict the ease with which a compound can travel through the phloem of a plant to its target site, uses pK a in addition to the log P value of a lead compound. 1 The measurement of pK a values of small, weakly acidic molecules (pK a range 2−15) may be achieved accurately via a variety of methods, including ultraviolet−visible spectrometry, potentiometry, or nuclear magnetic resonance spectroscopy, to name a few. An accurate method for the determination of pK a without the need for synthesis offers great advantage in the context of early-stage drug and agrochemical discovery, when thousands of potential compounds must be assessed for target specificity and selectivity. The pursuit of a fast, accurate, and allencompassing pK a prediction method is still a key challenge in chemistry, and in particular, the prediction accuracy of bases is of much interest. 2−4 Empirical pK a predictors, such as ACD/pK a DB 5 of ACD/ Labs and Epik, 6 utilize extensive databases of experimental values to formulate a set of rules to predict ionization feasibility, in line with known substituent effects. To do so, they often invoke Hammett−Taft equations, with additional moleculespecific details such as mesomer standardization and charge cancellation. Marvin 7 is the pK a prediction package from ChemAxon, which uses quantitative structure property relationships (QSPRs) based on various chemical descriptors, including partial charge and polarizability. First principles-based approaches have also been widely explored but are yet to be widely implemented in popular chemical software packages, with the exception of Jaguar pK a 2,8 by Schrodinger. These protocols calculate the change in Gibbs energy for dissociation, accessing pK a values via the relation of this calculated quantity to the temperature, gas constant, and common logarithm of the dissociation constant. This approach to pK a prediction prescribes implementation of a thermodynamic cycle of some sort, whereby both gas-phase and solution-phase energies are calculated for deprotonated and protonated species. Composite methods such as G3SX, 9 CBS-QB3, and CBS-APNO 10 have been shown to work favorably in the calculation of relevant thermodynamic quantities, but the high computational cost accrued by such high levels of theory remains a barrier to their utility. Recently, Ho 11 suggested that the use of thermodynamic cycles can be avoided completely via high-level electronic structure calculations [MP2/GTMP2Large and G3(MP2)-RAD(+)], computed directly in the solution phase with the SMD solvation model. 12 Other recent work 4 details the use of an isodesmic reaction scheme, whereby relevant energetic quantities are computed using semiempirical methods. The authors implement AM1/SMD and PM3/SMD for a structurally diverse set of amines, yielding root-mean-squarederror values of 1.0 and 1.1 pK a units, respectively. An error of 5.69 kJ mol −1 in the Gibbs free energy will alter the predicted pK a value by one unit. Hence, accuracy in the determination of the intermediate free energy values for first principles calculations is imperative.
In the pursuit of prediction accuracy in the first principlesbased methods, the solvation scheme must also be considered, in addition to the choice of the thermodynamic cycle variant and the level of theory. Implicit−explicit models, whereby at least one water molecule is included to represent hydrogen bonding interactions, have been shown to offer significant improvement in the solvation energies over the use of an implicit model alone. 13−15 However, modeling the water molecules quantum mechanically further add to computational expense. The Jaguar pK a program 2,8 by Schrodinger works by applying empirical corrections to "raw" pK a values obtained using density functional theory (DFT) and a direct thermodynamic cycle. Prediction accuracy and a low dependence on the initial input conformation are achieved by thorough consideration of conformational effects on pK a , and excellent mean absolute error (MAE) values are obtained (0.2 pK a units for a set of cyclic amidines). Because of the treatment of compounds as three-dimensional (3D) structures, their protocol is proven capable of discerning different pK a values for two stereoisomers as well as being sensitive to steric effects on pK a , which their purely empirical counterpart, Epik, is not. 2 QSPR-based protocols often implement a lower level of theory than first principles methods and are generally reliant on a number of chemical descriptors. These descriptors are regressed against experimental values to assemble predictive models using multiple linear regression (MLR) or principal component regression. 16−18 Univariate regression models that rely on only one descriptor may be labelled linear free energy relationships (LFERs). The method we propose here falls into this latter category. Other successful attempts to predict pK a values from one calculated descriptor include the use of the quantum theory of atoms in molecules (QTAIM) atomic energy of the dissociating proton. 19 Partial atomic charges of the atom bonded to the acidic proton have also been shown to have a linear relationship with pK a . 20 Here, we reintroduce our protocol, which reveals the linear relationships that exist between our quantum mechanically derived molecular descriptor (bond length) and an experimental property (pK a ). The ab initio bond length high correlation subset (AIBLHiCoS) method works on the basis that chemical space may be partitioned into a series of highly correlated local linear relationships between pK a in the solvent and a relevant bond length in the gas phase. Membership of a compound to a subset is governed by structural and chemical commonality, defined by structural motifs such as tautomeric form and conformation, or the presence or absence of substituents at a defined proximity to the acidic proton. The existence of these relationships and simple implementation of the linear equations describing them to predict pK a values have been demonstrated for phenols, 21 naphthols, 22 guanidines, 23 benzoic acids and anilines, 24 aryl guanidines and 2-(arylimino)imidazolidines, 25 benzohydroxamic, phenylacetic, and phenoxyacetic acids, 26 bicyclo[2.2.2]octane and cubane carboxylic acids, 27 and pyridines, thiols, and barbituric acids. We use a descriptor from a 3D structure (in the gas phase), and so consideration of steric, intramolecular hydrogen bonding (IHB) and π−π interactions is inherent to the protocol within the constraints of our gas-phase model. These structural features are known to affect the measured pK a value, and their treatment is not inherent to "two-dimensional methods," such as the empirical protocol employed by Marvin, unless explicitly included in the parameterization.
The 2014 review of nine pK a prediction packages by Settimo et al. 3 found that prediction accuracy was harder to achieve for bases than for other weakly acidic compounds. This prompted our lab to work on the notoriously difficult guanidine functional group. 23 The work on guanidines is revisited here later at a new theory level to predict the pK a of 12 α 2 -adrenoceptor antagonist phenylguanidine derivatives. A recent study 25 detailed the synthesis and reported the pK a values for this set of aryl guanidines in addition to 23 2-(arylimino)imidazolidines, using both ultraviolet (UV)-metric and pH-metric techniques. AIBLHiCoS predictions were also reported using a singlebond-length model, but the implementation of the MLR model was not included. The pK a values for further five small 2-(arylimino)imidazolidines, two of which are the marketed drugs tizanidine and apraclonidine, were also obtained from the literature to further test the model. We also compared our predictions to the literature values for both sites of ionization present within a set of 10 diphenyl-based bis(2-iminoimidazolidines), further revealing the efficacy of the method in discerning the relative magnitude of the two microscopic dissociation constants. The 2-(arylimino)imidazolidine derivatives studied here also contain the Y-shaped guanidine functional group and thus may also adopt three tautomeric forms in their neutral state. The approach to high correlation subset (HiCoS) formation for these compounds is therefore similar to the approach described for guanidines, 23 in that the tautomeric forms are used as a primary classifier of structural commonality.

ACS Omega
Article 2. METHODOLOGY 2.1. HiCoS Formation. The first stage in the AIBLHiCoS protocol is to deduce where a so-called "active bond" lies and in which tautomeric form and conformation it is located. The active bond is identified by it returning superior r 2 , q 2 , and rootmean-squared-error of estimation (RMSEE) values when regressed against pK a , compared to those of other bonds. As a descriptor, the active bond length must reflect the implications on geometry that a diverse array of substituent groups may impose.
Aromatic compounds have been shown to provide a simplistic and chemically intuitive example of how chemical space may be partitioned using the bond length as a descriptor. Several of our studies have shown that the relationship between the bond length and pK a for meta-and para-substituted compounds is such that they may form their own subset. In the case of naphthols, phenols, benzoic acids, and anilines, the relationships established between pK a and the bond length for simple meta/para-substituted molecules provide linear equations capable of predictions within 0.5 pK a units for more complex biomolecules. Compounds containing ortho-substituents often fall into their own subset, where membership to a set may be defined by the specific chemical characteristic of the substituent group. The partitioning of ortho-substituted species according to the chemical nature of the substituent is justified by the propensity for IHBs between ortho-substituents and the atoms of the ionizable group. In the presence of bulky orthosubstituents, steric effects can impose constraints on bonding and dihedral angles adopted by the ionizable group at the ipso carbon. This has a knock-on effect on bond lengths, causing data points to appear as outliers in plots of pK a versus bond length. IHBs cause rearrangement of electron density in the region of interest, meaning that bond lengths adjacent to where the proton is lost often deviate from the expected trend, relative to a set of compounds absent in the IHB interaction. This effect leads to the phenomenon of self-selection, an expression that describes how compounds exhibiting these respective geometric features fall into their own separate HiCoSs, at differing ranges of bond lengths or occasionally with differing active bonds.
The identification of an optimally predictive model is guided by a number of internal and external validation metrics. The mathematical formalism of the statistics used, that is, the squared correlation coefficient r 2 , RMSEE, and the external validation statistic denoted q 2 (in the form of leave-oneseventh-out), are illustrated in a previous work. 21 A high r 2 value (0.9 and above) describes a strong relationship between our descriptor and the experimental observable, but this may be misleading. Indeed, a line of best fit connecting clusters of data points may return a high r 2 value generating a model of poor predictivity. The latter property is best captured by the crossvalidated correlation coefficient q 2 and the MAE. A high q 2 value is indicative of a highly predictive model, and the leaveone-seventh-out protocol used here typically overcomes the known pitfalls of the widely used leave-one-out q 2 . The need to overcome such deficiencies has been demonstrated by Tropsha et al. 15 Furthermore, Roy recommends 28 employing an external dataset of at least 10 compounds as a true test of a predictive model and advocates MAE evaluation, taking into consideration the standard deviation of the absolute errors (AE) as well as the range of the training set. For a model to be classed as "good" according to Roy, it must pass the following criteria: (i) MAE ≤ 10% of the training set range (ii) MAE + 3 × σ ≤ 20% of the training set range (where σ is the standard deviation of the AE). The above validation criteria will be used to assess the predictive ability of the most promising models we have developed.
It must be noted that a caveat of the current methodology (and of all methods that use experimental values in their calibration protocol) is its dependence on the availability of experimental data. On this occasion, the experimental values used to calibrate the imidazolidines model were measured at 20°C rather than at a standard state of 25°C. Because of lack of other available data in aqueous solution at 25°C, the values from this paper 29 by Timmermans and van Zwieten were used, with the possible detriment to prediction accuracy in mind. The general trend of pK a variance with temperature for the ionizable group in question is that there is a small decrease (0.1−0.2 pK a units) for every 5°C increment in the temperature. This effect is shown in Table S1 in the Supporting Information, which details the change in the measured pK a for seven compounds for the temperature range of 20−40°C. For the diphenyl-based bis(2-iminoimidazolidine) test compounds, the pK a data are taken from one and the same publication 40 and were recorded at 30°C. As a temperature increase of 10°C is expected to have a non-negligible effect on pK a , a correction was applied to the training set pK a values. The linear equation describing the

ACS Omega
Article pK a change between 20 and 30°C is used to correct the values of the original training set. To enhance the model, the test set of 2-(arylimino)imidazolidines first introduced in section 4.1.3 is also added to the initial training set, after a correction for the temperature change from 25 to 30°C.
The experimental values used in this work correspond to the pK a values for dissociation from the guanidinium-type ion. However, as in the protocol described by Griffiths et al. for guanidine derivatives, 23 the bond lengths we use as descriptors in this work correspond to the neutral, conjugate base form.
2.2. Tautomers and Bond Lengths. Individual HiCoSs were constructed 23 by Griffiths et al. for five bond lengths within the guanidine fragment for a training set of compounds with chemically diverse R groups. The CN bond lengths in the tautomeric form A, shown in Figure 1a, returned an impressive correlation of r 2 = 0.97 with pK a . The equation describing this relationship was tested for several drug molecules, including guanethidine, guanoxan, and two amiloride derivatives, and gave prediction errors of less than 0.5 pK a units. Our approach to high correlation subset formation is guided by the parallels that may be drawn between guanidines and 2-iminoimidazolidines in terms of their resonance structures.
At this point, it is fitting to clarify the nomenclature used for the compounds studied in the current work. Compounds represented as tautomers T1 and T2 in Figure 1b contain an imidazoline ring (containing a CN double bond) bonded to an amino group at the 2-position and thus are both referred to as 2-aminoimidazolines. When the T3 tautomer is assumed, the five-membered ring is an imidazolidine group, and the CN double bond instead occurs between the carbon at the 2position of imidazolidine and the exocyclic amino group. Such compounds are referred to as 2-iminoimidazolidines.
The four bonds chosen for consideration for each of the tautomers T1, T2, and T3 are shown as a, b, c, and d in Figure  1b. For each tautomer, bonds labelled e, f, and g were also analyzed, but initial correlations obtained were found to be inferior to the bonds of the guanidine fragment and are not included in the discussion. Learning from past studies on bases (i.e., guanidines and anilines 30 ), N−H bonds are excluded from the analysis on the basis that correlation coefficients obtained previously were consistently poor.

METHODS AND MATERIALS
3.1. Conformational Analysis and Electronic Structure Calculations. For all compounds of the imidazolidine and guanidine training and test sets, all geometry optimization and frequency calculations were performed in the gas phase. Previous work has shown that, according to the assessment of HiCoS r 2 values, there is little to be gained from the use of an implicit solvent model (CPCM), when weighed against the additional computational cost. 22 Initially, the B3LYP functional with a 6-31G(d,p) basis set was employed here, on the basis of a previous study 31 from our lab. This recent study revealed that bond lengths calculated at this level of theory produced a superior r 2 value when regressed against pK a values for a HiCoS of substituted naphthols. Many of our previous studies used geometries optimized at the HF/6-31G(d) level to obtain equilibrium bond lengths, showing that LFERs emerge even at low levels of theory. However, here we chose to use the splitvalence triple-zeta basis set 6-311G(d,p). Justification of this choice can be found in section S1 of the Supporting Information.
To identify the dominant conformations of each compound in the gas phase for each of the three tautomers, we used the "conformers plug-in" within the Marvin Sketch 7 program to generate a number of starting geometries. For each type of mono-ortho-substituted species of the imidazolidine training set, 15 conformations were taken as input structures for optimization and vibrational analysis using the Gaussian 09 program. The most stable equilibrium geometries within the ensembles of each compound, in each tautomeric form, were then identified by a comparison of total ab initio energies. As the ortho-groups are identical for each of the 2,6-di-orthocompounds of the training set, degenerate conformations are expected because of the inherent symmetry of the system. These compounds also exhibit a lower degree of conformational flexibility than their mono-ortho-substituted analogue because of restricted rotation around C 1 −N 1 and N 1 −C 7 (illustrated for T3 in Figure 2). Noting these inherent geometric features, all 2,6-di-ortho compounds were optimized from an input geometry suspected to be close to their most stable form, whereby the two rings are orthogonal with respect to each other. For each mono-ortho-substituted species, optimization from an input geometry with orthogonal orientation of the two rings was also carried out. This separate optimization was performed to assess the effect of conformational commonality on the LFERs produced from our training set, which, as will be discussed, has limited variation with respect to substitution patterns. HF/6-31G(d) optimized structures of guanidine training set compounds from the previous study were reoptimized at the B3LYP/6-311G(d,p) level of theory. This reoptimization was carried out for each compound in each of the five conformer/ tautomer forms defined in the previous paper 23 and which are shown in Figure 1a. The pK a data for further six small guanidine derivatives were obtained and optimized in each tautomer/ conformer combination. Frequency calculations were carried out using the same level of theory for optimization, and inspection of the Hessian confirmed the absence of imaginary frequencies and the location of energy minima.
For compounds substituted at one ortho-position, the D1 angle is measured with C 2 denoting the carbon bonded to that substituent.

ACS Omega
Article 3.2. Construction of Models. Relevant bond lengths were extracted (seven bonds a−g for imidazolidines and five labelled i−v for guanidines) from all optimized structures of the tautomerically distinct subsets. The values of r 2 , RMSEE, and leave-one-seventh-out q 2 were obtained using the program 32 SIMCA-P 10.0 following linear regression of each type of bond length versus aqueous pK a values. The coefficients of the threebond-length MLR model were also obtained using the SIMCA-P program. For all test set compounds, the most stable geometries were ascertained for each of the tautomeric forms using the Marvin conformer search protocol described above, where for the diphenyl-based bis(2-iminoimidazolidines), the number of conformers generated was increased to 25. The appropriate bond length was then extracted and inserted into the HiCoS equation [pK a = m × (bond length) + c] or MLR equation for the models deemed to be optimally predictive according to internal and external validation statistics.
3.3. Interatomic Exchange−Correlation Energy Calculations. Bond lengths by themselves do not reveal their degree of covalency by their mere value. Covalency is essentially a measure of electronic delocalization, which can be calculated within the context of a topological energy partitioning method called interacting quantum atoms (IQA). 33 Originating from the QTAIM, 34,35 IQA is an increasingly popular tool to analyze a large variety of chemical phenomena. By partitioning the total energy of a system into intra-and interatomic terms, we derive a quantitative measure of covalent-like interactions between atoms in the molecules. This measure comes in the form of the exchange−correlation potential energy V xc AB , which is the sum of the exchange energy V x AB and the correlation energy V c AB . The former term usually dominates V xc AB and denotes the Fock− Dirac exchange, which describes the ever-reducing probability of finding two electrons of the same spin close to one another (i.e., the Fermi hole). The latter term, V c AB , is associated with the Coulomb hole and the electrostatic repulsion between the electrons. The absolute value of V xc AB evaluated between two atoms can be taken as the extent of delocalization of electrons between them, a factor that also determines the bond distance. The values for the exchange−correlation energy between C and N (denoted V xc CN ) were obtained by the AIMAll program 36 (version 14.11.23), using the first-ever 37 DFT-compatible IQA partitioning and using default parameters on wave functions obtained at the B3LYP/6-311G(d,p) level. Note that the V xc CN values are always negative, and the more negative they are, the more covalent is the interaction between C and N.
3.4. Experimental Data. Compounds of the training set are labelled 1−23 and can be found in Table S2 of the Supporting Information. The structural diversity of these compounds in terms of substitution pattern is fairly low. There are 14 compounds of the set that possess two orthosubstituents, 9 of which have two identical o-halogen groups (both are either F, Cl, or Br), whereas the other 5 have two identical o-alkyl groups (either Me or Et). Eight compounds of the training set have only one ortho-substituent, three of which have an o-methyl group and five have an o-chloro group. Metaand para-substituents are also present on the phenyl group of 15 o-substituted compounds of the set. The only compound of the set lacking any ortho-substituents is the unsubstituted form (shown in Figure 2 as tautomer T3), which will be referred to as compound 23.
The test compounds discussed in sections 4.1.3 and 4.1.4 were synthesized previously, as reported by Rozas and coworkers (ph1−ph10 and py1−py7) 38,39 and Dardonville and co-workers (ph11−ph15, 1, 2a,b−5a,b, and 6h). 25,40 The substitution patterns of the test set include meta-and parasubstituted compounds. The pK a values were measured by both the potentiometric (pH-metric) and spectrophotometric (UVmetric) methods of pK a determination. It is pH-metric values that are used for comparison to theoretical predictions for ph1−ph15 and py1−py7 in section 4.1.3, whereas both UVand pH-metric values are used for comparison in section 4.1.4. Further details on the methods used for pK a determination can be found in section S2 of the Supporting Information.

RESULTS AND DISCUSSION
4.1. Imidazolidines. 4.1.1. Conformation and Commonality. In a previous work, we have demonstrated the existence of conformation-specific linear relationships between the bond length and pK a . For instance, we illustrated that two subsets are formed using the same interatomic bonding distance for a series of ortho-substituted naphthols, when the O−H group is either syn or anti with respect to the ortho-substituent. 22 Consequently, bond lengths later used to predict unknown pK a values must be extracted from the compound in the specific tautomer/conformer combination used to form the predictive equation. Therefore, insertion of an O−H bond length from a syn conformer into the anti HiCoS equation, or vice versa, would likely return a poor prediction. Figure 2 illustrates the two torsional angles that define the orientation of the imidazolidine group relative to the phenyl ring: Structures labelled T1 and T2 in Figure 1b may be considered as two rotamers of the same tautomer. For these rotamers, interconversion is achieved through 180°rotation of the imidazolidine group around the N 1 −C 7 bond (i.e., by torsion angle D2 over bond c).
In each tautomeric form (i.e., T1, T2, and T3), the most stable conformation identified for compounds with a single ortho-alkyl group is one in which the imidazolidine ring is tilted from orthogonality with the phenyl ring, away from the o-alkyl group (i.e., if the substituent is at C 6 , D1 > 90°). In the T2 form, the two rings are found to be nearly completely coplanar, with average D1 and D2 angles of 177°and −1°. The enhanced stability afforded to this arrangement is presumably a consequence of minimized steric interactions with the alkyl group. Compounds of the training set with two o-alkyl substituents are found to adopt a geometry, whereby the two rings are more or less orthogonal in tautomers T1, T2, and T3.
For 2,6-dihalogens in the T3 form, the most stable orientation of the two rings is one whereby an N−H group of imidazolidine is orientated toward one of the orthosubstituents on the phenyl (i.e., D1 < 90°). This feature is most pronounced for the 2,6-di-fluoro derivative, compound 15, for which D1 is ∼50°. This small D1 angle is indicative of the presence of an IHB interaction between an N−H of imidazoline/imidazolidine and the fluorine atom. Further analyses were performed using the IQA topological energy partitioning scheme, which revealed a (3, −1) bond critical point between (N)−H and F for compound 15 in tautomer T3. A full explanation of the IQA analysis and further evidence for the presence of the NH···F hydrogen bond for compound 15 in both tautomers T1 and T3 can be found in section S3 of the Supporting Information.
The next class of training set compounds to consider are species that have a chlorine atom occupying one ortho-position while the remaining ortho-position is unsubstituted. When the amino-type tautomer T1 is assumed, the most stable geometry

ACS Omega
Article is one in which the imidazoline group is orientated away from the ortho-chloro group. If the same class of compound is instead represented as T2, a preference for almost absolute coplanarity is observed, illustrated by averaged D1 and D2 angles of around 178°and −1°, respectively. In contrast to the observations made for T1, the most stable arrangement for T3 is one in which the N−H of the imidazolidine group is oriented toward the chlorine atom. An IQA analysis of compounds 5 and 7 (the 2,4-chloro and 2,5-chloro derivatives) in their most stable conformation of tautomer T3 reveals no BCP for N−H··· Cl. Therefore, the propensity to adopt this conformation cannot be ascribed to the presence of an IHB.
The commonly visited geometries of compound 23 [2-(phenylimino)imidazolidine] will be quite similar to those identified for any other compounds substituted at the metaand/or para-positions for small substituent groups. This may be asserted because of the infeasibility of significant through-space interactions between imidazoline and these groups at the more distant meta-and para-positions. Consequently, inclusion of this compound within the training set of a model will ensure that the applicability domain encompasses compounds without ortho-substituents. For T2, one dominant conformation is identified, within which the two rings lie almost exactly coplanar, akin to all mono-ortho-substituted derivatives in this form. An IQA analysis was performed to assess the reasons for the enhanced stability of the coplanar conformation for T2 and can be found in section S4 of the Supporting Information. For the T1 tautomer, there is also one dominant conformation that corresponds to a structure with a D1 angle of 160°and a D2 dihedral of −12°. For T3, there are two conformations that have similar D1 and D2 angles but differ by the orientations of the hydrogen atoms of the N−H groups, with a difference in stability of only ∼1 kJ mol −1 . In the analysis below, only the bond lengths of the most stable T3 conformers are considered.
So far, we have identified preferred conformations for the compounds of our training set in each tautomeric form, which has revealed a degree of commonality for subsets of compounds possessing similar substituent patterns. We also observed an anomalous geometry due to an IHB for the 2,6-F 2 derivative (compound 15), which is likely to be a common feature among other 2-F-substituted compounds. We are now in a position to construct plots from the constituent bond lengths of each compound and assess how LFERs emerge for each bond. In the following analysis, we include bond lengths from the 2substituted species in their most stable conformations of T1 and T2 and also from a geometry that more closely resembles the orthogonal orientation of the 2,6-substituted compounds. This is performed to assess whether conformational congruence is indeed the guiding factor to HiCoS formation or simply whether the most stable, and therefore, the most commonly visited geometries provide bond lengths that have the strongest linear relationships with pK a values.
4.1.2. Identification of HiCoSs. In the search for a predictive model for 2-(phenylimino)imidazolidines, our initial motivation was to formulate a model using bond lengths from the most commonly visited forms of each compound of our training set. According to chemical intuition, within the constraints of our gas phase and static model, an active bond should be located within the most commonly visited state, that is, the lowest energy conformation of the most stable tautomeric form. Furthermore, in accordance with the results for other aromatic species, we may then expect to find that subsets form, where membership to a given subset is defined by the location of a substituent at the ortho-or meta-and para-positions on the phenyl group.
In the most stable conformation of each of their tautomeric forms, the "imino" tautomer (T3), where the CN double bond is present between the imidazolidine and the phenyl rings, is consistently more stable than either of the T1 and T2 rotamers (internal energy rankings can be found in Table S2). This energy ranking is consistent for a comparison of Gibbs energies. Plotting bond lengths a−d versus pK a , with the 23 training set compounds in their most stable conformation of the T3 tautomer, reveals strong, individual linear relationships for all bonds (r 2 = 0.946, 0.944, 0.830, and 0.960 for a−d, respectively). However, for bonds a and b (the endocyclic C−N  Table S2 represented as tautomer T3. The blue diamonds denote compounds that have ortho-substituents at both positions, green diamonds denote compounds with only one orthosubstituent, red crosses represent the unsubstituted form, and the circle and triangle denote the 2,6-F 2 -phenyl and 2-Cl,6-F-phenyl derivatives.

ACS Omega
Article bonds of the guanidine fragment), one obvious outlier is observed in each case. These anomalous bond lengths are identified as belonging to compound 15, 2-(2,6difluorophenylimino)imidazolidine (blue circles in Figure 3). We may explain the occurrence of this outlier by the presence of the IHB we described above, which perturbs the electron density distribution between atoms and causes a more extreme distortion of the geometry than that observed for the 2-chlorosubstituted anologues. We may justify the removal of this compound from the set because compound 15 lacks conformational commonality with the rest of the set. If enough data for other ortho-fluoro-compounds were available, then we surmise based on a previous work, that a separate HiCoS would be identified for bonds a and b, consisting of compounds that have this intramolecular interaction in common. There is only one other compound with pK a data available: the 2-F,6-Cl-phenyl analogue. As expected, this second 2-F derivative exhibits a conformation very similar to compound 15 (D1 = 51°) and also appears below the trend line for bonds a and b, next to compound 15 (orange triangles in Figure 3). However, deriving a new HiCoS equation from two data points is unlikely to provide a meaningful relationship, thus rendering an allencompassing, single-bond-length model using a or b as unfeasible. Removal of compound 15 reduces the training set to 22 congeners and reveals r 2 values of 0.973 for bond a, 0.969 for b, 0.836 for c, and 0.959 for d, as shown in Table 1, thus making bond a the active bond for all compounds contained in the training set, except for 2-F derivatives.
Inspection of the plots for bonds a and d in Figure 3 reveals that the unsubstituted compound 23 falls in the correct regions to be conducive of a linear relationship between the bond length and pK a in both cases. However, 23 may be considered an outlier in the case of plot b, as removal of this compound from the set causes an increase in r 2 from 0.969 to 0.981. Bond c of 23 seems to fall in a bond length range mostly occupied by mono-ortho-substituted compounds, but it is apparent that the overall r 2 and q 2 values for the LFER model derived from the CN bond are lower and the RMSEE value is higher when compared to all other models. The commonly visited geometries of compound 23 (2-(phenylimino)imidazolidine) will be quite similar to those identified for any other compounds substituted at the meta-and/or para-positions for small substituent groups. This may be asserted because of the infeasibility of significant through-space interactions between imidazoline and these groups at the more distant meta-and para-positions. The fact that compound 23 does not appear as an outlier for the linear fit of bonds a and d versus pK a suggests that they may have the largest applicability radius out of the four single-bond-length models.
Referring to the validation statistics listed in the lower part of Table 1 where compound 15 has been excluded, the a model demonstrates marginally superior r 2 , q 2 , and RMSEE values over the other bonds. However, we may conclude that for the most stable conformer and tautomeric form of the phenylsubstituted 2-iminoimidazolidines of the training set, a linear relationship exists between each of the four bond lengths a−d and pK a . Shorter C−N bonds a, b, and d and a longer CN bond c are all molecular features indicative of a lower aqueous pK a value and thus increased acidity. In other words, in the case of 2-(arylimino)imidazolidines, there is a concerted change in four gas-phase bonding distances in the region of the guanidine skeleton because of substituent effects on the phenyl group, which is linked to the thermodynamic feasibility of the guanidine group losing a proton in the aqueous phase. In terms of the dissociation equilibrium reaction in the static T3 state, the nitrogen of the CN c bond is the site of proton loss or acceptance. Considering this point, it is then unclear as to why we observe a stronger or more sensitive reflection of proton loss/acceptance feasibility in the changing C−N bonding distances of a, b, and d compared to the relationship observed for the CN bond c. These results for 2-(arylimino)imidazolidines differ from our findings for the previous study on simple guanidines and from the results described in section 4.2.1, where we find that only the CN bond lengths of a specific conformation of a higher energy tautomer correlate highly with aqueous pK a values (shown in Figure 1a).
In light of the existence of more than one "active bond," MLR analysis was performed for bonds a, b, and d versus pK a on the grounds that their individual r 2 and q 2 values are in excess of 0.95. The c bonds were excluded, as a detrimental effect on MLR validation metrics can be expected in accordance with the inferior (r 2 and q 2 < 0.9) single-bond-length values shown in Table 1. This new model, where pK a is now defined as a function of three bonds, exhibits superior validation statistics when compared to any single-bond-length model ( Table 1). The removal of the anomalously long bonds of compound 15 from the training set revealed a further increase in r 2 and notably an increase in q 2 of 0.12 in addition to a 0.05 reduction in RMSEE.
For the higher energy tautomeric forms, T1 and T2, we observe no other superior set of r 2 , q 2 , and RMSEE values for the training set in their most stable conformations, compared to those observed for T3 a (Table 2). In the case of T2, the bond length versus pK a plots reveal that for a highly correlated model to be constructed using all compounds, each mono-orthosubstituted compound must occupy a higher energy conformation, whereby the two rings are arranged orthogonally. Separate HiCoSs are formed in the case of each bond, when the bond lengths of the more stable, more coplanar conformers of mono-ortho-compounds are plotted against their pK a values. Therefore, for T2, inducing conformational commonality is the guiding factor for a strong LFER; indeed the r 2 , q 2 , and RMSEE values for bonds a, c, and d in the same orthogonal conformation are indicative of an almost perfect linear relationship. The new subsets constructed from the bond lengths of the coplanar conformers are found to run almost parallel to the full set containing the orthogonally arranged

ACS Omega
Article analogues ( Figure S1) and also have high r 2 values. The formation of new HiCoSs according to conformational congruence is also observed in the case of bond lengths c and d for T1 ( Figure S2). Although inducing a conformational agreement between doubly and singly ortho-substituted compounds appears to allow for a linear relationship to reveal itself for the T1/T2 forms, it also introduces a layer of unnecessary complexity to a prediction protocol. This complexity arises because successful implementation of the equations describing these relationships to predict pK a values is dependent on the compounds' propensity to take up a prescribed conformation. The problematic nature of this condition is exposed for compound 23 as T2, for which the only equilibrium geometry located exhibits an almost exactly coplanar orientation of the two rings (D1 = 175°and D2 = 0.3°). Correspondingly, for plots a−d for T2, compound 23 is found to belong to the new HiCoSs containing only the more stable, coplanar conformers, for which bonds a and c are found to give the highest r 2 values of 0.97 and 0.96 ( Figure S1). This suggests that a route may be taken whereby doubly ortho-compounds are predicted with one model, and singly ortho-, meta-, and para-substituted species are predicted with T2 a or c. However, the smaller pK a range of 2.06 for the T2 a HiCoS is a detrimental factor to its implementation for compounds with a wide range of substituents, when compared to the larger range of the T3 models.
In the following sections, we proceed to make predictions for two external test sets, first by application of the single-bondlength models a, b, and d for T3 and by application of the three-bond-length model for T3, where the bonds of compound 15 are excluded. By comparison of the error statistics, we arrive at a protocol capable of a highly accurate pK a prediction for the guanidine group of 2-(arylimino)imidazolidine derivatives, which is proven to be sensitive to a diverse array of substituent effects. 4.1.3. External Test Set: 2-(Phenylimino) and 2-(Pyridylimino)imidazolidines. The a bond lengths of the most stable conformers identified for 27 compounds represented as T3 were inserted into the equation (i.e., pK a = 272.76 × r(a) − 368.08), which corresponds to the optimally predictive single-bond-length model identified in section 4.1.2. The equations for the models derived using bonds b and d (pK a = 310.49 × r(b) − 421.74 and pK a = 99.03 × r(d) − 128.88, respectively) were also implemented for the same test set. All single-bond-length predictions and error statistics referred to in this discussion can be found in Table S7 of the Supporting Information.
The MAE value of the a model was found to be the lowest at 0.26, with a standard deviation of AE (σ) of 0.23. Whereas a reasonable performance is observed using the d model equation (MAE = 0.57, σ = 0.53), the b model performed very badly (MAE = 1.54, σ = 2.47). The poor performance of the b and d models can be mainly attributed to large prediction errors for compounds py4−py7, which correspond to four 2-pyridyl derivatives (structures are shown in Table 3). For example, the errors for bond b for these four compounds are −7.49, −7.24, −7.28, and −7.31. These very low predictions can be attributed to a significant shortening of the b bond, which equates to a ∼−0.03 Å difference relative to those of 3-pyridyl analogues. The d bond lengths also experience this shortening effect but to a lesser degree, as illustrated by the lower prediction errors of Although the averaged error for the test set predictions using the a model is impressively low, there are some compounds of the set that show significant errors, namely, ph11 (−0.70), ph13 (−0.82), and to a lesser extent, ph15 (−0.64). These errors correspond to the compounds 2-(4-nitrophenylimino)-, 2-(3-chloro,4-nitrophenylimino)-, and 2-(3-fluoro,4nitrophenylimino)imidazolidine, and therefore, all three con- a "Most stable" refers to the models formed with 2,6-di-orthocompounds plus the more stable coplanar conformers of the monoortho-compounds. "Same conf" refers to models where mono-orthocompounds are optimized from an orthogonal orientation of the two rings, so as to match that of the 2,6-di-ortho-substituted compounds (i.e., all compounds optimized from the same conformation). Compound 23 falls within the "most stable" set as it did not optimize to a local minimum with an orthogonal geometry. Corresponding plots are shown in Figures S1 and S2, whereas Tables S5 and S6 show bond lengths and angles D1 and D2.

ACS Omega
Article tain a nitro group at the para position. The strong, resonanceinduced electron-withdrawing effect of p-nitro groups is evidenced by their measured pK a values, which indicate a reduction in the basicity of ∼2 pK a units for all nitro compounds (ph11−ph15), relative to the unsubstituted form (23). The prediction errors for the three problematic nitro compounds are somewhat reduced by use of the b model (+0.58, +0.40, and +0.70), but significant improvement is found by use of the d model, for which the predicted values are found to be a mere −0.23, −0.05, and +0.10 pK a units outside of the experiment. It is interesting to note that compounds ph12 and ph14 also contain a p-nitro group, but show consistently low errors for all three single-bond-length models (Table S7). The reason for the lower errors for ph12 and ph14 may lie in the fact that these compounds also have o-halogen substituents in addition to the p-nitro group, and thus share more structural commonality with the compounds used to calibrate the models.
With the introduction of our test set, we notice that certain phenyl substituents, like the ortho-fluoro groups of compound 15, can perturb bond lengths a, b, and d to varying degrees, meaning that outliers are found to be inconsistent between single-bond-length models. This is not a surprising observation; but subsequently, to derive a pK a prediction protocol that is both highly accurate and consistent in its accuracy, we must go from one single-bond-model to another to minimize overall prediction errors.
The resultant predictive equation for the MLR of bonds a, b, and d against pK a is expressed as follows: pK a = 92.511 × r(a) + 105.33 × r(b) + 33.596 × r(d) − 311.79. Implementation of this equation to predict for all 27 test set compounds revealed it to be inferior in terms of overall accuracy relative to the singlebond-length model a (MAE = 0.63, σ = 1.08). However, this high mean error can again be solely attributed to poor predictions for the 2-pyridyl compounds, which show errors of −3.24, −3.01, −3.09, and −3.21 for py4−py7. Excluding the errors of the 2-pyridyl derivatives reveals an improvement in the prediction accuracy over the use of the a model for the remaining compounds, as the MAE falls to 0.20 and notably, the standard deviation of errors is reduced to just 0.18. In noting the sensitivity of the b and d bonds to the presence of the IHB of 2-pyridyl compounds, it appears that the optimal protocol for the pK a prediction of this class of derivatives should be via implementation of the a model equation. For compounds that lack the heteroatom at the 2-position of the aryl group, implementation of the MLR three-bond-length model will likely provide the most consistently accurate predictions. Because of a lack of pK a data at this time, it is not possible to say how this protocol will fare when the phenyl a Predictions and errors are also shown for further five literature compounds: tizanidine (Tiz), 41 apraclonidine (Apra), 42 and AG3−AG5. 43 Predictions for compounds py4−py7 were made by implementation of the a model, that is, pK a = 272.76 × r(a) − 368.08, and are denoted by (a). Predictions and AE obtained using ChemAxon using both T1 and T3 as input structures are also shown for comparison, where asterisks denote predictions where the Marvin program recognized the T3 form as the dominant tautomer.

ACS Omega
Article group is replaced by other aryl groups containing heteroatoms. However, the respectable errors obtained for compounds without ortho-substituents, for 3-pyridyls and for the drug tizanidine (Tiz) (which has a piazthiole derivative instead of Ph), hint at an applicability radius much wider than just the simple o-substituted phenyl derivatives that make up the training set.
The predictions shown in Table 3 were made using the MLR model, with the exception of the 2-pyridyl compounds, which, for reasons explained above, were made using the a bond model. The MAE for the full AIBLHiCoS protocol pK a predictions for the 27 compounds shown in Table 3 is 0.20, with a standard deviation of AE of 0.18.
A high degree of prediction accuracy in modern terms may be considered as within 1 pK a unit per prediction, with a MAE for a test set of less than 0.5 units. 2 According to these conditions, the error statistics (MAE = 0.20, σ = 0.18) indicate that AIBLHiCoS performs very well and does so consistently. For predictions made using the pK a predictor plug-in within the Marvin Suite by ChemAxon, 7 with the imino tautomer (T3) as the input structure, eight prediction errors exceed 1.0 pK a units (ph8, ph9, ph11 to ph15, and py1), and the MAE value is in excess of 0.5 units (0.72). This value deteriorates further when T1 is used as an input structure, for which 11 errors exceed 1 pK a unit. The MAE of 0.97 for T1 shown in Table 3 is inclusive of the error values for ph11−ph15, for which the program implicitly recognizes the dominant tautomeric form as T3. Therefore, on the basis of our energy rankings (Table S3), optimal prediction accuracy is dependent on the user having prior knowledge about the dominant tautomeric form. The AIBLHiCoS protocol we have developed for these compounds finds its place here as a useful tool for predicting subtle substituent effects on pK a with a high accuracy. Conversely, the wider applicability radius of the Marvin program means it demonstrates lower prediction accuracy in the case of these compounds.
According to Roy's stringent assessment of a QSPR/ quantitative structure−activity relationship model, the predictions made using the MLR model may be classed as "good" in his own parlance, that is, the MAE of 0.20 is smaller than 10% of the training set range (0.39 units) and the MAE + 3 × σ (0.20 + 3 × 0.18) is 0.74 and therefore lower than 20% of the training set range (0.78).
We now discuss the exchange−correlation measure V xc CN introduced in the Computational Methods section 3. Figure  4a−d shows the relationship between the V xc CN values corresponding to the CN interactions for each of the bond lengths a−d and the pK a values. Referring to the r 2 values on plots (a−d), which are for the training set compounds only (blue diamonds), remarkably strong linear relationships are observed in the case of all four CN bonds. Taking the V xc CN value as a more physically meaningful metric, the greater the exchange−correlation energy between the constituent atoms of the CN bond c, along with a lesser interaction energy between the atoms of the three C−N bonds a, b, and d adjacent to it, the more basic the guanidine group. Furthermore, plots of how the V xc CN values of the training set correlate with their corresponding bond lengths reveal r 2 values of 0.990, 0.871, 0.869, and 0.989 for a−d. Thus, although the bond length is not necessarily a measure of covalency, it is a measure in the case of the 23 training set compounds and most strongly for bonds a and d. This observation therefore heightens the significance of our bond length versus pK a LFERs, that is, the most active bond lengths a and d (which exhibit the superior validation statistics r 2 , q 2 , and RMSEE) most strongly reflect the extent of delocalization between the two bonded atoms.
As the V xc CN values of a and d are very accurately reflected in the corresponding calculated bond distances, compounds ph11, ph13, and ph15 (purple triangles in Figure 4) once again appear as outliers for a but not for d. For the 2-pyridyl compounds py4−py7 (red circles), the presence of the IHB is seen to cause divergence of the V xc CN values of b, c, and d from the linear trends that reflect the most substituent effects with impressive accuracy, leading to the formation of new HiCoSs. Accurate pK a prediction would be therefore be possible by the use of the linear equations describing the relationships shown in Figure 4a Table S8.

ACS Omega
Article of bond lengths. The latter alternative prediction method could follow a procedure similar to the one described above for bond lengths. Specifically, the MLR model could be used to predict for substituted phenyl or 3-pyridyl compounds, but the a V xc CN model must again be used for 2-pyridyl compounds, as they belong to separate HiCoSs.
As the internal and external validation metrics for the bond length models are already impressive, it is likely that there is little to be gained from a protocol that uses V xc CN as molecular descriptors, when weighed against the added computational cost involved in calculating them. However, it is interesting to note that the highest r 2 value for V xc CN values versus pK a is now found for bond c, which is the CN double bond. Figure 4 shows that the V xc CN values for both CN bonds adjacent to the site of ionization (c and d) are both highly correlated with the ionization feasibility. As the extent of delocalization increases between atoms of bond c and decreases for the atoms of bond d, the ability of the conjugate guanidinium-type cation to stabilize the positive charge increases. This finding is perhaps more in line with chemical intuition than what we observe for the bond lengths, where the CN bond gives the worst r 2 , q 2 , and RMSEE values when regressed against pK a .
Notwithstanding the dramatic increase in r 2 from 0.83 to 0.98 for bond c by use of the corresponding V xc CN values, Figure  4 also shows no substantial deterioration for other bonds. In fact, all V xc CN versus pK a plots show r 2 values above 0.9. The interatomic property, V xc AB , as a measure of delocalization of electrons between two atoms, is related to the delocalization index, a QTAIM-derived metric that describes the number of shared or exchanged electrons between two QTAIM atoms. Work 44−46 by Matta and co-workers has implemented delocalization indices, along with localization indices (a measure of electrons localized within a QTAIM atom), to define a series of compounds as what they call a ζ-matrix. The authors showed 44 that the Frobenius distances between the elements of the matrix representations of a set of p-benzoic acids containing only delocalization indices and that of a reference species with the lowest pK a value show a high r 2 value when regressed against their pK a values. Significantly, the distances between the submatrices representing just the [COOH] molecular fragment were found to give a higher Table 4. pK a(1) and pK a(2) Experimental Values for Compounds 1 to 6h Using the UV-Metric Method (Upper) Taken from the Literature 40

ACS Omega
Article correlation with pK a than the [OH] submatrices. This is rationalized by the fact that the pK a value reflects the ability of the COOH fragment to accommodate ionization. Comparisons may therefore be drawn between this study and the current work, where using V xc CN , we see that each bonding interaction of the guanidine fragment correlates with pK a . This observation may be rationalized by the fact that the pK a values for our 2iminoimidazolidines reflect the thermodynamic feasibility of the whole guanidine fragment to accommodate the positive charge.
4.1.4. External Test Set: Bis(2-iminoimidazolidines). To test the robustness of the AIBLHiCoS protocol described above for larger, more conformationally flexible species, the pK a values for a series of diphenyl-based bis(2-iminoimidazolidine) compounds were taken as a new test set (Table 4). These compounds contain two 2-(phenylimino)imidazolidine moieties linked by an amide bond at the para-positions, for which the two guanidine-type sites of ionization exhibit distinct pK a values. As the values obtained by Rı́os Martı́nez et al. 40 were measured at a nonstandard temperature of 30°C, the values of the training set were adjusted to account for the expected deviation in pK a . Initial training set values (1−23) were adjusted using a correction from 20 to 30°C, and for consistency, the former test set (ph1 through py7) values were adjusted using the appropriate equation for an increase from 25 to 30°C. The dataset of guanidine-containing compounds used to approximate the pK a change is shown in Table S1 of the Supporting Information, and the new values for the enhanced training set are shown in Table S9. Both the a model and the three-bond-length MLR model formed using a, b, and d bond lengths were implemented to assess their relative accuracy for this new test set. The bond lengths of the 2-pyridyl compounds (py4−py7) and those of the three other largest outliers identified in the previous section (ph11, ph13, and ph15) were removed from the new training set on the basis that they belong to a separate HiCoS for these specific bonds.
All results shown in Table 4 correspond to predictions made using the MLRs for both temperatures, as they were both found to provide lower overall errors and standard deviation of errors when compared to the a models (Table S10). The upper part of Table 4 (corresponding to the UV-metric values of Rı́os Martı́nez et al. 40 ) demonstrates a poor performance for the AIBLHiCoS model, which shows an MAE of 0.41 and a standard deviation of 0.59 for the 20 sites of ionization. Notably, the prediction errors for compound 6h are huge (pK a(1) = +0.88 and pK a(2) = +2.56). These large errors cannot be attributed to the lack of training for species containing pyridines, as there are three 3-pyridyl species (py1−py3) present within the enhanced training set. Furthermore, as the predicted values for the majority of the congeneric species in Table 4 show errors below the 0.5 log unit threshold, we cannot rationalize the highly inaccurate predictions for both pK a(1) and pK a(2) of 6h and 5a pK a(2) by the approximate nature of the temperature corrections applied to the training set data.
On the basis of the success of the MLR model for the test compounds described in the previous section, the reliability of the UV-metric values was questioned. Subsequently, new pHmetric measurements were obtained for all compounds for which adequate sample quantity was available. These new pHmetric values were recorded at the standard 25°C. Unfortunately, new measurements for 4b and 5b were not possible because of insufficient sample quantities.
The new predictions and pH-metric values are shown in the lower part of Table 4, and a comparison of error evaluation statistics is summarized in Table 5. Comparing the new predictions at 25°C to the pH-metric values, the unsigned errors for 6h are now only 0.37 and 0.33 pK a units away from the experiment, proving the previously reported experimental values to be erroneous. The prediction error for 5a pK a(2) is now also below 0.5 units and closer in magnitude to that of the chemically similar pK a(1) site of 2a. All prediction errors are now found to be below 0.5.
Implementation of Roy's criteria to assess prediction errors against the pH-metric values indicates that the model may be classified as "good," that is, the MAE of 0.27 is smaller than 10% of the training set range (0.37) and the MAE + 3 × σ (0.27 + 3 × 0.10) is 0.56 and therefore lower than 20% of the training set range (0.74). The MAE values produced utilizing Marvin exceed 1 log unit for both the UV-metric values at 30°C and pH-metric values at 25°C. Notably, 19 out of 20 values exhibit errors of more than 0.5 units for the UV-metric values, and 14 out of 18 for the pH-metric values. Further, the order of microdissociation steps for sites (1) and (2), described by the relative magnitude of the two measured pK a values, is correct on every occasion for AIBLHiCoS but is incorrect for 5a and 6h for the predictions made by Marvin's pK a predictor (of ChemAxon).
4.2. Guanidines. 4.2.1. HiCoSs from Tautomers. It has already been established 23 that for guanidines, distinct HiCoSs may be formed by representing compounds in specific conformations of each tautomeric form. We have also proven that this approach to HiCoS formation is successful for 2iminoimidazolidines. In the case of the imidazolidines discussed in the previous section, the most highly predictive model was formed using a bond distance within the most stable tautomer. For guanidines where substitution only occurs at one amino/ imino group, previous work details the formation of a model using the CN bond lengths of equilibrium geometries of compounds in a higher energy tautomeric state. The purpose of the current section is to prove that the model already established in the previous work still functions as a predictive model for a new test set, as we shift the level of theory from our previous CPU-triggered HF/6-31G(d) level to the current B3LYP/6-311G(d,p) level. Because this section of the work on guanidines has already been discussed in the paper by Griffiths et al., further details can be found in section 2.3 of that paper. 23 Several changes have been made to the previous training set to form the model used in this work. First, eight new guanidine derivatives were added during recalibration to further corroborate the location of the active bond. Second, some pK a values used here differ from those used previously. For  49 and reported to have been measured at 20°C. Five candidates for the active bond were chosen as N−H, CN, C−N, C−N, and N−H, which correspond to those marked respectively as i−v in Figure 1a. Two remaining N−H bonds were excluded from the analysis, the first of which is one of the primary amino N−H bonds because including both primary N−H bonds was deemed unnecessary because of the fact they are very similar in magnitude. The N−H bond distance corresponding to the nitrogen atom, which is also bonded to the R group, was also omitted from the analysis as it is not present in any of the training or test set structures. The N−R bond was also excluded because the large variety of R groups causes too substantial a variation in the bond length. Using a total of 17 compounds from 10 different sources, bond length versus pK a plot using the CN bond of tautomer A is found to return the highest r 2 value of 0.925 (see Table 6). The removal of five outliers, that is, methylguanidine, acetylguanidine, metformin, amiloride, and biguanide, reveals an r 2 value of 0.97. These outliers are not wildly anomalous, and their inclusion still allows for a correlation of >0.90. We may therefore only rationalize their removal from the training set by the possibility of small deviations in experimental values, which are inherent to the varied experimental techniques and the conditions in which they were measured. The final set of 13 compounds used to form the predictive model implemented in the next section are shown in Table 7 and in the Supporting Information Tables S11−S15, respectively, for tautomers A to E.
4.2.2. External Test Set. The pK a prediction software by ChemAxon is again implemented to assess the quality of our predictions relative to a popular choice of a predictor. Our chosen model, as discussed in the previous section, is formed from tautomer A. For a fair comparison, the same tautomer used for the AIBLHiCoS model is also used as input for ChemAxon, in addition to the most stable form, C. As the r 2 value and other model validation statistics for the bonds of tautomer C are inferior to tautomer A, it can be confidently assumed that they would give inferior predictions to our model using A and are therefore not included.
The ChemAxon prediction error for the C tautomer is very respectable (0.54, Table 8), although the MAE does fall short of the <0.5 units threshold by a small margin. AIBLHiCoS offers only a minor advantage, with a slightly lower MAE (0.29) and 3 out of 12 errors in excess of 0.5 units. However, because of the large range of our models used to predict for the guanidines (7.2), both of Roy's criteria are satisfied: (1) the MAE of 0.29 is smaller than 10% of the training set range or 0.71 and (2) MAE + (3 × σ) = 1.23 is smaller than 20% of the training set range, which is 1.42. ChemAxon shows 5 out of 12 errors above the 0.5 threshold using the C tautomer, but this rises to 9 out of 12 when tautomer A is used as the input structure. Once again, the prediction accuracy is highly dependent on user input with respect to the tautomeric form.
For the guanidine test set, the only AIBLHiCoS prediction above 1 unit corresponds to PG3, that is, 1-(5-chloro-2pyridinyl)guanidine. Our model gives an error of +1.05 units, whereas ChemAxon gives an error of −0.89. Inspection of the optimized structure reveals the presence of an IHB between an amino group on guanidine and the nitrogen of the pyridine ring. As observed for the imidazolidines, the presence of IHBs can cause significant deviations in certain bond lengths, leading to further partitioning of the set into subsets of higher structural commonality. Evidence for this effect lies in the lower (im) is an imino-type nitrogen; (sec) denotes a secondary amino nitrogen; (pr) denotes a primary amino nitrogen. C lacks a single-bonded secondary nitrogen and has an additional primary amino-type, denoted (iv′). The q 2 and RMSEE values are also listed in Tables S11 to S15 of the Supporting Information. a The equation for the linear fit used for predictions is pK a = 448.9 × r(CN) − 561. 15 and was obtained using the bond length data in Table S11.

ACS Omega
Article prediction errors observed for two 3-pyridinyls, PG1 and PG2 (+0.55 and −0.24), for which the pyridine's nitrogen is too far away for H-bonding to be feasible. There are pK a data available for 1-(2-pyridinyl)guanidine and 1-(6-methyl-2-pyridinyl)guanidine, and when the bond lengths of their optimized structures are included in the CN versus pK a plot using tautomer A, they also appear below the line of best fit, near to the data point for PG3, giving errors of +1.53 and +1.22, respectively. A new HiCoS cannot be constructed using only two data points, and so a predictive equation that better describes the LFER for these compounds will be constructed when adequate data can be procured.

CONCLUSIONS
We have demonstrated the existence of novel, highly correlated linear relationships between gas-phase equilibrium bond lengths within the guanidine fragment of 2-(phenylimino)imidazolidine derivatives and their aqueous pK a values. It has been established that the relationship between bond lengths and pK a is strengthened for the amino tautomers, T1 and T2, when conformational congruence is ensured. However, three single-bond-length models with excellent validation statistics (r 2 > 0.95) emerged for compounds in their most stable conformations of the imino tautomer, T3. For this tautomer, shorter calculated a, b, and d bond lengths and a longer CN c bond length than that observed for the unsubstituted species can be collectively considered, indicative of a compound with higher acidity. Predictions made using the three a, b, and d CN single-bondlength models were then compared to those obtained via implementation of an MLR model built from all three bond lengths a, b, and d versus pK a . The MLR model was found to exhibit superior validation statistics over any single-bond-length model. Prediction errors using the MLR model were found to be less than 0.5 for most compounds, but four anomalously large errors were observed for four 2-pyridyl compounds. These larger errors can be attributed to anomalously short b and d bonds, which are explained in terms of the presence of an IHB between the N−H of imidazolidine and the heteroatom of 2pyridine. A robust protocol is then established, whereby the single-bond-length model a should be used to predict for compounds with a heteroatom at the 2-position, but the MLR model should be implemented for all other classes of compounds shown in the training and test sets. This protocol allowed for an MAE and standard deviation for all 27 compounds of 0.20 and 0.18, respectively.
It must be noted that the power of this relationship especially revealed itself on a few occasions during this part of the work. First, as mentioned in a recent study, 25 the initial experimental value for compound ph12 (8.13) meant that our a model prediction gave an error of 0.85 log units, which was puzzling when compared to the high accuracy of the prediction for the fluoro analogue, ph14. However, remeasurements revealed the initial value to be erroneous, and the real value (7.29) fitted our initial prediction to within the 0.5 unit accuracy threshold. Furthermore, implementation of the T3 a model using an enhanced training set has revealed that some of the 20 UVmetric values reported for diphenyl-based bis(2-iminoimidazolidines) by Rı́os Martı́nez et al. 40 are inaccurate, most dramatically in the case of compound 6h, for which the prediction errors were reduced from +0.88 and −2.56 to +0.37 and +0.33, upon remeasurement using the pH-metric technique. The MAE for the 16 remeasured pH-metric values was found to be 0.27 with a standard deviation of 0.10. The MAE, standard deviation of AE of both external test sets, and training set ranges mean that both models may be considered "good" in terms of their predictive ability according to Roy's criteria. The AIBLHiCoS method provides average prediction errors that are superior to ChemAxon's pK a predictor in all cases tested here. Further, the order of microdissociation steps  25 and AIBLHiCoS Predictions Using the Linear Equation of Model A, Expressing pK a as a Function of the CN Bond Length (pK a = 448.9 × r(CN) − 561.15) for Phenylguanidines (G) and Pyridylguanidines (PG) a a Predictions using ChemAxon software are also shown for both tautomer A and tautomer C. Compounds are shown in their most stable tautomer, C.

ACS Omega
Article predicted by Marvin was incorrect in the case of three compounds, whereas AIBLHiCoS is found to be correct in every case.
Revisiting the previously established model for guanidine derivatives provides corroboration of our previous work at a new level of theory, which suggests once again that, counterintuitively, it is not the most stable gas-phase tautomer that provides the active bond(s). The MAE for the test set of phenylguanidine analogues employed here was 0.29, which is both lower than that obtained by the ChemAxon program and in accordance with Roy's stringent MAE evaluation criteria and is evidence of a "good" model.
We also present our findings from the first IQA-based study in the context of these LFERs, where in our efforts to rationalize outliers, it was found that the V xc AB values for bonding interactions were also found to correlate highly with pK a . Furthermore, the active equilibrium bond lengths (those which correlate most highly with pK a ) also correlate highly to the extent of delocalization between the two bonded atoms. Further work will explore the meaning of this interesting observation in the context of our protocol.

* S Supporting Information
The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acsomega.8b00142.
Detailed discussion related to the choice of level of theory, experimental details of pK a measurements, IQA Analysis to confirm the presence of an IHB in compound 15, and IQA Analysis to analyze the enhanced stability of the coplanar conformer of compound 23 (PDF)