Predicting Partition Coefficients of Short-Chain Chlorinated Paraffin Congeners by COSMO-RS-Trained Fragment Contribution Models

Chlorinated Chlorinated paraffins (CPs) are highly complex mixtures of polychlorinated n-alkanes with differing chain lengths and chlorination patterns. Knowledge on physicochemical properties of individual congeners is limited but needed to understand their environmental fate and potential risks. This work uses a sophisticated but time-demanding quantum chemically based method COSMO-RS and a fast-running fragment contribution approach to enable prediction of partition coefficients for a large number of short-chain chlorinated paraffin (SCCP) congeners. Fragment contribution models (FCMs) were developed using molecular fragments with a length of up to C 4 in CP molecules as explanatory variables and COSMO-RS-calculated partition coefficients as training data. The resulting FCMs can quickly provide COSMO-RS predictions for octanol–water (K ow ), air–water (K aw ), and octanol–air (K oa ) partition coefficients of SCCP congeners with an accuracy of 0.1–0.3 log units root mean squared errors. The FCM predictions for K ow agree with experimental values for individual constitutional isomers within 1 log unit. The distribution of partition coefficients for each SCCP congener group was computed, which successfully reproduced experimental log K ow ranges of industrial CP mixtures. As an application of the developed FCMs, the predicted K aw and K oa were plotted to evaluate the bioaccumulation potential of each SCCP congener group. Abstract 13 Chlorinated paraffins (CPs) are highly complex mixtures of polychlorinated n -alkanes with 14 differing chain lengths and chlorination patterns. Knowledge on physicochemical properties of 15 individual congeners is limited but needed to understand their environmental fate and potential risks. 16 This work uses a sophisticated but time-demanding quantum chemically based method COSMO-RS 17 and a fast-running fragment contribution approach to enable prediction of partition coefficients for 18 a large number of short-chain chlorinated paraffin (SCCP) congeners. Fragment contribution models 19 (FCMs) were developed using molecular fragments with a length of up to C 4 in CP molecules as 20 explanatory variables and COSMO-RS-calculated partition coefficients as training data. The resulting 21 FCMs can quickly provide COSMO-RS predictions for octanol–water ( K ow ), air–water ( K aw ), and 22 octanol–air ( K oa ) partition coefficients of SCCP congeners with an accuracy of 0.1–0.3 log units root 23 mean squared errors. The FCM predictions for K ow agree with experimental values for individual 24 constitutional isomers within 1 log unit. The distribution of partition coefficients for each SCCP 25 congener group was computed, which successfully reproduced experimental log K ow ranges of 26 industrial CP mixtures. As an application of the developed FCMs, the predicted K aw and K oa were 27 plotted to evaluate the bioaccumulation potential of each SCCP congener group.

but needed to understand their environmental fate and potential risks. This work uses a sophisticated but time-demanding quantum chemically based method COSMO-RS and a fast-running fragment contribution approach to enable prediction of partition coefficients for a large number of short-chain chlorinated paraffin (SCCP) congeners. Fragment contribution models (FCMs) were developed using molecular fragments with a length of up to C 4 in CP molecules as explanatory variables and COSMO-RS-calculated partition coefficients as training data. The resulting FCMs can quickly provide COSMO-RS predictions for octanol-water (K ow ), air-water (K aw ), and octanol-air (K oa ) partition coefficients of SCCP congeners with an accuracy of 0.1-0.3 log units root mean squared errors. The FCM predictions for K ow agree with experimental values for individual constitutional isomers within 1 log unit. The distribution of partition coefficients for each SCCP congener group was computed, which successfully reproduced experimental log K ow ranges of industrial CP mixtures. As an application of the developed FCMs, the predicted K aw and K oa were plotted to evaluate the bioaccumulation potential of each SCCP congener group.
File list (3) download file view on ChemRxiv Endo_SCCPpartitioning_SI_20200926.pdf (2. Table S1. Types of fragments used in the model development.   Addition description to Figure S4. The fact that the level 4 model performs the best suggests that the actual contribution of each C type (e.g., -CH2-, -CHCl-) to log K's depends on its neighboring structure.
Nevertheless, lower level models may also be useful to illustrate the average contributions of the C types to log K's. For instance, the Level 1 model (with only C1 fragments) shows that the fragment contributions of -CH2-, -CHCl-, and -CCl2-to log Kow are 0.35, 0.58, and 0.93, respectively, with systematic increase with Cl ( Figure S4). The contributions to log Koa are also systematic (0.58, 1.05, and 1.46, respectively). In contrast, the fragment contributions to log Kaw are irregular (−0.23, −0.46, and −0.49, respectively). Thus, substituting one H in -CH2-with Cl to form -CHCl-decreases log Kaw, but further substitution to -CCl2-would not change log Kaw. Figure S4. Fragment model predictions for the validation set.   Figure S10. Distributions of log Kow for CP mixtures (II). The same plot as Figure S9, except that the FCM-predicted distributions are based on the "one Cl per C" rule (i.e., double/triple Cl NOT allowed). For congeners with the number of Cl > that of C, however, double/triple Cl is allowed. S10 Figure   S11.    Satoshi Endo,* Jort Hammer Computational methods may be the only possibility to provide congener-specific information. Among 49 such prediction models available, empirical fit models may not be useful, as congener-specific 50 experimental data are not sufficiently available to calibrate such models. 51

S4
This study applies the quantum chemically based COSMO-RS theory 5 to predict partition 52 coefficients of CP congeners. COSMO-RS can predict partition coefficients from the molecular 53 structure alone without any additional empirical parameter. This approach could address partition 54 coefficients of CP congeners with differing structures even including stereoisomers. Previous studies 55 show that COSMO-RS can predict partition coefficients for chemicals of diverse structures (but no 56 CPs) to the accuracy of < 1 log unit root-mean squared errors (RMSE) as compared to experimental 57 data. 6,7 Relative values across chemicals are expected to be even more accurate because systematic 58 errors are canceled. 8 59 The problem of using COSMO-RS for predicting a large number of chemicals is the The CPs considered in this work are polychlorinated n-alkanes (i.e., no branching, no multiple 94 bond). In this article, we refer to individual CP structures with different chain lengths and Cl-95 substitution patterns as "congeners". A "congener group" collectively denotes the congeners with the 96 same number of C and Cl atoms (i.e., isomers). Isomers of CPs include stereoisomers that have the 97 same two-dimensional molecular structure but are not superimposable in the three-dimensional 98 space because of the presence of chiral centers. properties including partition coefficients. 5 For a given stereochemically specific congener, the 102 molecular structure in the SDF format was entered into the COSMOconfX 4.3 software (COSMOlogic), 103 which selected optimal conformers and generated their COSMO files using quantum chemistry (COSMOlogic, parameterization: BP_TZVPD_FINE_19) to calculate Kow, Kaw, and Koa at 25°C. Here, we 106 calculated Kow with wet octanol and Koa with dry octanol. Note that the version of COSMOconfX used 107 in this work sometimes returned structures that are stereochemically inconsistent with the original 108 structure in the SDF (i.e., incorrect R or S configuration). This problem did not occur when we used 109 the Windows version of COSMOconfX, switched off RDKit, and used only Balloon to generate initial 110 candidate conformers. 111 In the course of work, we noticed that the calculated partition coefficient sometimes 112 depends slightly on the conformation of the initial input structure entered in COSMOconfX. We 113 examined the extent of this "random error" using 10 starting conformations each for three arbitrarily 114 chosen C10 congeners. The standard deviations for log Kow, log Kaw, and log Koa were on average 0.02, 115 0.14, and 0.12, respectively. These differences may represent the current precision of COSMOtherm 116 predictions for CPs. 117

Generation of training and validation sets. The training set consisted of 815 congeners-all 118
315 distinct isomers of C5 CPs and 100 randomly generated isomers for each of C6 to C10 CPs. We used 119 "very" short to short-chain CPs as training chemicals because computational time of COSMOconfX 120 increases with the size of molecule. The validation set, in contrast, should comprise congeners that 121 are relevant. We used 120 SCCP congeners (30 for each of C10 to C13 CPs) that were also randomly 122 generated. Calibrating and/or testing models for MCCPs and LCCPs would also be interesting but need 123 much more time for calculations and was thus left for future work. 124 In random generation, 0 to (2m + 2) H atoms of Cm-n-alkanes were randomly substituted with 125 Cl atoms without any restriction. Here, all H atoms were considered distinct to also generate 126 stereoisomers. Equivalent structures (i.e., superimposable by rotation) and enantiomers (i.e., mirror 127 images) were removed because they show the identical partition coefficient value in reality, and 128 COSMO-RS should give the same value in theory. Diastereomers, in contrast, can have different 129 partitioning properties and thus are considered distinct congeners. Codes were written in the R 130 language 17 to create SMILES strings for all these congeners. SMILES was then converted to SDF format 131 using OpenBabel, 18 which was then fed to COSMOconfX as described above. 132 the selected C1 fragments were used as the initial variable set of the variable selection procedure for 145 the Level 2 model, and so forth. To avoid a possible over-fitting problem, partial least squares 146 regression (PLSR) was also performed using the selected Level 4 model fragments. The randomization 147 test method was used to decide on the number of PLS components. All these statistical analyses were 148 performed with R using functions such as lm(), step(), plsr(), and selectNcomp(). 149

Fragment contribution models (FCMs
Predictions of partition coefficients for congener groups. Using the FCMs calibrated with 150 PLSR, log Kow, log Kaw, and log Koa for 1000 randomly generated isomers for each SCCP congener group 151 (C10-C13, Cl2-Cl14) were predicted. Two methods were adapted to generate random isomers. In the 152 first method, all H atoms were considered available for Cl substitution at the same likelihood. Second, 153 all H atoms were available, but each C atom was able to carry a maximum of only one Cl atom. In 154 other words, the first method allows double or triple Cl substitution, while the second does not. As 155 for random generation of training and validation sets explained above, all substitution positions along 156 the carbon-chain were considered distinct to account for stereoisomers. Duplications were allowed 157 for random generation of 1000 isomers; this matters the most for C10Cl2 group, which has only 30 158 constitutional isomers with 46 distinct structural isomers (i.e., 16 constitutional isomers have 159 diastereomers). Duplication occurs increasingly rarely as the number of Cl approaches that of C. For 160 example, 1000 random isomers of C10Cl10 had only 10 duplications and 14 enantiomer pairs. 161 We are aware that existing studies have shown that Cl substitution patterns are not random 162 in commercial CP mixtures. A recent study suggested that the first, second, and third carbons from 163 an end of the chain and central carbons all have differing likelihood of chlorination. 21 Also, it has been 164 known that chlorination occurs less likely to the neighbors of the carbon that is already chlorinated 165 due to a steric effect, 22,23 which is also inferred by GC retention measurements for CP mixtures. 4,24 166 Nevertheless, in highly chlorinated CP mixtures, dichloro-substituted carbons and trichloromethyl 167 groups have also been identified. 21,25 Since general rules for positions of Cl for CPs of different lengths 168 and chlorination degree are still under investigation, we opted for the "fully random" and "one Cl per 169 C" rules to generate congener sets for this work. data, as indicated by R 2 , root mean squared errors (RMSE), and AIC (Figures 1, S1, S2, Table S2). Hence, 184 Level 4 model resulted in the best fit on the training data set. It is interesting that C4 fragments do 185 have statistically significant contributions to the partition coefficients, suggesting that the molecular 186 interaction properties of CPs cannot fully be reduced to the shorter fragments, and that the actual 187 contribution of each C type (e.g., -CH2-, -CHCl-) to log K's depends on its neighboring structure (see 188 Figure S3 and additional discussion on fragment contributions in the SI). In the variable selection 189 procedure, about half (49-61%) of the total fragments were removed for Level 2 to 4 models. This is file. We note that some fragments that describe diastereometric structures were also significant. 199 External validation with 120 SCCP congeners leads to the same conclusions as those with the 200 training data presented above. Thus, the Level 4 model showed the best statistics, and the statistics 201 were better in order of log Kow, log Koa, and log Kaw. (Figures 1 and S4, Table S2 Figures S1, S2, S4) and a table with statistics for all models (Table S2)  to the corresponding peaks with double/triple Cl. Still, the presence of double/triple Cl influences log 279 K's only by < 1 log unit and thus is not highly important for calculation of partition coefficients when 280 the congener groups are considered as a whole. An important difference between the two cases is 281 that congeners with the number of Cl > the number of C cannot be generated under the one Cl per C 282 rule. Also, if Cl = C, then there is only one constitutional isomer that fulfills the one Cl per C rule (but 283 with many stereoisomers), which makes the distribution peak comparatively sharp. The medians of log Kow, Kaw, and Koa show different dependence on the numbers of C and Cl. 289 All three log K's are linearly dependent on the number of C, although the slopes differ depending on 290 the partitioning phases and partially on the number of Cl ( Figure S7). In contrast, dependence on the 291 number of Cl is nonlinear (Figure 3; more clearly in Figure S8). Log Kow is fairly constant from Cl2 to 292 ~Cl5, above which it increases with ca 0.35 log units/C. Log Kaw has the opposite trend; it decreases 293 from Cl2 to ~Cl10 by 2.5-3.5 log units and thereafter stays nearly the same. Log Koa monotonically increases but in a concave downward shape. The increase is ca 0.8 log unit/Cl from Cl2 to Cl3 whereas 295 only 0.4 log units/Cl from C13 to C14. 296 congeners in this work. That said, there are notable differences between the two studies. First, some 300 irregular patterns exist in the predictions of the cited work 9 (see, e.g., log Kow of C13Cl9; log Kaw of 301 C13Cl5), which are absent in the predicted distributions of this work. Second, Glüge et al. 9 generated 302 congeners with one Cl per C at maximum, and as such, they considered only congeners with the 303 number of Cl < the number of C. Third and most importantly, the cited work 9 only provides a range, 304 not a distribution. This is crucial, because the distributions appear to be often highly skewed, and the 305 mean of the maximum and minimum does not capture the most frequently occurring values. In this 306 regard, the median of the distribution presented in this work may be considered a more 307 representative value for a congener group (see Tables S3, S4). COSMOtherm predictions from Glüge et al. 9 Plots with more congener groups are shown in Figure S6. 4; more plots in Figures S9, S10). Here, the predicted log Kow distributions for congener groups were 321 weighted by their relative abundance (i.e., mole fractions) in the mixture and were then summed. information should improve our understanding on the environmental fate of CPs. As an example, 340 SCCP congeners were plotted in the chemical space that indicates the Arctic bioaccumulation 341 potential using predicted log Kaw and log Koa, following the approach by Czub et al. 30

and Brown and 342
Wania 31 ( Figure 5; see Figure S11 for individual SCCP congener groups). Figure 5 shows that relatively 343 low chlorinated (Cl2-Cl6) SCCPs fall into the chemical space where high Arctic bioaccumulation is 344 expected, assuming perfect persistence. In contrast, SCCPs with relatively high molecular weight (C + 345 Cl ≥ 20) do not fall in this zone. Previously, Gawor and Wania 32 presented various chemical space plots 346 for CPs using log Kaw and log Koa predicted by ACD/ADME Suite prediction tools. The plots appear in 347 part similar but not identical to those in this work. For example, ACD/ADME appears to predict log Koa 348 < 6 for many low-chlorinated SCCP congeners, but such data points are absent in Figure 5. It would 349 be interesting to repeat their analysis with the predicted partition coefficients from this work, which 350 is however beyond the scope of this article. 351 In future work, the presented approach may be further improved at least in three aspects: