Chlorines Are Not Evenly Substituted in Chlorinated Paraffins: A Predicted NMR Pattern Matching Framework for Isomeric Discrimination in Complex Contaminant Mixtures

Chlorinated paraffins (CPs) can be mixtures of nearly a half-million possible isomers. Despite the extensive use of CPs, their isomer composition and effects on the environment remain poorly understood. Here, we reveal the isomeric distributions of nine CP mixtures with single-chain lengths (C14/15) and varying degrees of chlorination. The molar distribution of CnH2n+2–mClm in each mixture was determined using high-resolution mass spectrometry (MS). Next, the mixtures were analyzed by applying both one-dimensional 1H, 13C and two-dimensional nuclear magnetic resonance (NMR) spectroscopy. Due to substantially overlapping signals in the experimental NMR spectra, direct assignment of individual isomers was not possible. As such, a new NMR spectral matching approach that used massive NMR databases predicted by a neural network algorithm to provide the top 100 most likely structural matches was developed. The top 100 isomers appear to be an adequate representation of the overall mixture. Their modeled physicochemical and toxicity parameters agree with previous experimental results. Chlorines are not evenly distributed in any of the CP mixtures and show a general preference at the third carbon. The approach described here can play a key role in understanding of complex isomeric mixtures such as CPs that cannot be resolved by MS alone.


1D 1 H NMR
The 1 H 1D NMR spectra were collected using a single 90° pulse, calibrated on a per sample basis (~ 8.5 µs) Data were collected with 32k time domain points and 15 ppm spectral width. The frequency offset was set to 4.7 ppm. For each 1 H NMR spectrum, 16 transients were S3 collected with a recycle delay of 30 s. Spectra were processed with an exponential multiplication equivalent to a 0.3 Hz line broadening with a zero-filling factor of 2 before Fourier transformation.

1D 13 C NMR
The 13 C 1D NMR spectra were collected using a single 90° pulse of 12 µs with 1 H decoupling during acquisition. A Waltz-64 1 H decoupling scheme was used with a 1 H B 1 (radio frequency) field strength of 6.25 kHz. Each 13 C NMR spectrum was collected using 130k time domain points and 200 ppm spectral width. The 13 C frequency offset was set to 100 ppm and 1 H frequency offset was set to 4.7 ppm.
As the 13 C data were used for relative quantification of the functional groups present, the 13 C spin lattice relaxation time (T 1 ) were determined using the standard T 1 inversion recovery with Waltz-64 1 H decoupling scheme. The general T 1 of the sample was calculated based on the recovery delay closest to the null point for 13 C sample resonances with longest T 1 values. The T 1 of the sample was approximated using this general formula.
T null = ln(2)*T1 5120 transients were collected for all 1D 13 C NMR spectra with a recycle delay of 5*T1 (17.1s between pulses) to allow for full relaxation. Spectra were processed with an exponential multiplication equivalent to a 1 Hz line broadening in the transformed spectra and a zero-filling factor of 2.

2D 13 C-1 H Heteronuclear Single Quantum Coherence (HSQC)
The 2D HSQC ( 1 H-13 C) spectrum was collected using the Bruker standard pulse sequence (hsqcetgpsp.2) composed of two 13 C adiabatic chirp pulses for inversion (500 µs) and refocusing (2 ms) with GARP4 13 C decoupling. Typical acquisition parameters for HSQC were: 1) 1 H B 1 field strength of 29 kHz, 2) 13 C B 1 field strength of 20 kHz, 3) a 1 H spectral width of 8.5 ppm with 1 H frequency offset equal to 4.7 ppm, 4) a 13 C spectral width of 160 ppm with 13 C frequency offset equal to 70 ppm, 5) an acquisition length of 4096 time domain points in the (F2) direct 1 H dimension, 6) 512 (t 1 ) increments were collected to construct a phase sensitive (F1) 13 C indirect dimension via the echo/anti-echo acquisition scheme 7) 32 transients were collected for each t 1 increment, 8) a recycle delay of 1.5 s was used, and 9) a 3.5 kHz B 1 field strength was used for 13 C decoupling during acquisition.
The HSQC spectrum was processed via 2D Fourier transformation using an exponential multiplication equivalent to 8 Hz line broadening in the direct (F2) dimension and a sine squared function phase shifted by π/2 in the indirect (F1) dimension with a zero-filling factor of 2.

2D 1 H-13 C Heteronuclear Multiple Bond Coherence (HMBC)
The 2D HMBC ( 1 H-13 C) was collected using a Bruker standard pulse sequence (hmbcetgpl3nd) 3 composed of a 2 ms 13 C adiabatic chirp pulse for refocusing. Typical acquisition parameters for HMBC were: 1) an 8 Hz long range 1 H-13 C J-couplings, 2) 1 H B 1 field strength of 29 kHz, 3) 13 C B 1 field strength of 20 kHz, 4) a 1 H spectral width of 10 ppm with 1 H frequency offset equal to 4.7 ppm, 5) a 13 C spectral width of 250 ppm with 13 C frequency offset equal to 100 ppm, 6) an acquisition length of 4096 time domain points in the (F2) direct 1 H dimension, 7) 196 (t 1 ) increments were collected to construct a phase sensitive S4 (F1) 13 C indirect dimension via the echo/anti-echo acquisition scheme 8) 32 transients were collected for each t 1 increment, and 9) a recycle delay of 2 s was used.
The HMBC spectrum was processed via 2D Fourier transformation using a sine squared function phase shifted by π/2 for both direct (F2) and indirect (F1) dimensions with a zerofilling factor of 2. The spectrum was processed in magnitude mode along the F2 dimension.

C NMR databases
The SMILES codes for chlorinated isomeric structures were generated for molecular formulae identified as major components (by mass spectrometry) using an in-house Python script. For the C 14 series (C 14 H y Cl z ) this included all possible structures from C 14 H 29 Cl 1 through C 14 H 20 Cl 10, while for the C 15 series (C 15 H y Cl z ) structures from C 15 H 31 Cl 1 through C 15 H 25 Cl 7 were considered. After removing identical structures, (i.e. mirror images that were not stereoisomers) over 410,000 unique chemical structures remained. Note that MS data showed C 14 Cl >10 contributed in less than 5% to the C 14 60.14% sample and much less (<1%) to the other samples with less chlorination. However, the number of possible structural isomers increases drastically with increased degree of chlorination ( Figure S6). For example, we estimate around >250,000 isomers for C 14 Cl 11 and around 400,000 isomers for C 14 Cl 12 . If these structures were included the number of compounds would have increased to well over a million, resulting in the computational time being too onerous. As such they were not included in the study.
ACD/Labs (v.2018.2.5) was then used to predict the 13 C chemical shifts for the generated chemical structures. A line width of 0.015ppm at half height was chosen to match the linewidth in the experimental data. Spectra were calculated using the neural network algorithm and 16,384 time domain points. The process was performed using an in-house batch predictor by ACD/Labs. Approximately 5 weeks of processing time were required to complete all the structures. The results were arranged into two databases, containing the C 14 and C 15 structures respectfully.
The neural network algorithm integrated into ACD/Labs is based on artificial neural nets (ANN's) as described by Blinov et al. 4 The approach uses an internal training database of over 2,000,000 chemical shifts and produces and error of less than 1.5 ppm (carbon) when tested against 11,000 new synthesized organic compounds (over 150,000 chemical shifts) published over the 2005-2006 period. To further confirm the ability of ACD/Labs to accurately predict chemical shifts for polychlorinated alkane species, experimental and predicted data were compared. Unfortunately, it is difficult to obtain large quantity (1 mL) of pure isomers of the species in this study, but instead three multiply chlorinated molecules were randomly selected Aldrich NMR spectral database and ACD/Labs was used to predict the 1 H and 13 C chemicals shifts ( Figure S3). In each case the observed error between the experimental data and predicted data (0.1 ppm 1 H and 0.57ppm 13 C) is less than that reported in Blinov et al., 4 showing excellent abilities of ACD's artificial neural network to predict the polychlorinated alkane species studied here.

Spectral Matching of the 13 C NMR to generate the top 1000
The 13 C NMR spectra of the chlorinated mixtures were compared against the predicted databases ( Figure S2a) using ACD/Labs' similarity search algorithm. The solvent was selected as a dark region (i.e. not considered), the range was ordered by Euclidean distance and the results ranked by HQI (Hit Quality Index), retaining the top 1000 matches. These top 1000 matches represent the wider range of compounds prevalent in the mixtures. The similarity search takes into account both spectral intensity and peak locations and can be summarized as follows: Each spectrum was indexed based on its chemical shifts. For 1D 13 C NMR spectra, the selected interval of the spectral width was divided into regions of 1 ppm and each region was indexed on a 127 point scale. The integral of each indexed region was determined and compared between the experimental spectrum and each spectrum in the database. The HQI is then calculated according to the formula: where L denotes the compared regions, N is the total number of indices used and p i is the index value (from the query spectrum and the database spectrum, respectively). In this case, a better match is indicated by a smaller HQI value. More details are provided in the ACD/Spectrus Processor Help (version 2018.2.5).

2D 1 H-13 C NMR Databases
1 H-13 C 2D NMR databases provide much higher spectral dispersion than 13 C NMR alone. For example, Hertkorn reports the resolving power of 1D 13 C NMR to be ~30,000 whereas 2D 1 H-13 C HSQC approaches 2,000,000. 5 As such, HSQC is ideal for database matching of components in complex mixtures. 6,7 Unfortunately, as the 2D database prediction and generation had to be performed manually and was considerably more computationally expensive that 1D NMR prediction, it was not possible to generate 2D NMR databases for all the 410,000 compounds. Instead 13 C was used to pre-screen the compounds to obtain the top 1000 most likely compounds for each mixture. Each list of 1000 predicted compounds (one per mixture) took ~ 1 week to complete (~9 weeks in total). Spectra were generated using the neural network algorithm, 1024 points in each dimension and line widths of 0.02 ppm ( 1 H) and 0.2 ppm ( 13 C). For 1 H-13 C HSQC only 1 bond correlations were included, while for 1 H-13 C HMBC 1 bond correlations were suppressed and 2-3 1 H-13 C bond correlations were included. It should be noted that both HSQC and HMBC data were used for assignment (see later) but only HSQC was used for compound matching. This is due to the fact that the number and type of correlations in HMBC (commonly across 2-5 bonds) can vary depending on a range of experimental factors. Thus the exact number of correlations obtained can differ, making it not ideal for spectral pattern matching. Conversely, HSQC detects only one bond 1 H-13 C correlations, which do not change with experimental conditions. This makes HSQC ideal for spectral pattern matching, as has been previously demonstrated for complex environmental samples. 6

Matching top 100 compounds from 1 H-13 C 2D NMR
Each 1 H-13 C HSQC NMR spectrum of the chlorinated mixtures was compared against the predicted databases using a mixture search ( Figure S2b). All peaks in the mixture were picked via a combination of automatic picking followed by additional manual picking to ensure all discernable peaks were selected. Mixture searches were performed giving a freedom of 3 ppm in 13 C and 0.1 ppm in 1 H, to maintain consistency with the prediction accuracies reported for ACD NMR 2D predictors. 8 The results were ranked by HQI, while retaining the top 100 matches. The top 100 matches represent the most likely abundant compounds in the mixtures S6 based on NMR but may not reflect the full diversity of compounds in the mixtures. The mixture search can be summarized as follows: The HQI mixture_search is based on the number of peaks found in the database compared to a selected region of a query spectrum. The HQI mixture_search is calculated using the formula: Here, L is the total number of peaks regions in the query spectrum, N k is the number of peaks in the k-peak regions of the query spectrum and M k is the number of peaks in the k-peak region of the true spectrum.

NMR Assignment
The assignments of the experimental NMR spectra were performed based on peak location and intensity matches in the ACD/Labs database (1D 13 C, HSQC and HMBC) and then cross referenced for consistency using the short range (HSQC) and long range (HMBC) J-couplings. Importantly, the assignments were performed based on the structural motifs they represent. For each structure, the chemical shift values were collected into a spreadsheet according to their fragments and examined for their commonality. The chemical shifts of these common fragments were then mapped onto the spectra for display purposes using Mathematica 11.3. This approach to assignment provides a gauge of spectral overlap and exemplifies how complex the data are.

Molecular Heatmaps
Heatmaps were generated by summing up the number of times a chlorine occurred at a specific position across the database results for each mixture. The results are expressed as simple percentages, for example for a specific position of the 10% is displayed it means that in 10% of the molecules this position contained a chlorine and 90% of the time it contained a hydrogen. 0.30 a chlorination degree of a CP mixture was calculated by eq. S1 9 : (S1) where %Cl i is the chlorination degree of molecular formula i (C n H 2n+2-m Cl m ) and percentage i is weight-based percentage of molecular formula i measured using MS. b absolute deviation = calculated %Cl -manufacturer-reported %Cl. The correlations between two %Cl are shown below: x: manufacturer-reported %Cl c R 2 was calculated by eq. S2 10 : where x i and y i are the percentage composition of individual molecular formulae i in the compared two formulae distribution patterns, respectively. * Some regions were combined together due to overlap, which however is as close as the values quantified directly from the 1D 13 C NMR data. Each parameter of a CP mixture was calculated by:

S8
(eq.S3) Parameter(CP mixture) = ∑ 100 1 mole(CP isome ) × parameter(CP isome ) where mole(CP isomer i ) is the molecular amount of CP isomer i in the CP mixture which equals one for all the isomers; parameter(CP isomer i ) is the parameter of CP isomer i and is given in Worksheet (SI_WS 1-9); a. In the pH region where the molecule is predominantly unionized, log D=log K OW ; 11 b. Lethal Dose 50 is the amount of a CP mixture, given all at once, which causes the death of 50% of a group of rat/mouse.  Figure S1. Data processing of the C 14-15 single-chain-length standards for response factors (RFs) of individual C n Cl m using APCI-Orbitrap-MS. The molar distribution of C n Cl m calculated with the RFs were shown in Figure 1e and Figure S4.