Detection and Exclusion of False-Positive Molecular Formula Assignments via Mass Error Distributions in UHR Mass Spectra of Natural Organic Matter

Ultrahigh resolution mass spectrometry (UHRMS) routinely detects and identifies thousands of mass peaks in complex mixtures, such as natural organic matter (NOM) and petroleum. The assignment of several chemically plausible molecular formulas (MFs) for a single accurate mass still poses a major problem for the reliable interpretation of NOM composition in a biogeochemical context. Applying sensible chemical rules for MF validation is often insufficient to eliminate multiple assignments (MultiAs)—especially for mass peaks with low abundance or if ample heteroatoms or isotopes are included - and requires manual inspection or expert judgment. Here, we present a new approach based on mass error distributions for the identification of true and false assignments among MultiAs. To this end, we used the mass error in millidalton (mDa), which was superior to the commonly used relative mass error in ppm. We developed an automatic workflow to group MultiAs based on their shared formula units and Kendrick mass defect values and to evaluate the mass error distribution. In this way, the number of valid assignments of chlorinated disinfection byproducts was increased by 8-fold as compared to only applying 37Cl/35Cl isotope ratio filters. Likewise, phosphorus-containing MFs can be differentiated against chlorine-containing MFs with high confidence. Further, false assignments of highly aromatic sulfur-containing MFs (“black sulfur”) to sodium adducts in negative ionization mode can be excluded by applying our approach. Overall, MFs for mass peaks that are close to the detection limit or where naturally occurring isotopes are rare (e.g., 15N) or absent (e.g., P and F) can now be validated, substantially increasing the reliability of MF assignments and broadening the applicability of UHRMS analysis to even more complex samples and processes.


Table of Contents
Mass error (Merr) in mDa and its distribution Figure S1.Schema for mass errors (mDa) distribution in multiple assignment caused by specific replacement pair with varying mass difference.

Sample description
Table S1.Description of samples and data acquisition.

Performance of internal calibrations and robustness of median/mean value of Merr distribution.
Table S2.Performance of internal calibrations.

SRFA dataset
Table S4.Total number of formula assignments in SRFA dataset before automatic filtration.
Table S5.Main replacement pairs that cause multiple assignments (MultiAs) in SRFA dataset.Table S6.Total number of formula assignments in SRFA dataset after automatic filtration.

EfOM_Oz_18O dataset
Table S7.Total number of formula assignments in EfOM_Oz_18O dataset before automatic filtration.Table S8.Main replacement pairs that cause multiple assignments (MultiAs) in EfOM_Oz_18O.
Table S9.Total number of formula assignments in EfOM_Oz_18O dataset after workflow filtration.

DW_Cl2 dataset
Table S12.Total number of formula assignments in DW_Cl2 dataset before automatic filtration.Table S13.Dominant replacement pairs that cause multiple assignments (MultiAs) in DW_Cl2 dataset.
Table S14.Total number of formula assignments in DW_Cl2 dataset after automatic filtration.Table S15.Total number of Chlorine formula assignments in DW_Cl2 dataset before and after filtration.
Table S19.Total number of formula assignments in SRFA_CBZ_2H dataset before and after filtration.

Mass error (Merr) in mDa and its distribution
FT-ICR mass spectrometry principle can be written below by the mapping function F from mass-tocharge ratio (m/z) and other relevant physical quantities pi, 1<i≤m (ion abundance etc.) to the corresponding ion motion frequencies f for a given mass analyzer: Then, the eq.1 can be solved for observed (m/z)obs: However, due to various effects, e.g.ion abundance and uneven electric fields, eq 2 is not even applicable until proper sufficient mass accuracy is provided by mass calibration functions.
To reduce the systematic errors in the measurements, the mass calibration function Mcal(f,p1,…,pm) could be fitted by f with corresponding theoretical m/z of internal calibrants ((m/z)int): Then the (m/z)corr with enough mass accuracy could be obtained by applying Mcal to all other (m/z)obs.
When doing linear calibration, the parameters that need to be decided in eq 3 might be fitted by multivariate linear regression (LS) approach.LS here aims to minimize the root-mean-squared mass error , where   =   −  ̂ is the residual for the ith data point and  ̂ is the fitted response value.In this case,   here is measured m/z, and fitted response value is theorical m/z.In practical, the residual will also be expressed as mass error (Merr) in mDa, which is the difference of measured mass and theoretical formula mass, by adding/removing a proton to m/z, considering the charge state of ion is ±1.

Relative mass error (RME) and its distribution
It's should be noted that relative mass error in ppm (eq.6) doesn't follow normal distribution as mass errors in mDa have.

Sample description
Samples in this study are from different sources with treatment processes applied, and the FT-ICR spectra are obtained with different measurement modes, direct infusion (DI) or hyphenated with LC.Every sample is diluted to 10 mg/L DOC with ultrapure water for FT-ICR-MS analysis.

Merr distribution.
Internal calibration was performed for every spectrum and all segments in DataAnalysis software, with known CHO series, yielding a root-mean-squared mass error (RMSE) of less than 0.2 ppm.For LC-FT-ICR MS measurements, the whole spectrum would be segmented by minutes and each segment was treated as an individual spectrum and internally calibrated, while the calibration results of segments with highest total ion intensity will be exported as calibration performance.Meanwhile, for experimental replicates, calibration results with highest RMSE will be used below as comparison.
Calibration performances of the 5 datasets were examined to evaluate the robustness and reproducibility of the applied internal calibration.Except blanks, all the measurements/datasets were calibrated with abundant calibrants cross the mass range, and yielded overall RMSE less than 0.2 ppm, most of them even below 0.1 ppm (Table S2).
For SRFA_CBZ_2H dataset, which has the most abundant multiple assignments (MultiAs), relative mass errors (RMEs) of calibrants still aligned well with the pre-set tolerance threshold (+/-0.200ppm) after internal calibration, as shown in Figure S1A.The averages of median and mean values of RMEs in this dataset were 0.004 (+/-0.012)ppm and 0.006 (+/-0.005)ppm, respectively, indicating overall excellent calibration performance and neglectable systematic error.
Meanwhile, the robustness of mean/median as proxy of true-and false-assignments were also examined.
The Merr distributions of all internal calibrations from SRFA_CBZ_2H dataset was also checked, as shown in Figure S1B.The Merr of all internal calibrants were as small as 0.100 mDa, and the median and mean values are within 0.010 mDa (0.001 ± 0.003 mDa and 0.006 ± 0.004 mDa, respectively), demonstrating that systematic errors have been largely eliminated.Since they are smaller than the smallest mass differences that we have observed by now, medians/means are robust enough as references for recognizing false-assignments in groups caused by replacement pairs.Given that medians have lower standard deviation (STD) than mean values, mainly because of the lower leverage of outliers, the medians of Merr were used for evaluation and comparison in this study.

EfOM_Oz_18O dataset
Wastewater treatment plant effluent (EfOM) samples were oxidized with heavy ozone ( 18 O3, 50% purity), after which organic matter was isolated by solid phase extraction and measured by DI-FT-ICR-MS.
Spectra were processed using absorption mode processing to improve mass accuracy after acquisition.
EfOM_Oz_18O dataset consists of 2 samples including ozonated (EfOM_Oz) and unozonated EfOM and was used here for analysis of MultiAs.Molecular formulas were regulated with harsh RME threshold of ± 0.2 ppm, after which the 34 S and 13 C isotopologue peak abundance were validated.where standardized difference is Δ = (µ0 -µ1) ∕ σ.And the distribution is compared with known population value (µ0 = 0) and n is calculated in the one-sample case (α = 0.05).

Figure S2 .
Figure S2.Schema for distribution of relative mass errors.

Figure S4 .
Figure S4.Frequency of replacement pairs in MultiAs observed in SRFA dataset with 2 different CFC.

Figure S5 .
Figure S5.Example multiple assignment and its replacement pair in SRFA dataset.

Figure S6 .
Figure S6.Merr distribution of SRFA dataset.SRFA_Na Dataset Figure S7.Expanded section from a full scan mass spectrum showing Na + adducts in SRFA_Na.

Figure S8 .
Figure S8.Mass error distribution of MultiAs caused by Na + adducts in SRFA_Na dataset.

Figure S11 .
Figure S11.Sample size needed for proper estimation of Merr distribution of different replacement pairs and number of KMD series in the SRFA dataset (A) and (B) sample size estimation with different SD (according to instrumental mass accuracy).

Figure S12 .
Figure S12.S/N distributions of 35 Cl formulas before data filtering (plotted with bin size of 1).

Figure S15 .
Figure S15.Mass error distribution of 2 H formulas.

Figure S1 .
Figure S1.Schema for mass errors (mDa) distribution in a multiple assignment caused by specific replacement pair with varying mass difference: (A) Mass difference of replacement pair is larger than twice the standard deviation of the Merr distribution of true-assignments resulting in a bimodal distribution with two local maxima; (B) Mass difference of replacement pair is less than twice the standard deviation of the Merr distribution of true-assignments resulting in aunimodal distribution with one non-zero center.Recognition of false-assignments is possible, if the underlying distributions can be recognized.
mass error (∆  ) = (/ ℎ.−/ .) / ℎ.× 10 6 in ppm (parts per million) (6) Assume the mass follows normal distribution as well, i.e. m | X ~ (µ,   2 ).Then the ratio of these two normal distribution Fz = (ε | X) (m | X) shall not be normal distributed.Fz has no finite moments and is heavy tailed, which shape can be bimodal, asymmetric, symmetric, and even close to a normal distribution, depending largely on the values of the coefficient of variation of m | X. 1 Fz in this case will be centered at β = () () ⁄ .If E(ε) is 0, then  equals 0, otherwise  is a non-zero but variable value depend on E(m).Also, when Fz is estimated as Cauchy distribution, the mean values might not be obtained and the median values should be used.Considering the availability of mean value and the large leverages of outlier biases, especially in small data groups, median value seems to be more robust for practical usage.

Figure S2 .
Figure S2.Schema for distribution of relative mass errors.(A) bimodal distribution with 2 peaks; (B) unimodal with one nonzero center and (C) unimodal with one zero center stacked from complete overlap of true-and false assignments.Additional Reference:(1) Díaz-Francés, E.; Rubio, F. J. On the Existence of a Normal Approximation to the Distribution of the

Figure S3 .
Figure S3.Performances of internal calibrations from SRFA_CBZ_2H dataset: A) relative mass error in ppm; B) mass error in mDa.0.026 mDa refers to the smallest mass difference of replacement pairs observed in SRFA datasets.

Figure S4 .
Figure S4.Frequency of replacement pairs in MultiAs observed in SRFA dataset with 2 different CFC: (A) MultiAs caused by replacement pairs from CFC-N5S3; (B) MultiAs caused by replacement pairs from CFC-N3S1.

Figure S7 .
Figure S7.Expanded section from a full scan mass spectrum showing Na + adducts in SRFA_Na (blue; m/z 335.03846: [C13H12Na1O9] -and m/z 335.07487: [C14H16Na1O8] -) measured with ESI negative mode.These peaks are not present in SRFA (i.e., without NaCl added, red).Inset shows mass peaks at m/z 313.05645: [C13H13O9] -and m/z 313.09294: [C14H17O8] -corresponding to the deprotonated form of the Na + -adducts.The peak magnitude of the deprotonated species decreases upon addition of Na.The Na + -adducts also have a multiple assignment in form of a highly unsaturated and oxygen-poor S-containing molecular formula ([C19H11O4S1] -and [C20H15O3S1] -), which may be the only assignment if Na is not considered.

Figure S9 .
Figure S9.van Krevelen plot of MultiAs caused by replacement pair of C6S (CHOS molecular formula (MF)) vs HO5Na (CHO_Na MF) in SRFA_Na dataset.Arrows indicate changes in O/C and H/C between false-assigned CHOS formulas (sometimes referred to as "blacksulfur") and true-assigned CHO sodium adducts.

Figure S10 .
Figure S10.Mass error distribution of molecular formulas (MFs) related to O1P1 / C1 35 Cl1 ("CHOP" MF vs. "CHOCl" MF, 0.176 mDa difference in mass) in the DW_Cl2 dataset: (A) overlapped mass error distribution of multiple assignments in ppm; (B) overlapped mass error distribution of multiple assignments in mDa and (C) data filtered by Merr inspection in homologous groups (here CHO referes to CHOCl MF class).Note that 2 P-containing MF were retained after filtration.

Figure S11 .
Figure S11.Sample size needed for proper estimation of Merr distribution of different replacement pairs and number of KMD series in the SRFA dataset (A) and (B) sample size estimation with different SD (according to instrumental mass accuracy).

Figure S15 .
Figure S15.Mass error distribution of 2 H formulas: (A) 2 H formulas in multiple assignments; (B) Multiple assignments filtered by Merr inspection.

Table S10 .
Performance of automatic data filtering algorithm for SRFA dataset.

Table S11 .
Gaussian distribution fitting of Merr in SRFA dataset.

Table S16 .
Performance of automatic data filtering algorithm for DW_Cl2 dataset.

Table S20 .
Duration reported when running R script snippet for different data inputs.Results were tested on laptop with CPU of Intel-i7, and SSD of 512 GB, R version 4.2.1.

Table S1 .
Description of samples and data acquisition.

Table S2 .
Performance of internal calibrations.

Table S3 .
Chemical formula configuration used for different dataset.

Table S4 .
Total number of formula assignments in SRFA dataset before automatic filtration.

Table S5 .
Main replacement pairs that cause multiple assignments (MultiAs) in SRFA dataset.

Table S6 .
Total number of formula assignments in SRFA dataset after automatic filtration.
Figure S6.Merr distribution of SRFA dataset: (A) all formulas before filtration; (B) formulas with MultiAs filtered by Merr inspection subset from KMD-CH2 and formula classes.

Table S7 .
Total number of formula assignments in EfOM_Oz_18O dataset before automatic filtration.

Table S8 .
Main replacement pairs that cause multiple assignments (MultiAs) in EfOM_Oz_18O.

Table S9 .
Total number of formula assignments in EfOM_Oz_18O dataset after workflow filtration.MF = Molecular formula.

Table S12 .
Total number of formula assignments in DW_Cl2 dataset before automatic filtration.

Table S14 .
Total number of formula assignments in DW_Cl2 dataset after automatic filtration.

Table S15 .
Total number of Chlorine formula assignments in DW_Cl2 dataset before and after filtration.

Table S16 .
Performance of automatic data filtering algorithm for DW_Cl2 dataset.MF = Molecular formula.

Table S19 .
Total number of formula assignments in SRFA_CBZ_2H dataset before and after filtration.MF = Molecular formula.

Table S20 .
Duration reported when running R script snippet for different data inputs.Results were tested on laptop with CPU of and SSD of 512 GB, R version 4.2.1.