Article
A Statistically Rigorous Test for the Identification of Parent−Fragment Pairs in LC-MS Datasets

Abstract
Untargeted global metabolic profiling by liquid chromato-graphy−mass spectrometry generates numerous signals that are due to unknown compounds and whose identification forms an important challenge. The analysis of metabolite fragmentation patterns, following collision-induced dissociation, provides a valuable tool for identification, but can be severely impeded by close chromatographic coelution of distinct metabolites. We propose a new algorithm for identifying related parent−fragment pairs and for distinguishing these from signals due to unrelated compounds. Unlike existing methods, our approach addresses the problem by means of a hypothesis test that is based on the distribution of the recorded ion counts, and thereby provides a statistically rigorous measure of the uncertainty involved in the classification problem. Because of technological constraints, the test is of primary use at low and intermediate ion counts, above which detector saturation causes substantial bias to the recorded ion count. The validity of the test is demonstrated through its application to pairs of coeluting isotopologues and to known parent−fragment pairs, which results in test statistics consistent with the null distribution. The performance of the test is compared with a commonly used Pearson correlation approach and found to be considerably better (e.g., false positive rate of 6.25%, compared with a value of 50% for the correlation for perfectly coeluting ions). Because the algorithm may be used for the analysis of high-mass compounds in addition to metabolic data, we expect it to facilitate the analysis of fragmentation patterns for a wide range of analytical problems.
Theory
For LC-MS data, the rate of ion arrivals of a particular molecular species will be a function of its elution time, t, so that we may write
For a Poisson distribution, the rate parameter (λ) is equal to the mean. The centroided rate function, λ(t), therefore describes the mean number of ion arrivals of a particular molecular species within one scan, as a function of retention time. The rate function may be regarded as the product of the concentration and the ionization propensity of the compound, and consequently may be written
where π is a measure of the compound’s ionization propensity, and Q(t) is a measure of its concentration in the retention time dimension. Supposing a metabolite were to fragment into multiple ions after eluting from the chromatographic column, these would all share the same Q(t) as the original metabolite. This provides the basis for constructing a test of hypothesis for exact coelution.
where, for the sake of conciseness, the dependence on t has been omitted. However, if
we may, following Przyborowski and Wilenski,(25) rewrite the joint probability as
which is the joint probability of a Poisson distribution with mean μ (which determines the sum of the ion counts, n) and a binomial distribution of n trials with probability ρ (which determines what portion of the sum is due to k0 in particular). If the two ions under investigation exhibit exact coelution and, hence, share the same Q(t), then this term cancels out from the expression for the binomial probability, which, when reinstating the dependence on t, can be written as
which will therefore be constant across retention time. Under the null hypothesis of constant binomial probabilities, Pearson’s χ2 goodness-of-fit statistic,
approximates a χ2 distribution with one degree of freedom (see, for example, Wackerly et al.,(26) p 682). We can evaluate this statistic for all N data points across the chromatographic peak and sum them to obtain a pooled statistic, X2 = Σx2, which approximates a χ2 distribution with N − 1 degrees of freedom, since ρ must be estimated from the data. The approximation to the χ2 distribution works best when n is large and ρ is moderate. A standard test of validity is to require that nρ ≥ 5 and n(1 − ρ) ≥ 5; data points for which this is not the case should be left out or pooled together.
If this estimator is used, then the overall computational requirements of the test will be very low, and generally comparable to those of the Pearson correlation. Computational efficiency is an important property, considering the size of typical LC-MS datasets.
where I is the mean number of ion arrivals over the entire chromatographic peak. For given values of μ, σ, and I, we may then simulate Poisson-distributed random variables according to this model, for each of the N scans over which the chromatographic peak is to be investigated. If two simulated chromatographic peaks share the same μ and σ, then the result of applying the GOF test to their counts will be a p-value that is approximately uniformly distributed. A discrepancy in the μ values, for instance, would tend to inflate the X2 statistic and result in a correspondingly low p-value.
Figure 1. (Top) Two simulated chromatographic peaks exhibiting exact coelution (a) and two simulated chromatographic peaks exhibiting very close but partial coelution (c), as indicated by the shifted means (10% of the standard deviation of the peaks). (Bottom) The corresponding scatterplots with the p-values of the x2-statistics of each data-point indicated by color-code. Low counts for which the distribution of the x2-statistics may deviate substantially from the χ12-distribution are excluded, and these data points are indicated in black. While the correlations are approximately the same in either scenario, the p-value of the pooled X2-statistic is highly significant under partial coelution (p = 0.0079), but quite moderate under exact coelution (p = 0.1489).

Figure 2. Similar to Figure 1, except, in this case, real LC-MS data derived from synthetic urine are used. The left-hand side shows the chromatographic peaks (a) and scatterplot (b) of a pair of isotopologues, which, like related fragments, may be expected to exhibit exact coelution; the right-hand side shows the chromatographic peaks (c) and scatterplot (d) of two presumably unrelated compounds. The difference in the estimated means is 6.34 times the estimated standard deviation.
Experimental Section
Results

Figure 3. Continuum plots of a pair of isotopologues. The x-axis indicates the chromatographic scan number, while the y-axis indicates each of the individual “ticks” of the clock that measures the time-of-flight of the ions, along with the corresponding m/z values. The number of ions counted at each tick is indicated by the color code. In these two cases, there are no apparent signs of interference from other compounds of similar masses.
| cluster | compound | scan number/retention time (min) | m/z | isotopologueb |
|---|---|---|---|---|
| 1 | N-acetyl-l-glutamic acid | 857/1.652 | 188.0426 | [M−H]− |
| 868/1.672 | 189.0572 | [M+1−H]− | ||
| 877/1.690 | 190.0615 | [M+2−H]− | ||
| 2 | uridine | 1096/2.114 | 243.0537 | [M−H]− |
| 1104/2.129 | 244.0672 | [M+1−H]− | ||
| 3 | 4-aminohippuric acid | 1677/3.234 | 193.0507 | [M−H]− |
| 1681/3.241 | 194.0653 | [M+1−H]− | ||
| 1681/3.241 | 195.0645 | [M+2−H]− | ||
| 4 | glutaric acid | 1724/3.323 | 131.0251 | [M−H]− |
| 1755/3.382 | 132.0381 | [M+1−H]− | ||
| 5 | methylsuccinic acid | 2471/4.763 | 132.0384 | [M+1−H]− |
| 2414/4.654 | 133.0376 | [M+2−H]− | ||
| 6 | 3-nitro tyrosine | 2877/5.546 | 225.0399 | [M−H]− |
| 2871/5.535 | 226.0579 | [M+1−H]− | ||
| 7 | adipic acid | 2952/5.689 | 145.0464 | [M−H]− |
| 2951/5.687 | 146.0554 | [M+1−H]− | ||
| 8 | indoxyl sulfate | 2971/5.725 | 211.9924 | [M−H]− |
| 2965/5.714 | 213.0030 | [M+1−H]− | ||
| 2975/5.733 | 213.9970 | [M+2−H]− | ||
| 2972/5.727 | 214.9968 | [M+3−H]− | ||
| 9 | suberic acid | 3635/7.007 | 173.0707 | [M−H]− |
| 3617/6.973 | 174.0842 | [M+1−H]− | ||
| 3623/6.985 | 175.0832 | [M+2−H]− | ||
| 10 | salicylic acid | 4096/7.895 | 138.0239 | [M+1−H]− |
| 4100/7.903 | 139.0298 | [M+2−H]− | ||
| 11 | sebacic acid | 4615/8.893 | 202.1097 | [M+1−H]− |
| 4610/8.884 | 203.1155 | [M+2−H]− |
This may not correspond to the global maximum of the peak, because, for many clusters, parts of the chromatographic peaks were left out, to avoid “contamination” from distinct compounds of similar masses.
Here, “[M−H]−” denotes the negatively ionized lowest-mass isotopologue of the metabolite in question.

Figure 4. Scatterplots for the three datasets derived from 4-aminohippuric acid: (a) the full data set, (b) the dataset with low and moderate counts, and (c) the dataset with only low counts. The approximate p-values of the x2-statistics are indicated by color code, and the p-values of the pooled X2-statistics are listed.

Figure 5. (Top) Histograms of the p-values corresponding to the x2-statistics derived from the three datasets and (bottom) quantile−quantile plots of the x2-statistics themselves, compared to the theoretical χ12-distribution. Only the dataset of low counts seems to closely approximate the χ12-distribution.
| full dataset | ion count: 0−600 | ion count: 0−300 | |
|---|---|---|---|
| percentage of x2-statistics in 5% critical region | 31.13% (1896/6090) | 6.45% (260/4029) | 4.99% (149/2986) |
| percentage of x2-statistics in 1% critical region | 21.51% (1310/6090) | 1.51% (61/4029) | 0.87% (23/2986) |
| GOF p-value for X2-statistics | <10−7 | <10−7 | 0.5642 |

Figure 6. (Top) Histograms of the p-values corresponding to the x2-statistics derived from the three datasets after they had been corrected for detector saturation and (bottom) quantile−quantile plots of the x2 statistics themselves, compared to the theoretical χ12-distribution. Only for the dataset of low and moderate counts does the correction seem to cause the distribution of the x2-statistics to be substantially closer to the χ12-distribution than it was for the raw data, although some deviations remain.
| full dataset | ion count: 0−600 | ion count: 0−300 | |
|---|---|---|---|
| percentage of x2-statistics in 5% critical region | 20.89% (1183/5662) | 5.70% (209/3664) | 5.24% (138/2633) |
| percentage of x2-statistics in 1% critical region | 13.49% (764/5662) | 1.06% (39/3664) | 0.72% (19/2633) |
| GOF p-value for X2-statistics | <10−7 | 0.0074 | 0.3190 |

Figure 7. Ion counts of 4-aminohippuric acid (blue), the fragment formed by the loss of carbon dioxide (black), and a partially coeluting compound (red) used in the ionization suppression test. If the rate functions of 4-aminohippuric acid and its fragment were reduced by significantly differing factors by the partially coeluting compound, we would expect their ratio to start shifting near scan number 1680, but no such effect is observed.
![]() |
In all cases, a distinct metabolite was coeluting with the above pairs. The scan numbers of the apices of the chromatographic peaks used in the evaluation of the GOF test are listed.

Figure 8. Histogram of the p-values returned by the GOF test when applied to the low ion counts of the six parent−fragment pairs (left), and quantile−quantile plot of the corresponding x2-statistics (right). The results are consistent with those obtained for the isotopologues, and there is no evidence that the coelution of distinct compounds affects the validity of the GOF test.

Figure 9. Plots of the percentage of the isotopologue pairs that are classified as exhibiting partial coelution by the GOF test (blue) and the correlation (red), as a function of “increasingly partial” coelution. Only the leftmost point corresponds to exactly coeluting peaks and thereby indicates the false-positive rate. False-negative rates correspond to 100 minus the ordinate for nonzero retention time shifts. Plot (a) standardizes the two tests by matching their false positive rates, while plot (b) matches their false negative rates. Clearly, the performance of the GOF test is considerably better than that of the correlation.
Discussion and Conclusion
Acknowledgment
Thanks are due to Natalja Strelkowa for valuable discussions. The authors acknowledge Laura Egnash and Michael Reilly (formerly of the Department of Discovery Biomarkers, Pfizer Global R&D, Ann Arbor, MI), for providing the synthetic urine. This work was supported by the Wellcome Trust, through Grant No. 080714/Z/06/Z. E.J.W. would like to acknowledge Waters Corporation for funding.
References
This article references 30 other publications.
- 1.
- 2.
- 3.
- 4. Want, E. J., Nordstrom, A., Morita, H. and Siuzdak, G. J. Proteome Res. 2007, 6, 459– 468
- 5. Griffiths, W., Jonsson, A. P., Liu, S., Rai, D. K. and Wang, Y. J. Biochem. 2001, 355, 545– 561[ChemPort]
- 6.
- 7. O’Connor, P., Little, D. P. and McLafferty, F. W. Anal. Chem. 1996, 68, 542– 545
- 8.
- 9. Wishart, D. S., Tzur, D., Knox, C., Eisner, R., Guo, A. C., Young, N., Cheng, D., Jewell, K., Arndt, D., Sawhney, S., Fung, C., Nikolai, L., Lewis, M., Coutouly, M. A., Forsythe, I., Tang, P., Shrivastava, S., Jeroncic, K., Stothard, P., Amegbey, G., Block, D., Hau, D. D., Wagner, J., Miniaci, J., Clements, M., Gebremedhin, M., Guo, N., Zhang, Y., Duggan, G. E., Macinnis, G. D., Weljie, A. M., Dowlatabadi, R., Bamforth, F., Clive, D., Greiner, R., Li, L., Marrie, T., Sykes, B. D., Vogel, H. J. and Querengesser, L. Nucleic Acids Res. 2007, 35, D521– D526
- 10.
- 11. Wilson, I. D., Nicholson, J. K., Castro-Perez, J., Granger, J. H., Johnson, K. A., Smith, B. W. and Plumb, R. S. J. Proteome Res. 2005, 4, 591– 598
- 12.
- 13. Tautenhahn, R., Bottcher, C. and Neumann, S. Lecture Notes in Computer Science: Bioinformatics Research and Development; Springer: Heidelberg, 2007; pp 371− 380.
- 14.
- 15. Geromanos, S. J., Silva, J. C., Li, G.-Z. and Gorenstein, M. V. U.S. Patent Application US 2008/0272292, 2008.
- 16.
- 17.
- 18.
- 19. Hoffmann, E. and Stroobant, V. Mass Spectrometry: Principles and Applications, Third Edition; Wiley: New York, 2007.
- 20. Bateman, R. H., Brown, J. M., Green, M. and Wildgoose, J. L. International Patent WO 2006/129094, 2006.
- 21. Green, M., Wildgoose, J. L. and Gorenstein, M. V. International Patent WO 2006/090138, 2006.
- 22. Hoyes, J. and Cottrel, J. International Patent WO 99/38192, 1999.
- 23.
- 24. Bateman, R. H., Green, M. and Jackson, M. U.S. Patent 7,038,197, 2006.
- 25. Przyborowski, J. and Wilenski, H. Biometrika 1940, 31 ( 3−4) 313– 323
- 26. Wackerly, D. D., Mendenhall, W. and Scheaffer, R. L. Mathematical Statistics with Applications, Sixth Edition: Duxbury: Pacific Grove, CA, 2002.
- 27.
- 28. Smith, C. A., Want, E. J., O’Maille, G., Abagyan, R. and Siuzdak, G. Anal. Chem. 2006, 78, 779– 787
- 29.
- 30.
Tools
-
Add to Favorites
-
Download Citation
-
Email a Colleague -
Permalink
Order Reprints
Rights & Permissions
Citation Alerts
History
- Published In Issue March 01, 2010
- Article ASAPFebruary 09, 2010
- Received: October 18, 2009
Accepted: January 4, 2010
Cart




