Estimating the Confidence of Peptide Identifications without Decoy DatabasesClick to copy article linkArticle link copied!
Abstract
Using decoy databases to compute the confidence of peptide identifications has become the standard procedure for mass spectrometry driven proteomics. While decoy databases have numerous advantages, they double the run time and are not applicable to all peptide identification problems such as error-tolerant or de novo searches or the large-scale identification of cross-linked peptides. Instead, we propose a fast, simple and robust mixture modeling approach to estimate the confidence of peptide identifications without the need for decoy database searches, which automatically checks whether its underlying assumptions are fulfilled. This approach is then evaluated on 41 LC/MS data sets of varying complexity and origin. The results are very similar to those of the decoy database strategy at a negligible computational cost. Our approach is applicable not only to standard protein identification workflows, but also to proteomics problems for which meaningful decoy databases cannot be constructed.
This publication is licensed for personal use by The American Chemical Society.
Methods
Overview
Mixture Model
Number of Component Decision Criterion
False Discovery Proportion
χ2-Test
Implementation
Experiments
Results
Comparison with PeptideProphet without Decoy Information
Comparison with Original Decoy FDR
Influence of Sample Size
Influence of the Ratio of Good to Bad Spectra
Discussion
Determination of the Number of Components
Assumption of Normal Distributions
Conclusions
Supporting Information
Additional information as noted in text. This material is available free of charge via the Internet at http://pubs.acs.org.
Terms & Conditions
Most electronic Supporting Information files are available without a subscription to ACS Web Editions. Such files may be downloaded by article for research use (if there is a public use license linked to the relevant article, that license may permit other uses). Permission may be obtained from ACS for other uses through requests via the RightsLink permission system: http://pubs.acs.org/page/copyright/permissions.html.
Acknowledgment
The authors would like to thank Michael Hanselmann and Anna Kreshuk (Interdisciplinary Center for Scientific Computing (IWR), University of Heidelberg, Germany) for comments, suggestions, and fruitful discussions. We gratefully acknowledge financial support by the DFG under Grant No. HA4364/2-1 (B.Y.R., F.A.H.), the Alexander von Humboldt-Foundation (Grant 3.1-DEU/1134241 to M.K.), Robert Bosch GmbH (F.A.H.), as well as the Helmholtz Initiative for Systems Biology (F.A.H.).
References
This article references 17 other publications.
- 1Bradshaw, R. A., Burlingame, A. L., Carr, S., and Aebersold, R. Mol. Cell. Proteomics 2006, 5, 787– 788Google Scholar1https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BD28XkslClsr4%253D&md5=a0b5774e683a220d733ee4b07eb830e4Reporting protein identification data the next generation of guidelinesBradshaw, Ralph A.; Burlingame, Alma L.; Carr, Steven; Aebersold, RuediMolecular and Cellular Proteomics (2006), 5 (5), 787CODEN: MCPOBS; ISSN:1535-9476. (American Society for Biochemistry and Molecular Biology)There is no expanded citation for this reference.
- 2Choi, H., Ghosh, D., and Nesvizhskii, A. I. J. Proteome Res. 2008, 7, 286– 292Google Scholar2https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BD2sXhsVejtbbO&md5=c3f146e9befac3cbdd963098077a516aStatistical Validation of Peptide Identifications in Large-Scale Proteomics Using the Target-Decoy Database Search Strategy and Flexible Mixture ModelingChoi, Hyungwon; Ghosh, Debashis; Nesvizhskii, Alexey I.Journal of Proteome Research (2008), 7 (1), 286-292CODEN: JPROBS; ISSN:1535-3893. (American Chemical Society)Reliable statistical validation of peptide and protein identifications is a top priority in large-scale mass spectrometry based proteomics. PeptideProphet is one of the computational tools commonly used for assessing the statistical confidence in peptide assignments to tandem mass spectra obtained using database search programs such as SEQUEST, MASCOT, or X! TANDEM. The authors present two flexible methods, the variable component mixt. model and the semiparametric mixt. model, that remove the restrictive parametric assumptions in the mixt. modeling approach of PeptideProphet. Using a control protein mixt. data set generated on an linear ion trap Fourier transform (LTQ-FT) mass spectrometer, the authors demonstrate that both methods improve parametric models in terms of the accuracy of probability ests. and the power to detect correct identifications controlling the false discovery rate to the same degree. The statistical approaches presented here require that the data set contain a sufficient no. of decoy (known to be incorrect) peptide identifications, which can be obtained using the target-decoy database search strategy.
- 3Choi, H. and Nesvizhskii, A. I. J. Proteome Res. 2008, 7, 254– 265Google Scholar3https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BD2sXhsVOntbfM&md5=bb0ea1f340c4ba7d099f655c203f49edSemisupervised Model-Based Validation of Peptide Identifications in Mass Spectrometry-Based ProteomicsChoi, Hyungwon; Nesvizhskii, Alexey I.Journal of Proteome Research (2008), 7 (1), 254-265CODEN: JPROBS; ISSN:1535-3893. (American Chemical Society)Development of robust statistical methods for validation of peptide assignments to tandem mass (MS/MS) spectra obtained using database searching remains an important problem. PeptideProphet is one of the commonly used computational tools available for that purpose. An alternative simple approach for validation of peptide assignments is based on addn. of decoy (reversed, randomized, or shuffled) sequences to the searched protein sequence database. The probabilistic modeling approach of PeptideProphet and the decoy strategy can be combined within a single semisupervised framework, leading to improved robustness and higher accuracy of computed probabilities even in the case of most challenging data sets. The authors present a semisupervised expectation-maximization (EM) algorithm for constructing a Bayes classifier for peptide identification using the probability mixt. model, extending PeptideProphet to incorporate decoy peptide matches. Using several data sets of varying complexity, from control protein mixts. to a human plasma sample, and using three commonly used database search programs, SEQUEST, MASCOT, and TANDEM/k-score, the authors illustrate that more accurate mixt. estn. leads to an improved control of the false discovery rate in the classification of peptide assignments.
- 4Elias, J. E. and Gygi, S. P. Nat. Methods 2007, 4, 207– 214Google Scholar4https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BD2sXitFChtrs%253D&md5=4336d04ea53dc7a161d83de1fa8249d3Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometryElias, Joshua E.; Gygi, Steven P.Nature Methods (2007), 4 (3), 207-214CODEN: NMAEA3; ISSN:1548-7091. (Nature Publishing Group)Liq. chromatog. and tandem mass spectrometry (LC-MS/MS) has become the preferred method for conducting large-scale surveys of proteomes. Automated interpretation of tandem mass spectrometry (MS/MS) spectra can be problematic, however, for a variety of reasons. As most sequence search engines return results even for 'unmatchable' spectra, proteome researchers must devise ways to distinguish correct from incorrect peptide identifications. The target-decoy search strategy represents a straightforward and effective way to manage this effort. Despite the apparent simplicity of this method, some controversy surrounds its successful application. Here the authors clarify their preferred methodol. by addressing 4 issues based on obsd. decoy hit frequencies: (i) the major assumptions made with this database search strategy are reasonable; (ii) concatenated target-decoy database searches are preferable to sep. target and decoy database searches; (iii) the theor. error assocd. with target-decoy false pos. (FP) rate measurements can be estd.; and (iv) alternate methods for constructing decoy databases are similarly effective once certain considerations are taken into account.
- 5Frank, A. and Pevzner, P. Anal. Chem. 2005, 77, 964– 973Google ScholarThere is no corresponding record for this reference.
- 6Goloborodko, A. A., Mayerhofer, C., Zubarev, A. R., Tarasova, I. A., Gorshkov, A. V., Zubarev, R. A., and Gorshkov, M. V. Rapid Commun. Mass Spectrom. 2010, 24, 454– 462Google Scholar6https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BC3cXos1aguw%253D%253D&md5=44c4b6c350db0531c802f85b734bd59eEmpirical approach to false discovery rate estimation in shotgun proteomicsGoloborodko, Anton A.; Mayerhofer, Corina; Zubarev, Alexander R.; Tarasova, Irina A.; Gorshkov, Alexander V.; Zubarev, Roman A.; Gorshkov, Mikhail V.Rapid Communications in Mass Spectrometry (2010), 24 (4), 454-462CODEN: RCMSEF; ISSN:0951-4198. (John Wiley & Sons Ltd.)Estn. of false discovery rate (FDR) for identified peptides is an important step in large-scale proteomic studies. We introduced an empirical approach to the problem that is based on the FDR-like functions of sets of peptide spectral matches (PSMs). These functions have close values for equal-sized sets with the same FDR and depend monotonically on the FDR of a set. We have found three of them, based on three complementary sources of data: chromatog., mass spectrometry, and sequences of identified peptides. Using a calibration on a set of putative correct PSMs these functions were converted into the FDR scale. The approach was tested on a set of ∼2800 PSMs obtained from rat kidney tissue. The ests. based on all three data sources were rather consistent with each other as well as with one made using the target-decoy strategy. Copyright © 2010 John Wiley & Sons, Ltd.
- 7Higgs, R. E., Knierman, M. D., Freeman, A. B., Gelbert, L. M., Patil, S. T., and Hale, J. E. J. Proteome Res. 2007, 6, 1758– 1767Google Scholar7https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BD2sXjs1Srsbs%253D&md5=d9275935349ed354ae7e086cacabcd16Estimating the Statistical Significance of Peptide Identifications from Shotgun Proteomics ExperimentsHiggs, Richard E.; Knierman, Michael D.; Freeman, Angela Bonner; Gelbert, Lawrence M.; Patil, Sandeep T.; Hale, John E.Journal of Proteome Research (2007), 6 (5), 1758-1767CODEN: JPROBS; ISSN:1535-3893. (American Chemical Society)The authors present a wrapper-based approach to est. and control the false discovery rate for peptide identifications using the outputs from multiple com. available MS/MS search engines. Features of the approach include the flexibility to combine output from multiple search engines with sequence and spectral derived features in a flexible classification model to produce a score assocd. with correct peptide identifications. This classification model score from a reversed database search is taken as the null distribution for estg. p-values and false discovery rates using a simple and established statistical procedure. Results from 10 analyses of rat sera on an LTQ-FT mass spectrometer indicate that the method is well calibrated for controlling the proportion of false positives in a set of reported peptide identifications while correctly identifying more peptides than rule-based methods using one search engine alone.
- 8Jiang, X., Jiang, X., Han, G., Ye, M., and Zou, H. BMC Bioinf. 2007, 8, 323Google Scholar8https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A280%3ADC%252BD2snitlKhsg%253D%253D&md5=ec311c61ae9dd370fedff21ddf1180e1Optimization of filtering criterion for SEQUEST database searching to improve proteome coverage in shotgun proteomicsJiang Xinning; Jiang Xiaogang; Han Guanghui; Ye Mingliang; Zou HanfaBMC bioinformatics (2007), 8 (), 323 ISSN:.BACKGROUND: In proteomic analysis, MS/MS spectra acquired by mass spectrometer are assigned to peptides by database searching algorithms such as SEQUEST. The assignations of peptides to MS/MS spectra by SEQUEST searching algorithm are defined by several scores including Xcorr, Delta Cn, Sp, Rsp, matched ion count and so on. Filtering criterion using several above scores is used to isolate correct identifications from random assignments. However, the filtering criterion was not favorably optimized up to now. RESULTS: In this study, we implemented a machine learning approach known as predictive genetic algorithm (GA) for the optimization of filtering criteria to maximize the number of identified peptides at fixed false-discovery rate (FDR) for SEQUEST database searching. As the FDR was directly determined by decoy database search scheme, the GA based optimization approach did not require any pre-knowledge on the characteristics of the data set, which represented significant advantages over statistical approaches such as PeptideProphet. Compared with PeptideProphet, the GA based approach can achieve similar performance in distinguishing true from false assignment with only 1/10 of the processing time. Moreover, the GA based approach can be easily extended to process other database search results as it did not rely on any assumption on the data. CONCLUSION: Our results indicated that filtering criteria should be optimized individually for different samples. The new developed software using GA provides a convenient and fast way to create tailored optimal criteria for different proteome samples to improve proteome coverage.
- 9Keller, A., Nesvizhskii, A. I., Kolker, E., and Aaebersold, R. Anal. Chem. 2002, 74, 5383– 5392Google Scholar9https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BD38XmvVyktL4%253D&md5=fad3d58f90b1ff57439ea6ca2d76f13fEmpirical Statistical Model To Estimate the Accuracy of Peptide Identifications Made by MS/MS and Database SearchKeller, Andrew; Nesvizhskii, Alexey I.; Kolker, Eugene; Aebersold, RuediAnalytical Chemistry (2002), 74 (20), 5383-5392CODEN: ANCHAM; ISSN:0003-2700. (American Chemical Society)We present a statistical model to est. the accuracy of peptide assignments to tandem mass (MS/MS) spectra made by database search applications such as SEQUEST. Employing the expectation maximization algorithm, the anal. learns to distinguish correct from incorrect database search results, computing probabilities that peptide assignments to spectra are correct based upon database search scores and the no. of tryptic termini of peptides. Using SEQUEST search results for spectra generated from a sample of known protein components, we demonstrate that the computed probabilities are accurate and have high power to discriminate between correctly and incorrectly assigned peptides. This anal. makes it possible to filter large vols. of MS/MS database search results with predictable false identification error rates and can serve as a common std. by which the results of different research groups are compared.
- 10Kim, S., Gupta, N., and Pevzner, P. A. J. Proteome Res. 2008, 7, 3354– 3363Google Scholar10https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BD1cXnvF2gs74%253D&md5=c0770ec96d5046fc6cc52d7454da3708Spectral Probabilities and Generating Functions of Tandem Mass Spectra: A Strike against Decoy DatabasesKim, Sangtae; Gupta, Nitin; Pevzner, Pavel A.Journal of Proteome Research (2008), 7 (8), 3354-3363CODEN: JPROBS; ISSN:1535-3893. (American Chemical Society)A key problem in computational proteomics is distinguishing between correct and false peptide identifications. The authors argue that evaluating the error rates of peptide identifications is not unlike computing generating functions in combinatorics. The authors show that the generating functions and their derivs. (spectral energy and spectral probability) represent new features of tandem mass spectra that, similarly to Δ-scores, significantly improve peptide identifications. Furthermore, the spectral probability provides a rigorous soln. to the problem of computing statistical significance of spectral identifications. The spectral energy/probability approach improves the sensitivity-specificity tradeoff of existing MS/MS search tools, addresses the notoriously difficult problem of "one-hit-wonders" in mass spectrometry, and often eliminates the need for decoy database searches. The authors therefore argue that the generating function approach has the potential to increase the no. of peptide identifications in MS/MS searches.
- 11Käll, L., Canterbury, J. D., Weston, J., Noble, W. S., and Maccoss, M. J. Nat. Methods 2007, 4, 923– 925Google Scholar11https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A280%3ADC%252BD2snksFaltQ%253D%253D&md5=d9f40d025046fd771e274313fd526d12Semi-supervised learning for peptide identification from shotgun proteomics datasetsKall Lukas; Canterbury Jesse D; Weston Jason; Noble William Stafford; MacCoss Michael JNature methods (2007), 4 (11), 923-5 ISSN:1548-7091.Shotgun proteomics uses liquid chromatography-tandem mass spectrometry to identify proteins in complex biological samples. We describe an algorithm, called Percolator, for improving the rate of confident peptide identifications from a collection of tandem mass spectra. Percolator uses semi-supervised machine learning to discriminate between correct and decoy spectrum identifications, correctly assigning peptides to 17% more spectra from a tryptic Saccharomyces cerevisiae dataset, and up to 77% more spectra from non-tryptic digests, relative to a fully supervised approach.
- 12Korn, E. L., Troendle, J. F., Mcshane, L. M., and Simon, R. J. Stat. Plann. Inference 2004, 124 (2) 379– 398Google ScholarThere is no corresponding record for this reference.
- 13Maiolica, A., Cittaro, D., Borsotti, D., Sennels, L., Ciferri, C., Tarricone, C., Musacchio, A., and Rappsilber, J. Mol. Cell. Proteomics 2007, 6, 2200– 2211Google Scholar13https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BD1cXksF2hsg%253D%253D&md5=e344924a84221dd43f7682a36b667c73Structural analysis of multiprotein complexes by cross-linking, mass spectrometry, and database searchingMaiolica, Alessio; Cittaro, Davide; Borsotti, Dario; Sennels, Lau; Ciferri, Claudio; Tarricone, Cataldo; Musacchio, Andrea; Rappsilber, JuriMolecular and Cellular Proteomics (2007), 6 (12), 2200-2211CODEN: MCPOBS; ISSN:1535-9476. (American Society for Biochemistry and Molecular Biology)Most protein complexes are inaccessible to high resoln. structural anal. The authors report the results of a combined approach of crosslinking, mass spectrometry, and bioinformatics to two human complexes contg. large coiled-coil segments, the NDEL1 homodimer and the NDC80 heterotetramer. An important limitation of the crosslinking approach, so far, was the identification of cross-linked peptides from fragmentation spectra. The authors' novel approach overcomes the data anal. bottleneck of crosslinking and mass spectrometry. The authors constructed a purpose-built database to match spectra with cross-linked peptides, define a score that expresses the quality of the authors' identification, and est. false pos. rates. The authors show that their anal. sheds light on crit. structural parameters such as the directionality of the homodimeric coiled coil of NDEL1, the register of the heterodimeric coiled coils of the NDC80 complex, and the organization of a tetramerization region in the NDC80 complex. The authors' approach is esp. useful to address complexes that are difficult in addressing by std. structural methods.
- 14Pawitan, Y., Calza, S., and Ploner, A. Bioinformatics 2006, 22, 3025– 3031Google ScholarThere is no corresponding record for this reference.
- 15Renard, B. Y., Kirchner, M., Monigatti, F., Ivanov, A. R., Rappsilber, J., Winter, D., Steen, J. A. J., Hamprecht, F. A., and Steen, H. Proteomics 2009, 9, 4979– 4984Google ScholarThere is no corresponding record for this reference.
- 16Weatherly, D. B., Atwood, J. A., Minning, T. A., Cavola, C., Tarleton, R. L., and Orlando, R. Mol. Cell. Proteomics 2005, 4, 762– 772Google Scholar16https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BD2MXlsV2nsrs%253D&md5=846ef30ff62411a2305416987a71c42dA heuristic method for assigning a false-discovery rate for protein identifications from Mascot database search resultsWeatherly, D. Brent; Atwood, James A., III; Minning, Todd A.; Cavola, Cameron; Tarleton, Rick L.; Orlando, RonMolecular and Cellular Proteomics (2005), 4 (6), 762-772CODEN: MCPOBS; ISSN:1535-9476. (American Society for Biochemistry and Molecular Biology)MS/MS and database searching has emerged as a valuable technol. for rapidly analyzing protein expression, localization, and post-translational modifications. The probability-based search engine Mascot has found wide-spread use as a tool to correlate tandem mass spectra with peptides in a sequence database. Although the Mascot scoring algorithm provides a probability-based model for peptide identification, the independent peptide scores do not correlate with the significance of the proteins to which they match. Herein, the authors describe a heuristic method for organizing proteins identified at a specified false-discovery rate using Mascot-matched peptides. The authors call this method PROVALT, and it uses peptide matches from a random database to calc. false-discovery rates for protein identifications and reduces a complex list of peptide matches to a nonredundant list of homologous protein groups. This method was evaluated using Mascot-identified peptides from a Trypanosoma cruzi epimastigote whole-cell lysate, which was sepd. by multidimensional LC and analyzed by MS/MS. PROVALT was then compared with the two traditional methods of protein identification when using Mascot, the single peptide score and cumulative protein score methods, and was shown to be superior to both in regards to the no. of proteins identified and the inclusion of lower scoring nonrandom peptide matches.
- 17Young, D., Benaglia, T., Chauveau, D., Elmore, R., Hettmansperger, T., Hunter, D., Thomas, H., and Xuan, F.mixtools: Tools for analyzing finite mixture models, R package, version 0.3.2; 2008.Google ScholarThere is no corresponding record for this reference.
Cited By
This article is cited by 21 publications.
- Dominik Madej, Henry Lam. Modeling Lower-Order Statistics to Enable Decoy-Free FDR Estimation in Proteomics. Journal of Proteome Research 2023, 22
(4)
, 1159-1171. https://doi.org/10.1021/acs.jproteome.2c00604
- Dominik Madej, Long Wu, Henry Lam. Common Decoy Distributions Simplify False Discovery Rate Estimation in Shotgun Proteomics. Journal of Proteome Research 2022, 21
(2)
, 339-348. https://doi.org/10.1021/acs.jproteome.1c00600
- Chengjian Tu, Quanhu Sheng, Jun Li, Danjun Ma, Xiaomeng Shen, Xue Wang, Yu Shyr, Zhengping Yi, and Jun Qu . Optimization of Search Engines and Postprocessing Approaches to Maximize Peptide and Protein Identification for High-Resolution Mass Data. Journal of Proteome Research 2015, 14
(11)
, 4662-4673. https://doi.org/10.1021/acs.jproteome.5b00536
- Giulia Gonnelli, Michiel Stock, Jan Verwaeren, Davy Maddelein, Bernard De Baets, Lennart Martens, and Sven Degroeve . A Decoy-Free Approach to the Identification of Peptides. Journal of Proteome Research 2015, 14
(4)
, 1792-1798. https://doi.org/10.1021/pr501164r
- Nai-ping Dong, Yi-Zeng Liang, Qing-song Xu, Daniel K. W. Mok, Lun-zhao Yi, Hong-mei Lu, Min He, and Wei Fan . Prediction of Peptide Fragment Ion Mass Spectra by Data Mining Techniques. Analytical Chemistry 2014, 86
(15)
, 7446-7454. https://doi.org/10.1021/ac501094m
- Martina Fischer, Susann Zilkenat, Roman G. Gerlach, Samuel Wagner, and Bernhard Y. Renard . Pre- and Post-Processing Workflow for Affinity Purification Mass Spectrometry Data. Journal of Proteome Research 2014, 13
(5)
, 2239-2249. https://doi.org/10.1021/pr401249b
- Suruchi Aggarwal, Anurag Raj, Dhirendra Kumar, Debasis Dash, Amit Kumar Yadav. False discovery rate: the Achilles’ heel of proteogenomics. Briefings in Bioinformatics 2022, 23
(5)
https://doi.org/10.1093/bib/bbac163
- James C. Wright, Jyoti S. Choudhary. PSM Scoring and Validation. 2016, 69-92. https://doi.org/10.1039/9781782626732-00069
- Thilo Muth, Bernhard Y. Renard, Lennart Martens. Metaproteomic data analysis at a glance: advances in computational microbial community proteomics. Expert Review of Proteomics 2016, 13
(8)
, 757-769. https://doi.org/10.1080/14789450.2016.1209418
- Sven H. Giese, Franziska Zickmann, Bernhard Y. Renard. Detection of Unknown Amino Acid Substitutions Using Error-Tolerant Database Search. 2016, 247-264. https://doi.org/10.1007/978-1-4939-3106-4_16
- Franziska Zickmann, Bernhard Y. Renard. MSProGene: integrative proteogenomics beyond six-frames and single nucleotide polymorphisms. Bioinformatics 2015, 31
(12)
, i106-i115. https://doi.org/10.1093/bioinformatics/btv236
- Anke Penzlin, Martin S. Lindner, Joerg Doellinger, Piotr Wojtek Dabrowski, Andreas Nitsche, Bernhard Y. Renard. Pipasic: similarity and expression correction for strain-level identification and quantification in metaproteomics. Bioinformatics 2014, 30
(12)
, i149-i156. https://doi.org/10.1093/bioinformatics/btu267
- Alessandro Tanca, Antonio Palomba, Massimo Deligios, Tiziana Cubeddu, Cristina Fraumene, Grazia Biosa, Daniela Pagnozzi, Maria Filippa Addis, Sergio Uzzau, . Evaluating the Impact of Different Sequence Databases on Metaproteome Analysis: Insights from a Lab-Assembled Microbial Mixture. PLoS ONE 2013, 8
(12)
, e82981. https://doi.org/10.1371/journal.pone.0082981
- Thilo Muth, Dirk Benndorf, Udo Reichl, Erdmann Rapp, Lennart Martens. Searching for a needle in a stack of needles: challenges in metaproteomics data analysis. Mol. BioSyst. 2013, 9
(4)
, 578-585. https://doi.org/10.1039/C2MB25415H
- Mathias Kuhring, Bernhard Y. Renard, . iPiG: Integrating Peptide Spectrum Matches into Genome Browser Visualizations. PLoS ONE 2012, 7
(12)
, e50246. https://doi.org/10.1371/journal.pone.0050246
- Francesco Mancuso, Jakob Bunkenborg, Michael Wierer, Henrik Molina. Data extraction from proteomics raw data: An evaluation of nine tandem MS tools using a large Orbitrap data set. Journal of Proteomics 2012, 75
(17)
, 5293-5303. https://doi.org/10.1016/j.jprot.2012.06.012
- Bernhard Y. Renard, Buote Xu, Marc Kirchner, Franziska Zickmann, Dominic Winter, Simone Korten, Norbert W. Brattig, Amit Tzur, Fred A. Hamprecht, Hanno Steen. Overcoming Species Boundaries in Peptide Identification with Bayesian Information Criterion-driven Error-tolerant Peptide Search (BICEPS). Molecular & Cellular Proteomics 2012, 11
(7)
, M111.014167-1-M111.014167-12. https://doi.org/10.1074/mcp.M111.014167
- Hanns Soblik, Abuelhassan Elshazly Younis, Makedonka Mitreva, Bernhard Y. Renard, Marc Kirchner, Frank Geisinger, Hanno Steen, Norbert W. Brattig. Life Cycle Stage-resolved Proteomic Analysis of the Excretome/Secretome from Strongyloides ratti—Identification of Stage-specific Proteases. Molecular & Cellular Proteomics 2011, 10
(12)
, M111.010157. https://doi.org/10.1074/mcp.M111.010157
- Bernhard Y Renard, Martin Löwer, Yvonne Kühne, Ulf Reimer, Andrée Rothermel, Özlem Türeci, John C Castle, Ugur Sahin. rapmad: Robust analysis of peptide microarray data. BMC Bioinformatics 2011, 12
(1)
https://doi.org/10.1186/1471-2105-12-324
- Wei ZHANG, Ji-Yang ZHANG, Hui LIU, Han-Chang SUN, Chang-Ming XU, Hai-Bin MA, Yun-Ping ZHU, Hong-Wei XIE. Development of Algorithms for Mass Spectrometry-based Label-free Quantitative Proteomics*. PROGRESS IN BIOCHEMISTRY AND BIOPHYSICS 2011, 38
(6)
, 506-518. https://doi.org/10.3724/SP.J.1206.2010.00560
- Alexey I. Nesvizhskii. A survey of computational methods and error rate estimation procedures for peptide and protein identification in shotgun proteomics. Journal of Proteomics 2010, 73
(11)
, 2092-2123. https://doi.org/10.1016/j.jprot.2010.08.009
Article Views are the COUNTER-compliant sum of full text article downloads since November 2008 (both PDF and HTML) across all institutions and individuals. These metrics are regularly updated to reflect usage leading up to the last few days.
Citations are the number of other articles citing this article, calculated by Crossref and updated daily. Find more information about Crossref citation counts.
The Altmetric Attention Score is a quantitative measure of the attention that a research article has received online. Clicking on the donut icon will load a page at altmetric.com with additional details about the score and the social media presence for the given article. Find more information on the Altmetric Attention Score and how the score is calculated.
Recommended Articles
References
This article references 17 other publications.
- 1Bradshaw, R. A., Burlingame, A. L., Carr, S., and Aebersold, R. Mol. Cell. Proteomics 2006, 5, 787– 7881https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BD28XkslClsr4%253D&md5=a0b5774e683a220d733ee4b07eb830e4Reporting protein identification data the next generation of guidelinesBradshaw, Ralph A.; Burlingame, Alma L.; Carr, Steven; Aebersold, RuediMolecular and Cellular Proteomics (2006), 5 (5), 787CODEN: MCPOBS; ISSN:1535-9476. (American Society for Biochemistry and Molecular Biology)There is no expanded citation for this reference.
- 2Choi, H., Ghosh, D., and Nesvizhskii, A. I. J. Proteome Res. 2008, 7, 286– 2922https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BD2sXhsVejtbbO&md5=c3f146e9befac3cbdd963098077a516aStatistical Validation of Peptide Identifications in Large-Scale Proteomics Using the Target-Decoy Database Search Strategy and Flexible Mixture ModelingChoi, Hyungwon; Ghosh, Debashis; Nesvizhskii, Alexey I.Journal of Proteome Research (2008), 7 (1), 286-292CODEN: JPROBS; ISSN:1535-3893. (American Chemical Society)Reliable statistical validation of peptide and protein identifications is a top priority in large-scale mass spectrometry based proteomics. PeptideProphet is one of the computational tools commonly used for assessing the statistical confidence in peptide assignments to tandem mass spectra obtained using database search programs such as SEQUEST, MASCOT, or X! TANDEM. The authors present two flexible methods, the variable component mixt. model and the semiparametric mixt. model, that remove the restrictive parametric assumptions in the mixt. modeling approach of PeptideProphet. Using a control protein mixt. data set generated on an linear ion trap Fourier transform (LTQ-FT) mass spectrometer, the authors demonstrate that both methods improve parametric models in terms of the accuracy of probability ests. and the power to detect correct identifications controlling the false discovery rate to the same degree. The statistical approaches presented here require that the data set contain a sufficient no. of decoy (known to be incorrect) peptide identifications, which can be obtained using the target-decoy database search strategy.
- 3Choi, H. and Nesvizhskii, A. I. J. Proteome Res. 2008, 7, 254– 2653https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BD2sXhsVOntbfM&md5=bb0ea1f340c4ba7d099f655c203f49edSemisupervised Model-Based Validation of Peptide Identifications in Mass Spectrometry-Based ProteomicsChoi, Hyungwon; Nesvizhskii, Alexey I.Journal of Proteome Research (2008), 7 (1), 254-265CODEN: JPROBS; ISSN:1535-3893. (American Chemical Society)Development of robust statistical methods for validation of peptide assignments to tandem mass (MS/MS) spectra obtained using database searching remains an important problem. PeptideProphet is one of the commonly used computational tools available for that purpose. An alternative simple approach for validation of peptide assignments is based on addn. of decoy (reversed, randomized, or shuffled) sequences to the searched protein sequence database. The probabilistic modeling approach of PeptideProphet and the decoy strategy can be combined within a single semisupervised framework, leading to improved robustness and higher accuracy of computed probabilities even in the case of most challenging data sets. The authors present a semisupervised expectation-maximization (EM) algorithm for constructing a Bayes classifier for peptide identification using the probability mixt. model, extending PeptideProphet to incorporate decoy peptide matches. Using several data sets of varying complexity, from control protein mixts. to a human plasma sample, and using three commonly used database search programs, SEQUEST, MASCOT, and TANDEM/k-score, the authors illustrate that more accurate mixt. estn. leads to an improved control of the false discovery rate in the classification of peptide assignments.
- 4Elias, J. E. and Gygi, S. P. Nat. Methods 2007, 4, 207– 2144https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BD2sXitFChtrs%253D&md5=4336d04ea53dc7a161d83de1fa8249d3Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometryElias, Joshua E.; Gygi, Steven P.Nature Methods (2007), 4 (3), 207-214CODEN: NMAEA3; ISSN:1548-7091. (Nature Publishing Group)Liq. chromatog. and tandem mass spectrometry (LC-MS/MS) has become the preferred method for conducting large-scale surveys of proteomes. Automated interpretation of tandem mass spectrometry (MS/MS) spectra can be problematic, however, for a variety of reasons. As most sequence search engines return results even for 'unmatchable' spectra, proteome researchers must devise ways to distinguish correct from incorrect peptide identifications. The target-decoy search strategy represents a straightforward and effective way to manage this effort. Despite the apparent simplicity of this method, some controversy surrounds its successful application. Here the authors clarify their preferred methodol. by addressing 4 issues based on obsd. decoy hit frequencies: (i) the major assumptions made with this database search strategy are reasonable; (ii) concatenated target-decoy database searches are preferable to sep. target and decoy database searches; (iii) the theor. error assocd. with target-decoy false pos. (FP) rate measurements can be estd.; and (iv) alternate methods for constructing decoy databases are similarly effective once certain considerations are taken into account.
- 5Frank, A. and Pevzner, P. Anal. Chem. 2005, 77, 964– 973There is no corresponding record for this reference.
- 6Goloborodko, A. A., Mayerhofer, C., Zubarev, A. R., Tarasova, I. A., Gorshkov, A. V., Zubarev, R. A., and Gorshkov, M. V. Rapid Commun. Mass Spectrom. 2010, 24, 454– 4626https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BC3cXos1aguw%253D%253D&md5=44c4b6c350db0531c802f85b734bd59eEmpirical approach to false discovery rate estimation in shotgun proteomicsGoloborodko, Anton A.; Mayerhofer, Corina; Zubarev, Alexander R.; Tarasova, Irina A.; Gorshkov, Alexander V.; Zubarev, Roman A.; Gorshkov, Mikhail V.Rapid Communications in Mass Spectrometry (2010), 24 (4), 454-462CODEN: RCMSEF; ISSN:0951-4198. (John Wiley & Sons Ltd.)Estn. of false discovery rate (FDR) for identified peptides is an important step in large-scale proteomic studies. We introduced an empirical approach to the problem that is based on the FDR-like functions of sets of peptide spectral matches (PSMs). These functions have close values for equal-sized sets with the same FDR and depend monotonically on the FDR of a set. We have found three of them, based on three complementary sources of data: chromatog., mass spectrometry, and sequences of identified peptides. Using a calibration on a set of putative correct PSMs these functions were converted into the FDR scale. The approach was tested on a set of ∼2800 PSMs obtained from rat kidney tissue. The ests. based on all three data sources were rather consistent with each other as well as with one made using the target-decoy strategy. Copyright © 2010 John Wiley & Sons, Ltd.
- 7Higgs, R. E., Knierman, M. D., Freeman, A. B., Gelbert, L. M., Patil, S. T., and Hale, J. E. J. Proteome Res. 2007, 6, 1758– 17677https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BD2sXjs1Srsbs%253D&md5=d9275935349ed354ae7e086cacabcd16Estimating the Statistical Significance of Peptide Identifications from Shotgun Proteomics ExperimentsHiggs, Richard E.; Knierman, Michael D.; Freeman, Angela Bonner; Gelbert, Lawrence M.; Patil, Sandeep T.; Hale, John E.Journal of Proteome Research (2007), 6 (5), 1758-1767CODEN: JPROBS; ISSN:1535-3893. (American Chemical Society)The authors present a wrapper-based approach to est. and control the false discovery rate for peptide identifications using the outputs from multiple com. available MS/MS search engines. Features of the approach include the flexibility to combine output from multiple search engines with sequence and spectral derived features in a flexible classification model to produce a score assocd. with correct peptide identifications. This classification model score from a reversed database search is taken as the null distribution for estg. p-values and false discovery rates using a simple and established statistical procedure. Results from 10 analyses of rat sera on an LTQ-FT mass spectrometer indicate that the method is well calibrated for controlling the proportion of false positives in a set of reported peptide identifications while correctly identifying more peptides than rule-based methods using one search engine alone.
- 8Jiang, X., Jiang, X., Han, G., Ye, M., and Zou, H. BMC Bioinf. 2007, 8, 3238https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A280%3ADC%252BD2snitlKhsg%253D%253D&md5=ec311c61ae9dd370fedff21ddf1180e1Optimization of filtering criterion for SEQUEST database searching to improve proteome coverage in shotgun proteomicsJiang Xinning; Jiang Xiaogang; Han Guanghui; Ye Mingliang; Zou HanfaBMC bioinformatics (2007), 8 (), 323 ISSN:.BACKGROUND: In proteomic analysis, MS/MS spectra acquired by mass spectrometer are assigned to peptides by database searching algorithms such as SEQUEST. The assignations of peptides to MS/MS spectra by SEQUEST searching algorithm are defined by several scores including Xcorr, Delta Cn, Sp, Rsp, matched ion count and so on. Filtering criterion using several above scores is used to isolate correct identifications from random assignments. However, the filtering criterion was not favorably optimized up to now. RESULTS: In this study, we implemented a machine learning approach known as predictive genetic algorithm (GA) for the optimization of filtering criteria to maximize the number of identified peptides at fixed false-discovery rate (FDR) for SEQUEST database searching. As the FDR was directly determined by decoy database search scheme, the GA based optimization approach did not require any pre-knowledge on the characteristics of the data set, which represented significant advantages over statistical approaches such as PeptideProphet. Compared with PeptideProphet, the GA based approach can achieve similar performance in distinguishing true from false assignment with only 1/10 of the processing time. Moreover, the GA based approach can be easily extended to process other database search results as it did not rely on any assumption on the data. CONCLUSION: Our results indicated that filtering criteria should be optimized individually for different samples. The new developed software using GA provides a convenient and fast way to create tailored optimal criteria for different proteome samples to improve proteome coverage.
- 9Keller, A., Nesvizhskii, A. I., Kolker, E., and Aaebersold, R. Anal. Chem. 2002, 74, 5383– 53929https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BD38XmvVyktL4%253D&md5=fad3d58f90b1ff57439ea6ca2d76f13fEmpirical Statistical Model To Estimate the Accuracy of Peptide Identifications Made by MS/MS and Database SearchKeller, Andrew; Nesvizhskii, Alexey I.; Kolker, Eugene; Aebersold, RuediAnalytical Chemistry (2002), 74 (20), 5383-5392CODEN: ANCHAM; ISSN:0003-2700. (American Chemical Society)We present a statistical model to est. the accuracy of peptide assignments to tandem mass (MS/MS) spectra made by database search applications such as SEQUEST. Employing the expectation maximization algorithm, the anal. learns to distinguish correct from incorrect database search results, computing probabilities that peptide assignments to spectra are correct based upon database search scores and the no. of tryptic termini of peptides. Using SEQUEST search results for spectra generated from a sample of known protein components, we demonstrate that the computed probabilities are accurate and have high power to discriminate between correctly and incorrectly assigned peptides. This anal. makes it possible to filter large vols. of MS/MS database search results with predictable false identification error rates and can serve as a common std. by which the results of different research groups are compared.
- 10Kim, S., Gupta, N., and Pevzner, P. A. J. Proteome Res. 2008, 7, 3354– 336310https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BD1cXnvF2gs74%253D&md5=c0770ec96d5046fc6cc52d7454da3708Spectral Probabilities and Generating Functions of Tandem Mass Spectra: A Strike against Decoy DatabasesKim, Sangtae; Gupta, Nitin; Pevzner, Pavel A.Journal of Proteome Research (2008), 7 (8), 3354-3363CODEN: JPROBS; ISSN:1535-3893. (American Chemical Society)A key problem in computational proteomics is distinguishing between correct and false peptide identifications. The authors argue that evaluating the error rates of peptide identifications is not unlike computing generating functions in combinatorics. The authors show that the generating functions and their derivs. (spectral energy and spectral probability) represent new features of tandem mass spectra that, similarly to Δ-scores, significantly improve peptide identifications. Furthermore, the spectral probability provides a rigorous soln. to the problem of computing statistical significance of spectral identifications. The spectral energy/probability approach improves the sensitivity-specificity tradeoff of existing MS/MS search tools, addresses the notoriously difficult problem of "one-hit-wonders" in mass spectrometry, and often eliminates the need for decoy database searches. The authors therefore argue that the generating function approach has the potential to increase the no. of peptide identifications in MS/MS searches.
- 11Käll, L., Canterbury, J. D., Weston, J., Noble, W. S., and Maccoss, M. J. Nat. Methods 2007, 4, 923– 92511https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A280%3ADC%252BD2snksFaltQ%253D%253D&md5=d9f40d025046fd771e274313fd526d12Semi-supervised learning for peptide identification from shotgun proteomics datasetsKall Lukas; Canterbury Jesse D; Weston Jason; Noble William Stafford; MacCoss Michael JNature methods (2007), 4 (11), 923-5 ISSN:1548-7091.Shotgun proteomics uses liquid chromatography-tandem mass spectrometry to identify proteins in complex biological samples. We describe an algorithm, called Percolator, for improving the rate of confident peptide identifications from a collection of tandem mass spectra. Percolator uses semi-supervised machine learning to discriminate between correct and decoy spectrum identifications, correctly assigning peptides to 17% more spectra from a tryptic Saccharomyces cerevisiae dataset, and up to 77% more spectra from non-tryptic digests, relative to a fully supervised approach.
- 12Korn, E. L., Troendle, J. F., Mcshane, L. M., and Simon, R. J. Stat. Plann. Inference 2004, 124 (2) 379– 398There is no corresponding record for this reference.
- 13Maiolica, A., Cittaro, D., Borsotti, D., Sennels, L., Ciferri, C., Tarricone, C., Musacchio, A., and Rappsilber, J. Mol. Cell. Proteomics 2007, 6, 2200– 221113https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BD1cXksF2hsg%253D%253D&md5=e344924a84221dd43f7682a36b667c73Structural analysis of multiprotein complexes by cross-linking, mass spectrometry, and database searchingMaiolica, Alessio; Cittaro, Davide; Borsotti, Dario; Sennels, Lau; Ciferri, Claudio; Tarricone, Cataldo; Musacchio, Andrea; Rappsilber, JuriMolecular and Cellular Proteomics (2007), 6 (12), 2200-2211CODEN: MCPOBS; ISSN:1535-9476. (American Society for Biochemistry and Molecular Biology)Most protein complexes are inaccessible to high resoln. structural anal. The authors report the results of a combined approach of crosslinking, mass spectrometry, and bioinformatics to two human complexes contg. large coiled-coil segments, the NDEL1 homodimer and the NDC80 heterotetramer. An important limitation of the crosslinking approach, so far, was the identification of cross-linked peptides from fragmentation spectra. The authors' novel approach overcomes the data anal. bottleneck of crosslinking and mass spectrometry. The authors constructed a purpose-built database to match spectra with cross-linked peptides, define a score that expresses the quality of the authors' identification, and est. false pos. rates. The authors show that their anal. sheds light on crit. structural parameters such as the directionality of the homodimeric coiled coil of NDEL1, the register of the heterodimeric coiled coils of the NDC80 complex, and the organization of a tetramerization region in the NDC80 complex. The authors' approach is esp. useful to address complexes that are difficult in addressing by std. structural methods.
- 14Pawitan, Y., Calza, S., and Ploner, A. Bioinformatics 2006, 22, 3025– 3031There is no corresponding record for this reference.
- 15Renard, B. Y., Kirchner, M., Monigatti, F., Ivanov, A. R., Rappsilber, J., Winter, D., Steen, J. A. J., Hamprecht, F. A., and Steen, H. Proteomics 2009, 9, 4979– 4984There is no corresponding record for this reference.
- 16Weatherly, D. B., Atwood, J. A., Minning, T. A., Cavola, C., Tarleton, R. L., and Orlando, R. Mol. Cell. Proteomics 2005, 4, 762– 77216https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BD2MXlsV2nsrs%253D&md5=846ef30ff62411a2305416987a71c42dA heuristic method for assigning a false-discovery rate for protein identifications from Mascot database search resultsWeatherly, D. Brent; Atwood, James A., III; Minning, Todd A.; Cavola, Cameron; Tarleton, Rick L.; Orlando, RonMolecular and Cellular Proteomics (2005), 4 (6), 762-772CODEN: MCPOBS; ISSN:1535-9476. (American Society for Biochemistry and Molecular Biology)MS/MS and database searching has emerged as a valuable technol. for rapidly analyzing protein expression, localization, and post-translational modifications. The probability-based search engine Mascot has found wide-spread use as a tool to correlate tandem mass spectra with peptides in a sequence database. Although the Mascot scoring algorithm provides a probability-based model for peptide identification, the independent peptide scores do not correlate with the significance of the proteins to which they match. Herein, the authors describe a heuristic method for organizing proteins identified at a specified false-discovery rate using Mascot-matched peptides. The authors call this method PROVALT, and it uses peptide matches from a random database to calc. false-discovery rates for protein identifications and reduces a complex list of peptide matches to a nonredundant list of homologous protein groups. This method was evaluated using Mascot-identified peptides from a Trypanosoma cruzi epimastigote whole-cell lysate, which was sepd. by multidimensional LC and analyzed by MS/MS. PROVALT was then compared with the two traditional methods of protein identification when using Mascot, the single peptide score and cumulative protein score methods, and was shown to be superior to both in regards to the no. of proteins identified and the inclusion of lower scoring nonrandom peptide matches.
- 17Young, D., Benaglia, T., Chauveau, D., Elmore, R., Hettmansperger, T., Hunter, D., Thomas, H., and Xuan, F.mixtools: Tools for analyzing finite mixture models, R package, version 0.3.2; 2008.There is no corresponding record for this reference.
Supporting Information
Supporting Information
Additional information as noted in text. This material is available free of charge via the Internet at http://pubs.acs.org.
Terms & Conditions
Most electronic Supporting Information files are available without a subscription to ACS Web Editions. Such files may be downloaded by article for research use (if there is a public use license linked to the relevant article, that license may permit other uses). Permission may be obtained from ACS for other uses through requests via the RightsLink permission system: http://pubs.acs.org/page/copyright/permissions.html.