A Comprehensive Evaluation of Consensus Spectrum Generation Methods in ProteomicsClick to copy article linkArticle link copied!
- Xiyang LuoXiyang LuoChongqing Key Laboratory of Big Data for Bio Intelligence, Chongqing University of Posts and Telecommunications, 400065 Chongqing, ChinaMore by Xiyang Luo
- Wout BittremieuxWout BittremieuxSkaggs School of Pharmacy and Pharmaceutical Sciences, University of California San Diego, La Jolla, California 92093, United StatesMore by Wout Bittremieux
- Johannes GrissJohannes GrissEuropean Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, U.K.Department of Dermatology, Medical University of Vienna, 1090 Vienna, AustriaMore by Johannes Griss
- Eric W. DeutschEric W. DeutschInstitute for Systems Biology (ISB), Seattle, Washington 98109, United StatesMore by Eric W. Deutsch
- Timo SachsenbergTimo SachsenbergApplied Bioinformatics, Department for Computer Science, University of Tuebingen, Sand 14, 72076 Tuebingen, GermanyMore by Timo Sachsenberg
- Lev I. LevitskyLev I. LevitskyV.L. Talrose Institute for Energy Problems of Chemical Physics, N.N. Semenov Federal Research Center for Chemical Physics, Russian Academy of Sciences, Moscow 142432, RussiaMore by Lev I. Levitsky
- Mark V. IvanovMark V. IvanovV.L. Talrose Institute for Energy Problems of Chemical Physics, N.N. Semenov Federal Research Center for Chemical Physics, Russian Academy of Sciences, Moscow 142432, RussiaMore by Mark V. Ivanov
- Julia A. BubisJulia A. BubisV.L. Talrose Institute for Energy Problems of Chemical Physics, N.N. Semenov Federal Research Center for Chemical Physics, Russian Academy of Sciences, Moscow 142432, RussiaMore by Julia A. Bubis
- Ralf GabrielsRalf GabrielsVIB-UGent Center for Medical Biotechnology, B-9052 Ghent, BelgiumDepartment of Biomolecular Medicine, Ghent University, B-9000 Ghent, BelgiumMore by Ralf Gabriels
- Henry WebelHenry WebelNovo Nordisk Foundation Center for Protein Research, University of Copenhagen, Copenhagen DK-2200, DenmarkMore by Henry Webel
- Aniel SanchezAniel SanchezSection for Clinical Chemistry, Department of Translational Medicine, Lund University, Skåne University Hospital Malmö, 20502 Malmö, SwedenMore by Aniel Sanchez
- Mingze BaiMingze BaiChongqing Key Laboratory of Big Data for Bio Intelligence, Chongqing University of Posts and Telecommunications, 400065 Chongqing, ChinaMore by Mingze Bai
- Lukas Käll*Lukas Käll*Email: [email protected]Science for Life Laboratory, School of Engineering Sciences in Chemistry, Biotechnology and Health, Royal Institute of Technology − KTH, Box 1031, 17121 Solna, SwedenMore by Lukas Käll
- Yasset Perez-Riverol*Yasset Perez-Riverol*Email: [email protected]European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, U.K.More by Yasset Perez-Riverol
Abstract
Spectrum clustering is a powerful strategy to minimize redundant mass spectra by grouping them based on similarity, with the aim of forming groups of mass spectra from the same repeatedly measured analytes. Each such group of near-identical spectra can be represented by its so-called consensus spectrum for downstream processing. Although several algorithms for spectrum clustering have been adequately benchmarked and tested, the influence of the consensus spectrum generation step is rarely evaluated. Here, we present an implementation and benchmark of common consensus spectrum algorithms, including spectrum averaging, spectrum binning, the most similar spectrum, and the best-identified spectrum. We have analyzed diverse public data sets using two different clustering algorithms (spectra-cluster and MaRaCluster) to evaluate how the consensus spectrum generation procedure influences downstream peptide identification. The BEST and BIN methods were found the most reliable methods for consensus spectrum generation, including for data sets with post-translational modifications (PTM) such as phosphorylation. All source code and data of the present study are freely available on GitHub at https://github.com/statisticalbiotechnology/representative-spectra-benchmark.
This publication is licensed under
License Summary*
You are free to share(copy and redistribute) this article in any medium or format and to adapt(remix, transform, and build upon) the material for any purpose, even commercially within the parameters below:
Creative Commons (CC): This is a Creative Commons license.
Attribution (BY): Credit must be given to the creator.
*Disclaimer
This summary highlights only some of the key features and terms of the actual license. It is not a license and has no legal value. Carefully review the actual license before using these materials.
License Summary*
You are free to share(copy and redistribute) this article in any medium or format and to adapt(remix, transform, and build upon) the material for any purpose, even commercially within the parameters below:
Creative Commons (CC): This is a Creative Commons license.
Attribution (BY): Credit must be given to the creator.
*Disclaimer
This summary highlights only some of the key features and terms of the actual license. It is not a license and has no legal value. Carefully review the actual license before using these materials.
License Summary*
You are free to share(copy and redistribute) this article in any medium or format and to adapt(remix, transform, and build upon) the material for any purpose, even commercially within the parameters below:
Creative Commons (CC): This is a Creative Commons license.
Attribution (BY): Credit must be given to the creator.
*Disclaimer
This summary highlights only some of the key features and terms of the actual license. It is not a license and has no legal value. Carefully review the actual license before using these materials.
Introduction
Methods
Consensus Spectrum Generation Methods and Evaluation
Spectrum averaging (AVERAGE): The representative spectrum is an average of all the spectra in the cluster. (8,11,12) In this algorithm, peaks with close m/z values are merged into a single peak, and their m/z values and intensities are averaged. m/z values are averaged using the corresponding peak intensities as weights.
Spectrum binning (BIN): In this method, for each cluster, a consensus spectrum vector with a bin width of 0.02 m/z was first constructed. (12) For all spectra in the cluster, peak m/z and intensity values were assigned to the corresponding bin in the consensus spectrum vector. Bins that contained values from fewer than 25% of the cluster members were discarded. Next, the vector was converted to a consensus spectrum by averaging all peak m/z and intensity values per bin. (2)
Most similar spectrum (MOST): For each cluster, the spectrum that is on average most similar to all cluster members was selected as a representative spectrum. (13) The most similar spectrum was selected by first calculating the dot product of all pairwise similarities between spectra in the cluster. Next, the spectrum with the maximal summed dot product to all other spectra was selected as the representative for that cluster.
Best identified spectrum (BEST): For each cluster that contained at least one identified spectrum, the spectrum with the maximal peptide-spectrum match score was chosen as the representative for that cluster. Note that this approach is not valid if all spectra in the cluster are unmatched.
Benchmark Datasets
project accession | instrument | no. MS/MS |
---|---|---|
PXD008355 (16) | Q Exactive | 1 477 567 |
PXD023047 (17) | Q Exactive HF | 109 333 |
PXD021518 (18) | Q Exactive HF-X | 286 410 |
PXD023361 (19) | Q Exactive | 38 286 |
The number of peptide identifications and peptide-spectrum matches can be found in the Supplementary Notes. In addition, the description of each dataset can be found in the original publication and PRIDE Archive. (20)
Code Availability
Results
Impact of Consensus Clustering in Database Search Algorithms
Impact of Cluster Size and Quality of Peptide Identifications
methods | high-quality representative spectrum ratio | low-quality representative spectrum ratio |
---|---|---|
MaRaCluster BEST | 0.871 | 0.129 |
MaRaCluster BIN | 0.853 | 0.135 |
MaRaCluster AVERAGE | 0.831 | 0.143 |
MaRaCluster MOST | 0.842 | 0.158 |
spectra-cluster BEST | 0.776 | 0.224 |
spectra-cluster BIN | 0.779 | 0.212 |
spectra-cluster AVERAGE | 0.770 | 0.215 |
spectra-cluster MOST | 0.772 | 0.228 |
Consensus Spectrum Generation Methods for Spectra Library Search
Posttranslational Modification Site Localization of Consensus Spectra
cluster method | method | phospho PSMs | phosphosites | corroborative PSMs | divergent PSMs |
---|---|---|---|---|---|
MaRaCluster | BEST | 66 914 | 81 238 | 63 165 | 2683 |
BIN | 68 429 | 83 091 | |||
spectra-cluster | BEST | 91 195 | 109 877 | 89 161 | 1494 |
BIN | 92 202 | 111 230 |
We quantified the number of total phosphorylated PSMs and phosphorylation sites for each combination of clustering method and consensus generation method. In addition, we added the number of PSMs that ended up being in common or different from each cluster (corroborative and divergent PSMs) when comparing PSMs for the BEST and BIN method’s spectra.
Conclusions
Supporting Information
The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acs.jproteome.2c00069.
Supplementary Note S1: Identification score as a function of cluster size; Supplementary Note S2: The data sets used in the benchmark; Supplementary Note S3: Analysis of the phosphoproteomics data set PXD008355 (PDF)
Terms & Conditions
Most electronic Supporting Information files are available without a subscription to ACS Web Editions. Such files may be downloaded by article for research use (if there is a public use license linked to the relevant article, that license may permit other uses). Permission may be obtained from ACS for other uses through requests via the RightsLink permission system: http://pubs.acs.org/page/copyright/permissions.html.
Acknowledgments
The authors would like to acknowledge the EuBIC-MS community that organized the EuBIC-MS Developer Meeting in January 2020, (29) triggering the original discussions and implementations of this work. L.K. was supported by a grant from the Swedish Research Council (Grant 2017-04030).
References
This article references 29 other publications.
- 1Perez-Riverol, Y.; Vizcaino, J. A.; Griss, J. Future Prospects of Spectral Clustering Approaches in Proteomics. Proteomics 2018, 18 (14), e1700454 DOI: 10.1002/pmic.201700454Google ScholarThere is no corresponding record for this reference.
- 2Griss, J.; Perez-Riverol, Y.; Lewis, S.; Tabb, D. L.; Dianes, J. A.; Del-Toro, N.; Rurik, M.; Walzer, M. W.; Kohlbacher, O.; Hermjakob, H. Recognizing millions of consistently unidentified spectra across hundreds of shotgun proteomics datasets. Nat. Methods 2016, 13 (8), 651– 656, DOI: 10.1038/nmeth.3902Google Scholar2https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BC28XhtVKntbnO&md5=ebece175e00762a1c26ae12ce1afde75Recognizing millions of consistently unidentified spectra across hundreds of shotgun proteomics datasetsGriss, Johannes; Perez-Riverol, Yasset; Lewis, Steve; Tabb, David L.; Dianes, Jose A.; del-Toro, Noemi; Rurik, Marc; Walzer, Mathias; Kohlbacher, Oliver; Hermjakob, Henning; Wang, Rui; Vizcaino, Juan AntonioNature Methods (2016), 13 (8), 651-656CODEN: NMAEA3; ISSN:1548-7091. (Nature Publishing Group)Mass spectrometry (MS) is the main technol. used in proteomics approaches. However, on av., 75% of spectra analyzed in an MS expt. remain unidentified. We propose to use spectrum clustering at a large scale to shed light on these unidentified spectra. The Proteomics Identifications (PRIDE) Database Archive is one of the largest MS proteomics public data repositories worldwide. By clustering all tandem MS spectra publicly available in the PRIDE Archive, coming from hundreds of data sets, we were able to consistently characterize spectra into three distinct groups: (1) incorrectly identified, (2) correctly identified but below the set scoring threshold, and (3) truly unidentified. Using multiple complementary anal. approaches, we were able to identify ~ 20% of the consistently unidentified spectra. The complete spectrum-clustering results are available through the new version of the PRIDE Cluster resource (http://www.ebi.ac.uk/pride/cluster).</a></a></a></a></a></a></a></a></a></a></a></a></a></a></a></a>. This resource is intended, among other aims, to encourage and simplify further investigation into these unidentified spectra.
- 3Frank, A. M.; Monroe, M. E.; Shah, A. R.; Carver, J. J.; Bandeira, N.; Moore, R. J.; Anderson, G. A.; Smith, R. D.; Pevzner, P. A. Spectral archives: extending spectral libraries to analyze both identified and unidentified spectra. Nat. Methods 2011, 8 (7), 587– 591, DOI: 10.1038/nmeth.1609Google Scholar3https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BC3MXmtV2hs78%253D&md5=7ff219a2a958756a0b77b64f0421b5d2Spectral archives: extending spectral libraries to analyze both identified and unidentified spectraFrank, Ari M.; Monroe, Matthew E.; Shah, Anuj R.; Carver, Jeremy J.; Bandeira, Nuno; Moore, Ronald J.; Anderson, Gordon A.; Smith, Richard D.; Pevzner, Pavel A.Nature Methods (2011), 8 (7), 587-591CODEN: NMAEA3; ISSN:1548-7091. (Nature Publishing Group)Tandem mass spectrometry (MS/MS) expts. yield multiple, nearly identical spectra of the same peptide in various labs., but proteomics researchers typically do not leverage the unidentified spectra produced in other labs to decode spectra they generate. We propose a spectral archives approach that clusters MS/MS datasets, representing similar spectra by a single consensus spectrum. Spectral archives extend spectral libraries by analyzing both identified and unidentified spectra in the same way and maintaining information about peptide spectra that are common across species and conditions. Thus archives offer both traditional library spectrum similarity-based search capabilities along with new ways to analyze the data. By developing a clustering tool, MS-Cluster, we generated a spectral archive from ∼1.18 billion spectra that greatly exceeds the size of existing spectral repositories. We advocate that publicly available data should be organized into spectral archives rather than be analyzed as disparate datasets, as is mostly the case today.
- 4The, M.; Kall, L. Focus on the spectra that matter by clustering of quantification data in shotgun proteomics. Nat. Commun. 2020, 11 (1), 3234, DOI: 10.1038/s41467-020-17037-3Google Scholar4https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BB3cXhtlSitr3J&md5=f4c357704724e9bdf2bd1d19da08e789Focus on the spectra that matter by clustering of quantification data in shotgun proteomicsThe, Matthew; Kaell, LukasNature Communications (2020), 11 (1), 3234CODEN: NCAOBW; ISSN:2041-1723. (Nature Research)Abstr.: In shotgun proteomics, the anal. of label-free quantification expts. is typically limited by the identification rate and the noise level in the quant. data. This generally causes a low sensitivity in differential expression anal. Here, we propose a quantification-first approach for peptides that reverses the classical identification-first workflow, thereby preventing valuable information from being discarded in the identification stage. Specifically, we introduce a method, Quandenser, that applies unsupervised clustering on both MS1 and MS2 level to summarize all analytes of interest without assigning identities. This reduces search time due to the data redn. We can now employ open modification and de novo searches to identify analytes of interest that would have gone unnoticed in traditional pipelines. Quandenser+Triqler outperforms the state-of-the-art method MaxQuant+Perseus, consistently reporting more differentially abundant proteins for all tested datasets. Software is available for all major operating systems at https://github.com/statisticalbiotechnol./quandenser,</a></a></a></a></a></a></a></a></a></a></a></a></a></a></a></a></a></a></a></a> under Apache 2.0 license.Griss, J.; Stanek, F.; Hudecz, O.; Durnberger, G.; Perez-Riverol, Y.; Vizcaino, J. A.; Mechtler, K. Spectral Clustering Improves Label-Free Quantification of Low-Abundant Proteins. J. Proteome Res. 2019, 18 (4), 1477– 1485, DOI: 10.1021/acs.jproteome.8b00377Google Scholar4https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BC1MXksF2kt7s%253D&md5=ea84ad38a473eb92ad6c06d2ce59a560Spectral Clustering Improves Label-Free Quantification of Low-Abundant ProteinsGriss, Johannes; Stanek, Florian; Hudecz, Otto; Duernberger, Gerhard; Perez-Riverol, Yasset; Vizcaino, Juan Antonio; Mechtler, KarlJournal of Proteome Research (2019), 18 (4), 1477-1485CODEN: JPROBS; ISSN:1535-3893. (American Chemical Society)Label-free quantification has become a common-practice in many mass spectrometry-based proteomics expts. In recent years, we and others have shown that spectral clustering can considerably improve the anal. of (primarily large-scale) proteomics data sets. Here we show that spectral clustering can be used to infer addnl. peptide-spectrum matches and improve the quality of label-free quant. proteomics data in data sets also contg. only tens of MS runs. We analyzed four well-known public benchmark data sets that represent different exptl. settings using spectral counting and peak intensity based label-free quantification. In both approaches, the addnl. inferred peptide-spectrum matches through our spectra-cluster algorithm improved the detectability of low abundant proteins while increasing the accuracy of the derived quant. data, without increasing the data sets' noise. Addnl., we developed a Proteome Discoverer node for our spectra-cluster algorithm which allows anyone to rebuild our proposed pipeline using the free version of Proteome Discoverer.
- 5The, M.; Kall, L. MaRaCluster: A Fragment Rarity Metric for Clustering Fragment Spectra in Shotgun Proteomics. J. Proteome Res. 2016, 15 (3), 713– 720, DOI: 10.1021/acs.jproteome.5b00749Google Scholar5https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BC2MXitVSitrnJ&md5=9f4da17029204f0fcb69269a3d44f0ceMaRaCluster: A Fragment Rarity Metric for Clustering Fragment Spectra in Shotgun ProteomicsThe, Matthew; Kaell, LukasJournal of Proteome Research (2016), 15 (3), 713-720CODEN: JPROBS; ISSN:1535-3893. (American Chemical Society)Shotgun proteomics expts. generate large amts. of fragment spectra as primary data, normally with high redundancy between and within expts. Here, the authors have devised a clustering technique to identify fragment spectra stemming from the same species of peptide. This is a powerful alternative method to traditional search engines for analyzing spectra, specifically useful for larger scale mass spectrometry studies. As an aid in this process, the authors propose a distance calcn. relying on the rarity of exptl. fragment peaks, following the intuition that peaks shared by only a few spectra offer more evidence than peaks shared by a large no. of spectra. The authors used this distance calcn. and a complete-linkage scheme to cluster data from a recent large-scale mass spectrometry-based study. The clusterings produced by the authors' method have up to 40% more identified peptides for their consensus spectra compared to those produced by the previous state-of-the-art method. The authors see that the authors' method would advance the construction of spectral libraries as well as serve as a tool for mining large sets of fragment spectra. The source code and Ubuntu binary packages are available at https://github.com/statisticalbiotechnol./maracluster (under an Apache 2.0 license).
- 6Wang, L.; Li, S.; Tang, H. msCRUSH: Fast Tandem Mass Spectral Clustering Using Locality Sensitive Hashing. J. Proteome Res. 2018, 18 (1), 147– 158, DOI: 10.1021/acs.jproteome.8b00448Google ScholarThere is no corresponding record for this reference.
- 7Bittremieux, W.; Laukens, K.; Noble, W. S.; Dorrestein, P. C. Large-scale tandem mass spectrum clustering using fast nearest neighbor searching. Rapid Commun. Mass Spectrom. 2021, e9153 DOI: 10.1002/rcm.9153Google ScholarThere is no corresponding record for this reference.
- 8Lam, H.; Deutsch, E. W.; Eddes, J. S.; Eng, J. K.; King, N.; Stein, S. E.; Aebersold, R. Development and validation of a spectral library searching method for peptide identification from MS/MS. Proteomics 2007, 7 (5), 655– 667, DOI: 10.1002/pmic.200600625Google Scholar8https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BD2sXjs1Kls70%253D&md5=f34b1ce3ee3a941044c9971d04d2dc50Development and validation of a spectral library searching method for peptide identification from MS/MSLam, Henry; Deutsch, Eric W.; Eddes, James S.; Eng, Jimmy K.; King, Nichole; Stein, Stephen E.; Aebersold, RuediProteomics (2007), 7 (5), 655-667CODEN: PROTC7; ISSN:1615-9853. (Wiley-VCH Verlag GmbH & Co. KGaA)A notable inefficiency of shotgun proteomics expts. is the repeated rediscovery of the same identifiable peptides by sequence database searching methods, which often are time-consuming and error-phone. A more precise and efficient method, in which previously obsd. and identified peptide MS/MS spectra are cataloged and condensed into searchable spectral libraries to allow new identifications by spectral matching, is seen as a promising alternative. To that end, an open-source, functionally complete, high-throughput and readily extensible MS/MS spectral searching tool, SpectraST, was developed. A high-quality spectral library was constructed by combining the high-confidence identifications of millions of spectra taken from various data repositories and searched using four sequence search engines. The resulting library consists of over 30,000 spectra for Saccharomyces cerevisiae. Using this library, SpectraST vastly outperforms the sequence search engine SEQUEST in terms of speed and the ability to discriminate good and bad hits. A unique advantage of SpectraST is its full integration into the popular Trans Proteomic Pipeline suite of software, which facilitates user adoption and provides important functionalities such as peptide and protein probability assignment, quantification, and data visualization. This method of spectral library searching is esp. suited for targeted proteomics applications, offering superior performance to traditional sequence searching.
- 9Griss, J.; Perez-Riverol, Y.; The, M.; Kall, L.; Vizcaino, J. A. Response to ″Comparison and Evaluation of Clustering Algorithms for Tandem Mass Spectra″. J. Proteome Res. 2018, 17 (5), 1993– 1996, DOI: 10.1021/acs.jproteome.7b00824Google Scholar9https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BC1cXotVahtLk%253D&md5=876777f378bfaae814f134e3ecea91b8Response to "Comparison and Evaluation of Clustering Algorithms for Tandem Mass Spectra"Griss, Johannes; Perez-Riverol, Yasset; The, Matthew; Kaell, Lukas; Vizcaino, Juan AntonioJournal of Proteome Research (2018), 17 (5), 1993-1996CODEN: JPROBS; ISSN:1535-3893. (American Chemical Society)A polemic in response to V. Rieder et al. (ibid., 2017, 16,4035). In the recent benchmarking article entitled "Comparison and Evaluation of Clustering Algorithms for Tandem Mass Spectra", Rieder et al. compared several different approaches to cluster MS/MS spectra. While we certainly recognize the value of the manuscript, here, we report some shortcomings detected in the original analyses. For most analyses, the authors clustered only single MS/MS runs. In one of the reported analyses, three MS/MS runs were processed together, which already led to computational performance issues in many of the tested approaches. This fact highlights the difficulties of using many of the tested algorithms on the nowadays produced av. proteomics data sets. Second, the authors only processed identified spectra when merging MS runs. Thereby, all unidentified spectra that are of lower quality were already removed from the data set and could not influence the clustering results. Next, we found that the authors did not analyze the effect of chimeric spectra on the clustering results. In our anal., we found that 3% of the spectra in the used data sets were chimeric, and this had marked effects on the behavior of the different clustering algorithms tested. Finally, the authors' choice to evaluate the MS-Cluster and spectra-cluster algorithms using a precursor tolerance of 5 Da for high-resoln. Orbitrap data only was, in our opinion, not adequate to assess the performance of MS/MS clustering approaches.
- 10Wang, M.; Wang, J.; Carver, J.; Pullman, B. S.; Cha, S. W.; Bandeira, N. Assembling the Community-Scale Discoverable Human Proteome. Cell Syst 2018, 7 (4), 412– 421, DOI: 10.1016/j.cels.2018.08.004Google Scholar10https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BC1cXitVSisrfP&md5=16682c8ba956d4abbe04c5c8437d14d5Assembling the Community-Scale Discoverable Human ProteomeWang, Mingxun; Wang, Jian; Carver, Jeremy; Pullman, Benjamin S.; Cha, Seong Won; Bandeira, NunoCell Systems (2018), 7 (4), 412-421.e5CODEN: CSEYA4; ISSN:2405-4712. (Cell Press)The increasing throughput and sharing of proteomics mass spectrometry data have now yielded over one-third of a million public mass spectrometry runs. However, these discoveries are not continuously aggregated in an open and error-controlled manner, which limits their utility. To facilitate the reusability of these data, we built the MassIVE Knowledge Base (MassIVE-KB), a community-wide, continuously updating knowledge base that aggregates proteomics mass spectrometry discoveries into an open reusable format with full provenance information for community scrutiny. Reusing >31 TB of public human data stored in a mass spectrometry interactive virtual environment (MassIVE), the MassIVE-KB contains >2.1 million precursors from 19,610 proteins (48% larger than before; 97% of the total) and doubles proteome coverage to 6 million amino acids (54% of the proteome) with strict library-scale false discovery controls, thereby providing evidence for 430 proteins for which sufficient protein-level evidence was previously missing. Furthermore, MassIVE-KB can inform exptl. design, helps identify and quantify new data, and provides tools for community construction of specialized spectral libraries.
- 11Tabb, D. L.; Thompson, M. R.; Khalsa-Moyers, G.; VerBerkmoes, N. C.; McDonald, W. H. MS2Grouper: group assessment and synthetic replacement of duplicate proteomic tandem mass spectra. J. Am. Soc. Mass Spectrom. 2005, 16 (8), 1250– 1261, DOI: 10.1016/j.jasms.2005.04.010Google Scholar11https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BD2MXntVCnurY%253D&md5=4971f374a6c3e8684224a16ce09517eaMS2Grouper: Group Assessment and Synthetic Replacement of Duplicate Proteomic Tandem Mass SpectraTabb, David L.; Thompson, Melissa R.; Khalsa-Moyers, Gurusahai; VerBerkmoes, Nathan C.; McDonald, W. HayesJournal of the American Society for Mass Spectrometry (2005), 16 (8), 1250-1261CODEN: JAMSEF; ISSN:1044-0305. (Elsevier Inc.)Shotgun proteomics expts. require the collection of thousands of tandem mass spectra; these sets of data will continue to grow as new instruments become available that can scan at even higher rates. Such data contain substantial amts. of redundancy with spectra from a particular peptide being acquired many times during a single LC-MS/MS expt. In this article, the authors present MS2Grouper, an algorithm that detects spectral duplication, assesses groups of related spectra, and replaces these groups with synthetic representative spectra. Errors in detecting spectral similarity are cor. using a paraclique criterion - spectra are only assessed as groups if they are part of a clique of at least three completely interrelated spectra or are subsequently added to such cliques by being similar to all but one of the clique members. A greedy algorithm constructs a representative spectrum for each group by iteratively removing the tallest peaks from the spectral collection and matching to peaks in the other spectra. This strategy is shown to be effective in reducing spectral counts by up to 20% in LC-MS/MS datasets from protein std. mixts. and proteomes, reducing database search times without a concomitant redn. in identified peptides.
- 12Lam, H.; Deutsch, E. W.; Eddes, J. S.; Eng, J. K.; Stein, S. E.; Aebersold, R. Building consensus spectral libraries for peptide identification in proteomics. Nat. Methods 2008, 5 (10), 873– 875, DOI: 10.1038/nmeth.1254Google Scholar12https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BD1cXhtFKktLfM&md5=3cf83e26679ad2ce2d62dbfa87429a3fBuilding consensus spectral libraries for peptide identification in proteomicsLam, Henry; Deutsch, Eric W.; Eddes, James S.; Eng, Jimmy K.; Stein, Stephen E.; Aebersold, RuediNature Methods (2008), 5 (10), 873-875CODEN: NMAEA3; ISSN:1548-7091. (Nature Publishing Group)Spectral searching has drawn increasing interest as an alternative to sequence-database searching in proteomics. The authors developed and validated an open-source software toolkit, SpectraST, to enable proteomics researchers to build spectral libraries and to integrate this promising approach in their data-anal. pipeline. It allows individual researchers to condense raw data into spectral libraries, summarizing information about obsd. proteomes into a concise and retrievable format for future data analyses.
- 13Tabb, D. L.; MacCoss, M. J.; Wu, C. C.; Anderson, S. D.; Yates, J. R., 3rd Similarity among tandem mass spectra from proteomic experiments: detection, significance, and utility. Anal. Chem. 2003, 75 (10), 2470– 2477, DOI: 10.1021/ac026424oGoogle Scholar13https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BD3sXivVaqu7c%253D&md5=a85fbad9844ca2ab7699ad9a4c045507Similarity among Tandem Mass Spectra from Proteomic Experiments: Detection, Significance, and UtilityTabb, David L.; MacCoss, Michael J.; Wu, Christine C.; Anderson, Scott D.; Yates, John R., IIIAnalytical Chemistry (2003), 75 (10), 2470-2477CODEN: ANCHAM; ISSN:0003-2700. (American Chemical Society)Liq. chromatog. paired with tandem mass spectrometry is a std. technique for identifying peptides from complex protein mixts. Most fragment ion spectra acquired by this technique are unique, but some are repeated. Similarities among the spectra from 1D and 2D liq. chromatog. expts. were calcd. by the dot product algorithm. Similar spectra were grouped, and the degree of duplication was calcd. for each sample. In 1D liq. chromatog. data from 1D gel bands, 18% of the fragment ion spectra were duplicates. A six-cycle 2D liq. chromatog. sepn. of more than 200 proteins produced 28% duplicate spectra. A rat hippocampal homogenate analyzed by a 12-cycle 2D liq. chromatog. sepn. contained 25% duplicate spectra. Removal of these duplicate spectra, however, resulted in fewer peptides being successfully identified by SEQUEST. We propose a modification for peptide identification algorithms that would improve their performance and accuracy by explicitly recognizing and making use of spectral similarity.Frewen, B. E.; Merrihew, G. E.; Wu, C. C.; Noble, W. S.; MacCoss, M. J. Analysis of peptide MS/MS spectra from large-scale proteomics experiments using spectrum libraries. Anal. Chem. 2006, 78 (16), 5678– 5684, DOI: 10.1021/ac060279nGoogle Scholar13https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BD28XmvVGhsLk%253D&md5=bd76eec7ee34f215e508084e77b82ff7Analysis of Peptide MS/MS Spectra from Large-Scale Proteomics Experiments Using Spectrum LibrariesFrewen, Barbara E.; Merrihew, Gennifer E.; Wu, Christine C.; Noble, William Stafford; MacCoss, Michael J.Analytical Chemistry (2006), 78 (16), 5678-5684CODEN: ANCHAM; ISSN:0003-2700. (American Chemical Society)A widespread proteomics procedure for characterizing a complex mixt. of proteins combines tandem mass spectrometry and database search software to yield mass spectra with identified peptide sequences. The same peptides are often detected in multiple expts., and once they have been identified, the resp. spectra can be used for future identifications. The authors present a method for collecting previously identified tandem mass spectra into a ref. library that is used to identify new spectra. Query spectra are compared to refs. in the library to find the ones that are most similar. A dot product metric is used to measure the degree of similarity. With the authors' largest library, the search of a query set finds 91% of the spectrum identifications and 93.7% of the protein identifications that could be made with a SEQUEST database search. A second expt. demonstrates that queries acquired on an LCQ ion trap mass spectrometer can be identified with a library of refs. acquired on an LTQ ion trap mass spectrometer. The dot product similarity score provides good sepn. of correct and incorrect identifications.
- 14Kim, S.; Pevzner, P. A. MS-GF+ makes progress towards a universal database search tool for proteomics. Nat. Commun. 2014, 5, 5277, DOI: 10.1038/ncomms6277Google Scholar14https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BC2MXktFensbs%253D&md5=fd0e69633f52dcb998d8d88b832093ebMS-GF+ makes progress towards a universal database search tool for proteomicsKim, Sangtae; Pevzner, Pavel A.Nature Communications (2014), 5 (), 5277CODEN: NCAOBW; ISSN:2041-1723. (Nature Publishing Group)Mass spectrometry (MS) instruments and exptl. protocols are rapidly advancing, but the software tools to analyze tandem mass spectra are lagging behind. We present a database search tool MS-GF+ that is sensitive (it identifies more peptides than most other database search tools) and universal (it works well for diverse types of spectra, different configurations of MS instruments and different exptl. protocols). We benchmark MS-GF+ using diverse spectral data sets: (i) spectra of varying fragmentation methods; (ii) spectra of multiple enzyme digests; (iii) spectra of phosphorylated peptides; and (iv) spectra of peptides with unusual fragmentation propensities produced by a novel alpha-lytic protease. For all these data sets, MS-GF+ significantly increases the no. of identified peptides compared with commonly used methods for peptide identifications. We emphasize that although MS-GF+ is not specifically designed for any particular exptl. set-up, it improves on the performance of tools specifically designed for these applications (for example, specialized tools for phosphoproteomics).
- 15Hulstaert, N.; Shofstahl, J.; Sachsenberg, T.; Walzer, M.; Barsnes, H.; Martens, L.; Perez-Riverol, Y. ThermoRawFileParser: Modular, Scalable, and Cross-Platform RAW File Conversion. J. Proteome Res. 2020, 19 (1), 537– 542, DOI: 10.1021/acs.jproteome.9b00328Google Scholar15https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BC1MXit1Wkt7fO&md5=70eeac3c8722a64b4205536be8f6adb0ThermoRawFileParser: Modular, Scalable, and Cross-Platform RAW File ConversionHulstaert, Niels; Shofstahl, Jim; Sachsenberg, Timo; Walzer, Mathias; Barsnes, Harald; Martens, Lennart; Perez-Riverol, YassetJournal of Proteome Research (2020), 19 (1), 537-542CODEN: JPROBS; ISSN:1535-3893. (American Chemical Society)The field of computational proteomics is approaching the big data age, driven both by a continuous growth in the no. of samples analyzed per expt. as well as by the growing amt. of data obtained in each anal. run. In order to process these large amts. of data, it is increasingly necessary to use elastic compute resources such as Linux-based cluster environments and cloud infrastructures. Unfortunately, the vast majority of cross-platform proteomics tools are not able to operate directly on the proprietary formats generated by the diverse mass spectrometers. Here, we present ThermoRawFileParser, an open-source, cross-platform tool that converts Thermo RAW files into open file formats such as MGF and the HUPO-PSI std. file format mzML. To ensure the broadest possible availability and to increase integration capabilities with popular workflow systems such as Galaxy or Nextflow, we have also built Conda package and BioContainers container around ThermoRawFileParser. In addn., we implemented a user-friendly interface (ThermoRawFileParserGUI) for those users not familiar with command-line tools. Finally, we performed a benchmark of ThermoRawFileParser and msconvert to verify that the converted mzML files contain reliable quant. results.
- 16Van Leene, J.; Han, C.; Gadeyne, A.; Eeckhout, D.; Matthijs, C.; Cannoot, B.; De Winne, N.; Persiau, G.; Van De Slijke, E.; Van de Cotte, B. Capturing the phosphorylation and protein interaction landscape of the plant TOR kinase. Nat. Plants 2019, 5 (3), 316– 327, DOI: 10.1038/s41477-019-0378-zGoogle Scholar16https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BC1MXmslKjur8%253D&md5=aeb262da4a27123086b7b383e23f0c01Capturing the phosphorylation and protein interaction landscape of the plant TOR kinaseVan Leene, Jelle; Han, Chao; Gadeyne, Astrid; Eeckhout, Dominique; Matthijs, Caroline; Cannoot, Bernard; De Winne, Nancy; Persiau, Geert; Van De Slijke, Eveline; Van de Cotte, Brigitte; Stes, Elisabeth; Van Bel, Michiel; Storme, Veronique; Impens, Francis; Gevaert, Kris; Vandepoele, Klaas; De Smet, Ive; De Jaeger, GeertNature Plants (London, United Kingdom) (2019), 5 (3), 316-327CODEN: NPALBC; ISSN:2055-0278. (Nature Research)The target of rapamycin (TOR) kinase is a conserved regulatory hub that translates environmental and nutritional information into permissive or restrictive growth decisions. Despite the increased appreciation of the essential role of the TOR complex in plants, no large-scale phosphoproteomics or interactomics studies have been performed to map TOR signalling events in plants. To fill this gap, we combined a systematic phosphoproteomics screen with a targeted protein complex anal. in the model plant Arabidopsis thaliana. Integration of the phosphoproteome and protein complex data on the one hand shows that both methods reveal complementary subspaces of the plant TOR signalling network, enabling proteome-wide discovery of both upstream and downstream network components. On the other hand, the overlap between both data sets reveals a set of candidate direct TOR substrates. The integrated network embeds both evolutionarily-conserved and plant-specific TOR signalling components, uncovering an intriguing complex interplay with protein synthesis. Overall, the network provides a rich data set to start addressing fundamental questions about how TOR controls key processes in plants, such as autophagy, auxin signalling, chloroplast development, lipid metab., nucleotide biosynthesis, protein translation or senescence.
- 17Doner, N. M.; Seay, D.; Mehling, M.; Sun, S.; Gidda, S. K.; Schmitt, K.; Braus, G. H.; Ischebeck, T.; Chapman, K. D.; Dyer, J. M. Arabidopsis thaliana EARLY RESPONSIVE TO DEHYDRATION 7 Localizes to Lipid Droplets via Its Senescence Domain. Front Plant Sci. 2021, 12, 658961, DOI: 10.3389/fpls.2021.658961Google Scholar17https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A280%3ADC%252BB3sbptFehsA%253D%253D&md5=df07fe21691d64303260899b2aac1f61Arabidopsis thaliana EARLY RESPONSIVE TO DEHYDRATION 7 Localizes to Lipid Droplets via Its Senescence DomainDoner Nathan M; Gidda Satinder K; Mullen Robert T; Seay Damien; Mehling Marina; Dyer John M; Sun Siqi; Ischebeck Till; Schmitt Kerstin; Braus Gerhard H; Chapman Kent DFrontiers in plant science (2021), 12 (), 658961 ISSN:1664-462X.Lipid droplets (LDs) are neutral-lipid-containing organelles found in all kingdoms of life and are coated with proteins that carry out a vast array of functions. Compared to mammals and yeast, relatively few LD proteins have been identified in plants, particularly those associated with LDs in vegetative (non-seed) cell types. Thus, to better understand the cellular roles of LDs in plants, a more comprehensive inventory and characterization of LD proteins is required. Here, we performed a proteomics analysis of LDs isolated from drought-stressed Arabidopsis leaves and identified EARLY RESPONSIVE TO DEHYDRATION 7 (ERD7) as a putative LD protein. mCherry-tagged ERD7 localized to both LDs and the cytosol when ectopically expressed in plant cells, and the protein's C-terminal senescence domain (SD) was both necessary and sufficient for LD targeting. Phylogenetic analysis revealed that ERD7 belongs to a six-member family in Arabidopsis that, along with homologs in other plant species, is separated into two distinct subfamilies. Notably, the SDs of proteins from each subfamily conferred targeting to either LDs or mitochondria. Further, the SD from the ERD7 homolog in humans, spartin, localized to LDs in plant cells, similar to its localization in mammals; although, in mammalian cells, spartin also conditionally localizes to other subcellular compartments, including mitochondria. Disruption of ERD7 gene expression in Arabidopsis revealed no obvious changes in LD numbers or morphology under normal growth conditions, although this does not preclude a role for ERD7 in stress-induced LD dynamics. Consistent with this possibility, a yeast two-hybrid screen using ERD7 as bait identified numerous proteins involved in stress responses, including some that have been identified in other LD proteomes. Collectively, these observations provide new insight to ERD7 and the SD-containing family of proteins in plants and suggest that ERD7 may be involved in functional aspects of plant stress response that also include localization to the LD surface.
- 18Pipitone, R.; Eicke, S.; Pfister, B.; Glauser, G.; Falconet, D.; Uwizeye, C.; Pralon, T.; Zeeman, S. C.; Kessler, F.; Demarsy, E. A multifaceted analysis reveals two distinct phases of chloroplast biogenesis during de-etiolation in Arabidopsis. eLife 2021, DOI: 10.7554/eLife.62709Google ScholarThere is no corresponding record for this reference.
- 19Osman, S.; Mohammad, E.; Lidschreiber, M.; Stuetzer, A.; Bazso, F. L.; Maier, K. C.; Urlaub, H.; Cramer, P. The Cdk8 kinase module regulates interaction of the mediator complex with RNA polymerase II. J. Biol. Chem. 2021, 296, 100734, DOI: 10.1016/j.jbc.2021.100734Google Scholar19https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BB3MXhtlSrtrzP&md5=baf6d930a92293cf68ac46d385a55ca6The Cdk8 kinase module regulates interaction of the mediator complex with RNA polymerase IIOsman, Sara; Mohammad, Eusra; Lidschreiber, Michael; Stuetzer, Alexandra; Bazso, Fanni Laura; Maier, Kerstin C.; Urlaub, Henning; Cramer, PatrickJournal of Biological Chemistry (2021), 296 (), 100734CODEN: JBCHA3; ISSN:1083-351X. (Elsevier Inc.)The Cdk8 kinase module (CKM) is a dissociable part of the coactivator complex mediator, which regulates gene transcription by RNA polymerase II. The CKM has both neg. and pos. functions in gene transcription that remain poorly understood at the mechanistic level. In order to reconstitute the role of the CKM in transcription initiation, we prepd. recombinant CKM from the yeast Saccharomyces cerevisiae. We showed that CKM bound to the core mediator (cMed) complex, sterically inhibiting cMed from binding to the polymerase II preinitiation complex (PIC) in vitro. We further showed that the Cdk8 kinase activity of the CKM weakened CKM-cMed interaction, thereby facilitating dissocn. of the CKM and enabling mediator to bind the PIC in order to stimulate transcription initiation. Finally, we report that the kinase activity of Cdk8 is required for gene activation during the stressful condition of heat shock in vivo but not under steady-state growth conditions. Based on these results, we propose a model in which the CKM neg. regulates mediator function at upstream-activating sequences by preventing mediator binding to the PIC at the gene promoter. However, during gene activation in response to stress, the Cdk8 kinase activity of the CKM may release mediator and allow its binding to the PIC, thereby accounting for the pos. function of CKM. This may impart improved adaptability to stress by allowing a rapid transcriptional response to environmental changes, and we speculate that a similar mechanism in metazoans may allow the precise timing of developmental transcription programs.
- 20Perez-Riverol, Y.; Bai, J.; Bandla, C.; Garcia-Seisdedos, D.; Hewapathirana, S.; Kamatchinathan, S.; Kundu, D. J.; Prakash, A.; Frericks-Zipper, A.; Eisenacher, M. The PRIDE database resources in 2022: a hub for mass spectrometry-based proteomics evidences. Nucleic Acids Res. 2022, 50 (D1), D543– D552, DOI: 10.1093/nar/gkab1038Google Scholar20https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BB38Xis1ChtLo%253D&md5=839f38b25e98ce0751fd074cd8670747PRIDE database resources in 2022 hub for mass spectrometry-based proteomics evidencesPerez-Riverol, Yasset; Bai, Jingwen; Bandla, Chakradhar; Garcia-Seisdedos, David; Hewapathirana, Suresh; Kamatchinathan, Selvakumar; Kundu, Deepti J.; Prakash, Ananth; Frericks-Zipper, Anika; Eisenacher, Martin; Walzer, Mathias; Wang, Shengbo; Brazma, Alvis; Vizcaino, Juan AntonioNucleic Acids Research (2022), 50 (D1), D543-D552CODEN: NARHAD; ISSN:1362-4962. (Oxford University Press)The PRoteomics IDEntifications (PRIDE) database is the world's largest data repository of mass spectrometry-based proteomics data. PRIDE is one of the founding members of the global ProteomeXchange (PX) consortium and an ELIXIR core data resource. In this manuscript, we summarize the developments in PRIDE resources and related tools since the previous update manuscript was published in Nucleic Acids Research in 2019. The no. of submitted datasets to PRIDE Archive (the archival component of PRIDE) has reached on av. around 500 datasets per mo during 2021. In addn. to continuous improvements in PRIDE Archive data pipelines and infrastructure, the PRIDE Spectra Archive has been developed to provide direct access to the submitted mass spectra using Universal Spectrum Identifiers. As a key point, the file format MAGE-TAB for proteomics has been developed to enable the improvement of sample metadata annotation. Addnl., the resource PRIDE Peptidome provides access to aggregated peptide/protein evidences across PRIDE Archive. Furthermore, we will describe how PRIDE has increased its efforts to reuse and disseminate high-quality proteomics data into other added-value resources such as UniProt, Ensembl and Expression Atlas.
- 21Hunter, J. D. Matplotlib: A 2D graphics environment. Comput. Sci. Eng. 2007, 9 (3), 90– 95, DOI: 10.1109/MCSE.2007.55Google ScholarThere is no corresponding record for this reference.
- 22Lam, S. K.; Pitrou, A.; Seibert, S. Numba: A LLVM-based Python JIT compiler. In Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC; Austin, TX, 2015.Google ScholarThere is no corresponding record for this reference.
- 23Harris, C. R.; Millman, K. J.; van der Walt, S. J.; Gommers, R.; Virtanen, P.; Cournapeau, D.; Wieser, E.; Taylor, J.; Berg, S.; Smith, N. J. Array programming with NumPy. Nature 2020, 585 (7825), 357– 362, DOI: 10.1038/s41586-020-2649-2Google Scholar23https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BB3cXitlWmsbbN&md5=a9e32986e9cc14fa31afe3e524e95882Array programming with NumPyHarris, Charles R.; Millman, K. Jarrod; van der Walt, Stefan J.; Gommers, Ralf; Virtanen, Pauli; Cournapeau, David; Wieser, Eric; Taylor, Julian; Berg, Sebastian; Smith, Nathaniel J.; Kern, Robert; Picus, Matti; Hoyer, Stephan; van Kerkwijk, Marten H.; Brett, Matthew; Haldane, Allan; del Rio, Jaime Fernandez; Wiebe, Mark; Peterson, Pearu; Gerard-Marchant, Pierre; Sheppard, Kevin; Reddy, Tyler; Weckesser, Warren; Abbasi, Hameer; Gohlke, Christoph; Oliphant, Travis E.Nature (London, United Kingdom) (2020), 585 (7825), 357-362CODEN: NATUAS; ISSN:0028-0836. (Nature Research)Abstr.: Array programming provides a powerful, compact and expressive syntax for accessing, manipulating and operating on data in vectors, matrixes and higher-dimensional arrays. NumPy is the primary array programming library for the Python language. It has an essential role in research anal. pipelines in fields as diverse as physics, chem., astronomy, geoscience, biol., psychol., materials science, engineering, finance and economics. For example, in astronomy, NumPy was an important part of the software stack used in the discovery of gravitational waves1 and in the first imaging of a black hole2. Here we review how a few fundamental array concepts lead to a simple and powerful programming paradigm for organizing, exploring and analyzing scientific data. NumPy is the foundation upon which the scientific Python ecosystem is constructed. It is so pervasive that several projects, targeting audiences with specialized needs, have developed their own NumPy-like interfaces and array objects. Owing to its central position in the ecosystem, NumPy increasingly acts as an interoperability layer between such array computation libraries and, together with its application programming interface (API), provides a flexible framework to support the next decade of scientific and industrial anal.
- 24Rost, H. L.; Schmitt, U.; Aebersold, R.; Malmstrom, L. pyOpenMS: a Python-based interface to the OpenMS mass-spectrometry algorithm library. Proteomics 2014, 14 (1), 74– 77, DOI: 10.1002/pmic.201300246Google Scholar24https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A280%3ADC%252BC2czltVehtw%253D%253D&md5=789875047d3e04492cbdabf6458db71cpyOpenMS: a Python-based interface to the OpenMS mass-spectrometry algorithm libraryRost Hannes L; Schmitt Uwe; Aebersold Ruedi; Malmstrom LarsProteomics (2014), 14 (1), 74-7 ISSN:.pyOpenMS is an open-source, Python-based interface to the C++ OpenMS library, providing facile access to a feature-rich, open-source algorithm library for MS-based proteomics analysis. It contains Python bindings that allow raw access to the data structures and algorithms implemented in OpenMS, specifically those for file access (mzXML, mzML, TraML, mzIdentML among others), basic signal processing (smoothing, filtering, de-isotoping, and peak-picking) and complex data analysis (including label-free, SILAC, iTRAQ, and SWATH analysis tools). pyOpenMS thus allows fast prototyping and efficient workflow development in a fully interactive manner (using the interactive Python interpreter) and is also ideally suited for researchers not proficient in C++. In addition, our code to wrap a complex C++ library is completely open-source, allowing other projects to create similar bindings with ease. The pyOpenMS framework is freely available at https://pypi.python.org/pypi/pyopenms while the autowrap tool to create Cython code automatically is available at https://pypi.python.org/pypi/autowrap (both released under the 3-clause BSD licence).
- 25Levitsky, L. I.; Klein, J. A.; Ivanov, M. V.; Gorshkov, M. V. Pyteomics 4.0: Five Years of Development of a Python Proteomics Framework. J. Proteome Res. 2019, 18 (2), 709– 714, DOI: 10.1021/acs.jproteome.8b00717Google Scholar25https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BC1cXisFOks7vK&md5=78dd10647be2893faa9b6cc1b5aba582Pyteomics 4.0: Five Years of Development of a Python Proteomics FrameworkLevitsky, Lev I.; Klein, Joshua A.; Ivanov, Mark V.; Gorshkov, Mikhail V.Journal of Proteome Research (2019), 18 (2), 709-714CODEN: JPROBS; ISSN:1535-3893. (American Chemical Society)A review. Many of the novel ideas that drive today's proteomic technologies are focused essentially on exptl. or data-processing workflows. The latter are implemented and published in a no. of ways, from custom scripts and programs, to projects built using general-purpose or specialized workflow engines; a large part of routine data processing is performed manually or with custom scripts that remain unpublished. Facilitating the development of reproducible data-processing workflows becomes essential for increasing the efficiency of proteomic research. To assist in overcoming the bioinformatics challenges in the daily practice of proteomic labs., 5 years ago we developed and announced Pyteomics, a freely available open-source library providing Python interfaces to proteomic data. We summarize the new functionality of Pyteomics developed during the time since its introduction.
- 26Bittremieux, W. spectrum_utils: A Python Package for Mass Spectrometry Data Processing and Visualization. Anal. Chem. 2020, 92 (1), 659– 661, DOI: 10.1021/acs.analchem.9b04884Google Scholar26https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BC1MXitleqsr7N&md5=7409ec02d1cb04f1db7a6abcb0673cd2spectrum_utils: A Python Package for Mass Spectrometry Data Processing and VisualizationBittremieux, WoutAnalytical Chemistry (Washington, DC, United States) (2020), 92 (1), 659-661CODEN: ANCHAM; ISSN:0003-2700. (American Chemical Society)Given the wide diversity in applications of biol. mass spectrometry, custom data analyses are often needed to fully interpret the results of an expt. Such bioinformatics scripts necessarily include similar basic functionality to read mass spectral data from std. file formats, process it, and visualize it. Rather than having to reimplement this functionality, to facilitate this task, spectrum_utils is a Python package for mass spectrometry data processing and visualization. Its high-level functionality enables developers to quickly prototype ideas for computational mass spectrometry projects in only a few lines of code. Notably, the data processing functionality is highly optimized for computational efficiency to be able to deal with the large vols. of data that are generated during mass spectrometry expts. The visualization functionality makes it possible to easily produce publication-quality figures as well as interactive spectrum plots for inclusion on web pages. spectrum_utils is available for Python 3.6+, includes extensive online documentation and examples, and can be easily installed using conda. It is freely available as open source under the Apache 2.0 license at https://github.com/bittremieux/spectrum_utils.
- 27Deutsch, E. W.; Perez-Riverol, Y.; Carver, J.; Kawano, S.; Mendoza, L.; Van Den Bossche, T.; Gabriels, R.; Binz, P. A.; Pullman, B.; Sun, Z. Universal Spectrum Identifier for mass spectra. Nat. Methods 2021, 18 (7), 768– 770, DOI: 10.1038/s41592-021-01184-6Google Scholar27https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BB3MXhsVenurjM&md5=1f1df6a61c31ba9ff5aad99999cf6574Universal Spectrum Identifier for mass spectraDeutsch, Eric W.; Perez-Riverol, Yasset; Carver, Jeremy; Kawano, Shin; Mendoza, Luis; Van Den Bossche, Tim; Gabriels, Ralf; Binz, Pierre-Alain; Pullman, Benjamin; Sun, Zhi; Shofstahl, Jim; Bittremieux, Wout; Mak, Tytus D.; Klein, Joshua; Zhu, Yunping; Lam, Henry; Vizcaino, Juan Antonio; Bandeira, NunoNature Methods (2021), 18 (7), 768-770CODEN: NMAEA3; ISSN:1548-7091. (Nature Portfolio)Mass spectra provide the ultimate evidence to support the findings of mass spectrometry proteomics studies in publications, and it is therefore crucial to be able to trace the conclusions back to the spectra. The Universal Spectrum Identifier (USI) provides a standardized mechanism for encoding a virtual path to any mass spectrum contained in datasets deposited to public proteomics repositories. USI enables greater transparency of spectral evidence, with more than 1 billion USI identifications from over 3 billion spectra already available through ProteomeXchange repositories.
- 28Choi, M.; Carver, J.; Chiva, C.; Tzouros, M.; Huang, T.; Tsai, T. H.; Pullman, B.; Bernhardt, O. M.; Huttenhain, R.; Teo, G. C. MassIVE.quant: a community resource of quantitative mass spectrometry-based proteomics datasets. Nat. Methods 2020, 17 (10), 981– 984, DOI: 10.1038/s41592-020-0955-0Google Scholar28https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BB3cXhvVanurbN&md5=9fe06bca9e9d67aa3af0a28ca1eb8448MassIVE.quant: a community resource of quantitative mass spectrometry-based proteomics datasetsChoi, Meena; Carver, Jeremy; Chiva, Cristina; Tzouros, Manuel; Huang, Ting; Tsai, Tsung-Heng; Pullman, Benjamin; Bernhardt, Oliver M.; Huttenhain, Ruth; Teo, Guo Ci; Perez-Riverol, Yasset; Muntel, Jan; Muller, Maik; Goetze, Sandra; Pavlou, Maria; Verschueren, Erik; Wollscheid, Bernd; Nesvizhskii, Alexey I.; Reiter, Lukas; Dunkley, Tom; Sabido, Eduard; Bandeira, Nuno; Vitek, OlgaNature Methods (2020), 17 (10), 981-984CODEN: NMAEA3; ISSN:1548-7091. (Nature Research)Abstr.: MassIVE.quant is a repository infrastructure and data resource for reproducible quant. mass spectrometry-based proteomics, which is compatible with all mass spectrometry data acquisition types and computational anal. tools. A branch structure enables MassIVE.quant to systematically store raw exptl. data, metadata of the exptl. design, scripts of the quant. anal. workflow, intermediate input and output files, as well as alternative reanalyses of the same dataset.
- 29Ashwood, C.; Bittremieux, W.; Deutsch, E. W.; Doncheva, N. T.; Dorfer, V.; Gabriels, R.; Gorshkov, V.; Gupta, S.; Jones, A. R.; Käll, L. Proceedings of the EuBIC-MS 2020 Developers’ Meeting. EuPA Open Proteomics 2020, 24, 1– 6, DOI: 10.1016/j.euprot.2020.11.001Google ScholarThere is no corresponding record for this reference.
Cited By
This article is cited by 2 publications.
- BingHuan Yuan, XiaoMeng Li, Shan Xu, Huan Sun, CunSi Shen, JianJian Ji, LiLi Lin, WeiChen Xu, JinJun Shan, WenJun Tong, Tong Xie. Discovery of N-Acyl Amino Acids and Novel Related N-, O-Acyl Lipids by Integrating Molecular Networking and an Extended In Silico Spectral Library. Analytical Chemistry 2023, 95
(22)
, 8443-8451. https://doi.org/10.1021/acs.analchem.2c04822
- Weihong Xu, Jaeyoung Kang, Wout Bittremieux, Niema Moshiri, Tajana Rosing. HyperSpec: Ultrafast Mass Spectra Clustering in Hyperdimensional Space. Journal of Proteome Research 2023, 22
(6)
, 1639-1648. https://doi.org/10.1021/acs.jproteome.2c00612
Article Views are the COUNTER-compliant sum of full text article downloads since November 2008 (both PDF and HTML) across all institutions and individuals. These metrics are regularly updated to reflect usage leading up to the last few days.
Citations are the number of other articles citing this article, calculated by Crossref and updated daily. Find more information about Crossref citation counts.
The Altmetric Attention Score is a quantitative measure of the attention that a research article has received online. Clicking on the donut icon will load a page at altmetric.com with additional details about the score and the social media presence for the given article. Find more information on the Altmetric Attention Score and how the score is calculated.
Recommended Articles
References
This article references 29 other publications.
- 1Perez-Riverol, Y.; Vizcaino, J. A.; Griss, J. Future Prospects of Spectral Clustering Approaches in Proteomics. Proteomics 2018, 18 (14), e1700454 DOI: 10.1002/pmic.201700454There is no corresponding record for this reference.
- 2Griss, J.; Perez-Riverol, Y.; Lewis, S.; Tabb, D. L.; Dianes, J. A.; Del-Toro, N.; Rurik, M.; Walzer, M. W.; Kohlbacher, O.; Hermjakob, H. Recognizing millions of consistently unidentified spectra across hundreds of shotgun proteomics datasets. Nat. Methods 2016, 13 (8), 651– 656, DOI: 10.1038/nmeth.39022https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BC28XhtVKntbnO&md5=ebece175e00762a1c26ae12ce1afde75Recognizing millions of consistently unidentified spectra across hundreds of shotgun proteomics datasetsGriss, Johannes; Perez-Riverol, Yasset; Lewis, Steve; Tabb, David L.; Dianes, Jose A.; del-Toro, Noemi; Rurik, Marc; Walzer, Mathias; Kohlbacher, Oliver; Hermjakob, Henning; Wang, Rui; Vizcaino, Juan AntonioNature Methods (2016), 13 (8), 651-656CODEN: NMAEA3; ISSN:1548-7091. (Nature Publishing Group)Mass spectrometry (MS) is the main technol. used in proteomics approaches. However, on av., 75% of spectra analyzed in an MS expt. remain unidentified. We propose to use spectrum clustering at a large scale to shed light on these unidentified spectra. The Proteomics Identifications (PRIDE) Database Archive is one of the largest MS proteomics public data repositories worldwide. By clustering all tandem MS spectra publicly available in the PRIDE Archive, coming from hundreds of data sets, we were able to consistently characterize spectra into three distinct groups: (1) incorrectly identified, (2) correctly identified but below the set scoring threshold, and (3) truly unidentified. Using multiple complementary anal. approaches, we were able to identify ~ 20% of the consistently unidentified spectra. The complete spectrum-clustering results are available through the new version of the PRIDE Cluster resource (http://www.ebi.ac.uk/pride/cluster).</a></a></a></a></a></a></a></a></a></a></a></a></a></a></a></a>. This resource is intended, among other aims, to encourage and simplify further investigation into these unidentified spectra.
- 3Frank, A. M.; Monroe, M. E.; Shah, A. R.; Carver, J. J.; Bandeira, N.; Moore, R. J.; Anderson, G. A.; Smith, R. D.; Pevzner, P. A. Spectral archives: extending spectral libraries to analyze both identified and unidentified spectra. Nat. Methods 2011, 8 (7), 587– 591, DOI: 10.1038/nmeth.16093https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BC3MXmtV2hs78%253D&md5=7ff219a2a958756a0b77b64f0421b5d2Spectral archives: extending spectral libraries to analyze both identified and unidentified spectraFrank, Ari M.; Monroe, Matthew E.; Shah, Anuj R.; Carver, Jeremy J.; Bandeira, Nuno; Moore, Ronald J.; Anderson, Gordon A.; Smith, Richard D.; Pevzner, Pavel A.Nature Methods (2011), 8 (7), 587-591CODEN: NMAEA3; ISSN:1548-7091. (Nature Publishing Group)Tandem mass spectrometry (MS/MS) expts. yield multiple, nearly identical spectra of the same peptide in various labs., but proteomics researchers typically do not leverage the unidentified spectra produced in other labs to decode spectra they generate. We propose a spectral archives approach that clusters MS/MS datasets, representing similar spectra by a single consensus spectrum. Spectral archives extend spectral libraries by analyzing both identified and unidentified spectra in the same way and maintaining information about peptide spectra that are common across species and conditions. Thus archives offer both traditional library spectrum similarity-based search capabilities along with new ways to analyze the data. By developing a clustering tool, MS-Cluster, we generated a spectral archive from ∼1.18 billion spectra that greatly exceeds the size of existing spectral repositories. We advocate that publicly available data should be organized into spectral archives rather than be analyzed as disparate datasets, as is mostly the case today.
- 4The, M.; Kall, L. Focus on the spectra that matter by clustering of quantification data in shotgun proteomics. Nat. Commun. 2020, 11 (1), 3234, DOI: 10.1038/s41467-020-17037-34https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BB3cXhtlSitr3J&md5=f4c357704724e9bdf2bd1d19da08e789Focus on the spectra that matter by clustering of quantification data in shotgun proteomicsThe, Matthew; Kaell, LukasNature Communications (2020), 11 (1), 3234CODEN: NCAOBW; ISSN:2041-1723. (Nature Research)Abstr.: In shotgun proteomics, the anal. of label-free quantification expts. is typically limited by the identification rate and the noise level in the quant. data. This generally causes a low sensitivity in differential expression anal. Here, we propose a quantification-first approach for peptides that reverses the classical identification-first workflow, thereby preventing valuable information from being discarded in the identification stage. Specifically, we introduce a method, Quandenser, that applies unsupervised clustering on both MS1 and MS2 level to summarize all analytes of interest without assigning identities. This reduces search time due to the data redn. We can now employ open modification and de novo searches to identify analytes of interest that would have gone unnoticed in traditional pipelines. Quandenser+Triqler outperforms the state-of-the-art method MaxQuant+Perseus, consistently reporting more differentially abundant proteins for all tested datasets. Software is available for all major operating systems at https://github.com/statisticalbiotechnol./quandenser,</a></a></a></a></a></a></a></a></a></a></a></a></a></a></a></a></a></a></a></a> under Apache 2.0 license.Griss, J.; Stanek, F.; Hudecz, O.; Durnberger, G.; Perez-Riverol, Y.; Vizcaino, J. A.; Mechtler, K. Spectral Clustering Improves Label-Free Quantification of Low-Abundant Proteins. J. Proteome Res. 2019, 18 (4), 1477– 1485, DOI: 10.1021/acs.jproteome.8b003774https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BC1MXksF2kt7s%253D&md5=ea84ad38a473eb92ad6c06d2ce59a560Spectral Clustering Improves Label-Free Quantification of Low-Abundant ProteinsGriss, Johannes; Stanek, Florian; Hudecz, Otto; Duernberger, Gerhard; Perez-Riverol, Yasset; Vizcaino, Juan Antonio; Mechtler, KarlJournal of Proteome Research (2019), 18 (4), 1477-1485CODEN: JPROBS; ISSN:1535-3893. (American Chemical Society)Label-free quantification has become a common-practice in many mass spectrometry-based proteomics expts. In recent years, we and others have shown that spectral clustering can considerably improve the anal. of (primarily large-scale) proteomics data sets. Here we show that spectral clustering can be used to infer addnl. peptide-spectrum matches and improve the quality of label-free quant. proteomics data in data sets also contg. only tens of MS runs. We analyzed four well-known public benchmark data sets that represent different exptl. settings using spectral counting and peak intensity based label-free quantification. In both approaches, the addnl. inferred peptide-spectrum matches through our spectra-cluster algorithm improved the detectability of low abundant proteins while increasing the accuracy of the derived quant. data, without increasing the data sets' noise. Addnl., we developed a Proteome Discoverer node for our spectra-cluster algorithm which allows anyone to rebuild our proposed pipeline using the free version of Proteome Discoverer.
- 5The, M.; Kall, L. MaRaCluster: A Fragment Rarity Metric for Clustering Fragment Spectra in Shotgun Proteomics. J. Proteome Res. 2016, 15 (3), 713– 720, DOI: 10.1021/acs.jproteome.5b007495https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BC2MXitVSitrnJ&md5=9f4da17029204f0fcb69269a3d44f0ceMaRaCluster: A Fragment Rarity Metric for Clustering Fragment Spectra in Shotgun ProteomicsThe, Matthew; Kaell, LukasJournal of Proteome Research (2016), 15 (3), 713-720CODEN: JPROBS; ISSN:1535-3893. (American Chemical Society)Shotgun proteomics expts. generate large amts. of fragment spectra as primary data, normally with high redundancy between and within expts. Here, the authors have devised a clustering technique to identify fragment spectra stemming from the same species of peptide. This is a powerful alternative method to traditional search engines for analyzing spectra, specifically useful for larger scale mass spectrometry studies. As an aid in this process, the authors propose a distance calcn. relying on the rarity of exptl. fragment peaks, following the intuition that peaks shared by only a few spectra offer more evidence than peaks shared by a large no. of spectra. The authors used this distance calcn. and a complete-linkage scheme to cluster data from a recent large-scale mass spectrometry-based study. The clusterings produced by the authors' method have up to 40% more identified peptides for their consensus spectra compared to those produced by the previous state-of-the-art method. The authors see that the authors' method would advance the construction of spectral libraries as well as serve as a tool for mining large sets of fragment spectra. The source code and Ubuntu binary packages are available at https://github.com/statisticalbiotechnol./maracluster (under an Apache 2.0 license).
- 6Wang, L.; Li, S.; Tang, H. msCRUSH: Fast Tandem Mass Spectral Clustering Using Locality Sensitive Hashing. J. Proteome Res. 2018, 18 (1), 147– 158, DOI: 10.1021/acs.jproteome.8b00448There is no corresponding record for this reference.
- 7Bittremieux, W.; Laukens, K.; Noble, W. S.; Dorrestein, P. C. Large-scale tandem mass spectrum clustering using fast nearest neighbor searching. Rapid Commun. Mass Spectrom. 2021, e9153 DOI: 10.1002/rcm.9153There is no corresponding record for this reference.
- 8Lam, H.; Deutsch, E. W.; Eddes, J. S.; Eng, J. K.; King, N.; Stein, S. E.; Aebersold, R. Development and validation of a spectral library searching method for peptide identification from MS/MS. Proteomics 2007, 7 (5), 655– 667, DOI: 10.1002/pmic.2006006258https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BD2sXjs1Kls70%253D&md5=f34b1ce3ee3a941044c9971d04d2dc50Development and validation of a spectral library searching method for peptide identification from MS/MSLam, Henry; Deutsch, Eric W.; Eddes, James S.; Eng, Jimmy K.; King, Nichole; Stein, Stephen E.; Aebersold, RuediProteomics (2007), 7 (5), 655-667CODEN: PROTC7; ISSN:1615-9853. (Wiley-VCH Verlag GmbH & Co. KGaA)A notable inefficiency of shotgun proteomics expts. is the repeated rediscovery of the same identifiable peptides by sequence database searching methods, which often are time-consuming and error-phone. A more precise and efficient method, in which previously obsd. and identified peptide MS/MS spectra are cataloged and condensed into searchable spectral libraries to allow new identifications by spectral matching, is seen as a promising alternative. To that end, an open-source, functionally complete, high-throughput and readily extensible MS/MS spectral searching tool, SpectraST, was developed. A high-quality spectral library was constructed by combining the high-confidence identifications of millions of spectra taken from various data repositories and searched using four sequence search engines. The resulting library consists of over 30,000 spectra for Saccharomyces cerevisiae. Using this library, SpectraST vastly outperforms the sequence search engine SEQUEST in terms of speed and the ability to discriminate good and bad hits. A unique advantage of SpectraST is its full integration into the popular Trans Proteomic Pipeline suite of software, which facilitates user adoption and provides important functionalities such as peptide and protein probability assignment, quantification, and data visualization. This method of spectral library searching is esp. suited for targeted proteomics applications, offering superior performance to traditional sequence searching.
- 9Griss, J.; Perez-Riverol, Y.; The, M.; Kall, L.; Vizcaino, J. A. Response to ″Comparison and Evaluation of Clustering Algorithms for Tandem Mass Spectra″. J. Proteome Res. 2018, 17 (5), 1993– 1996, DOI: 10.1021/acs.jproteome.7b008249https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BC1cXotVahtLk%253D&md5=876777f378bfaae814f134e3ecea91b8Response to "Comparison and Evaluation of Clustering Algorithms for Tandem Mass Spectra"Griss, Johannes; Perez-Riverol, Yasset; The, Matthew; Kaell, Lukas; Vizcaino, Juan AntonioJournal of Proteome Research (2018), 17 (5), 1993-1996CODEN: JPROBS; ISSN:1535-3893. (American Chemical Society)A polemic in response to V. Rieder et al. (ibid., 2017, 16,4035). In the recent benchmarking article entitled "Comparison and Evaluation of Clustering Algorithms for Tandem Mass Spectra", Rieder et al. compared several different approaches to cluster MS/MS spectra. While we certainly recognize the value of the manuscript, here, we report some shortcomings detected in the original analyses. For most analyses, the authors clustered only single MS/MS runs. In one of the reported analyses, three MS/MS runs were processed together, which already led to computational performance issues in many of the tested approaches. This fact highlights the difficulties of using many of the tested algorithms on the nowadays produced av. proteomics data sets. Second, the authors only processed identified spectra when merging MS runs. Thereby, all unidentified spectra that are of lower quality were already removed from the data set and could not influence the clustering results. Next, we found that the authors did not analyze the effect of chimeric spectra on the clustering results. In our anal., we found that 3% of the spectra in the used data sets were chimeric, and this had marked effects on the behavior of the different clustering algorithms tested. Finally, the authors' choice to evaluate the MS-Cluster and spectra-cluster algorithms using a precursor tolerance of 5 Da for high-resoln. Orbitrap data only was, in our opinion, not adequate to assess the performance of MS/MS clustering approaches.
- 10Wang, M.; Wang, J.; Carver, J.; Pullman, B. S.; Cha, S. W.; Bandeira, N. Assembling the Community-Scale Discoverable Human Proteome. Cell Syst 2018, 7 (4), 412– 421, DOI: 10.1016/j.cels.2018.08.00410https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BC1cXitVSisrfP&md5=16682c8ba956d4abbe04c5c8437d14d5Assembling the Community-Scale Discoverable Human ProteomeWang, Mingxun; Wang, Jian; Carver, Jeremy; Pullman, Benjamin S.; Cha, Seong Won; Bandeira, NunoCell Systems (2018), 7 (4), 412-421.e5CODEN: CSEYA4; ISSN:2405-4712. (Cell Press)The increasing throughput and sharing of proteomics mass spectrometry data have now yielded over one-third of a million public mass spectrometry runs. However, these discoveries are not continuously aggregated in an open and error-controlled manner, which limits their utility. To facilitate the reusability of these data, we built the MassIVE Knowledge Base (MassIVE-KB), a community-wide, continuously updating knowledge base that aggregates proteomics mass spectrometry discoveries into an open reusable format with full provenance information for community scrutiny. Reusing >31 TB of public human data stored in a mass spectrometry interactive virtual environment (MassIVE), the MassIVE-KB contains >2.1 million precursors from 19,610 proteins (48% larger than before; 97% of the total) and doubles proteome coverage to 6 million amino acids (54% of the proteome) with strict library-scale false discovery controls, thereby providing evidence for 430 proteins for which sufficient protein-level evidence was previously missing. Furthermore, MassIVE-KB can inform exptl. design, helps identify and quantify new data, and provides tools for community construction of specialized spectral libraries.
- 11Tabb, D. L.; Thompson, M. R.; Khalsa-Moyers, G.; VerBerkmoes, N. C.; McDonald, W. H. MS2Grouper: group assessment and synthetic replacement of duplicate proteomic tandem mass spectra. J. Am. Soc. Mass Spectrom. 2005, 16 (8), 1250– 1261, DOI: 10.1016/j.jasms.2005.04.01011https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BD2MXntVCnurY%253D&md5=4971f374a6c3e8684224a16ce09517eaMS2Grouper: Group Assessment and Synthetic Replacement of Duplicate Proteomic Tandem Mass SpectraTabb, David L.; Thompson, Melissa R.; Khalsa-Moyers, Gurusahai; VerBerkmoes, Nathan C.; McDonald, W. HayesJournal of the American Society for Mass Spectrometry (2005), 16 (8), 1250-1261CODEN: JAMSEF; ISSN:1044-0305. (Elsevier Inc.)Shotgun proteomics expts. require the collection of thousands of tandem mass spectra; these sets of data will continue to grow as new instruments become available that can scan at even higher rates. Such data contain substantial amts. of redundancy with spectra from a particular peptide being acquired many times during a single LC-MS/MS expt. In this article, the authors present MS2Grouper, an algorithm that detects spectral duplication, assesses groups of related spectra, and replaces these groups with synthetic representative spectra. Errors in detecting spectral similarity are cor. using a paraclique criterion - spectra are only assessed as groups if they are part of a clique of at least three completely interrelated spectra or are subsequently added to such cliques by being similar to all but one of the clique members. A greedy algorithm constructs a representative spectrum for each group by iteratively removing the tallest peaks from the spectral collection and matching to peaks in the other spectra. This strategy is shown to be effective in reducing spectral counts by up to 20% in LC-MS/MS datasets from protein std. mixts. and proteomes, reducing database search times without a concomitant redn. in identified peptides.
- 12Lam, H.; Deutsch, E. W.; Eddes, J. S.; Eng, J. K.; Stein, S. E.; Aebersold, R. Building consensus spectral libraries for peptide identification in proteomics. Nat. Methods 2008, 5 (10), 873– 875, DOI: 10.1038/nmeth.125412https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BD1cXhtFKktLfM&md5=3cf83e26679ad2ce2d62dbfa87429a3fBuilding consensus spectral libraries for peptide identification in proteomicsLam, Henry; Deutsch, Eric W.; Eddes, James S.; Eng, Jimmy K.; Stein, Stephen E.; Aebersold, RuediNature Methods (2008), 5 (10), 873-875CODEN: NMAEA3; ISSN:1548-7091. (Nature Publishing Group)Spectral searching has drawn increasing interest as an alternative to sequence-database searching in proteomics. The authors developed and validated an open-source software toolkit, SpectraST, to enable proteomics researchers to build spectral libraries and to integrate this promising approach in their data-anal. pipeline. It allows individual researchers to condense raw data into spectral libraries, summarizing information about obsd. proteomes into a concise and retrievable format for future data analyses.
- 13Tabb, D. L.; MacCoss, M. J.; Wu, C. C.; Anderson, S. D.; Yates, J. R., 3rd Similarity among tandem mass spectra from proteomic experiments: detection, significance, and utility. Anal. Chem. 2003, 75 (10), 2470– 2477, DOI: 10.1021/ac026424o13https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BD3sXivVaqu7c%253D&md5=a85fbad9844ca2ab7699ad9a4c045507Similarity among Tandem Mass Spectra from Proteomic Experiments: Detection, Significance, and UtilityTabb, David L.; MacCoss, Michael J.; Wu, Christine C.; Anderson, Scott D.; Yates, John R., IIIAnalytical Chemistry (2003), 75 (10), 2470-2477CODEN: ANCHAM; ISSN:0003-2700. (American Chemical Society)Liq. chromatog. paired with tandem mass spectrometry is a std. technique for identifying peptides from complex protein mixts. Most fragment ion spectra acquired by this technique are unique, but some are repeated. Similarities among the spectra from 1D and 2D liq. chromatog. expts. were calcd. by the dot product algorithm. Similar spectra were grouped, and the degree of duplication was calcd. for each sample. In 1D liq. chromatog. data from 1D gel bands, 18% of the fragment ion spectra were duplicates. A six-cycle 2D liq. chromatog. sepn. of more than 200 proteins produced 28% duplicate spectra. A rat hippocampal homogenate analyzed by a 12-cycle 2D liq. chromatog. sepn. contained 25% duplicate spectra. Removal of these duplicate spectra, however, resulted in fewer peptides being successfully identified by SEQUEST. We propose a modification for peptide identification algorithms that would improve their performance and accuracy by explicitly recognizing and making use of spectral similarity.Frewen, B. E.; Merrihew, G. E.; Wu, C. C.; Noble, W. S.; MacCoss, M. J. Analysis of peptide MS/MS spectra from large-scale proteomics experiments using spectrum libraries. Anal. Chem. 2006, 78 (16), 5678– 5684, DOI: 10.1021/ac060279n13https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BD28XmvVGhsLk%253D&md5=bd76eec7ee34f215e508084e77b82ff7Analysis of Peptide MS/MS Spectra from Large-Scale Proteomics Experiments Using Spectrum LibrariesFrewen, Barbara E.; Merrihew, Gennifer E.; Wu, Christine C.; Noble, William Stafford; MacCoss, Michael J.Analytical Chemistry (2006), 78 (16), 5678-5684CODEN: ANCHAM; ISSN:0003-2700. (American Chemical Society)A widespread proteomics procedure for characterizing a complex mixt. of proteins combines tandem mass spectrometry and database search software to yield mass spectra with identified peptide sequences. The same peptides are often detected in multiple expts., and once they have been identified, the resp. spectra can be used for future identifications. The authors present a method for collecting previously identified tandem mass spectra into a ref. library that is used to identify new spectra. Query spectra are compared to refs. in the library to find the ones that are most similar. A dot product metric is used to measure the degree of similarity. With the authors' largest library, the search of a query set finds 91% of the spectrum identifications and 93.7% of the protein identifications that could be made with a SEQUEST database search. A second expt. demonstrates that queries acquired on an LCQ ion trap mass spectrometer can be identified with a library of refs. acquired on an LTQ ion trap mass spectrometer. The dot product similarity score provides good sepn. of correct and incorrect identifications.
- 14Kim, S.; Pevzner, P. A. MS-GF+ makes progress towards a universal database search tool for proteomics. Nat. Commun. 2014, 5, 5277, DOI: 10.1038/ncomms627714https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BC2MXktFensbs%253D&md5=fd0e69633f52dcb998d8d88b832093ebMS-GF+ makes progress towards a universal database search tool for proteomicsKim, Sangtae; Pevzner, Pavel A.Nature Communications (2014), 5 (), 5277CODEN: NCAOBW; ISSN:2041-1723. (Nature Publishing Group)Mass spectrometry (MS) instruments and exptl. protocols are rapidly advancing, but the software tools to analyze tandem mass spectra are lagging behind. We present a database search tool MS-GF+ that is sensitive (it identifies more peptides than most other database search tools) and universal (it works well for diverse types of spectra, different configurations of MS instruments and different exptl. protocols). We benchmark MS-GF+ using diverse spectral data sets: (i) spectra of varying fragmentation methods; (ii) spectra of multiple enzyme digests; (iii) spectra of phosphorylated peptides; and (iv) spectra of peptides with unusual fragmentation propensities produced by a novel alpha-lytic protease. For all these data sets, MS-GF+ significantly increases the no. of identified peptides compared with commonly used methods for peptide identifications. We emphasize that although MS-GF+ is not specifically designed for any particular exptl. set-up, it improves on the performance of tools specifically designed for these applications (for example, specialized tools for phosphoproteomics).
- 15Hulstaert, N.; Shofstahl, J.; Sachsenberg, T.; Walzer, M.; Barsnes, H.; Martens, L.; Perez-Riverol, Y. ThermoRawFileParser: Modular, Scalable, and Cross-Platform RAW File Conversion. J. Proteome Res. 2020, 19 (1), 537– 542, DOI: 10.1021/acs.jproteome.9b0032815https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BC1MXit1Wkt7fO&md5=70eeac3c8722a64b4205536be8f6adb0ThermoRawFileParser: Modular, Scalable, and Cross-Platform RAW File ConversionHulstaert, Niels; Shofstahl, Jim; Sachsenberg, Timo; Walzer, Mathias; Barsnes, Harald; Martens, Lennart; Perez-Riverol, YassetJournal of Proteome Research (2020), 19 (1), 537-542CODEN: JPROBS; ISSN:1535-3893. (American Chemical Society)The field of computational proteomics is approaching the big data age, driven both by a continuous growth in the no. of samples analyzed per expt. as well as by the growing amt. of data obtained in each anal. run. In order to process these large amts. of data, it is increasingly necessary to use elastic compute resources such as Linux-based cluster environments and cloud infrastructures. Unfortunately, the vast majority of cross-platform proteomics tools are not able to operate directly on the proprietary formats generated by the diverse mass spectrometers. Here, we present ThermoRawFileParser, an open-source, cross-platform tool that converts Thermo RAW files into open file formats such as MGF and the HUPO-PSI std. file format mzML. To ensure the broadest possible availability and to increase integration capabilities with popular workflow systems such as Galaxy or Nextflow, we have also built Conda package and BioContainers container around ThermoRawFileParser. In addn., we implemented a user-friendly interface (ThermoRawFileParserGUI) for those users not familiar with command-line tools. Finally, we performed a benchmark of ThermoRawFileParser and msconvert to verify that the converted mzML files contain reliable quant. results.
- 16Van Leene, J.; Han, C.; Gadeyne, A.; Eeckhout, D.; Matthijs, C.; Cannoot, B.; De Winne, N.; Persiau, G.; Van De Slijke, E.; Van de Cotte, B. Capturing the phosphorylation and protein interaction landscape of the plant TOR kinase. Nat. Plants 2019, 5 (3), 316– 327, DOI: 10.1038/s41477-019-0378-z16https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BC1MXmslKjur8%253D&md5=aeb262da4a27123086b7b383e23f0c01Capturing the phosphorylation and protein interaction landscape of the plant TOR kinaseVan Leene, Jelle; Han, Chao; Gadeyne, Astrid; Eeckhout, Dominique; Matthijs, Caroline; Cannoot, Bernard; De Winne, Nancy; Persiau, Geert; Van De Slijke, Eveline; Van de Cotte, Brigitte; Stes, Elisabeth; Van Bel, Michiel; Storme, Veronique; Impens, Francis; Gevaert, Kris; Vandepoele, Klaas; De Smet, Ive; De Jaeger, GeertNature Plants (London, United Kingdom) (2019), 5 (3), 316-327CODEN: NPALBC; ISSN:2055-0278. (Nature Research)The target of rapamycin (TOR) kinase is a conserved regulatory hub that translates environmental and nutritional information into permissive or restrictive growth decisions. Despite the increased appreciation of the essential role of the TOR complex in plants, no large-scale phosphoproteomics or interactomics studies have been performed to map TOR signalling events in plants. To fill this gap, we combined a systematic phosphoproteomics screen with a targeted protein complex anal. in the model plant Arabidopsis thaliana. Integration of the phosphoproteome and protein complex data on the one hand shows that both methods reveal complementary subspaces of the plant TOR signalling network, enabling proteome-wide discovery of both upstream and downstream network components. On the other hand, the overlap between both data sets reveals a set of candidate direct TOR substrates. The integrated network embeds both evolutionarily-conserved and plant-specific TOR signalling components, uncovering an intriguing complex interplay with protein synthesis. Overall, the network provides a rich data set to start addressing fundamental questions about how TOR controls key processes in plants, such as autophagy, auxin signalling, chloroplast development, lipid metab., nucleotide biosynthesis, protein translation or senescence.
- 17Doner, N. M.; Seay, D.; Mehling, M.; Sun, S.; Gidda, S. K.; Schmitt, K.; Braus, G. H.; Ischebeck, T.; Chapman, K. D.; Dyer, J. M. Arabidopsis thaliana EARLY RESPONSIVE TO DEHYDRATION 7 Localizes to Lipid Droplets via Its Senescence Domain. Front Plant Sci. 2021, 12, 658961, DOI: 10.3389/fpls.2021.65896117https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A280%3ADC%252BB3sbptFehsA%253D%253D&md5=df07fe21691d64303260899b2aac1f61Arabidopsis thaliana EARLY RESPONSIVE TO DEHYDRATION 7 Localizes to Lipid Droplets via Its Senescence DomainDoner Nathan M; Gidda Satinder K; Mullen Robert T; Seay Damien; Mehling Marina; Dyer John M; Sun Siqi; Ischebeck Till; Schmitt Kerstin; Braus Gerhard H; Chapman Kent DFrontiers in plant science (2021), 12 (), 658961 ISSN:1664-462X.Lipid droplets (LDs) are neutral-lipid-containing organelles found in all kingdoms of life and are coated with proteins that carry out a vast array of functions. Compared to mammals and yeast, relatively few LD proteins have been identified in plants, particularly those associated with LDs in vegetative (non-seed) cell types. Thus, to better understand the cellular roles of LDs in plants, a more comprehensive inventory and characterization of LD proteins is required. Here, we performed a proteomics analysis of LDs isolated from drought-stressed Arabidopsis leaves and identified EARLY RESPONSIVE TO DEHYDRATION 7 (ERD7) as a putative LD protein. mCherry-tagged ERD7 localized to both LDs and the cytosol when ectopically expressed in plant cells, and the protein's C-terminal senescence domain (SD) was both necessary and sufficient for LD targeting. Phylogenetic analysis revealed that ERD7 belongs to a six-member family in Arabidopsis that, along with homologs in other plant species, is separated into two distinct subfamilies. Notably, the SDs of proteins from each subfamily conferred targeting to either LDs or mitochondria. Further, the SD from the ERD7 homolog in humans, spartin, localized to LDs in plant cells, similar to its localization in mammals; although, in mammalian cells, spartin also conditionally localizes to other subcellular compartments, including mitochondria. Disruption of ERD7 gene expression in Arabidopsis revealed no obvious changes in LD numbers or morphology under normal growth conditions, although this does not preclude a role for ERD7 in stress-induced LD dynamics. Consistent with this possibility, a yeast two-hybrid screen using ERD7 as bait identified numerous proteins involved in stress responses, including some that have been identified in other LD proteomes. Collectively, these observations provide new insight to ERD7 and the SD-containing family of proteins in plants and suggest that ERD7 may be involved in functional aspects of plant stress response that also include localization to the LD surface.
- 18Pipitone, R.; Eicke, S.; Pfister, B.; Glauser, G.; Falconet, D.; Uwizeye, C.; Pralon, T.; Zeeman, S. C.; Kessler, F.; Demarsy, E. A multifaceted analysis reveals two distinct phases of chloroplast biogenesis during de-etiolation in Arabidopsis. eLife 2021, DOI: 10.7554/eLife.62709There is no corresponding record for this reference.
- 19Osman, S.; Mohammad, E.; Lidschreiber, M.; Stuetzer, A.; Bazso, F. L.; Maier, K. C.; Urlaub, H.; Cramer, P. The Cdk8 kinase module regulates interaction of the mediator complex with RNA polymerase II. J. Biol. Chem. 2021, 296, 100734, DOI: 10.1016/j.jbc.2021.10073419https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BB3MXhtlSrtrzP&md5=baf6d930a92293cf68ac46d385a55ca6The Cdk8 kinase module regulates interaction of the mediator complex with RNA polymerase IIOsman, Sara; Mohammad, Eusra; Lidschreiber, Michael; Stuetzer, Alexandra; Bazso, Fanni Laura; Maier, Kerstin C.; Urlaub, Henning; Cramer, PatrickJournal of Biological Chemistry (2021), 296 (), 100734CODEN: JBCHA3; ISSN:1083-351X. (Elsevier Inc.)The Cdk8 kinase module (CKM) is a dissociable part of the coactivator complex mediator, which regulates gene transcription by RNA polymerase II. The CKM has both neg. and pos. functions in gene transcription that remain poorly understood at the mechanistic level. In order to reconstitute the role of the CKM in transcription initiation, we prepd. recombinant CKM from the yeast Saccharomyces cerevisiae. We showed that CKM bound to the core mediator (cMed) complex, sterically inhibiting cMed from binding to the polymerase II preinitiation complex (PIC) in vitro. We further showed that the Cdk8 kinase activity of the CKM weakened CKM-cMed interaction, thereby facilitating dissocn. of the CKM and enabling mediator to bind the PIC in order to stimulate transcription initiation. Finally, we report that the kinase activity of Cdk8 is required for gene activation during the stressful condition of heat shock in vivo but not under steady-state growth conditions. Based on these results, we propose a model in which the CKM neg. regulates mediator function at upstream-activating sequences by preventing mediator binding to the PIC at the gene promoter. However, during gene activation in response to stress, the Cdk8 kinase activity of the CKM may release mediator and allow its binding to the PIC, thereby accounting for the pos. function of CKM. This may impart improved adaptability to stress by allowing a rapid transcriptional response to environmental changes, and we speculate that a similar mechanism in metazoans may allow the precise timing of developmental transcription programs.
- 20Perez-Riverol, Y.; Bai, J.; Bandla, C.; Garcia-Seisdedos, D.; Hewapathirana, S.; Kamatchinathan, S.; Kundu, D. J.; Prakash, A.; Frericks-Zipper, A.; Eisenacher, M. The PRIDE database resources in 2022: a hub for mass spectrometry-based proteomics evidences. Nucleic Acids Res. 2022, 50 (D1), D543– D552, DOI: 10.1093/nar/gkab103820https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BB38Xis1ChtLo%253D&md5=839f38b25e98ce0751fd074cd8670747PRIDE database resources in 2022 hub for mass spectrometry-based proteomics evidencesPerez-Riverol, Yasset; Bai, Jingwen; Bandla, Chakradhar; Garcia-Seisdedos, David; Hewapathirana, Suresh; Kamatchinathan, Selvakumar; Kundu, Deepti J.; Prakash, Ananth; Frericks-Zipper, Anika; Eisenacher, Martin; Walzer, Mathias; Wang, Shengbo; Brazma, Alvis; Vizcaino, Juan AntonioNucleic Acids Research (2022), 50 (D1), D543-D552CODEN: NARHAD; ISSN:1362-4962. (Oxford University Press)The PRoteomics IDEntifications (PRIDE) database is the world's largest data repository of mass spectrometry-based proteomics data. PRIDE is one of the founding members of the global ProteomeXchange (PX) consortium and an ELIXIR core data resource. In this manuscript, we summarize the developments in PRIDE resources and related tools since the previous update manuscript was published in Nucleic Acids Research in 2019. The no. of submitted datasets to PRIDE Archive (the archival component of PRIDE) has reached on av. around 500 datasets per mo during 2021. In addn. to continuous improvements in PRIDE Archive data pipelines and infrastructure, the PRIDE Spectra Archive has been developed to provide direct access to the submitted mass spectra using Universal Spectrum Identifiers. As a key point, the file format MAGE-TAB for proteomics has been developed to enable the improvement of sample metadata annotation. Addnl., the resource PRIDE Peptidome provides access to aggregated peptide/protein evidences across PRIDE Archive. Furthermore, we will describe how PRIDE has increased its efforts to reuse and disseminate high-quality proteomics data into other added-value resources such as UniProt, Ensembl and Expression Atlas.
- 21Hunter, J. D. Matplotlib: A 2D graphics environment. Comput. Sci. Eng. 2007, 9 (3), 90– 95, DOI: 10.1109/MCSE.2007.55There is no corresponding record for this reference.
- 22Lam, S. K.; Pitrou, A.; Seibert, S. Numba: A LLVM-based Python JIT compiler. In Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC; Austin, TX, 2015.There is no corresponding record for this reference.
- 23Harris, C. R.; Millman, K. J.; van der Walt, S. J.; Gommers, R.; Virtanen, P.; Cournapeau, D.; Wieser, E.; Taylor, J.; Berg, S.; Smith, N. J. Array programming with NumPy. Nature 2020, 585 (7825), 357– 362, DOI: 10.1038/s41586-020-2649-223https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BB3cXitlWmsbbN&md5=a9e32986e9cc14fa31afe3e524e95882Array programming with NumPyHarris, Charles R.; Millman, K. Jarrod; van der Walt, Stefan J.; Gommers, Ralf; Virtanen, Pauli; Cournapeau, David; Wieser, Eric; Taylor, Julian; Berg, Sebastian; Smith, Nathaniel J.; Kern, Robert; Picus, Matti; Hoyer, Stephan; van Kerkwijk, Marten H.; Brett, Matthew; Haldane, Allan; del Rio, Jaime Fernandez; Wiebe, Mark; Peterson, Pearu; Gerard-Marchant, Pierre; Sheppard, Kevin; Reddy, Tyler; Weckesser, Warren; Abbasi, Hameer; Gohlke, Christoph; Oliphant, Travis E.Nature (London, United Kingdom) (2020), 585 (7825), 357-362CODEN: NATUAS; ISSN:0028-0836. (Nature Research)Abstr.: Array programming provides a powerful, compact and expressive syntax for accessing, manipulating and operating on data in vectors, matrixes and higher-dimensional arrays. NumPy is the primary array programming library for the Python language. It has an essential role in research anal. pipelines in fields as diverse as physics, chem., astronomy, geoscience, biol., psychol., materials science, engineering, finance and economics. For example, in astronomy, NumPy was an important part of the software stack used in the discovery of gravitational waves1 and in the first imaging of a black hole2. Here we review how a few fundamental array concepts lead to a simple and powerful programming paradigm for organizing, exploring and analyzing scientific data. NumPy is the foundation upon which the scientific Python ecosystem is constructed. It is so pervasive that several projects, targeting audiences with specialized needs, have developed their own NumPy-like interfaces and array objects. Owing to its central position in the ecosystem, NumPy increasingly acts as an interoperability layer between such array computation libraries and, together with its application programming interface (API), provides a flexible framework to support the next decade of scientific and industrial anal.
- 24Rost, H. L.; Schmitt, U.; Aebersold, R.; Malmstrom, L. pyOpenMS: a Python-based interface to the OpenMS mass-spectrometry algorithm library. Proteomics 2014, 14 (1), 74– 77, DOI: 10.1002/pmic.20130024624https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A280%3ADC%252BC2czltVehtw%253D%253D&md5=789875047d3e04492cbdabf6458db71cpyOpenMS: a Python-based interface to the OpenMS mass-spectrometry algorithm libraryRost Hannes L; Schmitt Uwe; Aebersold Ruedi; Malmstrom LarsProteomics (2014), 14 (1), 74-7 ISSN:.pyOpenMS is an open-source, Python-based interface to the C++ OpenMS library, providing facile access to a feature-rich, open-source algorithm library for MS-based proteomics analysis. It contains Python bindings that allow raw access to the data structures and algorithms implemented in OpenMS, specifically those for file access (mzXML, mzML, TraML, mzIdentML among others), basic signal processing (smoothing, filtering, de-isotoping, and peak-picking) and complex data analysis (including label-free, SILAC, iTRAQ, and SWATH analysis tools). pyOpenMS thus allows fast prototyping and efficient workflow development in a fully interactive manner (using the interactive Python interpreter) and is also ideally suited for researchers not proficient in C++. In addition, our code to wrap a complex C++ library is completely open-source, allowing other projects to create similar bindings with ease. The pyOpenMS framework is freely available at https://pypi.python.org/pypi/pyopenms while the autowrap tool to create Cython code automatically is available at https://pypi.python.org/pypi/autowrap (both released under the 3-clause BSD licence).
- 25Levitsky, L. I.; Klein, J. A.; Ivanov, M. V.; Gorshkov, M. V. Pyteomics 4.0: Five Years of Development of a Python Proteomics Framework. J. Proteome Res. 2019, 18 (2), 709– 714, DOI: 10.1021/acs.jproteome.8b0071725https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BC1cXisFOks7vK&md5=78dd10647be2893faa9b6cc1b5aba582Pyteomics 4.0: Five Years of Development of a Python Proteomics FrameworkLevitsky, Lev I.; Klein, Joshua A.; Ivanov, Mark V.; Gorshkov, Mikhail V.Journal of Proteome Research (2019), 18 (2), 709-714CODEN: JPROBS; ISSN:1535-3893. (American Chemical Society)A review. Many of the novel ideas that drive today's proteomic technologies are focused essentially on exptl. or data-processing workflows. The latter are implemented and published in a no. of ways, from custom scripts and programs, to projects built using general-purpose or specialized workflow engines; a large part of routine data processing is performed manually or with custom scripts that remain unpublished. Facilitating the development of reproducible data-processing workflows becomes essential for increasing the efficiency of proteomic research. To assist in overcoming the bioinformatics challenges in the daily practice of proteomic labs., 5 years ago we developed and announced Pyteomics, a freely available open-source library providing Python interfaces to proteomic data. We summarize the new functionality of Pyteomics developed during the time since its introduction.
- 26Bittremieux, W. spectrum_utils: A Python Package for Mass Spectrometry Data Processing and Visualization. Anal. Chem. 2020, 92 (1), 659– 661, DOI: 10.1021/acs.analchem.9b0488426https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BC1MXitleqsr7N&md5=7409ec02d1cb04f1db7a6abcb0673cd2spectrum_utils: A Python Package for Mass Spectrometry Data Processing and VisualizationBittremieux, WoutAnalytical Chemistry (Washington, DC, United States) (2020), 92 (1), 659-661CODEN: ANCHAM; ISSN:0003-2700. (American Chemical Society)Given the wide diversity in applications of biol. mass spectrometry, custom data analyses are often needed to fully interpret the results of an expt. Such bioinformatics scripts necessarily include similar basic functionality to read mass spectral data from std. file formats, process it, and visualize it. Rather than having to reimplement this functionality, to facilitate this task, spectrum_utils is a Python package for mass spectrometry data processing and visualization. Its high-level functionality enables developers to quickly prototype ideas for computational mass spectrometry projects in only a few lines of code. Notably, the data processing functionality is highly optimized for computational efficiency to be able to deal with the large vols. of data that are generated during mass spectrometry expts. The visualization functionality makes it possible to easily produce publication-quality figures as well as interactive spectrum plots for inclusion on web pages. spectrum_utils is available for Python 3.6+, includes extensive online documentation and examples, and can be easily installed using conda. It is freely available as open source under the Apache 2.0 license at https://github.com/bittremieux/spectrum_utils.
- 27Deutsch, E. W.; Perez-Riverol, Y.; Carver, J.; Kawano, S.; Mendoza, L.; Van Den Bossche, T.; Gabriels, R.; Binz, P. A.; Pullman, B.; Sun, Z. Universal Spectrum Identifier for mass spectra. Nat. Methods 2021, 18 (7), 768– 770, DOI: 10.1038/s41592-021-01184-627https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BB3MXhsVenurjM&md5=1f1df6a61c31ba9ff5aad99999cf6574Universal Spectrum Identifier for mass spectraDeutsch, Eric W.; Perez-Riverol, Yasset; Carver, Jeremy; Kawano, Shin; Mendoza, Luis; Van Den Bossche, Tim; Gabriels, Ralf; Binz, Pierre-Alain; Pullman, Benjamin; Sun, Zhi; Shofstahl, Jim; Bittremieux, Wout; Mak, Tytus D.; Klein, Joshua; Zhu, Yunping; Lam, Henry; Vizcaino, Juan Antonio; Bandeira, NunoNature Methods (2021), 18 (7), 768-770CODEN: NMAEA3; ISSN:1548-7091. (Nature Portfolio)Mass spectra provide the ultimate evidence to support the findings of mass spectrometry proteomics studies in publications, and it is therefore crucial to be able to trace the conclusions back to the spectra. The Universal Spectrum Identifier (USI) provides a standardized mechanism for encoding a virtual path to any mass spectrum contained in datasets deposited to public proteomics repositories. USI enables greater transparency of spectral evidence, with more than 1 billion USI identifications from over 3 billion spectra already available through ProteomeXchange repositories.
- 28Choi, M.; Carver, J.; Chiva, C.; Tzouros, M.; Huang, T.; Tsai, T. H.; Pullman, B.; Bernhardt, O. M.; Huttenhain, R.; Teo, G. C. MassIVE.quant: a community resource of quantitative mass spectrometry-based proteomics datasets. Nat. Methods 2020, 17 (10), 981– 984, DOI: 10.1038/s41592-020-0955-028https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BB3cXhvVanurbN&md5=9fe06bca9e9d67aa3af0a28ca1eb8448MassIVE.quant: a community resource of quantitative mass spectrometry-based proteomics datasetsChoi, Meena; Carver, Jeremy; Chiva, Cristina; Tzouros, Manuel; Huang, Ting; Tsai, Tsung-Heng; Pullman, Benjamin; Bernhardt, Oliver M.; Huttenhain, Ruth; Teo, Guo Ci; Perez-Riverol, Yasset; Muntel, Jan; Muller, Maik; Goetze, Sandra; Pavlou, Maria; Verschueren, Erik; Wollscheid, Bernd; Nesvizhskii, Alexey I.; Reiter, Lukas; Dunkley, Tom; Sabido, Eduard; Bandeira, Nuno; Vitek, OlgaNature Methods (2020), 17 (10), 981-984CODEN: NMAEA3; ISSN:1548-7091. (Nature Research)Abstr.: MassIVE.quant is a repository infrastructure and data resource for reproducible quant. mass spectrometry-based proteomics, which is compatible with all mass spectrometry data acquisition types and computational anal. tools. A branch structure enables MassIVE.quant to systematically store raw exptl. data, metadata of the exptl. design, scripts of the quant. anal. workflow, intermediate input and output files, as well as alternative reanalyses of the same dataset.
- 29Ashwood, C.; Bittremieux, W.; Deutsch, E. W.; Doncheva, N. T.; Dorfer, V.; Gabriels, R.; Gorshkov, V.; Gupta, S.; Jones, A. R.; Käll, L. Proceedings of the EuBIC-MS 2020 Developers’ Meeting. EuPA Open Proteomics 2020, 24, 1– 6, DOI: 10.1016/j.euprot.2020.11.001There is no corresponding record for this reference.
Supporting Information
Supporting Information
The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acs.jproteome.2c00069.
Supplementary Note S1: Identification score as a function of cluster size; Supplementary Note S2: The data sets used in the benchmark; Supplementary Note S3: Analysis of the phosphoproteomics data set PXD008355 (PDF)
Terms & Conditions
Most electronic Supporting Information files are available without a subscription to ACS Web Editions. Such files may be downloaded by article for research use (if there is a public use license linked to the relevant article, that license may permit other uses). Permission may be obtained from ACS for other uses through requests via the RightsLink permission system: http://pubs.acs.org/page/copyright/permissions.html.