DIMA: Data-Driven Selection of an Imputation Algorithm
- Janine Egert*Janine Egert*Email: [email protected]Institute of Medical Biometry and Statistics (IMBI), Institute of Medicine and Medical Center Freiburg, 79104 Freiburg im Breisgau, GermanyCentre for Integrative Biological Signalling Studies (CIBSS), Albert-Ludwigs-Universität Freiburg, 79104 Freiburg, GermanyMore by Janine Egert
- ,
- Eva BrombacherEva BrombacherInstitute of Medical Biometry and Statistics (IMBI), Institute of Medicine and Medical Center Freiburg, 79104 Freiburg im Breisgau, GermanyCentre for Integrative Biological Signalling Studies (CIBSS), Albert-Ludwigs-Universität Freiburg, 79104 Freiburg, GermanySpemann Graduate School of Biology and Medicine (SGBM), Albert-Ludwigs-Universität Freiburg, 79104 Freiburg, GermanyFaculty of Biology, Albert-Ludwigs-Universität Freiburg, 79104 Freiburg im Breisgau, GermanyMore by Eva Brombacher
- ,
- Bettina WarscheidBettina WarscheidBiochemistry and Functional Proteomics, Institute of Biology II, Faculty of Biology, Albert-Ludwigs-Universität Freiburg, 79104 Freiburg im Breisgau, GermanySignalling Research Centres BIOSS and CIBSS, Albert-Ludwigs-Universität Freiburg, 79104 Freiburg im Breisgau, GermanyMore by Bettina Warscheid
- , and
- Clemens Kreutz*Clemens Kreutz*Email: [email protected]Institute of Medical Biometry and Statistics (IMBI), Institute of Medicine and Medical Center Freiburg, 79104 Freiburg im Breisgau, GermanySignalling Research Centres BIOSS and CIBSS, Albert-Ludwigs-Universität Freiburg, 79104 Freiburg im Breisgau, GermanyCenter for Data Analysis and Modeling (FDM), Albert-Ludwigs-Universität Freiburg, 79104 Freiburg im Breisgau, GermanyMore by Clemens Kreutz
Abstract

Imputation is a prominent strategy when dealing with missing values (MVs) in proteomics data analysis pipelines. However, it is difficult to assess the performance of different imputation methods and varies strongly depending on data characteristics. To overcome this issue, we present the concept of a data-driven selection of an imputation algorithm (DIMA). The performance and broad applicability of DIMA are demonstrated on 142 quantitative proteomics data sets from the PRoteomics IDEntifications (PRIDE) database and on simulated data consisting of 5–50% MVs with different proportions of missing not at random and missing completely at random values. DIMA reliably suggests a high-performing imputation algorithm, which is always among the three best algorithms and results in a root mean square error difference (ΔRMSE) ≤ 10% in 80% of the cases. DIMA implementation is available in MATLAB at github.com/kreutz-lab/OmicsData and in R at github.com/kreutz-lab/DIMAR.
This publication is licensed under
License Summary*
You are free to share (copy and redistribute) this article in any medium or format within the parameters below:
Creative Commons (CC): This is a Creative Commons license.
Attribution (BY): Credit must be given to the creator.
Non-Commercial (NC): Only non-commercial uses of the work are permitted.
No Derivatives (ND): Derivative works may be created for non-commercial purposes, but sharing is prohibited.
*Disclaimer
This summary highlights only some of the key features and terms of the actual license. It is not a license and has no legal value. Carefully review the actual license before using these materials.
License Summary*
You are free to share (copy and redistribute) this article in any medium or format within the parameters below:
Creative Commons (CC): This is a Creative Commons license.
Attribution (BY): Credit must be given to the creator.
Non-Commercial (NC): Only non-commercial uses of the work are permitted.
No Derivatives (ND): Derivative works may be created for non-commercial purposes, but sharing is prohibited.
*Disclaimer
This summary highlights only some of the key features and terms of the actual license. It is not a license and has no legal value. Carefully review the actual license before using these materials.
License Summary*
You are free to share (copy and redistribute) this article in any medium or format within the parameters below:
Creative Commons (CC): This is a Creative Commons license.
Attribution (BY): Credit must be given to the creator.
Non-Commercial (NC): Only non-commercial uses of the work are permitted.
No Derivatives (ND): Derivative works may be created for non-commercial purposes, but sharing is prohibited.
*Disclaimer
This summary highlights only some of the key features and terms of the actual license. It is not a license and has no legal value. Carefully review the actual license before using these materials.
License Summary*
You are free to share (copy and redistribute) this article in any medium or format within the parameters below:
Creative Commons (CC): This is a Creative Commons license.
Attribution (BY): Credit must be given to the creator.
Non-Commercial (NC): Only non-commercial uses of the work are permitted.
No Derivatives (ND): Derivative works may be created for non-commercial purposes, but sharing is prohibited.
*Disclaimer
This summary highlights only some of the key features and terms of the actual license. It is not a license and has no legal value. Carefully review the actual license before using these materials.
License Summary*
You are free to share (copy and redistribute) this article in any medium or format within the parameters below:
Creative Commons (CC): This is a Creative Commons license.
Attribution (BY): Credit must be given to the creator.
Non-Commercial (NC): Only non-commercial uses of the work are permitted.
No Derivatives (ND): Derivative works may be created for non-commercial purposes, but sharing is prohibited.
*Disclaimer
This summary highlights only some of the key features and terms of the actual license. It is not a license and has no legal value. Carefully review the actual license before using these materials.
1. Introduction
2. Methods
2.1. Illustration Data
Figure 1

Figure 1. DIMA analysis pipeline illustrated on an LC-MS/MS data set. (24) The data is sorted from top to bottom according to the frequency of MVs and the mean intensity of the proteins. Likewise, the reference data R is sorted according to the mean protein intensities after considering pattern PR of step 3. (1) The pattern PO of MVs is learned by logistic regression using the protein and sample as factorial predictors and the mean protein intensity as a continuous predictor. (2) A reference data R with few MVs is defined. (3) Various patterns PR of MVs are generated by the logistic regression model, and the respective coefficients of step 1 are incorporated into the reference data R. (4) Boxplots of the absolute imputation errors for multiple imputation algorithms. The circle indicates the median imputation deviation. The algorithms are ranked by their overall root mean square error (RMSE, red diamond). The algorithms can be divided into well-performing algorithms with an RMSE < 0.5 (green), medium performance with 0.5 < RMSE < 3 (yellow), and bad performance with RMSE > 3 (red). (5) The best-performing imputation algorithm on R (in this example impSeqRob) is recommended for the original data O and imputation of O is conducted.
2.2. PRIDE Data
2.3. Simulation Study


2.4. DIMA

2.4.1. Learn Pattern of Missing Values

2.4.2. Generation of the Reference Data
2.4.3. Imputation Algorithms
2.4.4. Ranking of Imputation Algorithms


3. Implementation
3.1. MATLAB Implementation
3.2. R Implementation
4. Results
4.1. Illustration Data

The eight best-performing and the two least-performing imputation algorithms on the illustration data set are shown. In green, the best-performing algorithms for the respective criterion are highlighted with decreasing transparency. The algorithm selection by DIMA depending on the ranking criterion is highlighted in bold.
4.2. PRIDE Data
Figure 2

Figure 2. DIMA is applied and evaluated on 142 PRIDE data sets. (A) Nine algorithms compete for being recommended as the best-performing algorithm. The R package rrcovNA with its algorithms impSeqRob (47%) and impSeq (25%) is selected most frequently, followed by missForest in 13% and imputePCA in 10%. For 5% of the Pride data sets, another algorithm is suggested. (B) The rank of the imputation algorithms obtained in the 142 PRIDE data sets is shown as a box plot. The seven algorithms with the lowest median rank are also the seven most frequently selected algorithms by DIMA (A). The algorithms with a median rank lower than 5% are highlighted in green, and algorithms with a median rank greater than 20 are highlighted in red.
4.3. Simulation Study

Figure 3

Figure 3. Performance of DIMA is evaluated on simulated data S with the incorporation of various proportions of MV and MNAR/MCAR ratios. The RMSE (color-coded) and rank (first entry) obtained by the best-performing imputation algorithm recommended by DIMA (second entry) compared to direct imputation assessment (third entry) over 500 data simulations are calculated. The algorithm recommended by DIMA is within the top three out of 27 approaches in all cases. For MV < 20% (A), the additive regression aregImpute with type regression (reg) outperforms, between 20 and 30% MVs; (B) several algorithms compete against each other and for MV > 30%; (C) the random forest algorithm missForest performs best.
5. Discussion
Supporting Information
The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acs.jproteome.1c00119.
Sigmoidal decrease of missing values for higher protein intensities (Figure S1); missing value distribution per sample and per protein (Figure S2); distribution of the estimated logistic regression coefficients (Figure S3); DIMA analysis at the peptide level (Figure S4); density plot of the imputed compared to the original data values (Figure S5); principal component analysis before and after imputation (Figure S6); DIMA Implementation (Figure S7); characteristics of the 30 applied imputation algorithms (Table S1); and characteristics of the PRIDE data sets assessed with DIMA (Table S2) (PDF)
Terms & Conditions
Most electronic Supporting Information files are available without a subscription to ACS Web Editions. Such files may be downloaded by article for research use (if there is a public use license linked to the relevant article, that license may permit other uses). Permission may be obtained from ACS for other uses through requests via the RightsLink permission system: http://pubs.acs.org/page/copyright/permissions.html.
Acknowledgments
This work was supported by the Federal Ministry of Education and Research of Germany [EA:Sys,FKZ031L0080 to J.E. and C.K.]; the Deutsche Forschungsgemeinschaft (German Research Foundation) [CIBSS-EXC-2189-2100249960-390939984 to E.B., B.W., and C.K., Project-ID 403222702278002225/SFB 1381 to B.W., FOR 2743 to B.W., TRR 130 to B.W.], and the European Research Council H2020 [648235 to B.W., Marie Sklodowska Curie grant 812968 to B.W.]. The authors acknowledge support from the state of Baden-Württemberg through bwHPC and the Deutsche Forschungsgemeinschaft through grant INST 35/1134-1 FUGG. The authors gratefully thank Lena Reimann, Wignand Mühlhäuser, and Friedel Drepper for fruitful discussions on the topic.
References
This article references 38 other publications.
- 1McGurk, K. A. The use of missing values in proteomic data-independent acquisition mass spectrometry to enable disease activity discrimination. Bioinformatics 2020, 36, 2217– 2223, DOI: 10.1093/bioinformatics/btz898[Crossref], [PubMed], [CAS], Google Scholar1https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BB3cXitlOmu7bI&md5=d2a3cd6644ada2086f6aaa6722145f68The use of missing values in proteomic data-independent acquisition mass spectrometry to enable disease activity discriminationMcGurk, Kathryn A.; Dagliati, Arianna; Chiasserini, Davide; Lee, Dave; Plant, Darren; Baricevic-Jones, Ivona; Kelsall, Janet; Eineman, Rachael; Reed, Rachel; Geary, Bethany; Unwin, Richard D.; Nicolaou, Anna; Keavney, Bernard D.; Barton, Anne; Whetton, Anthony D.; Geifman, NopharBioinformatics (2020), 36 (7), 2217-2223CODEN: BOINFP; ISSN:1367-4811. (Oxford University Press)Data-independent acquisition mass spectrometry allows for comprehensive peptide detection and relative quantification than std. data-dependent approaches. While less prone to missing values, these still exist. Current approaches for handling the so-called missingness have challenges. We hypothesized that non-random missingness is a useful biol. measure and demonstrate the importance of analyzing missingness for proteomic discovery within a longitudinal study of disease activity. The magnitude of missingness did not correlate with mean peptide concn. The magnitude of missingness for each protein strongly correlated between collection time points (baseline, 3 mo, 6 mo; R = 0.95-0.97, confidence interval = 0.94-0.97) indicating little time-dependent effect. This allowed for the identification of proteins with outlier levels of missingness that differentiate between the patient groups characterized by different patterns of disease activity. The assocn. of these proteins with disease activity was confirmed by machine learning techniques. Our novel approach complements analyses on complete observations and other missing value strategies in biomarker prediction of disease activity.
- 2Poulos, R. C. Strategies to enable large-scale proteomics for reproducible research. Nat. Commun. 2020, 11, 3793 DOI: 10.1038/s41467-020-17641-3[Crossref], [PubMed], [CAS], Google Scholar2https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BB3cXhsFajsb7M&md5=848bd6df41e1fc16efef3d87f76de510Strategies to enable large-scale proteomics for reproducible researchPoulos, Rebecca C.; Hains, Peter G.; Shah, Rohan; Lucas, Natasha; Xavier, Dylan; Manda, Srikanth S.; Anees, Asim; Koh, Jennifer M. S.; Mahboob, Sadia; Wittman, Max; Williams, Steven G.; Sykes, Erin K.; Hecker, Michael; Dausmann, Michael; Wouters, Merridee A.; Ashman, Keith; Yang, Jean; Wild, Peter J.; deFazio, Anna; Balleine, Rosemary L.; Tully, Brett; Aebersold, Ruedi; Speed, Terence P.; Liu, Yansheng; Reddel, Roger R.; Robinson, Phillip J.; Zhong, QingNature Communications (2020), 11 (1), 3793CODEN: NCAOBW; ISSN:2041-1723. (Nature Research)Abstr.: Reproducible research is the bedrock of exptl. science. To enable the deployment of large-scale proteomics, we assess the reproducibility of mass spectrometry (MS) over time and across instruments and develop computational methods for improving quant. accuracy. We perform 1560 data independent acquisition (DIA)-MS runs of eight samples contg. known proportions of ovarian and prostate cancer tissue and yeast, or control HEK293T cells. Replicates are run on six mass spectrometers operating continuously with varying maintenance schedules over four months, interspersed with ∼5000 other runs. We utilize neg. controls and replicates to remove unwanted variation and enhance biol. signal, outperforming existing methods. We also design a method for reducing missing values. Integrating these computational modules into a pipeline (ProNorM), we mitigate variation among instruments over time and accurately predict tissue proportions. We demonstrate how to improve the quant. anal. of large-scale DIA-MS data, providing a pathway toward clin. proteomics.
- 3Brenes, A.; Hukelmann, J.; Bensaddek, D.; Lamond, A. I. Multibatch TMT Reveals False Positives, Batch Effects and Missing Values. Mol. Cell. Proteomics 2019, 18, 1967– 1980, DOI: 10.1074/mcp.RA119.001472[Crossref], [PubMed], [CAS], Google Scholar3https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BC1MXitlSjtb3J&md5=1321490bdeb894c40aa34eff5d12546bMultibatch TMT reveals false positives, batch effects and missing valuesBrenes, Alejandro; Hukelmann, Jens; Bensaddek, Dalila; Lamond, Angus I.Molecular & Cellular Proteomics (2019), 18 (10), 1967-1980CODEN: MCPOBS; ISSN:1535-9484. (American Society for Biochemistry and Molecular Biology)Multiplexing strategies for large-scale proteomic analyses have become increasingly prevalent, tandem mass tags (TMT) in particular. Here we used a large iPSC proteomic expt. with twenty-four 10-plex TMT batches to evaluate the effect of integrating multiple TMT batches within a single anal. We identified a significant inflation rate of protein missing values as multiple batches are integrated and show that this pattern is aggravated at the peptide level. We also show that without normalization strategies to address the batch effects, the high precision of quantitation within a single multiplexed TMT batch is not reproduced when data from multiple TMT batches are integrated. Further, the incidence of false positives was studied by using Y chromosome peptides as an internal control. The iPSC lines quantified in this data set were derived from both male and female donors, hence the peptides mapped to the Y chromosome should be absent from female lines. Nonetheless, these Y chromosome-specific peptides were consistently detected in the female channels of all TMT batches. We then used the same Y chromosome specific peptides to quantify the level of ion coisolation as well as the effect of primary and secondary reporter ion interference. These results were used to propose solns. to mitigate the limitations of multi-batch TMT analyses. We confirm that including a common ref. line in every batch increases precision by facilitating normalization across the batches and we propose exptl. designs that minimize the effect of cross population reporter ion interference.
- 4Wei, R.; Wang, J.; Su, M.; Jia, E.; Chen, S.; Chen, T.; Ni, Y. Missing Value Imputation Approach for Mass Spectrometry-based Metabolomics Data. Sci. Rep. 2018, 8, 663 DOI: 10.1038/s41598-017-19120-0[Crossref], [PubMed], [CAS], Google Scholar4https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A280%3ADC%252BC1MvgtlyrsQ%253D%253D&md5=f161109ca8155e59a728dd2761e994f6Missing Value Imputation Approach for Mass Spectrometry-based Metabolomics DataWei Runmin; Wang Jingye; Su Mingming; Chen Shaoqiu; Ni Yan; Wei Runmin; Chen Shaoqiu; Su Mingming; Jia Erik; Chen TianluScientific reports (2018), 8 (1), 663 ISSN:.Missing values exist widely in mass-spectrometry (MS) based metabolomics data. Various methods have been applied for handling missing values, but the selection can significantly affect following data analyses. Typically, there are three types of missing values, missing not at random (MNAR), missing at random (MAR), and missing completely at random (MCAR). Our study comprehensively compared eight imputation methods (zero, half minimum (HM), mean, median, random forest (RF), singular value decomposition (SVD), k-nearest neighbors (kNN), and quantile regression imputation of left-censored data (QRILC)) for different types of missing values using four metabolomics datasets. Normalized root mean squared error (NRMSE) and NRMSE-based sum of ranks (SOR) were applied to evaluate imputation accuracy. Principal component analysis (PCA)/partial least squares (PLS)-Procrustes analysis were used to evaluate the overall sample distribution. Student's t-test followed by correlation analysis was conducted to evaluate the effects on univariate statistics. Our findings demonstrated that RF performed the best for MCAR/MAR and QRILC was the favored one for left-censored MNAR. Finally, we proposed a comprehensive strategy and developed a public-accessible web-tool for the application of missing value imputation in metabolomics ( https://metabolomics.cc.hawaii.edu/software/MetImp/ ).
- 5Webb-Robertson, B.-J. M.; Wiberg, H. K.; Matzke, M. M.; Brown, J. N.; Wang, J.; McDermott, J. E.; Smith, R. D.; Rodland, K. D.; Metz, T. O.; Pounds, J. G.; Waters, K. M. Reviewand Evaluationand and Discussion of the Challenges of Missing Value Imputation for Mass Spectrometry-Based Label-Free Global Proteomics. J. Proteome Res. 2015, 14, 1993– 2001, DOI: 10.1021/pr501138h[ACS Full Text
], [CAS], Google Scholar
5https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BC2MXmtV2ltL0%253D&md5=218ce277ccefd57aab2ad11987b943c6Review, Evaluation, and Discussion of the Challenges of Missing Value Imputation for Mass Spectrometry-Based Label-Free Global ProteomicsWebb-Robertson, Bobbie-Jo M.; Wiberg, Holli K.; Matzke, Melissa M.; Brown, Joseph N.; Wang, Jing; McDermott, Jason E.; Smith, Richard D.; Rodland, Karin D.; Metz, Thomas O.; Pounds, Joel G.; Waters, Katrina M.Journal of Proteome Research (2015), 14 (5), 1993-2001CODEN: JPROBS; ISSN:1535-3893. (American Chemical Society)A review. In this review, we apply selected imputation strategies to label-free liq. chromatog.-mass spectrometry (LC-MS) proteomics datasets to evaluate the accuracy with respect to metrics of variance and classification. We evaluate several commonly used imputation approaches for individual merits and discuss the caveats of each approach with respect to the example LC-MS proteomics data. In general, local similarity-based approaches, such as the regularized expectation maximization and least-squares adaptive algorithms, yield the best overall performances with respect to metrics of accuracy and robustness. However, no single algorithm consistently outperforms the remaining approaches, and in some cases, performing classification without imputation sometimes yielded the most accurate classification. Thus, because of the complex mechanisms of missing data in proteomics, which also vary from peptide to protein, no individual method is a single soln. for imputation. On the basis of the observations in this review, the goal for imputation in the field of computational proteomics should be to develop new approaches that work generically for this data type and new strategies to guide users in the selection of the best imputation for their dataset and anal. objectives. - 6Lazar, C.; Laurent, G.; Myriam, F.; Christophe, B.; Thomas, B. Accounting for the Multiple Natures of Missing Values in Label-Free Quantitative Proteomics Data Sets to Compare Imputation Strategies. J. Proteome Res. 2016, 15, 1116– 1125, DOI: 10.1021/acs.jproteome.5b00981[ACS Full Text
], [CAS], Google Scholar
6https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BC28XivFOntbw%253D&md5=edecca833d3f9183fb1fe3e9ff14e8afAccounting for the Multiple Natures of Missing Values in Label-Free Quantitative Proteomics Data Sets to Compare Imputation StrategiesLazar, Cosmin; Gatto, Laurent; Ferro, Myriam; Bruley, Christophe; Burger, ThomasJournal of Proteome Research (2016), 15 (4), 1116-1125CODEN: JPROBS; ISSN:1535-3893. (American Chemical Society)Missing values are a genuine issue in label-free quant. proteomics. Recent works have surveyed the different statistical methods to conduct imputation and have compared them on real or simulated data sets and recommended a list of missing value imputation methods for proteomics application. Although insightful, these comparisons do not account for two important facts: (i) depending on the proteomics data set, the missingness mechanism may be of different natures and (ii) each imputation method is devoted to a specific type of missingness mechanism. As a result, we believe that the question at stake is not to find the most accurate imputation method in general but instead the most appropriate one. We describe a series of comparisons that support our views: For instance, we show that a supposedly "under-performing" method (i.e., giving baseline av. results), if applied at the "appropriate" time in the data-processing pipeline (before or after peptide aggregation) on a data set with the "appropriate" nature of missing values, can outperform a blindly applied, supposedly "better-performing" method (i.e., the ref. method from the state-of-the-art). This leads us to formulate few practical guidelines regarding the choice and the application of an imputation method in a proteomics context. - 7Rubin, D. B. Inference and missing data. Biometrika 1976, 63, 581– 592, DOI: 10.1093/biomet/63.3.581
- 8Karpievitch, Y. V.; Dabney, A. R.; Smith, R. D. Normalization and missing value imputation for label-free LC-MS analysis. BMC Bioinf. 2012, 13, S5 DOI: 10.1186/1471-2105-13-S16-S5[Crossref], [PubMed], [CAS], Google Scholar8https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BC38XhvVyis7bN&md5=f1dbc6f041dba77202a072e3e0f4a959Normalization and missing value imputation for label-free LC-MS analysisKarpievitch, Yuliya V.; Dabney, Alan R.; Smith, Richard D.BMC Bioinformatics (2012), 13 (Suppl. 16), S5CODEN: BBMIC4; ISSN:1471-2105. (BioMed Central Ltd.)Shotgun proteomic data are affected by a variety of known and unknown systematic biases as well as high proportions of missing values. Typically, normalization is performed in an attempt to remove systematic biases from the data before statistical inference, sometimes followed by missing value imputation to obtain a complete matrix of intensities. Here we discuss several approaches to normalization and dealing with missing values, some initially developed for microarray data and some developed specifically for mass spectrometry-based data.
- 9Välikangas, T.; Suomi, T.; Elo, L. L. A comprehensive evaluation of popular proteomics software workflows for label-free proteome quantification and imputation. Briefings Bioinf. 2017, 19, 1344– 1355, DOI: 10.1093/bib/bbx054
- 10Wang, J.; Li, L.; Chen, T.; Ma, J.; Zhu, Y.; Zhuang, J.; Chang, C. In-depth method assessments of differentially expressed protein detection for shotgun proteomics data with missing values. Sci. Rep. 2017, 7, 3367 DOI: 10.1038/s41598-017-03650-8[Crossref], [PubMed], [CAS], Google Scholar10https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A280%3ADC%252BC1cnnvF2iug%253D%253D&md5=038103fc16b08f4e3b802f37c1172b39In-depth method assessments of differentially expressed protein detection for shotgun proteomics data with missing valuesWang Jinxia; Li Liwei; Chen Tao; Ma Jie; Zhu Yunping; Chang Cheng; Wang Jinxia; Zhuang Jujuan; Wang JinxiaScientific reports (2017), 7 (1), 3367 ISSN:.Considering as one of the major goals in quantitative proteomics, detection of the differentially expressed proteins (DEPs) plays an important role in biomarker selection and clinical diagnostics. There have been plenty of algorithms and tools focusing on DEP detection in proteomics research. However, due to the different application scopes of these methods, and various kinds of experiment designs, it is not very apparent about the best choice for large-scale proteomics data analyses. Moreover, given the fact that proteomics data usually contain high percentage of missing values (MVs), but few replicates, a systematic evaluation of the DEP detection methods combined with the MV imputation methods is essential and urgent. Here, we analyzed a total of four representative imputation methods and five DEP methods on different experimental and simulated datasets. The results showed that (i) MV imputation could not always improve the performances of DEP detection methods and the imputation effects differed in the missing value percentages; (ii) the DEP detection methods had different statistical powers affected by the percentage of MVs. Two statistical methods (i.e. the empirical Bayesian random censoring threshold model, and the significance analysis of microarray) performed better than the other evaluated methods in terms of accuracy and sensitivity.
- 11Janssen, K. J.; Donders, A. R. T.; Harrell, F. E.; Vergouwe, Y.; Chen, Q.; Grobbee, D. E.; Moons, K. G. Missing covariate data in medical research: To impute is better than to ignore. J. Clin. Epidemiol. 2010, 63, 721– 727, DOI: 10.1016/j.jclinepi.2009.12.008[Crossref], [PubMed], [CAS], Google Scholar11https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A280%3ADC%252BC3czmtV2gtQ%253D%253D&md5=bab541926c308b8b5c901d42982dae27Missing covariate data in medical research: to impute is better than to ignoreJanssen Kristel J M; Donders A Rogier T; Harrell Frank E Jr; Vergouwe Yvonne; Chen Qingxia; Grobbee Diederick E; Moons Karel G MJournal of clinical epidemiology (2010), 63 (7), 721-7 ISSN:.OBJECTIVE: We compared popular methods to handle missing data with multiple imputation (a more sophisticated method that preserves data). STUDY DESIGN AND SETTING: We used data of 804 patients with a suspicion of deep venous thrombosis (DVT). We studied three covariates to predict the presence of DVT: d-dimer level, difference in calf circumference, and history of leg trauma. We introduced missing values (missing at random) ranging from 10% to 90%. The risk of DVT was modeled with logistic regression for the three methods, that is, complete case analysis, exclusion of d-dimer level from the model, and multiple imputation. RESULTS: Multiple imputation showed less bias in the regression coefficients of the three variables and more accurate coverage of the corresponding 90% confidence intervals than complete case analysis and dropping d-dimer level from the analysis. Multiple imputation showed unbiased estimates of the area under the receiver operating characteristic curve (0.88) compared with complete case analysis (0.77) and when the variable with missing values was dropped (0.65). CONCLUSION: As this study shows that simple methods to deal with missing data can lead to seriously misleading results, we advise to consider multiple imputation. The purpose of multiple imputation is not to create data, but to prevent the exclusion of observed data.
- 12Brock, G. N.; Shaffer, J. R.; Blakesley, R. E.; Lotz, M. J.; Tseng, G. C. Which missing value imputation method to use in expression profiles: a comparative study and two selection schemes. BMC Bioinf. 2008, 9, 12 DOI: 10.1186/1471-2105-9-12[Crossref], [PubMed], [CAS], Google Scholar12https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A280%3ADC%252BD1c7jsVyhug%253D%253D&md5=550ac2936960ddcd03d9ca2586c4281dWhich missing value imputation method to use in expression profiles: a comparative study and two selection schemesBrock Guy N; Shaffer John R; Blakesley Richard E; Lotz Meredith J; Tseng George CBMC bioinformatics (2008), 9 (), 12 ISSN:.BACKGROUND: Gene expression data frequently contain missing values, however, most down-stream analyses for microarray experiments require complete data. In the literature many methods have been proposed to estimate missing values via information of the correlation patterns within the gene expression matrix. Each method has its own advantages, but the specific conditions for which each method is preferred remains largely unclear. In this report we describe an extensive evaluation of eight current imputation methods on multiple types of microarray experiments, including time series, multiple exposures, and multiple exposures x time series data. We then introduce two complementary selection schemes for determining the most appropriate imputation method for any given data set. RESULTS: We found that the optimal imputation algorithms (LSA, LLS, and BPCA) are all highly competitive with each other, and that no method is uniformly superior in all the data sets we examined. The success of each method can also depend on the underlying "complexity" of the expression data, where we take complexity to indicate the difficulty in mapping the gene expression matrix to a lower-dimensional subspace. We developed an entropy measure to quantify the complexity of expression matrixes and found that, by incorporating this information, the entropy-based selection (EBS) scheme is useful for selecting an appropriate imputation algorithm. We further propose a simulation-based self-training selection (STS) scheme. This technique has been used previously for microarray data imputation, but for different purposes. The scheme selects the optimal or near-optimal method with high accuracy but at an increased computational cost. CONCLUSION: Our findings provide insight into the problem of which imputation method is optimal for a given data set. Three top-performing methods (LSA, LLS and BPCA) are competitive with each other. Global-based imputation methods (PLS, SVD, BPCA) performed better on mcroarray data with lower complexity, while neighbour-based methods (KNN, OLS, LSA, LLS) performed better in data with higher complexity. We also found that the EBS and STS schemes serve as complementary and effective tools for selecting the optimal imputation algorithm.
- 13To, K. T.; Fry, R. C.; Reif, D. M. Characterizing the effects of missing data and evaluating imputation methods for chemical prioritization applications using ToxPi. BioData Min. 2018, 11, 10 DOI: 10.1186/s13040-018-0169-5[Crossref], [PubMed], [CAS], Google Scholar13https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A280%3ADC%252BB3c%252FhslOksQ%253D%253D&md5=5ea3228f441ab27190b2ada91f5059e6Characterizing the effects of missing data and evaluating imputation methods for chemical prioritization applications using ToxPiTo Kimberly T; Reif David M; Fry Rebecca C; Reif David M; Reif David MBioData mining (2018), 11 (), 10 ISSN:1756-0381.BACKGROUND: The Toxicological Priority Index (ToxPi) is a method for prioritization and profiling of chemicals that integrates data from diverse sources. However, individual data sources ("assays"), such as in vitro bioassays or in vivo study endpoints, often feature sections of missing data, wherein subsets of chemicals have not been tested in all assays. In order to investigate the effects of missing data and recommend solutions, we designed simulation studies around high-throughput screening data generated by the ToxCast and Tox21 programs on chemicals highlighted by the Agency for Toxic Substances and Disease Registry's (ATSDR) Substance Priority List (SPL), which helps prioritize environmental research and remediation resources. RESULTS: Our simulations explored a wide range of scenarios concerning data (0-80% assay data missing per chemical), modeling (ToxPi models containing from 160-700 different assays), and imputation method (k-Nearest-Neighbor, Max, Mean, Min, Binomial, Local Least Squares, and Singular Value Decomposition). We find that most imputation methods result in significant changes to ToxPi score, except for datasets with a small number of assays. If we consider rank change conditional on these significant changes to ToxPi score, we find that ranks of chemicals in the minimum value imputation, SVD imputation, and kNN imputation sets are more sensitive to the score changes. CONCLUSIONS: We found that the choice of imputation strategy exerted significant influence over both scores and associated ranks, and the most sensitive scenarios were those involving fewer assays plus higher proportions of missing data. By characterizing the effects of missing data and the relative benefit of imputation approaches across real-world data scenarios, we can augment confidence in the robustness of decisions regarding the health and ecological effects of environmental chemicals.
- 14Poyatos, R.; Sus, O.; Badiella, L.; Mencuccini, M.; Martinez-Vilalta, J. Gap-filling a spatially explicit plant trait database: comparing imputation methods and different levels of environmental information. Biogeosciences 2018, 15, 2601– 2617, DOI: 10.5194/bg-15-2601-2018
- 15Lenz, M.; Schulz, A.; Koeck, T.; Rapp, S.; Nagler, M.; Sauer, M.; Eggebrecht, L.; Cate, V. T.; Panova-Noeva, M.; Prochaska, J. H.; Lackner, K. J.; Münzel, T.; Leineweber, K.; Wild, P. S.; Andrade-Navarro, M. A. Missing value imputation in proximity extension assay-based targeted proteomics data. PLoS One 2020, 15, e0243487 DOI: 10.1371/journal.pone.0243487[Crossref], [PubMed], [CAS], Google Scholar15https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BB3cXis1GgsrrI&md5=a13ae3b2e3fc861f525e13f71405736dMissing value imputation in proximity extension assay-based targeted proteomics dataLenz, Michael; Schulz, Andreas; Koeck, Thomas; Rapp, Steffen; Nagler, Markus; Sauer, Madeleine; Eggebrecht, Lisa; Ten Cate, Vincent; Panova-Noeva, Marina; Prochaska, Juergen H.; Lackner, Karl J.; Muenzel, Thomas; Leineweber, Kirsten; Wild, Philipp S.; Andrade-Navarro, Miguel A.PLoS One (2020), 15 (12), e0243487CODEN: POLNCL; ISSN:1932-6203. (Public Library of Science)Targeted proteomics utilizing antibody-based proximity extension assays provides sensitive and highly specific quantifications of plasma protein levels. Multivariate anal. of this data is hampered by frequent missing values (random or left censored), calling for imputation approaches. While appropriate missing-value imputation methods exist, benchmarks of their performance in targeted proteomics data are lacking. Here, we assessed the performance of two methods for imputation of values missing completely at random, the previously top-benchmarked 'missForest' and the recently published 'GSimp' method. Evaluation was accomplished by comparing imputed with remeasured relative concns. of 91 inflammation related circulating proteins in 86 samples from a cohort of 645 patients with venous thromboembolism. The median Pearson correlation between imputed and remeasured protein expression values was 69.0% for missForest and 71.6% for GSimp (p = 5.8e-4). Imputation with missForest resulted in stronger redn. of variance compared to GSimp (median relative variance of 25.3% vs. 68.6%, p = 2.4e-16) and undesired larger bias in downstream anal. Irresp. of the imputation method used, the 91 imputed proteins revealed large variations in imputation accuracy, driven by differences in signal to noise ratio and information overlap between proteins. In summary, GSimp outperformed missForest, while both methods show good overall imputation accuracy with large variations between proteins.
- 16Bramer, L. M.; Irvahn, J.; Piehowski, P. D.; Rodland, K. D.; Webb-Robertson, B.-J. M. A Review of Imputation Strategies for Isobaric Labeling-Based Shotgun Proteomics. J. Proteome Res. 2021, 20, 1– 13, DOI: 10.1021/acs.jproteome.0c00123[ACS Full Text
], [CAS], Google Scholar
16https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BB3cXhvVerur3N&md5=6de86677050965321a2b7a05314cb1f7A Review of Imputation Strategies for Isobaric Labeling-Based Shotgun ProteomicsBramer, Lisa M.; Irvahn, Jan; Piehowski, Paul D.; Rodland, Karin D.; Webb-Robertson, Bobbie-Jo M.Journal of Proteome Research (2021), 20 (1), 1-13CODEN: JPROBS; ISSN:1535-3893. (American Chemical Society)A review. The throughput efficiency and increased depth of coverage provided by isobaric-labeled proteomics measurements have led to increased usage of these techniques. However, the structure of missing data is different than unlabeled studies, which prompts the need for this review to compare the efficacy of nine imputation methods on large isobaric-labeled proteomics data sets to guide researchers on the appropriateness of various imputation methods. Imputation methods were evaluated by accuracy, statistical hypothesis test inference, and run time. In general, expectation maximization and random forest imputation methods yielded the best performance, and const.-based methods consistently performed poorly across all data set sizes and percentages of missing values. For data sets with small sample sizes and higher percentages of missing data, results indicate that statistical inference with no imputation may be preferable. On the basis of the findings in this review, there are core imputation methods that perform better for isobaric-labeled proteomics data, but great care and consideration as to whether imputation is the optimal strategy should be given for data sets comprised of a small no. of samples. - 17de Souto, M. C. P.; Jaskowiak, P. A.; Costa, I. G. Impact of missing data imputation methods on gene expression clustering and classification. BMC Bioinf. 2015, 16, 64 DOI: 10.1186/s12859-015-0494-3[Crossref], [PubMed], [CAS], Google Scholar17https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A280%3ADC%252BC2Mjlslemsg%253D%253D&md5=210bc346aff2d28f775b2c1f2a4093b7Impact of missing data imputation methods on gene expression clustering and classificationde Souto Marcilio C P; Jaskowiak Pablo A; Costa Ivan G; Costa Ivan GBMC bioinformatics (2015), 16 (), 64 ISSN:.BACKGROUND: Several missing value imputation methods for gene expression data have been proposed in the literature. In the past few years, researchers have been putting a great deal of effort into presenting systematic evaluations of the different imputation algorithms. Initially, most algorithms were assessed with an emphasis on the accuracy of the imputation, using metrics such as the root mean squared error. However, it has become clear that the success of the estimation of the expression value should be evaluated in more practical terms as well. One can consider, for example, the ability of the method to preserve the significant genes in the dataset, or its discriminative/predictive power for classification/clustering purposes. RESULTS AND CONCLUSIONS: We performed a broad analysis of the impact of five well-known missing value imputation methods on three clustering and four classification methods, in the context of 12 cancer gene expression datasets. We employed a statistical framework, for the first time in this field, to assess whether different imputation methods improve the performance of the clustering/classification methods. Our results suggest that the imputation methods evaluated have a minor impact on the classification and downstream clustering analyses. Simple methods such as replacing the missing values by mean or the median values performed as well as more complex strategies. The datasets analyzed in this study are available at http://costalab.org/Imputation/ .
- 18Liu, M.; Dongre, A. Proper imputation of missing values in proteomics datasets for differential expression analysis. Briefings Bioinf. 2020, 0, bbaa112 DOI: 10.1093/bib/bbaa112
- 19Rodwell, L.; Lee, K. J.; Romaniuk, H.; Carlin, J. B. Comparison of methods for imputing limited-range variables: a simulation study. BMC Med. Res. Methodol. 2014, 14, 57 DOI: 10.1186/1471-2288-14-57[Crossref], [PubMed], [CAS], Google Scholar19https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A280%3ADC%252BC2cnmvVKlsA%253D%253D&md5=67b590a65a7366914e38279d9ed1cb6aComparison of methods for imputing limited-range variables: a simulation studyRodwell Laura; Lee Katherine J; Romaniuk Helena; Carlin John BBMC medical research methodology (2014), 14 (), 57 ISSN:.BACKGROUND: Multiple imputation (MI) was developed as a method to enable valid inferences to be obtained in the presence of missing data rather than to re-create the missing values. Within the applied setting, it remains unclear how important it is that imputed values should be plausible for individual observations. One variable type for which MI may lead to implausible values is a limited-range variable, where imputed values may fall outside the observable range. The aim of this work was to compare methods for imputing limited-range variables, with a focus on those that restrict the range of the imputed values. METHODS: Using data from a study of adolescent health, we consider three variables based on responses to the General Health Questionnaire (GHQ), a tool for detecting minor psychiatric illness. These variables, based on different scoring methods for the GHQ, resulted in three continuous distributions with mild, moderate and severe positive skewness. In an otherwise complete dataset, we set 33% of the GHQ observations to missing completely at random or missing at random; repeating this process to create 1000 datasets with incomplete data for each scenario.For each dataset, we imputed values on the raw scale and following a zero-skewness log transformation using: univariate regression with no rounding; post-imputation rounding; truncated normal regression; and predictive mean matching. We estimated the marginal mean of the GHQ and the association between the GHQ and a fully observed binary outcome, comparing the results with complete data statistics. RESULTS: Imputation with no rounding performed well when applied to data on the raw scale. Post-imputation rounding and imputation using truncated normal regression produced higher marginal means than the complete data estimate when data had a moderate or severe skew, and this was associated with under-coverage of the complete data estimate. Predictive mean matching also produced under-coverage of the complete data estimate. For the estimate of association, all methods produced similar estimates to the complete data. CONCLUSIONS: For data with a limited range, multiple imputation using techniques that restrict the range of imputed values can result in biased estimates for the marginal mean when data are highly skewed.
- 20Kruttika, D.; Simion, K.; R, J. M.; J, P. S. A Simple Optimization Workflow to Enable Precise and Accurate Imputation of Missing Values in Proteomic Datasets. bioRxiv 2020, 1– 39, DOI: 10.1101/2019.12.11.123456
- 21Jin, L.; Bi, Y.; Hu, C.; Qu, J.; Shen, S.; Wang, X.; Tian, Y. A comparative study of evaluating missing value imputation methodsin label-free proteomics. Sci. Rep. 2021, 11, 1760 DOI: 10.1038/s41598-021-81279-4[Crossref], [PubMed], [CAS], Google Scholar21https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BB3MXhvFaqtrk%253D&md5=34e7040ccb9fb0dfe9b7e371dac1bcb1A comparative study of evaluating missing value imputation methods in label-free proteomicsJin, Liang; Bi, Yingtao; Hu, Chenqi; Qu, Jun; Shen, Shichen; Wang, Xue; Tian, YuScientific Reports (2021), 11 (1), 1760CODEN: SRCEC3; ISSN:2045-2322. (Nature Research)The presence of missing values (MVs) in label-free quant. proteomics greatly reduces the completeness of data. Imputation has been widely utilized to handle MVs, and selection of the proper method is crit. for the accuracy and reliability of imputation. Here we present a comparative study that evaluates the performance of seven popular imputation methods with a large-scale benchmark dataset and an immune cell dataset. Simulated MVs were incorporated into the complete part of each dataset with different combinations of MV rates and missing not at random (MNAR) rates. Normalized root mean square error (NRMSE) was applied to evaluate the accuracy of protein abundances and intergroup protein ratios after imputation. Detection of true positives (TPs) and false altered-protein discovery rate (FADR) between groups were also compared using the benchmark dataset. Furthermore, the accuracy of handling real MVs was assessed by comparing enriched pathways and signature genes of cell activation after imputing the immune cell dataset. We obsd. that the accuracy of imputation is primarily affected by the MNAR rate rather than the MV rate, and downstream anal. can be largely impacted by the selection of imputation methods. A random forest-based imputation method consistently outperformed other popular methods by achieving the lowest NRMSE, high amt. of TPs with the av. FADR < 5%, and the best detection of relevant pathways and signature genes, highlighting it as the most suitable method for label-free proteomics.
- 22Audoux, J.; Salson, M.; Grosset, C. F.; Beaumeunier, S.; Holder, J.-M.; Commes, T.; Philippe, N. SimBA: A methodology and tools for evaluating the performance of RNA-Seq bioinformatic pipelines. BMC Bioinf. 2017, 18, 428 DOI: 10.1186/s12859-017-1831-5[Crossref], [PubMed], [CAS], Google Scholar22https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BC1cXitVOhtbrM&md5=024e4f119feb752a3458f3a643e71527SimBA: a methodology and tools for evaluating the performance of RNA-Seq bioinformatic pipelinesAudoux, Jerome; Salson, Mikael; Grosset, Christophe F.; Beaumeunier, Sacha; Holder, Jean-Marc; Commes, Therese; Philippe, NicolasBMC Bioinformatics (2017), 18 (), 428/1-428/14CODEN: BBMIC4; ISSN:1471-2105. (BioMed Central Ltd.)The evolution of next-generation sequencing (NGS) technologies has led to increased focus on RNA-Seq. Many bioinformatic tools have been developed for RNA-Seq anal., each with unique performance characteristics and configuration parameters. Users face an increasingly complex task in understanding which bioinformatic tools are best for their specific needs and how they should be configured. In order to provide some answers to these questions, we investigate the performance of leading bioinformatic tools designed for RNA-Seq anal. and propose a methodol. for systematic evaluation and comparison of performance to help users make well informed choices. To evaluate RNA-Seq pipelines, we developed a suite of two benchmarking tools. SimCT generates simulated datasets that get as close as possible to specific real biol. conditions accompanied by the list of genomic incidents and mutations that have been inserted. BenchCT then compares the output of any bioinformatics pipeline that has been run against a SimCT dataset with the simulated genomic and transcriptional variations it contains to give an accurate performance evaluation in addressing specific biol. question. We used these tools to simulate a real-world genomic medicine questions involving the comparison of healthy and cancerous cells. Results revealed that performance in addressing a particular biol. context varied significantly depending on the choice of tools and settings used. We also found that by combining the output of certain pipelines, substantial performance improvements could be achieved. Our research emphasizes the importance of selecting and configuring bioinformatic tools for the specific biol. question being investigated to obtain optimal results. Pipeline designers, developers and users should include benchmarking in the context of their biol. question as part of their design and quality control process. Our SimBA suite of benchmarking tools provides a reliable basis for comparing the performance of RNA-Seq bioinformatics pipelines in addressing a specific biol. question. We would like to see the creation of a ref. corpus of data-sets that would allow accurate comparison between benchmarks performed by different groups and the publication of more benchmarks based on this public corpus.
- 23Wang, S.; Li, W.; Hu, L.; Cheng, J.; Yang, H.; Liu, Y. NAguideR: performing and prioritizing missing value imputations for consistent bottom-up proteomic analyses. Nucleic Acids Res. 2020, 48, e83, DOI: 10.1093/nar/gkaa498[Crossref], [PubMed], [CAS], Google Scholar23https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BB3cXis1entrrE&md5=ff8b871b38806a95b7fb0220d5782dafNAguideR: performing and prioritizing missing value imputations for consistent bottom-up proteomic analysesWang, Shisheng; Li, Wenxue; Hu, Liqiang; Cheng, Jingqiu; Yang, Hao; Liu, YanshengNucleic Acids Research (2020), 48 (14), e83CODEN: NARHAD; ISSN:1362-4962. (Oxford University Press)Mass spectrometry (MS)-based quant. proteomics expts. frequently generate data with missing values, which may profoundly affect downstream analyses. A wide variety of imputation methods have been established to deal with the missing-value issue. To date, however, there is a scarcity of efficient, systematic, and easy-to-handle tools that are tailored for proteomics community. Herein, we developed a user-friendly and powerful stand-alone software, NAguideR, to enable implementation and evaluation of different missing value methods offered by 23 widely used missing-value imputation algorithms. NAguideR further evaluates data imputation results through classic computational criteria and, unprecedentedly, proteomic empirical criteria, such as quant. consistency between different charge-states of the same peptide, different peptides belonging to the same proteins, and individual proteins participating protein complexes and functional interactions. We applied NAguideR into three label-free proteomic datasets featuring peptide-level, protein-level, and phosphoproteomic variables resp., all generated by data independent acquisition mass spectrometry (DIA-MS) with substantial biol. replicates. The results indicate that NAguideR is able to discriminate the optimal imputation methods that are facilitating DIA-MS expts. over those sub-optimal and low-performance algorithms. NAguideR further provides downloadable tables and figures supporting flexible data anal. and interpretation.
- 24Rose, M.; Duhamel, M.; Aboulouard, S.; Kobeissy, F.; Rhun, E. L.; Desmons, A.; Tierny, D.; Fournier, I.; Rodet, F.; Salzet, M. The Role of a Proprotein Convertase Inhibitor in Reactivation of Tumor-Associated Macrophages and Inhibition of Glioma Growth. Mol. Ther.--Oncolytics 2020, 17, 31– 46, DOI: 10.1016/j.omto.2020.03.005[Crossref], [PubMed], [CAS], Google Scholar24https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BB3cXht1GntbnL&md5=8198a0c313c461e11bafb3372d34b112The Role of a Proprotein Convertase Inhibitor in Reactivation of Tumor-Associated Macrophages and Inhibition of Glioma GrowthRose, Melanie; Duhamel, Marie; Aboulouard, Soulaimane; Kobeissy, Firas; Le Rhun, Emilie; Desmons, Annie; Tierny, Dominique; Fournier, Isabelle; Rodet, Franck; Salzet, MichelMolecular Therapy--Oncolytics (2020), 17 (), 31-46CODEN: MTOHDL; ISSN:2372-7705. (Elsevier Inc.)Tumors are characterized by the presence of malignant and non-malignant cells, such as immune cells including macrophages, which are preponderant. Macrophages impact the efficacy of chemotherapy and may lead to drug resistance. In this context and based on our previous work, we investigated the ability to reactivate macrophages by using a proprotein convertases inhibitor. Proprotein convertases process immature proteins into functional proteins, with several of them having a role in immune cell activation and tumorigenesis. Macrophages were treated with a peptidomimetic inhibitor targeting furin, PC1/3, PC4, PACE4, and PC5/6. Their anti-glioma activity was analyzed by mass spectrometry-based proteomics and viability assays in 2D and 3D in vitro cultures. Comparison with temozolomide, the drug used for glioma therapy, established that the inhibitor was more efficient for the redn. of cancer cell d. The inhibitor was also able to reactivate macrophages through the secretion of several immune factors with antitumor properties. Moreover, two proteins considered as good glioma patient survival indicators were also identified in 3D cultures treated with the inhibitor. Finally, we established that the proprotein convertases inhibitor has a dual role as an anti-glioma drug and anti-tumoral macrophage reactivation drug. This strategy could be used together with chemotherapy to increase therapy efficacy in glioma.
- 25Cox, J.; Mann, M. MaxQuant enables high peptide identification rates and individualized p.p.b.-range mass accuracies and proteome-wide protein quantification. Nat. Biotechnol. 2008, 26, 1367– 1372, DOI: 10.1038/nbt.1511[Crossref], [PubMed], [CAS], Google Scholar25https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BD1cXhsVWjtLzJ&md5=675d31ca84e3a7e4fb9bdd601d8075eaMaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantificationCox, Juergen; Mann, MatthiasNature Biotechnology (2008), 26 (12), 1367-1372CODEN: NABIF9; ISSN:1087-0156. (Nature Publishing Group)Efficient anal. of very large amts. of raw data for peptide identification and protein quantification is a principal challenge in mass spectrometry (MS)-based proteomics. Here we describe MaxQuant, an integrated suite of algorithms specifically developed for high-resoln., quant. MS data. Using correlation anal. and graph theory, MaxQuant detects peaks, isotope clusters and stable amino acid isotope-labeled (SILAC) peptide pairs as three-dimensional objects in m/z, elution time and signal intensity space. By integrating multiple mass measurements and correcting for linear and nonlinear mass offsets, we achieve mass accuracy in the p.p.b. range, a sixfold increase over std. techniques. We increase the proportion of identified fragmentation spectra to 73% for SILAC peptide pairs via unambiguous assignment of isotope and missed-cleavage state and individual mass precision. MaxQuant automatically quantifies several hundred thousand peptides per SILAC-proteome expt. and allows statistically robust identification and quantification of >4000 proteins in mammalian cell lysates.
- 26O’Brien, J. J.; Gunawardena, H. P.; Paulo, J. A.; Chen, X.; Ibrahim, J. G.; Gygi, S. P.; Qaqish, B. F. The effects of nonignorable missing data on label-free mass spectrometry proteomics experiments. Ann. Appl. Stat. 2018, 12, 2075– 2095, DOI: 10.1214/18-AOAS1144[Crossref], [PubMed], [CAS], Google Scholar26https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A280%3ADC%252BB3crksFeitA%253D%253D&md5=e7433543cf49d814974145824b08c735The effects of nonignorable missing data on label-free mass spectrometry proteomics experimentsO'Brien Jonathon J; Gunawardena Harsha P; Paulo Joao A; Chen Xian; Ibrahim Joseph G; Gygi Steven P; Qaqish Bahjat FThe annals of applied statistics (2018), 12 (4), 2075-2095 ISSN:1932-6157.An idealized version of a label-free discovery mass spectrometry proteomics experiment would provide absolute abundance measurements for a whole proteome, across varying conditions. Unfortunately, this ideal is not realized. Measurements are made on peptides requiring an inferential step to obtain protein level estimates. The inference is complicated by experimental factors that necessitate relative abundance estimation and result in widespread non-ignorable missing data. Relative abundance on the log scale takes the form of parameter contrasts. In a complete-case analysis, contrast estimates may be biased by missing data and a substantial amount of useful information will often go unused. To avoid problems with missing data, many analysts have turned to single imputation solutions. Unfortunately, these methods often create further difficulties by hiding inestimable contrasts, preventing the recovery of interblock information and failing to account for imputation uncertainty. To mitigate many of the problems caused by missing values, we propose the use of a Bayesian selection model. Our model is tested on simulated data, real data with simulated missing values, and on a ground truth dilution experiment where all of the true relative changes are known. The analysis suggests that our model, compared with various imputation strategies and complete-case analyses, can increase accuracy and provide substantial improvements to interval coverage.
- 27Marquardt, D. W. Comment - You should standardize the predictor variables in your regression models. J. Am. Stat. Assoc. 1980, 75, 87– 91, 10.1080/01621459.1980.10477430Google ScholarThere is no corresponding record for this reference.
- 28Menard, S. Standards for standardized logistic regression coefficients. Soc. Forces 2011, 89, 1409– 1428, DOI: 10.1093/sf/89.4.1409
- 29Kreutz, C. New Concepts for Evaluating the Performance of Computational Methods. IFAC-PapersOnLine 2016, 49, 63– 70, DOI: 10.1016/j.ifacol.2016.12.104
- 30MATLAB. 9.8.0.1538580 (R2020a); The MathWorks Inc.: Natickand Massachusetts, 2020.Google ScholarThere is no corresponding record for this reference.
- 31R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna, Austria, 2021.Google ScholarThere is no corresponding record for this reference.
- 32Stekhoven, D. J.; Buhlmann, P. MissForest-non-parametric missing value imputation for mixed-type data. Bioinformatics 2012, 28, 112– 118, DOI: 10.1093/bioinformatics/btr597[Crossref], [PubMed], [CAS], Google Scholar32https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BC3MXhs1yms7fO&md5=44a690989dad7424d50a1662f5de2625MissForest-non-parametric missing value imputation for mixed-type dataStekhoven, Daniel J.; Buehlmann, PeterBioinformatics (2012), 28 (1), 112-118CODEN: BOINFP; ISSN:1367-4803. (Oxford University Press)Modern data acquisition based on high-throughput technol. is often facing the problem of missing data. Algorithms commonly used in the anal. of such large-scale data often depend on a complete set. Missing value imputation offers a soln. to this problem. However, the majority of available imputation methods are restricted to one type of variable only: continuous or categorical. For mixed-type data, the different types are usually handled sep. Therefore, these methods ignore possible relations between variable types. We propose a non-parametric method which can cope with different types of variables simultaneously. Results: We compare several state of the art methods for the imputation of missing values. We propose and evaluate an iterative imputation method (missForest) based on a random forest. By averaging over many unpruned classification or regression trees, random forest intrinsically constitutes a multiple imputation scheme. Using the built-in out-of-bag error ests. of random forest, we are able to est. the imputation error without the need of a test set. Evaluation is performed on multiple datasets coming from a diverse selection of biol. fields with artificially introduced missing values ranging from 10% to 30%. We show that missForest can successfully handle missing values, particularly in datasets including different types of variables. In our comparative study, missForest outperforms other methods of imputation esp. in data settings where complex interactions and non-linear relations are suspected. The out-of-bag imputation error ests. of missForest prove to be adequate in all settings. Addnl., missForest exhibits attractive computational efficiency and can cope with high-dimensional data.
- 33Josse, J.; Husson, F. Handling missing values in exploratory multivariate data analysis methods. J. Soc. Fr. Stat. 2012, 153, 79– 99Google ScholarThere is no corresponding record for this reference.
- 34Khoonsari, P. E.; Häggmark, A.; Lönnberg, M.; Mikus, M.; Kilander, L.; Lannfelt, L.; Bergquist, J.; Ingelsson, M.; Nilsson, P.; Kultima, K.; Shevchenko, G. Analysis of the Cerebrospinal Fluid Proteome in Alzheimer’s Disease. PLoS One 2016, 11, e0150672 DOI: 10.1371/journal.pone.0150672[Crossref], [PubMed], [CAS], Google Scholar34https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BC28XhtVart7vO&md5=ef30cb763212be43089453d2a5117f7eAnalysis of the cerebrospinal fluid proteome in Alzheimer's diseaseKhoonsari, Payam Emami; Haggmark, Anna; Lonnberg, Maria; Mikus, Maria; Kilander, Lena; Lannfelt, Lars; Bergquist, Jonas; Ingelsson, Martin; Nilsson, Peter; Kultima, Kim; Shevchenko, GannaPLoS One (2016), 11 (3), e0150672/1-e0150672/25CODEN: POLNCL; ISSN:1932-6203. (Public Library of Science)Alzheimer's disease is a neurodegenerative disorder accounting for more than 50% of cases of dementia. Diagnosis of Alzheimer's disease relies on cognitive tests and anal. of amyloid beta, protein tau, and hyperphosphorylated tau in cerebrospinal fluid. Although these markers provide relatively high sensitivity and specificity for early disease detection, they are not suitable for monitor of disease progression. In the present study, we used label-free shotgun mass spectrometry to analyze the cerebrospinal fluid proteome of Alzheimer's disease patients and non-demented controls to identify potential biomarkers for Alzheimer's disease. We processed the data using five programs (DecyderMS, Maxquant, OpenMS, PEAKS, and Sieve) and compared their results by means of reproducibility and peptide identification, including three different normalization methods. After depletion of high abundant proteins we found that Alzheimer's disease patients had lower fraction of low-abundance proteins in cerebrospinal fluid compared to healthy controls (p<0.05). Consequently, global normalization was found to be less accurate compared to using spiked-in chicken ovalbumin for normalization. In addn., we detd. that Sieve and OpenMS resulted in the highest reproducibility and PEAKS was the programs with the highest identification performance. Finally, we successfully verified significantly lower levels (p<0.05) of eight proteins (A2GL, APOM, C1QB, C1QC, C1S, FBLN3, PTPRZ, and SEZ6) in Alzheimer's disease compared to controls using an antibody-based detection method. These proteins are involved in different biol. roles spanning from cell adhesion and migration, to regulation of the synapse and the immune system.
- 35Sturm, M.; Bertsch, A.; Gröpl, C.; Hildebrandt, A.; Hussong, R.; Lange, E.; Pfeifer, N.; Schulz-Trieglaff, O.; Zerck, A.; Reinert, K.; Kohlbacher, O. OpenMS - an open-source software framework for mass spectrometry. BMC Bioinf. 2008, 9, 163 DOI: 10.1186/1471-2105-9-163[Crossref], [PubMed], [CAS], Google Scholar35https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A280%3ADC%252BD1c3ltF2ktg%253D%253D&md5=99727c3e786a099fdd18786d6e642cd8OpenMS - an open-source software framework for mass spectrometrySturm Marc; Bertsch Andreas; Gropl Clemens; Hildebrandt Andreas; Hussong Rene; Lange Eva; Pfeifer Nico; Schulz-Trieglaff Ole; Zerck Alexandra; Reinert Knut; Kohlbacher OliverBMC bioinformatics (2008), 9 (), 163 ISSN:.BACKGROUND: Mass spectrometry is an essential analytical technique for high-throughput analysis in proteomics and metabolomics. The development of new separation techniques, precise mass analyzers and experimental protocols is a very active field of research. This leads to more complex experimental setups yielding ever increasing amounts of data. Consequently, analysis of the data is currently often the bottleneck for experimental studies. Although software tools for many data analysis tasks are available today, they are often hard to combine with each other or not flexible enough to allow for rapid prototyping of a new analysis workflow. RESULTS: We present OpenMS, a software framework for rapid application development in mass spectrometry. OpenMS has been designed to be portable, easy-to-use and robust while offering a rich functionality ranging from basic data structures to sophisticated algorithms for data analysis. This has already been demonstrated in several studies. CONCLUSION: OpenMS is available under the Lesser GNU Public License (LGPL) from the project website at http://www.openms.de.
- 36Pursiheimo, A.; Vehmas, A. P.; Afzal, S.; Suomi, T.; Chand, T.; Strauss, L.; Poutanen, M.; Rokka, A.; Corthals, G. L.; Elo, L. L. Optimization of Statistical Methods Impact on Quantitative Proteomics Data. J. Proteome Res. 2015, 14, 4118– 4126, DOI: 10.1021/acs.jproteome.5b00183[ACS Full Text
], [CAS], Google Scholar
36https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BC2MXhsVWitL%252FI&md5=0c2a241e97545ea49acff53fdcf8c5ecOptimization of Statistical Methods Impact on Quantitative Proteomics DataPursiheimo, Anna; Vehmas, Anni P.; Afzal, Saira; Suomi, Tomi; Chand, Thaman; Strauss, Leena; Poutanen, Matti; Rokka, Anne; Corthals, Garry L.; Elo, Laura L.Journal of Proteome Research (2015), 14 (10), 4118-4126CODEN: JPROBS; ISSN:1535-3893. (American Chemical Society)As tools for quant. label-free mass spectrometry (MS) rapidly develop, a consensus about the best practices is not apparent. In the work described here, we compared popular statistical methods for detecting differential protein expression from quant. MS data using both controlled expts. with known quant. differences for specific proteins used as stds. as well as "real" expts. where differences in protein abundance are not known a priori. Our results suggest that data-driven reproducibility-optimization can consistently produce reliable differential expression rankings for label-free proteome tools and are straightforward in their application. - 37Govaert, E.; Van Steendam, K.; Scheerlinck, E.; Vossaert, L.; Meert, P.; Stella, M.; Willems, S.; De Clerck, L.; Dhaenens, M.; Deforce, D. Extracting histones for the specific purpose of label-free MS. Proteomics 2016, 16, 2937– 2944, DOI: 10.1002/pmic.201600341[Crossref], [PubMed], [CAS], Google Scholar37https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BC28XitVSrurzK&md5=682a99c1991f6a9ac6517da0755212a7Extracting histones for the specific purpose of label-free MSGovaert, Elisabeth; Van Steendam, Katleen; Scheerlinck, Ellen; Vossaert, Liesbeth; Meert, Paulien; Stella, Martina; Willems, Sander; De Clerck, Laura; Dhaenens, Maarten; Deforce, DieterProteomics (2016), 16 (23), 2937-2944CODEN: PROTC7; ISSN:1615-9853. (Wiley-VCH Verlag GmbH & Co. KGaA)Extg. histones from cells is the first step in studies that aim to characterize histones and their post-translational modifications (hPTMs) with MS. In the last decade, label-free quantification is more frequently being used for MS-based histone characterization. However, many histone extn. protocols were not specifically designed for label-free MS. While label-free quantification has its advantages, it is also very susceptible to tech. variation. Here, we adjust an established histone extn. protocol according to general label-free MS guidelines with a specific focus on minimizing sample handling. These protocols are first evaluated using SDS-PAGE. Hereafter, a selection of extn. protocols was used in a complete histone workflow for label-free MS. All protocols display nearly identical relative quantification of hPTMs. We thus show that, depending on the cell type under investigation and at the cost of some addnl. contaminating proteins, minimizing sample handling can be done during histone isolation. This allows analyzing bigger sample batches, leads to reduced tech. variation and minimizes the chance of in vitro alterations to the hPTM snapshot. Overall, these results allow researchers to det. the best protocol depending on the resources and goal of their specific study. Data are available via ProteomeXchange with identifier PXD002885.
- 38Calf, O. W.; van Dam, N. M.; Weinhold, A.; Huber, H.; Peters, J. L. MTBLS738: Glycoalkaloid composition explains variation in slug resistance in Solanum dulcamara. Oecologia 2018, 187, 495– 506, DOI: 10.1007/s00442-018-4064-z[Crossref], [PubMed], [CAS], Google Scholar38https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A280%3ADC%252BC1Mvls1Gguw%253D%253D&md5=172dd50c42901c7f65018fa75fab07b3Glycoalkaloid composition explains variation in slug resistance in Solanum dulcamaraCalf Onno W; van Dam Nicole M; Huber Heidrun; Peters Janny L; Weinhold Alexander; van Dam Nicole M; van Dam Nicole MOecologia (2018), 187 (2), 495-506 ISSN:.In natural environments, plants have to deal with a wide range of different herbivores whose communities vary in time and space. It is believed that the chemical diversity within plant species has mainly arisen from selection pressures exerted by herbivores. So far, the effects of chemical diversity on plant resistance have mostly been assessed for arthropod herbivores. However, also gastropods, such as slugs, can cause extensive damage to plants. Here we investigate to what extent individual Solanum dulcamara plants differ in their resistance to slug herbivory and whether this variation can be explained by differences in secondary metabolites. We performed a series of preference assays using the grey field slug (Deroceras reticulatum) and S. dulcamara accessions from eight geographically distinct populations from the Netherlands. Significant and consistent variation in slug preference was found for individual accessions within and among populations. Metabolomic analyses showed that variation in steroidal glycoalkaloids (GAs) correlated with slug preference; accessions with high GA levels were consistently less damaged by slugs. One, strongly preferred, accession with particularly low GA levels contained high levels of structurally related steroidal compounds. These were conjugated with uronic acid instead of the glycoside moieties common for Solanum GAs. Our results illustrate how intraspecific variation in steroidal glycoside profiles affects resistance to slug feeding. This suggests that also slugs should be considered as important drivers in the co-evolution between plants and herbivores.
Cited By
This article is cited by 3 publications.
- Yannis Schumann, Julia E. Neumann, Philipp Neumann. Robust classification using average correlations as features (ACF). BMC Bioinformatics 2023, 24 (1) https://doi.org/10.1186/s12859-023-05224-0
- Janine Egert, Clemens Kreutz. Rcall: An R interface for MATLAB. SoftwareX 2023, 21 , 101276. https://doi.org/10.1016/j.softx.2022.101276
- Christophe Vanderaa, Laurent Gatto. Replication of single-cell proteomics data reveals important computational challenges. Expert Review of Proteomics 2021, 18 (10) , 835-843. https://doi.org/10.1080/14789450.2021.1988571
Abstract
Figure 1
Figure 1. DIMA analysis pipeline illustrated on an LC-MS/MS data set. (24) The data is sorted from top to bottom according to the frequency of MVs and the mean intensity of the proteins. Likewise, the reference data R is sorted according to the mean protein intensities after considering pattern PR of step 3. (1) The pattern PO of MVs is learned by logistic regression using the protein and sample as factorial predictors and the mean protein intensity as a continuous predictor. (2) A reference data R with few MVs is defined. (3) Various patterns PR of MVs are generated by the logistic regression model, and the respective coefficients of step 1 are incorporated into the reference data R. (4) Boxplots of the absolute imputation errors for multiple imputation algorithms. The circle indicates the median imputation deviation. The algorithms are ranked by their overall root mean square error (RMSE, red diamond). The algorithms can be divided into well-performing algorithms with an RMSE < 0.5 (green), medium performance with 0.5 < RMSE < 3 (yellow), and bad performance with RMSE > 3 (red). (5) The best-performing imputation algorithm on R (in this example impSeqRob) is recommended for the original data O and imputation of O is conducted.
Figure 2
Figure 2. DIMA is applied and evaluated on 142 PRIDE data sets. (A) Nine algorithms compete for being recommended as the best-performing algorithm. The R package rrcovNA with its algorithms impSeqRob (47%) and impSeq (25%) is selected most frequently, followed by missForest in 13% and imputePCA in 10%. For 5% of the Pride data sets, another algorithm is suggested. (B) The rank of the imputation algorithms obtained in the 142 PRIDE data sets is shown as a box plot. The seven algorithms with the lowest median rank are also the seven most frequently selected algorithms by DIMA (A). The algorithms with a median rank lower than 5% are highlighted in green, and algorithms with a median rank greater than 20 are highlighted in red.
Figure 3
Figure 3. Performance of DIMA is evaluated on simulated data S with the incorporation of various proportions of MV and MNAR/MCAR ratios. The RMSE (color-coded) and rank (first entry) obtained by the best-performing imputation algorithm recommended by DIMA (second entry) compared to direct imputation assessment (third entry) over 500 data simulations are calculated. The algorithm recommended by DIMA is within the top three out of 27 approaches in all cases. For MV < 20% (A), the additive regression aregImpute with type regression (reg) outperforms, between 20 and 30% MVs; (B) several algorithms compete against each other and for MV > 30%; (C) the random forest algorithm missForest performs best.
References
ARTICLE SECTIONSThis article references 38 other publications.
- 1McGurk, K. A. The use of missing values in proteomic data-independent acquisition mass spectrometry to enable disease activity discrimination. Bioinformatics 2020, 36, 2217– 2223, DOI: 10.1093/bioinformatics/btz898[Crossref], [PubMed], [CAS], Google Scholar1https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BB3cXitlOmu7bI&md5=d2a3cd6644ada2086f6aaa6722145f68The use of missing values in proteomic data-independent acquisition mass spectrometry to enable disease activity discriminationMcGurk, Kathryn A.; Dagliati, Arianna; Chiasserini, Davide; Lee, Dave; Plant, Darren; Baricevic-Jones, Ivona; Kelsall, Janet; Eineman, Rachael; Reed, Rachel; Geary, Bethany; Unwin, Richard D.; Nicolaou, Anna; Keavney, Bernard D.; Barton, Anne; Whetton, Anthony D.; Geifman, NopharBioinformatics (2020), 36 (7), 2217-2223CODEN: BOINFP; ISSN:1367-4811. (Oxford University Press)Data-independent acquisition mass spectrometry allows for comprehensive peptide detection and relative quantification than std. data-dependent approaches. While less prone to missing values, these still exist. Current approaches for handling the so-called missingness have challenges. We hypothesized that non-random missingness is a useful biol. measure and demonstrate the importance of analyzing missingness for proteomic discovery within a longitudinal study of disease activity. The magnitude of missingness did not correlate with mean peptide concn. The magnitude of missingness for each protein strongly correlated between collection time points (baseline, 3 mo, 6 mo; R = 0.95-0.97, confidence interval = 0.94-0.97) indicating little time-dependent effect. This allowed for the identification of proteins with outlier levels of missingness that differentiate between the patient groups characterized by different patterns of disease activity. The assocn. of these proteins with disease activity was confirmed by machine learning techniques. Our novel approach complements analyses on complete observations and other missing value strategies in biomarker prediction of disease activity.
- 2Poulos, R. C. Strategies to enable large-scale proteomics for reproducible research. Nat. Commun. 2020, 11, 3793 DOI: 10.1038/s41467-020-17641-3[Crossref], [PubMed], [CAS], Google Scholar2https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BB3cXhsFajsb7M&md5=848bd6df41e1fc16efef3d87f76de510Strategies to enable large-scale proteomics for reproducible researchPoulos, Rebecca C.; Hains, Peter G.; Shah, Rohan; Lucas, Natasha; Xavier, Dylan; Manda, Srikanth S.; Anees, Asim; Koh, Jennifer M. S.; Mahboob, Sadia; Wittman, Max; Williams, Steven G.; Sykes, Erin K.; Hecker, Michael; Dausmann, Michael; Wouters, Merridee A.; Ashman, Keith; Yang, Jean; Wild, Peter J.; deFazio, Anna; Balleine, Rosemary L.; Tully, Brett; Aebersold, Ruedi; Speed, Terence P.; Liu, Yansheng; Reddel, Roger R.; Robinson, Phillip J.; Zhong, QingNature Communications (2020), 11 (1), 3793CODEN: NCAOBW; ISSN:2041-1723. (Nature Research)Abstr.: Reproducible research is the bedrock of exptl. science. To enable the deployment of large-scale proteomics, we assess the reproducibility of mass spectrometry (MS) over time and across instruments and develop computational methods for improving quant. accuracy. We perform 1560 data independent acquisition (DIA)-MS runs of eight samples contg. known proportions of ovarian and prostate cancer tissue and yeast, or control HEK293T cells. Replicates are run on six mass spectrometers operating continuously with varying maintenance schedules over four months, interspersed with ∼5000 other runs. We utilize neg. controls and replicates to remove unwanted variation and enhance biol. signal, outperforming existing methods. We also design a method for reducing missing values. Integrating these computational modules into a pipeline (ProNorM), we mitigate variation among instruments over time and accurately predict tissue proportions. We demonstrate how to improve the quant. anal. of large-scale DIA-MS data, providing a pathway toward clin. proteomics.
- 3Brenes, A.; Hukelmann, J.; Bensaddek, D.; Lamond, A. I. Multibatch TMT Reveals False Positives, Batch Effects and Missing Values. Mol. Cell. Proteomics 2019, 18, 1967– 1980, DOI: 10.1074/mcp.RA119.001472[Crossref], [PubMed], [CAS], Google Scholar3https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BC1MXitlSjtb3J&md5=1321490bdeb894c40aa34eff5d12546bMultibatch TMT reveals false positives, batch effects and missing valuesBrenes, Alejandro; Hukelmann, Jens; Bensaddek, Dalila; Lamond, Angus I.Molecular & Cellular Proteomics (2019), 18 (10), 1967-1980CODEN: MCPOBS; ISSN:1535-9484. (American Society for Biochemistry and Molecular Biology)Multiplexing strategies for large-scale proteomic analyses have become increasingly prevalent, tandem mass tags (TMT) in particular. Here we used a large iPSC proteomic expt. with twenty-four 10-plex TMT batches to evaluate the effect of integrating multiple TMT batches within a single anal. We identified a significant inflation rate of protein missing values as multiple batches are integrated and show that this pattern is aggravated at the peptide level. We also show that without normalization strategies to address the batch effects, the high precision of quantitation within a single multiplexed TMT batch is not reproduced when data from multiple TMT batches are integrated. Further, the incidence of false positives was studied by using Y chromosome peptides as an internal control. The iPSC lines quantified in this data set were derived from both male and female donors, hence the peptides mapped to the Y chromosome should be absent from female lines. Nonetheless, these Y chromosome-specific peptides were consistently detected in the female channels of all TMT batches. We then used the same Y chromosome specific peptides to quantify the level of ion coisolation as well as the effect of primary and secondary reporter ion interference. These results were used to propose solns. to mitigate the limitations of multi-batch TMT analyses. We confirm that including a common ref. line in every batch increases precision by facilitating normalization across the batches and we propose exptl. designs that minimize the effect of cross population reporter ion interference.
- 4Wei, R.; Wang, J.; Su, M.; Jia, E.; Chen, S.; Chen, T.; Ni, Y. Missing Value Imputation Approach for Mass Spectrometry-based Metabolomics Data. Sci. Rep. 2018, 8, 663 DOI: 10.1038/s41598-017-19120-0[Crossref], [PubMed], [CAS], Google Scholar4https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A280%3ADC%252BC1MvgtlyrsQ%253D%253D&md5=f161109ca8155e59a728dd2761e994f6Missing Value Imputation Approach for Mass Spectrometry-based Metabolomics DataWei Runmin; Wang Jingye; Su Mingming; Chen Shaoqiu; Ni Yan; Wei Runmin; Chen Shaoqiu; Su Mingming; Jia Erik; Chen TianluScientific reports (2018), 8 (1), 663 ISSN:.Missing values exist widely in mass-spectrometry (MS) based metabolomics data. Various methods have been applied for handling missing values, but the selection can significantly affect following data analyses. Typically, there are three types of missing values, missing not at random (MNAR), missing at random (MAR), and missing completely at random (MCAR). Our study comprehensively compared eight imputation methods (zero, half minimum (HM), mean, median, random forest (RF), singular value decomposition (SVD), k-nearest neighbors (kNN), and quantile regression imputation of left-censored data (QRILC)) for different types of missing values using four metabolomics datasets. Normalized root mean squared error (NRMSE) and NRMSE-based sum of ranks (SOR) were applied to evaluate imputation accuracy. Principal component analysis (PCA)/partial least squares (PLS)-Procrustes analysis were used to evaluate the overall sample distribution. Student's t-test followed by correlation analysis was conducted to evaluate the effects on univariate statistics. Our findings demonstrated that RF performed the best for MCAR/MAR and QRILC was the favored one for left-censored MNAR. Finally, we proposed a comprehensive strategy and developed a public-accessible web-tool for the application of missing value imputation in metabolomics ( https://metabolomics.cc.hawaii.edu/software/MetImp/ ).
- 5Webb-Robertson, B.-J. M.; Wiberg, H. K.; Matzke, M. M.; Brown, J. N.; Wang, J.; McDermott, J. E.; Smith, R. D.; Rodland, K. D.; Metz, T. O.; Pounds, J. G.; Waters, K. M. Reviewand Evaluationand and Discussion of the Challenges of Missing Value Imputation for Mass Spectrometry-Based Label-Free Global Proteomics. J. Proteome Res. 2015, 14, 1993– 2001, DOI: 10.1021/pr501138h[ACS Full Text
], [CAS], Google Scholar
5https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BC2MXmtV2ltL0%253D&md5=218ce277ccefd57aab2ad11987b943c6Review, Evaluation, and Discussion of the Challenges of Missing Value Imputation for Mass Spectrometry-Based Label-Free Global ProteomicsWebb-Robertson, Bobbie-Jo M.; Wiberg, Holli K.; Matzke, Melissa M.; Brown, Joseph N.; Wang, Jing; McDermott, Jason E.; Smith, Richard D.; Rodland, Karin D.; Metz, Thomas O.; Pounds, Joel G.; Waters, Katrina M.Journal of Proteome Research (2015), 14 (5), 1993-2001CODEN: JPROBS; ISSN:1535-3893. (American Chemical Society)A review. In this review, we apply selected imputation strategies to label-free liq. chromatog.-mass spectrometry (LC-MS) proteomics datasets to evaluate the accuracy with respect to metrics of variance and classification. We evaluate several commonly used imputation approaches for individual merits and discuss the caveats of each approach with respect to the example LC-MS proteomics data. In general, local similarity-based approaches, such as the regularized expectation maximization and least-squares adaptive algorithms, yield the best overall performances with respect to metrics of accuracy and robustness. However, no single algorithm consistently outperforms the remaining approaches, and in some cases, performing classification without imputation sometimes yielded the most accurate classification. Thus, because of the complex mechanisms of missing data in proteomics, which also vary from peptide to protein, no individual method is a single soln. for imputation. On the basis of the observations in this review, the goal for imputation in the field of computational proteomics should be to develop new approaches that work generically for this data type and new strategies to guide users in the selection of the best imputation for their dataset and anal. objectives. - 6Lazar, C.; Laurent, G.; Myriam, F.; Christophe, B.; Thomas, B. Accounting for the Multiple Natures of Missing Values in Label-Free Quantitative Proteomics Data Sets to Compare Imputation Strategies. J. Proteome Res. 2016, 15, 1116– 1125, DOI: 10.1021/acs.jproteome.5b00981[ACS Full Text
], [CAS], Google Scholar
6https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BC28XivFOntbw%253D&md5=edecca833d3f9183fb1fe3e9ff14e8afAccounting for the Multiple Natures of Missing Values in Label-Free Quantitative Proteomics Data Sets to Compare Imputation StrategiesLazar, Cosmin; Gatto, Laurent; Ferro, Myriam; Bruley, Christophe; Burger, ThomasJournal of Proteome Research (2016), 15 (4), 1116-1125CODEN: JPROBS; ISSN:1535-3893. (American Chemical Society)Missing values are a genuine issue in label-free quant. proteomics. Recent works have surveyed the different statistical methods to conduct imputation and have compared them on real or simulated data sets and recommended a list of missing value imputation methods for proteomics application. Although insightful, these comparisons do not account for two important facts: (i) depending on the proteomics data set, the missingness mechanism may be of different natures and (ii) each imputation method is devoted to a specific type of missingness mechanism. As a result, we believe that the question at stake is not to find the most accurate imputation method in general but instead the most appropriate one. We describe a series of comparisons that support our views: For instance, we show that a supposedly "under-performing" method (i.e., giving baseline av. results), if applied at the "appropriate" time in the data-processing pipeline (before or after peptide aggregation) on a data set with the "appropriate" nature of missing values, can outperform a blindly applied, supposedly "better-performing" method (i.e., the ref. method from the state-of-the-art). This leads us to formulate few practical guidelines regarding the choice and the application of an imputation method in a proteomics context. - 7Rubin, D. B. Inference and missing data. Biometrika 1976, 63, 581– 592, DOI: 10.1093/biomet/63.3.581
- 8Karpievitch, Y. V.; Dabney, A. R.; Smith, R. D. Normalization and missing value imputation for label-free LC-MS analysis. BMC Bioinf. 2012, 13, S5 DOI: 10.1186/1471-2105-13-S16-S5[Crossref], [PubMed], [CAS], Google Scholar8https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BC38XhvVyis7bN&md5=f1dbc6f041dba77202a072e3e0f4a959Normalization and missing value imputation for label-free LC-MS analysisKarpievitch, Yuliya V.; Dabney, Alan R.; Smith, Richard D.BMC Bioinformatics (2012), 13 (Suppl. 16), S5CODEN: BBMIC4; ISSN:1471-2105. (BioMed Central Ltd.)Shotgun proteomic data are affected by a variety of known and unknown systematic biases as well as high proportions of missing values. Typically, normalization is performed in an attempt to remove systematic biases from the data before statistical inference, sometimes followed by missing value imputation to obtain a complete matrix of intensities. Here we discuss several approaches to normalization and dealing with missing values, some initially developed for microarray data and some developed specifically for mass spectrometry-based data.
- 9Välikangas, T.; Suomi, T.; Elo, L. L. A comprehensive evaluation of popular proteomics software workflows for label-free proteome quantification and imputation. Briefings Bioinf. 2017, 19, 1344– 1355, DOI: 10.1093/bib/bbx054
- 10Wang, J.; Li, L.; Chen, T.; Ma, J.; Zhu, Y.; Zhuang, J.; Chang, C. In-depth method assessments of differentially expressed protein detection for shotgun proteomics data with missing values. Sci. Rep. 2017, 7, 3367 DOI: 10.1038/s41598-017-03650-8[Crossref], [PubMed], [CAS], Google Scholar10https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A280%3ADC%252BC1cnnvF2iug%253D%253D&md5=038103fc16b08f4e3b802f37c1172b39In-depth method assessments of differentially expressed protein detection for shotgun proteomics data with missing valuesWang Jinxia; Li Liwei; Chen Tao; Ma Jie; Zhu Yunping; Chang Cheng; Wang Jinxia; Zhuang Jujuan; Wang JinxiaScientific reports (2017), 7 (1), 3367 ISSN:.Considering as one of the major goals in quantitative proteomics, detection of the differentially expressed proteins (DEPs) plays an important role in biomarker selection and clinical diagnostics. There have been plenty of algorithms and tools focusing on DEP detection in proteomics research. However, due to the different application scopes of these methods, and various kinds of experiment designs, it is not very apparent about the best choice for large-scale proteomics data analyses. Moreover, given the fact that proteomics data usually contain high percentage of missing values (MVs), but few replicates, a systematic evaluation of the DEP detection methods combined with the MV imputation methods is essential and urgent. Here, we analyzed a total of four representative imputation methods and five DEP methods on different experimental and simulated datasets. The results showed that (i) MV imputation could not always improve the performances of DEP detection methods and the imputation effects differed in the missing value percentages; (ii) the DEP detection methods had different statistical powers affected by the percentage of MVs. Two statistical methods (i.e. the empirical Bayesian random censoring threshold model, and the significance analysis of microarray) performed better than the other evaluated methods in terms of accuracy and sensitivity.
- 11Janssen, K. J.; Donders, A. R. T.; Harrell, F. E.; Vergouwe, Y.; Chen, Q.; Grobbee, D. E.; Moons, K. G. Missing covariate data in medical research: To impute is better than to ignore. J. Clin. Epidemiol. 2010, 63, 721– 727, DOI: 10.1016/j.jclinepi.2009.12.008[Crossref], [PubMed], [CAS], Google Scholar11https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A280%3ADC%252BC3czmtV2gtQ%253D%253D&md5=bab541926c308b8b5c901d42982dae27Missing covariate data in medical research: to impute is better than to ignoreJanssen Kristel J M; Donders A Rogier T; Harrell Frank E Jr; Vergouwe Yvonne; Chen Qingxia; Grobbee Diederick E; Moons Karel G MJournal of clinical epidemiology (2010), 63 (7), 721-7 ISSN:.OBJECTIVE: We compared popular methods to handle missing data with multiple imputation (a more sophisticated method that preserves data). STUDY DESIGN AND SETTING: We used data of 804 patients with a suspicion of deep venous thrombosis (DVT). We studied three covariates to predict the presence of DVT: d-dimer level, difference in calf circumference, and history of leg trauma. We introduced missing values (missing at random) ranging from 10% to 90%. The risk of DVT was modeled with logistic regression for the three methods, that is, complete case analysis, exclusion of d-dimer level from the model, and multiple imputation. RESULTS: Multiple imputation showed less bias in the regression coefficients of the three variables and more accurate coverage of the corresponding 90% confidence intervals than complete case analysis and dropping d-dimer level from the analysis. Multiple imputation showed unbiased estimates of the area under the receiver operating characteristic curve (0.88) compared with complete case analysis (0.77) and when the variable with missing values was dropped (0.65). CONCLUSION: As this study shows that simple methods to deal with missing data can lead to seriously misleading results, we advise to consider multiple imputation. The purpose of multiple imputation is not to create data, but to prevent the exclusion of observed data.
- 12Brock, G. N.; Shaffer, J. R.; Blakesley, R. E.; Lotz, M. J.; Tseng, G. C. Which missing value imputation method to use in expression profiles: a comparative study and two selection schemes. BMC Bioinf. 2008, 9, 12 DOI: 10.1186/1471-2105-9-12[Crossref], [PubMed], [CAS], Google Scholar12https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A280%3ADC%252BD1c7jsVyhug%253D%253D&md5=550ac2936960ddcd03d9ca2586c4281dWhich missing value imputation method to use in expression profiles: a comparative study and two selection schemesBrock Guy N; Shaffer John R; Blakesley Richard E; Lotz Meredith J; Tseng George CBMC bioinformatics (2008), 9 (), 12 ISSN:.BACKGROUND: Gene expression data frequently contain missing values, however, most down-stream analyses for microarray experiments require complete data. In the literature many methods have been proposed to estimate missing values via information of the correlation patterns within the gene expression matrix. Each method has its own advantages, but the specific conditions for which each method is preferred remains largely unclear. In this report we describe an extensive evaluation of eight current imputation methods on multiple types of microarray experiments, including time series, multiple exposures, and multiple exposures x time series data. We then introduce two complementary selection schemes for determining the most appropriate imputation method for any given data set. RESULTS: We found that the optimal imputation algorithms (LSA, LLS, and BPCA) are all highly competitive with each other, and that no method is uniformly superior in all the data sets we examined. The success of each method can also depend on the underlying "complexity" of the expression data, where we take complexity to indicate the difficulty in mapping the gene expression matrix to a lower-dimensional subspace. We developed an entropy measure to quantify the complexity of expression matrixes and found that, by incorporating this information, the entropy-based selection (EBS) scheme is useful for selecting an appropriate imputation algorithm. We further propose a simulation-based self-training selection (STS) scheme. This technique has been used previously for microarray data imputation, but for different purposes. The scheme selects the optimal or near-optimal method with high accuracy but at an increased computational cost. CONCLUSION: Our findings provide insight into the problem of which imputation method is optimal for a given data set. Three top-performing methods (LSA, LLS and BPCA) are competitive with each other. Global-based imputation methods (PLS, SVD, BPCA) performed better on mcroarray data with lower complexity, while neighbour-based methods (KNN, OLS, LSA, LLS) performed better in data with higher complexity. We also found that the EBS and STS schemes serve as complementary and effective tools for selecting the optimal imputation algorithm.
- 13To, K. T.; Fry, R. C.; Reif, D. M. Characterizing the effects of missing data and evaluating imputation methods for chemical prioritization applications using ToxPi. BioData Min. 2018, 11, 10 DOI: 10.1186/s13040-018-0169-5[Crossref], [PubMed], [CAS], Google Scholar13https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A280%3ADC%252BB3c%252FhslOksQ%253D%253D&md5=5ea3228f441ab27190b2ada91f5059e6Characterizing the effects of missing data and evaluating imputation methods for chemical prioritization applications using ToxPiTo Kimberly T; Reif David M; Fry Rebecca C; Reif David M; Reif David MBioData mining (2018), 11 (), 10 ISSN:1756-0381.BACKGROUND: The Toxicological Priority Index (ToxPi) is a method for prioritization and profiling of chemicals that integrates data from diverse sources. However, individual data sources ("assays"), such as in vitro bioassays or in vivo study endpoints, often feature sections of missing data, wherein subsets of chemicals have not been tested in all assays. In order to investigate the effects of missing data and recommend solutions, we designed simulation studies around high-throughput screening data generated by the ToxCast and Tox21 programs on chemicals highlighted by the Agency for Toxic Substances and Disease Registry's (ATSDR) Substance Priority List (SPL), which helps prioritize environmental research and remediation resources. RESULTS: Our simulations explored a wide range of scenarios concerning data (0-80% assay data missing per chemical), modeling (ToxPi models containing from 160-700 different assays), and imputation method (k-Nearest-Neighbor, Max, Mean, Min, Binomial, Local Least Squares, and Singular Value Decomposition). We find that most imputation methods result in significant changes to ToxPi score, except for datasets with a small number of assays. If we consider rank change conditional on these significant changes to ToxPi score, we find that ranks of chemicals in the minimum value imputation, SVD imputation, and kNN imputation sets are more sensitive to the score changes. CONCLUSIONS: We found that the choice of imputation strategy exerted significant influence over both scores and associated ranks, and the most sensitive scenarios were those involving fewer assays plus higher proportions of missing data. By characterizing the effects of missing data and the relative benefit of imputation approaches across real-world data scenarios, we can augment confidence in the robustness of decisions regarding the health and ecological effects of environmental chemicals.
- 14Poyatos, R.; Sus, O.; Badiella, L.; Mencuccini, M.; Martinez-Vilalta, J. Gap-filling a spatially explicit plant trait database: comparing imputation methods and different levels of environmental information. Biogeosciences 2018, 15, 2601– 2617, DOI: 10.5194/bg-15-2601-2018
- 15Lenz, M.; Schulz, A.; Koeck, T.; Rapp, S.; Nagler, M.; Sauer, M.; Eggebrecht, L.; Cate, V. T.; Panova-Noeva, M.; Prochaska, J. H.; Lackner, K. J.; Münzel, T.; Leineweber, K.; Wild, P. S.; Andrade-Navarro, M. A. Missing value imputation in proximity extension assay-based targeted proteomics data. PLoS One 2020, 15, e0243487 DOI: 10.1371/journal.pone.0243487[Crossref], [PubMed], [CAS], Google Scholar15https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BB3cXis1GgsrrI&md5=a13ae3b2e3fc861f525e13f71405736dMissing value imputation in proximity extension assay-based targeted proteomics dataLenz, Michael; Schulz, Andreas; Koeck, Thomas; Rapp, Steffen; Nagler, Markus; Sauer, Madeleine; Eggebrecht, Lisa; Ten Cate, Vincent; Panova-Noeva, Marina; Prochaska, Juergen H.; Lackner, Karl J.; Muenzel, Thomas; Leineweber, Kirsten; Wild, Philipp S.; Andrade-Navarro, Miguel A.PLoS One (2020), 15 (12), e0243487CODEN: POLNCL; ISSN:1932-6203. (Public Library of Science)Targeted proteomics utilizing antibody-based proximity extension assays provides sensitive and highly specific quantifications of plasma protein levels. Multivariate anal. of this data is hampered by frequent missing values (random or left censored), calling for imputation approaches. While appropriate missing-value imputation methods exist, benchmarks of their performance in targeted proteomics data are lacking. Here, we assessed the performance of two methods for imputation of values missing completely at random, the previously top-benchmarked 'missForest' and the recently published 'GSimp' method. Evaluation was accomplished by comparing imputed with remeasured relative concns. of 91 inflammation related circulating proteins in 86 samples from a cohort of 645 patients with venous thromboembolism. The median Pearson correlation between imputed and remeasured protein expression values was 69.0% for missForest and 71.6% for GSimp (p = 5.8e-4). Imputation with missForest resulted in stronger redn. of variance compared to GSimp (median relative variance of 25.3% vs. 68.6%, p = 2.4e-16) and undesired larger bias in downstream anal. Irresp. of the imputation method used, the 91 imputed proteins revealed large variations in imputation accuracy, driven by differences in signal to noise ratio and information overlap between proteins. In summary, GSimp outperformed missForest, while both methods show good overall imputation accuracy with large variations between proteins.
- 16Bramer, L. M.; Irvahn, J.; Piehowski, P. D.; Rodland, K. D.; Webb-Robertson, B.-J. M. A Review of Imputation Strategies for Isobaric Labeling-Based Shotgun Proteomics. J. Proteome Res. 2021, 20, 1– 13, DOI: 10.1021/acs.jproteome.0c00123[ACS Full Text
], [CAS], Google Scholar
16https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BB3cXhvVerur3N&md5=6de86677050965321a2b7a05314cb1f7A Review of Imputation Strategies for Isobaric Labeling-Based Shotgun ProteomicsBramer, Lisa M.; Irvahn, Jan; Piehowski, Paul D.; Rodland, Karin D.; Webb-Robertson, Bobbie-Jo M.Journal of Proteome Research (2021), 20 (1), 1-13CODEN: JPROBS; ISSN:1535-3893. (American Chemical Society)A review. The throughput efficiency and increased depth of coverage provided by isobaric-labeled proteomics measurements have led to increased usage of these techniques. However, the structure of missing data is different than unlabeled studies, which prompts the need for this review to compare the efficacy of nine imputation methods on large isobaric-labeled proteomics data sets to guide researchers on the appropriateness of various imputation methods. Imputation methods were evaluated by accuracy, statistical hypothesis test inference, and run time. In general, expectation maximization and random forest imputation methods yielded the best performance, and const.-based methods consistently performed poorly across all data set sizes and percentages of missing values. For data sets with small sample sizes and higher percentages of missing data, results indicate that statistical inference with no imputation may be preferable. On the basis of the findings in this review, there are core imputation methods that perform better for isobaric-labeled proteomics data, but great care and consideration as to whether imputation is the optimal strategy should be given for data sets comprised of a small no. of samples. - 17de Souto, M. C. P.; Jaskowiak, P. A.; Costa, I. G. Impact of missing data imputation methods on gene expression clustering and classification. BMC Bioinf. 2015, 16, 64 DOI: 10.1186/s12859-015-0494-3[Crossref], [PubMed], [CAS], Google Scholar17https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A280%3ADC%252BC2Mjlslemsg%253D%253D&md5=210bc346aff2d28f775b2c1f2a4093b7Impact of missing data imputation methods on gene expression clustering and classificationde Souto Marcilio C P; Jaskowiak Pablo A; Costa Ivan G; Costa Ivan GBMC bioinformatics (2015), 16 (), 64 ISSN:.BACKGROUND: Several missing value imputation methods for gene expression data have been proposed in the literature. In the past few years, researchers have been putting a great deal of effort into presenting systematic evaluations of the different imputation algorithms. Initially, most algorithms were assessed with an emphasis on the accuracy of the imputation, using metrics such as the root mean squared error. However, it has become clear that the success of the estimation of the expression value should be evaluated in more practical terms as well. One can consider, for example, the ability of the method to preserve the significant genes in the dataset, or its discriminative/predictive power for classification/clustering purposes. RESULTS AND CONCLUSIONS: We performed a broad analysis of the impact of five well-known missing value imputation methods on three clustering and four classification methods, in the context of 12 cancer gene expression datasets. We employed a statistical framework, for the first time in this field, to assess whether different imputation methods improve the performance of the clustering/classification methods. Our results suggest that the imputation methods evaluated have a minor impact on the classification and downstream clustering analyses. Simple methods such as replacing the missing values by mean or the median values performed as well as more complex strategies. The datasets analyzed in this study are available at http://costalab.org/Imputation/ .
- 18Liu, M.; Dongre, A. Proper imputation of missing values in proteomics datasets for differential expression analysis. Briefings Bioinf. 2020, 0, bbaa112 DOI: 10.1093/bib/bbaa112
- 19Rodwell, L.; Lee, K. J.; Romaniuk, H.; Carlin, J. B. Comparison of methods for imputing limited-range variables: a simulation study. BMC Med. Res. Methodol. 2014, 14, 57 DOI: 10.1186/1471-2288-14-57[Crossref], [PubMed], [CAS], Google Scholar19https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A280%3ADC%252BC2cnmvVKlsA%253D%253D&md5=67b590a65a7366914e38279d9ed1cb6aComparison of methods for imputing limited-range variables: a simulation studyRodwell Laura; Lee Katherine J; Romaniuk Helena; Carlin John BBMC medical research methodology (2014), 14 (), 57 ISSN:.BACKGROUND: Multiple imputation (MI) was developed as a method to enable valid inferences to be obtained in the presence of missing data rather than to re-create the missing values. Within the applied setting, it remains unclear how important it is that imputed values should be plausible for individual observations. One variable type for which MI may lead to implausible values is a limited-range variable, where imputed values may fall outside the observable range. The aim of this work was to compare methods for imputing limited-range variables, with a focus on those that restrict the range of the imputed values. METHODS: Using data from a study of adolescent health, we consider three variables based on responses to the General Health Questionnaire (GHQ), a tool for detecting minor psychiatric illness. These variables, based on different scoring methods for the GHQ, resulted in three continuous distributions with mild, moderate and severe positive skewness. In an otherwise complete dataset, we set 33% of the GHQ observations to missing completely at random or missing at random; repeating this process to create 1000 datasets with incomplete data for each scenario.For each dataset, we imputed values on the raw scale and following a zero-skewness log transformation using: univariate regression with no rounding; post-imputation rounding; truncated normal regression; and predictive mean matching. We estimated the marginal mean of the GHQ and the association between the GHQ and a fully observed binary outcome, comparing the results with complete data statistics. RESULTS: Imputation with no rounding performed well when applied to data on the raw scale. Post-imputation rounding and imputation using truncated normal regression produced higher marginal means than the complete data estimate when data had a moderate or severe skew, and this was associated with under-coverage of the complete data estimate. Predictive mean matching also produced under-coverage of the complete data estimate. For the estimate of association, all methods produced similar estimates to the complete data. CONCLUSIONS: For data with a limited range, multiple imputation using techniques that restrict the range of imputed values can result in biased estimates for the marginal mean when data are highly skewed.
- 20Kruttika, D.; Simion, K.; R, J. M.; J, P. S. A Simple Optimization Workflow to Enable Precise and Accurate Imputation of Missing Values in Proteomic Datasets. bioRxiv 2020, 1– 39, DOI: 10.1101/2019.12.11.123456
- 21Jin, L.; Bi, Y.; Hu, C.; Qu, J.; Shen, S.; Wang, X.; Tian, Y. A comparative study of evaluating missing value imputation methodsin label-free proteomics. Sci. Rep. 2021, 11, 1760 DOI: 10.1038/s41598-021-81279-4[Crossref], [PubMed], [CAS], Google Scholar21https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BB3MXhvFaqtrk%253D&md5=34e7040ccb9fb0dfe9b7e371dac1bcb1A comparative study of evaluating missing value imputation methods in label-free proteomicsJin, Liang; Bi, Yingtao; Hu, Chenqi; Qu, Jun; Shen, Shichen; Wang, Xue; Tian, YuScientific Reports (2021), 11 (1), 1760CODEN: SRCEC3; ISSN:2045-2322. (Nature Research)The presence of missing values (MVs) in label-free quant. proteomics greatly reduces the completeness of data. Imputation has been widely utilized to handle MVs, and selection of the proper method is crit. for the accuracy and reliability of imputation. Here we present a comparative study that evaluates the performance of seven popular imputation methods with a large-scale benchmark dataset and an immune cell dataset. Simulated MVs were incorporated into the complete part of each dataset with different combinations of MV rates and missing not at random (MNAR) rates. Normalized root mean square error (NRMSE) was applied to evaluate the accuracy of protein abundances and intergroup protein ratios after imputation. Detection of true positives (TPs) and false altered-protein discovery rate (FADR) between groups were also compared using the benchmark dataset. Furthermore, the accuracy of handling real MVs was assessed by comparing enriched pathways and signature genes of cell activation after imputing the immune cell dataset. We obsd. that the accuracy of imputation is primarily affected by the MNAR rate rather than the MV rate, and downstream anal. can be largely impacted by the selection of imputation methods. A random forest-based imputation method consistently outperformed other popular methods by achieving the lowest NRMSE, high amt. of TPs with the av. FADR < 5%, and the best detection of relevant pathways and signature genes, highlighting it as the most suitable method for label-free proteomics.
- 22Audoux, J.; Salson, M.; Grosset, C. F.; Beaumeunier, S.; Holder, J.-M.; Commes, T.; Philippe, N. SimBA: A methodology and tools for evaluating the performance of RNA-Seq bioinformatic pipelines. BMC Bioinf. 2017, 18, 428 DOI: 10.1186/s12859-017-1831-5[Crossref], [PubMed], [CAS], Google Scholar22https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BC1cXitVOhtbrM&md5=024e4f119feb752a3458f3a643e71527SimBA: a methodology and tools for evaluating the performance of RNA-Seq bioinformatic pipelinesAudoux, Jerome; Salson, Mikael; Grosset, Christophe F.; Beaumeunier, Sacha; Holder, Jean-Marc; Commes, Therese; Philippe, NicolasBMC Bioinformatics (2017), 18 (), 428/1-428/14CODEN: BBMIC4; ISSN:1471-2105. (BioMed Central Ltd.)The evolution of next-generation sequencing (NGS) technologies has led to increased focus on RNA-Seq. Many bioinformatic tools have been developed for RNA-Seq anal., each with unique performance characteristics and configuration parameters. Users face an increasingly complex task in understanding which bioinformatic tools are best for their specific needs and how they should be configured. In order to provide some answers to these questions, we investigate the performance of leading bioinformatic tools designed for RNA-Seq anal. and propose a methodol. for systematic evaluation and comparison of performance to help users make well informed choices. To evaluate RNA-Seq pipelines, we developed a suite of two benchmarking tools. SimCT generates simulated datasets that get as close as possible to specific real biol. conditions accompanied by the list of genomic incidents and mutations that have been inserted. BenchCT then compares the output of any bioinformatics pipeline that has been run against a SimCT dataset with the simulated genomic and transcriptional variations it contains to give an accurate performance evaluation in addressing specific biol. question. We used these tools to simulate a real-world genomic medicine questions involving the comparison of healthy and cancerous cells. Results revealed that performance in addressing a particular biol. context varied significantly depending on the choice of tools and settings used. We also found that by combining the output of certain pipelines, substantial performance improvements could be achieved. Our research emphasizes the importance of selecting and configuring bioinformatic tools for the specific biol. question being investigated to obtain optimal results. Pipeline designers, developers and users should include benchmarking in the context of their biol. question as part of their design and quality control process. Our SimBA suite of benchmarking tools provides a reliable basis for comparing the performance of RNA-Seq bioinformatics pipelines in addressing a specific biol. question. We would like to see the creation of a ref. corpus of data-sets that would allow accurate comparison between benchmarks performed by different groups and the publication of more benchmarks based on this public corpus.
- 23Wang, S.; Li, W.; Hu, L.; Cheng, J.; Yang, H.; Liu, Y. NAguideR: performing and prioritizing missing value imputations for consistent bottom-up proteomic analyses. Nucleic Acids Res. 2020, 48, e83, DOI: 10.1093/nar/gkaa498[Crossref], [PubMed], [CAS], Google Scholar23https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BB3cXis1entrrE&md5=ff8b871b38806a95b7fb0220d5782dafNAguideR: performing and prioritizing missing value imputations for consistent bottom-up proteomic analysesWang, Shisheng; Li, Wenxue; Hu, Liqiang; Cheng, Jingqiu; Yang, Hao; Liu, YanshengNucleic Acids Research (2020), 48 (14), e83CODEN: NARHAD; ISSN:1362-4962. (Oxford University Press)Mass spectrometry (MS)-based quant. proteomics expts. frequently generate data with missing values, which may profoundly affect downstream analyses. A wide variety of imputation methods have been established to deal with the missing-value issue. To date, however, there is a scarcity of efficient, systematic, and easy-to-handle tools that are tailored for proteomics community. Herein, we developed a user-friendly and powerful stand-alone software, NAguideR, to enable implementation and evaluation of different missing value methods offered by 23 widely used missing-value imputation algorithms. NAguideR further evaluates data imputation results through classic computational criteria and, unprecedentedly, proteomic empirical criteria, such as quant. consistency between different charge-states of the same peptide, different peptides belonging to the same proteins, and individual proteins participating protein complexes and functional interactions. We applied NAguideR into three label-free proteomic datasets featuring peptide-level, protein-level, and phosphoproteomic variables resp., all generated by data independent acquisition mass spectrometry (DIA-MS) with substantial biol. replicates. The results indicate that NAguideR is able to discriminate the optimal imputation methods that are facilitating DIA-MS expts. over those sub-optimal and low-performance algorithms. NAguideR further provides downloadable tables and figures supporting flexible data anal. and interpretation.
- 24Rose, M.; Duhamel, M.; Aboulouard, S.; Kobeissy, F.; Rhun, E. L.; Desmons, A.; Tierny, D.; Fournier, I.; Rodet, F.; Salzet, M. The Role of a Proprotein Convertase Inhibitor in Reactivation of Tumor-Associated Macrophages and Inhibition of Glioma Growth. Mol. Ther.--Oncolytics 2020, 17, 31– 46, DOI: 10.1016/j.omto.2020.03.005[Crossref], [PubMed], [CAS], Google Scholar24https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BB3cXht1GntbnL&md5=8198a0c313c461e11bafb3372d34b112The Role of a Proprotein Convertase Inhibitor in Reactivation of Tumor-Associated Macrophages and Inhibition of Glioma GrowthRose, Melanie; Duhamel, Marie; Aboulouard, Soulaimane; Kobeissy, Firas; Le Rhun, Emilie; Desmons, Annie; Tierny, Dominique; Fournier, Isabelle; Rodet, Franck; Salzet, MichelMolecular Therapy--Oncolytics (2020), 17 (), 31-46CODEN: MTOHDL; ISSN:2372-7705. (Elsevier Inc.)Tumors are characterized by the presence of malignant and non-malignant cells, such as immune cells including macrophages, which are preponderant. Macrophages impact the efficacy of chemotherapy and may lead to drug resistance. In this context and based on our previous work, we investigated the ability to reactivate macrophages by using a proprotein convertases inhibitor. Proprotein convertases process immature proteins into functional proteins, with several of them having a role in immune cell activation and tumorigenesis. Macrophages were treated with a peptidomimetic inhibitor targeting furin, PC1/3, PC4, PACE4, and PC5/6. Their anti-glioma activity was analyzed by mass spectrometry-based proteomics and viability assays in 2D and 3D in vitro cultures. Comparison with temozolomide, the drug used for glioma therapy, established that the inhibitor was more efficient for the redn. of cancer cell d. The inhibitor was also able to reactivate macrophages through the secretion of several immune factors with antitumor properties. Moreover, two proteins considered as good glioma patient survival indicators were also identified in 3D cultures treated with the inhibitor. Finally, we established that the proprotein convertases inhibitor has a dual role as an anti-glioma drug and anti-tumoral macrophage reactivation drug. This strategy could be used together with chemotherapy to increase therapy efficacy in glioma.
- 25Cox, J.; Mann, M. MaxQuant enables high peptide identification rates and individualized p.p.b.-range mass accuracies and proteome-wide protein quantification. Nat. Biotechnol. 2008, 26, 1367– 1372, DOI: 10.1038/nbt.1511[Crossref], [PubMed], [CAS], Google Scholar25https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BD1cXhsVWjtLzJ&md5=675d31ca84e3a7e4fb9bdd601d8075eaMaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantificationCox, Juergen; Mann, MatthiasNature Biotechnology (2008), 26 (12), 1367-1372CODEN: NABIF9; ISSN:1087-0156. (Nature Publishing Group)Efficient anal. of very large amts. of raw data for peptide identification and protein quantification is a principal challenge in mass spectrometry (MS)-based proteomics. Here we describe MaxQuant, an integrated suite of algorithms specifically developed for high-resoln., quant. MS data. Using correlation anal. and graph theory, MaxQuant detects peaks, isotope clusters and stable amino acid isotope-labeled (SILAC) peptide pairs as three-dimensional objects in m/z, elution time and signal intensity space. By integrating multiple mass measurements and correcting for linear and nonlinear mass offsets, we achieve mass accuracy in the p.p.b. range, a sixfold increase over std. techniques. We increase the proportion of identified fragmentation spectra to 73% for SILAC peptide pairs via unambiguous assignment of isotope and missed-cleavage state and individual mass precision. MaxQuant automatically quantifies several hundred thousand peptides per SILAC-proteome expt. and allows statistically robust identification and quantification of >4000 proteins in mammalian cell lysates.
- 26O’Brien, J. J.; Gunawardena, H. P.; Paulo, J. A.; Chen, X.; Ibrahim, J. G.; Gygi, S. P.; Qaqish, B. F. The effects of nonignorable missing data on label-free mass spectrometry proteomics experiments. Ann. Appl. Stat. 2018, 12, 2075– 2095, DOI: 10.1214/18-AOAS1144[Crossref], [PubMed], [CAS], Google Scholar26https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A280%3ADC%252BB3crksFeitA%253D%253D&md5=e7433543cf49d814974145824b08c735The effects of nonignorable missing data on label-free mass spectrometry proteomics experimentsO'Brien Jonathon J; Gunawardena Harsha P; Paulo Joao A; Chen Xian; Ibrahim Joseph G; Gygi Steven P; Qaqish Bahjat FThe annals of applied statistics (2018), 12 (4), 2075-2095 ISSN:1932-6157.An idealized version of a label-free discovery mass spectrometry proteomics experiment would provide absolute abundance measurements for a whole proteome, across varying conditions. Unfortunately, this ideal is not realized. Measurements are made on peptides requiring an inferential step to obtain protein level estimates. The inference is complicated by experimental factors that necessitate relative abundance estimation and result in widespread non-ignorable missing data. Relative abundance on the log scale takes the form of parameter contrasts. In a complete-case analysis, contrast estimates may be biased by missing data and a substantial amount of useful information will often go unused. To avoid problems with missing data, many analysts have turned to single imputation solutions. Unfortunately, these methods often create further difficulties by hiding inestimable contrasts, preventing the recovery of interblock information and failing to account for imputation uncertainty. To mitigate many of the problems caused by missing values, we propose the use of a Bayesian selection model. Our model is tested on simulated data, real data with simulated missing values, and on a ground truth dilution experiment where all of the true relative changes are known. The analysis suggests that our model, compared with various imputation strategies and complete-case analyses, can increase accuracy and provide substantial improvements to interval coverage.
- 27Marquardt, D. W. Comment - You should standardize the predictor variables in your regression models. J. Am. Stat. Assoc. 1980, 75, 87– 91, 10.1080/01621459.1980.10477430Google ScholarThere is no corresponding record for this reference.
- 28Menard, S. Standards for standardized logistic regression coefficients. Soc. Forces 2011, 89, 1409– 1428, DOI: 10.1093/sf/89.4.1409
- 29Kreutz, C. New Concepts for Evaluating the Performance of Computational Methods. IFAC-PapersOnLine 2016, 49, 63– 70, DOI: 10.1016/j.ifacol.2016.12.104
- 30MATLAB. 9.8.0.1538580 (R2020a); The MathWorks Inc.: Natickand Massachusetts, 2020.Google ScholarThere is no corresponding record for this reference.
- 31R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna, Austria, 2021.Google ScholarThere is no corresponding record for this reference.
- 32Stekhoven, D. J.; Buhlmann, P. MissForest-non-parametric missing value imputation for mixed-type data. Bioinformatics 2012, 28, 112– 118, DOI: 10.1093/bioinformatics/btr597[Crossref], [PubMed], [CAS], Google Scholar32https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BC3MXhs1yms7fO&md5=44a690989dad7424d50a1662f5de2625MissForest-non-parametric missing value imputation for mixed-type dataStekhoven, Daniel J.; Buehlmann, PeterBioinformatics (2012), 28 (1), 112-118CODEN: BOINFP; ISSN:1367-4803. (Oxford University Press)Modern data acquisition based on high-throughput technol. is often facing the problem of missing data. Algorithms commonly used in the anal. of such large-scale data often depend on a complete set. Missing value imputation offers a soln. to this problem. However, the majority of available imputation methods are restricted to one type of variable only: continuous or categorical. For mixed-type data, the different types are usually handled sep. Therefore, these methods ignore possible relations between variable types. We propose a non-parametric method which can cope with different types of variables simultaneously. Results: We compare several state of the art methods for the imputation of missing values. We propose and evaluate an iterative imputation method (missForest) based on a random forest. By averaging over many unpruned classification or regression trees, random forest intrinsically constitutes a multiple imputation scheme. Using the built-in out-of-bag error ests. of random forest, we are able to est. the imputation error without the need of a test set. Evaluation is performed on multiple datasets coming from a diverse selection of biol. fields with artificially introduced missing values ranging from 10% to 30%. We show that missForest can successfully handle missing values, particularly in datasets including different types of variables. In our comparative study, missForest outperforms other methods of imputation esp. in data settings where complex interactions and non-linear relations are suspected. The out-of-bag imputation error ests. of missForest prove to be adequate in all settings. Addnl., missForest exhibits attractive computational efficiency and can cope with high-dimensional data.
- 33Josse, J.; Husson, F. Handling missing values in exploratory multivariate data analysis methods. J. Soc. Fr. Stat. 2012, 153, 79– 99Google ScholarThere is no corresponding record for this reference.
- 34Khoonsari, P. E.; Häggmark, A.; Lönnberg, M.; Mikus, M.; Kilander, L.; Lannfelt, L.; Bergquist, J.; Ingelsson, M.; Nilsson, P.; Kultima, K.; Shevchenko, G. Analysis of the Cerebrospinal Fluid Proteome in Alzheimer’s Disease. PLoS One 2016, 11, e0150672 DOI: 10.1371/journal.pone.0150672[Crossref], [PubMed], [CAS], Google Scholar34https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BC28XhtVart7vO&md5=ef30cb763212be43089453d2a5117f7eAnalysis of the cerebrospinal fluid proteome in Alzheimer's diseaseKhoonsari, Payam Emami; Haggmark, Anna; Lonnberg, Maria; Mikus, Maria; Kilander, Lena; Lannfelt, Lars; Bergquist, Jonas; Ingelsson, Martin; Nilsson, Peter; Kultima, Kim; Shevchenko, GannaPLoS One (2016), 11 (3), e0150672/1-e0150672/25CODEN: POLNCL; ISSN:1932-6203. (Public Library of Science)Alzheimer's disease is a neurodegenerative disorder accounting for more than 50% of cases of dementia. Diagnosis of Alzheimer's disease relies on cognitive tests and anal. of amyloid beta, protein tau, and hyperphosphorylated tau in cerebrospinal fluid. Although these markers provide relatively high sensitivity and specificity for early disease detection, they are not suitable for monitor of disease progression. In the present study, we used label-free shotgun mass spectrometry to analyze the cerebrospinal fluid proteome of Alzheimer's disease patients and non-demented controls to identify potential biomarkers for Alzheimer's disease. We processed the data using five programs (DecyderMS, Maxquant, OpenMS, PEAKS, and Sieve) and compared their results by means of reproducibility and peptide identification, including three different normalization methods. After depletion of high abundant proteins we found that Alzheimer's disease patients had lower fraction of low-abundance proteins in cerebrospinal fluid compared to healthy controls (p<0.05). Consequently, global normalization was found to be less accurate compared to using spiked-in chicken ovalbumin for normalization. In addn., we detd. that Sieve and OpenMS resulted in the highest reproducibility and PEAKS was the programs with the highest identification performance. Finally, we successfully verified significantly lower levels (p<0.05) of eight proteins (A2GL, APOM, C1QB, C1QC, C1S, FBLN3, PTPRZ, and SEZ6) in Alzheimer's disease compared to controls using an antibody-based detection method. These proteins are involved in different biol. roles spanning from cell adhesion and migration, to regulation of the synapse and the immune system.
- 35Sturm, M.; Bertsch, A.; Gröpl, C.; Hildebrandt, A.; Hussong, R.; Lange, E.; Pfeifer, N.; Schulz-Trieglaff, O.; Zerck, A.; Reinert, K.; Kohlbacher, O. OpenMS - an open-source software framework for mass spectrometry. BMC Bioinf. 2008, 9, 163 DOI: 10.1186/1471-2105-9-163[Crossref], [PubMed], [CAS], Google Scholar35https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A280%3ADC%252BD1c3ltF2ktg%253D%253D&md5=99727c3e786a099fdd18786d6e642cd8OpenMS - an open-source software framework for mass spectrometrySturm Marc; Bertsch Andreas; Gropl Clemens; Hildebrandt Andreas; Hussong Rene; Lange Eva; Pfeifer Nico; Schulz-Trieglaff Ole; Zerck Alexandra; Reinert Knut; Kohlbacher OliverBMC bioinformatics (2008), 9 (), 163 ISSN:.BACKGROUND: Mass spectrometry is an essential analytical technique for high-throughput analysis in proteomics and metabolomics. The development of new separation techniques, precise mass analyzers and experimental protocols is a very active field of research. This leads to more complex experimental setups yielding ever increasing amounts of data. Consequently, analysis of the data is currently often the bottleneck for experimental studies. Although software tools for many data analysis tasks are available today, they are often hard to combine with each other or not flexible enough to allow for rapid prototyping of a new analysis workflow. RESULTS: We present OpenMS, a software framework for rapid application development in mass spectrometry. OpenMS has been designed to be portable, easy-to-use and robust while offering a rich functionality ranging from basic data structures to sophisticated algorithms for data analysis. This has already been demonstrated in several studies. CONCLUSION: OpenMS is available under the Lesser GNU Public License (LGPL) from the project website at http://www.openms.de.
- 36Pursiheimo, A.; Vehmas, A. P.; Afzal, S.; Suomi, T.; Chand, T.; Strauss, L.; Poutanen, M.; Rokka, A.; Corthals, G. L.; Elo, L. L. Optimization of Statistical Methods Impact on Quantitative Proteomics Data. J. Proteome Res. 2015, 14, 4118– 4126, DOI: 10.1021/acs.jproteome.5b00183[ACS Full Text
], [CAS], Google Scholar
36https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BC2MXhsVWitL%252FI&md5=0c2a241e97545ea49acff53fdcf8c5ecOptimization of Statistical Methods Impact on Quantitative Proteomics DataPursiheimo, Anna; Vehmas, Anni P.; Afzal, Saira; Suomi, Tomi; Chand, Thaman; Strauss, Leena; Poutanen, Matti; Rokka, Anne; Corthals, Garry L.; Elo, Laura L.Journal of Proteome Research (2015), 14 (10), 4118-4126CODEN: JPROBS; ISSN:1535-3893. (American Chemical Society)As tools for quant. label-free mass spectrometry (MS) rapidly develop, a consensus about the best practices is not apparent. In the work described here, we compared popular statistical methods for detecting differential protein expression from quant. MS data using both controlled expts. with known quant. differences for specific proteins used as stds. as well as "real" expts. where differences in protein abundance are not known a priori. Our results suggest that data-driven reproducibility-optimization can consistently produce reliable differential expression rankings for label-free proteome tools and are straightforward in their application. - 37Govaert, E.; Van Steendam, K.; Scheerlinck, E.; Vossaert, L.; Meert, P.; Stella, M.; Willems, S.; De Clerck, L.; Dhaenens, M.; Deforce, D. Extracting histones for the specific purpose of label-free MS. Proteomics 2016, 16, 2937– 2944, DOI: 10.1002/pmic.201600341[Crossref], [PubMed], [CAS], Google Scholar37https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BC28XitVSrurzK&md5=682a99c1991f6a9ac6517da0755212a7Extracting histones for the specific purpose of label-free MSGovaert, Elisabeth; Van Steendam, Katleen; Scheerlinck, Ellen; Vossaert, Liesbeth; Meert, Paulien; Stella, Martina; Willems, Sander; De Clerck, Laura; Dhaenens, Maarten; Deforce, DieterProteomics (2016), 16 (23), 2937-2944CODEN: PROTC7; ISSN:1615-9853. (Wiley-VCH Verlag GmbH & Co. KGaA)Extg. histones from cells is the first step in studies that aim to characterize histones and their post-translational modifications (hPTMs) with MS. In the last decade, label-free quantification is more frequently being used for MS-based histone characterization. However, many histone extn. protocols were not specifically designed for label-free MS. While label-free quantification has its advantages, it is also very susceptible to tech. variation. Here, we adjust an established histone extn. protocol according to general label-free MS guidelines with a specific focus on minimizing sample handling. These protocols are first evaluated using SDS-PAGE. Hereafter, a selection of extn. protocols was used in a complete histone workflow for label-free MS. All protocols display nearly identical relative quantification of hPTMs. We thus show that, depending on the cell type under investigation and at the cost of some addnl. contaminating proteins, minimizing sample handling can be done during histone isolation. This allows analyzing bigger sample batches, leads to reduced tech. variation and minimizes the chance of in vitro alterations to the hPTM snapshot. Overall, these results allow researchers to det. the best protocol depending on the resources and goal of their specific study. Data are available via ProteomeXchange with identifier PXD002885.
- 38Calf, O. W.; van Dam, N. M.; Weinhold, A.; Huber, H.; Peters, J. L. MTBLS738: Glycoalkaloid composition explains variation in slug resistance in Solanum dulcamara. Oecologia 2018, 187, 495– 506, DOI: 10.1007/s00442-018-4064-z[Crossref], [PubMed], [CAS], Google Scholar38https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A280%3ADC%252BC1Mvls1Gguw%253D%253D&md5=172dd50c42901c7f65018fa75fab07b3Glycoalkaloid composition explains variation in slug resistance in Solanum dulcamaraCalf Onno W; van Dam Nicole M; Huber Heidrun; Peters Janny L; Weinhold Alexander; van Dam Nicole M; van Dam Nicole MOecologia (2018), 187 (2), 495-506 ISSN:.In natural environments, plants have to deal with a wide range of different herbivores whose communities vary in time and space. It is believed that the chemical diversity within plant species has mainly arisen from selection pressures exerted by herbivores. So far, the effects of chemical diversity on plant resistance have mostly been assessed for arthropod herbivores. However, also gastropods, such as slugs, can cause extensive damage to plants. Here we investigate to what extent individual Solanum dulcamara plants differ in their resistance to slug herbivory and whether this variation can be explained by differences in secondary metabolites. We performed a series of preference assays using the grey field slug (Deroceras reticulatum) and S. dulcamara accessions from eight geographically distinct populations from the Netherlands. Significant and consistent variation in slug preference was found for individual accessions within and among populations. Metabolomic analyses showed that variation in steroidal glycoalkaloids (GAs) correlated with slug preference; accessions with high GA levels were consistently less damaged by slugs. One, strongly preferred, accession with particularly low GA levels contained high levels of structurally related steroidal compounds. These were conjugated with uronic acid instead of the glycoside moieties common for Solanum GAs. Our results illustrate how intraspecific variation in steroidal glycoside profiles affects resistance to slug feeding. This suggests that also slugs should be considered as important drivers in the co-evolution between plants and herbivores.
Supporting Information
Supporting Information
ARTICLE SECTIONSThe Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acs.jproteome.1c00119.
Sigmoidal decrease of missing values for higher protein intensities (Figure S1); missing value distribution per sample and per protein (Figure S2); distribution of the estimated logistic regression coefficients (Figure S3); DIMA analysis at the peptide level (Figure S4); density plot of the imputed compared to the original data values (Figure S5); principal component analysis before and after imputation (Figure S6); DIMA Implementation (Figure S7); characteristics of the 30 applied imputation algorithms (Table S1); and characteristics of the PRIDE data sets assessed with DIMA (Table S2) (PDF)
Terms & Conditions
Most electronic Supporting Information files are available without a subscription to ACS Web Editions. Such files may be downloaded by article for research use (if there is a public use license linked to the relevant article, that license may permit other uses). Permission may be obtained from ACS for other uses through requests via the RightsLink permission system: http://pubs.acs.org/page/copyright/permissions.html.