ACS Publications. Most Trusted. Most Cited. Most Read
My Activity
CONTENT TYPES

Figure 1Loading Img

DIMA: Data-Driven Selection of an Imputation Algorithm

  • Janine Egert*
    Janine Egert
    Institute of Medical Biometry and Statistics (IMBI), Institute of Medicine and Medical Center Freiburg, 79104 Freiburg im Breisgau, Germany
    Centre for Integrative Biological Signalling Studies (CIBSS), Albert-Ludwigs-Universität Freiburg, 79104 Freiburg, Germany
    *Email: [email protected]
    More by Janine Egert
  • Eva Brombacher
    Eva Brombacher
    Institute of Medical Biometry and Statistics (IMBI), Institute of Medicine and Medical Center Freiburg, 79104 Freiburg im Breisgau, Germany
    Centre for Integrative Biological Signalling Studies (CIBSS), Albert-Ludwigs-Universität Freiburg, 79104 Freiburg, Germany
    Spemann Graduate School of Biology and Medicine (SGBM), Albert-Ludwigs-Universität Freiburg, 79104 Freiburg, Germany
    Faculty of Biology, Albert-Ludwigs-Universität Freiburg, 79104 Freiburg im Breisgau, Germany
  • Bettina Warscheid
    Bettina Warscheid
    Biochemistry and Functional Proteomics, Institute of Biology II, Faculty of Biology, Albert-Ludwigs-Universität Freiburg, 79104 Freiburg im Breisgau, Germany
    Signalling Research Centres BIOSS and CIBSS, Albert-Ludwigs-Universität Freiburg, 79104 Freiburg im Breisgau, Germany
  • , and 
  • Clemens Kreutz*
    Clemens Kreutz
    Institute of Medical Biometry and Statistics (IMBI), Institute of Medicine and Medical Center Freiburg, 79104 Freiburg im Breisgau, Germany
    Signalling Research Centres BIOSS and CIBSS, Albert-Ludwigs-Universität Freiburg, 79104 Freiburg im Breisgau, Germany
    Center for Data Analysis and Modeling (FDM), Albert-Ludwigs-Universität Freiburg, 79104 Freiburg im Breisgau, Germany
    *Email: [email protected]
Cite this: J. Proteome Res. 2021, 20, 7, 3489–3496
Publication Date (Web):June 1, 2021
https://doi.org/10.1021/acs.jproteome.1c00119

Copyright © 2022 The Authors. Published by American Chemical Society. This publication is licensed under

CC-BY-NC-ND 4.0.
  • Open Access

Article Views

1356

Altmetric

-

Citations

LEARN ABOUT THESE METRICS
PDF (3 MB)
Supporting Info (1)»

Abstract

Imputation is a prominent strategy when dealing with missing values (MVs) in proteomics data analysis pipelines. However, it is difficult to assess the performance of different imputation methods and varies strongly depending on data characteristics. To overcome this issue, we present the concept of a data-driven selection of an imputation algorithm (DIMA). The performance and broad applicability of DIMA are demonstrated on 142 quantitative proteomics data sets from the PRoteomics IDEntifications (PRIDE) database and on simulated data consisting of 5–50% MVs with different proportions of missing not at random and missing completely at random values. DIMA reliably suggests a high-performing imputation algorithm, which is always among the three best algorithms and results in a root mean square error difference (ΔRMSE) ≤ 10% in 80% of the cases. DIMA implementation is available in MATLAB at github.com/kreutz-lab/OmicsData and in R at github.com/kreutz-lab/DIMAR.

This publication is licensed under

CC-BY-NC-ND 4.0.
  • cc licence
  • by licence
  • nc licence
  • nd licence

1. Introduction

ARTICLE SECTIONS
Jump To

Mass spectrometry (MS) has become a popular strategy for identifying and quantifying thousands of proteins in proteomics research. Despite the ongoing technical improvements in modern MS technologies, the handling of missing values (MVs) remains a persistent issue (1−6) as high rates of MVs─with up to 30–50% MVs─- result in missing identification and missing quantification.
The occurrence of missing data results from different biological and/or technical reasons: a peptide is either not detected or falsely identified (missing completely at random (MCAR)), below the detection limit (missing not at random (MNAR)) or simply not present in the sample. (6−8)
A further consequence of MVs is that they bias estimations and may negatively impact statistical power and downstream analyses such as statistical tests and clustering analyses. (9−11) Many of these even require complete data.
A popular strategy to deal with incomplete data is to apply imputation. (5,6,8,9) However, a poor-performing imputation method may bias subsequent analysis steps making missing value analysis an ongoing cause for debate within the bioinformatics community. To date, there is no “one fits all” imputation strategy available. Many imputation methods and method comparisons specifically tailored to a certain data acquisition technique have been published in the last few years. (12−19) Furthermore, serial dilution data were used to find a generally superior imputation strategy; (5,20,21) however, no such strategy was identified. (5,12,22)
Recently, the tool, NAguideR, (23) was developed to guide the decision for an optimal data-dependent imputation algorithm. It compares various imputation strategies on the complete portion of the data based on which an optimal imputation method for the incomplete portion is suggested. However, in general, only 10–30% of the quantified proteins have no MV, so the complete portion may not reflect the whole data set. In addition, the random assignment of MVs may not be a realistic representation of the occurrence of MVs.
With data-driven selection of an imputation algorithm (DIMA) we extend this approach to a more realistic setting. DIMA provides a valid recommendation of a high-performing imputation algorithm specifically tailored to the user-supplied data set and its MV distribution. Within DIMA, a reference data set is generated that captures the intensity and MV distribution from the original data and on which various imputation strategies can be evaluated. The reference data comprises proteins with the least amount of MVs scaled to the protein means and standard deviations of the whole input data set. To reproduce the pattern of MVs, a logistic regression model is trained on the original data taking into account the protein and sample as well as the mean protein intensity. The logistic regression coefficients are then applied to the reference data to imitate the MV distribution.

2. Methods

ARTICLE SECTIONS
Jump To

2.1. Illustration Data

To illustrate the concept of DIMA, it is applied to exemplary liquid chromatography (LC)–mass spectrometry (MS)/MS data set. Here, the secretion of antitumoral factors by rat alveolar wild-type macrophages is triggered by four different concentrations of proprotein convertase (PC) inhibitors ({0, 50, 100, 150} μM) leading to a decreased viability of C6 glioma cells (PXD014679/proteinGroups-WTsecretome). (24) This is measured in three technical replicates. The data comprises 884 identified proteins with 39.3% MVs and is depicted in Figure 1.

Figure 1

Figure 1. DIMA analysis pipeline illustrated on an LC-MS/MS data set. (24) The data is sorted from top to bottom according to the frequency of MVs and the mean intensity of the proteins. Likewise, the reference data R is sorted according to the mean protein intensities after considering pattern PR of step 3. (1) The pattern PO of MVs is learned by logistic regression using the protein and sample as factorial predictors and the mean protein intensity as a continuous predictor. (2) A reference data R with few MVs is defined. (3) Various patterns PR of MVs are generated by the logistic regression model, and the respective coefficients of step 1 are incorporated into the reference data R. (4) Boxplots of the absolute imputation errors for multiple imputation algorithms. The circle indicates the median imputation deviation. The algorithms are ranked by their overall root mean square error (RMSE, red diamond). The algorithms can be divided into well-performing algorithms with an RMSE < 0.5 (green), medium performance with 0.5 < RMSE < 3 (yellow), and bad performance with RMSE > 3 (red). (5) The best-performing imputation algorithm on R (in this example impSeqRob) is recommended for the original data O and imputation of O is conducted.

2.2. PRIDE Data

To show its broad applicability and performance, DIMA is applied to 142 publicly available data sets from the PRoteomics IDEntifications (PRIDE) database. Data sets that contain the search pattern “MaxQuant” (25) or “proteinGroups”, the file extensions .txt, .xlsx, .csv, .tar, .gz, .zip or .rar and uploaded to the PRIDE database between May and July 2020 are acquired from the ftp-server “ftp.PRIDE.ebi.ac.uk”. The resulting 142 PRIDE data sets comprise [200–13 430] proteins, [2–112] samples and [2.6–94.3]% MVs, with an average amount of (2990 ± 200) proteins, (17.8 ± 1.3) samples, and (40.5 ± 2.1)% MVs. Thirty-five percent of the data sets contain more than 50% MVs. The characteristics for the individual PRIDE data sets are shown in the Supporting Information Table S2.

2.3. Simulation Study

In addition to experimental data sets, a simulation study is performed to analyze the influence of MCAR and MNAR ratios as well as the frequency of MVs on the imputation performance. The simulation process is adapted from O’Brien et al. (26) The protein intensities
(1)
of condition k are calculated with the baseline average θijN(18.5, IG(1,1)) of peptide j matched to protein O and with a noise term ϵijkN(0, IG(2, 1)). To simulate differential expression, a fold change FCikN(0, IG(1.5,1)) is added to the protein intensity for condition k = 1. The variances are drawn from the inverse γ distribution IG, which represents the marginal posterior distribution for unknown variances of a normal distribution.
To evaluate the impact of the MV distribution, various combinations of MVs and MCAR/MNAR ratios are incorporated into the simulated data Y. The pattern Pik ∈ {0, 1} of MVs is defined as Pik = 1 if the observation is available and Pik = 0 if the observation is missing. The MCAR values are set randomly. The indicator variable
(2)
of MNAR values is drawn from the intensity-dependent probit model Φ(a + bYik) with the cumulative distribution function Φ() of a normally distributed random variable N(0, 1), (26) the rate b of MNAR values; and the b-quantile a = Qb(Y) of the protein intensities Y. In our notation, MVs are incorporated in the simulated data S if one of the patterns PikMNARPikMCAR = 0 suggests an MV, otherwise, S consists of the simulated intensities Y. For each MV and MCAR/MNAR ratio, 500 data sets with 500 proteins and 20 samples that consist of two conditions k = {0, 1} are simulated.

2.4. DIMA

DIMA assesses and suggests imputation algorithms for a user-defined data set. The method consists of five main steps, which are depicted in Figure 1.
(1) The pattern PO of MVs in the original input data O is learned by logistic regression (Section 2.4.1).
(2) Reference data R with the subset of proteins containing the least amount of MVs are generated from the original data O to evaluate imputation performance (Section 2.4.2).
(3) To generate a pattern of missing data with a similar distribution as in the original data, the logistic regression model of step 1 is applied to the reference data R. Bernoulli trials are performed to simulate nP patterns PR of MVs (Section 2.4.2).
(4) Multiple imputation algorithms (Section 2.4.3) are applied to R with patterns PR of MVs and ranked by their root mean square error (RMSE). The best-performing algorithm is given by the lowest average rank over all pattern simulations (Section 2.4.4).
(5) The best-performing imputation algorithm of step 4 is recommended as an imputation algorithm for the original data O and imputation of O is performed.
In general, as the protein abundances are log-normally distributed, the authors recommend applying DIMA on the logarithmic scale. In this paper, intensities are transformed into a logarithmic scale with base 2.

2.4.1. Learn Pattern of Missing Values

The probability pMV = Prob(P = 0) of a specific data point missing (P = 0) is described by a logistic regression model as follows
(3)
The columns of the design matrix X correspond to predictor variables describing certain data properties: the mean protein intensity that is central for MNAR data and the protein and sample identities as factorial predictors. The sample predictor captures different experimental conditions and replicates. In addition, DIMA offers optional predictors such as protein ratios, molecular weight, or quality scores if included in the input file.
The predictors are standardized (27,28) and a weak regularization of the regression coefficients β for the factorial predictors has been performed to decrease the variance of the estimated parameters and to prevent non-identifiability. (29) For large data sets (>1000 proteins), multiple random subsamples of the proteins are analyzed to decrease the computational cost. Here, each protein is drawn once and the regression coefficients β are set to the sampling means.

2.4.2. Generation of the Reference Data

Because the imputation algorithms are evaluated and ranked on the reference data R, the reference data plays a key role within DIMA. The aim is to simulate abundances, variability, and MV distribution as realistically as possible compared to the original data O. This is done by assigning the proteins with the least number of MVs─at least 20% of the proteins─to the reference data multiple times to get the same number of proteins as in the original data. The reference data is then scaled to the mean and standard deviation of the original proteins. To simulate the occurrence patterns PR of MVs, the logistic model from eq 3 with the regression coefficients β of step 1 is applied to the reference data R, and Bernoulli trials are performed. Depending on the size of the input data, 5–20 patterns are generated.

2.4.3. Imputation Algorithms

By applying DIMA, 30 imputation algorithms from 12 R packages are evaluated. They are based on a variety of imputation strategies such as single-value approaches (mean, minimum), local similarity approaches (KNN, random forest, regression models), and global structure approaches (sequential, PCA, SVD). The algorithms are depicted in the Supporting Information, Table S1, with their package name, reference, and a short explanation.

2.4.4. Ranking of Imputation Algorithms

To compare the performance of the different imputation algorithms, the root mean square error
(4)
between the protein intensities Ri of the reference data and the imputed protein intensities Ii is calculated. Here, n denotes the total number of imputed entries.
The imputation algorithms are ranked by their RMSE for each generated pattern. The approach with the lowest mean rank over all pattern simulations is recommended as the best-performing algorithm for the original data O. If for any reason an imputation algorithm fails, the highest rank is assigned for this respective imputation. Thus, the algorithm may still be recommended by DIMA, even though this is unlikely. An imputed data set that still contains MVs is by definition treated as an imputation failure.
Most commonly, the accuracy of the imputed point estimates to the underlying truth is desired for downstream analyses including estimation of the protein fold changes or cluster analysis. For statistical tests, however, also the variability of a data set is crucial and it might not be reproduced by the imputed point estimates. In such a setting, alternative criteria have to be applied to rank the imputation algorithms. For statistical interpretation based on the t-test, the RMSEt ≔ RMSE(tR, tI) serves as rank criterion, where t is the t-test statistics calculated from the observed data R and the imputed data O. To define the null hypothesis H0, the group assignments of the samples have to be specified by the user.
To verify that the selected imputation approach preserves the variability of the data, the differences in variances of the observed data R and the imputed data I are examined with a two-sample F-test. Here, the p-value
(5)
under H0 of equal variances σR2 = σI2 is the criterion of choice.

3. Implementation

ARTICLE SECTIONS
Jump To

DIMA is implemented in MATLAB (30) and R (31). The MATLAB implementation is part of the OmicsData repository at github.com/kreutz-lab/OmicsData and the R implementation can be found at github.com/kreutz-lab/DIMAR. All results of the following sections were performed in MATLAB on an Intel Xeon E5-2640v3 CPU with 16 cores on the BwForCluster MLS&WISO Development.

3.1. MATLAB Implementation

A proteomics data object is created by
O = OmicsData(file);
Accepted input formats are .xls, .xlsx, .csv, .tsv, .txt, and .mat files as well as numeric data matrices. A pre-processing step may be applied by
O = OmicsFilter(O,nacut,logflag,scaleflag);
This removes features with a higher MV proportion than the nacut threshold. The flags logflag and scaleflag can be logical or string inputs. They define if a log 2 or log 10 transformation and a median or mean normalization should be performed. DIMA is executed via
O = DIMA(O,[algorithms],[bio]);
By default, the imputation algorithms of Table S1 are applied. Alternatively, they can be specified by the user. A fast version, which only runs the nine most frequently recommended algorithms based on the 142 PRIDE data sets, is also implemented and available by setting the argument algorithms to “fast”. The optional third input argument is a flag if biological information such as protein ratios, molecular weight, or scores should be taken into account. After applying DIMA the suggested algorithm and the respective imputation are stored in the proteomics data object and can be accessed via
algorithm = get(O,’DIMA’);
data = get(O,’data’);

3.2. R Implementation

The DIMA implementation in R is located in the dimar package:
devtools::install_github(”kreutz-lab/DIMAR”)
library(dimar)
The algorithm is applied to the input data mtx in a single function call:
Imp <- DIMAR::dimar(mtx, pattern=NULL, methods=’fast’).
mtx should be given in a matrix format. The MaxQuant output file names can be passed to the function as string, such as the proteinGroups text file. DIMA is applied to those columns of the data whose column name matches the argument pattern. The imputation algorithms can be defined in methods. By default, the nine most frequently selected algorithms on the 142 Pride data sets are applied.

4. Results

ARTICLE SECTIONS
Jump To

DIMA is first demonstrated on one of the PRIDE data sets. Then, to show its broad range of applicability and to evaluate its performance, DIMA is applied to 142 experimental data sets from the PRIDE database (Section 2.2) and to the simulated MNAR and MCAR data sets (Section 2.3).

4.1. Illustration Data

Figure 1 illustrates the DIMA workflow on the LC-MS/MS illustration data set. The protein abundances and MV distribution in the reference data R and the input data O are of great similarity (see the Supporting Information, Figures S1 and S2, for more details). To evaluate various imputation algorithms, the absolute deviations of the imputed data from the original values are calculated and the algorithms are sorted according to their RMSE (Figure 1.4). The performance measures RMSE, RMSEt of the t-statistics, and the p-value (pF) of the F-statistics with their respective ranking are depicted in Table 1 for the eight best-performing and the two least-performing imputation algorithms.
Table 1. To compare the imputed data values to the reference data, the RMSE, RMSE of the t-statistics and p-value of the F-statistics as well as their respective rankings are calculateda
a

The eight best-performing and the two least-performing imputation algorithms on the illustration data set are shown. In green, the best-performing algorithms for the respective criterion are highlighted with decreasing transparency. The algorithm selection by DIMA depending on the ranking criterion is highlighted in bold.

The best-performing algorithm for this exemplary data set is the sequential algorithm impSeqRob from the R package rrocovNA (32) with an average rank of 1.5 over all pattern simulations, followed by the algorithm impSeq from the same R package. The four best-performing algorithms impSeqRob and impSeq as well as the PCA-based algorithms imputePCA and MIPCA from the R package missMDA (33) perform relatively well for all three ranking criteria, which can further be generalized for the remaining investigated PRIDE data sets (Section 2.2). The other imputation algorithms result in higher RMSEs.
The computation time of the individual imputation algorithms ranges from 3 to 120 s. DIMA applied to the illustration data set takes 18 min in total. All imputation methods show a positive correlation between the original and the imputed data values with a Pearson correlation coefficient between 0.59 and 0.99. Most algorithms overestimate, but the R package imputeLCMD which algorithms are left-censored underestimate intensities (Supporting Information Figure S5).
To determine if the protein intensities with and without the treatment of cells with 150 μM proprotein convertase (PC) inhibitor differ significantly from each other, the t-test is applied. The RMSE of the t-statistics over all identified proteins is the smallest for the imputation algorithm imputePCA. Hence, imputePCA is suggested by DIMA for statistical testing. The null hypothesis of equal variances of the original and the imputed data is rejected (pF < 0.01) for 18 and not rejected for 5 out of 23 imputation algorithms. In our example, those 5 algorithms also perform best with respect to the RMSE and RMSEt.
Proteins with complete missingness are also reflected in the patterns PR of MVs in the reference data and thus play a role in the evaluation of the imputation algorithms. However, the R packages pcaMethods and impute are not able to deal with complete missingness. If such proteins are present in the input data, the user can consider removing them from the data beforehand.
The illustration data set comprises the proteins of the WT secretome experiment. (24) The reference data and simulated patterns of MVs for the peptide data of the same experiment are shown in the Supporting Information Figure S4. In principle, DIMA can be applied to any data in a matrix format.

4.2. PRIDE Data

To demonstrate its general applicability and to evaluate the various imputation algorithms on multiple data sets, DIMA is applied to 142 data sets from the PRIDE database (see Section 2.2). For 47% of the data sets, DIMA proposes the robust sequential algorithm impSeqRob and for 25% the sequential algorithm impSeq (Figure 2A). Both approaches are included in the rrcovNA package and are based on sequential imputation of each MV by minimizing the determinant of the covariance matrix. Furthermore, frequently selected imputation algorithms are the random forest algorithm missForest (13%) and the PCA algorithms imputePCA (10%), ppca (1.5%), and bpca (1.5%).

Figure 2

Figure 2. DIMA is applied and evaluated on 142 PRIDE data sets. (A) Nine algorithms compete for being recommended as the best-performing algorithm. The R package rrcovNA with its algorithms impSeqRob (47%) and impSeq (25%) is selected most frequently, followed by missForest in 13% and imputePCA in 10%. For 5% of the Pride data sets, another algorithm is suggested. (B) The rank of the imputation algorithms obtained in the 142 PRIDE data sets is shown as a box plot. The seven algorithms with the lowest median rank are also the seven most frequently selected algorithms by DIMA (A). The algorithms with a median rank lower than 5% are highlighted in green, and algorithms with a median rank greater than 20 are highlighted in red.

The algorithms most frequently selected by DIMA also show generally good performance on the 142 PRIDE data sets. The eight proposed algorithms have the lowest median rank, except for the algorithm softImpute, which has a median rank of 28 and is the best-performing algorithm for just one data set (Figure 2B). The algorithms impSeqRob and impSeq from the R package rrcovNA show the best performance with a median rank of 2, followed by the R package missMDA (imputePCA and MIPCA) and missForest. On average, DIMA takes 6.44 ± 0.06 min CPU time per data set.
To show the broad applicability of DIMA, in addition to the PRIDE data sets processed using MaxQuant, DIMA is applied to the quantitative output of other proteomics and metabolomics software. Khoonsari et al. (34) identified the peptides of the cerebrospinal fluid proteome and analyzed the same raw data with five different proteomics software: DecyderMS (GE healthcare), MaxQuant, (25) OpenMS, (35) PEAKS (Bioinformatics Solutions Inc.), and Sieve (Thermo). DIMA is applied to all five data sets normalized to the spiked-in chicken ovalbumin. In this example, irrespective of the applied proteomics software, the same imputation algorithm, the sequential algorithm impSeq, is selected as the best-performing imputation algorithm. Thus, in this case, the kind of proteomics software does not seem to be decisive for the selection of the imputation algorithm. For PXD002099 (36) and PXD002885 (37) processed with Progenesis QI (Nonlinear Dynamics, Waters, U.K.) DIMA selected the robust sequential algorithm impSeqRob and the Bayesian principal component analysis (bpca) algorithm, respectively. For the metabolomics MTBLS738 (38) data set DIMA selected the random forest algorithm missForest as the best-performing imputation method.

4.3. Simulation Study

To investigate whether DIMA is capable of reliably identifying a high-performing imputation approach, a data simulation study is conducted where the ground truth for the intensities of MVs is known. The data is simulated as described in Section 2.3 with MV ∈ {5, 10, 15, 20, 25, 30, 35, 40, 45, 50} % and MNAR ∈ {0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100}%.
First, the simulated data sets S with incorporated MVs are imputed and compared to the simulated ground truth Y. This assessment and the respective ranking of the methods are hereafter referred to as direct imputation assessment. In such a setting, the aim of DIMA is to predict a high-performing method of this direct imputation assessment only based on S. Therefore, the direct imputation assessment serves as the assumed true algorithm ranking against which the recommendation of DIMA is compared.
The RMSE difference of the algorithm recommended by DIMA compared to direct imputation assessment is averaged over 500 data simulations and is displayed in Figure 3. This is shown for different combinations of MV percentages and MNAR/MCAR ratios. The mean RMSE difference is below 5% in 60% of the MV and MNAR/MCAR combinations.

Figure 3

Figure 3. Performance of DIMA is evaluated on simulated data S with the incorporation of various proportions of MV and MNAR/MCAR ratios. The RMSE (color-coded) and rank (first entry) obtained by the best-performing imputation algorithm recommended by DIMA (second entry) compared to direct imputation assessment (third entry) over 500 data simulations are calculated. The algorithm recommended by DIMA is within the top three out of 27 approaches in all cases. For MV < 20% (A), the additive regression aregImpute with type regression (reg) outperforms, between 20 and 30% MVs; (B) several algorithms compete against each other and for MV > 30%; (C) the random forest algorithm missForest performs best.

In each image section of Figure 3, the rank (first entry) of the algorithm recommended by DIMA (second entry) compared to the best-performing algorithm with direct imputation assessment (third entry) is averaged over all simulated data sets S. The average rank is between 1.2 and 2.2 indicating that DIMA recommends one of the best three out of 27 algorithms independent of the number of MVs or the MNAR/MCAR ratio. The most frequent algorithm recommendations by DIMA and by direct imputation assessment are as follows: For MV < 20% (Figure 3A) the additive regression algorithm aregImpute with type regression from the R package Hmisc. Between an MV percentage of 15 and 35% (Figure 3B) the algorithms aregImpute, impSeq, and missForest compete as the best-performing imputation algorithm depending on the simulated data set. For MV > 30% (Figure 3C) the random forest algorithm missForest is most frequently recommended.

5. Discussion

ARTICLE SECTIONS
Jump To

Despite enormous progress, modern MS-based proteomics may still suffer from the presence of MVs (1−3) and no general guidelines on how to deal with MVs exist. (5,12,22) Poor-performing MV imputation methods as well as performing no imputation at all may lead to an estimation bias and negatively affect the peptide/protein quantification and in turn the downstream analyses. (9−11) In contrast, performing imputation benefits statistical analyses as well as clustering methods as it increases data completeness.
Many review articles and benchmark studies investigate the performance of different imputation strategies applied to specific data types and acquisition techniques. (5,16,18,20,21) However, many authors claim that the results cannot be adapted to different data settings and a general recommendation is urgently needed. (5,12,22) This prompted us to develop DIMA, a general decision guide applicable to any proteomics data set, which reproduces the individual occurrence pattern of MVs and thereby suggests a suitable imputation method.
Although the mechanisms underlying MVs are diverse, the logistic regression model and the algorithm selection are able to capture this distribution without a priori knowledge. This is demonstrated by a simulation study where data with different MV percentages and MNAR/MCAR ratios are simulated and DIMA suggests one of the top three ranked algorithms compared to direct imputation assessment in all cases (Figure 3). Here, several algorithms show comparably good performance which indicates that, generally, selecting a well-performing imputation algorithm is sufficient. However, it is crucial to avoid a bad-performing imputation which may over- or underestimate the true data values (Supporting Information Figure S5) and therefore lead to analysis bias.
The selection and ranking of the imputation algorithms applied to 142 PRIDE data sets (Figure 2) reveal a similar algorithm selection as with NAguideR: (23) Examples of well-performing algorithms are ImpSeq, ImpSeqRob, and bpca, medium-performing methods are kNN, rf, cart, and norm, and SVD, mean/median, and MinDet/MinProb/QRILC seem to perform poorly. The best-performing imputation algorithms are the sequential algorithms of the R package rrcovNA which can thus serve as the best a priori choice for imputation. However, we recommend using a decision tool like DIMA or NAguideR to aid the decision progress for a suitable imputation algorithm.
Selecting an imputation algorithm highly depends on the downstream analysis steps. In most cases, evaluation based on the average distance to the truth is appropriate. For statistical testing, however, not only the average distance but also the variance should be maintained. Thus, besides the RMSE being an indicator for the accuracy of point estimates, further criteria such as RMSEt for the proximity of the t-statistics or the p-value pF of the F-statistics reflecting the variations in variances should be used to select an appropriate imputation algorithm.
For large data sets and/or time-saving purposes, one could consider omitting the calculation of imputation algorithms that are not expected to perform well based on prior knowledge e.g., algorithms, which were not suggested by DIMA for the PRIDE data sets. DIMA offers the possibility to specify the evaluated imputation approaches and provides a fast version including the nine algorithms that are most frequently recommended for the PRIDE data sets.
Methodically, DIMA is applicable to any data set in a matrix format independent of its acquisition. Specifically, data-dependent, data-independent, bottom-up, targeted, label-free, isotope labeling and more proteomics data sets can highly benefit from applying DIMA as it increases the quantification of peptides and proteins. In addition, DIMA could also be beneficial for data of other high-throughput techniques such as bulk or single-cell RNAseq data.
In summary, DIMA provides a realistic and data-dependent performance assessment and thereby suggests a suitable imputation algorithm. Its performance and effectiveness are demonstrated on simulated data sets as well as on 142 quantitative MS data sets.

Supporting Information

ARTICLE SECTIONS
Jump To

The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acs.jproteome.1c00119.

  • Sigmoidal decrease of missing values for higher protein intensities (Figure S1); missing value distribution per sample and per protein (Figure S2); distribution of the estimated logistic regression coefficients (Figure S3); DIMA analysis at the peptide level (Figure S4); density plot of the imputed compared to the original data values (Figure S5); principal component analysis before and after imputation (Figure S6); DIMA Implementation (Figure S7); characteristics of the 30 applied imputation algorithms (Table S1); and characteristics of the PRIDE data sets assessed with DIMA (Table S2) (PDF)

Terms & Conditions

Most electronic Supporting Information files are available without a subscription to ACS Web Editions. Such files may be downloaded by article for research use (if there is a public use license linked to the relevant article, that license may permit other uses). Permission may be obtained from ACS for other uses through requests via the RightsLink permission system: http://pubs.acs.org/page/copyright/permissions.html.

Author Information

ARTICLE SECTIONS
Jump To

  • Corresponding Authors
    • Janine Egert - Institute of Medical Biometry and Statistics (IMBI), Institute of Medicine and Medical Center Freiburg, 79104 Freiburg im Breisgau, GermanyCentre for Integrative Biological Signalling Studies (CIBSS), Albert-Ludwigs-Universität Freiburg, 79104 Freiburg, GermanyOrcidhttps://orcid.org/0000-0002-2032-7081 Email: [email protected]
    • Clemens Kreutz - Institute of Medical Biometry and Statistics (IMBI), Institute of Medicine and Medical Center Freiburg, 79104 Freiburg im Breisgau, GermanySignalling Research Centres BIOSS and CIBSS, Albert-Ludwigs-Universität Freiburg, 79104 Freiburg im Breisgau, GermanyCenter for Data Analysis and Modeling (FDM), Albert-Ludwigs-Universität Freiburg, 79104 Freiburg im Breisgau, Germany Email: [email protected]
  • Authors
    • Eva Brombacher - Institute of Medical Biometry and Statistics (IMBI), Institute of Medicine and Medical Center Freiburg, 79104 Freiburg im Breisgau, GermanyCentre for Integrative Biological Signalling Studies (CIBSS), Albert-Ludwigs-Universität Freiburg, 79104 Freiburg, GermanySpemann Graduate School of Biology and Medicine (SGBM), Albert-Ludwigs-Universität Freiburg, 79104 Freiburg, GermanyFaculty of Biology, Albert-Ludwigs-Universität Freiburg, 79104 Freiburg im Breisgau, Germany
    • Bettina Warscheid - Biochemistry and Functional Proteomics, Institute of Biology II, Faculty of Biology, Albert-Ludwigs-Universität Freiburg, 79104 Freiburg im Breisgau, GermanySignalling Research Centres BIOSS and CIBSS, Albert-Ludwigs-Universität Freiburg, 79104 Freiburg im Breisgau, GermanyOrcidhttps://orcid.org/0000-0001-5096-1975
  • Notes
    The authors declare no competing financial interest.

Acknowledgments

ARTICLE SECTIONS
Jump To

This work was supported by the Federal Ministry of Education and Research of Germany [EA:Sys,FKZ031L0080 to J.E. and C.K.]; the Deutsche Forschungsgemeinschaft (German Research Foundation) [CIBSS-EXC-2189-2100249960-390939984 to E.B., B.W., and C.K., Project-ID 403222702278002225/SFB 1381 to B.W., FOR 2743 to B.W., TRR 130 to B.W.], and the European Research Council H2020 [648235 to B.W., Marie Sklodowska Curie grant 812968 to B.W.]. The authors acknowledge support from the state of Baden-Württemberg through bwHPC and the Deutsche Forschungsgemeinschaft through grant INST 35/1134-1 FUGG. The authors gratefully thank Lena Reimann, Wignand Mühlhäuser, and Friedel Drepper for fruitful discussions on the topic.

References

ARTICLE SECTIONS
Jump To

This article references 38 other publications.

  1. 1
    McGurk, K. A. The use of missing values in proteomic data-independent acquisition mass spectrometry to enable disease activity discrimination. Bioinformatics 2020, 36, 22172223,  DOI: 10.1093/bioinformatics/btz898
  2. 2
    Poulos, R. C. Strategies to enable large-scale proteomics for reproducible research. Nat. Commun. 2020, 11, 3793  DOI: 10.1038/s41467-020-17641-3
  3. 3
    Brenes, A.; Hukelmann, J.; Bensaddek, D.; Lamond, A. I. Multibatch TMT Reveals False Positives, Batch Effects and Missing Values. Mol. Cell. Proteomics 2019, 18, 19671980,  DOI: 10.1074/mcp.RA119.001472
  4. 4
    Wei, R.; Wang, J.; Su, M.; Jia, E.; Chen, S.; Chen, T.; Ni, Y. Missing Value Imputation Approach for Mass Spectrometry-based Metabolomics Data. Sci. Rep. 2018, 8, 663  DOI: 10.1038/s41598-017-19120-0
  5. 5
    Webb-Robertson, B.-J. M.; Wiberg, H. K.; Matzke, M. M.; Brown, J. N.; Wang, J.; McDermott, J. E.; Smith, R. D.; Rodland, K. D.; Metz, T. O.; Pounds, J. G.; Waters, K. M. Reviewand Evaluationand and Discussion of the Challenges of Missing Value Imputation for Mass Spectrometry-Based Label-Free Global Proteomics. J. Proteome Res. 2015, 14, 19932001,  DOI: 10.1021/pr501138h
  6. 6
    Lazar, C.; Laurent, G.; Myriam, F.; Christophe, B.; Thomas, B. Accounting for the Multiple Natures of Missing Values in Label-Free Quantitative Proteomics Data Sets to Compare Imputation Strategies. J. Proteome Res. 2016, 15, 11161125,  DOI: 10.1021/acs.jproteome.5b00981
  7. 7
    Rubin, D. B. Inference and missing data. Biometrika 1976, 63, 581592,  DOI: 10.1093/biomet/63.3.581
  8. 8
    Karpievitch, Y. V.; Dabney, A. R.; Smith, R. D. Normalization and missing value imputation for label-free LC-MS analysis. BMC Bioinf. 2012, 13, S5  DOI: 10.1186/1471-2105-13-S16-S5
  9. 9
    Välikangas, T.; Suomi, T.; Elo, L. L. A comprehensive evaluation of popular proteomics software workflows for label-free proteome quantification and imputation. Briefings Bioinf. 2017, 19, 13441355,  DOI: 10.1093/bib/bbx054
  10. 10
    Wang, J.; Li, L.; Chen, T.; Ma, J.; Zhu, Y.; Zhuang, J.; Chang, C. In-depth method assessments of differentially expressed protein detection for shotgun proteomics data with missing values. Sci. Rep. 2017, 7, 3367  DOI: 10.1038/s41598-017-03650-8
  11. 11
    Janssen, K. J.; Donders, A. R. T.; Harrell, F. E.; Vergouwe, Y.; Chen, Q.; Grobbee, D. E.; Moons, K. G. Missing covariate data in medical research: To impute is better than to ignore. J. Clin. Epidemiol. 2010, 63, 721727,  DOI: 10.1016/j.jclinepi.2009.12.008
  12. 12
    Brock, G. N.; Shaffer, J. R.; Blakesley, R. E.; Lotz, M. J.; Tseng, G. C. Which missing value imputation method to use in expression profiles: a comparative study and two selection schemes. BMC Bioinf. 2008, 9, 12  DOI: 10.1186/1471-2105-9-12
  13. 13
    To, K. T.; Fry, R. C.; Reif, D. M. Characterizing the effects of missing data and evaluating imputation methods for chemical prioritization applications using ToxPi. BioData Min. 2018, 11, 10  DOI: 10.1186/s13040-018-0169-5
  14. 14
    Poyatos, R.; Sus, O.; Badiella, L.; Mencuccini, M.; Martinez-Vilalta, J. Gap-filling a spatially explicit plant trait database: comparing imputation methods and different levels of environmental information. Biogeosciences 2018, 15, 26012617,  DOI: 10.5194/bg-15-2601-2018
  15. 15
    Lenz, M.; Schulz, A.; Koeck, T.; Rapp, S.; Nagler, M.; Sauer, M.; Eggebrecht, L.; Cate, V. T.; Panova-Noeva, M.; Prochaska, J. H.; Lackner, K. J.; Münzel, T.; Leineweber, K.; Wild, P. S.; Andrade-Navarro, M. A. Missing value imputation in proximity extension assay-based targeted proteomics data. PLoS One 2020, 15, e0243487  DOI: 10.1371/journal.pone.0243487
  16. 16
    Bramer, L. M.; Irvahn, J.; Piehowski, P. D.; Rodland, K. D.; Webb-Robertson, B.-J. M. A Review of Imputation Strategies for Isobaric Labeling-Based Shotgun Proteomics. J. Proteome Res. 2021, 20, 113,  DOI: 10.1021/acs.jproteome.0c00123
  17. 17
    de Souto, M. C. P.; Jaskowiak, P. A.; Costa, I. G. Impact of missing data imputation methods on gene expression clustering and classification. BMC Bioinf. 2015, 16, 64  DOI: 10.1186/s12859-015-0494-3
  18. 18
    Liu, M.; Dongre, A. Proper imputation of missing values in proteomics datasets for differential expression analysis. Briefings Bioinf. 2020, 0, bbaa112  DOI: 10.1093/bib/bbaa112
  19. 19
    Rodwell, L.; Lee, K. J.; Romaniuk, H.; Carlin, J. B. Comparison of methods for imputing limited-range variables: a simulation study. BMC Med. Res. Methodol. 2014, 14, 57  DOI: 10.1186/1471-2288-14-57
  20. 20
    Kruttika, D.; Simion, K.; R, J. M.; J, P. S. A Simple Optimization Workflow to Enable Precise and Accurate Imputation of Missing Values in Proteomic Datasets. bioRxiv 2020, 139,  DOI: 10.1101/2019.12.11.123456
  21. 21
    Jin, L.; Bi, Y.; Hu, C.; Qu, J.; Shen, S.; Wang, X.; Tian, Y. A comparative study of evaluating missing value imputation methodsin label-free proteomics. Sci. Rep. 2021, 11, 1760  DOI: 10.1038/s41598-021-81279-4
  22. 22
    Audoux, J.; Salson, M.; Grosset, C. F.; Beaumeunier, S.; Holder, J.-M.; Commes, T.; Philippe, N. SimBA: A methodology and tools for evaluating the performance of RNA-Seq bioinformatic pipelines. BMC Bioinf. 2017, 18, 428  DOI: 10.1186/s12859-017-1831-5
  23. 23
    Wang, S.; Li, W.; Hu, L.; Cheng, J.; Yang, H.; Liu, Y. NAguideR: performing and prioritizing missing value imputations for consistent bottom-up proteomic analyses. Nucleic Acids Res. 2020, 48, e83,  DOI: 10.1093/nar/gkaa498
  24. 24
    Rose, M.; Duhamel, M.; Aboulouard, S.; Kobeissy, F.; Rhun, E. L.; Desmons, A.; Tierny, D.; Fournier, I.; Rodet, F.; Salzet, M. The Role of a Proprotein Convertase Inhibitor in Reactivation of Tumor-Associated Macrophages and Inhibition of Glioma Growth. Mol. Ther.--Oncolytics 2020, 17, 3146,  DOI: 10.1016/j.omto.2020.03.005
  25. 25
    Cox, J.; Mann, M. MaxQuant enables high peptide identification rates and individualized p.p.b.-range mass accuracies and proteome-wide protein quantification. Nat. Biotechnol. 2008, 26, 13671372,  DOI: 10.1038/nbt.1511
  26. 26
    O’Brien, J. J.; Gunawardena, H. P.; Paulo, J. A.; Chen, X.; Ibrahim, J. G.; Gygi, S. P.; Qaqish, B. F. The effects of nonignorable missing data on label-free mass spectrometry proteomics experiments. Ann. Appl. Stat. 2018, 12, 20752095,  DOI: 10.1214/18-AOAS1144
  27. 27
    Marquardt, D. W. Comment - You should standardize the predictor variables in your regression models. J. Am. Stat. Assoc. 1980, 75, 8791, 10.1080/01621459.1980.10477430
  28. 28
    Menard, S. Standards for standardized logistic regression coefficients. Soc. Forces 2011, 89, 14091428,  DOI: 10.1093/sf/89.4.1409
  29. 29
    Kreutz, C. New Concepts for Evaluating the Performance of Computational Methods. IFAC-PapersOnLine 2016, 49, 6370,  DOI: 10.1016/j.ifacol.2016.12.104
  30. 30
    MATLAB. 9.8.0.1538580 (R2020a); The MathWorks Inc.: Natickand Massachusetts, 2020.
  31. 31
    R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna, Austria, 2021.
  32. 32
    Stekhoven, D. J.; Buhlmann, P. MissForest-non-parametric missing value imputation for mixed-type data. Bioinformatics 2012, 28, 112118,  DOI: 10.1093/bioinformatics/btr597
  33. 33
    Josse, J.; Husson, F. Handling missing values in exploratory multivariate data analysis methods. J. Soc. Fr. Stat. 2012, 153, 7999
  34. 34
    Khoonsari, P. E.; Häggmark, A.; Lönnberg, M.; Mikus, M.; Kilander, L.; Lannfelt, L.; Bergquist, J.; Ingelsson, M.; Nilsson, P.; Kultima, K.; Shevchenko, G. Analysis of the Cerebrospinal Fluid Proteome in Alzheimer’s Disease. PLoS One 2016, 11, e0150672  DOI: 10.1371/journal.pone.0150672
  35. 35
    Sturm, M.; Bertsch, A.; Gröpl, C.; Hildebrandt, A.; Hussong, R.; Lange, E.; Pfeifer, N.; Schulz-Trieglaff, O.; Zerck, A.; Reinert, K.; Kohlbacher, O. OpenMS - an open-source software framework for mass spectrometry. BMC Bioinf. 2008, 9, 163  DOI: 10.1186/1471-2105-9-163
  36. 36
    Pursiheimo, A.; Vehmas, A. P.; Afzal, S.; Suomi, T.; Chand, T.; Strauss, L.; Poutanen, M.; Rokka, A.; Corthals, G. L.; Elo, L. L. Optimization of Statistical Methods Impact on Quantitative Proteomics Data. J. Proteome Res. 2015, 14, 41184126,  DOI: 10.1021/acs.jproteome.5b00183
  37. 37
    Govaert, E.; Van Steendam, K.; Scheerlinck, E.; Vossaert, L.; Meert, P.; Stella, M.; Willems, S.; De Clerck, L.; Dhaenens, M.; Deforce, D. Extracting histones for the specific purpose of label-free MS. Proteomics 2016, 16, 29372944,  DOI: 10.1002/pmic.201600341
  38. 38
    Calf, O. W.; van Dam, N. M.; Weinhold, A.; Huber, H.; Peters, J. L. MTBLS738: Glycoalkaloid composition explains variation in slug resistance in Solanum dulcamara. Oecologia 2018, 187, 495506,  DOI: 10.1007/s00442-018-4064-z

Cited By

This article is cited by 3 publications.

  1. Yannis Schumann, Julia E. Neumann, Philipp Neumann. Robust classification using average correlations as features (ACF). BMC Bioinformatics 2023, 24 (1) https://doi.org/10.1186/s12859-023-05224-0
  2. Janine Egert, Clemens Kreutz. Rcall: An R interface for MATLAB. SoftwareX 2023, 21 , 101276. https://doi.org/10.1016/j.softx.2022.101276
  3. Christophe Vanderaa, Laurent Gatto. Replication of single-cell proteomics data reveals important computational challenges. Expert Review of Proteomics 2021, 18 (10) , 835-843. https://doi.org/10.1080/14789450.2021.1988571
  • Abstract

    Figure 1

    Figure 1. DIMA analysis pipeline illustrated on an LC-MS/MS data set. (24) The data is sorted from top to bottom according to the frequency of MVs and the mean intensity of the proteins. Likewise, the reference data R is sorted according to the mean protein intensities after considering pattern PR of step 3. (1) The pattern PO of MVs is learned by logistic regression using the protein and sample as factorial predictors and the mean protein intensity as a continuous predictor. (2) A reference data R with few MVs is defined. (3) Various patterns PR of MVs are generated by the logistic regression model, and the respective coefficients of step 1 are incorporated into the reference data R. (4) Boxplots of the absolute imputation errors for multiple imputation algorithms. The circle indicates the median imputation deviation. The algorithms are ranked by their overall root mean square error (RMSE, red diamond). The algorithms can be divided into well-performing algorithms with an RMSE < 0.5 (green), medium performance with 0.5 < RMSE < 3 (yellow), and bad performance with RMSE > 3 (red). (5) The best-performing imputation algorithm on R (in this example impSeqRob) is recommended for the original data O and imputation of O is conducted.

    Figure 2

    Figure 2. DIMA is applied and evaluated on 142 PRIDE data sets. (A) Nine algorithms compete for being recommended as the best-performing algorithm. The R package rrcovNA with its algorithms impSeqRob (47%) and impSeq (25%) is selected most frequently, followed by missForest in 13% and imputePCA in 10%. For 5% of the Pride data sets, another algorithm is suggested. (B) The rank of the imputation algorithms obtained in the 142 PRIDE data sets is shown as a box plot. The seven algorithms with the lowest median rank are also the seven most frequently selected algorithms by DIMA (A). The algorithms with a median rank lower than 5% are highlighted in green, and algorithms with a median rank greater than 20 are highlighted in red.

    Figure 3

    Figure 3. Performance of DIMA is evaluated on simulated data S with the incorporation of various proportions of MV and MNAR/MCAR ratios. The RMSE (color-coded) and rank (first entry) obtained by the best-performing imputation algorithm recommended by DIMA (second entry) compared to direct imputation assessment (third entry) over 500 data simulations are calculated. The algorithm recommended by DIMA is within the top three out of 27 approaches in all cases. For MV < 20% (A), the additive regression aregImpute with type regression (reg) outperforms, between 20 and 30% MVs; (B) several algorithms compete against each other and for MV > 30%; (C) the random forest algorithm missForest performs best.

  • References

    ARTICLE SECTIONS
    Jump To

    This article references 38 other publications.

    1. 1
      McGurk, K. A. The use of missing values in proteomic data-independent acquisition mass spectrometry to enable disease activity discrimination. Bioinformatics 2020, 36, 22172223,  DOI: 10.1093/bioinformatics/btz898
    2. 2
      Poulos, R. C. Strategies to enable large-scale proteomics for reproducible research. Nat. Commun. 2020, 11, 3793  DOI: 10.1038/s41467-020-17641-3
    3. 3
      Brenes, A.; Hukelmann, J.; Bensaddek, D.; Lamond, A. I. Multibatch TMT Reveals False Positives, Batch Effects and Missing Values. Mol. Cell. Proteomics 2019, 18, 19671980,  DOI: 10.1074/mcp.RA119.001472
    4. 4
      Wei, R.; Wang, J.; Su, M.; Jia, E.; Chen, S.; Chen, T.; Ni, Y. Missing Value Imputation Approach for Mass Spectrometry-based Metabolomics Data. Sci. Rep. 2018, 8, 663  DOI: 10.1038/s41598-017-19120-0
    5. 5
      Webb-Robertson, B.-J. M.; Wiberg, H. K.; Matzke, M. M.; Brown, J. N.; Wang, J.; McDermott, J. E.; Smith, R. D.; Rodland, K. D.; Metz, T. O.; Pounds, J. G.; Waters, K. M. Reviewand Evaluationand and Discussion of the Challenges of Missing Value Imputation for Mass Spectrometry-Based Label-Free Global Proteomics. J. Proteome Res. 2015, 14, 19932001,  DOI: 10.1021/pr501138h
    6. 6
      Lazar, C.; Laurent, G.; Myriam, F.; Christophe, B.; Thomas, B. Accounting for the Multiple Natures of Missing Values in Label-Free Quantitative Proteomics Data Sets to Compare Imputation Strategies. J. Proteome Res. 2016, 15, 11161125,  DOI: 10.1021/acs.jproteome.5b00981
    7. 7
      Rubin, D. B. Inference and missing data. Biometrika 1976, 63, 581592,  DOI: 10.1093/biomet/63.3.581
    8. 8
      Karpievitch, Y. V.; Dabney, A. R.; Smith, R. D. Normalization and missing value imputation for label-free LC-MS analysis. BMC Bioinf. 2012, 13, S5  DOI: 10.1186/1471-2105-13-S16-S5
    9. 9
      Välikangas, T.; Suomi, T.; Elo, L. L. A comprehensive evaluation of popular proteomics software workflows for label-free proteome quantification and imputation. Briefings Bioinf. 2017, 19, 13441355,  DOI: 10.1093/bib/bbx054
    10. 10
      Wang, J.; Li, L.; Chen, T.; Ma, J.; Zhu, Y.; Zhuang, J.; Chang, C. In-depth method assessments of differentially expressed protein detection for shotgun proteomics data with missing values. Sci. Rep. 2017, 7, 3367  DOI: 10.1038/s41598-017-03650-8
    11. 11
      Janssen, K. J.; Donders, A. R. T.; Harrell, F. E.; Vergouwe, Y.; Chen, Q.; Grobbee, D. E.; Moons, K. G. Missing covariate data in medical research: To impute is better than to ignore. J. Clin. Epidemiol. 2010, 63, 721727,  DOI: 10.1016/j.jclinepi.2009.12.008
    12. 12
      Brock, G. N.; Shaffer, J. R.; Blakesley, R. E.; Lotz, M. J.; Tseng, G. C. Which missing value imputation method to use in expression profiles: a comparative study and two selection schemes. BMC Bioinf. 2008, 9, 12  DOI: 10.1186/1471-2105-9-12
    13. 13
      To, K. T.; Fry, R. C.; Reif, D. M. Characterizing the effects of missing data and evaluating imputation methods for chemical prioritization applications using ToxPi. BioData Min. 2018, 11, 10  DOI: 10.1186/s13040-018-0169-5
    14. 14
      Poyatos, R.; Sus, O.; Badiella, L.; Mencuccini, M.; Martinez-Vilalta, J. Gap-filling a spatially explicit plant trait database: comparing imputation methods and different levels of environmental information. Biogeosciences 2018, 15, 26012617,  DOI: 10.5194/bg-15-2601-2018
    15. 15
      Lenz, M.; Schulz, A.; Koeck, T.; Rapp, S.; Nagler, M.; Sauer, M.; Eggebrecht, L.; Cate, V. T.; Panova-Noeva, M.; Prochaska, J. H.; Lackner, K. J.; Münzel, T.; Leineweber, K.; Wild, P. S.; Andrade-Navarro, M. A. Missing value imputation in proximity extension assay-based targeted proteomics data. PLoS One 2020, 15, e0243487  DOI: 10.1371/journal.pone.0243487
    16. 16
      Bramer, L. M.; Irvahn, J.; Piehowski, P. D.; Rodland, K. D.; Webb-Robertson, B.-J. M. A Review of Imputation Strategies for Isobaric Labeling-Based Shotgun Proteomics. J. Proteome Res. 2021, 20, 113,  DOI: 10.1021/acs.jproteome.0c00123
    17. 17
      de Souto, M. C. P.; Jaskowiak, P. A.; Costa, I. G. Impact of missing data imputation methods on gene expression clustering and classification. BMC Bioinf. 2015, 16, 64  DOI: 10.1186/s12859-015-0494-3
    18. 18
      Liu, M.; Dongre, A. Proper imputation of missing values in proteomics datasets for differential expression analysis. Briefings Bioinf. 2020, 0, bbaa112  DOI: 10.1093/bib/bbaa112
    19. 19
      Rodwell, L.; Lee, K. J.; Romaniuk, H.; Carlin, J. B. Comparison of methods for imputing limited-range variables: a simulation study. BMC Med. Res. Methodol. 2014, 14, 57  DOI: 10.1186/1471-2288-14-57
    20. 20
      Kruttika, D.; Simion, K.; R, J. M.; J, P. S. A Simple Optimization Workflow to Enable Precise and Accurate Imputation of Missing Values in Proteomic Datasets. bioRxiv 2020, 139,  DOI: 10.1101/2019.12.11.123456
    21. 21
      Jin, L.; Bi, Y.; Hu, C.; Qu, J.; Shen, S.; Wang, X.; Tian, Y. A comparative study of evaluating missing value imputation methodsin label-free proteomics. Sci. Rep. 2021, 11, 1760  DOI: 10.1038/s41598-021-81279-4
    22. 22
      Audoux, J.; Salson, M.; Grosset, C. F.; Beaumeunier, S.; Holder, J.-M.; Commes, T.; Philippe, N. SimBA: A methodology and tools for evaluating the performance of RNA-Seq bioinformatic pipelines. BMC Bioinf. 2017, 18, 428  DOI: 10.1186/s12859-017-1831-5
    23. 23
      Wang, S.; Li, W.; Hu, L.; Cheng, J.; Yang, H.; Liu, Y. NAguideR: performing and prioritizing missing value imputations for consistent bottom-up proteomic analyses. Nucleic Acids Res. 2020, 48, e83,  DOI: 10.1093/nar/gkaa498
    24. 24
      Rose, M.; Duhamel, M.; Aboulouard, S.; Kobeissy, F.; Rhun, E. L.; Desmons, A.; Tierny, D.; Fournier, I.; Rodet, F.; Salzet, M. The Role of a Proprotein Convertase Inhibitor in Reactivation of Tumor-Associated Macrophages and Inhibition of Glioma Growth. Mol. Ther.--Oncolytics 2020, 17, 3146,  DOI: 10.1016/j.omto.2020.03.005
    25. 25
      Cox, J.; Mann, M. MaxQuant enables high peptide identification rates and individualized p.p.b.-range mass accuracies and proteome-wide protein quantification. Nat. Biotechnol. 2008, 26, 13671372,  DOI: 10.1038/nbt.1511
    26. 26
      O’Brien, J. J.; Gunawardena, H. P.; Paulo, J. A.; Chen, X.; Ibrahim, J. G.; Gygi, S. P.; Qaqish, B. F. The effects of nonignorable missing data on label-free mass spectrometry proteomics experiments. Ann. Appl. Stat. 2018, 12, 20752095,  DOI: 10.1214/18-AOAS1144
    27. 27
      Marquardt, D. W. Comment - You should standardize the predictor variables in your regression models. J. Am. Stat. Assoc. 1980, 75, 8791, 10.1080/01621459.1980.10477430
    28. 28
      Menard, S. Standards for standardized logistic regression coefficients. Soc. Forces 2011, 89, 14091428,  DOI: 10.1093/sf/89.4.1409
    29. 29
      Kreutz, C. New Concepts for Evaluating the Performance of Computational Methods. IFAC-PapersOnLine 2016, 49, 6370,  DOI: 10.1016/j.ifacol.2016.12.104
    30. 30
      MATLAB. 9.8.0.1538580 (R2020a); The MathWorks Inc.: Natickand Massachusetts, 2020.
    31. 31
      R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna, Austria, 2021.
    32. 32
      Stekhoven, D. J.; Buhlmann, P. MissForest-non-parametric missing value imputation for mixed-type data. Bioinformatics 2012, 28, 112118,  DOI: 10.1093/bioinformatics/btr597
    33. 33
      Josse, J.; Husson, F. Handling missing values in exploratory multivariate data analysis methods. J. Soc. Fr. Stat. 2012, 153, 7999
    34. 34
      Khoonsari, P. E.; Häggmark, A.; Lönnberg, M.; Mikus, M.; Kilander, L.; Lannfelt, L.; Bergquist, J.; Ingelsson, M.; Nilsson, P.; Kultima, K.; Shevchenko, G. Analysis of the Cerebrospinal Fluid Proteome in Alzheimer’s Disease. PLoS One 2016, 11, e0150672  DOI: 10.1371/journal.pone.0150672
    35. 35
      Sturm, M.; Bertsch, A.; Gröpl, C.; Hildebrandt, A.; Hussong, R.; Lange, E.; Pfeifer, N.; Schulz-Trieglaff, O.; Zerck, A.; Reinert, K.; Kohlbacher, O. OpenMS - an open-source software framework for mass spectrometry. BMC Bioinf. 2008, 9, 163  DOI: 10.1186/1471-2105-9-163
    36. 36
      Pursiheimo, A.; Vehmas, A. P.; Afzal, S.; Suomi, T.; Chand, T.; Strauss, L.; Poutanen, M.; Rokka, A.; Corthals, G. L.; Elo, L. L. Optimization of Statistical Methods Impact on Quantitative Proteomics Data. J. Proteome Res. 2015, 14, 41184126,  DOI: 10.1021/acs.jproteome.5b00183
    37. 37
      Govaert, E.; Van Steendam, K.; Scheerlinck, E.; Vossaert, L.; Meert, P.; Stella, M.; Willems, S.; De Clerck, L.; Dhaenens, M.; Deforce, D. Extracting histones for the specific purpose of label-free MS. Proteomics 2016, 16, 29372944,  DOI: 10.1002/pmic.201600341
    38. 38
      Calf, O. W.; van Dam, N. M.; Weinhold, A.; Huber, H.; Peters, J. L. MTBLS738: Glycoalkaloid composition explains variation in slug resistance in Solanum dulcamara. Oecologia 2018, 187, 495506,  DOI: 10.1007/s00442-018-4064-z
  • Supporting Information

    Supporting Information

    ARTICLE SECTIONS
    Jump To

    The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acs.jproteome.1c00119.

    • Sigmoidal decrease of missing values for higher protein intensities (Figure S1); missing value distribution per sample and per protein (Figure S2); distribution of the estimated logistic regression coefficients (Figure S3); DIMA analysis at the peptide level (Figure S4); density plot of the imputed compared to the original data values (Figure S5); principal component analysis before and after imputation (Figure S6); DIMA Implementation (Figure S7); characteristics of the 30 applied imputation algorithms (Table S1); and characteristics of the PRIDE data sets assessed with DIMA (Table S2) (PDF)


    Terms & Conditions

    Most electronic Supporting Information files are available without a subscription to ACS Web Editions. Such files may be downloaded by article for research use (if there is a public use license linked to the relevant article, that license may permit other uses). Permission may be obtained from ACS for other uses through requests via the RightsLink permission system: http://pubs.acs.org/page/copyright/permissions.html.

Pair your accounts.

Export articles to Mendeley

Get article recommendations from ACS based on references in your Mendeley library.

Pair your accounts.

Export articles to Mendeley

Get article recommendations from ACS based on references in your Mendeley library.

You’ve supercharged your research process with ACS and Mendeley!

STEP 1:
Click to create an ACS ID

Please note: If you switch to a different device, you may be asked to login again with only your ACS ID.

Please note: If you switch to a different device, you may be asked to login again with only your ACS ID.

Please note: If you switch to a different device, you may be asked to login again with only your ACS ID.

MENDELEY PAIRING EXPIRED
Your Mendeley pairing has expired. Please reconnect