LFQ-Based Peptide and Protein Intensity Differential Expression Analysis

Testing for significant differences in quantities at the protein level is a common goal of many LFQ-based mass spectrometry proteomics experiments. Starting from a table of protein and/or peptide quantities from a given proteomics quantification software, many tools and R packages exist to perform the final tasks of imputation, summarization, normalization, and statistical testing. To evaluate the effects of packages and settings in their substeps on the final list of significant proteins, we studied several packages on three public data sets with known expected protein fold changes. We found that the results between packages and even across different parameters of the same package can vary significantly. In addition to usability aspects and feature/compatibility lists of different packages, this paper highlights sensitivity and specificity trade-offs that come with specific packages and settings.


INTRODUCTION
Proteomics has become a key technology to understand and characterize protein expression, 1,2 interactions, and sequence modifications 3 in state-of-the-art biology research. 2 Quantitative bottom-up proteomics has been dominated by three different approaches: in vivo metabolic labeling, 4 in vitro labeling, 5 and label-free methods. 6 In quantitative label-free, no isotopes or labels are added to the sample and the samples are not multiplexed in the same runs. Label-free approaches typically require fewer sample experimental steps, and differential expression analysis can simultaneously be performed across many samples. 7 From the bioinformatics data analysis perspective, label-free methods and labeled experiments share multiple steps including mass spectra preprocessing, peptide identification, and protein inference. 8 For peptide/protein identification tasks, multiple bioinformatics tools are available such as MaxQuant, 9 MS-GF+, 10 and PeptideShaker 11 as well as cloudbased workflows like quantms 12 or Galaxy proteomics. 13, 14 However, the quantification step is significantly different, including multiple substeps such as feature retention time alignment and feature detection. 8,14,15 One of the main challenges in label-free based experiments is the high number of missing values across samples and replicates, which makes other substeps like protein expression normalization difficult. 15,16 The high number of missing values presented in LFQbased experiments has triggered the development of multiple R-packages including different algorithms for intensity normalization and imputation.
In this work, we studied multiple R-packages that enable the normalization, imputation, and differential expression analysis on LFQ-based intensity proteomics experiments. Previous works have mainly focused on evaluating the software that performs peptide identification, protein inference, 17 and the generation of protein intensity tables. 8,18,19 We first briefly describe the main packages and tools that enable the statistical analysis of LFQ data sets from peptide or protein intensity data. While multiple packages and tools are available for statistical analysis of these data, we selected some of the most relevant ones and novel implementations including MSstats, 20 Perseus, 21 Proteus, 22 prolfqua, 23 ProVision, 24 LFQ-Analyst, 25 Eatomics, 26 ProStaR, 27 and msqrob2. 28−30 Finally, we used three different data sets�UPS spiked data set, 31 large-scale mix data set 32 and toxicology data set 33 �to evaluate the performance of each tool and discuss some of the advantages and disadvantages of their use with different types of data sets.

R-PACKAGES FOR STATISTICAL ANALYSIS OF QUANTITATIVE LFQ-BASED DATA
Statistical validation and assessment of protein expression data have been dominated by the R language. R (https://www.rproject.org/) is a popular framework for statistics and machine learning analysis. Many R packages are developed for bioinformatics analysis and visualization, especially due to the rapid increase of libraries provided by Bioconductor. 34 In 2014, Gatto et al. 35 showed that R and Bioconductor are the perfect environments for statistical analysis of proteomics data and indispensable for computational proteomics research. In addition to R-packages, R-Shiny applications are commonly used in proteomics not only to perform the DE analysis but also to interactively explore the results. 36 R-Shiny is an R framework to build interactive web apps including dashboards and interactive plots (https://shiny.rstudio.com/). Table 1 presents a group of recently published or commonly used R-packages and R-Shiny tools for statistical analysis of quantitative LFQ-based data. More details on imputation and normalization methods can be found in Supplementary Table  1, including the description of each method, and the default methods for each tool. We selected the packages based on the number of uses in PubMed, the novelty of the algorithms and methods employed for the statistical analysis, the maintainability, and the user interface. We will not study independent scripts or adaptions of other omics packages to the field of proteomics. We focused the study on the imputation and normalization algorithms of each tool, other parts of the algorithms such as multiple testing correction are not explored because most of these tools use Benjamini-Hochberg by default. 37 Supplementary Tables 2 and 3 show the required parameters to run each tool and the quality metrics computed by each tool, respectively.

MSstats
MSstats 20 (https://github.com/Vitek-Lab/MSstats) is an open-source R-package for peptide and protein quantification in mass spectrometry-based proteomics. MSstats supports multiple data acquisition types: data-dependent acquisition (DDA), both LFQ and label-based workflows, data-independent acquisition (DIA), and targeted approaches. It uses a set of flexible linear mixed models to summarize the protein or peptide abundance in a single biological replicate or condition and perform the relative quantification of proteins across conditions. A simple tabular file containing the intensity of the identified peptides and proteins is needed as input. MSstats provides multiple adapters to transform the output of proteomics analysis tools like MaxQuant, Skyline 38 or OpenMS into MSstats input format. Remarkably, MSstats is one of the most documented and actively maintained tools for differential expression analysis.

Proteus and limma
Proteus 22 (https://github.com/bartongroup/Proteus) is used for DE analysis of MaxQuant output data, and differential expression analysis based on the popular algorithm/package limma. 39 Proteus supports two normalization methods: equalize median and quantile, and it uses a mean-variance relationship to estimate variance (limma) where data are missing. The Proteus Shiny application allows users to perform the analysis with one click if the data is provided in MaxQuant format. Shiny-based volcano plots and fold-change intensity plots enable users to interact with differentially expressed proteins of interest and to view the significant results in detail. One of the other advantages of Proteus is that it has a lot of default data processing parameters that enable nonexpert users to perform a robust analysis of their data without the need to try multiple combinations of parameters. However, no quality control (QC) report is provided by the package making it difficult to assess the quality of the data and reproducibility of the experiment.

prolfqua
prolfqua 23 (https://github.com/fgcz/prolfqua) integrates the basic steps of a differential expression analysis workflow: quality control, data normalization, protein aggregation, statistical modeling, hypothesis testing, and sample size estimation. The modular design of prolfqua enables users to select the optimal differential expression analysis algorithm. prolfqua supports four normalization methods: quantile, variance stabilizing normalization (vsn), 40 log2 transform, and z-scale, and it enables imputation by a group-mean model. It provides a set of reports for users such as peptide intensity variance across samples, a scatterplot matrix of intensity correlation across samples and replicates, and a heatmap of missing values clustered by samples (example report https:// fgcz.github.io/prolfqua/articles/QCandSampleSize.html). prolfqua supports the output of multiple tools including MaxQuant, 9 Skyline, 38 DIA-NN, 41 and MSstats. 20

ProVision
ProVision 24 (https://github.com/JamesGallant/ProVision) is an R-shiny web application to facilitate the analysis of LFQ and TMT proteomics experiments. ProVision is designed for endusers (e.g., biologists), with a set of graphical interfaces to guide the users through data processing, parameter selection, and result presentation. In addition, it provides several parameters for users to interact with important filtering and statistical processes. ProVision only supports the MaxQuant output format as input and requires users to manually annotate the experimental design. In addition, the tool lacks QC reports that enable understanding and assessing the quality of the data.

LFQ-Analyst
L F Q -A n a l y s t 2 5 ( h t t p s : / / g i t h u b . c o m / MonashBioinformaticsPlatform/LFQ-Analyst) is an interactive, R-Shiny-based platform for quickly and easily analyzing and visualizing unlabeled proteomics data preprocessed with MaxQuant. LFQ-Analyst can process LFQ intensity, and its quality control report contains multiple visualization plots (volcano plots, heatmaps, and box plots) of differentially expressed. However, LFQ-Analyst has fewer statistical parameters for users to choose from, supports only MaxQuant format files and needs users to manually annotate experimental design files before the analysis.

Eatomics
Eatomics 26 (https://github.com/Millchmaedchen/Eatomics) is also an R-shiny application for the interactive exploration of quantitative proteomics data from MaxQuant, integrating quality control, differential abundance analysis, and enrichment analysis. It has a variety of interactive exploration possibilities and a unique experimental design setup module that interactively transforms a given research hypothesis into a differential abundance and enrichment analysis formula. One advantage of Eatomics is that it has built-in detailed user tutorials to help users get started and can be run with one click after uploading the input file.

DAPAR and ProStaR
DAPAR and ProStaR 27 (http://www.prostar-proteomics.org/) are two tools dedicated to the discovery of differential analysis of quantitative data generated by proteomic experiments. DAPAR is an R-package which provides five processing steps (filtering, normalization, imputation, aggregation, and difference analysis), based on those functions, ProStaR provides an R-Shiny web platform for interactive exploring. The advantage of ProStaR lies in its ability to do online analysis and embed detailed user tutorials to help users get started. There are various types of inputs, such as proteinGroups.txt from MaxQuant, Proline 42 and MSnset 43 files. Users need to manually annotate the condition and select multiple parameters, and they do not provide QC reports on the analysis results or the original intensity data.

msqrob2
msqrob2 28−30 (https://github.com/statOmics/msqrob2/) is a free and open-source R package that can handle virtually any experimental proteomics design. msqrob supports multiple types of inputs, including MaxQuant, 9 moFF, 44 and mzTab. 45,46 msqrob2 can use both R script and Shiny application (https://github.com/statOmics/msqrob2gui) for analysis and has detailed instruction manuals and videos to help new users get started. However, no matter which method is used, users need to manually annotate the condition and cannot obtain the QC report.

Perseus
In addition to the open-source R packages and R-Shiny applications, we explore the performance of the popular tool Perseus for users familiar with MaxQuant. Perseus 21 (https:// maxquant.net/perseus/) is designed for DE analysis of quantitative results in the MaxQuant ecosystem. Perseus is a desktop application which offers a wide variety of algorithms for MaxQuant data normalization, imputation, batch correction and differential expression analysis. Users need to manually annotate the condition during data processing and can choose different types of intensity: raw intensities, LFQ intensity or IBAQ values. It provides quality control reports but requires users to generate them manually. The tool provides an extensive number of supporting materials and o n l i n e t u t o r i a l s ( h t t p s : / / w w w . y o u t u b e . c o m / @ MaxQuantChannel).

Benchmark Data Sets
Three data sets were used to evaluate each package. The LFQbased data set PXD000279 ("UPS spiked dataset"), 31 consists of two E. coli digested samples (with 4 replicates for each sample); each half of the samples are enriched with one of two "Universal Protein Standards" (UPS1 and UPS2). Both samples contained the same 48 recombinant human proteins, which were either mixed in equal amounts (UPS1) or spanned multiple orders of magnitude at a determined ratio (UPS2). Based on the experimental design, 40 of the 48 UPS proteins and none of the E. coli proteins should be detected as differentially expressed. This data set has been extensively used to evaluate LFQ-based differential expression tools 14,25,31 and algorithms. 15 In addition, two other more complex data sets were used PXD007145 32 ("large-scale mix dataset") and PXD020248 33 ("toxicology dataset"). The large-scale mix data set contains a multiple species mixture, in which Yeast proteome was diluted into fixed ratios of 1:4:10 and added to a background of 1:1:1 human proteome to simulate the real experimental data. Six technical replicates were used for each sample to measure the coefficient of variations. The toxicology data set is a cell line  33 benchmarked TMT and LFQ analytical methods using the same sample. In the present study, we used the LFQ part to benchmark the peptide intensity-based tools MSstats, Proteus, and msqrob2.

Peptide and Protein Quantitation Tools
MaxQuant. To evaluate each tool's parameter combinations, algorithms for data processing and protein quantification analysis, we analyzed the data sets with MaxQuant. 9 Raw data were processed with MaxQuant (version v1.6.10.43) before the DE analysis with each tool. We used default parameters except that "the min ratio of LFQ" was set as 1 and "matching between runs" was enabled. The results from MaxQuant analysis and the parameters file used can be downloaded from the following repository (https://github.com/ypriverol/ quantms-research/tree/main/r-research).
quantms. quantms (and its predecessor proteomicsLFQ 12 ) is a cloud-based workflow that uses OpenMS 47 tools and DIA-NN 41 to enable quantitative analysis of LFQ data-dependent (LFQ-DDA) and independent acquisition (LFQ-DIA) and TMT data (https://quantms.readthedocs.io/en/latest/). In the present work, we used the LFQ-DDA subworkflow of the pipeline on the three data sets to do the peptide quantification benchmarking. The subworkflow performs peptide identifications using Comet 48 and MSGF+ 10 and feature detection using proteomicsLFQ in OpenMS. 47 All the scripts and data used to generate the figures and tables of the paper can be found in the following repository (https://github.com/ypriverol/quantms-research/tree/main/ r-research). The parameter selections of Shiny tools are shown in Supplementary Note 1. Table 2 shows the performance of each evaluated R-package with the UPS spiked and large-scale mix data sets. Both data sets were analyzed with MaxQuant, and protein intensity tables were used as the starting point of the DE analysis. For each protein in the UPS spiked data set, if both the true and estimated protein abundance ratios are greater than 1 in the same direction and the adjusted p-value is less than 0.05, we define it as a "true positive". If the true protein abundance ratio is greater than 1, but the estimated ratio is not greater than 1 or the adjusted p-value is greater than 0.05, this protein is a "false negative". All the adjusted p-values we used were corrected by the Benjamini-Hochberg method for each tool, which is the common correction method for all tools evaluated. In the large-scale mix data set, if the estimated protein meets the previous TP definition and the species belongs to Yeast, we define it as a "true positive". If the protein does not satisfy the numerical condition but the species condition, it is a "false negative". While the normalization and imputation methods employed by each tool are different, with this benchmark we aim to find what combination of parameters on each tool provides higher positive predicted values (PPV) and Negative predicted values (NPV). Since the toxicology data set is not a gold-standard data set, PPV and NPV cannot be calculated.

Protein-Based Analysis
In the UPS spiked data set, Perseus, Eatomics, LFQ-Analyst, and Provision detected most of the true positive proteins (38 UPS proteins), but Proteus achieved the highest PPV but with the lowest TP. 31 Note that when filtering for an FDR of 5%, we must expect PPV ≥ 0.95, then all the combinations PPV ≥ 0.95 can be considered as good combinations (e.g., all Perseus combinations for UPS spiked data set). ProStaR achieved the lowest 84% PPV even if it had 100% NPV. Remarkably, most of the tools and combinations of parameters achieved Negative predicted values (NPV) higher than 0.95. Supplementary Table 4 shows large differences among combinations of parameters for the same tool (e.g., Proteus) resulting in different protein lists. For example, if a normalization method is applied with Proteus, more true positive proteins are DE quantified (33 versus 31 UPS proteins), while the number of false positives increases.
In addition, we compared the adjusted p-values obtained by the six different tools with the best combination of parameters (Best PPV and NPV results, Table 2) by Bland-Altman plot (Supplementary Figure 1). We did not try ProStaR because its default output does not display the list of adjusted p-values. For all these tests, the input protein expression tables were generated by MaxQuant to perform the differential expression analysis. Only the best combination for each tool is presented. Supplementary Table 4 contains all combinations' results. The definition of the best combination is that the higher the PPV with the same amount of TP proteins. If there is a large difference in the amount of TP proteins, the greater the amount of TP proteins, the better the combination.

Journal of Proteome Research pubs.acs.org/jpr
Technical Note The Bland-Altman plot 49 is a robust method to assess the agreement between two quantitative methods; allowing one to measure the agreement between methods by studying the mean difference and constructing limits of the agreement. 49 A lower difference is observed among the three tools: Perseus, Eatomics, and LFQ-Analyst, which can imply the results are alternative between them. However, Proteus and prolfqua adjusted p-values are different compared with other tools. When we compared the estimated log-fold change (log2FC) and the expected log-fold change for each tool, all Pearson correlations were higher than 0.85 (Supplementary Figure 2), which means that the log2FC is consistent with the expected value within a tool. It is important to note for Perseus users that the direct alternatives to the tool (ProVision, Eatomics, and LFQ-Analyst) will produce values between 0.7 and 0.8 Pearson correlation compared to Perseus. We draw the boxplot of estimated log2-fold changes produced by each tool for background proteins in the UPS spiked data set (Supplementary For the large-scale mix data set, prolfqua detected most of the true positive proteins (544 and 703) in both the 4:1 fold data set and 10:1 fold data set, but Perseus achieved better PPV in the former and LFQ-Analyst achieved better PPV in the latter. In the former data set (4:1 fold), ProStaR achieved the lowest 55% PPV but Eatomics returned the lowest 43 TP proteins. In the latter data set (10:1 fold), Eatomics achieved the lowest 33% PPV even though it had a 99.8% NPV, and it returned the lowest 49 TP proteins. Similarly, different combinations of parameters for the same tool produce different results. For example, when enabling the KNN imputation, LFQ-Analyst achieved the highest PPV (64%) while it recognized true positive proteins (335) are the lowest (Supplementary Table 4).

Peptide-Based Analysis
In addition to exploring the performance of these packages in analyzing MaxQuant protein results, we also explored Rpackage alternatives to analyze peptide-intensity data. Five packages (MSstats, Proteus, prolfqua, ProStaR, msqrob2) support the data analysis from peptide-level intensity tables (Table 1). We explored MSstats, Proteus, and msqrob2 with peptide-intensity results from quantms and MaxQuant. Previous benchmarks have been performed only using MaxQuant, making it difficult to carry the same conclusions to other quantification tools. quantms is a new cloud-based, open-source workflow that enables large-scale proteomics data analysis and generates peptide-level quantification and could benefit from this evaluation. The protein quantities have been generated by each R-package or tool starting from the peptide intensity table from quantms, and the uploaded tables by each tool can be viewed in Supplementary Table 2.
MSstats summarizes the peptide intensities into protein intensities, and linear modeling or Tukey's median polish is then performed at the protein level. 20 Similarly, the Proteus approach computes the protein-level intensities by calculating the mean of the three most intense peptides. Then, limma, which offers robust treatment of missing data, is used to perform the differential expression analysis. Furthermore, msqrob2 aggregates peptide intensities to protein expression values by the robust summarization method in the QFeatures package. 50 We did not evaluate prolfqua and ProStaR because they lack the support of the mzTab or MSstats file format. Table 3 shows the benchmark of all combinations of parameters, quantification tool (MaxQuant, quantms), and DE package.
MSstats outperform the other packages msqrob2 and Proteus in the large-scale and UPS spiked data sets when using both quantification tools MaxQuant and quantms ( Table  3). The only exception is the combination of UPS spiked data set and quantms, where Proteus outperforms MSstats. quantms and Proteus combinations returned more true positive proteins and higher PPV values than MaxQuant combinations.
We calculated the coefficient of variation (CV) distributions in Figure 1, which shows the CV distributions in at least 50% replicates for different tools and normalization methods. The CV is calculated from the standard deviation of protein intensities divided by the mean within a sample after normalization and aggregation by MSstats, Proteus, and msqrob2. Note that the intensity of the protein after aggregation generated by msqrob2 have negative values, and we deleted them when plotting. There are six technical replicates per sample in the large-scale mix data set, so we only calculate the CV of proteins quantified in at least 3 runs. From Figure 1, the MSstats package achieved a lower average CV of 15.0% across conditions compared with Proteus (average of 19.3%), and msqrob2 achieved the lowest average CV of 13.7%. Supplementary Figure 4  For a large-scale mix data set, the true positive rates are straightforward, and we can plot ROC curves based on the adjusted p-values for each case to assess the performance of MSstats, msqrob2, and Proteus ( Figure 2). We see that MSstats have higher accuracy, with area under the curve (AUC) values of 0.82 at a 4-fold change based on the median normalization method. However, msqrob2 performs better in other cases. Overall, MSstats and msqrob2 have similar performances for low fold change and they obtain the same Figure 1. Coefficient of variation of three normalization methods based on the six replicates of the large-scale mix data set for MSstats, Proteus, and msqrob2. The MSstats and Proteus used default imputation methods NaN and MVL, respectively, and the msqrob2 did not use the imputation method. The underlying protein quantities have been generated by the tools starting from the intensity peptide table output of quantms.  In addition to the previous two gold standard data sets, we analyzed the toxicology data set where no ratio is known between samples and no spike proteins have been used. Interestingly, in the original paper 33 the authors did not use any package for normalization, missing value imputation, or DE analysis. Supplementary Figure 6 shows the CVs for all replicates of the toxicology data set. For all normalization methods, MSstats shows lower CV values than Proteus; and the values for both tools are quite high compared with the previous analysis large-scale mix data set (Proteus: 19.3%, MSstats: 15.0% in the large-scale mix data set; Proteus: 28.0%, MSstats: 16.7%, msqrob2: 13.7% in the toxicology data set). For msqrob2, the performance of this tool in the toxicology data set (average of 17.6%) is not as good as that in the largescale mix data set (average of 13.7%). With a deeper look into the data sets, the large-scale mix data set, and the toxicology data set result from quantms we observed a significant variation in the number of missed cleavages and missing values ( Figure 3). The quantms quality control heatmap reports of both data sets (Supplementary Figure 7A,B) show this variation at the level of the MS run file. This may be the reason why we observed major differences when analyzing the number of DE proteins at different FDR thresholds for different normalization methods (Supplementary Figure 8) and a little agreement among tools for the final DE proteins (Supplementary Figure 9). However, MSstats and msqrob2 returned the same number of differentially expressed proteins (147 DEP) at a 0.05 p-value threshold. Intriguingly, Proteus shared less than 20% in all normalization cases with the original author's results while MSstats and the paper results shared around ∼40% of the reported proteins. While MSstats and msqrob2 perform relatively better for the toxicology data set, but the former performs better in the UPS spiked data set). We included these contradictory results to show that data quality may introduce more differences and variability in the results than the tool used to perform the DE analysis.

CONCLUSIONS
This study explores multiple R-packages and tools for differential expression data analysis of LFQ experiments based on protein and peptide intensity results. We explored eight different R or Shiny applications including MSstats, Proteus, prolfqua, ProVision, LFQ-Analyst, Eatomics, ProStaR, msqrob2, and the popular C# tool Perseus. Most of these packages (LFQ-Analyst, ProVision, Eatomics, prolfqua, msqrob2) have been created as an alternative to Perseus to analyze the output of MaxQuant in the R. MSstats, prolfqua, Proteus, msqrob2, and ProStaR can process not only protein intensity results but peptide-based results, making them more flexible and compatible with tools such as DIA-NN, quantms, Skyline, etc. MSstats is the most well-documented package and the one with more support. However, other packages like prolfqua have higher quality control capabilities, which are crucial in differential expression data analysis to detect variances and errors in specific samples or analytical steps.
The results highlighted that major differences for some tools and data sets are observed when protein level intensity from MaxQuant is used because the performance of some tools can fluctuate significantly depending on the analysis parameters and even the characteristics of the input data set. Most of the tools can correctly quantify true positives at low false positive rates, however, Perseus, prolfqua, and LFQ-Analyst were the tools that performed better among the benchmarked packages in the two data sets. We used Proteus, msqrob2, and MSstats to evaluate the performance of protein quantification based on peptide-level intensities data from the quantms workflow and MaxQuant. Overall, for both data sets analyzed (the large-scale mix data set and the toxicology data set) msqrob and MSstats provided lower CVs across replicates than Proteus and more accurate quantification of differential expression ratios in the large-scale mix data set. However, we observed that when the data sets presented higher CV values independently of the tools used for the analysis, the results were not consistent and difficult to compare across tools. This fact should trigger a more rigorous study including the possible development of guidelines to make the quantitative results reproducible and accurate, even if they pass the author's specific statistical tests.
At present, there are many algorithms and tools for identifying and quantifying proteomic data and R-packages to perform differential expression analysis. However, there may be some tools and algorithms that have not been evaluated in this article for retrieval methods or other reasons. Each package and tool provide different methods for normalization, imputation, visualization, and quality control of the DE protein results. Due to the diversity of statistical methods, algorithms, Figure 3. Distribution of metrics including contaminants, peptide intensity, charge, missed cleavages, identification rate over retention time, MS2 oversampling, and peptide missing values for the largescale mix data set and toxicology data set analyzed with quantms. The normalized metric value (y-axis) is the normalized value for each specific metric between 0 and 1.