Compositional Proteomics: Effects of Spatial Constraints on Protein Quantification Utilizing Isobaric Tags

Mass spectrometry (MS) has become an accessible tool for whole proteome quantitation with the ability to characterize protein expression across thousands of proteins within a single experiment. A subset of MS quantification methods (e.g., SILAC and label-free) monitor the relative intensity of intact peptides, where thousands of measurements can be made from a single mass spectrum. An alternative approach, isobaric labeling, enables precise quantification of multiple samples simultaneously through unique and sample specific mass reporter ions. Consequently, in a single scan, the quantitative signal comes from a limited number of spectral features (≤11). The signal observed for these features is constrained by automatic gain control, forcing codependence of concurrent signals. The study of constrained outcomes primarily belongs to the field of compositional data analysis. We show experimentally that isobaric tag proteomics data are inherently compositional and highlight the implications for data analysis and interpretation. We present a new statistical model and accompanying software that improves estimation accuracy and the ability to detect changes in protein abundance. Finally, we demonstrate a unique compositional effect on proteins with infinite changes. We conclude that many infinite changes will appear small and that the magnitude of these estimates is highly dependent on experimental design.


Compositional Data Analysis
Ignoring the compositional effects of ion sampling can lead to nonsensical comparisons as discussed in the main text. However, more subtle errors also exist. In the Figure 1C example, when estimating the ratio between channels 126 and 128, averaging the intensities prior to estimating the ratio, while not advised, will still provide reasonable results. The ratio after averaging would be ௫ା௬ ହ௫ାହ௬ = ଵ ହ , which is on target regardless of how greatly ‫ݔ‬ and ‫ݕ‬ diverge. Accordingly, the most important aspect of an analysis is correctly defining the relative quantities of interest. However, this sort of analysis is still suboptimal as lower intensities will contribute less to the average, even if the ratios are of identical quality.
Similarly, it should be noted that log linear models when used to estimate contrasts between conditions, will also agree with a basic log-ratio compositional analysis. This is because the models often have the same expected values, e.g. ‫(ܧ‬log ଶ ‫ݕ‬ ଵ ) − ‫(ܧ‬log ଶ ‫ݕ‬ ଶ ) = ‫(ܧ‬log ଶ ௬ భ ௬ మ ). However, this equivalence disappears when considering error estimation or models that include covariates, such as the MS2 model in this paper or a longitudinal model that seeks to estimate a time effect. Importantly, the improvement to accuracy seen with our MS2 model could not have been achieved by simply adding SSN as a covariate in regular log linear model. This is because the parameters of interest are contrasts and peptide parameters create a blocking structure. Consequently, the contrast is taken within each peptide block, and since each peptide has the same SSN across conditions, the effect on the contrast estimate will be non-existent.
The best way to avoid the mistakes discussed above is to use tools from compositional data analysis. The relevant theory is based on the concept of transforming from a constrained geometric space to unconstrained real space, performing a usual analysis, and transforming back as needed. Appropriate transformations include the additive log-ratio (ALR) 1 and the isometric S-4 log-ratio transformation (ILR) 2 . The ILR has the advantage of being isometric but at the cost of interpretability (transforming back to the proportions becomes essential). The ALR, while not isometric, provides valid inferences when using likelihood-based methods 1 and enables direct estimation of log-ratios between conditions, which are often of interest in a proteomics experiment.
From the simplex, we transform our data into real space with the ALR. Then we fit a linear model to the log-ratios of all peptides from a single protein and included a regression on isolation specificity (IS), where IS is defined as the proportion of signal in the MS1 isolation window belonging to the targeted mass (or its isotopes). At this point each outcome is defined as a logratio to an arbitrary reference channel. Notice that the ALR transformation reduces the dimension of the problem to two. This is because a composition with three parts only had two degrees of freedom (because of the constraint, knowing two parts tells us everything about the composition). With results obtained from two regressions in real space, the inverse of the ALR can then be used (if necessary) to create a single regression line in the simplex. Doing so here is informative.
Points in the ternary diagram seem to be pulled towards the center of the triangle and fitting a regression line on IS appears to explain much of the trend towards unity (grey dashed line). This suggests that predicting where the protein would have been if IS equaled one might mitigate some of the compression problems in isobaric tag proteomics data 3 . In Figure 3, the projected estimate is about half-way between the true value and the average intensity. It should be noted that adding isolation specificity to standard linear models for proteomics 4 would not achieve the same effect as estimating a slope that directly affects the log-ratios.

S-5
In a full dataset, making use of peptide-level covariates becomes substantially more difficult because of the unbalanced structure of the data. While certain proteins will have many peptides, others will have very few observations, making the estimation of reliable regression lines extremely difficult. Furthermore, as shown in Figure 2a, the relationship between our compression surrogate and the peptide ratios is not only a function of IS (or summed signal-tonoise), it also depends on the true protein fold-change. Proteins that do not change have zero compression, while proteins with large changes can be dramatically pulled towards unity.

Sharing information with Bayesian modelling
In proteomics it is common to analyze data for each protein independently. However, this can be problematic for proteins that only had a few peptide observations. In these cases regression lines will be fairly unreliable and using them to project relative protein abundance, for example when IS equals one, proves to often be detrimental. However, Figure 2D and E suggest that there may be a consistent relationship between our peptide level covariates and ratio compression that can be modelled across the whole dataset. We can see that the slope of the relationship should always be positive, it should be zero when the true change is zero, and should increase proportionally to the magnitude of the true change. These assumptions motivate the use of a single data-wide parameter that defines the relationship between a peptide level covariate and ratio compression. We incorporate this information into a non-linear Bayesian model for MS2 data defined below.
The unbalanced structure of the data also exacerbates the importance of error estimation. For proteins with only a single observed peptide, standard errors cannot be computed. With a small S-6 number of peptides, standard errors can be computed but are not robust to outliers. Both of these problems can be addressed by sharing information across proteins.
One solution to the error estimation problem would be to use a pooled variance estimate. In this case we would estimate one variance component that represented experimental error for every protein. This would immediately solve the problem of unreliable error estimates that will occur when no pooling is used. However, such an approach simply trades one problem for another.
While error estimates for proteins with little information may benefit from a pooled variance component, this approach would overestimate the error for proteins with lots of data and little variation.
For this reason we make use of partially pooled variance components 5 so that proteins with few peptides rely almost entirely on the average experimental error, but as more peptides are observed, the variation converges to the within protein variance. This can be accomplished by creating a hierarchical model for the variance components.
The hierarchical model for variance components provides substantially improved variance estimation which was responsible for the improved signal detection shown throughout the paper.
Beyond signal detection, an improved assessment of variability can protect against some wellknown sources of danger in proteomics data. Post translational modifications and protein isoforms can result in peptides that are assigned to the same protein in the database, but which have very different ratios. There is nothing in our modeling that accounts for this situation.
However, since we are improving the estimation of variance it may be possible to detect these scenarios by looking for proteins with many peptides but unusually high variance.

S-7
The Bayesian compositional models The compositional MS2 model that utilizes partially pooled variance estimation and a single compression parameter is defined as follows.
‫ݕ‬ is the additive log2 ratio of the ݇th peptide observed within protein ݅ in condition ݆. ߚ represents the average log ratio when summed-signal-to-noise equals zero. The prior was selected with a mean of zero to reflect the experimental assumption that most proteins will not be changed across conditions. The standard deviation of 10 was selected to be a sufficiently large distribution (these are on the log2 scale!) to refrain from effecting our estimation while still being informative enough to avoid sampling problems caused by sampling from unrealistic parameter values. This is often referred to as a weakly informative prior.

S-8
ܵܵܰ is the peptide level covariate describing the observed summed signal-to-noise across all channels.
The analysis in our paper contains predictions of the outcome when ܵܵܰ is at the 99 th percentile of observed summed signal-to-noise values. In theory, we would be interested in the value at the maximum, but these datasets often contain extreme outliers in summed signal-to-noise values which motivated a projection to the 99 th percentile. ߜ is a parameter used across the whole dataset to define the relationship between summed signal-to-noise and true fold change. Estimating a single ߜ parameter allows us to model compression even for proteins that have a small number of observed peptides. The uniform prior was selected since we did not know a priori what a reasonable range for this parameter might be, and after fitting the model the non-informative approach did not cause any problems with convergence.
Regarding the SSN adjustment in the MS2 model, it is worth noting that this approach is substantially different than previous efforts to reverse MS2 compression effects. Others have sought to reverse compression by adjusting the data with learned factors [6][7][8] . In general, these approaches eliminate compression while sacrificing precision by multiplying the data with a learned quantity. Our approach, while indirectly related to compression, does neither. Using a peptide-level covariate is a fundamentally different approach. It is more general, as peptide-level covariates do not necessarily relate to compression (any observed peptide level quantity that we expect to affect ratios could be added to our model). Even when the covariate used clearly relates to compression, as was the case with SSN, adjusting for the observed relationship only slightly mitigated the compression phenomenon.
But perhaps the most important difference is that using a covariate does not require altering the underlying data. Instead we redefine our target parameter based on the observed surrogate S-9 measure and estimate it accordingly. Consequently, a covariate adjustment does not cause the dramatic losses to precision that occur when multiplying data by a compression factor. The compositional MS3 model is similar only we do not make an adjustment based on summed-signalto-noise since MS3 experiments were designed to remove interference experimentally and we do not expect the same sort of relationships to hold. The use of SSN provided a noticeable improvement to MS2 estimation accuracy and finding similar quantities that might improve MS3 estimation is certainly a valuable goal of future research. However, appropriate statistics have not yet been studied and the model definition used for MS3 throughout this paper is as follows: Where ߚ now represents the expected log ratio for the ݅'th protein between the ݆'th condition and the reference condition.
In both models partially pooled variance estimation is accomplished by sampling variance components, ߪ ଶ , from a distribution of variances where the shape of the distribution is determined by the hierarchical parameter ߬. Once again the hyperparameters were selected to be weakly informative. However, the choice of the half-normal and inverse gamma distributions was made for convergence considerations which are detailed below. It should be noted that this Bayesian models provide great flexibility in how we predict signal detection as we can directly compute the posterior probability that a parameter lies in a specified interval. For example, a researcher could compute the probability that a protein fold-change was greater than 2. However, in our dilution experiments we do not wish to give our methodology an unfair advantage by using the true ratios as part of our decision rule. Consequently, we instead ranked the proteins by the posterior mean divided by the posterior coefficient of variation. This statistic ranks proteins by the magnitude of the posterior mean but diminishes or strengthens the evidence based on the CV. This was used as the predictive statistic in ROC plots throughout the paper.
Credible intervals are a feature of Bayesian modelling. They can be thought of as analogous to confidence intervals, however they differ in a few important aspects. In a Bayesian model we have the ability to directly asses the probability, conditional on the observed data, that a model parameter (protein fold-change) lies in a particular interval. Confidence intervals lack this clear probabilistic interpretation. This is advantageous as we can ask questions with a Bayesian model that are simply not possible in a frequentist framework. For example, if we wanted to know the probability that a protein fold-change was very small (between -.1 and .1) we could do so directly. This is profoundly different than looking for large p-values which in no way suggests evidence in favor of a null hypothesis.
The compositional transformation, peptide level covariate adjustments, and partially pooled variance estimation are all implemented in our publicly available R package which contains precompiled Stan models to make use of efficient Bayesian simulation algorithms 9 .

Assessing Model Convergence
Bayesian modeling typically depends on simulations to characterize posterior distributions. The Stan programming language makes use of Hamiltonian Monte Carlo techniques to efficiently explore posterior distributions. However, whenever estimation is performed with nondeterministic methods concerns about model convergence need to be considered. This concern is especially relevant for models as large and complex as the ones proposed in this paper.
Distributional assumptions and parametrizations all play a role in obtaining convergence. In particular the hierarchical variance components require the non-centered reparameterization described in the Stan case studies (http://mc-stan.org/users/documentation/casestudies/divergences_and_bias.html). Furthermore, we found that many distributional assumptions resulted in computational difficulties. Consequently, the distributions described in this paper, and implemented in our package were often selected out of computational necessity.
Stan offers a number of tools to assess model convergence. By default, in Stan and our compMS package, every sampling chain is independently run four times from varying sets of initial conditions. This means that we can observe trace plots of each parameter to visually inspect S-12 whether the parameters converged, and whether not each different starting points still converged to the same place. As an example, a trace plot for ߜ from the MS2 model is shown in Figure S1, A.
Our proteomics models contain thousands of parameters and examining every trace plot is not realistic. Fortunately, there are some options for systematically assessing convergence properties. One way is to compare the variance between and within each of the independent chains. This is done with the ܴ statistic 10 which should be close to 1 if the model has converged.
A histogram of all the ܴ statistics can be generated by calling the stan_rhat() function on any stan model fit object ( Figure S1, B). For reference, Stan best practices suggest that Rhat should be less than 1.1 (https://github.com/stan-dev/stan/wiki/Stan-Best-Practices).
Another way to assess overall performance is to inspect the distributions of the parameters of Only 95% credible intervals that do not contain zero are shown here.

Boundary Case Experiment
Mouse whole brain tissue lysate and yeast whole cell lysate was prepared as described previously 11,12 .
Protein digestion and TMT labeling was performed as described previously 13 . After labeling, we with one composed of all protein sequences in the reversed order. Searches were performed using a 50 ppm precursor ion tolerance for total protein level analysis. The product ion tolerance was set to 0.9 Da. These wide mass tolerance windows were chosen to maximize sensitivity in conjunction with Sequest searches and linear discriminant analysis 16,17 . TMT tags on lysine residues and peptide N termini (+229.163 Da) and carbamidomethylation of cysteine residues (+57.021 Da) were set as static modifications, while oxidation of methionine residues (+15.995 Da) was set as a variable modification.

S-16
Peptide-spectrum matches (PSMs) were adjusted to a 1% false discovery rate (FDR) 18,19 . PSM filtering was performed using a linear discriminant analysis, as described previously 16 , while considering the following parameters: XCorr, ∆Cn, missed cleavages, peptide length, charge state, and precursor mass accuracy. For TMT-based reporter ion quantitation, we extracted the signal-to-noise (S:N) ratio for each TMT channel and found the closest matching centroid to the expected mass of the TMT reporter ion. For protein-level comparisons, PSMs were identified, quantified, and collapsed to a 1% peptide false discovery rate (FDR) and then collapsed further to a final protein-level FDR of 1%. Moreover, protein assembly was guided by principles of parsimony to produce the smallest set of proteins necessary to account for all observed peptides.
Channels were randomly permuted within each protein to expand the full range of possible outcomes then peptide level two-way ANOVA's similar to those used by Oberg et. al. 20 were used to estimate log2 protein level fold-changes for both the MS2 and MS3 data. Note that the more common approach of discarding peptide level variation and performing a t-test on protein level estimates is not an option here as the experiment had no replicates. For each protein, the following model was fit in R with the lm function.

‫ݕ‬ = ߤ + ߚ + ߙ + ߳
Where ‫ݕ‬ is the log2 signal-to-noise ratio where ݆ = 1, … , ݊ indexes the number of conditions. ݇ = 1, … , ‫ܭ‬ indexes the ‫ܭ‬ peptides observed within the protein. Reference cell coding is used so that ߚ ଵ = 0, ߙ = 0. ߤ represents the expected value of peptide 1 in condition 1. ߚ , for ݆ ≠ 1, represents the expected log2 contrast for the protein between condition ݆ and condition 1. ߙ , for ݇ ≠ 1, represents the average effect difference between peptide k and peptide 1.

S-17
For proteins with only a single observed peptide, the model is reduced to a one-way anova without a peptide effect.
P-values for the hypothesis that each fold change is zero, were taken directly from the model objects. The purpose of our hypothesis testing was to use the p-values to generate ROC plots. Since these plots depend only on the rank order of the p-values, and multiple hypothesis test corrections do not change the rank ordering, we did not perform any corrections.
ROC plots were generated with the ROCR package in R 21 . True positives are defined as yeast proteins that are known to change by given amount, and false positives are the yeast proteins known to not change.
Both of the compositional models were coded in the Stan programming language and are provided in our R package. To simplify the computational complexity each dataset analyzed was split up into randomly selected sets of 1000 proteins. This provides a substantial reduction in processing time.
Note that the point estimates from this model are very similar to the difference in the average log2 intensities between the conditions. This is true for any model that shares the mean structure of the described ANOVA, including the mixed model, and the GLM discussed in a recent review paper 22  Raw files have been uploaded to ProteomeXchange (Accession Number -PXD008259).

Common Case Experiment
Yeast cells were diluted to ratios of 1:2:3, with human cell lines added to compensate for lost material.
Cell lysis and protein digestion. protease at a 100:1 protein-to-protease ratio. Trypsin was then added at a 100:1 protein-toprotease ratio and the reaction was incubated 6 hrs at 37°C. Peptide concentrations in the digests were measured using the Quantitative Colorometric Peptide assay kit (Pierce). Peptides from a Lyc/trypsin digest from human or yeast lysates were mixed to create the ratios shown in Figure 1 in at least triplicate (n=3,4,4) TMT pipeline. The eleven TMT-labeled samples were mixed into a single sample and separated by basic pH RP HPLC We used an Agilent 1100 pump equipped with a degasser and a photodiode array (PDA) detector (set at 220 and 280 nm wavelength) from Thermo Fisher Scientific (Waltham, MA). Peptides were subjected to a 50 min linear gradient from 5% to 35% acetonitrile in 10mM ammonium bicarbonate pH 8 at a flow rate of 0.6 mL/min over an Agilent 300Extend C18 column (3.5 µm particles, 4.6 mm ID and 220 mm in length). The peptide mixture was fractionated into a total of 96 fractions which were consolidated into 24 fractions.
Samples were subsequently acidified with 1% formic acid and vacuum centrifuged to near dryness. Each eluted fraction was desalted via StageTip , dried via vacuum centrifugation, and reconstituted in 5% acetonitrile, 5% formic acid for LC-MS/MS processing.
For MS3 analysis, eleven pooled fractions were analyzed using 3-hr gradient separations using the instrument parameters described above for the boundary case experiment, injecting roughly 1 ug per fraction for 11 pooled fractions on a Samples were analyzed on an Orbitrap Fusion Lumos mass spectrometer (Thermo Fisher Scientific, San Jose, CA) coupled to a Proxeon EASY-nLC 1200 liquid chromatography (LC) pump (Thermo Fisher Scientific).

S-20
For the MS2 analysis, the same 11 samples were analyzed on the same Orbitrap Fusion Lumos mass spectrometer (Thermo Fisher Scientific, San Jose, CA) coupled to a Proxeon EASY-nLC 1200 liquid chromatography (LC) pump (Thermo Fisher Scientific). Instrument parameters were preserved for MS1 analysis, but for MS2 precursors were fragmented by HCD and analyzed using the Orbitrap (NCE 55, AGC 2.5E5, maximum injection time 150 ms, resolution was 50,000 at 400 Th). For MS2 analysis, we set the isolation window to 1.4 Th.
The data generated in this experiment is offered available with the manuscript titled "Proteome- Results were filtered to a 1% FDR at the peptide and protein levels using linear discriminant analysis and the target-decoy strategy. MS3 spectra were processed as signal-tonoise ratios for each reporter ion based on noise levels within a 25 Th window. Proteins were quantified by summing reporter ion intensities across all matching PSMs using in-house software, as described previously. PSMs with low isolation specificity (<0.7), MS3 spectra with more than eight TMT reporter ion channels missing, MS3 spectra with TMT reporter summed signal to noise ratio that is less than 200, or no MS3 spectra were excluded from quantitation.
Equal human protein starting amounts was enforced by normalizing to the sum of all human peptides for each of the 11 channels. The normalization is slightly different for the compositional modelling. Instead using a multiplicative factor to equalize protein level S-21 intensities, additive factors are used to ensure that the average PSM log ratio to the reference channel are equivalent for all channels.
In this experiment with multiple replicates, t-tests were possible so we chose to analyze the data in accordance with the most commonly used method. Protein estimates were obtained from the standard in house software and t-tests were performed for hypothesis testing.
The compositional modelling was identical to the procedures described for the Boundary Case Experiment.
Once again, ROC plots were generated with ROCR 21 . True positives are defined as the yeast proteins that are known to change by a certain amount (2-or 3-fold). False positives are defined by the unchanging human proteins.

Dual multiplexed viral infection experiment with infinite changes
Samples analyzed for the Human Cytomegalovirus Time Course Experiment were prepared as described previously 23 . Two TMT experiments were designed as follows: 1) a TMT10-plex in the following order   The plots on the left do not include proteins for which only a single peptide was quantified, since the ANOVA methodology used cannot provide p-values in these cases. On the right we include all proteins quantified by assigning a value of 1 to all of the missing p-values. These plots show that in this range of changes, MS3 signals are easier to detect than MS2. We also see that methodology plays a very large role in our ability to pick out these small signals, with the compositional modelling providing substantial gains in all cases. Finally, it is worth noting that including proteins with only a single measurement hurts the performance of MS3 methods more than MS2. This is because the MS3 data contains, as a percentage, far more proteins with only one observation.

Figure S3. Methodological advantages diminish as the true fold-changes increase
ROC plots from the Boundary Case Experiment for fold-changes of 4 (A), 8 (B) and 16 (C). The plots on the left do not include proteins for which only a single peptide was quantified, since the ANOVA methodology used cannot provide p-values in these cases. On the right we include all proteins quantified by assigning a value of 1 to all of the missing p-values. These plots show that in this range of changes, the gains of compositional modelling are still substantial. However, the improved signal detection of MS3 over MS2 begins to disappear. In fact, when including proteins with only a single observation, the MS2 methods begin to outperform MS3 for fold-changes of 8 and 16. MS2 methods provide more data for each protein but with less accuracy. Presumably, the shift in performance occurs because large enough changes do not require great accuracy to reject a hypothesis test that the change was zero. S-28

Figure S4. Methodological advantages diminish for large, but not infinite, fold changes
A-B) ROC plots from the Boundary Case Experiment for fold-changes of 32 (A) and 100 (B). The plots on the left do not include proteins for which only a single peptide was quantified, since the ANOVA methodology used cannot provide p-values in these cases. On the right we include all proteins quantified by assigning a value of 1 to all of the missing p-values. These plots show that in this range of changes, there are still some gains due to compositional modelling however and MS2 continues to outperform MS3 however, all of the advantages continue to shrink as all methods do a good job of detecting large changes. That said, many of the single observation proteins in this category are real changes and the ANOVA methodology cannot detect them, providing a substantial hit, especially for MS3.
C) ROC plot for 2-fold changes from the Boundary Case Experiment. This is the plot corresponding to one presented in Figure 2. The difference is that here we include all quantified proteins, even those with only one observation. The consequence is that the methodological gains from compositional modelling grow more substantial.
D) ROC plot for Infinite fold-changes from the Boundary Case Experiment. This is the plot corresponding to one presented in Figure 3A. The difference is that here we include all quantified proteins, even those with only one observation. The gains from compositional modelling are substantial in this category. The reason for this is that many of the infinite changes appear relatively small. Consequently, the detection performance parallels the patterns seen with smaller changes. Modelling provides enormous gains, and the improved accuracy for MS3 once again takes on more importance than the added number of observations from MS2 (of course we are not here considering the total number of observations, only the sensitivity and specificity of detecting a single signal).

Figure S5. Further analyses confirm performance patterns
A) ROC plot from the Common Case Experiment for 3-fold changes. This plot reinforces the rank order of performance seen for 2-fold changes. The lesson regarding absolute numbers of true positives also remains similar. At a one-percent false positive rate MS2 and MS3 technologies with t-tests gave us 1412 and 1155 true 3-fold changes. Using compositional modelling these numbers increased to 1779 and 1306 respectively. B) ROC plot showing the ability to detect infinite viral protein fold-changes from a background of mostly unchanged human proteins. This is the plot corresponding to one presented in Figure 3E. The difference is that here we include all quantified proteins, even those with only one observation. The consequence is that the methodological gains from compositional modelling grow more substantial. As expected, the ROC curves for compositional modeling in both the 2-and 10-plex have dropped (categorizing single observations is a more difficult problem). However, many of these prove to be true positives resulting in a greater divergence between methodologies when considering all observed proteins.