CharmeRT: Boosting Peptide Identifications by Chimeric Spectra Identification and Retention Time Prediction

Coeluting peptides are still a major challenge for the identification and validation of MS/MS spectra, but carry great potential. To tackle these problems, we have developed the here presented CharmeRT workflow, combining a chimeric spectra identification strategy implemented as part of the MS Amanda algorithm with the validation system Elutator, which incorporates a highly accurate retention time prediction algorithm. For high-resolution data sets this workflow identifies 38–64% chimeric spectra, which results in up to 63% more unique peptides compared to a conventional single search strategy.


Table S1
All features used in Elutator to validate PSMs and peptides Retention time prediction model Description of the model used for retention time prediction Neighboring amino acids Description of the impact of neighboring amino acids considered in the RT model Figure S1 Correlation of theoretically calculated hydrophobicity index to the measured retention time for high confident matches (FDR=0.001) of in-house HeLa and TiO2 enriched data sets.

Figure S2
Correlation of theoretically calculated hydrophobicity index to the measured retention time for high confident matches (FDR=0.001) of in-house and external HeLa data set Figure S3 Histogram of mass deviations for highly reliable identifications before and after recalibration Figure S4 Longest consecutive series A+B+Y Table S2 Shared ions between first and second peptides Figure S5 Results for data of O'Connell et al.

Figure S6
Protein evidence origin Figure S7 Presence of chimeric spectra in data sets with different isolation widths and gradient times Figure S8 Score distributions of MS Amanda scores Figure S9 RNA abundance of HeLa proteins Figure S10 Proportion of second search PSMs for spike-in data Table S3 Identified PSMs and unique peptides at 1% FDR Table S4 Mapping grouped proteins identified in first and second searches to RNA HeLa protein expression data Figure S11 Chimeric spectrum example Figure S12 Chimeric spectrum example Figure S13 Chimeric spectrum example Figure S14 Chimeric spectrum example Figure S15 Chimeric spectrum example Figure S16 Chimeric spectrum example Figure S17 Chimeric spectrum example Figure S18 Chimeric spectrum example S-3

Elutator features
Feature Description

MS Amanda Score
The PSM score assigned by the MS Amanda algorithm.
Delta Score Difference of the scores between 1st and 2nd rank matches. Nonzero for 1st rank matches only.
Delta Cn Normalized score difference relative to the first best scoring PSM of the spectrum. Zero for 1st rank matches and non-zero for rank 2 and above. Absolute value of the delta retention time.
Combined Score Combined score of the MS Amanda score and retention time deviation.

% Isolation Interference
Fraction of ion current in the isolation width not attributed to the identified precursor.

MH+ [Da]
Singly charged mass of the peptide.
m/z Measured m/z value.

Delta m/z [Th]
Absolute calibrated deviation of the measured m/z from the theoretical value of the peptide.

Calibrated Delta Mass [ppm]
Calibrated deviation of the measured mass from the theoretical mass of the peptide in ppm.

Log Peptides Matched
Logarithm of the number of candidates (search space) in the precursor mass window.
Log Total Intensity Logarithm of the total ion current of the fragment spectrum.

Fraction Matched Intensity [%]
Fraction of the total ion current of the fragment spectrum that is matched by fragments of the PSM.

Retention time prediction model
Our retention time prediction model can be fully described as the following non-linear sequence dependent function, which has been described by Krokhin 31 , 2006: where H is the hydrophobicity, newIso is a function modeling the isoelectric charge, seq is the peptide sequence, helices1 and helices2 are adjustments for short and long helices, and F is defined by: with sumScale being a polynomial function over the argument, lengthScale a polynomial factor dependent on the length of the peptide, length the length of the peptides sequence and R defined as: with smallness being a correction factor depending on the length of the peptide, undigested a function to handle special positively charged amino acids (L/H/K), clusterness a function for handling clusters of hydrophobic amino acids, decreasing the hydrophobicity, proline a function to handle sequences with >=2 prolines in the peptide sequence, and G defined as: with baseSumOfRetentionCoefficients being the sum of all retention time coefficients of all amino residuals of the peptide sequence and C modeling the impact of neighboring amino acids (see below).

Interactions between neighboring amino acids
We describe the cumulative contribution of neighbor residual's interactions for peptide sequence s to the hydrophobicity index as summed over all residues . ( , ) is defined as The summation by , runs through all amino residuals in the sequence s with a maximal difference of ±9 amino acid positions. For each amino acid pair, we consider two coefficients, for the amino residual at position in sequence s and γ for the amino residual in sequence s, such that the interaction between the amino residuals at positions and is described by the product ( ) ( ). Distance coefficients ( ) = ( − ) account for the contribution of residual pairs with a distance δ between them. All coefficients, including 0−4 , all lambda values, and the values of the lookup tables and are optimized during training of the RT model.

Figure S1 Correlation of theoretically calculated hydrophobicity index to the measured retention time for high confident matches (FDR=0.001) of in-house HeLa and TiO2 enriched data sets. 70% of all matches in the TiO2 enriched
data set contain one or more phosphorylated sites. Outliers were not removed.

Figure S2 Correlation of theoretically calculated hydrophobicity index to the measured retention time for high confident matches (FDR=0.001) of in-house and external HeLa data set.
Outliers were not removed.   Figure S7 Presence of chimeric spectra in data sets with different isolation widths and gradient times. All spectra having two or more reliably identified precursors are chimeric spectra. As expected, the presence of chimeric spectra rises with increasing isolation width. 40 . For low spike-in amounts, the proportion of UPS peptides is higher in the second search, as these originate from rare proteins and are therefore more likely to be coeluting peptides.