Quirks of Error Estimation in Cross-Linking/Mass Spectrometry

Cross-linking/mass spectrometry is an increasingly popular approach to obtain structural information on proteins and their complexes in solution. However, methods for error assessment are under current development. We note that false-discovery rates can be estimated at different points during data analysis, and are most relevant for residue or protein pairs. Missing this point led in our example analysis to an actual 8.4% error when 5% error was targeted. In addition, prefiltering of peptide-spectrum matches and of identified peptide pairs substantially improved results. In our example, this prefiltering increased the number of residue pairs (5% FDR) by 33% (n = 108 to n = 144). This number improvement did not come at the expense of reduced accuracy as the added data agreed with an available crystal structure. We provide an open-source tool, xiFDR (https://github.com/rappsilberlab/xiFDR), that implements our observations for routine application. Data are available via ProteomeXchange with identifier PXD004749.

C ross-linking/mass spectrometry (CLMS) is emerging as a valuable tool to investigate protein structures, protein complexes, and protein−protein interactions. 1−4 As any method relying on measurement as well as interpretation, CLMS has some level of error. One popular method in proteomics to assess the expected error among reported results is the false discovery rate (FDR) by the target-decoy approach. 5 A decoy database is generated, typically by inverting all target sequences. This decoy database should not contain any peptide sequences that are in the analyzed sample. Any match to this database is therefore a false positive. Under the assumption that random identifications fall with equal probability into the target and decoy section of the database, the distribution of decoy hits reveals the distribution of random target hits and allows the reporting of results with defined FDR.
For CLMS, the FDR estimation is complicated by the fact that every match is a composite of two peptides, each with its own probability to be false. Previously, FDR estimation of cross-links was addressed by either inverting all possible crosslinked peptide pairs, 6 not modeling cases that have one correctly identified peptide and one incorrectly identified peptide or by using a decoy (i.e., wrong mass) cross-linker. 6 While the decoy cross-linker permits for one peptide to be right and one to be wrong as well as both peptides being wrong, it does not provide an easy way to model both cases separately. To model this, FDR calculations have to take into account a set of two interdependent problems. While for the false identification of a single peptide, only a linear random space needs to be considered; for two peptides, this needs to be extended to a quadratic random space as each peptide could be from both the target as well as the decoy database. MS2cleavable cross-linkers 7−11 may allow circumvention of a crosslinking specific FDR, at least in part. The cross-link is cleaved in MS2, separating the two peptides that can then be identified individually in MS3. As linear peptides are being identified, standard proteomic peptide FDR estimation has been applied, 12 possibly falling short in considering errors from joining up peptides. Nevertheless, their data can also be assessed jointly as cross-links within a spectrum. 13,14 A formalism for FDR estimation of cross-links has recently been proposed. 15 However, some questions remain open such as how to handle directionality of the cross-linker or what levels to consider: peptide-spectrum matches (PSMs), peptide pairs, or residue pairs.
Here we share our considerations regarding FDR estimation in CLMS, based on the target-decoy approach. The FDR approach was tested using a data set of RNA Polymerase II (Pol II) cross-linked with Bis[sulfosuccinimidyl] suberate (BS3). 16 Our data was compared against an available crystal structure of Pol II, 17 which served as a mass spectrometry-independent evaluation of our FDR approach. We highlight the importance of considering the different information levels, PSMs, peptide pairs, and residue pairs, and how their relationship can be exploited productively.

■ EXPERIMENTAL SECTION
Dataset. The data set has been described previously 16 and was reprocessed here. In short, purified RNA polymerase II (Pol II) from Saccharomyces cerevisiae was cross-linked with BS3. Cross-linked complexes were then digested with trypsin and analyzed by LC-MS/MS. Mass spectrometric data was acquired using a "high−high" strategy, meaning both MS1 and MS2 spectra were acquired with high resolution (R = 100000 and R = 7500, respectively).
Data Processing. Mass spectrometric raw files were processed into peak lists using MaxQuant version 1.2.2.5 18 using default parameters except the setting for "Top MS/MS peaks per 100 Da" being set to 100. Peak lists were searched against a target-decoy database of all Pol II proteins (Rpb1 to Rpb12, 4565 residues) and their decoy equivalents obtained by sequence inversion 18 using Xi 19 (http://github.com/ Rappsilber-Laboratory/XiSearch) for identification of crosslinked peptides. Search parameters were MS accuracy, 6 ppm; MS/MS accuracy, 20 ppm; enzyme, trypsin; specificity, fully tryptic; allowed number of missed cleavages, four; cross-linker, BS3; fixed modifications, carbamidomethylation on cysteine; variable modifications, oxidation on methionine, hydrolyzed, amidated, and loop-linked versions of BS3. The linkage specificity for BS3 was assumed to be at lysine, serine, threonine, tyrosine, and protein N-termini. The data have been deposited to the ProteomeXchange 20 Consortium via the PRIDE 21 partner repository with the data set identifier PXD004749.
Comparison to Crystal Structure. As a mass spectrometry-independent assessment of identification success, the residue distance of identified cross-linked residue pairs was measured in an available crystal structure of Pol II (PDB| 1WCM). 17 CLMS and X-ray crystallography do not necessarily return identical results as CLMS investigates proteins in solution where conformational flexibility is likely much higher than in crystallized form. However, for our data set, a good agreement of the two methods has been reported. 16 To compare decoy matches with the crystal structure, the linked residue in the decoy was assigned the position of the same residue in the forward sequence.
xiFDR Software. All FDR calculations were done with xiFDR. We provide xiFDR, an open-source program (https:// github.com/lutzfischer/xiFDR), for researchers to analyze the results of their preferred cross-link search engine. The input of xiFDR is either an mzIdentML file or a table of PSMs (Table  S1). The output is either an amended mzIdentML file or a set of tables containing PSMs, peptide pairs, residue pairs, and protein pairs that pass the requested FDR thresholds. It supports two modes of operation for cross-links: directional and nondirectional. Directional here refers to matches where the spectra of A being cross-linked to B would be significantly different then B being cross-linked to A and nondirectional refers to cross-linking methods where there is practically no distinction between A cross-linked B and B cross-linked A. The formula for directional cross-links is with TD DB being the number off all possible unique target− decoy and decoy−target entries. The difference here is in how decoy−decoy model the false peptide−false peptide matches among the target−target matches. A detailed derivation of both formulas and their impact is described in the Supporting Information (text and Figure S1). Both formulas converge quickly (at 200 linkable entities the deviation is <1%, Supporting Information, Figure  S2 and supplemental discussion). Both formulas are applicable at PSM, peptide pair, residue pair, and protein pair level. Even so, how directionality would look for residue and protein pairs is currently unclear.
The calculated FDRs are being reported with an attached resolution. The resolution here is being defined as the difference of the next higher computable FDR minus the next lower FDR. This is exemplified in Supporting Information, Figure S3. While not providing an actual accuracy it gives an indication of the range into which the actual FDR might fall. xiFDR is described in more detail as part of the Supporting Information.

■ RESULTS AND DISCUSSION
Database searches of mass spectrometry data in proteomics return peptide-spectra matches (PSMs). Consequently, one may want to assess the error made in this process and FDR calculations for PSMs have been validated extensively for linear peptides based on a number of tests. 22−24 However, for protein cross-linking, there are three additional information levels. PSMs aggregate to peptide pairs, these then aggregate to linked residue pairs, which in turn aggregate to protein pairs. To assess if FDR estimation at the different information levels is actually valid we used a crystal structure as "ground truth". We compared our search results for data of a RNA polymerase II analysis 16 filtered to 50% FDR at different levels (PSMs, peptide pairs, residue pairs) to the crystal structure of Pol II (PDB|1WCM), 17 measuring the distance of residue pairs that were identified as being cross-linked. If the distance of a crosslinked residue pair is feasible the identification is possibly right. If not, it is likely wrong. When looking at the distance histogram of target and decoy matches, the distribution of target and decoy matches should be distinct for the crosslinkable distance with more targets then decoys (Figure 1a). This indicates that there are actually true identifications among the target matches. On the other hand, for long, structurally unfeasible, distances the curves should overlay. Most of the identifications of residue pairs that are long distance are structurally unfeasible and, hence, likely false positives, which decoys are supposed to model. Indeed, we found that the decoy distributions match the long-distance part of the target distribution for each observed level of information: PSMs (Figure 1b), peptide pairs (Figure 1c), and residue pairs (Figure 1d). Decoys (always false) and long-distance links (mostly false) agree for PSMs, peptide pairs, and residue pairs. Consequently, FDRs of PSMs, peptide pairs, and residue pairs can be obtained by target-decoy searches.
In a cross-linking experiment, the information of interest lies with the cross-linked residue pairs and the cross-linked protein pairs. Restricting FDR analysis to PSMs or peptide pairs leads to a problem: A defined FDR for PSMs or peptide pairs gives an unpredictable and typically larger FDR at the level of residue pairs or protein pairs (Figure 2). For our RNA polymerase II analysis 5% FDR at the level of PSMs leads to 5.8% FDR at the level of peptide pairs and 8.4% at the level of residue pairs. While we can also look at protein pairs, and the trend seems to persist, the actual number of possible pairs in Pol II does not permit for any statistically meaningful results. At no FDR is the PSM FDR a good guide for the accuracy of information at the level of residue pairs. Also peptide-pair FDR is not a good guide for the situation at residue-pair level. Consequently, the error should be estimated for the information that is of actual interest, that is, linked residue and protein pairs. Similar arguments have been made for protein identification: 25 correct matches tend to aggregate when combining PSMs to peptides and peptides to proteins. In contrast, false matches tend to stay alone. False matches are random and have a low probability to fall by chance into the same protein. Therefore, the proportion of false results increases when combining results. Given that residue-pair FDRs should and can be calculated leaves the question of how to treat PSMs and peptide pairs. One could ignore their error and leave error estimation to the level of residue pairs entirely. Instead, we restrict the number of false PSMs and peptide pairs by applying a FDR threshold at their respective level as a prefilter. Importantly, the way one handles PSMs and peptide pairs actually influences the number of residue pairs passing a given FDR threshold. For example, aiming for 5% FDR on residue pairs in our data we observe 108 hits if only applying the cutoff at residue level, compared to 144 hits if we apply 6% FDR cutoff at PSM level, and a 10.5% FDR cutoff at peptide-pair level (Figure 3). Prefiltering in PSMs and peptide pairs added 36 (33%) additional residue pairs without affecting their FDR. To test if our FDR is still reflecting the likely accuracy of cross-links reported in our analysis, we compared the initial as well as the number-improved set of cross-links with an available crystal structure of Pol II (PDB|   16 The protein-pair FDR is plotted as a trend only, due to data sparseness. (B) Exemplification of the error propagation, in form of wrong identifications, from PSMs to peptide pairs and residue pairs. Correctly identified PSMs (true positives = green) tend to cluster, for example, several correctly identified PSMs support the same unique peptide pair and correctly identified peptide pairs in turn support one residue pair. Incorrectly identified PSMs (false positives = red) are random and do not cluster to the same extend.

Analytical Chemistry
Technical Note 1WCM). Of the additional 36 residue pairs, 33 showed a distance in the crystal structure that matched the possible crosslink length (∼27 Å for lysine−lysine links with BS3 16 ). In addition, two of the three remaining residue pairs involve the very flexible N-terminal loop-region of Rbp1, offering an explanation for seeing these cross-links despite residues being distant in the crystal structure. In conclusion, prefiltering added 35 plausible residue pairs (33%) at the expense of adding one implausible one. Prefiltering therefore appears to be a valid way of improving search sensitivity without compromising search accuracy.
The success of prefiltering by applying FDR thresholds at lower levels in improving search sensitivity depends on combining multiple PSMs to support a peptide pair and multiple peptide pairs to support a residue pair. We are not aware of a way to predict best filter settings, or in fact if different filter settings at lower information levels would always be beneficial. We, therefore, suggest exploring this numerically by software. We supply such a software here, xiFDR (see Experimental Section). Note that this tool uses a CSV file or mzIdentML 26 version 1.2 (submitted) as input and is therefore independent of the search software. XiFDR reports the FDR interval (Supporting Information, Figure S3).

■ CONCLUSION
Current FDR approaches in cross-linking/mass spectrometry stop at the PSM or peptide-pair level, often missing to specify which one was actually used. Consequently, the information of interest, links between sites (residue pairs) or proteins (protein pairs), is reported with an unknown and typically higher (potentially much higher) error. Our data indicate that our FDR approach can be extended to assess the error on residuepair level and presumably also protein-pair level. As contributions to finding the most sensitive but also fair report of identified links we propose to prefilter on PSMs and peptide pairs, and to report FDR together with the interval of uncertainty resulting from limited data. FDR estimation played an important role in consolidating proteomics and it has a similar role to play for cross-linking/mass spectrometry.

* S Supporting Information
The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.analchem.6b03745.
Derivation of formulas for directional and nondirectional cross-linker and the impact of using one vs the other. Description of xiFDR software. Description and example for the resolution of an FDR calculation (PDF). Optimal FDR thresholds on PSMs and peptide pairs (left) return more cross-links (at 5% FDR) than not applying prefilters (right). (C) Distance distribution of the residue pairs (5% residue-pair FDR). The prefiltering does increase the number of cross-links but does not lead to a notable increase in long distance links (see text for a more detailed discussion).

Analytical Chemistry
Technical Note