Preprocessing Tandem Mass Spectra Using Genetic Programming for Peptide Identification
- Samaneh AzariSamaneh AzariSchool of Engineering and Computer Science, Victoria University of Wellington, 6012, Wellington, Kelburn, New ZealandSchool of Engineering and Computer Science, Victoria University of Wellington, 6140, Wellington, New ZealandMore by Samaneh Azari
- ,
- Bing XueBing XueSchool of Engineering and Computer Science, Victoria University of Wellington, 6012, Wellington, Kelburn, New ZealandMore by Bing Xue
- ,
- Mengjie ZhangMengjie ZhangSchool of Engineering and Computer Science, Victoria University of Wellington, 6012, Wellington, Kelburn, New ZealandMore by Mengjie Zhang
- , and
- Lifeng PengLifeng PengCentre for Biodiscovery and School of Biological Sciences, Victoria University of Wellington, Wellington, New ZealandMore by Lifeng Peng
Abstract

One of the major challenges in proteomics is peptide identification from mass spectra containing high noise ratio and small number of signal (b-/y-ions) peaks. However, the accuracy and reliability of peptide identification in such highly imbalanced MS/MS data can be improved by applying a preprocessing step prior to peptide identification aiming at discriminating b-/y-ions from noise peaks in the spectra. In this study, we report a genetic programming (GP)–based preprocessing method for de-noising highly imbalanced and noisy CID MS/MS spectra. GP now becomes a popular machine learning method via automatic programming. GP preprocesses the highly noisy MS/MS spectra by classifying peaks as noise peaks or signal peaks in a binary classification manner. Meanwhile, a set of spectral fragment features based on the MS/MS fragmentation rules is extracted from the dataset to investigate their discriminating abilities by GP. A MS/MS spectral dataset containing thousands of spectra are used to train the GP model. As the GP tree-based representation has the capability for implicit feature selection during the evolutionary process, the evolved GP model with the selected features is compared with the best threshold-based method. The results show that the GP method improved the reliability of peptide identification and increased the identification rate of a de novo sequencing tool, PEAKS, to 99.4% from 80.1% achieved by the best threshold-based method. Moreover, the result of peptide identification by a database search tool, SEQUEST, using the data preprocessed by the GP method was statistically significant compared to the other methods.
Cited By
This article is cited by 1 publications.
- Samaneh Azari, Bing Xue, Mengjie Zhang, Lifeng Peng. A Decomposition Based Multi-objective Genetic Programming Algorithm for Classification of Highly Imbalanced Tandem Mass Spectrometry. 2020, 449-463. https://doi.org/10.1007/978-3-030-41299-9_35