THESIS
2017
xviii, 96 pages : illustrations (some color) ; 30 cm
Abstract
Liquid chromatography–mass spectrometry (LC-MS) based proteomics has achieved
great success in recent years, and is becoming one of the major approaches to studying
biological problems. In the dry lab workflow, peptide identification is the first step. Its
outcome is the source of many downstream analyses, including post-translational modification
(PTM) analysis, protein expression analysis, biological signaling pathway analysis,
and protein-protein interaction analysis. Researchers have proposed various methods to
identify peptides. According to the targets, there are two types of peptide identification
tasks: cross-linked peptides identification and linear peptide identification. Cross-linked
peptides identification is a new topic that appeared a few years ago. Its targets are...[
Read more ]
Liquid chromatography–mass spectrometry (LC-MS) based proteomics has achieved
great success in recent years, and is becoming one of the major approaches to studying
biological problems. In the dry lab workflow, peptide identification is the first step. Its
outcome is the source of many downstream analyses, including post-translational modification
(PTM) analysis, protein expression analysis, biological signaling pathway analysis,
and protein-protein interaction analysis. Researchers have proposed various methods to
identify peptides. According to the targets, there are two types of peptide identification
tasks: cross-linked peptides identification and linear peptide identification. Cross-linked
peptides identification is a new topic that appeared a few years ago. Its targets are pairs
of peptides linked by certain chemical compounds. Thus, it’s search space is quadratic
with respect to the number of peptides in a database. Identifying cross-linked peptides by
searching all peptide-peptide pairs is still an open question. Linear peptide identification
has been well studied and widely used in biological research. However, most proposed
methods only support a limited number of PTMs due to the large computational complexity.
Identifying peptides without limiting PTMs is also an open question. In this
thesis, we try to solve these two open questions by proposing computational methods.
First, we solve the cross-linked peptides identification problem by proposing two methods.
The first method, called ECL, can exhaustively search all peptide-peptide pairs from
a database. To our knowledge, there is no existing tool that can search all peptide-peptide pairs due to the large computational complexity. Existing methods for cross-linked peptides
identification use heuristic filtering procedures to reduce the search space. However,
non-exhaustive search will cause considerable missed findings. Experiments show
that ECL identifies more nonredundant cross-linked peptides than non-exhaustive search
methods, including xQuest, pLink, and ProteinProspector. The running speed comparison
shows that ECL is much faster than xQuest, pLink, and ProteinProspector even
though it searches many more peptide-peptide pairs than these tools. We show that ECL
has a quadratic time complexity, which results in a long running time when the database
is large. Thus, we propose another method, called ECL 2.0, to achieve a linear time
complexity. This method takes advantage of the score functions’ additive property to
convert a score into the summation of two chain scores. It couples such a conversion with
a digitization-based approach to achieve the linear time complexity. Experiments show
that ECL 2.0 has the highest sensitivity among state-of-the-art tools, including pLink,
StavroX, ProteinProspector, Kojak, and ECL. It is also much faster than pLink, StavroX,
ProteinProspector, and ECL.
Second, we propose a method, called PIPI, that can identify peptides with unlimited
number of PTMs. This method codes peptide sequences and tandem mass spectra into
vectors. The coding approach ensures that the coded vectors are invariant to PTM. Then,
it searches the coded spectra against the coded peptide sequences. Since the coded spectra
and peptide sequences are invariant to PTM, the search procedure can find peptide-spectrum
matches (PSMs) with unspecified PTMs. Finally, it infers PTMs, calculates a
fine score, and estimates the false discovery rate for each PSM. Experiments show that
PIPI has a higher sensitivity than Mascot, Comet, MS-GF+, MS-Alignment, and MODa.
It is also much faster than most of these tools.
Post a Comment