Proteomics refers to the large-scale, high-throughput, system-wide study of proteins in a
biological system. Among many work-flows used in proteomics, shotgun proteomics, which
involves digesting proteins into shorter peptides, separating them by chromatography and then
resolving them and fragmenting them in the mass spectrometer, has been the most popular
and effective. The fragmentation patterns recorded in tandem mass (MS/MS) spectra can be
used to deduce the peptide sequences, usually by a computational method called sequence
database searching. This computationally intensive method relies on the genome sequence of
the organism studied to enumerate all possible peptide candidates, and then scores them one
by one to find the best match.
Traditionally, discovering all protein...[
Read more ]
Proteomics refers to the large-scale, high-throughput, system-wide study of proteins in a
biological system. Among many work-flows used in proteomics, shotgun proteomics, which
involves digesting proteins into shorter peptides, separating them by chromatography and then
resolving them and fragmenting them in the mass spectrometer, has been the most popular
and effective. The fragmentation patterns recorded in tandem mass (MS/MS) spectra can be
used to deduce the peptide sequences, usually by a computational method called sequence
database searching. This computationally intensive method relies on the genome sequence of
the organism studied to enumerate all possible peptide candidates, and then scores them one
by one to find the best match.
Traditionally, discovering all proteins and their isoforms has been the primary goal of
proteomics. More recently, proteomics has seen a gradual transition to hypothesis-driven
approaches, akin to DNA microarrays, for which the goal is to measure the same protein
signals reproducibly and accurately. For this purpose, spectral reference libraries, which are
compilations of previously observed MS/MS spectra, play the important role as an
information hub, enabling researchers to store, merge, retrieve and share data. In this thesis,
the main objective is to develop the necessary computational toolkit that extends the use of
traditional spectral reference libraries to unidentified spectra, breaking free of the assumption
that all spectra must be identified first to some peptide to be useful.
In the first part of this thesis, a novel method for denoising tandem mass spectra based on
Bayesian inference is developed for spectral library building. This mainly aims to help
improve the quality of spectral libraries, especially for singleton spectra, where the traditional
way of merging multiple replicates of the same peptide ion into a consensus spectrum cannot
be applied. As a result, spectra denoised by this method can retain more signal peaks, and
have better performance in searching, than those filtered by intensity only.
In the second part of this thesis, a clustering algorithm of constructing tandem mass spectral
library from both identified and unidentified MS/MS spectra is developed. Thus, the resulting
library can function as a complete record of experimental data, allowing better data analysis
and integration. Even in the absence of peptide identification, a properly compiled library of
tandem mass spectra can function as a "fingerprint" for a biological sample.
In the third part of the thesis, the scoring function used in spectral library searching is
redesigned to ameliorate some of its well-known shortcomings. The similarity score is
transformed into a tail probability, which allows one to assign the statistical significance to
every spectrum-spectrum match. This also enables one to forgo the use of target-decoy
approaches -- which are not applicable without peptide identifications -- and instead rely on
parametric mixture-model fitting to estimate the posterior error probability and thereby the
false discovery rate.
Finally, the methods developed are applied to one practical applications intractable by
traditional genome-obligated proteomics approaches. Our method is developed to identify the
source of the blood meals of hematophagous arthropods, an enabling tool for the study of the
ecology of infectious diseases in nature. This method, based on comparing the blood
proteomes as recorded in unidentified spectral libraries, is sensitive, fast, cost-effective,
evolutionarily accurate, and compares favorably to existing genome-based and single
protein-based methods. In conclusion, unidentified spectral libraries can function as
fingerprints for biological samples at the proteome level, and can be effectively utilized in
applications such as species classification and microbial source tracking.
Post a Comment