Statistical analysis of patient-derived sequences discovers biologically significant insights in highly mutable viruses
by Ahmed Abdul Quadeer
THESIS
2016
Ph.D. Electronic and Computer Engineering
xxi, 147 pages : illustrations ; 30 cm
Abstract
The advancement in fast DNA sequencing technologies in the last decade has opened up unprecedented
opportunities to explore a diverse set of questions in biomedical research. This thesis
utilizes statistics and statistical signal processing tools for analyzing sequences of viral proteins to
uncover novel insights into biological function and structure. Robust correlation matrix estimation for
high-dimensional data plays a major role in addressing such complicated biological problems. The
important biological information revealed using the analysis presented in this work can be useful in
multiple fields: in structural biology to identify parts of the viral protein important for structural stability;
in biochemistry to study the role of particular parts in performing different func...[ Read more ]
The advancement in fast DNA sequencing technologies in the last decade has opened up unprecedented
opportunities to explore a diverse set of questions in biomedical research. This thesis
utilizes statistics and statistical signal processing tools for analyzing sequences of viral proteins to
uncover novel insights into biological function and structure. Robust correlation matrix estimation for
high-dimensional data plays a major role in addressing such complicated biological problems. The
important biological information revealed using the analysis presented in this work can be useful in
multiple fields: in structural biology to identify parts of the viral protein important for structural stability;
in biochemistry to study the role of particular parts in performing different functions associated
with the viral protein; and in immunology to predict potential vulnerable parts of the virus, targeting
which can aid in designing potent vaccines.
The first part of this work presents a novel vaccine design for an extremely dangerous pathogen,
Hepatitis C Virus (HCV). Chronic HCV infection is one of the leading causes of liver cancer, affecting
around 3% of the world’s population. Current treatments for HCV are expensive and there is no
working vaccine. The vexing problem related to the design of a HCV vaccine is its extreme variability
that helps it to evade immune surveillance. A random matrix theory (RMT) based “noise cleaning”
correlation matrix estimator is used to reveal a group of “multi-dimensionally conserved sites” in a
HCV protein that may be most susceptible to immune pressure, despite the high mutability of the
virus. This statistical approach demonstrates for the first time the existence of such vulnerable parts
in HCV research, targeting which can lead to the design of efficacious vaccine against this scourge.
These results are backed up by linking with clinical evidence available in the literature. Two vaccine
designs leveraging such information are also proposed.
In addition to identifying immunological significance, the second part of this work shows the
remarkable power of this approach in predicting sites with biochemical (structural or functional) significance
using only the viral sequence data. This work serves as the first exhibition of a statistical
approach capable of addressing this fundamental problem in biology for viruses. Moreover, this analysis
reveals the inability of the proposed method to identify distinct groups of biochemically important
sites. To tackle this problem, a robust method is proposed which, in addition to using the RMT
concepts, exploits the embedded sparsity in the problem using sparse principal component analysis
techniques. This sophisticated approach remarkably identifies multiple distinct groups of sites with
each of them associated to a specific structural or functional property, thus making it the first statistical
procedure to reveal the modular structure of the viral proteins.
A simulation model is also presented that provides a cohesive statistical ground-truth understanding
of the results obtained using the developed methods.
Post a Comment