THESIS
2021
1 online resource (xv, 104 pages) : illustrations (chiefly color)
Abstract
Molecular evolution is shaped by a combination of evolutionary parameters such as
mutation, drift, selection, recombination, and others. The inference of these parameters
from time-series genomic data has remained an important subject of research with applications
in immunology, agriculture, pathology and vaccine design, to name a few. These
efforts have received a boost by the recent technological advancements in short-read sequencing,
which have enabled researchers to collect high resolution time-series genetic
data. Allele frequency trajectories observed from short-read time-series data have already
been used to infer a variety of evolutionary parameters. While allele frequency trajectories
for the entire genome are readily observable in short-read data, the allele-pair frequencies
a...[
Read more ]
Molecular evolution is shaped by a combination of evolutionary parameters such as
mutation, drift, selection, recombination, and others. The inference of these parameters
from time-series genomic data has remained an important subject of research with applications
in immunology, agriculture, pathology and vaccine design, to name a few. These
efforts have received a boost by the recent technological advancements in short-read sequencing,
which have enabled researchers to collect high resolution time-series genetic
data. Allele frequency trajectories observed from short-read time-series data have already
been used to infer a variety of evolutionary parameters. While allele frequency trajectories
for the entire genome are readily observable in short-read data, the allele-pair frequencies
are only observable when both loci occur within a read. Allele-pair frequency information
is critical when working with linkage-aware inference methods, which not only take into
account the trajectories of allele frequencies, but also how those trajectories are affected
by the evolution of other alleles. To our knowledge, no scalable method exists which
can resolve the latent allele-pair frequencies from short-read time-series data and perform
inference.
Marginal path likelihood (MPL) is a recently proposed inference method that infers
selection coefficients from time-series data. It is a computationally efficient and scalable
method based on a closed-form expression of the selection coefficients in terms of the allele
and allele-pair frequencies. As MPL requires knowledge of the allele-pair frequencies,
which are not completely observable in short-read data, it cannot be used directly on
short-read time-series data.
This work developed an extension of MPL based on population reconstruction, a first of
its kind framework, which can resolve unobservable allele-pair frequencies from large scale
short-read time-series data and infer selection. The main idea was to reconstruct the latent
allele-pair frequencies by performing population reconstruction efficiently, exploiting the
concept of bootstrap aggregation. The framework was validated on short-read time-series
data of the group-specific antigen (gag) protein of the human immunodeficiency virus
(HIV). A preliminary study on developing a maximum-entropy based framework for the
inference of unobservable covariance matrix entries was also developed.
Post a Comment