THESIS
2016
xxvi, 156 pages : illustrations ; 30 cm
Abstract
Genome-wide association studies (GWASs) are widely used to discover single nucleotide
polymorphisms (SNPs) associated with diseases. Commonly, we use a multi-stage setting
to discover associations and to validate identified findings. Under such a setting,
we discover associations in primary studies and validate findings in replication studies.
Only the associations showing statistical significance in both studies are regarded as true
findings. In this dissertation, we study three statistical issues in multi-stage GWASs.
Another related statistical issue is how to improve power with multiple GWAS data sets.
This dissertation also proposes a novel joint analysis method using summary statistics
from multiple GWASs.
First, we study how to estimate the power of replication studies i...[
Read more ]
Genome-wide association studies (GWASs) are widely used to discover single nucleotide
polymorphisms (SNPs) associated with diseases. Commonly, we use a multi-stage setting
to discover associations and to validate identified findings. Under such a setting,
we discover associations in primary studies and validate findings in replication studies.
Only the associations showing statistical significance in both studies are regarded as true
findings. In this dissertation, we study three statistical issues in multi-stage GWASs.
Another related statistical issue is how to improve power with multiple GWAS data sets.
This dissertation also proposes a novel joint analysis method using summary statistics
from multiple GWASs.
First, we study how to estimate the power of replication studies in multi-stage GWASs.
The traditional approach estimates the power by plugging observed effect sizes into the
power calculation. However, this approach would make the designed replication study
underpowered since we are only interested in primary associations (i.e., statistically significant
associations in the primary study) and the problem of the "winner's curse" would
occur. In this dissertation, we propose an Empirical Bayes (EB)-based method to estimate
the power of a replication study for each association. Simulation experiments
show that our method is better than plug-in-based estimators in terms of overcoming
the winner's curse and providing higher estimation accuracy. Experiments on data of six
diseases from the Wellcome Trust Case Control Consortium (WTCCC) show that sample
sizes determined by power using our method are higher than those with the traditional
approach.
Second, we study the probability of a primary association being validated in the replication
study. This dissertation proposes a Bayesian probabilistic measure, named the
replication rate (RR), to find the answer. We further provide an estimation method for
RR which makes use of the summary statistics from the primary study. We can use the estimated RR to determine the sample size of the replication study and to check the
consistency between the results of the primary study and those of the replication study.
Simulation and real-data experiments show that the estimated RR has good prediction
and calibration performance. We also use these experiments to demonstrate the usefulness
of RR.
Third, we study how to determine significance levels in multi-stage settings. In traditional
methods, the significance levels of the primary and replication studies are determined
separately. We argue that the separate-determination strategy reduces the power
in the overall multi-stage study. Therefore, we propose a novel method to determine significance
levels jointly. Our method is a reanalysis method that needs summary statistics
from both studies. We find the most powerful significance levels when controlling the false
discovery rate (Fdr) in the multi-stage study. To enjoy the power improvement from the
joint-determination method, we suggest selecting SNPs for replication at a less stringent
significance level. Simulation experiments show that our method can provide more power
than traditional methods and that the Fdr is well controlled. Empirical experiments
on data sets of five diseases/traits demonstrate that our method can help identify more
associations.
Finally, we study joint analysis methods using summary statistics from multiple GWASs.
Traditionally, meta-analysis methods are used to complete this task. We propose a novel
summary-statistics-based joint analysis method based on controlling the joint local false
discovery rate (Jlfdr). We prove that our method is the most powerful summary-statistics-based joint analysis method when controlling the Fdr at a certain level. In particular, the
Jlfdr-based method achieves higher power than commonly used meta-analysis methods
when analyzing heterogeneous data sets from multiple GWASs. Simulation experiments
demonstrate the superior power of our method over meta-analysis methods. Also, our
method discovers more associations than meta-analysis methods from empirical data sets
of four phenotypes.
Post a Comment