Statistical issues in genome-wide association studies

HKUST Electronic Theses

Statistical issues in genome-wide association studies

by Wei Jiang

THESIS 2016

Ph.D. Electronic and Computer Engineering

xxvi, 156 pages : illustrations ; 30 cm

Abstract

Genome-wide association studies (GWASs) are widely used to discover single nucleotide polymorphisms (SNPs) associated with diseases. Commonly, we use a multi-stage setting to discover associations and to validate identified findings. Under such a setting, we discover associations in primary studies and validate findings in replication studies. Only the associations showing statistical significance in both studies are regarded as true findings. In this dissertation, we study three statistical issues in multi-stage GWASs. Another related statistical issue is how to improve power with multiple GWAS data sets. This dissertation also proposes a novel joint analysis method using summary statistics from multiple GWASs.

First, we study how to estimate the power of replication studies in multi-stage GWASs. The traditional approach estimates the power by plugging observed effect sizes into the power calculation. However, this approach would make the designed replication study underpowered since we are only interested in primary associations (i.e., statistically significant associations in the primary study) and the problem of the "winner's curse" would occur. In this dissertation, we propose an Empirical Bayes (EB)-based method to estimate the power of a replication study for each association. Simulation experiments show that our method is better than plug-in-based estimators in terms of overcoming the winner's curse and providing higher estimation accuracy. Experiments on data of six diseases from the Wellcome Trust Case Control Consortium (WTCCC) show that sample sizes determined by power using our method are higher than those with the traditional approach.

Second, we study the probability of a primary association being validated in the replication study. This dissertation proposes a Bayesian probabilistic measure, named the replication rate (RR), to find the answer. We further provide an estimation method for RR which makes use of the summary statistics from the primary study. We can use the estimated RR to determine the sample size of the replication study and to check the consistency between the results of the primary study and those of the replication study. Simulation and real-data experiments show that the estimated RR has good prediction and calibration performance. We also use these experiments to demonstrate the usefulness of RR.

Third, we study how to determine significance levels in multi-stage settings. In traditional methods, the significance levels of the primary and replication studies are determined separately. We argue that the separate-determination strategy reduces the power in the overall multi-stage study. Therefore, we propose a novel method to determine significance levels jointly. Our method is a reanalysis method that needs summary statistics from both studies. We find the most powerful significance levels when controlling the false discovery rate (Fdr) in the multi-stage study. To enjoy the power improvement from the joint-determination method, we suggest selecting SNPs for replication at a less stringent significance level. Simulation experiments show that our method can provide more power than traditional methods and that the Fdr is well controlled. Empirical experiments on data sets of five diseases/traits demonstrate that our method can help identify more associations.

Finally, we study joint analysis methods using summary statistics from multiple GWASs. Traditionally, meta-analysis methods are used to complete this task. We propose a novel summary-statistics-based joint analysis method based on controlling the joint local false discovery rate (Jlfdr). We prove that our method is the most powerful summary-statistics-based joint analysis method when controlling the Fdr at a certain level. In particular, the Jlfdr-based method achieves higher power than commonly used meta-analysis methods when analyzing heterogeneous data sets from multiple GWASs. Simulation experiments demonstrate the superior power of our method over meta-analysis methods. Also, our method discovers more associations than meta-analysis methods from empirical data sets of four phenotypes.

[ Hide abstract ]

View Copyrighted to the author. Reproduction is prohibited without the author’s prior written consent.

Details

Collection HKUST Electronic Theses Degree Ph.D. Department Electronic and Computer Engineering Supervisors Yu, Weichuan Authors Jiang, Wei Subjects Genomes Statistical methods DNA replication Variation (Biology) Language English Call number Thesis ECED 2016 Jiang DOI 10.14711/thesis-b1750028

Full record

Statistical issues in genome-wide association studies

by Wei Jiang

Post a Comment Cancel reply