SNP data analysis in genome-wide association studies
by Can Yang
Ph.D. Electronic and Computer Engineering
xviii, 155 p. : ill. ; 30 cm
Genome-wide association study (GWAS) is a popular strategy in studying complex diseases. GWAS genotypes 105 ∼ 106 single-nucleotide polymorphisms (SNPs) from thousands of individuals. Compared to traditional “candidate gene” studies, in which a limited number of genetic variants are assayed, GWAS provides a much wider coverage across the human genome and enables us to identify more genetic variants associated with diseases. By October 2010, there are about 700 publications that have reported around 3000 SNPs associated with 150 diseases or complex traits....[ Read more ]
Genome-wide association study (GWAS) is a popular strategy in studying complex diseases. GWAS genotypes 105 ∼ 106 single-nucleotide polymorphisms (SNPs) from thousands of individuals. Compared to traditional “candidate gene” studies, in which a limited number of genetic variants are assayed, GWAS provides a much wider coverage across the human genome and enables us to identify more genetic variants associated with diseases. By October 2010, there are about 700 publications that have reported around 3000 SNPs associated with 150 diseases or complex traits.
Despite of the success of GWAS, most of these findings only explain a small portion of genetic contributions to complex diseases. For example, all of 18 SNPs identified in type 2 diabetes (T2D) only account for about 6% of the inherited risk. There is still a large portion of disease/trait heritability which remains unexplained. Recent GWAS publications name this unexplained heritability as missing heritability. Possible reasons for the missing heritability have been discussed recently, e.g., detecting gene-gene interactions and SNPs with relatively small effects is not fruitful. We address the issue of finding the missing heritability in the thesis as follows:
First, we consider the issue of detecting gene-gene interactions which are believed as ubiquitous components of the biological pathways that underlie complex diseases. Gene-Gene interactions happen when their joint effects deviate from the simple summation of their individual effects. Before we investigate this topic, there is a stepwise way to detect gene-gene interactions based on SNP data, which is computationally feasible. It first analyzes one SNP at a time to evaluate the single SNP effect (i.e., the marginal effect of a single SNP) and then selects those SNPs with large marginal effects to check their interaction effects. Because a single SNP with weak effect on the disease risk may show a large joint effect with some other SNPs due to their interaction, this stepwise approach will miss a lot of genetic information. In order to guarantee that no strong interactions are missed, examination of all SNP pairs across the whole genome is needed, which would be computationally intensive. For example, we need to evaluate 1.25 × 1011 SNP pairs for 500,000 SNPs.
We have developed a fast computational method named “BOolean Operation based Screening and Testing” (BOOST) to address this issue. BOOST can finish all tests within 60 hours on a single desktop. It is about 60 times faster than the state-of-art method PLINK. It has enabled us to efficiently identify the interaction patterns in GWAS. We have carried out interaction analyses on seven data sets from the Wellcome Trust Case Control Consortium (WTCCC). The interaction patterns identified from the type 1 diabetes data set display significant difference from those identified from the rheumatoid arthritis data set, although both data sets share a very similar hit region in the WTCCC report.
Second, we extend BOOST to detect “two-locus associations allowing for interactions” in GWAS. This type of associations involves SNPs which influence diseases via the combination of marginal effects and interaction effects. We also demonstrate the applicability of our method on nine real GWAS data sets, including the aged-related macular degeneration data set, the Parkinson’s disease data set, seven data sets from WTCCC. Our method has discovered some associations that were not identified before, which may contribute to finding the missing heritability.
Third, we observe that some SNP pairs have significant associations, showing neither significant marginal effects nor interaction effects. The reason is that the marginal effects of correlated variables do not express their significant joint effects faithfully due to the correlation cancelation. This phenomenon is referred as “unfaithfulness”. We have developed a computational method to detect them. Our results show that the associations masked by unfaithfulness commonly exist in GWAS, but they are not identified by either marginal analysis or interaction analysis.
Finally, we propose a method to detect disease-associated SNPs with small effects, which may be an important reason for the missing heritability. The chance is less than 8% that a causal SNP will be directly genotyped. Due to the imperfect linkage disequilibrium, the observed effects of its neighboring SNPs decay largely such that the causal effect remains undiscovered. Even when a causal SNP has been directly genotyped, it is still challenging to detect the causal SNP with a small effect. Since the disease-associated SNPs account for only a small fraction of the entire SNP set, we formulate this problem as Contiguous Outlier DEtection (CODE). In our formulation, we cast the disease-associated SNPs as outliers and further impose a spatial continuity constraint for outlier detection. It turns out that our formulation can be exactly solved by graph cuts. We have shown that this method is more powerful than existing methods in detecting signals with small or moderate effective size by using two independent data sets of the Crohn’s disease.