THESIS
2023
1 online resource (xviii, 131 pages) : color illustrations
Abstract
Data integration and causal inference are two important tasks for effective utilization
of big data. Data integration is especially critical in biological studies.
For example, in single-cell studies, there is a pressing need for data integration
methods to assemble numerous datasets originating from multiple sources into
a single comprehensive cell atlas. However, the task of single-cell data integration
poses challenges due to the heterogeneity of diverse data sources and the
extremely large scale of datasets to be integrated. It is difficult for integration
approaches to be not only computationally scalable, but also capable of preserving
a wide range of fine-grained cell populations during integration. Causal
inference, which aims to infer the causal relationship between a risk fact...[
Read more ]
Data integration and causal inference are two important tasks for effective utilization
of big data. Data integration is especially critical in biological studies.
For example, in single-cell studies, there is a pressing need for data integration
methods to assemble numerous datasets originating from multiple sources into
a single comprehensive cell atlas. However, the task of single-cell data integration
poses challenges due to the heterogeneity of diverse data sources and the
extremely large scale of datasets to be integrated. It is difficult for integration
approaches to be not only computationally scalable, but also capable of preserving
a wide range of fine-grained cell populations during integration. Causal
inference, which aims to infer the causal relationship between a risk factor and
an outcome of interest, is also essential in biomedical research. In recent years,
MR has gained increasing attention, as it can take widely available data from
genome-wide association studies to perform causal inference. However, existing
MR analysis often rely on strong assumptions that are often not satisfied
in practice, resulting in many false positive findings. In this thesis, we propose
two computational and statistical methods to address the above challenges in single-cell data integration and Mendelian randomization for causal inference,
respectively.
For comprehensive integration of atlas-level single-cell datasets, we propose a
deep learning-based method named Portal. Viewing datasets from different studies
as distinct domains with domain-specific effects, Portal achieves data integration
through a unified framework of uniquely designed domain translation
networks. Owing to the model and algorithm designs, Portal achieves superior
performance in preserving biological variation during integration, while achieving
integration of millions of cells in minutes with low memory consumption. By
conducting a benchmarking study using heterogeneous collections of atlas-level
scRNA-seq datasets, we showed Portal’s superior accuracy and computational
efficiency compared to other state-of-the-art single-cell integration algorithms.
We then showed that Portal can accurately integrate datasets across data types,
including scRNA-seq, snRNA-seq and scATAC-seq data. Finally, we demonstrated
the power of Portal by applying it to the integration of cross-species
datasets with limited shared information among them, elucidating biological insights
into the similarities and divergences in the spermatogenesis process among
mouse, macaque and human.
For inferring causal relationships among traits, we propose a statistical method
named MR-APSS. MR-APSS leverages GWAS summary statistics to perform
causal inference in the framework of MR. MR-APSS relaxes strong MR assumptions
by accounting for two major confounding factors, pleiotropy and sample
structure, simultaneously with genome-wide information. It also allows for the
inclusion of more genetic instruments with moderate effects to improve statistical
power without inflating type I errors. We first evaluated MR-APSS using
comprehensive simulations and negative controls, and then applied MR-APSS to
study the causal relationships among a collection of diverse complex traits. The
results suggest that MR-APSS can better identify plausible causal relationships with high reliability. In particular, MR-APSS can perform well for highly polygenic
traits, such as psychiatric disorders and social traits, where the strengths
of IVs tend to be relatively weak and most existing MR methods for causal
inference are vulnerable to confounding effects.
Post a Comment