THESIS
2019
vii, 120 pages : illustrations ; 30 cm
Abstract
De Novo assemblers construct genome sequences from small fragments, without using
any reference genome. Specifically, they represent the fragments in a De Bruijn graph and
traverse the graph to generate the sequence. As constructing and traversing a big De Bruijn
graph is both time and memory space consuming, we develop UNIPAR, a parallel software
package that runs this process on a cluster of GPU-equipped computers. In particular, it
utilizes all processor cores in each CPU and GPU, all CPUs and GPUs in a computer
node, and all computer nodes of the cluster. Furthermore, we analyze the characteristics
of genome data to design a concurrent hashing algorithm for the graph construction, and
to reduce the communication overhead in the graph traversal. We further improve the
overal...[
Read more ]
De Novo assemblers construct genome sequences from small fragments, without using
any reference genome. Specifically, they represent the fragments in a De Bruijn graph and
traverse the graph to generate the sequence. As constructing and traversing a big De Bruijn
graph is both time and memory space consuming, we develop UNIPAR, a parallel software
package that runs this process on a cluster of GPU-equipped computers. In particular, it
utilizes all processor cores in each CPU and GPU, all CPUs and GPUs in a computer
node, and all computer nodes of the cluster. Furthermore, we analyze the characteristics
of genome data to design a concurrent hashing algorithm for the graph construction, and
to reduce the communication overhead in the graph traversal. We further improve the
overall performance by partitioning and storing the data in a compact format, pipelining
data transfer and computation, and overlapping computation and communication. Our
experiments show that on real-world datasets, UNIPAR is an order of magnitude faster
than the state-of-the-art shared memory based assemblers, and more than five times faster
than the current distributed assemblers.
Post a Comment