Large-scale in-memory data processing

HKUST Electronic Theses

Large-scale in-memory data processing

by Zhiqiang Ma

THESIS 2014

Ph.D. Computer Science and Engineering

xi, 101 pages : illustrations ; 30 cm

Abstract

As cloud and big data computation grows to be an increasingly important paradigm, providing a general abstraction for datacenter-scale programming has become an imperative research agenda. Researchers have proposed, designed and implemented various computation models and systems on different abstraction levels, such as MapReduce, X10, Dryad, Storm and Spark. However, many abstractions expose the distributed detail of the platform to the application layer, and lead to increased complexity in programming, decreased performance, and, sometimes, loss of generality. At the data substrate layer, traditional cloud computing technologies, such as MapReduce, use disk-based file systems as the system-wide substrate for data storage and sharing. A distributed file system provides a global name space and stores data persistently, but it also introduces significant overhead. Several recent systems use DRAM to store data and tremendously improve the performance of cloud computing systems. However, both our own experience and related work indicate that a simple substitution of distributed DRAM for the file system does not provide a solid and viable foundation for data processing and storage in the datacenter environment, and the capacity of such systems is limited by the amount of physical memory in the cluster.

To support general, efficient, flexible, and concurrent application workloads with sophisticated data processing, we present programmers an illusion of a big virtual machine built on top of one, multiple or many compute nodes and unify the physical memory and disks of the nodes to form a globally addressable data substrate. We design a new instruction set architecture, i0, to unify myriads of compute nodes to form a big virtual machine called MAZE where thousands of tasks run concurrently in VOLUME, a large, unified, and snapshotted distributed virtual memory. i0, MAZE and VOLUME form the foundation of the Layer Zero systems which provide a general substrate for cloud computing. i0 provides a simple yet general and scalable programming model. VOLUME mitigates the scalability bottleneck of traditional distributed shared memory systems and unifies the physical memory and disks on many compute nodes to form a distributed transactional virtual memory. VOLUME provides a general memory-based abstraction, takes advantage of DRAM in the system to accelerate computation, and, transparently to programmers, scales the system to process and store large datasets by swapping data to disks and remote servers. Along with the efficient execution engine of MAZE, the capacity of a MAZE can scale up to support large datasets and large clusters.

We have implemented the Layer Zero systems on several platforms, and designed and implemented various benchmarks, graph processing and machine learning programs and application frameworks. Our evaluation shows that Layer Zero has excellent performance and scalability. On one physical host, the system overhead is comparable to that of traditional VMMs. On 16 physical hosts, Layer Zero runs 10 times faster than Hadoop and X10. On 160 physical compute servers, Layer Zero scales linearly on a typical iterative workload.

[ Hide abstract ]

View Copyrighted to the author. Reproduction is prohibited without the author’s prior written consent.

Details

Collection HKUST Electronic Theses Degree Ph.D. Department Computer Science and Engineering Supervisors Gu, Lin Authors Ma, Zhiqiang Subjects Cloud computing Database management Electronic data processing Virtual computer systems Language English Call number Thesis CSED 2014 Ma

Full record

Large-scale in-memory data processing

by Zhiqiang Ma

Post a Comment Cancel reply