THESIS
2013
xii, 68 pages : illustrations ; 30 cm
Abstract
MapReduce is a processing paradigm invented by Google that has been widely used for
both high performance computing and big data processing. Though the programming
model is simple, it is very challenging to conduct performance tuning for a MapReduce job.
The main reason is that a job usually has many configuration parameters that should be tuned
carefully according to both the data distribution and the execution environment (e.g., cluster
size and failure rate). It is desirable if we can efficiently estimate the execution time of a job
without actually execute it, given the input data and the settings. For this purpose, we can use
either a theoretical model or a simulator. However, due to the complex internal execution
flow and various types of failure that are hard to model, it...[
Read more ]
MapReduce is a processing paradigm invented by Google that has been widely used for
both high performance computing and big data processing. Though the programming
model is simple, it is very challenging to conduct performance tuning for a MapReduce job.
The main reason is that a job usually has many configuration parameters that should be tuned
carefully according to both the data distribution and the execution environment (e.g., cluster
size and failure rate). It is desirable if we can efficiently estimate the execution time of a job
without actually execute it, given the input data and the settings. For this purpose, we can use
either a theoretical model or a simulator. However, due to the complex internal execution
flow and various types of failure that are hard to model, it is extremely difficult to build
theoretical models to analyze the performance of MapReduce jobs. Moreover, to the best of
our knowledge, all the existing MapReduce simulators cannot well simulate the system
behaviors in the presence of unpredictable failures and skewed input data.
In this thesis, we propose a new MapReduce performance tuner called MRTune. Compared
with other approaches on runtime estimation, MRTune takes system failures and data skew
into serious consideration, which makes it very useful for parameter tuning for real
MapReduce jobs. It can be also used to study the influence of skewed data in MapReduce
applications. The model of MRTune simulates the basic components within a MapReduce
application based on discrete event simulation in order to guarantee the efficiency, and the input specifies of MRTune imports the information representing data distribution. We
validated MRTune with input under even distribution and Zipfian distribution, implementing
Word Count and Inverted Indexing to test the accuracy with simple benchmarks and more
complicated jobs. We also used MRTune for case study of data skew and failure. In previous
research on skew handling, most of the problem statement of partitioning skew is under the
situation where the number of keys and the number of reducers are close or in the same order
of magnitude. We observed that the partition skew will be greatly mitigated while the
number of keys is much larger than the number of reducers, which is often the case in real
applications. Moreover, in most cases users will set the number of reduce tasks equal to the
number of reduce slots of the cluster. However, if the number of reduce slot is too small, the
influence of failure will be huge. We observed that in this case, the influence of failure will
decrease while the number of reduce tasks is raised.
Post a Comment