THESIS
2015
Abstract
Recent advances in data warehousing technologies are enabling the storage and processing of
extremely large data sets. In viewing this opportunity, the leading cross-bank settlement institute
in China is looking for more business intelligence in their large-volume historical transaction
data accumulated in more than 10 years. Though a mature data warehousing solution Hive (an
open-source data warehousing solution built on top of Hadoop) is being adopted in production,
the efficiency of data storage and processing is suboptimal due to the lack of advanced customization
and optimization on the system. Specifically, overlapping fractions of the original data set
are materialized to different tables, introducing inter-table redundancy. Additions and changes on
columns inside a table...[
Read more ]
Recent advances in data warehousing technologies are enabling the storage and processing of
extremely large data sets. In viewing this opportunity, the leading cross-bank settlement institute
in China is looking for more business intelligence in their large-volume historical transaction
data accumulated in more than 10 years. Though a mature data warehousing solution Hive (an
open-source data warehousing solution built on top of Hadoop) is being adopted in production,
the efficiency of data storage and processing is suboptimal due to the lack of advanced customization
and optimization on the system. Specifically, overlapping fractions of the original data set
are materialized to different tables, introducing inter-table redundancy. Additions and changes on
columns inside a table also occur in an inconsistent manner during the over 10-year history of the
data sets, resulting in intra-table redundancy. Multiple user groups with various levels of needs use
different processing engines to access the data sets, causing cross-platform redundancy. Based on
these observations, we propose an optimization design that is transparent to all users of the system.
It minimizes the inter-table, intra-table and cross-platform redundancy. It employs a row columnar storage format to improve the data storage and processing efficiency and make the data accessible
to multiple processing engines. It also applies workload-based partitioning and indexing strategies
to further improve the data processing efficiency. We implement our optimization strategies on
their system and conduct extensive experiments on the historical transaction data of one day. Experimental
results show a 40x improvement in data storage efficiency and 5x speedup on typical
query workloads.
Post a Comment