THESIS
2017
xiii, 87 pages : illustrations ; 30 cm
Abstract
Data from heterogeneous data sources are frequently integrated into various data
warehouse systems for analytical purposes. As a result, data columns in such systems
often exhibit redundancy, which we term Cross-Column Redundancy (CCR).
CCR indicates high similarity or correlation between columns and therefore can be
exploited for data management and business intelligence. However, the combinatoric
nature of CCR wakes of computationally challenging to automatically detect
CCR.
In this thesis, we define three types of CCR, develop efficient algorithms for
CCR detection, and leverage CCR to compress data. In particular, we focus on a
type of CCR, called Soft Concatenation Mapping (SCM), where one column can be
derived from another or several other columns by transformation and c...[
Read more ]
Data from heterogeneous data sources are frequently integrated into various data
warehouse systems for analytical purposes. As a result, data columns in such systems
often exhibit redundancy, which we term Cross-Column Redundancy (CCR).
CCR indicates high similarity or correlation between columns and therefore can be
exploited for data management and business intelligence. However, the combinatoric
nature of CCR wakes of computationally challenging to automatically detect
CCR.
In this thesis, we define three types of CCR, develop efficient algorithms for
CCR detection, and leverage CCR to compress data. In particular, we focus on a
type of CCR, called Soft Concatenation Mapping (SCM), where one column can be
derived from another or several other columns by transformation and concatenation.
We prove that SCM detection is NP-hard and propose approximate algorithms.
Furthermore, we leverage CCR for database compression and develop Cuttle, a
column storage system that integrates our cross-column compression schemes into
existing database systems transparently. Our experiments on real-world datasets
show that Cuttle reduces the data storage by half and improves the query processing
performance by 20%. In addition, we present the design and implementation of
UStore, a customized version of Cuttle tailored for UnionPay. We use UnionPays
inter-bank transaction settlement platform (ITSP) as a running example to illustrate
the core components of UStore. To date, UStore has been deployed to process over
15 years bankcard transaction data (over 3PB in plain text format) in UnionPay.
Post a Comment