THESIS
2022
1 online resource (xii, 124 pages) : illustrations (some color)
Abstract
Over the past decades, machine learning techniques have achieved a breakthrough and make people re-examine the traditional data management challenges such as data indexing, cardinality estimation, query optimization, etc. In the seminal paper of Kraska et al. [66], the notion of “learned index” was first introduced to refer to the new data structure design paradigm by using a combination of machine learning models (to provide fast prediction) and traditional data structures (to handle errors caused by ML models).
Inspired by the impressive performance, optimizations and extensions to the original learned indices have caught the attention from both academia and industry. In this thesis, we introduce three of our recent works on designing learned data structures to efficiently process fun...[
Read more ]
Over the past decades, machine learning techniques have achieved a breakthrough and make people re-examine the traditional data management challenges such as data indexing, cardinality estimation, query optimization, etc. In the seminal paper of Kraska et al. [66], the notion of “learned index” was first introduced to refer to the new data structure design paradigm by using a combination of machine learning models (to provide fast prediction) and traditional data structures (to handle errors caused by ML models).
Inspired by the impressive performance, optimizations and extensions to the original learned indices have caught the attention from both academia and industry. In this thesis, we introduce three of our recent works on designing learned data structures to efficiently process fundamental queries in big data management and analytics, including the streaming membership query, approximate multi-dimensional range count query, and Hamming space similarity query.
First, for membership queries over data streams, we devise a novel Bloom filter variant called Stable Learned Bloom filter which addresses the performance decay issue of Bloom filters on intensive insertion workloads by combining classifiers with updatable backup filters. Second, we study the application of learned index as spatial data synopsis, which is widely adopted to speed-up query processing over large spatial databases. We propose a learned data synopsis technique named L̲earned Multi-dimensional H̲i̲s̲t̲ogram (LHist). Compared with the traditional data synopsis techniques, LHist is fully data-driven, easy-to-implement, and has the potential to achieve better storage-accuracy trade-off. Third, we investigate the problem of indexing large-scale binary databases to support fast Hamming distance-based similarity query, which is fundamental in image/video search applications. We propose HAP, an efficient ML-enhanced Hamming space index framework to support both Hamming range queries and k-NN queries.
Extensive experimental studies on large-scale real-world datasets and synthetic benchmarks reveal that our learned data structures can outperform the corresponding state-of-the-art baselines in terms of the storage cost and query processing efficiency.
Post a Comment