On effective and efficient learned data structure design

HKUST Electronic Theses

On effective and efficient learned data structure design

by Qiyu Liu

THESIS 2022

Ph.D. Computer Science and Engineering

1 online resource (xii, 124 pages) : illustrations (some color)

Abstract

Over the past decades, machine learning techniques have achieved a breakthrough and make people re-examine the traditional data management challenges such as data indexing, cardinality estimation, query optimization, etc. In the seminal paper of Kraska et al. [66], the notion of “learned index” was first introduced to refer to the new data structure design paradigm by using a combination of machine learning models (to provide fast prediction) and traditional data structures (to handle errors caused by ML models).

Inspired by the impressive performance, optimizations and extensions to the original learned indices have caught the attention from both academia and industry. In this thesis, we introduce three of our recent works on designing learned data structures to efficiently process fundamental queries in big data management and analytics, including the streaming membership query, approximate multi-dimensional range count query, and Hamming space similarity query.

First, for membership queries over data streams, we devise a novel Bloom filter variant called Stable Learned Bloom filter which addresses the performance decay issue of Bloom filters on intensive insertion workloads by combining classifiers with updatable backup filters. Second, we study the application of learned index as spatial data synopsis, which is widely adopted to speed-up query processing over large spatial databases. We propose a learned data synopsis technique named L̲earned Multi-dimensional H̲i̲s̲t̲ogram (LHist). Compared with the traditional data synopsis techniques, LHist is fully data-driven, easy-to-implement, and has the potential to achieve better storage-accuracy trade-off. Third, we investigate the problem of indexing large-scale binary databases to support fast Hamming distance-based similarity query, which is fundamental in image/video search applications. We propose HAP, an efficient ML-enhanced Hamming space index framework to support both Hamming range queries and k-NN queries.

Extensive experimental studies on large-scale real-world datasets and synthetic benchmarks reveal that our learned data structures can outperform the corresponding state-of-the-art baselines in terms of the storage cost and query processing efficiency.

[ Hide abstract ]

View Copyrighted to the author. Reproduction is prohibited without the author’s prior written consent.

Details

Collection HKUST Electronic Theses Degree Ph.D. Department Computer Science and Engineering Supervisors Chen, Lei Authors Liu, Qiyu Language English Call number Thesis CSE 2022 LiuQ DOI 10.14711/thesis-991013049928503412

Full record

On effective and efficient learned data structure design

by Qiyu Liu

Post a Comment Cancel reply