Efficient algorithms for frequent path finding and similarity join in big multidimensional data

HKUST Electronic Theses

Efficient algorithms for frequent path finding and similarity join in big multidimensional data

by Wuman Luo

THESIS 2012

Ph.D. Computer Science and Engineering

xii, 109 p. : ill. ; 30 cm

Abstract

The arrival of cyber-physical system era is changing data analysis in many ways. Driven by the advances in Internet and sensor techniques, the amount of multidimensional contents, such as images, trajectories, video clips, etc., has grown to an unprecedented level. Supporting multidimensional objects in large scale requires signiﬁcant extensions from traditional databases. One critical issue is indexing and query processing. In this thesis, we discuss two important queries in massive multidimensional datasets: frequent path ﬁnding and high-dimensional similarity join. In the context of big data, these two queries are challenging due to two reasons: 1) they both contain expensive comparison operations, and 2) the complex structures of their interested data complicate the index design. To address these issues, we conduct theoretical analysis, advanced algorithm design, as well as extensive experiments on large-scale datasets.

First, we address the problem of frequent path ﬁnding by proposing a new query named the time period-based most frequent path (TPMFP). Speciﬁcally, given a time period T, a source s, and a destination d, TPMFP query searches the most frequent path (MFP) from s to d during T. Though there exists several proposals on deﬁning MFP, they only consider a ﬁxed time period. Most importantly, we ﬁnd that none of them can well reﬂect people’s common sense notion which can be described by three key properties, namely suﬃx-optimal (i.e., any suﬃx of an MFP is also an MFP), length-insensitive (i.e., MFP should not favor shorter or longer paths), and bottleneck-free (i.e., MFP should not contain infrequent edges). The TPMFP with the above properties will reveal not only common routing preferences of the past travelers, but also take the time eﬀectiveness into consideration. Therefore, our ﬁrst task is to give an TPMFP deﬁnition which satisﬁes the above three properties. Then, given the comprehensive TPMFP deﬁnition, our next task is to ﬁnd TPMFP over huge amount of trajectory data eﬃciently. Particularly, we propose eﬃcient search algorithms together with novel indexes to speed up the processing of TPMFP.

Second, we study how to perform parallel high-dimensional similarity joins (HDSJs) in MapReduce eﬃciently in the MapReduce paradigm. Particularly, we propose a cost model to demonstrate that it is important to take both communication cost and computation cost into account as dimensionality and data volume increases. To this end, we propose dimension aggregation approximation (DAA), an eﬃcient compression approach that can help signiﬁcantly reduce both these costs when performing parallel HDSJs. Moreover, we design DAA-based parallel HDSJ algorithms which can scale up to massive data sizes and very high dimensionality.

Finally, we conduct comprehensive experiments using real large-scale datasets. The number of trajectories for path ﬁnding is up to 10 million and the average trajectory length (in terms of the number of vertices) is 21. The number of image vectors for join is up to 1 million and the dimensionality of the vectors is up to ten thousand. The results show that our proposed algorithms have desirable performance in big multidimensional data.

[ Hide abstract ]

View Copyrighted to the author. Reproduction is prohibited without the author’s prior written consent.

Details

Collection HKUST Electronic Theses Degree Ph.D. Department Computer Science and Engineering Supervisors Ni, Lionel M. Authors Luo, Wuman Subjects Electronic data processing Distributed processing Computer algorithms Multidimensional databases Language English Call number Thesis CSED 2012 Luo DOI 10.14711/thesis-b1213376

Full record

Efficient algorithms for frequent path finding and similarity join in big multidimensional data

by Wuman Luo

Post a Comment Cancel reply