THESIS
2012
xii, 109 p. : ill. ; 30 cm
Abstract
The arrival of cyber-physical system era is changing data analysis in many ways. Driven by the advances in Internet and sensor techniques, the amount of multidimensional contents, such as images, trajectories, video clips, etc., has grown to an unprecedented level. Supporting multidimensional objects in large scale requires significant extensions from traditional databases. One critical issue is indexing and query processing. In this thesis, we discuss two important queries in massive multidimensional datasets: frequent path finding and high-dimensional similarity join. In the context of big data, these two queries are challenging due to two reasons: 1) they both contain expensive comparison operations, and 2) the complex structures of their interested data complicate the index design. To...[
Read more ]
The arrival of cyber-physical system era is changing data analysis in many ways. Driven by the advances in Internet and sensor techniques, the amount of multidimensional contents, such as images, trajectories, video clips, etc., has grown to an unprecedented level. Supporting multidimensional objects in large scale requires significant extensions from traditional databases. One critical issue is indexing and query processing. In this thesis, we discuss two important queries in massive multidimensional datasets: frequent path finding and high-dimensional similarity join. In the context of big data, these two queries are challenging due to two reasons: 1) they both contain expensive comparison operations, and 2) the complex structures of their interested data complicate the index design. To address these issues, we conduct theoretical analysis, advanced algorithm design, as well as extensive experiments on large-scale datasets.
First, we address the problem of frequent path finding by proposing a new query named the time period-based most frequent path (TPMFP). Specifically, given a time period T, a source s, and a destination d, TPMFP query searches the most frequent path (MFP) from s to d during T. Though there exists several proposals on defining MFP, they only consider a fixed time period. Most importantly, we find that none of them can well reflect people’s common sense notion which can be described by three key properties, namely suffix-optimal (i.e., any suffix of an MFP is also an MFP), length-insensitive (i.e., MFP should not favor shorter or longer paths), and bottleneck-free (i.e., MFP should not contain infrequent edges). The TPMFP with the above properties will reveal not only common routing preferences of the past travelers, but also take the time effectiveness into consideration. Therefore, our first task is to give an TPMFP definition which satisfies the above three properties. Then, given the comprehensive TPMFP definition, our next task is to find TPMFP over huge amount of trajectory data efficiently. Particularly, we propose efficient search algorithms together with novel indexes to speed up the processing of TPMFP.
Second, we study how to perform parallel high-dimensional similarity joins (HDSJs) in MapReduce efficiently in the MapReduce paradigm. Particularly, we propose a cost model to demonstrate that it is important to take both communication cost and computation cost into account as dimensionality and data volume increases. To this end, we propose dimension aggregation approximation (DAA), an efficient compression approach that can help significantly reduce both these costs when performing parallel HDSJs. Moreover, we design DAA-based parallel HDSJ algorithms which can scale up to massive data sizes and very high dimensionality.
Finally, we conduct comprehensive experiments using real large-scale datasets. The number of trajectories for path finding is up to 10 million and the average trajectory length (in terms of the number of vertices) is 21. The number of image vectors for join is up to 1 million and the dimensionality of the vectors is up to ten thousand. The results show that our proposed algorithms have desirable performance in big multidimensional data.
Post a Comment