Theory and practice : similarity measurements on large-scale graphs

HKUST Electronic Theses

Theory and practice : similarity measurements on large-scale graphs

by Dandan Lin

THESIS 2021

Ph.D. Computer Science and Engineering

1 online resource (xiv, 128 pages) : illustrations (some color)

Abstract

Measuring node-to-node similarity in a graph is a fundamental tool for various graph-based mining applications such as image/information retrieval, friend/item recommendation, link prediction, text mining and community detection. Thus, studying similarity measures has been recognized as an important research problem in the data mining community. In this thesis, we focus on three representative similarity measurements in the literature: Random Walk with Restart (RWR), Manifold Ranking (MR) and First Hitting Probability (FHP), all of which provide good metrics for measuring the proximity of two nodes in the graph by considering the local structure and the global structure in the graph. Specifically, given a graph G and a pair of nodes s, t in G, RWR (resp. MR/FHP) computes a “similarity score” of t with respect to (w.r.t) s, which is called the RWR (resp. MR/FHP) score. The larger the RWR (resp. MR/FHP) score, the more similar to s node t is. However, for these measurements, existing best-known algorithms for computing RWR/MR/FHP scores have one of the following issues: (1) some of them require to build a bulky index, (2) some of them do not have any theoretical bound on the output, and (3) some of them are very time-consuming, making all of them difficult to be used in real world applications where the graphs are always in a large scale with millions of nodes and billions of edges.

In this thesis, we firstly propose an index-free algorithm called R̲e̲s̲idue-A̲c̲c̲umulated approach (ResAcc) which returns the RWR scores with a theoretical guarantee efficiently. Our experimental evaluations on large-scale real graphs show that ResAcc is up to 4 times faster than the best-known previous algorithm while guaranteeing the same accuracy. Under typical settings, the best-known algorithm ran around 1000 seconds on a large dataset containing 41.7 million nodes, which is too time-consuming, while ResAcc finished in 275 seconds with the same accuracy. Moreover, ResAcc is up to 6 orders of magnitude more accurate than the best-known algorithm in practice with the same execution time, which is considered as a substantial improvement.

Secondly, we study the top-k image retrieval (which returns the k images from the image datasets which are the most similar to a given query image) with manifold ranking (MR), which could give more accurate results than existing deep learning-based approaches. More importantly, compared with RWR, MR achieves higher precision on the top-k results since MR treats each edge in the graph with different weights while RWR treats equally. Besides, we propose 2 algorithms, namely Monte Carlo-based MR (MCMR) and MCMR+, for top-k image retrieval. We are the first one to propose an index-free manifold ranking image retrieval with the output theoretical bound. More importantly, our algorithms give the first best-known time complexity result of O(n log n) where n is the total number of images in the database compared with the existing best-known result of O(n²) in the literature of computing the exact top-k results with quality guarantee. Lastly, our experimental results show that MCMR+ outperforms existing algorithms by up to 4 orders of magnitude in terms of query time.

Thirdly, we focus on the group FHP value where the target is a set of nodes, instead of a single node, which is applicable to more real-world scenarios. Specifically, given a source node s and a target set T, the group FHP value f(s, T) is defined as the probability that a random walk from s hits one of the nodes in T for the first time. Based on this general FHP version, we present the efficient index-free algorithms on two types of FHP queries: the pairwise query which returns the FHP value of a target set T w.r.t a source node s, and the top-k query which returns the k target sets with the largest FHP values w.r.t a source node s. To answer the pairwise query, we first present a non-trivial and rigorous derivation of a group local push algorithm tailored for FHP (which cannot be found in the literature), and then speed up the computations by combining this group local push algorithm with the Monte Carlo approach. For top-k queries, we present a novel pruning strategy which prunes most non-top-k nodes without explicitly computing the lower/upper bounds. By considering the properties of FHP, we further present optimizations for the top-k queries to improve the practical query efficiency. Extensive experiments demonstrate that our proposed solutions run faster than the state-of-the-arts by up to 4 orders of magnitude in terms of query time.

[ Hide abstract ]

View Copyrighted to the author. Reproduction is prohibited without the author’s prior written consent.

Details

Collection HKUST Electronic Theses Degree Ph.D. Department Computer Science and Engineering Supervisors Wong, Raymond Chi-Wing Authors Lin, Dandan Subjects Data mining Graphic methods Language English Call number Thesis CSE 2021 Lin DOI 10.14711/thesis-991012936265203412

Full record

Theory and practice : similarity measurements on large-scale graphs

by Dandan Lin

Post a Comment Cancel reply