Measuring node-to-node similarity in a graph is a fundamental tool for various graph-based mining
applications such as image/information retrieval, friend/item recommendation, link prediction, text
mining and community detection. Thus, studying similarity measures has been recognized as an
important research problem in the data mining community. In this thesis, we focus on three representative
similarity measurements in the literature: Random Walk with Restart (RWR), Manifold
Ranking (MR) and First Hitting Probability (FHP), all of which provide good metrics for measuring
the proximity of two nodes in the graph by considering the local structure and the global
structure in the graph. Specifically, given a graph G and a pair of nodes s, t in G, RWR (resp.
MR/FHP) computes a “similarity s...[
Read more ]
Measuring node-to-node similarity in a graph is a fundamental tool for various graph-based mining
applications such as image/information retrieval, friend/item recommendation, link prediction, text
mining and community detection. Thus, studying similarity measures has been recognized as an
important research problem in the data mining community. In this thesis, we focus on three representative
similarity measurements in the literature: Random Walk with Restart (RWR), Manifold
Ranking (MR) and First Hitting Probability (FHP), all of which provide good metrics for measuring
the proximity of two nodes in the graph by considering the local structure and the global
structure in the graph. Specifically, given a graph G and a pair of nodes s, t in G, RWR (resp.
MR/FHP) computes a “similarity score” of t with respect to (w.r.t) s, which is called the RWR
(resp. MR/FHP) score. The larger the RWR (resp. MR/FHP) score, the more similar to s node t is.
However, for these measurements, existing best-known algorithms for computing RWR/MR/FHP
scores have one of the following issues: (1) some of them require to build a bulky index, (2) some
of them do not have any theoretical bound on the output, and (3) some of them are very time-consuming,
making all of them difficult to be used in real world applications where the graphs are
always in a large scale with millions of nodes and billions of edges.
In this thesis, we firstly propose an index-free algorithm called R̲e̲s̲idue-A̲c̲c̲umulated approach
(ResAcc) which returns the RWR scores with a theoretical guarantee efficiently. Our experimental
evaluations on large-scale real graphs show that ResAcc is up to 4 times faster than the best-known
previous algorithm while guaranteeing the same accuracy. Under typical settings, the best-known
algorithm ran around 1000 seconds on a large dataset containing 41.7 million nodes, which is too
time-consuming, while ResAcc finished in 275 seconds with the same accuracy. Moreover, ResAcc
is up to 6 orders of magnitude more accurate than the best-known algorithm in practice with the
same execution time, which is considered as a substantial improvement.
Secondly, we study the top-k image retrieval (which returns the k images from the image
datasets which are the most similar to a given query image) with manifold ranking (MR), which
could give more accurate results than existing deep learning-based approaches. More importantly,
compared with RWR, MR achieves higher precision on the top-k results since MR treats each edge
in the graph with different weights while RWR treats equally. Besides, we propose 2 algorithms,
namely Monte Carlo-based MR (MCMR) and MCMR+, for top-k image retrieval. We are the first
one to propose an index-free manifold ranking image retrieval with the output theoretical bound.
More importantly, our algorithms give the first best-known time complexity result of O(n log n)
where n is the total number of images in the database compared with the existing best-known result
of O(n
2) in the literature of computing the exact top-k results with quality guarantee. Lastly,
our experimental results show that MCMR+ outperforms existing algorithms by up to 4 orders of
magnitude in terms of query time.
Thirdly, we focus on the group FHP value where the target is a set of nodes, instead of a single
node, which is applicable to more real-world scenarios. Specifically, given a source node s and a
target set T, the group FHP value f(s, T) is defined as the probability that a random walk from s
hits one of the nodes in T for the first time. Based on this general FHP version, we present the
efficient index-free algorithms on two types of FHP queries: the pairwise query which returns the
FHP value of a target set T w.r.t a source node s, and the top-k query which returns the k target sets
with the largest FHP values w.r.t a source node s. To answer the pairwise query, we first present a
non-trivial and rigorous derivation of a group local push algorithm tailored for FHP (which cannot
be found in the literature), and then speed up the computations by combining this group local push
algorithm with the Monte Carlo approach. For top-k queries, we present a novel pruning strategy
which prunes most non-top-k nodes without explicitly computing the lower/upper bounds. By
considering the properties of FHP, we further present optimizations for the top-k queries to improve
the practical query efficiency. Extensive experiments demonstrate that our proposed solutions run
faster than the state-of-the-arts by up to 4 orders of magnitude in terms of query time.
Post a Comment