THESIS
2025
1 online resource (xviii, 133 pages) : illustrations (some color)
Abstract
Neural Graph Databases (NGDBs) augment classic graph databases by embedding node, edge features, and graph topology in fixed-dimensional vector spaces, where the derived vectors are also called embeddings. This embedding procedure, also termed neuralization, has widely appeared in modern machine learning applications. This integration introduces new opportunities and complex challenges in handling logical queries. On the opportunity side, the answers to logical queries can be retrieved in vector spaces, a setting widely studied by the study of vector databases. This ensures a sub-linear data complexity in time for the Approximate Nearest Neighbor (ANN) search. Meanwhile, the neural embeddings (vectors) are obtained by a learning process, which can potentially generalize to unseen data....[
Read more ]
Neural Graph Databases (NGDBs) augment classic graph databases by embedding node, edge features, and graph topology in fixed-dimensional vector spaces, where the derived vectors are also called embeddings. This embedding procedure, also termed neuralization, has widely appeared in modern machine learning applications. This integration introduces new opportunities and complex challenges in handling logical queries. On the opportunity side, the answers to logical queries can be retrieved in vector spaces, a setting widely studied by the study of vector databases. This ensures a sub-linear data complexity in time for the Approximate Nearest Neighbor (ANN) search. Meanwhile, the neural embeddings (vectors) are obtained by a learning process, which can potentially generalize to unseen data. On the challenge side, the logical calculus required to answer queries is not readily found in vector spaces, making querying such NGDBs a highly intricate and non-trivial task.
This thesis studies the fundamental problems of querying neural graph databases, where we only consider the neuralized graph topology. A central example studied in this thesis is a knowledge graph with relation and entity embeddings (vectors). Those vectors are trained to recover the knowledge graph’s graph topology. We consider how to answer different queries only with those embedding (without information from graph topology). For convenience, the answer to a query is a set of entities, and the retrieval process compares “query embedding” against each of the “entity embeddings” under some notions of similarities or distances. The first research problem is determining the condition under which the answers can be faithfully retrieved by the nearest neighbor search, particularly the minimal embeddable dimension of the vector space required. Theoretical analysis shows that the growth of the minimal embeddable dimension is upper bound by, at most, the logarithm of the number of entities in the graph when the selectivity is small, which is a practical database setting.
The study of minimal embeddable dimension suggests that all computation could be conducted in vector spaces with a "small" dimension. The overall complexity of query embeddings, particularly those via neural networks in such spaces, is also sublinear with respect to the size of the graph. Combined with the ANN, whose complexity is also sub-linear, the overall data complexity remains sublinear to the size of the graph. Once the theoretical foundation of NGDB is established, this thesis discusses the practical construction of NGDB engines. It considers two distinct kinds of ways of constructing logical queries: imperative and declarative.
Imperative queries describe the step-by-step operations to obtain the answer, which, in our setting, are set operators that are fundamentally related to the logical connectives. In the NGDB approach, set operators are further parameterized by neural networks over embedding spaces of a fixed dimension. We introduce a large-scale NGDB query-answering framework, demonstrating the challenge of generalizing such approaches in the combinatorial spaces of logical queries due to the conflict between rigorous logical reasoning and geometric embeddings. To understand some particular model-agnostic generalization issues we have observed, we conducted probabilistic modeling of the widely accepted data sampling process. We characterized the distribution shifts in different query types, which explains those puzzling particular observations in generalization. To solve imperative queries in general, we develop a new metric space inspired by unbalanced optimal transport for more combinatorially generalizable query answering and its computation with convolution operations. This space enables us to combine fuzzy logic operations with neural operations, which later achieves better generalization results.
Declarative queries state the constraints that should be satisfied. With an abstract interface of link predictors, the query-answering process of declarative queries can be considered an optimization process but requires at least linear data complexity. We then propose a learning-to-optimization framework to answer such queries. Compared to direct global optimization, we introduce four types of one-hop inference of a single predicate whose results are combined by a graph neural network for a global solution. Notably, we derive closed-form formulations of the one-hop inferences for six widely used KG embeddings, further facilitating the inference process.
Post a Comment