A Survey of Graph-based Approximate Nearest Neighbor (ANN)

最新推荐文章于 2024-10-12 13:24:31 发布

番话

最新推荐文章于 2024-10-12 13:24:31 发布

阅读量880

点赞数 15

分类专栏： ML Notebook 文章标签：图搜索算法

本文链接：https://blog.csdn.net/weixin_44252500/article/details/140909214

版权

ML Notebook 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

A Survey of Graph-based Approximate Nearest Neighbor (ANN) (Updating…)

A Comprehensive Survey and Experimental Comparison of Graph-Based Approximate Nearest Neighbor Search

https://arxiv.org/abs/2101.12631

Background

This paper compared graph-based ANN algorithm in a uniform test environment on eight real-world datasets and 12 synthetic datasets with varying sizes and characteristics.

Nearest Neighbor Search (NNS) is a fundamental building block in various application domains. With the explosive growth of datasets’ scale, accurate NNS cannot meet actual requirements for efficiency and cost. Much of the literature has focused on efforts to research approximate NNS (ANNS).

The existing ANNS algorithms can be divided into four major types: (1) hashing-based; (2) tree-based; (3) quantization-based; and (4) graph-based algorithms. Recently, graph-based algorithms have emerged as mainstream for research and practice in academia and industry. They only need to evaluate fewer points of dataset to receive more accurate results.

Graph-based ANNS’ general procedure: Initially, for query vertex $q$ , a seed vertex can be randomly sampled as the result vertex $r$ , and we can conduct ANNS from this seed vertex. Specifically, if $\delta(n, q)<\delta(r, q)$ , where $n$ is one of the neighbors of $r, r$ will be replaced by $n$ . We repeat this process until the termination condition (e.g., $\forall n, \delta(n, q) \geq \delta(r, q)$ ) is met, and the final $r$ (the green vertex) is $q$ 's nearest neighbor.

Mathematic Preparation

Notation

$\delta()$ : The Euclidean distance between points. The most commonly used distance function is the Euclidean distance $\delta(x, y)\left(l_2\right.$ norm): $\delta(x, y)=\sqrt{\sum_{i=0}^{d-1}\left(x_i-y_i\right)^2}$ . The larger the $\delta(x, y)$ , the more dissimilar $x$ and $y$ are, and the closer to zero, the more similar they are.
$G (V, E)$ : A graph index G where the set of vertices and edges are V and E, respectively
$N (v)$ : The neighbors of the vertex v in a graph

Problem Definition

Nearest Neighbor Search (NNS): Given a finite dataset $S$ in Euclidean space $E^d$ and a query $q$ , NNS obtains $k$ nearest neighbors $\mathcal{R}$ of $q$ by evaluating $\delta(x, q)$ , where $\in S$ . $\mathcal{R}$ is described as follows: $\mathcal{R}=\arg \min _{\mathcal{R} \subset S,|R|=k} \sum_{x \in \mathcal{R}} \delta(x, q)$ .

Approximate Nearest Neighbor Search (ANNS): Given a finite dataset $S$ in Euclidean space $E^d$ , and a query $q$ , ANNS builds an index $I$ on $S$ . It then gets a subset $C$ of $S$ by $I$ , and evaluates $\delta(x, q)$ to obtain the approximate $k$ nearest neighbors $\tilde{\mathcal{R}}$ of $q$ , where $\in C$ . (Attention! There are two steps, (1) getting a subset $\subset R$ , (2) find $k$ nearest neighbors in $C$ )

Generally, we use recall rate Recall@ $k=\frac{|\mathcal{R} \cap \tilde{\mathcal{R}}|}{k}$ to evaluate the search results’ accuracy. ANNS algorithms aim to maximize Recall@ $k$ while making $C$ as small as possible (e.g., $∣ C ∣$ is only a few thousand when $∣ S ∣$ is millions on the SIFT1M dataset). As mentioned earlier, ANNS algorithms based on graphs have risen in prominence because of their advantages in accuracy versus efficiency. We define graph-based ANNS as follows.

Graph-based ANNS. Given a finite dataset $S$ in Euclidean space $E^d, G(V, E)$ denotes a graph (the index $I$ in Definition if ANNS) constructed on $\forall v \in V$ that uniquely corresponds to a point $x$ in $S$ . Here $\forall(u, v) \in E$ represents the neighbor relationship between $u$ and $v$ , and $\in V$ . Given a query $q$ , seeds $\widehat{S}$ , routing strategy, and termination condition, the graph-based ANNS initializes approximate $k$ nearest neighbors $\tilde{\mathcal{R}}$ of $q$ with $\widehat{S}$ , then conducts a search from $\widehat{S}$ and updates $\tilde{\mathcal{R}}$ via a routing strategy. Finally, it returns the query result $\tilde{\mathcal{R}}$ once the termination condition is met.

Dataset: The base data and query data comprise high-dimensional feature vectors extracted by deep learning technology (such as VGG [87] for image)

Overview of Graph-based ANNS

Base Graph for ANNS

We first dissect four classic base graphs, including Delaunay Graph , Relative Neighborhood Graph , K-Nearest Neighbor Graph and Minimum Spanning Tree. After that, we review 13 representative graph-based ANNS algorithms working off different optimizations to these base graphs.

Delaunay Graph (DG): In Euclidean space $E^d$ , the DG $G (V, E)$ constructed on dataset $S$ satisfies the following conditions: For $\forall e \in E$ (e.g., the yellow line in Figure 2(a)), where its corresponding two vertices are $x, y$ , there exists a circle (the red circle in Figure 2(a)) passing through $x, y$ , and no other vertices inside the circle, and there are at most three vertices (i.e., $x, y, z$ ) on the circle at the same time. DG ensures that the ANNS always return precise results [67], but the disadvantage is that DG is almost fully connected when the dimension $d$ is extremely high, which leads to a large search space $[38, 43]$ .

请添加图片描述

Relative Neighborhood Graph (RNG): In Euclidean space $E^d$ , the RNG $G (V, E)$ built on dataset $S$ has the following property: For $\in V$ , if $x$ and $y$ are connected by edge $\in E$ , then $\forall z \in V$ , with $\delta(x, y)<\delta(x, z)$ , or $\delta(x, y)<\delta(z, y)$ . In other words, $z$ is not in the red lune in Figure 2(b) (for RNG’s standard definition, refer to [92]). Compared with DG, RNG cuts off some redundant neighbors (close to each other) that violate its aforementioned property, and makes the remaining neighbors distribute omnidirectionally, thereby reducing ANNS’ distance calculations [67]. However, the time complexity of constructing RNG on $S$ is $O\left(|S|^3\right)$ [49].

The definition is not that clear here, we provide supplement materials and another example for RNG here:

Consider again a set $P$ of $n$ distinct points on the plane: $P=\left\{p_1, p_2, \ldots, p_n\right\}$ . There are many possible ways of defining whether or not two points $p_i$ and $p_j$ are neighbours of each other. Several definitions are considered $[18, 13, 8, 10]$ . Lankford [8] defines two points $p_i$ and $p_j$ as being “relatively close” if $d\left(p_i, p_j\right) \leq \max \left[d\left(p_i, p_k\right), d\left(p_j, p_k\right)\right] \forall k=1, \ldots, n, k \neq i, j$ . Actually, Lankford uses ’ $<$ ’ rather than ’ $\leq$ ’ in his definition. The difference is essentially that with this minor modification, in a degenerate situation such as three points lying equidistant from each other, all three points are considered relative neighbours of each other, whereas with only ’ $<$ ’ in the definition none of the three points are relative neighbours of each other. Intuitively, the definition states that two points are relative neighbours if they are at least as close to each other as they are to any other point. The relative neighbourhood graph is obtained by connecting an edge between points $p_i$ and $p_j$ for all $\ldots, n, i \neq j$ if, and only if, $p_i$ and $p_j$ are relative neighbours. Figure 2 illustrates a set of points and its RNG.

请添加图片描述

K-Nearest Neighbor Graph (KNNG): Each point in dataset $S$ is connected to its nearest $K$ points to form a KNNG $G (V, E)$ in Euclidean space $E^d$ . As Figure $2(\mathrm{c})(K=2)$ shows, for $\in V$ , $\in N(y)=\{x, u\}$ , but $\notin N(x)=\{z, v\}$ , where $N (y), N (x)$ are the neighbor sets of $y$ and $x$ , respectively. Therefore, the edge between $y$ and $x$ is a directed edge, so KNNG is a directed graph. KNNG limits the number of neighbors of each vertex to $K$ at most, thus avoiding the surge of neighbors, which works well in scenarios with limited memory and high demand for efficiency. It can be seen that KNNG does not guarantee global connectivity in Figure 2©, which is unfavorable for ANNS.

Minimum Spanning Tree (MST): In Euclidean space $E^d$ , MST is the $G (V, E)$ with the smallest $\sum_{i=1}^{|E|} w\left(e_i\right)$ on dataset $S$ , where the two vertices associated with $e_i \in E$ are $x$ and $w\left(e_i\right)=\delta(x, y)$ . If $\exists e_i, e_j \in E, w\left(e_i\right)=w\left(e_j\right)$ , then MST is not unique [68]. Although MST has not been adopted by most current graph-based ANNS algorithms, HCNNG [72] confirms MST’s effectiveness as a neighbor selection strategy for ANNS. The main advantage for using MST as a base graph relies on the fact that MST uses the least edges to ensure the graph’s global connectivity, so that keeping vertices with low degrees and any two vertices are reachable. However, because of a lack of shortcuts, it may detour when searching on MST [38, 65]. For example, in Figure 2(d), when search goes from $s$ to $r$ , it must detour with $\rightarrow x \rightarrow y \rightarrow u \rightarrow r$ . This can be avoided if there is an edge between $s$ and $r$ .