近似最近邻算法HNSW伪代码分析

最新推荐文章于 2024-08-08 07:53:30 发布

那时那月那人

最新推荐文章于 2024-08-08 07:53:30 发布

阅读量793

点赞数 2

分类专栏：源码分析文章标签： nlp

本文链接：https://blog.csdn.net/xiaoxu1025/article/details/110276676

版权

源码分析专栏收录该内容

16 篇文章 4 订阅

订阅专栏

最近在看最邻近算法的HNSW论文，这里对里面的几个算法的伪码进行下分析记录。如果想看源码先明白伪码才能更好理解源码。

算法1 插入

INSERT(hnsw, q, M, Mmax, efConstruction, mL)
/***
    hnsw 表示输出的hnsw图结构
    q 插入点
    M 每个点设置的连接数  由用户自己设置
    Mmax 允许的最大连接数
    efConstruction 动态候选集的大小大于M
    mL 选择具体层数是用到的参数
**/
Input: multilayer graph hnsw, new element q, number of established 
connections M, maximum number of connections for each element 
per layer Mmax, size of the dynamic candidate list efConstruction, normalization factor for level generation mL
// 输出新的hnsw图结构
Output: update hnsw inserting element q
// 存放最近点的集合
1 W ← ∅ // list for the currently found nearest elements
2 ep ← get enter-point for hnsw
// 获取最顶层L
3 L ← level of ep // top layer for hnsw
// 给加入的几点计算所在的层 l   unif为0-1z之间的随机数
4 l ← ⌊-ln(unif(0..1))∙mL⌋ // new element’s level
// 从 顶层到l+1层选择最近的点加入集合W
5 for lc ← L … l+1
6   W ← SEARCH-LAYER(q, ep, ef=1, lc)
7   ep ← get the nearest element from W to q
8 for lc ← min(L, l) … 0
// 从 l层到0层 选择efConstruction多个点加入集合
9   W ← SEARCH-LAYER(q, ep, efConstruction, lc)
// 从集合中选择最近的M个点
10  neighbors ← SELECT-NEIGHBORS(q, W, M, lc) // alg. 3 or alg. 4
11  add bidirectionall connectionts from neighbors to q at layer lc
// 遍历neighbors中的所有的点进行双向连接
12  for each e ∈ neighbors // shrink connections if needed
13      eConn ← neighbourhood(e) at layer lc
// 如果有结点的连接数超过Mmax 重新为当前结点选择Mmax个连接
14      if │eConn│ > Mmax // shrink connections of e
// if lc = 0 then Mmax = Mmax0
15          eNewConn ← SELECT-NEIGHBORS(e, eConn, Mmax, lc)
// alg. 3 or alg. 4
16          set neighbourhood(e) at layer lc to eNewConn
17  ep ← W
18 if l > L
19 set enter-point for hnsw to q

算法2 搜索当前层的最近邻

SEARCH_LAYER(q, ep, ef, lc)
/**
 q 查询点
 ep 当前层的进入点enter point
 ef 返回的集合大小
 lc 所在的层数
 */
Input: 
query element q, 
enter point ep, 
number of nearest to q elements to return ef, 
layer number lc
/**
 * 输出：q的ef个最近邻
 */
Output: ef closest neighbors to q
// 构建三个集合  候选集C  已访问集合V  结果集合W 将进入点加入三个集合中
v ← ep
C ← ep
W ← ep
// 对候选集合进行遍历
while │C│ > 0
    // 从候选集合中选出最近的点  第一次就是enter point
    c ← extract nearest element from C to q
    // 从结果集中选择最远的点 
    f ← get furthest element from W to q
    // 比较 c,q 和 f,q之间的距离
    if distance(c, q) > distance(f, q) 
        break
    // 遍历当前结点相连的结点集合
    for each e ∈ neighbourhood(c) at layer lc
    // 如果该点还未访问过 加入V
        if e ∉ v
            v ← v ⋃ e
            // 从结果集中选择最远的点 
            f ← get furthest element from W to q
            // 比较 e,q 和 f, q 之间的距离 如果前者小
            if distance(e, q) < distance(f, q) or │W│ < ef
                // 加入候选集 并加入结果集
                C ← C ⋃ e
                W ← W ⋃ e
                // 如果超过了ef个数 则将距离远点一次删除
                if │W│ > ef
                    remove furthest element from W to q
return W

算法3 从候选集中选择M个最邻近

SELECT_NEIGHBORS_SIMPLE(q, C, M)
/**
 q 查询点
 C 候选集
 M 返回集合的大小
 */
Input: 
base element q, 
candidate elements C, 
number of neighbors to return M
// 返回M个q的最近邻
Output: M nearest elements to q

return M nearest elements from C to q

算法4 启发式搜索最邻近该方法的调用在算法1伪代码中的15行

SELECT_NEIGHBORS_HEURISTIC(q, C, M, lc, extendCandidates, keepPrunedConnections)
/**
 * q：查询的点
 * C：候选集合
 * M：需要返回集合大小
 * lc：当前层数
 * extendCandidates：是否扩展候选集合代销
 * keepPrunedConnections：是否添加丢弃元素表示
 */
Input: 
base element q, 
candidate elements C, 
number of neighbors to return M, 
layer number lc, 
flag indicating whether or not to extend candidate list extendCandidates, 
flag indicating whether or not to add discarded elements keepPrunedConnections
/**
 * 返回最邻近M个元素集合
 */
Output: M elements selected by the heuristic
// R 返回的结果集合  W候选集合
R ← ∅
W ← C 
// 如果需要扩展候选集合 就讲每个点的邻居进行添加
if extendCandidates
    for each e ∈ C
        for each e_adj ∈ neighbourhood(e) at layer lc
            if e_adj ∉ W
                W ← W ⋃ e_adj
// 记录丢弃元素队列                
Wd ← ∅ 
/**
 * 至于这里为什么叫启发式 个人感觉是因为extendCandidates 将候选集合中的点的邻居都加入计算 扩大候选集合 
 * 如果没有这一步 只是简单的从候选集合中找到M个就是最简单的找M个最邻近
 * 只要候选集合和结果集合不满足条件就一直遍历
 * 从候选集合中选择最近的点
 * if 如果e,q的距离比 R中任何一个点到q的距离近（只要比R中有一个点到q的距离大于e到q的距离就行）
 *  将e加入R集合
 * else
 *  将e加入丢弃元素队列
 */
while │W│ > 0 and │R│ < M
    e ← extract nearest element from W to q
    if e is closer to q compared to any element from R
        R ← R ⋃ e
    else
        Wd ← Wd ⋃ e
/**
 * 如果最终结果不满足M各就从丢弃队列中寻找添加
 */
if keepPrunedConnections
    while │Wd│ > 0 and │R│ < M
        R ← R ⋃ extract nearest element from Wd to q
return R

最后一个算法 KNN 搜索

K-NN-SEARCH(hnsw, q, K, ef)
/**
 * hnsw：构建的hnsw图结构
 * q：查询点
 * K：K邻近
 * ef：动态候选元素集合大小
 */
Input: 
multilayer graph hnsw, query element q, 
number of nearest neighbors to return K, 
size of the dynamic candidate list ef
// 输出K个最邻近
Output: K nearest elements to q
// 候选集合W
W ← ∅ 
ep ← get enter point for hnsw
L ← level of ep
/**
 * 自顶向下进行搜索  
 * 从L到1 也就是倒数第二层 每层的enterpoint 就是上层最近q最近的点
 */
for lc ← L … 1
    W ← SEARCH_LAYER(q, ep, ef=1, lc)
    ep ← get nearest element from W to q
// 用第一层的得到距离q最近的enter point 进入底层第0层找出ef个点
W ← SEARCH_LAYER(q, ep, ef, lc=0)
// 从W中选择最近的K个
return K nearest elements from W to q