NSW的搜索:
K-NNSearch(object q, integer:m, k)
定义:TreeSet[object] tempRes, candidates, visitedset, result
for(i=0;i<m;i++) do:
随机选择一个entry point 放入candidates中
tempRes = null
repeat:
在candidates中选择距离q最近的点c
将c从candidates中删除
如果c比所有在result中的元素都离q远
那么就break repeat
for e: c.friends
如果e不在visitedSet中则
把e加入visitedSet, candidates, tempRes
end repeat
把tempRes中的内容加入result中
end for
从result中返回前k个结果
NSW的 插入:
Nearest_Neighbor_Insert(object:new_object, integer:f,w)
SET[object]: neighbors = k-NNSearch(new_object, w, f)
for(i = 0; i<f; i++) do:
neighbors[i].connect(new_object)
new_object.connect(neighbors[i])
Lucene中hsnw的索引结构如下
Meta data and index part: +--------------------------------------------------+ | meta data | +--------+-----------------------------------------+ | doc id | offset to first friend list for the doc | +--------+-----------------------------------------+ | doc id | offset to first friend list for the doc | +--------+-----------------------------------------+ | ...... | +--------+-----------------------------------------+ Graph data part: +-------------------------+---------------------------+---------+-------------------------+ | friends list at layer N | friends list at layer N-1 | ...... | friends list at level 0 | <- friends lists for doc 0 +-------------------------+---------------------------+---------+-------------------------+ | friends list at layer N | friends list at layer N-1 | ...... | friends list at level 0 | <- friends lists for doc 1 +-------------------------+---------------------------+---------+-------------------------+ | ...... | <- and so on +-----------------------------------------------------------------------------------------+ Vector data part: +----------------------+ | encoded vector value | <- vector value for doc 0 +----------------------+ | encoded vector value | <- vector value for doc 1 +----------------------+ | ...... | <- and so on +----------------------+
写索引的逻辑
org.apache.lucene.codecs.lucene90.Lucene90HnswVectorsWriter#writeField
1. 写vector
2. 写graph
3. 写meta
@Override
public void writeField(FieldInfo fieldInfo, VectorValues vectors) throws IOException {
long pos = vectorData.getFilePointer();
// write floats aligned at 4 bytes. This will not survive CFS, but it shows a small benefit when
// CFS is not used, eg for larger indexes
long padding = (4 - (pos & 0x3)) & 0x3;
long vectorDataOffset = pos + padding;
for (int i = 0; i < padding; i++) {
vectorData.writeByte((byte) 0);
}
// TODO - use a better data structure; a bitset? DocsWithFieldSet is p.p. in o.a.l.index
int[] docIds = new int[vectors.size()];
int count = 0;
for (int docV = vectors.nextDoc(); docV != NO_MORE_DOCS; docV = vectors.nextDoc(), count++) {
// write vector
writeVectorValue(vectors);
docIds[count] = docV;
}
// count may be < vectors.size() e,g, if some documents were deleted
long[] offsets = new long[count];
long vectorDataLength = vectorData.getFilePointer() - vectorDataOffset;
long vectorIndexOffset = vectorIndex.getFilePointer();
if (vectors instanceof RandomAccessVectorValuesProducer) {
writeGraph(
vectorIndex,
(RandomAccessVectorValuesProducer) vectors,
fieldInfo.getVectorSimilarityFunction(),
vectorIndexOffset,
offsets,
count,
maxConn,
beamWidth);
} else {
throw new IllegalArgumentException(
"Indexing an HNSW graph requires a random access vector values, got " + vectors);
}
long vectorIndexLength = vectorIndex.getFilePointer() - vectorIndexOffset;
writeMeta(
fieldInfo,
vectorDataOffset,
vectorDataLength,
vectorIndexOffset,
vectorIndexLength,
count,
docIds);
writeGraphOffsets(meta, offsets);
}
HNSW介绍 – d0evi1的博客http://d0evi1.com/hnsw/
Delaunay三角剖分 - luoru - 博客园https://www.cnblogs.com/zfluo/p/5131851.html
HNSW学习笔记 - 知乎NN最近邻搜索广泛应用在各类搜索、分类任务中,在超大的数据集上因为效率原因转化为ANN,常见的算法有KD树、LSH、IVFPQ和本文提到的HNSW。 HNSW(Hierarchical Navigable Small World)是ANN搜索领域基于图的算法…https://zhuanlan.zhihu.com/p/80552211
增加hnsw的讨论帖
https://issues.apache.org/jira/browse/LUCENE-9004https://issues.apache.org/jira/browse/LUCENE-9004
实现分层的hnsw的讨论帖
hnsw联通性的问题
此处有小世界的定义
下面有说联通性问题