Lucene9中的HNSW介绍及一些参考文章

一文看懂HNSW算法理论的来龙去脉_u011233351的博客-CSDN博客_hnsw算法 HNSW算法----Hierarchcal Navigable Small World graphs,第一贡献者:Y.Malkov(俄)一.背景介绍        在浩渺的数据长河中做高效率相似性查找一直以来都是让人头疼的问题。比如,我在搜狗app上阅读了一篇文章,推荐系统就应当为我推送与这篇文章最相近的文章,数据库中所有文章是用向量表示的,所以我们要解决的问题就是“找到与这篇文章的向量...https://blog.csdn.net/u011233351/article/details/85116719

NSW的搜索: 

K-NNSearch(object q, integer:m, k)
定义:TreeSet[object] tempRes, candidates, visitedset, result
for(i=0;i<m;i++) do:
    随机选择一个entry point 放入candidates中
    tempRes = null
    repeat:
        在candidates中选择距离q最近的点c
        将c从candidates中删除
        如果c比所有在result中的元素都离q远
            那么就break repeat
        for e: c.friends
            如果e不在visitedSet中则
                把e加入visitedSet, candidates, tempRes
            
    end repeat
    把tempRes中的内容加入result中
    
end for
从result中返回前k个结果
        
        

NSW的 插入:

Nearest_Neighbor_Insert(object:new_object, integer:f,w)
    SET[object]: neighbors = k-NNSearch(new_object, w, f)
    for(i = 0; i<f; i++) do:
        neighbors[i].connect(new_object)
        new_object.connect(neighbors[i])

Lucene中hsnw的索引结构如下 

 Meta data and index part:
 +--------------------------------------------------+
 | meta data                                        |
 +--------+-----------------------------------------+
 | doc id | offset to first friend list for the doc |
 +--------+-----------------------------------------+
 | doc id | offset to first friend list for the doc |
 +--------+-----------------------------------------+
 |              ......                              |
 +--------+-----------------------------------------+

 Graph data part:
 +-------------------------+---------------------------+---------+-------------------------+
 | friends list at layer N | friends list at layer N-1 |  ...... | friends list at level 0 | <- friends lists for doc 0
 +-------------------------+---------------------------+---------+-------------------------+
 | friends list at layer N | friends list at layer N-1 |  ...... | friends list at level 0 | <- friends lists for doc 1
 +-------------------------+---------------------------+---------+-------------------------+
 |                            ......                                                       | <- and so on
 +-----------------------------------------------------------------------------------------+

 Vector data part:
 +----------------------+
 | encoded vector value | <- vector value for doc 0
 +----------------------+
 | encoded vector value | <- vector value for doc 1
 +----------------------+
 |   ......             | <- and so on
 +----------------------+

写索引的逻辑

org.apache.lucene.codecs.lucene90.Lucene90HnswVectorsWriter#writeField

1. 写vector

2. 写graph

3. 写meta

  @Override
  public void writeField(FieldInfo fieldInfo, VectorValues vectors) throws IOException {
    long pos = vectorData.getFilePointer();
    // write floats aligned at 4 bytes. This will not survive CFS, but it shows a small benefit when
    // CFS is not used, eg for larger indexes
    long padding = (4 - (pos & 0x3)) & 0x3;
    long vectorDataOffset = pos + padding;
    for (int i = 0; i < padding; i++) {
      vectorData.writeByte((byte) 0);
    }
    // TODO - use a better data structure; a bitset? DocsWithFieldSet is p.p. in o.a.l.index
    int[] docIds = new int[vectors.size()];
    int count = 0;
    for (int docV = vectors.nextDoc(); docV != NO_MORE_DOCS; docV = vectors.nextDoc(), count++) {
      // write vector
      writeVectorValue(vectors);
      docIds[count] = docV;
    }
    // count may be < vectors.size() e,g, if some documents were deleted
    long[] offsets = new long[count];
    long vectorDataLength = vectorData.getFilePointer() - vectorDataOffset;
    long vectorIndexOffset = vectorIndex.getFilePointer();
    if (vectors instanceof RandomAccessVectorValuesProducer) {
      writeGraph(
          vectorIndex,
          (RandomAccessVectorValuesProducer) vectors,
          fieldInfo.getVectorSimilarityFunction(),
          vectorIndexOffset,
          offsets,
          count,
          maxConn,
          beamWidth);
    } else {
      throw new IllegalArgumentException(
          "Indexing an HNSW graph requires a random access vector values, got " + vectors);
    }
    long vectorIndexLength = vectorIndex.getFilePointer() - vectorIndexOffset;
    writeMeta(
        fieldInfo,
        vectorDataOffset,
        vectorDataLength,
        vectorIndexOffset,
        vectorIndexLength,
        count,
        docIds);
    writeGraphOffsets(meta, offsets);
  }

HNSW介绍 – d0evi1的博客http://d0evi1.com/hnsw/

Delaunay三角剖分 - luoru - 博客园https://www.cnblogs.com/zfluo/p/5131851.html 
HNSW学习笔记 - 知乎NN最近邻搜索广泛应用在各类搜索、分类任务中,在超大的数据集上因为效率原因转化为ANN,常见的算法有KD树、LSH、IVFPQ和本文提到的HNSW。 HNSW(Hierarchical Navigable Small World)是ANN搜索领域基于图的算法…https://zhuanlan.zhihu.com/p/80552211

增加hnsw的讨论帖 

https://issues.apache.org/jira/browse/LUCENE-9004https://issues.apache.org/jira/browse/LUCENE-9004

实现分层的hnsw的讨论帖 

[LUCENE-10054] Handle hierarchy in HNSW graph - ASF JIRAhttps://issues.apache.org/jira/browse/LUCENE-10054

hnsw联通性的问题 

[LUCENE-10069] HNSW can miss results with very large k - ASF JIRAhttps://issues.apache.org/jira/browse/LUCENE-10069

HNSW的基本原理及使用_猜丁壳-CSDN博客_hnsw 原理本文首发于:http://xzyin.top/hnsw/转载请注明出处:http://xzyin.top/相关系列文章可参考:《大规模向量相似度计算(一)——hnswlib的基本使用示例》《大规模向量相似度计算(二)——hnswlib的参数含义》关注微信公众号:【charlie_mouse】进入技术交流群。1. Small world vs. Random graph在正式的介绍NSW和HNSW之前,先来了解一下小世界和随机图的概念方便后续理解为什么NSW能够做近邻查找。1.1 Regulahttps://blog.csdn.net/redhatforyou/article/details/109012660

此处有小世界的定义 

什么是小世界网络模型 | 集智百科 - 知乎“集智百科精选”是一个长期专栏,持续为大家推送复杂性科学相关的基本概念和资源信息。作为集智俱乐部的开源科学项目,集智百科希望打造复杂性科学领域最全面的百科全书,欢迎对复杂性科学感兴趣、热爱知识整理和…https://zhuanlan.zhihu.com/p/141042297

下面有说联通性问题 

距离向量算法_向量近似最近邻(ANN)算法HNSW详解「傻瓜版」_weixin_39687667的博客-CSDN博客在深度学习蓬勃发展的今天,一切物体皆可变成向量。图片可以变成一个向量,一个人的兴趣也可以表示成一个向量,一段语音也可以编码成一个向量。既然一切皆可成为向量,那么,通过一个东西找到相似的东西这个问题,都可以转化为通过一个向量找到相似的向量的问题了。书的二维向量化表示(因为是二维,所以可以展示到一张图上)所以,向量近似最近邻(ANN)算法需要解决如下的问题:给定N个维度是f的向量(vector)组成集...https://blog.csdn.net/weixin_39687667/article/details/111361853

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
好的,下面是一个简单的 Lucene9 的示例代码,包括创建索引和搜索: ```java import java.io.IOException; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.document.TextField; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.IndexWriterConfig; import org.apache.lucene.index.Term; import org.apache.lucene.queryparser.classic.ParseException; import org.apache.lucene.queryparser.classic.QueryParser; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.Query; import org.apache.lucene.search.ScoreDoc; import org.apache.lucene.search.TermQuery; import org.apache.lucene.search.TopDocs; import org.apache.lucene.store.Directory; import org.apache.lucene.store.FSDirectory; import org.apache.lucene.store.RAMDirectory; import java.nio.file.Paths; public class LuceneDemo { public static void main(String[] args) throws IOException, ParseException { // 创建索引 Directory directory = FSDirectory.open(Paths.get("index")); IndexWriterConfig config = new IndexWriterConfig(new StandardAnalyzer()); IndexWriter writer = new IndexWriter(directory, config); Document doc1 = new Document(); doc1.add(new TextField("content", "Lucene is a full-text search engine library in Java", Field.Store.YES)); writer.addDocument(doc1); Document doc2 = new Document(); doc2.add(new TextField("content", "Lucene is used to build search applications", Field.Store.YES)); writer.addDocument(doc2); writer.close(); // 搜索 Directory directory2 = FSDirectory.open(Paths.get("index")); IndexSearcher searcher = new IndexSearcher(directory2); // TermQuery Query termQuery = new TermQuery(new Term("content", "search")); TopDocs topDocs = searcher.search(termQuery, 10); for (ScoreDoc scoreDoc : topDocs.scoreDocs) { Document doc = searcher.doc(scoreDoc.doc); System.out.println(doc.get("content") + " score: " + scoreDoc.score); } // QueryParser QueryParser parser = new QueryParser("content", new StandardAnalyzer()); Query query = parser.parse("Lucene search"); TopDocs topDocs2 = searcher.search(query, 10); for (ScoreDoc scoreDoc : topDocs2.scoreDocs) { Document doc = searcher.doc(scoreDoc.doc); System.out.println(doc.get("content") + " score: " + scoreDoc.score); } searcher.getIndexReader().close(); directory2.close(); } } ``` 以上代码创建了一个包含两条文档的索引,然后使用 TermQuery 和 QueryParser 分别进行了搜索,并打印出了搜索结果。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值