hnswlib -向量ANN检索库


关于 hnswlib

Header-only C++/python library for fast approximate nearest neighbors

hnswlib 是一个用于高维向量的快速最近邻搜索(ANN)的C++库,其设计基于 Hierarchical Navigable Small World(HNSW)算法。
HNSW 算法是一种基于图的近似最近邻搜索方法,旨在高效地在高维向量空间中找到与给定查询向量最相似的向量。


以下是 hnswlib 的一些主要特点和功能:

  1. 高维向量支持:hnswlib 专注于处理高维向量,适用于各种需要处理大量高维数据的应用场景,如图像搜索、文本检索、推荐系统等。
  2. HNSW 算法:HNSW 算法是 hnswlib 的核心,它利用图结构和局部连接性来构建一个高效的索引结构,使得在高维空间中进行最近邻搜索变得高效可行。
  3. 多线程支持:hnswlib 支持多线程并行处理,可以利用多核处理器提高搜索性能。
  4. 内存友好:hnswlib 设计优化了内存使用,能够高效地管理和存储大规模的高维向量数据。
  5. 灵活的参数配置:用户可以根据自己的需求灵活配置索引结构和搜索参数,以满足不同应用场景的需求。
  6. 易于集成:hnswlib 提供了简洁易用的C++接口,方便用户集成到自己的应用中使用。

hnswlib 的主要优势在于其高效的最近邻搜索性能和对高维向量的友好支持,使其成为处理大规模高维数据的理想选择。
它已经被广泛应用于各种领域,包括机器学习、数据挖掘、自然语言处理等。
同时,hnswlib 也提供了 Python 接口,方便 Python 用户使用。


相关文章


安装

pip install hnswlib

从源码安装

apt-get install -y python-setuptools python-pip
git clone https://github.com/nmslib/hnswlib.git
cd hnswlib
pip install .

使用示例

C++ 示例:https://github.com/nmslib/hnswlib/blob/master/examples/cpp/EXAMPLES.md


1、创建索引,插入元素,搜索和选择序列

import hnswlib
import numpy as np
import pickle

dim = 128
num_elements = 10000

# Generating sample data
data = np.float32(np.random.random((num_elements, dim)))
ids = np.arange(num_elements)

# Declaring index
p = hnswlib.Index(space = 'l2', dim = dim) # possible options are l2, cosine or ip

# Initializing index - the maximum number of elements should be known beforehand
p.init_index(max_elements = num_elements, ef_construction = 200, M = 16)

# Element insertion (can be called several times):
p.add_items(data, ids)

# Controlling the recall by setting ef:
p.set_ef(50) # ef should always be > k

# Query dataset, k - number of the closest elements (returns 2 numpy arrays)
labels, distances = p.knn_query(data, k = 1)

# Index objects support pickling
# WARNING: serialization via pickle.dumps(p) or p.__getstate__() is NOT thread-safe with p.add_items method!
# Note: ef parameter is included in serialization; random number generator is initialized with random_seed on Index load
p_copy = pickle.loads(pickle.dumps(p)) # creates a copy of index p using pickle round-trip

### Index parameters are exposed as class properties:
print(f"Parameters passed to constructor:  space={p_copy.space}, dim={p_copy.dim}") 
print(f"Index construction: M={p_copy.M}, ef_construction={p_copy.ef_construction}")
print(f"Index size is {p_copy.element_count} and index capacity is {p_copy.max_elements}")
print(f"Search speed/quality trade-off parameter: ef={p_copy.ef}")

2、序列化/反序列化后更新

import hnswlib
import numpy as np

dim = 16
num_elements = 10000

# Generating sample data
data = np.float32(np.random.random((num_elements, dim)))

# We split the data in two batches:
data1 = data[:num_elements // 2]
data2 = data[num_elements // 2:]

# Declaring index
p = hnswlib.Index(space='l2', dim=dim)  # possible options are l2, cosine or ip

# Initializing index
# max_elements - the maximum number of elements (capacity). Will throw an exception if exceeded
# during insertion of an element.
# The capacity can be increased by saving/loading the index, see below.
#
# ef_construction - controls index search speed/build speed tradeoff
#
# M - is tightly connected with internal dimensionality of the data. Strongly affects memory consumption (~M)
# Higher M leads to higher accuracy/run_time at fixed ef/efConstruction

p.init_index(max_elements=num_elements//2, ef_construction=100, M=16)

# Controlling the recall by setting ef:
# higher ef leads to better accuracy, but slower search
p.set_ef(10)

# Set number of threads used during batch search/construction
# By default using all available cores
p.set_num_threads(4)

print("Adding first batch of %d elements" % (len(data1)))
p.add_items(data1)

# Query the elements for themselves and measure recall:
labels, distances = p.knn_query(data1, k=1)
print("Recall for the first batch:", np.mean(labels.reshape(-1) == np.arange(len(data1))), "\n")

# Serializing and deleting the index:
index_path='first_half.bin'
print("Saving index to '%s'" % index_path)
p.save_index("first_half.bin")
del p

# Re-initializing, loading the index
p = hnswlib.Index(space='l2', dim=dim)  # the space can be changed - keeps the data, alters the distance function.

print("\nLoading index from 'first_half.bin'\n")

# Increase the total capacity (max_elements), so that it will handle the new data
p.load_index("first_half.bin", max_elements = num_elements)

print("Adding the second batch of %d elements" % (len(data2)))
p.add_items(data2)

# Query the elements for themselves and measure recall:
labels, distances = p.knn_query(data, k=1)
print("Recall for two batches:", np.mean(labels.reshape(-1) == np.arange(len(data))), "\n")

伊织 2024-03-06(三)

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

编程乐园

请我喝杯伯爵奶茶~!

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值