【Faiss】index进阶操作（八）

最新推荐文章于 2025-02-26 08:00:00 发布

mjiansun

最新推荐文章于 2025-02-26 08:00:00 发布

阅读量4.6k

点赞数 1

分类专栏：算法与数据结构

原文链接：https://github.com/liqima/faiss_note/blob/master/4.Faiss%20indexes%20%E8%BF%9B%E9%98%B6%E6%93%8D%E4%BD%9C.ipynb

版权

算法与数据结构专栏收录该内容

34 篇文章

订阅专栏

index进阶操作

下面介绍的方法只支持部分Index类型。

从index中恢复出原始数据

给定id，可以使用reconstruct或者reconstruct_n方法从index中回复出原始向量。
支持IndexFlat, IndexIVFFlat (需要与make_direct_map结合), IndexIVFPQ, IndexPreTransform这几类索引类型。

# 导入faiss
import sys
import numpy as np 
sys.path.append('/home/maliqi/faiss/python/')
import faiss

#生成数据
d = 16
n_data = 500
data = np.random.rand(n_data, d).astype('float32')

index = faiss.IndexFlatL2(d)
index.add(data)
re_data = index.reconstruct(0)  #指定需要恢复的向量的id,每次只能恢复一个向量
print(re_data)
re_data_n = index.reconstruct_n(0, 10) #从第0个向量开始，连续取10个
print(re_data_n.shape)
[0.58085376 0.5048806  0.99052334 0.5899147  0.5211166  0.35997516
 0.7275415  0.1242122  0.08336558 0.48458952 0.3289773  0.905333
 0.6513156  0.33422878 0.04078896 0.6842935 ]
(10, 16)

从index中移除向量

使用remove_ids方法可以移除Index中的部分向量，调用了IDSelector对象（或IDSelectorBatch批量操作）标识每个向量是否应该被移除。因为要遍历标识数据库中的每一个向量，所以只有在需要移除大部分向量时才建议使用。
支持IndexFlat, IndexIVFFlat, IndexIVFPQ, IDMap。

index = faiss.IndexFlatL2(d)
index.add(data)
print(index.ntotal)
index.remove_ids(np.arange(5)) # 需要移除的向量的id
print(index.ntotal)  #移除了5个向量，还剩495个

500
495

搜索距离范围内的向量

以查询向量为中心，返回距离在一定范围内的结果，如返回数据库中与查询向量距离小于0.3的结果。
支持IndexFlat, IndexIVFFlat，只支持在CPU使用。

index = faiss.IndexFlatL2(d)
index.add(data)
dist = float(np.linalg.norm(data[3] - data[0])) * 0.99  # 定义一个半径/阈值
res_index = index.range_search(data[[49], :], dist)  #用第50个向量查询
print(res_index) #返回结果是一个三元组，分别是limit(返回的结果的数量), distance, index
res_index = index.range_search(data[[9], :], dist)  #用第10个向量查询
print(res_index) #返回结果是一个三元组，分别是limit(返回的结果的数量), distance, index

(array([0, 8], dtype=uint64), array([0.        , 1.165087  , 0.92170537, 0.9101888 , 1.2231735 ,
       1.2296542 , 1.2302384 , 1.1056653 ], dtype=float32), array([ 49, 135, 150, 225, 266, 323, 484, 491]))
(array([ 0, 26], dtype=uint64), array([1.2187614 , 0.        , 1.2426732 , 0.82170576, 1.1128769 ,
       0.8076687 , 1.2431146 , 0.9778436 , 1.2443304 , 1.1967008 ,
       1.1036559 , 1.1283486 , 1.1076214 , 1.2520782 , 1.2406417 ,
       1.2235129 , 1.0338147 , 1.1743065 , 0.9288659 , 1.1673778 ,
       1.1726046 , 1.1790745 , 1.1337838 , 1.1365123 , 1.2428    ,
       1.0492276 ], dtype=float32), array([  6,   9,  11,  15,  41,  47,  50,  58,  75, 104, 108, 112, 122,
       135, 162, 169, 213, 236, 271, 290, 342, 434, 463, 467, 477, 479]))

拆分/合并index

可以将多个index合并，需要注意的是，多个Index的数据应该满足同一分布，并且用同一分布的数据训练index，如果多个Index的数据分布不同，合并时并不会报错，但在理论上会降低索引的精度，应该用与合并后的数据集同分布的训练集再次训练。

nlist = 10
quantizer = faiss.IndexFlatL2(d)
index1 = faiss.IndexIVFFlat(quantizer, d, nlist)
index1.train(data)
index1.add(data[:250])
index2 = faiss.IndexIVFFlat(quantizer, d, nlist)
index2.add(data[250:])
index1.merge_from(index2, 250)
print(index1.ntotal) # 合并后应该包含500个向量
dis, ind = index1.search(data[:5], 10)
print(ind)

500
[[  0  28 382 194 286 114 308 480 254 279]
 [  1 416 272 250 296 138 366 281  93 169]
 [  2  44 491 231 178 285 117 273  83 187]
 [  3 194  28 143 270 430 264 382 197 279]
 [  4 464 317  89 325 498  83 101 285  51]]