【Faiss】indexes 前(后)处理（五）

最新推荐文章于 2024-06-05 23:33:09 发布

mjiansun

最新推荐文章于 2024-06-05 23:33:09 发布

阅读量5.1k

点赞数 1

分类专栏：算法与数据结构

原文链接：https://github.com/liqima/faiss_note/blob/master/4.Faiss%20indexes%20%E5%89%8D(%E5%90%8E)%E5%A4%84%E7%90%86.ipynb

版权

算法与数据结构专栏收录该内容

34 篇文章 7 订阅

订阅专栏

Pre and post processing

在某些情形下，需要对Index做前处理或后处理。

ID映射

默认情况下，faiss会为每个输入的向量记录一个次序id，在使用中也可以为向量指定任意我们需要的id。
部分index类型有add_with_ids方法，可以为每个向量对应一个64-bit的id，搜索的时候返回这个指定的id。

#导入faiss
import sys
sys.path.append('/home/maliqi/faiss/python/')
import faiss
import numpy as np 

#获取数据和Id
d = 512
n_data = 2000
data = np.random.rand(n_data, d).astype('float32')
ids = np.arange(100000, 102000)  #id设定为6位数整数

nlist = 10
quantizer = faiss.IndexFlatIP(d)
index = faiss.IndexIVFFlat(quantizer, d, nlist, faiss.METRIC_L2)
index.train(data)
index.add_with_ids(data, ids)
d, i = index.search(data[:5], 5)
print(i)  #返回的id应该是我们自己设定的

[[100000 100383 101007 101444 100729]
 [100001 100880 101693 100004 100964]
 [100002 101113 101998 101017 101768]
 [100003 100694 101701 101608 100831]
 [100004 100111 100564 100541 100513]]

但是对有些Index类型，并不支持add_with_ids，因此需要与其他Index类型结合，将默认的id映射到指定id，用IndexIDMap类实现。
指定的ids不能是字符串，只能是整数。

index = faiss.IndexFlatL2(data.shape[1]) 
index.add_with_ids(data, ids)  #报错

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-5-4de928a09ab9> in <module>()
      1 index = faiss.IndexFlatL2(data.shape[1])
----> 2 index.add_with_ids(data, ids)

/home/maliqi/faiss/python/faiss/__init__.py in replacement_add_with_ids(self, x, ids)
    104         assert d == self.d
    105         assert ids.shape == (n, ), 'not same nb of vectors as ids'
--> 106         self.add_with_ids_c(n, swig_ptr(x), swig_ptr(ids))
    107 
    108     def replacement_assign(self, x, k):

/home/maliqi/faiss/python/faiss/swigfaiss.py in add_with_ids(self, n, x, xids)
   1316 
   1317     def add_with_ids(self, n, x, xids):
-> 1318         return _swigfaiss.Index_add_with_ids(self, n, x, xids)
   1319 
   1320     def search(self, n, x, k, distances, labels):

RuntimeError: Error in virtual void faiss::Index::add_with_ids(faiss::Index::idx_t, const float*, const long int*) at Index.cpp:46: add_with_ids not implemented for this type of index

index2 = faiss.IndexIDMap(index)  
index2.add_with_ids(data, ids)  #将index的id映射到index2的id,会维持一个映射表

数据转换

有些时候需要在索引之前转换数据。转换类继承了VectorTransform类，将输入向量转换为输出向量。
1)随机旋转,类名RandomRotationMatri,用以均衡向量中的元素，一般在IndexPQ和IndexLSH之前；
2）PCA,类名PCAMatrix，降维；
3）改变维度，类名RemapDimensionsTransform，可以升高或降低向量维数

举例：PCA降维（通过IndexPreTransform）

输入向量是2048维，需要减少到16byte。

data = np.random.rand(n_data, 2048).astype('float32')
# the IndexIVFPQ will be in 256D not 2048
coarse_quantizer = faiss.IndexFlatL2 (256) 
sub_index = faiss.IndexIVFPQ (coarse_quantizer, 256, 16, 16, 8)
# PCA 2048->256
# 降维后随机旋转 (第四个参数)
pca_matrix = faiss.PCAMatrix (2048, 256, 0, True) 

#- the wrapping index
index = faiss.IndexPreTransform (pca_matrix, sub_index)

# will also train the PCA
index.train(data)  #数据需要是2048维
# PCA will be applied prior to addition
index.add(data)

举例：升维

有时候需要在向量中插入0升高维度，一般是我们需要 1）d是4的整数倍，有利于举例计算； 2）d是M的整数倍。

d = 512
M = 8   #M是在维度方向上分割的子空间个数
d2 = int((d + M - 1) / M) * M
remapper = faiss.RemapDimensionsTransform (d, d2, True)
index_pq = faiss.IndexPQ(d2, M, 8)
index = faiss.IndexPreTransform (remapper, index_pq) #后续可以添加数据/索引

对搜索结果重新排序

当查询向量时，可以用真实距离值对结果进行重新排序。
在下面的例子中，搜索阶段会首先选取4*10个结果，然后对这些结果计算真实距离值，再从中选取10个结果返回。IndexRefineFlat保存了全部的向量信息，内存开销不容小觑。

data = np.random.rand(n_data, d).astype('float32')
nbits_per_index = 4
q = faiss.IndexPQ (d, M, nbits_per_index)
rq = faiss.IndexRefineFlat (q)
rq.train (data)
rq.add (data)
rq.k_factor = 4
dis, ind = rq.search (data[:5], 10)
print(ind)

[[   0  434 1647 1501  867  658  822 1164  490 1430]
 [   1 1035  369  392  866 1645 1961 1469 1946  175]
 [   2  466 1183  403  427  505  394  759  633  746]
 [   3 1668 1798 1293  965 1484  755  315 1633 1453]
 [   4  309  715 1204  996  239 1381   48  707 1311]]

综合多个index返回的结果

当数据集分布在多个index中，需要在每个index中都执行搜索，然后使用IndexShards综合得到结果。同样也适用于index分布在不同的GPU的情况。

mjiansun

关注

1
点赞
踩
14

收藏

觉得还不错? 一键收藏
0
评论
【Faiss】indexes 前(后)处理（五）

Pre and post processing在某些情形下，需要对Index做前处理或后处理。ID映射默认情况下，faiss会为每个输入的向量记录一个次序id，在使用中也可以为向量指定任意我们需要的id。部分index类型有add_with_ids方法，可以为每个向量对应一个64-bit的id，搜索的时候返回这个指定的id。#导入faissimport syssys.path.append('/home/maliqi/faiss/python/')import faissimpo
复制链接

扫一扫

专栏目录