搜索库Faiss
Faiss全称(Facebook AI Similarity Search)是Facebook AI团队开源的针对聚类和相似性搜索库,为稠密向量提供高效相似度搜索和聚类,支持十亿级别向量的搜索,是目前较成熟的近似近邻搜索库。
参考介绍
【用法1】、【推荐】、【用法3】
在Cosplace工程中test.py具体代码如下:
import faiss
import time
# Compute R@1, R@5, R@10, R@20
RECALL_VALUES = [1, 5, 10, 20]
#。。。
queries_descriptors = all_descriptors[eval_ds.database_num:]
database_descriptors = all_descriptors[:eval_ds.database_num]
#Use a kNN to find predictions
tic = time.time()
faiss_index = faiss.IndexFlatL2(args.fc_output_dim)
faiss_index.add(database_descriptors)
print('Index built in {} sec'.format(time.time() - tic))
del database_descriptors, all_descriptors
logging.debug("Calculating recalls")
_, predictions = faiss_index.search(queries_descriptors, max(RECALL_VALUES))
print('Searched in {} sec'.format(time.time() - tic))
print(predictions.shape)
print(predictions[:5])
nlist = 100 # 单元格数
tic = time.time()
quantizer = faiss.IndexFlatL2(args.fc_output_dim) # the other index d是向量维度
index = faiss.IndexIVFFlat(quantizer, args.fc_output_dim, nlist, faiss.METRIC_L2)
# # here we specify METRIC_L2, by default it performs inner-product search
# assert not index.is_trained
index.train(database_descriptors)
# assert index.is_trained
index.add(database_descriptors) # add may be a bit slower as well
print('Index built in {} sec'.format(time.time() - tic))
index.nprobe = 10 # 执行搜索访问的单元格数(nlist以外) # default nprobe is 1, try a few more
D, I = index.search(queries_descriptors, max(RECALL_VALUES)) # actual search
print('Searched in {} sec'.format(time.time() - tic))
# print("D.shape: ",D.shape)
# print("D[:5]", D[:5])
print("I.shape: ", I.shape)
print("I[:5]",I[:5]) # neighbors of the 5 last queries
# IndexIVFPQ索引方式
nlist = 100
m = 64
tic = time.time()
quantizer = faiss.IndexFlatL2(args.fc_output_dim) # this remains the same
# 为了扩展到非常大的数据集,Faiss提供了基于产品量化器的有损压缩来压缩存储的向量的变体。压缩的方法基于乘积量化。损失了一定精度为代价, 自身距离也不为0, 这是由于有损压缩。
index = faiss.IndexIVFPQ(quantizer, args.fc_output_dim, nlist, m, 8)
# 8 specifies that each sub-vector is encoded as 8 bits
index.train(database_descriptors)
index.add(database_descriptors)
print('Searched in {} sec'.format(time.time() - tic))
# D, I = index.search(xb[:5], k) # sanity check
# print(I)
# print(D)
index.nprobe = 10 # make comparable with experiment above
_, I = index.search(queries_descriptors, max(RECALL_VALUES)) # search
print('Searched in {} sec'.format(time.time() - tic))
# print(I[:5])
如上便是实现IndexFlatL2、IndexIVFFlat、IndexIVFPQ三种索引方式的代码。在数据集上测试,其中database为1700张图片,query为10000张,查询top20最后测试结果为:
- IndexFlatL2: Indexbuilt(0.0231 sec), searched(0.1628 sec)
- IndexIVFFlat: Indexbuilt(0.2696 sec), searched(0.7498 sec)
- IndexIVFPQ: Indexbuilt(6.7583 sec), searched(6.8314 sec) 。参数m设置需注意,报错参考[ 第9个问题 ]
总结:理论上IndexIVFPQ效率应该更高,但在小数据库中反而包里搜索IndexFlatL2速度更快,依靠欧氏距离计算,而IndexIVFFlat和IndexIVFPQ都有个训练的过程。