Faiss库了解

V-SLAM

已于 2022-11-18 15:34:26 修改

阅读量594

点赞数

分类专栏：视觉定位文章标签： faiss 数据库 python

于 2022-11-18 15:31:53 首次发布

本文链接：https://blog.csdn.net/weixin_43166819/article/details/127923269

版权

视觉定位专栏收录该内容

1 篇文章 0 订阅

订阅专栏

搜索库Faiss

Faiss全称(Facebook AI Similarity Search)是Facebook AI团队开源的针对聚类和相似性搜索库，为稠密向量提供高效相似度搜索和聚类，支持十亿级别向量的搜索，是目前较成熟的近似近邻搜索库。
参考介绍
【用法1】、【推荐】、【用法3】

在Cosplace工程中test.py具体代码如下：

import faiss
import time
# Compute R@1, R@5, R@10, R@20
RECALL_VALUES = [1, 5, 10, 20]
#。。。
queries_descriptors = all_descriptors[eval_ds.database_num:]
database_descriptors = all_descriptors[:eval_ds.database_num]

 #Use a kNN to find predictions
    tic = time.time()
    faiss_index = faiss.IndexFlatL2(args.fc_output_dim)
    faiss_index.add(database_descriptors)
    print('Index built in {} sec'.format(time.time() - tic))
    del database_descriptors, all_descriptors
    
    logging.debug("Calculating recalls")
    _, predictions = faiss_index.search(queries_descriptors, max(RECALL_VALUES))
    print('Searched in {} sec'.format(time.time() - tic))
    print(predictions.shape)
    print(predictions[:5])

    nlist = 100 # 单元格数
    tic = time.time()
    quantizer = faiss.IndexFlatL2(args.fc_output_dim)  # the other index  d是向量维度
    index = faiss.IndexIVFFlat(quantizer, args.fc_output_dim, nlist, faiss.METRIC_L2)
# # here we specify METRIC_L2, by default it performs inner-product search
    # assert not index.is_trained
    index.train(database_descriptors)
    # assert index.is_trained
    index.add(database_descriptors)                  # add may be a bit slower as well
    print('Index built in {} sec'.format(time.time() - tic))
    index.nprobe = 10        # 执行搜索访问的单元格数（nlist以外）      # default nprobe is 1, try a few more
    D, I = index.search(queries_descriptors, max(RECALL_VALUES))     # actual search
    print('Searched in {} sec'.format(time.time() - tic))
    # print("D.shape: ",D.shape)
    # print("D[:5]", D[:5])
    print("I.shape: ", I.shape)
    print("I[:5]",I[:5]) # neighbors of the 5 last queries

# IndexIVFPQ索引方式
    nlist = 100
    m = 64
    tic = time.time()
    quantizer = faiss.IndexFlatL2(args.fc_output_dim)  # this remains the same
    # 为了扩展到非常大的数据集，Faiss提供了基于产品量化器的有损压缩来压缩存储的向量的变体。压缩的方法基于乘积量化。损失了一定精度为代价， 自身距离也不为0， 这是由于有损压缩。
    index = faiss.IndexIVFPQ(quantizer, args.fc_output_dim, nlist, m, 8)
    # 8 specifies that each sub-vector is encoded as 8 bits
    index.train(database_descriptors)
    index.add(database_descriptors)
    print('Searched in {} sec'.format(time.time() - tic))
    # D, I = index.search(xb[:5], k) # sanity check
    # print(I)
    # print(D)
    index.nprobe = 10              # make comparable with experiment above
    _, I = index.search(queries_descriptors, max(RECALL_VALUES))     # search
    print('Searched in {} sec'.format(time.time() - tic))
    # print(I[:5])

如上便是实现IndexFlatL2、IndexIVFFlat、IndexIVFPQ三种索引方式的代码。在数据集上测试，其中database为1700张图片，query为10000张，查询top20最后测试结果为：

IndexFlatL2: Indexbuilt(0.0231 sec), searched(0.1628 sec)
IndexIVFFlat： Indexbuilt(0.2696 sec), searched(0.7498 sec)
IndexIVFPQ： Indexbuilt(6.7583 sec), searched(6.8314 sec) 。参数m设置需注意，报错参考[ 第9个问题 ]

总结：理论上IndexIVFPQ效率应该更高，但在小数据库中反而包里搜索IndexFlatL2速度更快，依靠欧氏距离计算，而IndexIVFFlat和IndexIVFPQ都有个训练的过程。