对Faiss中IndexFlatL2、IndexIVFFlat、IndexIVFPQ三种索引的总结和选择

最新推荐文章于 2025-03-24 21:42:44 发布

蛐蛐蛐

最新推荐文章于 2025-03-24 21:42:44 发布

阅读量8.7k

点赞数 12

分类专栏：深度学习科研工具 Python技巧

本文链接：https://blog.csdn.net/qysh123/article/details/118565275

版权

科研工具同时被 3 个专栏收录

137 篇文章

订阅专栏

Python技巧

99 篇文章

订阅专栏

深度学习

65 篇文章

订阅专栏

本文通过实例对比了Faiss中的IndexFlatL2、IndexIVFFlat和IndexIVFPQ三种索引方法在存储和检索大量embedding时的性能和空间占用。IndexIVFPQ利用有损压缩实现高效存储，尽管精度有所下降，但空间节省显著。通过调整参数m，可以在精度和空间之间找到平衡。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

由于项目和研究的需要，想要存储并检索大量的embedding，在之前的博客里，我尝试了一种方案：https://blog.csdn.net/qysh123/article/details/113754991

但是感觉太傻瓜了，听从建议，试了一下Faiss。不得不说，虽然Faiss感觉挺强大的，但是文档和说明依然是很垃圾。像这里给出了我们到底应该怎么选index：https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index

但是也有网友吐槽，按照你们的说明，还是内存爆了：https://github.com/facebookresearch/faiss/issues/1239

所以说很多项目最大的问题，是对初学者和想要迅速使用的人太不友好了，要程序员写文档，感觉脑子里都是浆糊。

还是参考一些博客才真正理解了：https://www.cnblogs.com/yhzhou/p/10569311.html

我以上面这篇博客为基础，尝试分析一下IndexFlatL2、IndexIVFFlat、IndexIVFPQ这三种索引。我把上面博客中的例子稍微修改了一下，首先我们生成1,000,000条假的数据：

import numpy as np

# 构造数据
import time
d = 50                           # dimension
nb = 1000000                     # database size
np.random.seed(1234)             # make reproducible
xb = np.random.random((nb, d)).astype('float32')
xb[:, 0] += np.arange(nb) / 1000.

print(xb[:1])

# 写入文件中
np.savetxt('data.txt', xb)

在这个基础上我们来看看，用这三种索引，分别会占用多少空间（用pickle导出，然后查看文件大小）：

import numpy as np
import faiss
import pickle

# 读取文件形成numpy矩阵
data = []
with open('data.txt', 'rb') as f:
    for line in f:
        temp = line.split()
        data.append(temp)
print(data[0])
# 训练与需要计算的数据
dataArray = np.array(data).astype('float32')

# print(dataArray[0])
print(dataArray.shape)
# 获取数据的维度
d = dataArray.shape[1]

# IndexFlatL2索引方式
# # 为向量集构建IndexFlatL2索引，它是最简单的索引类型，只执行强力L2距离搜索
index = faiss.IndexFlatL2(d)
index.add(dataArray)

# # we want to see 11 nearest neighbors
k = 11
D, I = index.search(dataArray[:5], k)
# neighbors of the 5 first queries
print(I[:5])

f_Index=open('IndexFlatL2.pkl','wb')
pickle.dump(index, f_Index, protocol = 4)

# IndexIVFFlat索引方式
nlist = 100 # 单元格数
k = 11
quantizer = faiss.IndexFlatL2(d)  # the other index  d是向量维度
index = faiss.IndexIVFFlat(quantizer, d, nlist, faiss.METRIC_L2)
# here we specify METRIC_L2, by default it performs inner-product search

assert not index.is_trained
index.train(dataArray)
assert index.is_trained
index.add(dataArray)

index.nprobe = 10 # 执行搜索访问的单元格数（nlist以外）
D, I = index.search(dataArray[:5], k)
# neighbors of the 5 first queries
print(I[:5])

f_Index=open('IndexIVFFlat.pkl','wb')
pickle.dump(index, f_Index, protocol = 4)

nlist = 100
m = 10 #这里m需要是原维度d的整数商
k = 11
quantizer = faiss.IndexFlatL2(d)  # this remains the same
# 为了扩展到非常大的数据集，Faiss提供了基于产品量化器的有损压缩来压缩存储的向量的变体。压缩的方法基于乘积量化。
# 损失了一定精度为代价， 自身距离也不为0， 这是由于有损压缩。
index = faiss.IndexIVFPQ(quantizer, d, nlist, m, 8)
# 8 specifies that each sub-vector is encoded as 8 bits
index.train(dataArray)
index.add(dataArray)

f_Index=open('IndexIVFPQ.pkl','wb')
pickle.dump(index, f_Index, protocol = 4)

index.nprobe = 10              # make comparable with experiment above
D, I = index.search(dataArray[:5], k)     # search
print(I[:5])

我稍微把别人的代码改了一下，这段代码的运行结果如下：

[[   0 220 201 120 116   24   77 974 147 303 346]
[   1 322 844    5 107 327 346 524 230   99 390]
[   2 831   61 857 438   75 1059 320 466 923 611]
[   3 747 1144 408 1354 1136 831 1162 494 942 968]
[   4 386 461 281 419 782 466 531 1083 1128 285]]
[[   0 220 201 120 116   24   77 974 147 303 346]
[   1 322 844    5 107 327 346 524 230   99 390]
[   2 831   61 857 438   75 1059 320 466 923 611]
[   3 747 1144 408 1354 1136 831 1162 494 942 968]
[   4 386 461 281 419 782 466 531 1083 1128 285]]
[[   0 974 120   77   24 584   95 201 147 220   98]
[   1 844 322 202 1143   98 531 234 689 309 629]
[   2 320 278 342 293 466 121 344 857 348 831]
[   3 408 1337 747 1240 921 1354 968 1562 288 493]
[   4 281 151 793 219 341 461 212 285 531 386]]

稍微解释一下，每一行表示的是和查询向量相似度最高的向量的序号，当然每一个第一个结果都和自己最相似。我们看看标红的这几行，可以看出，IndexIVFFlat（第二组结果）和IndexFlatL2（第一组结果）是相同的，但是速度要快很多，IndexIVFPQ（第三组结果）由于是有损压缩，所以结果和前两组并不相同，但是：我们可以从标红的可以看到，前10名结果里还是有很多重复的。

从占据内存的大小来看，存储1,000,000个维度为50的embedding，前两个的空间分别为200和208MB，最后一个为18.1MB，所以IndexIVFPQ还是压缩得挺利害的。估算一下，如果embedding维度是512，那么存储5,000,000个embedding大概内存空间是：10GB，而IndexIVFPQ大概需要900MB。

从信息压缩的角度也可以理解，m = 10 #这里m需要是原维度d的整数商，这里决定了最后结果的精度，如果把m改成25，那么结果是这样的：

[[   0 220 201 120 116   24   77 974 147 303 346]
[   1 322 844    5 107 327 346 524 230   99 390]
[   2 831   61 857 438   75 1059 320 466 923 611]
[   3 747 1144 408 1354 1136 831 1162 494 942 968]
[   4 386 461 281 419 782 466 531 1083 1128 285]]
[[   0 220 201 120 116   24   77 974 147 303 346]
[   1 322 844    5 107 327 346 524 230   99 390]
[   2 831   61 857 438   75 1059 320 466 923 611]
[   3 747 1144 408 1354 1136 831 1162 494 942 968]
[   4 386 461 281 419 782 466 531 1083 1128 285]]
[[   0 201 220 120   24 147   77 303 116 467 346]
[   1 322 327    5 346 524 510 390   99 107 333]
[   2 831   61   75 466 320 1059 857   23 611 1144]
[   3 747 1144 408 1136 1325 1354 831 1162 494   19]
[   4 281 386 461 419 1083 466 531 1128 889 782]]