facebook向量搜索聚类faiss安装与使用示例

最新推荐文章于 2024-06-26 14:40:16 发布

fanzitao

最新推荐文章于 2024-06-26 14:40:16 发布

阅读量2.6k

点赞数 4

分类专栏：机器学习文章标签： faiss

本文链接：https://blog.csdn.net/fanzitao/article/details/83308811

版权

机器学习专栏收录该内容

2 篇文章 0 订阅

订阅专栏

安装

从这里下载相关的安装文本

1. 安装fortran
yum install libgfortran
yum install gcc-gfortran

2. 安装blas
rpm -ivh blas-3.2.1-5.el6.x86_64.rpm
rpm -ivh blas-devel-3.2.1-5.el6.x86_64.rpm

3. 安装lapack
rpm -ivh lapack-3.2.1-5.el6.x86_64.rpm
rpm -ivh lapack-devel-3.2.1-5.el6.x86_64.rpm

5. 克隆代码
git clone git@github.com:facebookresearch/faiss.git

6. 构建安装
   ./configure
   make
   make install

7. 测试
   make test

如果最后看到如下输出，则说明成功了

test_IndexIVFPQ (test_index.TestSearchAndReconstruct) ... WARNING clustering 1500 points to 256 centroids: please provide at least 9984 training points
WARNING clustering 1500 points to 256 centroids: please provide at least 9984 training points
WARNING clustering 1500 points to 256 centroids: please provide at least 9984 training points
WARNING clustering 1500 points to 256 centroids: please provide at least 9984 training points
WARNING clustering 1500 points to 256 centroids: please provide at least 9984 training points
WARNING clustering 1500 points to 256 centroids: please provide at least 9984 training points
WARNING clustering 1500 points to 256 centroids: please provide at least 9984 training points
WARNING clustering 1500 points to 256 centroids: please provide at least 9984 training points
Reconstruction error = 0.455
ok
test_IndexTransform (test_index.TestSearchAndReconstruct) ... WARNING clustering 1500 points to 256 centroids: please provide at least 9984 training points
WARNING clustering 1500 points to 256 centroids: please provide at least 9984 training points
WARNING clustering 1500 points to 256 centroids: please provide at least 9984 training points
WARNING clustering 1500 points to 256 centroids: please provide at least 9984 training points
WARNING clustering 1500 points to 256 centroids: please provide at least 9984 training points
WARNING clustering 1500 points to 256 centroids: please provide at least 9984 training points
WARNING clustering 1500 points to 256 centroids: please provide at least 9984 training points
WARNING clustering 1500 points to 256 centroids: please provide at least 9984 training points
Reconstruction error = 3.241
ok
test_MultiIndex (test_index.TestSearchAndReconstruct) ... WARNING clustering 1500 points to 256 centroids: please provide at least 9984 training points
WARNING clustering 1500 points to 256 centroids: please provide at least 9984 training points
WARNING clustering 1500 points to 256 centroids: please provide at least 9984 training points
WARNING clustering 1500 points to 256 centroids: please provide at least 9984 training points
WARNING clustering 1500 points to 256 centroids: please provide at least 9984 training points
WARNING clustering 1500 points to 256 centroids: please provide at least 9984 training points
WARNING clustering 1500 points to 256 centroids: please provide at least 9984 training points
WARNING clustering 1500 points to 256 centroids: please provide at least 9984 training points
Reconstruction error = 0.437
ok

----------------------------------------------------------------------
Ran 74 tests in 118.620s

OK

8. 安装python wrapper

make py

注意，完毕之后进入faiss/python目录，执行：

python -c "import faiss"

如果成功，记得把当前目录下的faiss目录拷贝到/usr/lib/python2.7/site-packages目录下。这也看你使用的python是什么，如果是anaconda的话。

然后查看有没有安装成功：

[root@aws ~]# python
Python 2.7.5 (default, Jul 13 2018, 13:06:57)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-28)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import faiss

使用示例

IndexFlatL2

IndexFlatL2是精确查找，没有做任何的数据压缩，使用欧式距离来衡量距离，不需要训练过程。不支持删除向量，也不支持带ID插入。

import numpy as np
import faiss

np.random.seed(100)
train_v = np.random.rand(1000, 128).astype('float32')
q_v = np.random.rand(10,128).astype('float32')

index = faiss.IndexFlatL2(128)  ##创建索引
index.add(train_v) ## 添加数据
D, I = index.search(q_v, 10) ##搜索

返回的D向量是相似度，I是向量的索引，该索引是向量插入到索引中的顺序。如果需要自己定义ID，例如图片ID，那么需要自己在插入的时候维护一个ID和索引ID的对应关系。或者使用faiss.IndexIDMap包装一下。

index = faiss.IndexFlatL2(128)
index = faiss.IndexIDMap(index)
index.add_with_ids(train_v,np.arange(1000))
D, I = index.search(q_v, 10)

这时候返回的I，是插入的时候，向量对应的ID。

IndexIVFFlat

为了加速检索速度，可以使用IndexIVFFlat，该索引再创建的时候，需要另一个索引，即quantizer, 也需要指定距离计算公式，不提供的话，默认也是L2距离。此外，还需要指定一个nlist,指定索引划分数。

quantizer = faiss.IndexFlatL2(dims)
index = faiss.IndexIVFFlat(quantizer, dims, 16, faiss.METRIC_L2)
index.train(train_v)
index_v = np.random.rand(5000, 128).astype('float32')
index.add_with_ids(index_v, np.arange(5000))
index.search(q_v, 10)
##删除前一步查询结果中出现的向量ID，再执行搜索试试
index.remove_ids(np.array([3821]))
index.search(q_v, 10)
## 会发现，这个结果中被删除的ID不见了

此外，IndexIVFFlat还支持设定nprobe，该参数的作用是控制速度和精度，该参数默认值是1. nprobe越小，搜索精度越高，速度越慢。 IndexIVF有两个基本组成部分：

quantizer index 给定一个向量，quantizer index返回该向量属于的group
InvertedLists 给定一个查询向量把一个id(nlist中的一个)映射到一个(code, id)的序列，这里的code,id分别是？

IndexIVFFlat没有对数据进行压缩，如果很介意内存（尤其是GPU）占用的话，考虑使用PQ。

PQ原理

PQ全称product quantization, 本质上是一中通过分治、数据压缩来实现高效向量检索的近似检索算法，在追求高效的大空间检索情况下，通常不会使用精确检索。首先介绍一下quantization的概念，vector quantization通过定义一个量化器(映射函数q)，把一个D维向量，映射成一个k维向量(k通常是2的幂)，通常这个k会远小于D。 product quantization在quantzation前面加了一个分治，例如原始向量是D=128维，我们把它分成m=4组，那么每组的子向量就是128/4=32维, 在每个32维子向量组里，利用kmeans算法学习到映射函数q。第一步分治，第二步压缩，加速了检索

IndexIVFPQ

IndexIVFPQ对原始数据进行了压缩，所以提供不精确检索。均匀分布的数据是很难被压缩的。

quantizer=faiss.IndexFlatL2(dims)
index = faiss.IndexIVFPQ(quantizer, dims, 16, 8, 8)
index.train(train_v)

其中第一，二，三个参数和之前介绍的一样。第一个8是向量分段数，就是前面qp中介绍的m, 第二个8是指分段后的每段的聚类中心点的个数（或者说码）占用的bit数，8意味256个聚类中心点（每段）。程序会不断warning，原因不记得了，可自行谷歌解决。

WARNING clustering 1000 points to 256 centroids: please provide at least 9984 training points
WARNING clustering 1000 points to 256 centroids: please provide at least 9984 training points
WARNING clustering 1000 points to 256 centroids: please provide at least 9984 training points
WARNING clustering 1000 points to 256 centroids: please provide at least 9984 training points
WARNING clustering 1000 points to 256 centroids: please provide at least 9984 training points
WARNING clustering 1000 points to 256 centroids: please provide at least 9984 training points
WARNING clustering 1000 points to 256 centroids: please provide at least 9984 training points
WARNING clustering 1000 points to 256 centroids: please provide at least 9984 training points

train_v = np.random.rand(10000, 128).astype('float32')
index.train(train_v)
index.add_with_ids(index_v, np.arange(5000))
##index.nprobe=10 也支持设置这个
index.search(q_v, 10)

工厂方法

index = faiss.index_factory(16, "Flat", faiss.METRIC_L2)
index = faiss.index_factory(16, "Flat", faiss.METRIC_INNER_PRODUCT)
index = faiss.index_factory(16, "IVF100,Flat")
index = faiss.index_factory(128, "IVF100,PQ8")

使用GPU

res = faiss.StandardGpuResources() ## 获取gpu资源
dims = 1024
quantizer = faiss.IndexFlatL2(dims)
 #index = faiss.IndexIVFFlat(quantizer, dims, 10)
index = faiss.IndexIVFPQ(quantizer, dims, 128, 8, 8)
self.index = faiss.index_cpu_to_gpu(res, 0, index) ## 使用gpu，并指定第0块gpu

fanzitao

关注

4
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
facebook向量搜索聚类faiss安装与使用示例

从这里下载相关的安装文本1. 安装fortran yum install libgfortran yum install gcc-gfortran2. 安装blas rpm -ivh blas-3.2.1-5.el6.x86_64.rpm rpm -ivh blas-devel-3.2.1-5.el6.x86_64.rpm3. 安装lapack ...
复制链接

扫一扫