Faiss使用

最新推荐文章于 2025-04-23 17:48:22 发布

阿杜依诺

最新推荐文章于 2025-04-23 17:48:22 发布

阅读量7.3k

点赞数 3

分类专栏： ANNS 文章标签： faiss

本文链接：https://blog.csdn.net/qq_40250862/article/details/97023376

版权

ANNS 专栏收录该内容

3 篇文章

订阅专栏

Faiss索引序列化

faiss索引序列化参考问题，代码。faiss有关写入redis的讨论Using faiss in distributed mode。faiss提供官方索引序列化代码。

Tips

向量内积与欧式距离关系说明。

Exact Search for L2 #基于L2距离的确定搜索匹配
Exact Search for Inner Product #基于内积的确定搜索匹配
Hierarchical Navigable Small World graph exploration #分层索引
Inverted file with exact post-verification #倒排索引
Locality-Sensitive Hashing (binary flat index) #本地敏感hash
Scalar quantizer (SQ) in flat mode #标量量化索引
Product quantizer (PQ) in flat mode #笛卡尔乘积索引
IVF and scalar quantizer #倒排+标量量化索引
IVFADC (coarse quantizer+PQ on residuals) #倒排+笛卡尔乘积索引
IVFADC+R (same as IVFADC with re-ranking based on codes) #倒排+笛卡尔乘积索引 + 基于编码器重排

faiss  -----gist  测试：

brute-force L2 distance search
IndexFlatL2构建索引
查询数据耗时5.769343376159668秒
recall: 0.99988


IndexIVFFlat：
(nprobe=1)
构建耗时3.3367772102355957秒
查询数据耗时1.0274395942687988秒
recall: 0.36356

（nprobe=10）
构建耗时3.139906167984009秒
查询数据耗时10.31830358505249秒
recall: 0.92417

(nlis=300,nprobe=10)
构建耗时4.376635789871216秒
查询数据耗时3.0249266624450684秒
recall: 0.7844199999999999

(nlist=400,nprobe=10)
构建耗时6.2013819217681885秒
查询数据耗时2.9145090579986572秒
recall: 0.74737

(nlist=400,nprobe=25)
构建耗时5.599908351898193秒
查询数据耗时8.83949089050293秒
recall: 0.91526

(nlist=400,nprobe=19)
构建耗时5.635021448135376秒
查询数据耗时6.352936029434204秒
recall: 0.8747499999999999

IndexFlatL2

IndexFlatL2不支持自定义id，只能将添加的顺序作为id。

f=h5py.File('gist-960-euclidean.hdf5','r')
x=f['train'][:]
q=f['test'][:]

index = faiss.IndexFlatL2(x.shape[1])   # build the index
print(index.is_trained)
index.add(x)                  # add vectors to the index
print(index.ntotal)

k = 100                          
start = time.time()
D, I = index.search(q, k)     # actual search
print('查询数据耗时{0}秒'.format((time.time() - start)))

逐次添加测试，每次添加100，查询的recall为：0.8615225。代码如下：

print('add 100 every time on gpu')
res = faiss.StandardGpuResources()

start = time.time()
index = faiss.IndexFlatL2(x.shape[1])
index_gpu=faiss.index_cpu_to_gpu(res,7,index)
print('构建耗时{0}秒'.format((time.time() - start)))

# start = time.time()
# index_gpu.train(x[:70000])
# print('train耗时{0}秒'.format((time.time() - start)))

start = time.time()
for i in range(0,1000):
    one = time.time()
    index_gpu.add(x[100*i:(i+1)*100])
    print('one add耗时{0}秒'.format((time.time() - one)))
print('all add耗时{0}秒'.format((time.time() - start)))
print(index_gpu.ntotal)

start = time.time()
cpu_index=faiss.index_gpu_to_cpu(index_gpu)
print('gpu2cpu耗时{0}秒'.format((time.time() - start)))
write_index(cpu_index,'faiss-log/flat/flat-gpu2cpu.index')


--------------------------------------------------------------------------------

print('test gpu')
res = faiss.StandardGpuResources()
start = time.time()
cpu_index=read_index('faiss-log/flat/flat-gpu2cpu.index')
gpu_index=faiss.index_cpu_to_gpu(res, 7, cpu_index)
print('load and to数据耗时{0}秒'.format((time.time() - start)))
#gpu_index.nprobe=100
k = 100
start = time.time()
D, I = gpu_index.search(q, k)  # actual search
print('查询数据耗时{0}秒'.format((time.time() - start)))

IndexFlatL2不支持自定义id,但是可以通过IndexIDMap/IndexIDMap2 来自定义索引id，但是只有IndexIDMap2可以从索引中读取向量，若转换成gpu索引，则不能在从gpu索引中读取向量（方法二：cpuindex--->gpuindex--->IDmap2可以正常读取，但是将索引保存为gpu索引（保存为cpu索引会报错）再读取时就保存了）。代码如下：

print('add with id test')

start = time.time()
index=faiss.IndexFlatL2(x.shape[1])
index_id=faiss.IndexIDMap2(index)
index_id.add_with_ids(x,np.arange(1,100001)+200000)
print('构建耗时{0}秒'.format((time.time() - start)))
write_index(index_id,'faiss-log/add/id.index')

start = time.time()
index_id=read_index('faiss-log/add/id.index')
print('load耗时{0}秒'.format((time.time() - start)))

print(index_id.reconstruct(200000))

暴力搜索

def search_knn(xq, xb, k, distance_type=faiss.METRIC_L2):
    """ wrapper around the faiss knn functions without index """
    nq, d = xq.shape
    nb, d2 = xb.shape
    assert d == d2

    I = np.empty((nq, k), dtype='int64') #
    D = np.empty((nq, k), dtype='float32')

    if distance_type == faiss.METRIC_L2:
        heaps = faiss.float_maxheap_array_t()
        heaps.k = k
        heaps.nh = nq
        heaps.val = faiss.swig_ptr(D)
        heaps.ids = faiss.swig_ptr(I)
        faiss.knn_L2sqr(
            faiss.swig_ptr(xq), faiss.swig_ptr(xb),
            d, nq, nb, heaps
        )
    elif distance_type == faiss.METRIC_INNER_PRODUCT:
        heaps = faiss.float_minheap_array_t()
        heaps.k = k
        heaps.nh = nq
        heaps.val = faiss.swig_ptr(D)
        heaps.ids = faiss.swig_ptr(I)
        faiss.knn_inner_product(
            faiss.swig_ptr(xq), faiss.swig_ptr(xb),
            d, nq, nb, heaps
        )
    return D, I
start = time.time()
D, I = search_knn(q, x, 100)
print('BF耗时{0}秒'.format((time.time() - start)))

BF耗时8.494398593902588秒
recall: 0.99988

IndexIVFFlat测试

f=h5py.File('gist-960-euclidean.hdf5','r')
x=f['train'][:]
q=f['test'][:]

nlist = 400

quantizer = faiss.IndexFlatL2(x.shape[1])  # the other index
index = faiss.IndexIVFFlat(quantizer, x.shape[1], nlist)
assert not index.is_trained
start = time.time()
index.train(x)
print('构建耗时{0}秒'.format((time.time() - start)))
assert index.is_trained
index.add(x)
index.nprobe=19

k = 100                          
start = time.time()
D, I = index.search(q, k)     # actual search
print('查询数据耗时{0}秒'.format((time.time() - start)))

ivf在gpu上测试，要保存索引，必须先将gpu_index 转为cpu_index，然后write_index，然后通过read_index读取索引，再将cpu_index转为gpu_index，再进行测试，参数设置参考，代码如下：

print('ivf on gpu')
res = faiss.StandardGpuResources()
nlist = 400
start = time.time()

quantizer = faiss.IndexFlatL2(960)
index_ivf = faiss.IndexIVFFlat(quantizer, 960, nlist, faiss.METRIC_L2)

#index_ivf=read_index('faiss-log/ivf.index')
gpu_index_ivf = faiss.index_cpu_to_gpu(res, 7, index_ivf)
print('gpu构建耗时{0}秒'.format((time.time() - start)))
assert not gpu_index_ivf.is_trained
gpu_index_ivf.train(x)
assert gpu_index_ivf.is_trained

gpu_index_ivf.add(x)
gpu_index_ivf.nprobe=25
k = 100
start = time.time()
D, I = gpu_index_ivf.search(q, k)  # actual search
print('查询数据耗时{0}秒'.format((time.time() - start)))
cpu_index=faiss.index_gpu_to_cpu(gpu_index_ivf)
write_index(cpu_index,'faiss-log/ivf-gpu2cpu.index')

#-----------------------------------------------------------------
print('test gpu')
res = faiss.StandardGpuResources()
start = time.time()
cpu_index=read_index('faiss-log/ivf-gpu2cpu.index')
gpu_index=faiss.index_cpu_to_gpu(res, 7, cpu_index)
gpu_index.nprobe=25
k = 100
start = time.time()
D, I = gpu_index.search(q, k)  # actual search
print('查询数据耗时{0}秒'.format((time.time() - start)))

可以直接在 gpu的索引上add向量，添加100个，用时约0.1011秒。删除向量只能在cpu索引上进行操作，gpu2cpu耗时4.900115013122559秒，remove10个耗时0.003612041473388672秒。在ivf上逐次添加100个，直到10w个，初始一定要大于等于400个才可以进行train，否则会报错。

add 100 every time on gpu
构建耗时13.18855881690979秒
WARNING clustering 400 points to 400 centroids: please provide at least 15600 training points
train耗时0.006032705307006836秒

all add耗时5.3617494106292725秒
100000
gpu2cpu耗时0.6505541801452637秒

-----------------------------------------------
load and to数据耗时15.247209072113037秒
查询数据耗时0.3533155918121338秒

返回结果中有-1时，官方说增加nprobe，~~但是并没有什么用，可能是我训练的个数太少。nlist=100，训练个数至少要12000个才不会有-1，nlist=200，训练个数为70000都还有-1（已放弃）~~。（其实是nprobe未正确设置，noprobe正确设置）逐次添加代码：

print('add 100 every time on gpu')
res = faiss.StandardGpuResources()
nlist = 200

start = time.time()
quantizer = faiss.IndexFlatL2(x.shape[1])
index_ivf = faiss.IndexIVFFlat(quantizer, 960, nlist, faiss.METRIC_L2)
index_gpu=faiss.index_cpu_to_gpu(res,7,index_ivf)
print('构建耗时{0}秒'.format((time.time() - start)))

start = time.time()
index_gpu.train(x[:70000])
print('train耗时{0}秒'.format((time.time() - start)))

start = time.time()
for i in range(0,1000):
    one = time.time()
    index_gpu.add(x[100*i:(i+1)*100])
    print('one add耗时{0}秒'.format((time.time() - one)))
print('all add耗时{0}秒'.format((time.time() - start)))
print(index_gpu.ntotal)

start = time.time()
cpu_index=faiss.index_gpu_to_cpu(index_gpu)
print('gpu2cpu耗时{0}秒'.format((time.time() - start)))
write_index(cpu_index,'faiss-log/ivf2/ivf-gpu2cpu.index')


print('test gpu')
res = faiss.StandardGpuResources()
start = time.time()
cpu_index=read_index('faiss-log/ivf2/ivf-gpu2cpu.index')
gpu_index=faiss.index_cpu_to_gpu(res, 7, cpu_index)
print('load and to数据耗时{0}秒'.format((time.time() - start)))
####gpu_index.nprobe=100  ###这种方式没用
faiss.GpuParameterSpace().set_index_parameter(gpu_index, "nprobe", 25) ##正确设置方式
k = 100
start = time.time()
D, I = gpu_index.search(q, k)  # actual search
print('查询数据耗时{0}秒'.format((time.time() - start)))


for out_num in range(q.shape[0]):
    for col_num in range(100):
        if col_num == 99:
            print('{},'.format(I[out_num][col_num]))
        else:
            print('{},'.format(I[out_num][col_num]),end='')

（nlist=400,nprobe=25,x[:15600]）
load and to数据耗时15.79077959060669秒
查询数据耗时0.07286548614501953秒
recall: 0.91232
---------------------------------------
(nlist=100,nprobe=7,x[:3900])
load and to数据耗时15.267820835113525秒
查询数据耗时0.07983207702636719秒
recall: 0.86322
---------------------------------------
(nlist=100,nprobe=9,x[:3900])
load and to数据耗时15.779078483581543秒
查询数据耗时0.09999513626098633秒
recall: 0.89282
---------------------------------------
(nlist=100,nprobe=10,x[:3900])
load and to数据耗时15.974253416061401秒
查询数据耗时0.10689020156860352秒
recall: 0.90302

--------------------------------------
先拿15600个点进行train，再加入10w点：
查询数据耗时0.06573295593261719秒
recall: 0.9095

使用技巧：ivf中的train其实是聚类的过程，之前我一直把train和add放再一起执行，其实可以分开，先拿小部分数据进行train，保存索引，然后加载索引，add数据，和之前是一样的效果。gpu的ivf索引好像不支持reconstruct。

IVFSQ

(nlist = 400，faiss.ScalarQuantizer.QT_8bit，nprobe=25)
ivfsq-8bit------50.38MB
train耗时0.30237245559692383秒
all add耗时0.9383833408355713秒
100000
查询数据耗时0.43352580070495605秒
recall: 0.9550000000000001
----------------------------------------------
(nlist = 400，faiss.ScalarQuantizer.QT_4bit，nprobe=25)
ivfsq-4bit------25.97MB
train耗时0.26999449729919434秒
all add耗时1.1254122257232666秒
100000
查询数据耗时0.48221397399902344秒
recall: 0.9399
---------------------------------------------
(nlist = 400，faiss.ScalarQuantizer.QT_fp16，nprobe=25)
ivfsq-fp16-----99.20MB
train耗时0.649604320526123秒
all add耗时6.78596043586731秒
100000
查询数据耗时0.39142894744873047秒
recall: 0.9570000000000001

PQ baseline测试：

train
PQ training on 1000000 points, remains 0 points: training polysemous on centroids
构建耗时30.735077142715454秒
add vectors to index
PQ baseline PQ baseline数据耗时3.8650856018066406秒
recall: 0.20003


train
PQ training on 1000000 points, remains 0 points: training polysemous on centroids
构建耗时60.833696126937866秒
add vectors to index
PQ baseline PQ baseline数据耗时13.431186199188232秒
recall: 0.42081

Polysemous测试：

f=h5py.File('gist-960-euclidean.hdf5','r')
x=f['train'][:]
q=f['test'][:]

index = faiss.IndexPQ(960, 64, 8)
index.do_polysemous_training = True
index.verbose = True

print("train")
start = time.time()
index.train(x)
print('构建耗时{0}秒'.format((time.time() - start)))
print("add vectors to index")

index.add(x)


faiss.omp_set_num_threads(1)

k=100
print("Polysemous", end=' ')
index.search_type = faiss.IndexPQ.ST_polysemous
index.polysemous_ht = 54
start = time.time()
D, I = index.search(q, k)
print('Polysemous数据耗时{0}秒'.format((time.time() - start)))

hnsw测试：

f=h5py.File('gist-960-euclidean.hdf5','r')
x=f['train'][:]
q=f['test'][:]

print("Testing HNSW Flat")
index = faiss.IndexHNSWFlat(960, 32)
index.hnsw.efConstruction = 650  ##
print("add")

index.verbose = True
index.add(x)
print("search")
index.hnsw.search_bounded_queue =True
index.hnsw.efSearch = 256##

k = 100
start = time.time()
D, I = index.search(q, k)
print('查询数据耗时{0}秒'.format((time.time() - start)))

（efConstruction = 40）
Done in 444616.346 ms
search
查询数据耗时0.14303946495056152秒
recall: 0.5691299999999999

（efConstruction = 85）
Done in 170936.471 ms
search
查询数据耗时0.09780073165893555秒
recall: 0.48455

（efConstruction = 100，64）
Done in 200906.928 ms
search
查询数据耗时0.1662428379058838秒
recall: 0.6487999999999999

（efConstruction = 170，64）
Done in 348107.059 ms
search
查询数据耗时0.1822066307067871秒
recall: 0.69226

（efConstruction = 300，64）
Done in 527014.629 ms
search
查询数据耗时0.1815934181213379秒
recall: 0.73158

（efConstruction = 550，128）
Done in 925379.767 ms
search
查询数据耗时0.37515878677368164秒
recall: 0.87788

Done in 1091919.391 ms
search
查询数据耗时0.7411410808563232秒
recall: 0.9532200000000001

Hnsw SQ测试：

f=h5py.File('gist-960-euclidean.hdf5','r')
x=f['train'][:]
q=f['test'][:]
print("Testing HNSW with a scalar quantizer")
    # also set M so that the vectors and links both use 128 bytes per
    # entry (total 256 bytes)
index = faiss.IndexHNSWSQ(960, faiss.ScalarQuantizer.QT_8bit, 16)

print("training")
start = time.time()
index.train(x)
print('train耗时{0}秒'.format((time.time() - start)))

index.hnsw.efConstruction = 650

print("add")
    # to see progress
index.verbose = True
index.add(x)
index.hnsw.efSearch = 256
write_index(index,'gist-960/index650-256.index')
#index=read_index('gist-960/index650-256.index')

k = 100
start = time.time()
D, I = index.search(q, k)
print('查询数据耗时{0}秒'.format((time.time() - start)))

（efConstruction = 600，efSearch = 256）
Done in 716309.536 ms
查询数据耗时0.3983769416809082秒
recall: 0.8989199999999999

(efConstruction = 650，efSearch = 256)
training
train耗时0.5952577590942383秒
Done in 767744.459 ms
查询数据耗时0.4028780460357666秒
recall: 0.90024


(efConstruction = 650，efSearch = 256)(添加时间和上面相差很多)
Done in 548369.012 ms
add耗时549.464108467102秒
查询数据耗时0.40467095375061035秒
recall: 0.90011

HNSW是静态的，不支持删除，服务器比电脑慢的原因。hnsw的添加其实是添加后，重新训练。

IVFpq：

add耗时7.222640037536621秒
查询数据耗时0.09519600868225098秒
recall: 0.152


train耗时7.732855558395386秒
add耗时8.108854532241821秒
查询数据耗时0.08902311325073242秒
recall: 0.16216999999999998


train耗时7.591001033782959秒
add耗时7.676396369934082秒
查询数据耗时0.13837480545043945秒
recall: 0.16231


train耗时7.725762605667114秒
add耗时8.161769390106201秒
查询数据耗时0.28371191024780273秒
recall: 0.16231

train耗时7.453045845031738秒
add耗时8.917322874069214秒
查询数据耗时0.3870260715484619秒
recall: 0.16231

train耗时6.147617816925049秒
add耗时7.441307067871094秒
查询数据耗时0.5570433139801025秒
recall: 0.15222


train耗时18.975913286209106秒
add耗时11.127276182174683秒
查询数据耗时8.516773223876953秒
recall: 0.59497


train耗时21.573126077651978秒
add耗时10.64961552619934秒
查询数据耗时5.987632989883423秒
recall: 0.6007

train耗时21.019615173339844秒
add耗时12.503971338272095秒
查询数据耗时7.924386262893677秒
recall: 0.6007

train耗时22.81511902809143秒
add耗时12.821179151535034秒
查询数据耗时6.22978949546814秒
recall: 0.6038

train耗时22.93665909767151秒
add耗时12.383829593658447秒
查询数据耗时7.6594767570495605秒
recall: 0.6038

n_bits一定要等于8，12或者16，d是m的倍数。m是4~64之间数字的2次幂，ivfpq适用于大规模搜索。