Faiss 相似度搜索使用余弦相似性
flyfish
Faiss提供了faiss.METRIC_INNER_PRODUCT 和faiss.METRIC_L2
只需要我们代码加上normalize_L2
IndexIVFFlat在参数选择时,使用faiss.METRIC_INNER_PRODUCT
为了验证正确性,我们先使用其他方法实现
1 使用numpy实现
def cosine_similarity_custom1(x, y):
x_y = np.dot(x, y.transpose())
x_norm = np.sqrt(np.multiply(x, x).sum(axis=1))
x_norm = x_norm[:, np.newaxis]
y_norm = np.sqrt(np.multiply(y, y).sum(axis=1))
y_norm = y_norm[:, np.newaxis]
result = np.divide(x_y, np.dot(x_norm, y_norm.transpose()))
return result
2 使用numpy自带的函数实现
def cosine_similarity_custom2(x,y):
num = x.dot(y.T)
result = np.linalg.norm(x) * np.linalg.norm(y)
return num / result
3 使用sklearn自带的 cosine_similarity
from sklearn.metrics.pairwise import cosine_similarity
完整代码如下
import json
import numpy as np
import datetime
import time
import mkl
from sklearn.metrics.pairwise import cosine_similarity
from faiss import normalize_L2
import faiss
#
def cosine_similarity_custom1(x, y):
x_y = np.dot(x, y.transpose())
x_norm = np.sqrt(np.multiply(x, x).sum(axis=1))
x_norm = x_norm[:, np.newaxis]
y_norm = np.sqrt(np.multiply(y, y).sum(axis=1))
y_norm = y_norm[:, np.newaxis]
result = np.divide(x_y, np.dot(x_norm, y_norm.transpose()))
return result
def cosine_similarity_custom2(x,y):
num = x.dot(y.T)
result = np.linalg.norm(x) * np.linalg.norm(y)
return num / result
t0 = time.time()
#生成dataset
d = 128# # dimension
nb = 1# # database size
np.random.seed(1234) # make reproducible
xb = np.random.random((nb, d)).astype('float32')
xq = np.random.random((nb, d)).astype('float32')
# print(xb)
# print(xq)
result_cosine0 = cosine_similarity(xb,xq)
result_cosine1 = cosine_similarity_custom1(xb,xq)
result_cosine2 = cosine_similarity_custom2(xb,xq)
print('result_cosine0:\n',result_cosine0)
print('result_cosine1:\n',result_cosine1)
print('result_cosine2:\n',result_cosine2)
normalize_L2(xb)
normalize_L2(xq)
nlist=1
quantizer = faiss.IndexFlatL2(d) # the other index
index = faiss.IndexIVFFlat(quantizer, d, nlist,faiss.METRIC_INNER_PRODUCT)
print(index.is_trained)
index.train(xb)
print(index.is_trained)
index.add(xb) # add vectors to the index
print(index.ntotal)
k = 1
D, I = index.search(xq, k) # actual search
print(I)
print(D)
在输出结果中前6位小数相同
[[0.74745756]]
[[0.7474575]]
[[0.74745756]]
[[0.74745744]]
再举个例子
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
a=np.array([[3,4]])
b=np.array([[5,12]])
print(cosine_similarity(a,b))
说明关于余弦相似性的计算
模就是长度的意思
a=[3,4]# a的模是5
b=[5,12]#b的模是13
计算方法1
向量内积/向量的模 = 余弦相似性
(3 ×5 + 4 ×12) / (5 ×13)=0.969
计算方法2
normalize_L2=向量的各自分量/向量的模
a=[3/5,4/5]
b=[5/13,12/13]
normalize_L2 -》向量的内积 -》 余弦相似性
(3/5) ×(5/13 )+( 4/5) × (12/13)= 0.969
如果从式子上看 就是分母通分,两个式子是相同的