langchain FAISS 的 余弦相似度比较
FAISS的相似度比较默认是通过欧式距离,而embedding模型多是用余弦相似度比较。所以更改下langchain。
FAISS 的 余弦相似度比较可以改如下:
- 首先faiss选项为内积
~/miniconda3/envs/test/lib/python3.8/sitepackages/langchain/vectorstores/faiss.py
将
index = faiss.IndexFlatL2(len(embeddings[0]))
修改成
index = fais.IndexFlatIP(len(embeddings[0]))
- 其次修改embendding归一化:
如:
from langchain.schema import Document
from langchain.vectorstores import Chroma,FAISS
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
import numpy as np
embeddings = HuggingFaceEmbeddings(model_name="models/text2vec-large-chinese",
model_kwargs={'device': 'cpu'},
encode_kwargs={'normalize_embeddings':True})
- 测试代码
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
import numpy as np
embeddings = HuggingFaceEmbeddings(model_name="models/e5-base-v2",
model_kwargs={'device': 'cpu'},
encode_kwargs={'normalize_embeddings':True})
#xx = embeddings.embed_query("早安 打工人")
docs = [
Document(page_content="A bunch of scientists bring back dinosaurs and mayhem breaks loose", metadata={"year": 1993, "rating": 7.7, "genre": "science fiction"}),
Document(page_content="Leo DiCaprio gets lost in a dream within a dream within a dream within a ...", metadata={"year": 2010, "director": "Christopher Nolan", "rating": 8.2}),
Document(page_content="A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea", metadata={"year": 2006, "director": "Satoshi Kon", "rating": 8.6}),
]
vector_store = FAISS.from_documents(docs, embeddings)
docs_score = vector_store.similarity_search_with_score("A bunch of scientists bring back dinosaurs?", k=3)
print(docs_score)
参考:
https://github.com/hwchase17/langchain/issues/4232
https://github.com/facebookresearch/faiss/wiki/Faiss-indexes
Chroma笔记:
这边其实试过Chroma,但它里面的余弦相似度 和 欧式距离的结果一致,所以没用了
db = Chroma.from_documents(texts, embeddings)
docs_score = db.similarity_search_with_score(query=query, distance_metric=“cos”, k = 6)
Chroma的本地存储用如下方法:
https://github.com/hwchase17/chroma-langchain/blob/master/persistent-qa.ipynb