部署bge:
cd到model文档:
使用命令:git clone https://huggingface.co/BAAI/bge-large-zh-1.5
再将bge-large-zh中的pytorch_model.bin删掉(如果该文件只有几十M大小,下载的可能是虚拟文件)再输入如下命令:
wget https://huggingface.co/BAAI/bge-large-zh/resolve/main/pytorch_model.bin
在py或者ipynb文件中输入如下命令:
from sentence_transformers import SentenceTransformer
queries = ['query_1', 'query_2']
passages = ["样例文档-1", "样例文档-2"]
instruction = "为这个句子生成表示以用于检索相关文章:"
model = SentenceTransformer('/home/ubuntu/model/bge-large-zh-1.5')
q_embeddings = model.encode([instruction+q for q in queries], normalize_embeddings=True)
p_embeddings = model.encode(passages, normalize_embeddings=True)
scores = q_embeddings @ p_embeddings.T
print(scores)
执行以后下列图片,即为下载成功。
如何使用:
我们希望我们的问题能embedding10个最相关的句子。使用如下数据集(由于是公司整理,暂时不会提供给公众)
这些数据我们录入到emb11_0918_3694.txt文件中
from sklearn.metrics.pairwise import cosine_similarity
with open('/home/ubuntu/model/bge-large-zh/emb11_0918_3694.txt', 'r') as f:
text = f.read()
# 按行分割文本
sentence_2_cn = text.split('\n')
model_cn = SentenceTransformer('/home/ubuntu/model/bge-large-zh')
embeddings_2_cn = model_cn.encode(sentence_2_cn, normalize_embeddings=True)
query = "如何更换雨刮片"
sentence_1 = []
instruction = "为这个句子生成表示以用于检索相关文章:"
sentence_1.append(query)
embeddings_1 = model.encode([instruction+q for q in sentence_1], normalize_embeddings=True)
similarities = cosine_similarity(embeddings_1, embeddings_2_cn)[0]
top_indices = np.argsort(similarities)
request = f"""请告诉我\\"{sentence_1}\\"的答案"""
top_10_sentences = [f"{sentence_2_cn[index]}" for index in reversed(top_indices)]
new_string = '\n'.join(top_10_sentences) +'\n'+ request