使用 ElasticSearch 作为知识库，存储向量及相似性搜索_elasticsearch cosinesimilarity

本文链接：https://blog.csdn.net/2401_84181626/article/details/137504916

这篇博客介绍了如何使用ElasticSearch结合BERT模型，将文本转换为向量并存储，以进行相似性搜索。通过余弦相似度和点积算法展示了如何在ElasticSearch中实现文本的相似度查询，以解决高血压患者能否食用党参和人参的问题。

摘要由CSDN通过智能技术生成

“search_analyzer”: “ik_smart”
},
“answer”: {
“type”: “text”,
“analyzer”: “ik_max_word”,
“search_analyzer”: “ik_smart”
}
}
}
}

其中 dims 为向量的长度。

在这里插入图片描述

查看创建的索引：

GET http://127.0.0.1:9200/medical_index

在这里插入图片描述

数据存入 ElasticSearch

引入 ElasticSearch 依赖库：

pip install elasticsearch -i https://pypi.tuna.tsinghua.edu.cn/simple

from elasticsearch import Elasticsearch
from transformers import BertTokenizer, BertModel
import torch
import pandas as pd

def embeddings_doc(doc, tokenizer, model, max_length=300):
encoded_dict = tokenizer.encode_plus(
doc,
add_special_tokens=True,
max_length=max_length,
padding=‘max_length’,
truncation=True,
return_attention_mask=True,
return_tensors=‘pt’
)
input_id = encoded_dict[‘input_ids’]
attention_mask = encoded_dict[‘attention_mask’]

前向传播

with torch.no_grad():
outputs = model(input_id, attention_mask=attention_mask)

提取最后一层的CLS向量作为文本表示

last_hidden_state = outputs.last_hidden_state
cls_embeddings = last_hidden_state[:, 0, :]
return cls_embeddings[0]

def add_doc(index_name, id, embedding_ask, ask, answer, es):
body = {
“ask_vector”: embedding_ask.tolist(),
“ask”: ask,
“answer”: answer
}
result = es.create(index=index_name, id=id, doc_type=“_doc”, body=body)
return result

def main():