背景:
有一些<知识点:知识点解释>格式的数据,需要根据一句话找到对应的知识点。
代码:
使用的还是bge-large-zh-v1.5
Settings.tokenizer = AutoTokenizer.from_pretrained(
"<your path>/bge-large-zh-v1.5",
)
Settings.embed_model = HuggingFaceEmbedding(model_name = "<your path>/bge-large-zh-v1.5")
os.environ["CUDA_LAUNCH_BLOCKING"] = "1"
我希望将知识点和知识点解释都构建在知识库内,所以先将字典处理成字符串
all_data = []
for k, v in datas.items():
all_data.append(f"{k}: {v}")
nodes = []
for data in tqdm(all_data):
node = TextNode(text=data,id_ = data.split(":")[0]) #构建知识库
nodes.append(node) #储存所有nodes
index = VectorStoreIndex(nodes) #将所有Nodes组成index
index.storage_context.persist("<your path>") #Index持久化
后续加载index:
def load_retriever(persist_dir):
context = StorageContext.from_defaults(persist_dir=persist_dir)
index = load_index_from_storage(context)
return index.as_retriever(similarity_top_k=3)
topk需要多少写多少即可
加载后使用:
knows_infos = retriever.retrieve(question) #question是待匹配的句子
knows = [know_info.node.id_ for know_info in knows_infos] #knows是单纯的知识点
know_and_infos = [know_info.node.text for know_info in knows_infos] #know and infos是知识点:知识点解释