为何要建检索?
因为有很多时候用户的 query 在不同的检索方式得到的结果是不一样的,有的时候关键词检索得到的结果更符合用户的期望,而有时候在需要跨语言或者近义词检索的时候需要使用向量检索。
能否结合两种检索方式,在大部分情况都能返回最优的结果呢?
当然可以,高级检索就是来解决这个问题的。
高级检索的原理:
将关键词检索得到的结果+向量检索得到的结果 进行重排序得到最优的结果。
从0开始
安装依赖包
pip install llama-index
pip install llama-index-readers-file pymupdf
pip install llama-index-llms-openai
pip install llama-index-retrievers-bm25
如果是在 jupyter 上跑代码需要加上这两行代码,来支持异步代码
import nest_asyncio
nest_asyncio.apply()
加载文档
文档放在项目的 data 目录下,本案例的文档是官方的 llama2 的介绍文档
下载地址:https://arxiv.org/pdf/2307.09288.pdf
下载后重命名为 llama2.pdf 放在目录 data 下
from pathlib import Path
from llama_index.readers.file import PyMuPDFReader
loader = PyMuPDFReader()
documents = loader.load(file_path="./data/llama2.pdf")
设置 key
import os
os.environ["OPENAI_API_KEY"] = "sk-..."
设置 llm 以及 embedding
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)
embed_model = OpenAIEmbedding(
model="text-embedding-3-small", embed_batch_size=256
)
将文档灌库
from llama_index.core import VectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter
splitter = SentenceSplitter(chunk_size=300, chunk_overlap=50) # 默认 token 1024,太长了,比较耗钱,这里设得短一点
index = VectorStoreIndex.from_documents(
documents, transformations=[splitter], embed_model=embed_model
)
定义高级检索
做好了准备工作,接下来就是定义高级检索工具了。
定义高级检索围绕以下三步
1.生成/修改 query:根据用户的原始 query 生成多个 query
2.为每个 query 执行检索
3.重排序/合并:合并检索得到的结果,应用重排序得到最佳的结果
生成/修改 query
第一步是根据用户原始的 query 生成意思相近的多个 query,以便增加检索的精确度。比如,我们
将 query 改写为更小的多个 query。
接下来看如何通过提示词来实现这一点:
from llama_index.core import PromptTemplate
query_str = "How do the models developed in this work compare to open-source chat models based on the benchmarks tested?"
query_gen_prompt_str = (
"You are a helpful assistant that generates multiple search queries based on a "
"single input query. Generate {num_queries} search queries, one on each line, "
"related to the following input query:\n"
"Query: {query}\n"
"Queries:\n"
)
query_gen_prompt = PromptTemplate(query_gen_prompt_str)
def generate_queries(llm, query_str: str, num_queries: int = 4):
fmt_prompt = query_gen_prompt.format(
num_queries=num_queries - 1, query=query_str
)
response = llm.complete(fmt_prompt)
queries = response.text.split("\n")
return queries
queries = generate_queries(llm, query_str, num_queries=4)
print(queries)
通过用户原始的输入:
How do the models developed in this work compare to open-source chat models based on the benchmarks tested
我们生成了如下的问题:
['1. Comparison of models developed in this work to open-source chat models in benchmark testing', '2. Performance evaluation of models developed in this work versus open-source chat models on tested benchmarks', '3. Analysis of differences between models developed in this work and open-source chat models in benchmark assessments']
可以看到,意思是一样的,只是变换了表达的方式。
用中文的例子打个比方就是:张三的帽子 ---> 这是张三的帽子,这帽子是张三的,这帽子属于张三.....
执行向量查询
上面我们生成了多个 query,我们针对这些 query 分别检索。
注意:我们可以有多个 query,当然也可以有多个检索的工具,假如我们生成了 N 个 query,现在有 M 个检索工具,那么检索的次数应该是 N*M,所以检索的结果就是 N*M 个 list
接下来定义一个执行检索的方法,并收集结果:
from tqdm.asyncio import tqdm
async def run_queries(queries, retrievers):
"""Run queries against retrievers."""
tasks = []
for query in queries:
for i, retriever in enumerate(retrievers):
tasks.append(retriever.aretrieve(query))
task_results = await tqdm.gather(*tasks)
results_dict = {}
for i, (query, query_result) in enumerate(zip(queries, task_results)):
results_dict[(query, i)] = query_result
return results_dict
定义检索器
# get retrievers
from llama_index.retrievers.bm25 import BM25Retriever
## 向量检索
vector_retriever = index.as_retriever(similarity_top_k=2)
## bm25 关键词检索
bm25_retriever = BM25Retriever.from_defaults(
docstore=index.docstore, similarity_top_k=2
)
3 个 query,2 个检索器,所以会执行 6 次检索
results_dict = await run_queries(queries, [vector_retriever, bm25_retriever])
0%| | 0/6 [00:00<?, ?it/s] 100%|██████████| 6/6 [00:00<00:00, 11.14it/s]
融合结果
将多个见多结果合并,然后执行重排序。
注意,检索的结果很可能是有重复的,所以需要适当的方式去重,然后重排序
接下来看官方代码如何实现这一点
from typing import List
from llama_index.core.schema import NodeWithScore
def fuse_results(results_dict, similarity_top_k: int = 2):
"""Fuse results."""
k = 60.0 # `k` is a parameter used to control the impact of outlier rankings.
fused_scores = {}
text_to_node = {}
# compute reciprocal rank scores
for nodes_with_scores in results_dict.values():
for rank, node_with_score in enumerate(
sorted(
nodes_with_scores, key=lambda x: x.score or 0.0, reverse=True
)
):
text = node_with_score.node.get_content()
text_to_node[text] = node_with_score # 这里就是排重
if text not in fused_scores:
fused_scores[text] = 0.0
fused_scores[text] += 1.0 / (rank + k) # 计算 rank 分数会累加
# sort results
reranked_results = dict(
sorted(fused_scores.items(), key=lambda x: x[1], reverse=True)
)
# adjust node scores
reranked_nodes: List[NodeWithScore] = []
for text, score in reranked_results.items():
reranked_nodes.append(text_to_node[text])
reranked_nodes[-1].score = score
return reranked_nodes[:similarity_top_k]
输出一下:
final_results = fuse_results(results_dict)
for n in final_results:
print(n.score, "\n", n.text, "\n********\n")
结果:
0.03333333333333333 Figure 12: Human evaluation results for Llama 2-Chat models compared to open- and closed-source models across ~4,000 helpfulness prompts with three raters per prompt. The largest Llama 2-Chat model is competitive with ChatGPT. Llama 2-Chat 70B model has a win rate of 36% and a tie rate of 31.5% relative to ChatGPT. Llama 2-Chat 70B model outperforms PaLM-bison chat model by a large percentage on our prompt set. More results and analysis is available in Section A.3.7. Inter-Rater Reliability (IRR). In our human evaluations, three different annotators provided independent assessments for each model generation comparison. High IRR scores (closer to 1.0) are typically seen as better from a data quality perspective, however, context is important. Highly subjective tasks like evaluating the overall helpfulness of LLM generations will usually have lower IRR scores than more objective labelling tasks. There are relatively few public benchmarks for these contexts, so we feel sharing our analysis here will benefit the research community. We used Gwet’s AC1/2 statistic (Gwet, 2008, 2014) to measure inter-rater reliability (IRR), as we found it to be the most stable metric across different measurement scenarios. On the 7-point Likert scale helpfulness task that is used in our analysis, Gwet’s AC2 score varies between 0.37 and 0.55 depending on the specific model comparison. We see scores on the lower end of that range for ratings from model comparisons with similar win rates to each other (like the Llama 2-Chat-70B-chat vs. ChatGPT comparison). We see scores on the higher end of that range for ratings from model comparisons with a more clear winner (like the Llama 2-Chat-34b-chat vs. Falcon-40b-instruct). Limitations of human evaluations. While our results indicate that Llama 2-Chat is on par with ChatGPT on human evaluations, it is important to note that human evaluations have several limitations. • By academic and research standards, we have a large prompt set of 4k prompts. However, it does not cover real-world usage of these models, which will likely cover a significantly larger number of use cases. • Diversity of the prompts could be another factor in our results. For example, our prompt set does not include any coding- or reasoning-related prompts. • We only evaluate the final generation of a multi-turn conversation. A more interesting evaluation could be to ask the models to complete a task and rate the overall experience with the model over multiple turns. • Human evaluation for generative models is inherently subjective and noisy. As a result, evaluation on a different set of prompts or with different instructions could result in different results. 19 ******** 0.03306010928961749 Llama 2: Open Foundation and Fine-Tuned Chat Models Hugo Touvron∗ Louis Martin† Kevin Stone† Peter Albert Amjad Almahairi Yasmine Babaei Nikolay Bashlykov Soumya Batra Prajjwal Bhargava Shruti Bhosale Dan Bikel Lukas Blecher Cristian Canton Ferrer Moya Chen Guillem Cucurull David Esiobu Jude Fernandes Jeremy Fu Wenyin Fu Brian Fuller Cynthia Gao Vedanuj Goswami Naman Goyal Anthony Hartshorn Saghar Hosseini Rui Hou Hakan Inan Marcin Kardas Viktor Kerkez Madian Khabsa Isabel Kloumann Artem Korenev Punit Singh Koura Marie-Anne Lachaux Thibaut Lavril Jenya Lee Diana Liskovich Yinghai Lu Yuning Mao Xavier Martinet Todor Mihaylov Pushkar Mishra Igor Molybog Yixin Nie Andrew Poulton Jeremy Reizenstein Rashi Rungta Kalyan Saladi Alan Schelten Ruan Silva Eric Michael Smith Ranjan Subramanian Xiaoqing Ellen Tan Binh Tang Ross Taylor Adina Williams Jian Xiang Kuan Puxin Xu Zheng Yan Iliyan Zarov Yuchen Zhang Angela Fan Melanie Kambadur Sharan Narang Aurelien Rodriguez Robert Stojnic Sergey Edunov Thomas Scialom∗ GenAI, Meta Abstract In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our models outperform open-source chat models on most benchmarks we tested, and based on our human evaluations for helpfulness and safety, may be a suitable substitute for closed- source models. We provide a detailed description of our approach to fine-tuning and safety improvements of Llama 2-Chat in order to enable the community to build on our work and contribute to the responsible development of LLMs. ∗Equal contribution, corresponding authors: {tscialom, htouvron}@meta.com †Second author Contributions for all the authors can be found in Section A.1. arXiv:2307.09288v2 [cs.CL] 19 Jul 2023 ********
接入 RetrieverQueryEngine
接下来我们定义一个检索工具,然后将 ta 传给 RetrieverQueryEngine 执行检索及融合
from typing import List
from llama_index.core import QueryBundle
from llama_index.core.retrievers import BaseRetriever
from llama_index.core.schema import NodeWithScore
import asyncio
class FusionRetriever(BaseRetriever):
"""Ensemble retriever with fusion."""
def __init__(
self,
llm,
retrievers: List[BaseRetriever],
similarity_top_k: int = 2,
) -> None:
"""Init params."""
self._retrievers = retrievers
self._similarity_top_k = similarity_top_k
self._llm = llm
super().__init__()
def _retrieve(self, query_bundle: QueryBundle) -> List[NodeWithScore]:
"""Retrieve."""
queries = generate_queries(
self._llm, query_bundle.query_str, num_queries=4
)
results = asyncio.run(run_queries(queries, self._retrievers))
final_results = fuse_results(
results, similarity_top_k=self._similarity_top_k
)
return final_results
from llama_index.core.query_engine import RetrieverQueryEngine
fusion_retriever = FusionRetriever(
llm, [vector_retriever, bm25_retriever], similarity_top_k=2
)
query_engine = RetrieverQueryEngine(fusion_retriever)
response = query_engine.query(query_str)
print(str(response))
输出:
The models developed in this work, specifically the Llama 2-Chat models, outperform open-source chat models on most benchmarks that were tested.
总结:
上面的例子我们明白了如何生成多个相近的 query,然后通过多个检索工具进行检索,并将结果融合排序获取最优结果,我们还自定义了一个 FusionRetriever 类,重写了_retrieve 方法,最后通过RetrieverQueryEngine 去调用检索及融合。
当然上面的示例是最基本的,官方已经提供了一个
QueryFusionRetriever 底层原理一样,不过功能更加丰富
码字不易,点赞+关注,谢谢