LlamaIndex 从入门到进阶构建高级检索

最新推荐文章于 2024-07-15 23:34:15 发布

IT民工老包

最新推荐文章于 2024-07-15 23:34:15 发布

阅读量1.3k

点赞数 36

分类专栏： LlamaIndex 文章标签： gpt llama 人工智能

本文链接：https://blog.csdn.net/baoj2010/article/details/140047030

版权

LlamaIndex 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

为何要建检索？

因为有很多时候用户的 query 在不同的检索方式得到的结果是不一样的，有的时候关键词检索得到的结果更符合用户的期望，而有时候在需要跨语言或者近义词检索的时候需要使用向量检索。

能否结合两种检索方式，在大部分情况都能返回最优的结果呢？

当然可以，高级检索就是来解决这个问题的。

高级检索的原理：

将关键词检索得到的结果+向量检索得到的结果进行重排序得到最优的结果。

从0开始

安装依赖包

pip install llama-index

pip install llama-index-readers-file pymupdf

pip install llama-index-llms-openai

pip install llama-index-retrievers-bm25

如果是在 jupyter 上跑代码需要加上这两行代码，来支持异步代码

import nest_asyncio

nest_asyncio.apply()

加载文档

文档放在项目的 data 目录下，本案例的文档是官方的 llama2 的介绍文档

下载地址：https://arxiv.org/pdf/2307.09288.pdf

下载后重命名为 llama2.pdf 放在目录 data 下

from pathlib import Path
from llama_index.readers.file import PyMuPDFReader

loader = PyMuPDFReader()
documents = loader.load(file_path="./data/llama2.pdf")

设置 key


import os

os.environ["OPENAI_API_KEY"] = "sk-..."

设置 llm 以及 embedding


from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)
embed_model = OpenAIEmbedding(
    model="text-embedding-3-small", embed_batch_size=256
)

将文档灌库


from llama_index.core import VectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter(chunk_size=300, chunk_overlap=50) # 默认 token 1024，太长了，比较耗钱，这里设得短一点
index = VectorStoreIndex.from_documents(
    documents, transformations=[splitter], embed_model=embed_model
)

定义高级检索

做好了准备工作，接下来就是定义高级检索工具了。

定义高级检索围绕以下三步

1.生成/修改 query：根据用户的原始 query 生成多个 query

2.为每个 query 执行检索

3.重排序/合并：合并检索得到的结果，应用重排序得到最佳的结果

生成/修改 query

第一步是根据用户原始的 query 生成意思相近的多个 query，以便增加检索的精确度。比如，我们

将 query 改写为更小的多个 query。

接下来看如何通过提示词来实现这一点：


from llama_index.core import PromptTemplate

query_str = "How do the models developed in this work compare to open-source chat models based on the benchmarks tested?"

query_gen_prompt_str = (
    "You are a helpful assistant that generates multiple search queries based on a "
    "single input query. Generate {num_queries} search queries, one on each line, "
    "related to the following input query:\n"
    "Query: {query}\n"
    "Queries:\n"
)
query_gen_prompt = PromptTemplate(query_gen_prompt_str)


def generate_queries(llm, query_str: str, num_queries: int = 4):
    fmt_prompt = query_gen_prompt.format(
        num_queries=num_queries - 1, query=query_str
    )
    response = llm.complete(fmt_prompt)
    queries = response.text.split("\n")
    return queries


queries = generate_queries(llm, query_str, num_queries=4)

print(queries)

通过用户原始的输入：

How do the models developed in this work compare to open-source chat models based on the benchmarks tested

我们生成了如下的问题：

['1. Comparison of models developed in this work to open-source chat models in benchmark testing', '2. Performance evaluation of models developed in this work versus open-source chat models on tested benchmarks', '3. Analysis of differences between models developed in this work and open-source chat models in benchmark assessments']

可以看到，意思是一样的，只是变换了表达的方式。

用中文的例子打个比方就是：张三的帽子 ---> 这是张三的帽子，这帽子是张三的，这帽子属于张三.....

执行向量查询

上面我们生成了多个 query,我们针对这些 query 分别检索。

注意：我们可以有多个 query，当然也可以有多个检索的工具，假如我们生成了 N 个 query,现在有 M 个检索工具，那么检索的次数应该是 N*M，所以检索的结果就是 N*M 个 list

接下来定义一个执行检索的方法，并收集结果：

from tqdm.asyncio import tqdm


async def run_queries(queries, retrievers):
    """Run queries against retrievers."""
    tasks = []
    for query in queries:
        for i, retriever in enumerate(retrievers):
            tasks.append(retriever.aretrieve(query))

    task_results = await tqdm.gather(*tasks)

    results_dict = {}
    for i, (query, query_result) in enumerate(zip(queries, task_results)):
        results_dict[(query, i)] = query_result

    return results_dict

定义检索器

# get retrievers
from llama_index.retrievers.bm25 import BM25Retriever


## 向量检索
vector_retriever = index.as_retriever(similarity_top_k=2)

## bm25 关键词检索
bm25_retriever = BM25Retriever.from_defaults(
    docstore=index.docstore, similarity_top_k=2
)

3 个 query，2 个检索器，所以会执行 6 次检索

results_dict = await run_queries(queries, [vector_retriever, bm25_retriever])

  0%|          | 0/6 [00:00<?, ?it/s]
100%|██████████| 6/6 [00:00<00:00, 11.14it/s]

融合结果

将多个见多结果合并，然后执行重排序。

注意，检索的结果很可能是有重复的，所以需要适当的方式去重，然后重排序

接下来看官方代码如何实现这一点

from typing import List
from llama_index.core.schema import NodeWithScore


def fuse_results(results_dict, similarity_top_k: int = 2):
    """Fuse results."""
    k = 60.0  # `k` is a parameter used to control the impact of outlier rankings.
    fused_scores = {}
    text_to_node = {}

    # compute reciprocal rank scores
    for nodes_with_scores in results_dict.values():
        for rank, node_with_score in enumerate(
            sorted(
                nodes_with_scores, key=lambda x: x.score or 0.0, reverse=True
            )
        ):
            text = node_with_score.node.get_content()
            text_to_node[text] = node_with_score # 这里就是排重
            if text not in fused_scores:
                fused_scores[text] = 0.0
            fused_scores[text] += 1.0 / (rank + k)  # 计算 rank 分数会累加

    # sort results
    reranked_results = dict(
        sorted(fused_scores.items(), key=lambda x: x[1], reverse=True)
    )

    # adjust node scores
    reranked_nodes: List[NodeWithScore] = []
    for text, score in reranked_results.items():
        reranked_nodes.append(text_to_node[text])
        reranked_nodes[-1].score = score

    return reranked_nodes[:similarity_top_k]

输出一下：

final_results = fuse_results(results_dict)

for n in final_results:
    print(n.score, "\n", n.text, "\n********\n")

结果：

0.03333333333333333
Figure 12: Human evaluation results for Llama 2-Chat models compared to open- and closed-source models
across ~4,000 helpfulness prompts with three raters per prompt.
The largest Llama 2-Chat model is competitive with ChatGPT. Llama 2-Chat 70B model has a win rate of
36% and a tie rate of 31.5% relative to ChatGPT. Llama 2-Chat 70B model outperforms PaLM-bison chat
model by a large percentage on our prompt set. More results and analysis is available in Section A.3.7.
Inter-Rater Reliability (IRR).
In our human evaluations, three different annotators provided independent
assessments for each model generation comparison. High IRR scores (closer to 1.0) are typically seen as
better from a data quality perspective, however, context is important. Highly subjective tasks like evaluating
the overall helpfulness of LLM generations will usually have lower IRR scores than more objective labelling
tasks. There are relatively few public benchmarks for these contexts, so we feel sharing our analysis here will
benefit the research community.
We used Gwet’s AC1/2 statistic (Gwet, 2008, 2014) to measure inter-rater reliability (IRR), as we found it to
be the most stable metric across different measurement scenarios. On the 7-point Likert scale helpfulness
task that is used in our analysis, Gwet’s AC2 score varies between 0.37 and 0.55 depending on the specific
model comparison. We see scores on the lower end of that range for ratings from model comparisons with
similar win rates to each other (like the Llama 2-Chat-70B-chat vs. ChatGPT comparison). We see scores on
the higher end of that range for ratings from model comparisons with a more clear winner (like the Llama
2-Chat-34b-chat vs. Falcon-40b-instruct).
Limitations of human evaluations.
While our results indicate that Llama 2-Chat is on par with ChatGPT
on human evaluations, it is important to note that human evaluations have several limitations.
• By academic and research standards, we have a large prompt set of 4k prompts. However, it does not cover
real-world usage of these models, which will likely cover a significantly larger number of use cases.
• Diversity of the prompts could be another factor in our results. For example, our prompt set does not
include any coding- or reasoning-related prompts.
• We only evaluate the final generation of a multi-turn conversation. A more interesting evaluation could be
to ask the models to complete a task and rate the overall experience with the model over multiple turns.
• Human evaluation for generative models is inherently subjective and noisy. As a result, evaluation on a
different set of prompts or with different instructions could result in different results.
19
********

0.03306010928961749
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron∗
Louis Martin†
Kevin Stone†
Peter Albert Amjad Almahairi Yasmine Babaei Nikolay Bashlykov Soumya Batra
Prajjwal Bhargava Shruti Bhosale Dan Bikel Lukas Blecher Cristian Canton Ferrer Moya Chen
Guillem Cucurull David Esiobu Jude Fernandes Jeremy Fu Wenyin Fu Brian Fuller
Cynthia Gao Vedanuj Goswami Naman Goyal Anthony Hartshorn Saghar Hosseini Rui Hou
Hakan Inan Marcin Kardas Viktor Kerkez Madian Khabsa Isabel Kloumann Artem Korenev
Punit Singh Koura Marie-Anne Lachaux Thibaut Lavril Jenya Lee Diana Liskovich
Yinghai Lu Yuning Mao Xavier Martinet Todor Mihaylov Pushkar Mishra
Igor Molybog Yixin Nie Andrew Poulton Jeremy Reizenstein Rashi Rungta Kalyan Saladi
Alan Schelten Ruan Silva Eric Michael Smith Ranjan Subramanian Xiaoqing Ellen Tan Binh Tang
Ross Taylor Adina Williams Jian Xiang Kuan Puxin Xu Zheng Yan Iliyan Zarov Yuchen Zhang
Angela Fan Melanie Kambadur Sharan Narang Aurelien Rodriguez Robert Stojnic
Sergey Edunov
Thomas Scialom∗
GenAI, Meta
Abstract
In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned
large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters.
Our fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our
models outperform open-source chat models on most benchmarks we tested, and based on
our human evaluations for helpfulness and safety, may be a suitable substitute for closed-
source models. We provide a detailed description of our approach to fine-tuning and safety
improvements of Llama 2-Chat in order to enable the community to build on our work and
contribute to the responsible development of LLMs.
∗Equal contribution, corresponding authors: {tscialom, htouvron}@meta.com
†Second author
Contributions for all the authors can be found in Section A.1.
arXiv:2307.09288v2 [cs.CL] 19 Jul 2023
********

接入 RetrieverQueryEngine

接下来我们定义一个检索工具，然后将 ta 传给 RetrieverQueryEngine 执行检索及融合

from typing import List

from llama_index.core import QueryBundle
from llama_index.core.retrievers import BaseRetriever
from llama_index.core.schema import NodeWithScore
import asyncio


class FusionRetriever(BaseRetriever):
    """Ensemble retriever with fusion."""

    def __init__(
        self,
        llm,
        retrievers: List[BaseRetriever],
        similarity_top_k: int = 2,
    ) -> None:
        """Init params."""
        self._retrievers = retrievers
        self._similarity_top_k = similarity_top_k
        self._llm = llm
        super().__init__()

    def _retrieve(self, query_bundle: QueryBundle) -> List[NodeWithScore]:
        """Retrieve."""
        queries = generate_queries(
            self._llm, query_bundle.query_str, num_queries=4
        )
        results = asyncio.run(run_queries(queries, self._retrievers))
        final_results = fuse_results(
            results, similarity_top_k=self._similarity_top_k
        )

        return final_results

from llama_index.core.query_engine import RetrieverQueryEngine

fusion_retriever = FusionRetriever(
    llm, [vector_retriever, bm25_retriever], similarity_top_k=2
)

query_engine = RetrieverQueryEngine(fusion_retriever)

response = query_engine.query(query_str)

print(str(response))

输出：

The models developed in this work, specifically the Llama 2-Chat models, outperform open-source chat models on most benchmarks that were tested.

总结：

上面的例子我们明白了如何生成多个相近的 query，然后通过多个检索工具进行检索，并将结果融合排序获取最优结果，我们还自定义了一个 FusionRetriever 类，重写了_retrieve 方法，最后通过RetrieverQueryEngine 去调用检索及融合。

当然上面的示例是最基本的，官方已经提供了一个

QueryFusionRetriever 底层原理一样，不过功能更加丰富

码字不易，点赞+关注，谢谢