【LangChain系列】实战案例4：再战RAG问答，提取在线网页数据，并返回生成答案的来源

最新推荐文章于 2025-03-06 21:32:30 发布

AI-入门

最新推荐文章于 2025-03-06 21:32:30 发布

阅读量961

点赞数 26

文章标签： langchain 数据库人工智能 prompt 学习产品经理 chatgpt

本文链接：https://blog.csdn.net/2401_86518761/article/details/142378927

版权

0. 背景

今天，我们将综合以上技能，完成 网络数据+RAG 问答的实践，并且学习如何在返回结果中添加结果的来源（原文档）。

在结果中添加该结果的参考来源是RAG问答中非常重要的一环，一方面让我们更加了解答案的生成原理和参考内容，防止参考错误的文档，另一方面，可以展示给用户，我们的答案是有参考的，不是胡说，增加信任度。例如下面这个检索工具的展示，有了来源之后，显得更加专业和更高的可信度：

在这里插入图片描述

1. 代码实现

参考：

python.langchain.com/docs/use_ca…

python.langchain.com/docs/use_ca…

1.1 加载网页数据

python代码解读复制代码loader = WebBaseLoader(
    web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            class_=("post-content", "post-title", "post-header")
        )
    ),
)
docs = loader.load()

在这里插入图片描述

代码中以加载 https://lilianweng.github.io/posts/2023-06-23-agent/ 链接的数据为例。

使用 WebBaseLoader 进行数据加载。WebBaseLoader 是LangChain封装的专门用于加载网页数据的类。其定义和初始化参数如下，原理就是利用 urllib 加载html页面，然后通过BeautifulSoup进行Html解析，找出其中指定tag的内容。以上代码中 class_=("post-content", "post-title", "post-header") 表明只提取HTML页面中这些tag的数据。

python代码解读复制代码class WebBaseLoader(BaseLoader):
    """Load HTML pages using `urllib` and parse them with `BeautifulSoup'."""

    def __init__(
        self,
        web_path: Union[str, Sequence[str]] = "",
        header_template: Optional[dict] = None,
        verify_ssl: bool = True,
        proxies: Optional[dict] =