7.RAG&LLM 从零学习笔记---Indexing

最新推荐文章于 2024-10-11 17:04:33 发布

Moxean

最新推荐文章于 2024-10-11 17:04:33 发布

阅读量761

点赞数 15

分类专栏：从零开始RAG 文章标签：学习笔记 python 搜索引擎

本文链接：https://blog.csdn.net/weixin_47059517/article/details/139985944

版权

从零开始RAG 专栏收录该内容

8 篇文章 0 订阅

订阅专栏

索引是RAG系统中不可或缺的一部分，它不仅提高了检索的效率和准确性，还能有效处理大规模数据和复杂查询。

这里介绍2种Indexing的方法：

Multi-representation
ColBERT

Multi-representation Indexing

Flow:

流程

首先，从指定的URL加载文档：

from langchain_community.document_loaders import WebBaseLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

loader = WebBaseLoader("<https://lilianweng.github.io/posts/2023-06-23-agent/>")
docs = loader.load()

loader = WebBaseLoader("<https://lilianweng.github.io/posts/2024-02-05-human-data-quality/>")
docs.extend(loader.load())

接着，通过语言模型和prompt对这些文档进行摘要：


import uuid

from langchain_core.documents import Document
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

chain = (
    {"doc": lambda x: x.page_content}
    | ChatPromptTemplate.from_template("Summarize the following document:\\n\\n{doc}")
    | ChatOpenAI(model="gpt-3.5-turbo",max_retries=0)
    | StrOutputParser()
)

summaries = chain.batch(docs, {"max_concurrency": 5})

vectorstore创建一个Chroma向量存储，用于索引摘要（即子块）
store创建一个内存字节存储，用于存储父文档。
retriever定义一个检索器，它结合了向量存储和字节存储，用于多向量检索。
doc_ids为每个文档生成一个唯一的ID。

from langchain.storage import InMemoryByteStore
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.retrievers.multi_vector import MultiVectorRetriever

# The vectorstore to use to index the child chunks
vectorstore = Chroma(collection_name="summaries",
                     embedding_function=OpenAIEmbeddings())

# The storage layer for the parent documents
store = InMemoryByteStore()
id_key = "doc_id"

# The retriever
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=store,
    id_key=id_key,
)
doc_ids = [str(uuid.uuid4()) for _ in docs]

# Docs linked to summaries
summary_docs = [
    Document(page_content=s, metadata={id_key: doc_ids[i]})
    for i, s in enumerate(summaries)
]

# Add
retriever.vectorstore.add_documents(summary_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))

查询测试：

query = "Memory in agents"
sub_docs = vectorstore.similarity_search(query,k=1)
sub_docs[0]

结果：Document(page_content='This document discusses the concept of building autonomous agents powered by LLM (large language model). It covers the key components of planning, memory, and tool use in LLM powered agents, along with case studies and proof-of-concept examples. The challenges of using natural language as an interface and the limitations of LLMs are also highlighted. The document includes various references to related research and projects in the field.', metadata={'doc_id': '755c3111-6b29-4564-9abf-038712f22ef7'})

retrieved_docs = retriever.get_relevant_documents(query,n_results=1)
retrieved_docs[0].page_content[0:500]

结果："\n\n\n\n\n\nLLM Powered Autonomous Agents | Lil'Log\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nLil'Log\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nPosts\n\n\n\n\nArchive\n\n\n\n\nSearch\n\n\n\n\nTags\n\n\n\n\nFAQ\n\n\n\n\nemojisearch.app\n\n\n\n\n\n\n\n\n\n LLM Powered Autonomous Agents\n \nDate: June 23, 2023 | Estimated Reading Time: 31 min | Author: Lilian Weng\n\n\n \n\n\nTable of Contents\n\n\n\nAgent System Overview\n\nComponent One: Planning\n\nTask Decomposition\n\nSelf-Reflection\n\n\nComponent Two: Memory\n\nTypes of Memory\n\nMaximum Inner Product Search (MIPS)\n\n”

ColBERT

RAGatouille使ColBERT的使用变得非常简单。ColBERT为段落中的每个标记生成一个受上下文影响的向量。ColBERT类似地为查询中的每个令牌生成向量。然后，每个文档的得分是每个查询嵌入与任何文档嵌入的最大相似性的总和：

加载预训练模型：

from ragatouille import RAGPretrainedModel
RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")

加载数据集：

import requests

def get_wikipedia_page(title: str):
    """
    Retrieve the full text content of a Wikipedia page.

    :param title: str - Title of the Wikipedia page.
    :return: str - Full text content of the page as raw string.
    """
# Wikipedia API endpointURL= "<https://en.wikipedia.org/w/api.php>"

# Parameters for the API requestparams= {
        "action": "query",
        "format": "json",
        "titles": title,
        "prop": "extracts",
        "explaintext":True,
    }

# Custom User-Agent header to comply with Wikipedia's best practicesheaders= {"User-Agent": "RAGatouille_tutorial/0.0.1 (ben@clavie.eu)"}

    response= requests.get(URL, params=params, headers=headers)
    data= response.json()

# Extracting page contentpage= next(iter(data["query"]["pages"].values()))
return page["extract"]if "extract"in pageelseNonefull_document= get_wikipedia_page("Hayao_Miyazaki")

进行切片并生成嵌入：

RAG.index(
    collection=[full_document],
    index_name="Miyazaki-123",
    max_document_length=180,
    split_documents=True,
)

然后就可以直接进行检索：

results = RAG.search(query="What animation studio did Miyazaki found?", k=3)
results

变成lang-chain的格式：

retriever = RAG.as_langchain_retriever(k=3)
retriever.invoke("What animation studio did Miyazaki found?")

Moxean

关注

15
点赞
踩
12

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录