7.RAG&LLM 从零学习笔记---Indexing

索引是RAG系统中不可或缺的一部分,它不仅提高了检索的效率和准确性,还能有效处理大规模数据和复杂查询。

这里介绍2种Indexing的方法:

  • Multi-representation
  • ColBERT

Multi-representation Indexing

Flow:

流程

首先,从指定的URL加载文档:

from langchain_community.document_loaders import WebBaseLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

loader = WebBaseLoader("<https://lilianweng.github.io/posts/2023-06-23-agent/>")
docs = loader.load()

loader = WebBaseLoader("<https://lilianweng.github.io/posts/2024-02-05-human-data-quality/>")
docs.extend(loader.load())

接着,通过语言模型和prompt对这些文档进行摘要:


import uuid

from langchain_core.documents import Document
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

chain = (
    {"doc": lambda x: x.page_content}
    | ChatPromptTemplate.from_template("Summarize the following document:\\n\\n{doc}")
    | ChatOpenAI(model="gpt-3.5-turbo",max_retries=0)
    | StrOutputParser()
)

summaries = chain.batch(docs, {"max_concurrency": 5})
  1. vectorstore创建一个Chroma向量存储,用于索引摘要(即子块)
  2. store创建一个内存字节存储,用于存储父文档。
  3. retriever定义一个检索器,它结合了向量存储和字节存储,用于多向量检索。
  4. doc_ids为每个文档生成一个唯一的ID。
from langchain.storage import InMemoryByteStore
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.retrievers.multi_vector import MultiVectorRetriever

# The vectorstore to use to index the child chunks
vectorstore = Chroma(collection_name="summaries",
                     embedding_function=OpenAIEmbeddings())

# The storage layer for the parent documents
store = InMemoryByteStore()
id_key = "doc_id"

# The retriever
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=store,
    id_key=id_key,
)
doc_ids = [str(uuid.uuid4()) for _ in docs]

# Docs linked to summaries
summary_docs = [
    Document(page_content=s, metadata={id_key: doc_ids[i]})
    for i, s in enumerate(summaries)
]

# Add
retriever.vectorstore.add_documents(summary_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))

查询测试:

query = "Memory in agents"
sub_docs = vectorstore.similarity_search(query,k=1)
sub_docs[0]

结果:Document(page_content='This document discusses the concept of building autonomous agents powered by LLM (large language model). It covers the key components of planning, memory, and tool use in LLM powered agents, along with case studies and proof-of-concept examples. The challenges of using natural language as an interface and the limitations of LLMs are also highlighted. The document includes various references to related research and projects in the field.', metadata={'doc_id': '755c3111-6b29-4564-9abf-038712f22ef7'})

retrieved_docs = retriever.get_relevant_documents(query,n_results=1)
retrieved_docs[0].page_content[0:500]

结果:"\n\n\n\n\n\nLLM Powered Autonomous Agents | Lil'Log\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nLil'Log\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nPosts\n\n\n\n\nArchive\n\n\n\n\nSearch\n\n\n\n\nTags\n\n\n\n\nFAQ\n\n\n\n\nemojisearch.app\n\n\n\n\n\n\n\n\n\n LLM Powered Autonomous Agents\n \nDate: June 23, 2023 | Estimated Reading Time: 31 min | Author: Lilian Weng\n\n\n \n\n\nTable of Contents\n\n\n\nAgent System Overview\n\nComponent One: Planning\n\nTask Decomposition\n\nSelf-Reflection\n\n\nComponent Two: Memory\n\nTypes of Memory\n\nMaximum Inner Product Search (MIPS)\n\n”

ColBERT

RAGatouille使ColBERT的使用变得非常简单。ColBERT为段落中的每个标记生成一个受上下文影响的向量。ColBERT类似地为查询中的每个令牌生成向量。然后,每个文档的得分是每个查询嵌入与任何文档嵌入的最大相似性的总和:

加载预训练模型:

from ragatouille import RAGPretrainedModel
RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")

加载数据集:

import requests

def get_wikipedia_page(title: str):
    """
    Retrieve the full text content of a Wikipedia page.

    :param title: str - Title of the Wikipedia page.
    :return: str - Full text content of the page as raw string.
    """
# Wikipedia API endpointURL= "<https://en.wikipedia.org/w/api.php>"

# Parameters for the API requestparams= {
        "action": "query",
        "format": "json",
        "titles": title,
        "prop": "extracts",
        "explaintext":True,
    }

# Custom User-Agent header to comply with Wikipedia's best practicesheaders= {"User-Agent": "RAGatouille_tutorial/0.0.1 (ben@clavie.eu)"}

    response= requests.get(URL, params=params, headers=headers)
    data= response.json()

# Extracting page contentpage= next(iter(data["query"]["pages"].values()))
return page["extract"]if "extract"in pageelseNonefull_document= get_wikipedia_page("Hayao_Miyazaki")

进行切片并生成嵌入:

RAG.index(
    collection=[full_document],
    index_name="Miyazaki-123",
    max_document_length=180,
    split_documents=True,
)

然后就可以直接进行检索:

results = RAG.search(query="What animation studio did Miyazaki found?", k=3)
results

变成lang-chain的格式:

retriever = RAG.as_langchain_retriever(k=3)
retriever.invoke("What animation studio did Miyazaki found?")

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值