Datawhale AI 夏令营大模型应用开发 Task3-CSDN博客

本文链接：https://blog.csdn.net/Ghhhhhggg/article/details/141404335

RAG

简述RAG

RAG全称是Retrieval-Augmented Generation,也被称为检索增强生成。RAG通过检索外部知识库获得额外语料，使用上下文学习ICL来改进LLM生成效果。RAG能够提高大模型回答的准确性，减少大模型幻觉的发生，使模型在新领域或者专门领域有更深的见解。

RAG运行过程

当用户发起一个生成请求时，首先基于用户的 prompt 从数据库中来检索（Retrieval）相关信息，然后这些信息会被组合到 prompt 中，为生成过程提供额外的上下文，最后将组合后的 prompt 输入 LLM 生成结果。

RAG 的一个主要优势是它无需针对特定任务再次进行训练。用户可以轻松附加外部知识库，从而丰富输入，优化模型的输出效果。由于其高可用性和低门槛，RAG 成为了 LLM 系统中非常受欢迎的方案之一，许多 LLM 应用程序都是基于 RAG 构建的。

数据库

数据库构建

ps：数据库构建流程中向量模型构建越好，语义相同的块在向量空间中距离越近。

数据库构建过程计算量大，需要离线计算

块转化为向量也被称为索引Indexing

数据库运行（检索）

提高检索准确性与效率的方法：

召回（Recall),数据库 中快速获得大量大概率相关的 Chunk，然后只有这些 Chunk 会参与计算向量相似度
重排 (Rerank)，即利用 重排模型（Reranker），使得越相似的结果排名更靠前。这样就能实现准确率稳定增长，即数据越多，效果越好

综上所述，在整个 检索 过程中，计算量的顺序是 召回 > 精排 > 重排，而检索效果的顺序则是 召回 < 精排 < 重排 。

当这一复杂的 检索 过程完成后，我们就会得到排好序的一系列 检索文档（Retrieval Documents）。

然后我们会从其中挑选最相似的 k 个结果，将它们和用户查询拼接成prompt的形式，输入到大模型。

RAG实战

案例1.源2.0-2B RAG实战

Step0 开通阿里云PAI-DSW试用

链接：阿里云免费试用 - 阿里云

需要在魔搭社区进行授权

链接：https://www.modelscope.cn/my/mynotebook/authorization

Step1 魔搭社区创建PAI实例

Step3 环境准备

打开创建的实例并进入终端，下载文件

git lfs install
git clone https://www.modelscope.cn/datasets/Datawhale/AICamp_yuan_baseline.git
cp AICamp_yuan_baseline/Task\ 3：源大模型RAG实战/* .

双击打开Task 3：源大模型RAG实战.ipynb，然后运行所有单元格。

在终端中安装streamlit

# 安装 streamlit
pip install streamlit==1.24.0

Step4 模型下载

# 向量模型下载
from modelscope import snapshot_download
model_dir = snapshot_download("AI-ModelScope/bge-small-zh-v1.5", cache_dir='.')

选用基于BERT架构的向量模型 bge-small-zh-v1.5，它是一个4层的BERT模型，最大输入长度512，输出的向量维度也为512。

# 源大模型下载
from modelscope import snapshot_download
model_dir = snapshot_download('IEITYuan/Yuan2-2B-Mars-hf', cache_dir='.')

还需要下载源大模型 IEITYuan/Yuan2-2B-Mars-hf

模型下载（step3已经执行）只需运行Task 3：源大模型RAG实战.ipynb

Step5 RAG实战

索引/向量化

# 定义向量模型类
class EmbeddingModel:
    """
    class for EmbeddingModel
    """

    def __init__(self, path: str) -> None:
        self.tokenizer = AutoTokenizer.from_pretrained(path)

        self.model = AutoModel.from_pretrained(path).cuda()
        print(f'Loading EmbeddingModel from {path}.')

    def get_embeddings(self, texts: List) -> List[float]:
        """
        calculate embedding for text list
        """
        encoded_input = self.tokenizer(texts, padding=True, truncation=True, return_tensors='pt')
        encoded_input = {k: v.cuda() for k, v in encoded_input.items()}
        with torch.no_grad():
            model_output = self.model(**encoded_input)
            sentence_embeddings = model_output[0][:, 0]
        sentence_embeddings = torch.nn.functional.normalize(sentence_embeddings, p=2, dim=1)
        return sentence_embeddings.tolist()

总体来说，这段代码定义了一个文本嵌入模型，该模型可以通过指定的路径加载预训练模型和分词器，并对于给定的文本列表计算相应的嵌入向量。使用了 GPU 以加速计算，并通过 L2 归一化处理输出的嵌入。

print("> Create embedding model...")
embed_model_path = './AI-ModelScope/bge-small-zh-v1___5'
embed_model = EmbeddingModel(embed_model_path)

检索

# 定义向量库索引类
class VectorStoreIndex:
    """
    class for VectorStoreIndex
    """

    def __init__(self, doecment_path: str, embed_model: EmbeddingModel) -> None:
        self.documents = []
        for line in open(doecment_path, 'r', encoding='utf-8'):
            line = line.strip()
            self.documents.append(line)

        self.embed_model = embed_model
        self.vectors = self.embed_model.get_embeddings(self.documents)

        print(f'Loading {len(self.documents)} documents for {doecment_path}.')

    def get_similarity(self, vector1: List[float], vector2: List[float]) -> float:
        """
        calculate cosine similarity between two vectors
        """
        dot_product = np.dot(vector1, vector2)
        magnitude = np.linalg.norm(vector1) * np.linalg.norm(vector2)
        if not magnitude:
            return 0
        return dot_product / magnitude

    def query(self, question: str, k: int = 1) -> List[str]:
        question_vector = self.embed_model.get_embeddings([question])[0]
        result = np.array([self.get_similarity(question_vector, vector) for vector in self.vectors])
        return np.array(self.documents)[result.argsort()[-k:][::-1]].tolist()

整体而言，这段代码展示了如何构建一个基于文本嵌入的向量索引，并提供查询功能。通过计算文档的嵌入向量，去除了底层的文本表示，使得可以通过余弦相似度进行有效的相似度搜索。该类能够高效地处理与查询相关的文档，适合于信息检索和问答系统等应用。

print("> Create index...")
doecment_path = './knowledge.txt'
index = VectorStoreIndex(doecment_path, embed_model)

传入问题

question = '介绍一下广州大学'
print('> Question:', question)

context = index.query(question)
print('> Context:', context)

结果准确传出知识库中第一条知识

生成

# 定义大语言模型类
class LLM:
    """
    class for Yuan2.0 LLM
    """

    def __init__(self, model_path: str) -> None:
        print("Creat tokenizer...")
        self.tokenizer = AutoTokenizer.from_pretrained(model_path, add_eos_token=False, add_bos_token=False, eos_token='<eod>')
        self.tokenizer.add_tokens(['<sep>', '<pad>', '<mask>', '<predict>', '<FIM_SUFFIX>', '<FIM_PREFIX>', '<FIM_MIDDLE>','<commit_before>','<commit_msg>','<commit_after>','<jupyter_start>','<jupyter_text>','<jupyter_code>','<jupyter_output>','<empty_output>'], special_tokens=True)

        print("Creat model...")
        self.model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16, trust_remote_code=True).cuda()

        print(f'Loading Yuan2.0 model from {model_path}.')

    def generate(self, question: str, context: List):
        if context:
            prompt = f'背景：{context}\n问题：{question}\n请基于背景，回答问题。'
        else:
            prompt = question

        prompt += "<sep>"
        inputs = self.tokenizer(prompt, return_tensors="pt")["input_ids"].cuda()
        outputs = self.model.generate(inputs, do_sample=False, max_length=1024)
        output = self.tokenizer.decode(outputs[0])

        print(output.split("<sep>")[-1])

print("> Create Yuan2.0 LLM...")
model_path = './IEITYuan/Yuan2-2B-Mars-hf'
llm = LLM(model_path)

print('> Without RAG:')
llm.generate(question, [])

print('> With RAG:')
llm.generate(question, context)

分词器Tokenizer

在上述代码中多次出现分词器，它的主要功能是将输入的文本数据转换成计算机可以处理的格式。

其核心功能分词，即将一段连续的文本分成独立的单词或词组（tokens）。比如，将句子 “今天天气很好” 分为 “今天”、“天气”、“很好”。

最终文本数据可能被转化为张量（pytorch框架中）

案例2. AI科研助手

产品功能

论文概括：AI科研助手能够快速阅读并理解一篇论文的主要内容，并生成简洁明了的摘要。
内容问答：用户可以通过自然语言提出与论文相关的问题，系统能够给出准确的答案。
关键词提取：从论文中提取关键概念和术语，帮助用户快速了解研究重点。
论文对比：比较两篇或者多篇论文的内容，总结他们之间的异同点。
相关工作推荐：根据用户的兴趣领域推荐相关的最新研究成果

核心代码

pip install pypdf faiss-gpu langchain langchain_community langchain_huggingface streamlit==1.24.0

streamlit run Task\ 3\ 案例：AI科研助手.py --server.address 127.0.0.1 --server.port 6006

# 导入所需的库
import torch
import streamlit as st
from transformers import AutoTokenizer, AutoModelForCausalLM
from langchain.prompts import PromptTemplate
from langchain_community.vectorstores import FAISS
from langchain_community.document_loaders import PyPDFLoader
from langchain_huggingface import HuggingFaceEmbeddings
from langchain.chains import LLMChain
from langchain.chains.question_answering import load_qa_chain
from langchain.llms.base import LLM
from langchain.callbacks.manager import CallbackManagerForLLMRun
from langchain.text_splitter import RecursiveCharacterTextSplitter

from typing import Any, List, Optional

# 向量模型下载
from modelscope import snapshot_download
model_dir = snapshot_download('AI-ModelScope/bge-small-en-v1.5', cache_dir='./')

# 源大模型下载
from modelscope import snapshot_download
model_dir = snapshot_download('IEITYuan/Yuan2-2B-Mars-hf', cache_dir='./')

# 定义模型路径
model_path = './IEITYuan/Yuan2-2B-Mars-hf'

# 定义向量模型路径
embedding_model_path = './AI-ModelScope/bge-small-en-v1___5'

# 定义模型数据类型
torch_dtype = torch.bfloat16 # A10
# torch_dtype = torch.float16 # P100

# 定义源大模型类
class Yuan2_LLM(LLM):
    """
    class for Yuan2_LLM
    """
    tokenizer: AutoTokenizer = None
    model: AutoModelForCausalLM = None

    def __init__(self, mode_path :str):
        super().__init__()

        # 加载预训练的分词器和模型
        print("Creat tokenizer...")
        self.tokenizer = AutoTokenizer.from_pretrained(mode_path, add_eos_token=False, add_bos_token=False, eos_token='<eod>')
        self.tokenizer.add_tokens(['<sep>', '<pad>', '<mask>', '<predict>', '<FIM_SUFFIX>', '<FIM_PREFIX>', '<FIM_MIDDLE>','<commit_before>','<commit_msg>','<commit_after>','<jupyter_start>','<jupyter_text>','<jupyter_code>','<jupyter_output>','<empty_output>'], special_tokens=True)

        print("Creat model...")
        self.model = AutoModelForCausalLM.from_pretrained(mode_path, torch_dtype=torch.bfloat16, trust_remote_code=True).cuda()

    def _call(
        self,
        prompt: str,
        stop: Optional[List[str]] = None,
        run_manager: Optional[CallbackManagerForLLMRun] = None,
        **kwargs: Any,
    ) -> str:
        prompt = prompt.strip()
        prompt += "<sep>"
        inputs = self.tokenizer(prompt, return_tensors="pt")["input_ids"].cuda()
        outputs = self.model.generate(inputs,do_sample=False,max_length=4096)
        output = self.tokenizer.decode(outputs[0])
        response = output.split("<sep>")[-1].split("<eod>")[0]

        return response

    @property
    def _llm_type(self) -> str:
        return "Yuan2_LLM"

# 定义一个函数，用于获取llm和embeddings
@st.cache_resource
def get_models():
    llm = Yuan2_LLM(model_path)

    model_kwargs = {'device': 'cuda'}
    encode_kwargs = {'normalize_embeddings': True} # set True to compute cosine similarity
    embeddings = HuggingFaceEmbeddings(
        model_name=embedding_model_path,
        model_kwargs=model_kwargs,
        encode_kwargs=encode_kwargs,
    )
    return llm, embeddings

summarizer_template = """
假设你是一个AI科研助手，请用一段话概括下面文章的主要内容，200字左右。

{text}
"""

# 定义Summarizer类
class Summarizer:
    """
    class for Summarizer.
    """

    def __init__(self, llm):
        self.llm = llm
        self.prompt = PromptTemplate(
            input_variables=["text"],
            template=summarizer_template
        )
        self.chain = LLMChain(llm=self.llm, prompt=self.prompt)

    def summarize(self, docs):
        # 从第一页中获取摘要
        content = docs[0].page_content.split('ABSTRACT')[1].split('KEY WORDS')[0]

        summary = self.chain.run(content)
        return summary

chatbot_template  = '''
假设你是一个AI科研助手，请基于背景，简要回答问题。

背景：
{context}

问题：
{question}
'''.strip()

# 定义ChatBot类
class ChatBot:
    """
    class for ChatBot.
    """

    def __init__(self, llm, embeddings):
        self.prompt = PromptTemplate(
            input_variables=["text"],
            template=chatbot_template
        )
        self.chain = load_qa_chain(llm=llm, chain_type="stuff", prompt=self.prompt)
        self.embeddings = embeddings

        # 加载 text_splitter
        self.text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=450,
            chunk_overlap=10,
            length_function=len
        )

    def run(self, docs, query):
        # 读取所有内容
        text = ''.join([doc.page_content for doc in docs])

        # 切分成chunks
        all_chunks = self.text_splitter.split_text(text=text)

        # 转成向量并存储
        VectorStore = FAISS.from_texts(all_chunks, embedding=self.embeddings)

        # 检索相似的chunks
        chunks = VectorStore.similarity_search(query=query, k=1)

        # 生成回复
        response = self.chain.run(input_documents=chunks, question=query)

        return chunks, response

def main():
    # 创建一个标题
    st.title('💬 Yuan2.0 AI科研助手')

    # 获取llm和embeddings
    llm, embeddings = get_models()

    # 初始化summarizer
    summarizer = Summarizer(llm)

    # 初始化ChatBot
    chatbot = ChatBot(llm, embeddings)

    # 上传pdf
    uploaded_file = st.file_uploader("Upload your PDF", type='pdf')

    if uploaded_file:
        # 加载上传PDF的内容
        file_content = uploaded_file.read()

        # 写入临时文件
        temp_file_path = "temp.pdf"
        with open(temp_file_path, "wb") as temp_file:
            temp_file.write(file_content)

        # 加载临时文件中的内容
        loader = PyPDFLoader(temp_file_path)
        docs = loader.load()

        st.chat_message("assistant").write(f"正在生成论文概括，请稍候...")

        # 生成概括
        summary = summarizer.summarize(docs)
        
        # 在聊天界面上显示模型的输出
        st.chat_message("assistant").write(summary)

        # 接收用户问题
        if query := st.text_input("Ask questions about your PDF file"):

            # 检索 + 生成回复
            chunks, response = chatbot.run(docs, query)

            # 在聊天界面上显示模型的输出
            st.chat_message("assistant").write(f"正在检索相关信息，请稍候...")
            st.chat_message("assistant").write(chunks)

            st.chat_message("assistant").write(f"正在生成回复，请稍候...")
            st.chat_message("assistant").write(response)

if __name__ == '__main__':
    main()

本学习笔记的参考原文链接如下：

DatawhaleDatawhale (linklearner.com)

声明：

本学习笔记为此笔记发布者使用，以便完成Datawhale AI 夏令营第四期大模型应用开发Task03打卡任务，同时记录自己的学习过程。该笔记在原文的基础上进行删改，对个别部分黑体着重强调以便及时记忆理解。本笔记并不做任何以盈利为目的的用途，仅供个人学习使用，若有侵权，请第一时间告知本人，侵删。