AI大模型全栈工程师课程笔记 - RAG 检索增强生成

Michael阿明

已于 2023-12-08 21:45:57 修改

阅读量534

点赞数 1

分类专栏： AI应用文章标签：人工智能 RAG 向量数据库 LLM emdedding

于 2023-12-08 21:43:29 首次发布

本文链接：https://blog.csdn.net/qq_21201267/article/details/134682271

版权

AI应用专栏收录该内容

12 篇文章 6 订阅

订阅专栏

课程学习自知乎知学堂 https://www.zhihu.com/education/learning

如果侵权，请联系删除，感谢！

文章目录

1. RAG

RAG（Retrieval Augmented Generation），通过检索获取一些信息，传给大模型，提高回复的准确性。

一般流程：

离线步骤：文档加载切片 -> 向量化 -> 存入向量数据库
在线步骤：用户提问 -> 向量化 ->检索 -> 组装提示词 -> LLM -> 输出回复

2. 构建流程

2.1 文档加载与切分

import pathlib
def extract_text_from_pdf(filename, page_numbers=None, min_line_length=1):
    '''从 PDF 文件中（按指定页码）提取文字'''
    paragraphs = []
    buffer = ''
    full_text = ''
    # 提取全部文本
    for i, page_layout in enumerate(extract_pages(filename)):
        # 如果指定了页码范围，跳过范围外的页
        if page_numbers is not None and i not in page_numbers:
            continue
        for element in page_layout:
            if isinstance(element, LTTextContainer):
                full_text += element.get_text() + '\n'
    # 按空行分隔，将文本重新组织成段落
    lines = full_text.split('\n')
    for text in lines:
        if len(text) >= min_line_length:
            buffer += (' ' + text) if not text.endswith('-') else text.strip('-')
        elif buffer:
            paragraphs.append(buffer)
            buffer = ''
    if buffer:
        paragraphs.append(buffer)
    return paragraphs

paragraphs = extract_text_from_pdf(pathlib.Path(__file__).parent.absolute() / "llama2.pdf", min_line_length=10)

2.2 传统检索引擎

安装 ElasticSearch

pip install elasticsearch8
pip install nltk

from elasticsearch8 import Elasticsearch, helpers
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import nltk
import re

import warnings
warnings.simplefilter("ignore") #屏蔽 ES 的一些Warnings

nltk.download('punkt') # 英文切词、词根、切句等方法
nltk.download('stopwords') # 英文停用词库

def to_keywords(input_string):
    '''（英文）文本只保留关键字'''
    # 使用正则表达式替换所有非字母数字的字符为空格
    no_symbols = re.sub(r'[^a-zA-Z0-9\s]', ' ', input_string)
    word_tokens = word_tokenize(no_symbols)
    stop_words = set(stopwords.words('english'))
    ps = PorterStemmer()
    # 去停用词，取词根
    filtered_sentence = [ps.stem(w) for w in word_tokens if not w.lower() in stop_words]
    return ' '.join(filtered_sentence)

切分文档存入 Es

# 1. 创建Elasticsearch连接
es = Elasticsearch(
    hosts=['http://localhost:9200'],  # 服务地址与端口
    # http_auth=("elastic", "*****"),  # 用户名，密码
)

# 2. 定义索引名称
index_name = "string_index"

# 3. 如果索引已存在，删除它（仅供演示，实际应用时不需要这步）
if es.indices.exists(index=index_name):
    es.indices.delete(index=index_name)

# 4. 创建索引
es.indices.create(index=index_name)

# 5. 灌库数据
actions = [
    {
        "_index": index_name,
        "_source": {
            "keywords": to_keywords(para),
            "text": para
        }
    }
    for para in paragraphs
]

# 6. 批量存储Es
helpers.bulk(es, actions)

关键字检索

def search(es, index_name, query_string, top_n=3):
    # ES 的查询语言
    search_query = {
        "match": {
            "keywords": to_keywords(query_string)
        }
    }
    res = es.search(index=index_name, query=search_query, size=top_n)
    return [hit["_source"]["text"] for hit in res["hits"]["hits"]]

results = search(es, "string_index", "how many parameters does llama 2 have?", 2)
for r in results:
    print(r + "\n")

搜索llama2有多少参数，找到了相关的文档，输出：

Llama 2 comes in a range of parameter sizes—7B, 13B, 
and 70B—as well as pretrained and fine-tuned variations.

1. Llama 2, an updated version of Llama 1, trained on a new mix of publicly available data. 
We also increased the size of the pretraining corpus by 40%, doubled the context length of the model, and adopted grouped-query attention (Ainslie et al., 2023). 
We are releasing variants of Llama 2 with 7B, 13B, and 70B parameters. We have also trained 34B variants, which we report on in this paper but are not releasing.§

2.3 LLM接口封装

from openai import OpenAI
import os
# 加载环境变量
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv('../utils/.env'))  # 读取本地 .env 文件，里面定义了 OPENAI_API_KEY

client = OpenAI(
    api_key=os.getenv("OPENAI_API_KEY"),
    base_url=os.getenv("OPENAI_API_BASE")
)

def get_completion(prompt, model="gpt-3.5-turbo"):
    '''封装 openai 接口'''
    messages = [{"role": "user", "content": prompt}]
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=0,  # 模型输出的随机性，0 表示随机性最小
    )
	return response.choices[0].message.content

2.4 构建prompt

def build_prompt(prompt_template, **kwargs):
    '''将 Prompt 模板赋值'''
    prompt = prompt_template
    for k, v in kwargs.items():
        if isinstance(v,str):
            val = v
        elif isinstance(v, list) and all(isinstance(elem, str) for elem in v):
            val = '\n'.join(v)
        else:
            val = str(v)
        prompt = prompt.replace(f"__{k.upper()}__",val)
    return prompt

prompt_template = """
    你是一个问答机器人。
    你的任务是根据下述给定的已知信息回答用户问题。
    确保你的回复完全依据下述已知信息。不要编造答案。
    如果下述已知信息不足以回答用户的问题，请直接回复"我无法回答您的问题"。

    已知信息:
    __INFO__

    用户问：
    __QUERY__

    请用中文回答用户问题。
    """

user_query = "how many parameters does llama 2 have?"

# 1. 检索
search_results = search(es, "string_index", user_query, 2)

# 2. 构建 Prompt
prompt = build_prompt(prompt_template, info=search_results, query=user_query)
print("===Prompt===")
print(prompt)

# 3. 调用 LLM
response = get_completion(prompt)
print("===回复===")
print(response)

提示词如下：
在这里插入图片描述
GPT输出：

Llama 2有7B、13B和70B三种参数大小的变体。

传统的关键字检索，对同一个语义的不同描述，可能检索不到结果

3. 向量检索

把一个词句映射到 n 维空间的一个向量
构建句对（相似和不相似），训练双塔式模型 https://www.sbert.net
向量相似度：
余弦距离dot(a, b)/(norm(a)*norm(b))
欧式距离 norm(np.asarray(a)-np.asarray(b))

向量化

def get_embeddings(texts, model="text-embedding-ada-002"):
    '''封装 OpenAI 的 Embedding 模型接口'''
    data = client.embeddings.create(input = texts, model=model).data
    return [x.embedding for x in data]
test_query = ["测试文本"]
vec = get_embeddings(test_query)
print(vec[0][:10])  # 1536 维向量  [-0.0072620222344994545, -0.006227712146937847, -0.010517913848161697, 0.001511403825134039, -0.010678159072995186, 0.029252037405967712, -0.019783001393079758, 0.0053937085904181, -0.017029697075486183, -0.01215678546577692]

4. 向量数据库

# pip install chromadb

import chromadb
from chromadb.config import Settings


class MyVectorDBConnector:
    def __init__(self, collection_name, embedding_fn):
        chroma_client = chromadb.Client(Settings(allow_reset=True))

        # 为了演示，实际不需要每次 reset()
        chroma_client.reset()

        # 创建一个 collection
        self.collection = chroma_client.get_or_create_collection(name="demo")
        self.embedding_fn = embedding_fn

    def add_documents(self, documents, metadata={}):
        '''向 collection 中添加文档与向量'''
        self.collection.add(
            embeddings=self.embedding_fn(documents),  # 每个文档的向量
            documents=documents,  # 文档的原文
            ids=[f"id{i}" for i in range(len(documents))]  # 每个文档的 id
        )

    def search(self, query, top_n):
        '''检索向量数据库'''
        results = self.collection.query(
            query_embeddings=self.embedding_fn([query]),
            n_results=top_n
        )
        return results

paragraphs = extract_text_from_pdf(pathlib.Path(__file__).parent.absolute() / "llama2.pdf", page_numbers=[2, 3], min_line_length=10)

# 创建一个向量数据库对象
vector_db = MyVectorDBConnector("demo", get_embeddings)
# 向向量数据库中添加文档
vector_db.add_documents(paragraphs)

user_query = "Llama 2有多少参数"
results = vector_db.search(user_query, 2)  # 查询最相似的2个

for para in results['documents'][0]:
    print(para+"\n")

主流向量数据库
在这里插入图片描述

FAISS: Meta开源的向量检索引擎 https://github.com/facebookresearch/faiss
Pinecone: 商用向量数据库，只有云服务 https://www.pinecone.io/
Milvus: 开源向量数据库，同时有云服务 https://milvus.io/ （选项全都是Y）
Weaviate: 开源向量数据库，同时有云服务 https://weaviate.io/
Qdrant: 开源向量数据库，同时有云服务 https://qdrant.tech/
PGVector: Postgres的开源向量检索引擎 https://github.com/pgvector/pgvector
RediSearch: Redis的开源向量检索引擎 https://github.com/RediSearch/RediSearch
ElasticSearch 也支持向量检索 https://www.elastic.co/enterprise-search/vector-search

5. 基于向量检索的RAG

class RAG_Bot:
    def __init__(self, vector_db, llm_api, n_results=2):
        self.vector_db = vector_db
        self.llm_api = llm_api
        self.n_results = n_results
        
    def chat(self, user_query):
        # 1. 检索
        search_results = self.vector_db.search(user_query,self.n_results)
        
        # 2. 构建 Prompt
        prompt = build_prompt(prompt_template, info=search_results['documents'][0], query=user_query)
        
        # 3. 调用 LLM
        response = self.llm_api(prompt)
        return response

# 创建一个RAG机器人
bot = RAG_Bot(
    vector_db,
    llm_api=get_completion
)

user_query="llama 2有多少参数？"
response = bot.chat(user_query)
print(response)  # Llama 2有7B、13B和70B参数的变体。

可以替换其他的 embedding、LLM

# pip install sentence_transformers
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('BAAI/bge-large-zh-v1.5')
query_vec = model.encode(query, normalize_embeddings=True)

不是每个 Embedding 模型都对 余弦距离 和 欧氏距离 同时有效
哪种相似度计算有效要阅读模型的说明（通常都支持余弦距离计算）

6. 进阶知识

6.1 文本分割粒度

太大，检索不精准，太小，信息不全
问题的答案跨越两个片段

改进方法：

按一定粒度，部分重叠式的切割文本，使上下文更完整

6.2 检索后再排序

最合适的答案，有时候不一定排在检索结果的最前面

检索的时候，多召回一些文本
再用排序模型对 query 和召回的 doc 进行打分排序

user_query="how safe is llama 2"
search_results = semantic_search(user_query,5) # 召回文档

model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2', max_length=512)
scores = model.predict([(user_query, doc) for doc in search_results['documents'][0]])
# 按得分排序
sorted_list = sorted(zip(scores,search_results['documents'][0]), key=lambda x: x[0], reverse=True)
for score, doc in sorted_list:
    print(f"{score}\t{doc}\n")