RAG词嵌入召回质量评估

一、目录

1 采用官方评估器进行评估
2 Open Ai的key分享
3 采用gpt 生成词嵌入的训练集
4 微调sentence_transformer模型
5 评估sentence_transformer模型

二、实现

官方网址:https://github.com/run-llama/finetune-embedding/blob/main/evaluate.ipynb
1.采用官方评估器进行评估
数据集格式:

datasets={
    "corpus":[{uuid1:doc1},{uuid2:doc2},{uuid3:doc3}       #对应的文本id、文本            
    ],
    "queries":[{uuid1:问题},{uuid2:问题},...     
    ],
    "relevant_docs":[{uuid1:[uuid答案]},{uuid2:[uuid答案]},{uuid3:[uuid答案]}            
    ]
}
from sentence_transformers.evaluation import InformationRetrievalEvaluator
from sentence_transformers import SentenceTransformer
def evaluate_st(
    dataset,
    model_id,
    name,
):
    corpus = dataset['corpus']
    queries = dataset['queries']
    relevant_docs = dataset['relevant_docs']
    evaluator = InformationRetrievalEvaluator(queries, corpus, relevant_docs, name=name)     #评估器
    model = SentenceTransformer(model_id)       #加载模型
    return evaluator(model, output_path='results/')
    
bge="C:/Users/86188/Downloads/bge-small-en"
evaluate_st(val_dataset, bge, name='bge')
bge="bge"
dfs=pd.read_csv(f"./results/Information-Retrieval_evaluation_{bge}_results.csv")
print(dfs)

注:具体见下文5
2. Open Ai的key分享
参考:https://blog.csdn.net/qq_32265503/article/details/130471485

1 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
2 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
3 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
4 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
5 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
6 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
7 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
8 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
9 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
10 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
11 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
12 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
13 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
14 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
15 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
16 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
17 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
18 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
19 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
  1. 采用gpt 生成词嵌入的训练集(本人并未购买openAI对应的key)
import json
from llama_index import SimpleDirectoryReader
from llama_index.node_parser import SimpleNodeParser
from llama_index.schema import MetadataMode

#加载pdf 文件
def load_corpus(files, verbose=False):
    if verbose:
        print(f"Loading files {files}")

    reader = SimpleDirectoryReader(input_files=files)
    docs = reader.load_data()
    if verbose:
        print(f'Loaded {len(docs)} docs')

    parser = SimpleNodeParser.from_defaults()
    nodes = parser.get_nodes_from_documents(docs, show_progress=verbose)

    if verbose:
        print(f'Parsed {len(nodes)} nodes')

    corpus = {node.node_id: node.get_content(metadata_mode=MetadataMode.NONE) for node in nodes}
    return corpus

TRAIN_FILES = ['C:/Users/86188/Downloads/智能变电站远程可视化运维与无人值守技术研究与实现_栾士岩.pdf']
VAL_FILES = ['C:/Users/86188/Downloads/基于人工智能的核电文档知识管理探索与实践_詹超铭.pdf']
train_corpus = load_corpus(TRAIN_FILES, verbose=True)      #加载文章
val_corpus = load_corpus(VAL_FILES, verbose=True)

# #保存文本文件
TRAIN_CORPUS_FPATH = './data/train_corpus.json'
VAL_CORPUS_FPATH = './data/val_corpus.json'
with open(TRAIN_CORPUS_FPATH, 'w+') as f:
    json.dump(train_corpus, f)

with open(VAL_CORPUS_FPATH, 'w+') as f:
    json.dump(val_corpus, f)



import re
import uuid
from llama_index.llms import OpenAI
from llama_index.schema import MetadataMode
from tqdm.notebook import tqdm


#加载数据集
with open(TRAIN_CORPUS_FPATH, 'r+') as f:
    train_corpus = json.load(f)

with open(VAL_CORPUS_FPATH, 'r+') as f:
    val_corpus = json.load(f)

import os
os.environ["OPENAI_API_KEY"] ="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"
# 采用gpt 3.5 生成训练集

def generate_queries(
        corpus,
        num_questions_per_chunk=2,
        prompt_template=None,
        verbose=False,
):
    """
    Automatically generate hypothetical questions that could be answered with
    doc in the corpus.
    """
    llm = OpenAI(model='gpt-3.5-turbo')

    prompt_template = prompt_template or """\
    Context information is below.

    ---------------------
    {context_str}
    ---------------------

    Given the context information and not prior knowledge.
    generate only questions based on the below query.

    You are a Teacher/ Professor. Your task is to setup \
    {num_questions_per_chunk} questions for an upcoming \
    quiz/examination. The questions should be diverse in nature \
    across the document. Restrict the questions to the \
    context information provided."
    """

    queries = {}
    relevant_docs = {}
    for node_id, text in tqdm(corpus.items()):

        query = prompt_template.format(context_str=text, num_questions_per_chunk=num_questions_per_chunk)
        print(query)
        response = llm.complete(query)       #生成答案

        result = str(response).strip().split("\n")
        questions = [
            re.sub(r"^\d+[\).\s]", "", question).strip() for question in result
        ]
        questions = [question for question in questions if len(question) > 0]

        for question in questions:
            question_id = str(uuid.uuid4())
            queries[question_id] = question
            relevant_docs[question_id] = [node_id]
    return queries, relevant_docs


'''自动生成训练数据集'''

train_queries, train_relevant_docs = generate_queries(train_corpus)
val_queries, val_relevant_docs = generate_queries(val_corpus)


TRAIN_QUERIES_FPATH = './data/train_queries.json'
TRAIN_RELEVANT_DOCS_FPATH = './data/train_relevant_docs.json'
VAL_QUERIES_FPATH = './data/val_queries.json'
VAL_RELEVANT_DOCS_FPATH = './data/val_relevant_docs.json'

with open(TRAIN_QUERIES_FPATH, 'w+') as f:
    json.dump(train_queries, f)

with open(TRAIN_RELEVANT_DOCS_FPATH, 'w+') as f:
    json.dump(train_relevant_docs, f)

with open(VAL_QUERIES_FPATH, 'w+') as f:
    json.dump(val_queries, f)

with open(VAL_RELEVANT_DOCS_FPATH, 'w+') as f:
    json.dump(val_relevant_docs, f)

'''生成的数据转为标准格式'''
TRAIN_DATASET_FPATH = './data/train_dataset.json'
VAL_DATASET_FPATH = './data/val_dataset.json'

train_dataset = {
    'queries': train_queries,
    'corpus': train_corpus,
    'relevant_docs': train_relevant_docs,
}

val_dataset = {
    'queries': val_queries,
    'corpus': val_corpus,
    'relevant_docs': val_relevant_docs,
}


with open(TRAIN_DATASET_FPATH, 'w+') as f:
    json.dump(train_dataset, f)

with open(VAL_DATASET_FPATH, 'w+') as f:
    json.dump(val_dataset, f)
  1. 微调sentence_transformer模型
#加载模型
from sentence_transformers import SentenceTransformer
model_id = "BAAI/bge-small-en"
model = SentenceTransformer(model_id)

import json
from torch.utils.data import DataLoader
from sentence_transformers import InputExample

TRAIN_DATASET_FPATH = './data/train_dataset.json'
VAL_DATASET_FPATH = './data/val_dataset.json'
BATCH_SIZE = 10

with open(TRAIN_DATASET_FPATH, 'r+') as f:
    train_dataset = json.load(f)

with open(VAL_DATASET_FPATH, 'r+') as f:
    val_dataset = json.load(f)

dataset = train_dataset

corpus = dataset['corpus']
queries = dataset['queries']
relevant_docs = dataset['relevant_docs']

examples = []
for query_id, query in queries.items():
    node_id = relevant_docs[query_id][0]
    text = corpus[node_id]
    example = InputExample(texts=[query, text])
    examples.append(example)

loader = DataLoader(
    examples, batch_size=BATCH_SIZE
)

from sentence_transformers import losses
loss = losses.MultipleNegativesRankingLoss(model)

from sentence_transformers.evaluation import InformationRetrievalEvaluator

dataset = val_dataset
corpus = dataset['corpus']
queries = dataset['queries']
relevant_docs = dataset['relevant_docs']

evaluator = InformationRetrievalEvaluator(queries, corpus, relevant_docs)

#训练epoch 轮次
EPOCHS = 2
warmup_steps = int(len(loader) * EPOCHS * 0.1)
model.fit(
    train_objectives=[(loader, loss)],
    epochs=EPOCHS,
    warmup_steps=warmup_steps,
    output_path='exp_finetune',       #输出文件夹
    show_progress_bar=True,
    evaluator=evaluator,
    evaluation_steps=50,
)
  1. 评估sentence_transformer模型
import json
from tqdm.notebook import tqdm
import pandas as pd
pd.set_option("display.max_columns",None)
import logging
logging.basicConfig(level=logging.INFO,format="%(asctime)s-%(filename)s-%(message)s")
from llama_index import ServiceContext, VectorStoreIndex
from llama_index.schema import TextNode
from llama_index.embeddings import OpenAIEmbedding

import os
os.environ["OPENAI_API_KEY"] ="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"

'''加载数据集'''
TRAIN_DATASET_FPATH = './data/train_dataset.json'
VAL_DATASET_FPATH = './data/val_dataset.json'
with open(TRAIN_DATASET_FPATH, 'r+') as f:
    train_dataset = json.load(f)

with open(VAL_DATASET_FPATH, 'r+') as f:
    val_dataset = json.load(f)

'''召回方法'''
def evaluate(
        dataset,
        embed_model,
        top_k=5,
        verbose=False,
):
    corpus = dataset['corpus']                      #相关节点的文本集合
    queries = dataset['queries']                     #问题
    relevant_docs = dataset['relevant_docs']         #答案所在的文章

    #词向量模型
    service_context = ServiceContext.from_defaults(embed_model=embed_model)
    nodes = [TextNode(id_=id_, text=text) for id_, text in corpus.items()]    #将文本转换为张量
    index = VectorStoreIndex(
        nodes,
        service_context=service_context,
        show_progress=True
    )
    retriever = index.as_retriever(similarity_top_k=top_k)

    eval_results = []
    for query_id, query in tqdm(queries.items()):

        retrieved_nodes = retriever.retrieve(query)                        #召回向量

        retrieved_ids = [node.node.node_id for node in retrieved_nodes]    #检索到的节点id 列表
        expected_id = relevant_docs[query_id][0]                           #真值id
        is_hit = expected_id in retrieved_ids  # assume 1 relevant doc

        eval_result = {
            'is_hit': is_hit,
            'retrieved': retrieved_ids,
            'expected': expected_id,
            'query': query_id,
        }
        eval_results.append(eval_result)
    return eval_results


'''词向量召回,评估器'''
from sentence_transformers.evaluation import InformationRetrievalEvaluator
from sentence_transformers import SentenceTransformer
def evaluate_st(
    dataset,
    model_id,
    name,
):
    corpus = dataset['corpus']
    queries = dataset['queries']
    relevant_docs = dataset['relevant_docs']
    evaluator = InformationRetrievalEvaluator(queries, corpus, relevant_docs, name=name)     #评估器
    model = SentenceTransformer(model_id)       #模型
    return evaluator(model, output_path='results/')



# openAI 自身的词嵌入模型的命中率
# ada = OpenAIEmbedding()
# ada_val_results = evaluate(val_dataset, ada)
# df_ada = pd.DataFrame(ada_val_results)
# hit_rate_ada = df_ada['is_hit'].mean()


'''测试bge-small-en 检索的命中率'''
bge = "local:/home/jiayafei_linux/bge-small-en"
bge_val_results = evaluate(val_dataset, bge)
# #
# # #检索结果评估,命中率
# #
df_bge = pd.DataFrame(bge_val_results)
hit_rate_bge = df_bge['is_hit'].mean()
print(hit_rate_bge)
# #
bge="C:/Users/86188/Downloads/bge-small-en"
evaluate_st(val_dataset, bge, name='bge')

bge="bge"
dfs=pd.read_csv(f"./results/Information-Retrieval_evaluation_{bge}_results.csv")
print(dfs)

<think>好的,我现在需要回答用户关于RAG(检索增强生成)模型的开发与性能调优方法的问题。首先,我得确保自己正确理解RAG的机制。RAG结合了检索系统和生成模型,通过检索相关文档来增强生成结果的质量。接下来,我需要分步骤考虑开发过程和优化技巧。 开发流程方面,应该从数据准备开始,包括文档的预处理、分块和嵌入。用户可能想知道具体的数据处理方法,比如分块策略,比如固定长度还是根据语义分割。然后,检索系统的搭建是关键,可能涉及到使用Elasticsearch或FAISS这样的工具。这部分需要详细说明如何建立高效的索引和检索机制。 然后是生成模型的选择,常见的有GPT系列、Llama等。需要提到模型微调的重要性,比如如何利用特定领域的数据进行微调,以提升生成内容的相关性。此外,用户可能关心如何将检索和生成两部分有效结合,比如通过注意力机制或重排序策略。 性能调优方面,用户可能想知道如何评估RAG的效果,比如使用BLEU、ROUGE等指标,或者人工评估。优化检索效率可能包括索引结构的优化、近似最近邻算法的调整。生成质量的提升可以通过调整温度参数、top-k采样等方法。另外,可以引用参考中的信息,比如高级RAG技术如何提升信息覆盖的深度和范围,这可能涉及到多阶段检索或查询扩展技术。 还需要考虑用户提到的引用内容,比如引用[1]提到高级RAG技术增加了信息的范围和深度,这可能对应多轮检索或混合检索策略。引用[3]中提到的微调模型的方法,比如对GPT3.5和Llama2的微调,可以说明模型微调在优化中的重要性。此外,路线图引用[2]可能涉及到学习资源和系统化的开发步骤。 最后,要确保回答结构清晰,分点列出开发步骤和优化技术,并自然地插入引用标记。同时,生成的相关问题需要与RAG开发相关,比如评估方法、微调技巧、应用场景等。</think>### RAG(检索增强生成)模型的开发与性能调优方法 #### 一、RAG开发流程 1. **数据准备与处理** - **文档分块**:根据语义或固定长度分割文本(例如每段512词),使用滑动窗口减少信息丢失[^1]。 - **嵌入模型选择**:采用预训练模型(如BERT、RoBERTa)生成文本向量,支持相似性检索。 - **索引构建**:使用向量数据库(如FAISS、Elasticsearch)建立高效检索结构,支持近似最近邻搜索[^1]。 2. **检索系统搭建** - **混合检索策略**:结合关键词检索(BM25)与语义检索,提升召回率。 - **重排序机制**:通过交叉编码器(如ColBERT)对初步结果重排序,提高相关性。 3. **生成模型优化** - **微调方法**:在领域数据上微调生成模型(如GPT-3.5、Llama2),增强上下文适应性[^3]。 - **注意力注入**:将检索结果通过注意力机制融入生成过程,公式表示为: $$ \text{Output} = \text{Decoder}([\text{Query}; \text{Retrieved\_doc}]) $$ #### 二、性能调优技术 1. **检索效率优化** - **分层索引**:构建粗粒度+细粒度的双层索引结构,平衡速度与精度。 - **量化压缩**:对嵌入向量使用PQ(Product Quantization)技术,减少内存占用。 2. **生成质量提升** - **动态温度调节**:根据检索置信度调整生成温度参数,公式: $$ T = \alpha \cdot \text{max}(scores) + \beta $$ - **后处理过滤**:使用规则引擎或分类器过滤生成结果中的矛盾信息。 3. **评估体系构建** - **量化指标**:采用RAGAS框架计算答案忠实度(Faithfulness)和上下文相关性(Context Relevance)。 - **人工评估**:设计细粒度评分卡(0-5分制),覆盖事实准确性、连贯性等维度。 #### 三、典型优化案例 某论文审稿系统通过以下改进使效果超越GPT-4[^3]: 1. 使用13B参数的Llama2模型进行两阶段微调 2. 构建包含10万级paper-review数据集的训练集 3. 引入多视角检索机制(摘要+方法论+结论独立检索)
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值