【2024RAG详解】如何用RAG优化大语言模型LLM效果？

德国Viviane

于 2024-07-26 16:28:47 发布

阅读量496

点赞数 3

文章标签： python 人工智能 RAG 大语言模型语言模型自然语言处理 llama

本文链接：https://blog.csdn.net/Viviane_2022/article/details/140718715

版权

🧠💡什么是RAG? 🤔

RAG(检索增强生成)是一种让大语言模型更智能的技术:

通过提供上下文来增强模型理解 📚
帮助防止模型产生幻觉 🛑
显著提高预测准确性 ⬆️

RAG的关键步骤 🔑

文档分块摄入 📄
提取关键上下文 🔍
用上下文提示模型 💬

为什么要清洗文档数据? 🧼

清洗文档数据非常重要:

确保准确性 ✅
提高质量 🏆
便于分析 📊

干净的数据可以:

改善生成效果
减少幻觉概率
提升速度和性能

4个实用的NLP清洗技巧 💪

1. 数据清洗和降噪 🧹

来看一个实际例子:

import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# 样例文本
text = "I love coding! 😊 #PythonProgramming is fun! 🐍✨ Let's clean some text 🧹"

# 分词
tokens = word_tokenize(text)

# 去除噪声
cleaned_tokens = [re.sub(r'[^\w\s]', '', token) for token in tokens]

# 规范化(转小写)
cleaned_tokens = [token.lower() for token in cleaned_tokens]

# 去除停用词
stop_words = set(stopwords.words('english'))
cleaned_tokens = [token for token in cleaned_tokens if token not in stop_words]

# 词形还原
lemmatizer = WordNetLemmatizer()
cleaned_tokens = [lemmatizer.lemmatize(token) for token in cleaned_tokens]

print(cleaned_tokens)
# 输出: ['love', 'coding', 'pythonprogramming', 'fun', 'clean', 'text']

这段代码展示了如何清理文本,去除表情、标签等无关字符。以下时代码的详细拆解：

导入必要的库：

import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

这里导入了正则表达式库(re)和自然语言处理工具包(nltk)的相关模块。

样例文本：

text = "I love coding! 😊 #PythonProgramming is fun! 🐍✨ Let's clean some text 🧹"

这是我们要处理的原始文本，包含了表情符号、标点符号和标签。

分词：

tokens = word_tokenize(text)

使用NLTK的word_tokenize函数将文本分割成单词列表。

去除噪声：

cleaned_tokens = [re.sub(r'[^\w\s]', '', token) for token in tokens]

使用正则表达式去除非字母、数字和空白字符的所有字符（如标点、表情符号等）。

规范化（转小写）：

cleaned_tokens = [token.lower() for token in cleaned_tokens]

将所有单词转换为小写，以统一格式。

去除停用词：

stop_words = set(stopwords.words('english'))
cleaned_tokens = [token for token in cleaned_tokens if token not in stop_words]

去除常见的无意义词（如"the"、"is"、"at"等），这些词通常不携带重要信息。

词形还原：

lemmatizer = WordNetLemmatizer()
cleaned_tokens = [lemmatizer.lemmatize(token) for token in cleaned_tokens]

将词语还原为其基本形式（如"running"变为"run"）。这有助于减少词形变化带来的复杂性。

打印结果：

print(cleaned_tokens)

输出处理后的单词列表。

最终输出：['love', 'coding', 'pythonprogramming', 'fun', 'clean', 'text']

这个过程有效地清理了原始文本，去除了无关的符号和词语，并将词语标准化，使得文本更适合进行后续的自然语言处理任务。

需要注意的是，这个过程可能会丢失一些信息（如表情符号可能携带的情感信息），因此在实际应用中，可能需要根据具体需求调整清洗的程度

2. 文本标准化 📏

看看如何进行拼写纠正:

import re

# 样例文本(包含拼写错误)
text_with_errors = """But 's not  oherence  about more language  oherence . 
Other important aspect is ensuring accurte retrievel by  oherence  product name spellings. 
Additionally, refning descriptions  oherenc the  oherence of the contnt."""

# 拼写纠正函数
def correct_spelling_errors(text):
    spelling_corrections = {
        " oherence ": "everything",
        " oherence ": "refinement",
        "accurte": "accurate",
        "retrievel": "retrieval",
        " oherence ": "correcting",
        "refning": "refining",
        " oherenc": "enhances",
        " oherence": "coherence",
        "contnt": "content",
    }
    for mistake, correction in spelling_corrections.items():
        text = re.sub(mistake, correction, text)
    return text

# 纠正拼写错误
cleaned_text = correct_spelling_errors(text_with_errors)
print(cleaned_text)

这段代码展示了如何对文本进行拼写纠正,提高文本质量。

导入必要的库：

import re

这里导入了Python的正则表达式库，用于文本替换。

样例文本：

text_with_errors = """But 's not  oherence  about more language  oherence . 
Other important aspect is ensuring accurte retrievel by  oherence  product name spellings. 
Additionally, refning descriptions  oherenc the  oherence of the contnt."""

这是一段包含多个拼写错误的文本。

拼写纠正函数：

def correct_spelling_errors(text):
    spelling_corrections = {
        " oherence ": "everything",
        " oherence ": "refinement",
        "accurte": "accurate",
        "retrievel": "retrieval",
        " oherence ": "correcting",
        "refning": "refining",
        " oherenc": "enhances",
        " oherence": "coherence",
        "contnt": "content",
    }
    for mistake, correction in spelling_corrections.items():
        text = re.sub(mistake, correction, text)
    return text

这个函数定义了一个拼写纠正字典 spelling_corrections，其中键是错误拼写，值是正确拼写。
函数遍历这个字典，使用 re.sub() 方法将文本中的错误拼写替换为正确拼写。

应用纠正：

cleaned_text = correct_spelling_errors(text_with_errors)
print(cleaned_text)

这里调用了拼写纠正函数，并打印出纠正后的文本。

主要特点和注意事项：

简单直接：这种方法直接定义了错误和正确的对应关系，适用于已知的特定错误。
上下文敏感：使用 re.sub() 可以替换带有空格的短语（如 " oherence "），这允许一定程度的上下文敏感性。
限制：这种方法只能纠正预定义的错误。它不能处理未在字典中列出的拼写错误。
潜在问题：如果错误拼写在不同上下文中有不同的正确形式，这种方法可能会导致错误纠正。例如，" oherence " 在代码中被定义了多次，可能会导致替换不一致。
效率：对于大型文本，这种逐个替换的方法可能不够高效。
可扩展性：可以轻松添加新的拼写纠正项到字典中。

3. 元数据处理 🏷️

使用spaCy提取关键实体:

import spacy
import json

# 加载英语语言模型
nlp = spacy.load("en_core_web_sm")

# 样例文本
text = """In a blog post titled 'The Top 10 Tech Trends of 2024,' 
John Doe discusses the rise of artificial intelligence and machine learning 
in various industries. The article mentions companies like Google and Microsoft 
as pioneers in AI research. Additionally, it highlights emerging technologies 
such as natural language processing and computer vision."""

# 使用spaCy处理文本
doc = nlp(text)

# 提取命名实体及其标签
meta_data = [{"text": ent.text, "label": ent.label_} for ent in doc.ents]

# 转换为JSON格式
meta_data_json = json.dumps(meta_data)

print(meta_data_json)

这段代码展示了如何使用spaCy提取文本中的关键实体,有助于理解文本结构。

4. 上下文信息处理 🌐

语言翻译

使用Google翻译API:

from googletrans import Translator

# 原文
text = "Hello, how are you?"

# 翻译文本
translator = Translator()
translated_text = translator.translate(text, src='en', dest='es').text

print("原文:", text)
print("译文:", translated_text)

主题建模

使用LDA进行主题建模:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# 样例文档
documents = [
    "Machine learning is a subset of artificial intelligence.",
    "Natural language processing involves analyzing and understanding human languages.",
    "Deep learning algorithms mimic the structure and function of the human brain.",
    "Sentiment analysis aims to determine the emotional tone of a text."
]

# 将文本转换为数值特征向量
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)

# 应用LDA进行主题建模
lda = LatentDirichletAllocation(n_components=2, random_state=42)
lda.fit(X)

# 显示主题
for topic_idx, topic in enumerate(lda.components_):
    print("主题 %d:" % (topic_idx + 1))
    print(" ".join([vectorizer.get_feature_names()[i] for i in topic.argsort()[:-5 - 1:-1]]))

这段代码展示了如何使用LDA进行主题建模,有助于理解文本主题。

实战演示 🔬

让我们用一个完整的例子来展示这些技巧:

import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

synthetic_text = """
Sarah (S): Technology Enthusiast
Mark (M): AI Expert
S: Hey Mark! How's it going? Heard about the latest advancements in Generative AI (GA)?
M: Hey Sarah! Yes, I've been diving deep into the realm of GA lately. It's fascinating how it's shaping the future of technology!
S: Absolutely! I mean, GA has been making waves across various industries. What do you think is driving its significance?
M: Well, GA, especially Retrieval Augmented Generative (RAG), is revolutionizing content generation. It's not just about regurgitating information anymore; it's about creating contextually relevant and engaging content.
S: Right! And with Machine Learning (ML) becoming more sophisticated, the possibilities seem endless.
M: Exactly! With advancements in ML algorithms like GPT (Generative Pre-trained Transformer), we're seeing unprecedented levels of creativity in AI-generated content.
S: But what about concerns regarding bias and ethics in GA?
M: Ah, the age-old question! While it's true that GA can inadvertently perpetuate biases present in the training data, there are techniques like Adversarial Training (AT) that aim to mitigate such issues.
S: Interesting! So, where do you see GA headed in the next few years?
M: Well, I believe we'll witness a surge in applications leveraging GA for personalized experiences. From virtual assistants to content creation tools, GA will become ubiquitous in our daily lives.
S: That's exciting! Imagine AI-powered virtual companions tailored to our preferences.
M: Indeed! And with advancements in Natural Language Processing (NLP) and computer vision, these virtual companions will be more intuitive and lifelike than ever before.
S: I can't wait to see what the future holds!
M: Agreed! It's an exciting time to be in the field of AI.
S: Absolutely! Thanks for sharing your insights, Mark.
M: Anytime, Sarah. Let's keep pushing the boundaries of Generative AI together!
S: Definitely! Catch you later, Mark!
M: Take care, Sarah!
"""

# 分词
tokens = word_tokenize(synthetic_text)

# 去除噪声
cleaned_tokens = [re.sub(r'[^\w\s]', '', token) for token in tokens]

# 规范化(转小写)
cleaned_tokens = [token.lower() for token in cleaned_tokens]

# 去除停用词
stop_words = set(stopwords.words('english'))
cleaned_tokens = [token for token in cleaned_tokens if token not in stop_words]

# 词形还原
lemmatizer = WordNetLemmatizer()
cleaned_tokens = [lemmatizer.lemmatize(token) for token in cleaned_tokens]

print(cleaned_tokens)

# 准备提示
MESSAGE_SYSTEM_CONTENT = """You are a customer service agent that helps 
a customer with answering questions. Please answer the question based on the
provided context below. 
Make sure not to make any changes to the context if possible,
when prepare answers so as to provide accurate responses. If the answer 
cannot be found in context, just politely say that you do not know, 
do not try to make up an answer."""

# 准备交互函数
def response_test(question:str, context:str, model:str = "gpt-4"):
    response = client.chat.completions.create(
        model=model,
        messages=[
            {
                "role": "system",
                "content": MESSAGE_SYSTEM_CONTENT,
            },
            {"role": "user", "content": question},
            {"role": "assistant", "content": context},
        ],
    )
    
    return response.choices[0].message.content

# 准备问题
question1 = """What are some specific techniques in Adversarial Training (AT) 
that can help mitigate biases in Generative AI models?"""

# 获取回答
response = response_test(question1, synthetic_text)
print(response)

这个完整的例子展示了如何清洗文本数据,并使用清洗后的数据与GPT-4模型交互。