(24-4-02)基于大模型的Python学习系统(Gemma+Langchain+FAISS):RAG(2)

12.5.3  问题格式化

到目前为止,当用户在聊天环境中提问时,并不总是以完整的问题形式提出的。假设有这样的问题:

What is a Python variable?\ Assistant: A variable is a storage of value\ User: And a function?(什么是 Python 变量?\ 助手:变量是值的存储\ 用户:还有函数吗?)

在这种情况下,人类很容易意识到用户实际上是在问"Python 函数是什么?"。但是,如果只将问题"还有函数吗?"发送给模型,它可能无法理解用户的意图。此外,可能会错过 FAISS 数据库中一些重要的上下文信息。为了避免这种情况,我们利用模型自身来重新构思问题,使用少量示例来提示。

实现此目的的 LangChain 接受一个示例列表(chat_history),遵循下面模式来处理,将最后一个(实际的用户问题)留给模型重新格式化。

问题 = 用户问题\ 重新格式化后的问题 = 重新格式化的问题

(1)定义了一个将用户问题重新格式化的模板,用于在聊天过程中根据之前的对话内容将问题重新表达为更加明确、简洁、完整、详细和自包含的形式。如果之前有聊天记录,就基于该记录进行重新表达,否则直接返回原始问题。如果问题已经是一个格式良好的问题,就直接返回。如果问题不需要修改,也直接返回原始问题。重新格式化后的问题必须是明确、简洁、完整、详细和自包含的,以便用于文档检索过程。最终只返回重新格式化后的问题,并以单行形式呈现。

question_formatting_template = """
Act as an ASSISTANT to reformulate the Question into QuestionReformatted to make it explicit, concise, complete, detailed and self-contained based on the previous conversation shown in the CHAT_HISTORY (if exists),else return the QuestionReformatted as is.
If the Question is a well-formatted question, just return the Question
If the Question does not need any modification, return the Question as is.
The QuestionReformatted must be explicit, concise, complete, detailed and self-contained to be used for a document retrieval process.
Only return the QuestionReformatted in a single line.

**CHAT HISTORY**
{chat_history}

Question = {question}
QuestionReformatted = 
"""

question_formatting_prompt_template = PromptTemplate(
    template=question_formatting_template,
    input_variables=['question', 'chat_history']
)

formatting_chain = LLMChain(
    prompt=question_formatting_prompt_template,
    llm=gemma_lc
)

在上述代码中,该模板采用了一个名为question_formatting_template的字符串模板,其中包含了对输入变量的引用,包括question和chat_history。然后,通过创建一个PromptTemplate对象,将模板与输入变量关联起来,以便稍后使用。最后,使用LLMChain将该模板和之前定义的 LangChain 模型gemma_lc结合起来,形成一个完整的问题格式化处理链。

(2)定义函数 format_question(question),功能是接受一个问题作为输入参数,并对其进行重新格式化处理。首先,函数构建了一个对话历史列表,其中包含了用户提出的问题和助手(模型)对这些问题的回答。然后,函数定义了一个问题对列表 question_pairs,其中包含了一系列问题及其相应的重新格式化后的版本。接下来,函数将这些问题对添加到对话历史中,并调用 formatting_chain 对问题进行重新格式化。最后,函数检查模型的响应是否超过了原始问题的两倍长度,如果是则返回原始问题,否则,返回重新格式化后的问题。

def format_question(question):
    messages = [{
                "role": "user",
                        "content": "What is a Python variable?"
            },
    {
                "role": "assistant",
                        "content": "A python variable is a storage of value"
            }]

    tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b", token=HF_TOKEN)
    chat_history = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
    
    question_pairs = [
                      # Follow-up questions
                      {"question":"And an iterator?", "questionReformatted":"What is a Python iterator?"},
                      {"question":"And a function?", "questionReformatted":"What is a Python function?"},
                      {"question":"And an generator?", "questionReformatted":"What is a Python generator?"},
                      {"question":"And a array?", "questionReformatted":"What is a Python array?"},
                      {"question":"And an list?", "questionReformatted":"What is a Python list?"},
                      
                      # Follow-up questions, different format
                      {"question":"What about an iterator?", "questionReformatted":"What is a Python iterator?"},
                      {"question":"What about a function?", "questionReformatted":"What is a Python function?"},
                      {"question":"What about a generator?", "questionReformatted":"What is a Python generator?"},
                      {"question":"What about an array?", "questionReformatted":"What is a Python array?"},
                      {"question":"What about a list?", "questionReformatted":"What is a Python list?"},
                      
                      # A few examples where no reformatting is needed
                      {"question":"What is a Python iterator?", "questionReformatted":"What is a Python iterator?"},
                      {"question":"What is a Python function?", "questionReformatted":"What is a Python function?"},
                      {"question":"What is a Python generator?", "questionReformatted":"What is a Python generator?"},
                      {"question":"What is a Python array?", "questionReformatted":"What is a Python array?"},
                      {"question":"What is a Python list?", "questionReformatted":"What is a Python list?"},
                        
                      # Examples of instructions which are not questions
                      {"question":"Give examples of using different regex functions in Python.", "questionReformatted":"Give examples of using different regex functions in Python."},
                      {"question":"Explain the concept of context managers and how to use them with resources like files.", "questionReformatted":"Explain the concept of context managers and how to use them with resources like files."},
                      {"question":"What is the difference between an integer and a float in Python?", "questionReformatted":"What is the difference between an integer and a float in Python?"},
                      {"question":"Differentiate between the True and False values. Provide examples and real-world use cases for each.", "questionReformatted":"Differentiate between the True and False values. Provide examples and real-world use cases for each."},
                      {"question":"Define a class and a module in Python. Provide a real-world example for each.", "questionReformatted":"Define a class and a module in Python. Provide a real-world example for each."},
                       ]
    
    for q_pair in question_pairs:
        chat_history += f"\nQuestion = {q_pair['question']}\nQuestionReformatted = {q_pair['questionReformatted']}\n"
        
    res = formatting_chain.invoke({'question': question, 'chat_history': chat_history})
    
    # a response which is much bigger than the question itself most probably means that the model has answered the question
    # instead of reformatting it. In this case it is better to just return the question
    if len(res["text"]) > len(question) * 2:
        return question

    return res["text"]

(3)下面的这段代码定义了一个模板,用于在聊天环境中提示模型,该模型被指示以聊天机器人的方式回答关于 Python 的问题。这个模板包含了截至目前的聊天历史以及一些上下文信息,以帮助模型回答下一个问题。上下文信息仅在回答当前问题时提供给模型,不保存在聊天历史中。这样做是为了保持模型专注于最近的问题,并保留 token,因为聊天历史和上下文的组合很快就会超过 8000 个 token。

template = """
## Instructions
You are a chat bot that answers Python questions. Here is the chat history so far and some context to help you with the next question:

## Context: 
{context}

## Instructions
You are a chat bot that answers Python-related questions. Use the context provided above, as well as the chat history to answer the question. 
If the context doesn't contain anything about the question, check the chat history.
If the context doesn't help you, answer the question yourself. Use the context to get information only, don't copy its style. 
Make sure your answer is detailed, explicit and complete and related to the previous messages in the chat.
Give details and examples:

## Chat History: Your previous messages are marked as "assistant" and the user's messages are marked as "user". Use this section when the user asks about previous questions):
{chat_history}

Answer the question and stop generating. Don't generate follow-up questions. Focus on the first question.

Question = {question}
Anwser = """

prompt_template = PromptTemplate(
    template=template,
    input_variables=['question', 'context', "chat_history"]
)

llm_chain = LLMChain(
    prompt=prompt_template,
    llm=gemma_lc
)

这样在每一步中,用户的查询会被获取并进行必要的重新格式化。聊天历史是从消息列表中编制的,使用tokenizer.apply_chat_template实现。这样做可以确保正确的 token 被用于标记对话的开始和结束,进一步帮助模型扮演聊天机器人助手的角色。

12.5.4  聊天函数

创建聊天函数chat,允许用户与一个 Python 问题聊天助手进行交互。用户可以向助手提出关于 Python 的问题,并获得详细的答案。聊天函数利用 LangChain 模型和 FAISS 数据库,通过重新格式化用户的问题、检索相关信息并生成响应来实现这一功能。

def chat(llm_chain):
    exit_commands = ["quit", "exit"]
    
    welcome_message = ("Ask me anything about Python!"
                f"\nIf you want to leave you can use the following commands: {', '.join(exit_commands)}")
    
    print(welcome_message)
    
    tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b", token=HF_TOKEN)
    
    # 用于存储聊天记录的数组
    messages = []
    
    while True:
        user_input = input(">")
        
        if user_input in exit_commands:
            return
        
        # 如有必要,重新格式化用户输入
        formatted_question = format_question(user_input)
        print(formatted_question)
                        
        # 使用重新格式化后的问题从 FAISS 数据库中检索相关信息
        context = get_answers(formatted_question)
        
        # 使用 tokenizer 格式化聊天历史,以获取正确的对话开始和结束的标记
        chat_history = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
              
        # 调用模型
        result = llm_chain.invoke({ 'question': formatted_question, 'context': context, 'chat_history': chat_history });
        
        # 将用户请求和模型响应添加到聊天历史中y
        messages.append({
            "role": "user",
                    "content": (user_input)
        }) 
            
        messages.append({
            "role": "assistant",
                    "content": (result["text"])
        })
        
        # 显示响应
        print(result["text"])
        #display(Markdown(result["text"]))

上述代码的实现流程如下所示:

  1. 首先,显示欢迎消息,并提示用户可以使用特定命令退出聊天。
  2. 然后,进入一个无限循环,等待用户输入。
  3. 如果用户输入的是退出命令之一,函数将结束。
  4. 如果用户输入的不是退出命令,则会对用户的输入进行必要的重新格式化,并打印格式化后的问题。
  5. 接着,使用重新格式化后的问题从 FAISS 数据库中检索相关信息,并将结果存储为上下文。
  6. 通过使用 tokenizer 应用聊天模板,对聊天历史进行格式化,以获取正确的 token 来表示对话的开始和结束。
  7. 调用模型,传递格式化后的问题、上下文和聊天历史。
  8. 将用户的请求和模型的响应添加到聊天历史中。
  9. 最后,显示模型的响应,并继续等待用户输入。

下面的内容是对数据集中前五个问题进行 RAG 模型微调后的回答回复,每个例子都以加粗的Example开头,接着是bold格式的提示内容,然后是模型的回答。回答的结果得到了显著的改善,模型始终回答问题,并提供了Python问题对应的举例代码和额外的上下文信息。

------------------------------------------------------------------------------------------------------------------------------------
Example 1: Analyze the question asked bellow about the Python programming language and return an answer that would help a software engineer understand the concept better. [ Give examples of using different regex functions in Python.] =¶
Sure. Here are examples of using the following regular expression functions in Python:
re.search():
This function searches for the first occurrence of the regular expression in the string and returns a match object.
The match object contains the start and end indices of the match within the string.
The re.search() function is useful for finding a single match in a string.
2.re.match():
This function is similar to re.search(), but it stops at the first match and returns a match object.
The re.match() function is useful when you want to find only the first match in a string.
3.re.findall():
This function finds all matches of the regular expression in the string.
It returns a list of all matches.
The re module provides the re.findall() function for finding all matches in a string.
4.re.compile():
This function compiles the regular expression into a regular expression object.
The compiled regular expression can be used with other functions, such as re.search() and re.match().
re.compile() is useful when you need to use the same regular expression multiple times.
5.re.group():
This function extracts the matched substring from the string.
The index of the start of the match is passed as an argument.
re.group() is useful for extracting specific parts of a match.
6.re.escape():
This function escapes special characters in the string.
The escaped string can be used with other regular expression functions.
re.escape() is useful when you need to search for patterns in a string that contains special characters.
Example 2: Analyze the question asked bellow about the Python programming language and return an answer that would help a software engineer understand the concept better.[ Explain the concept of context managers and how to use them with resources like files.] =
Sure, here's an explanation of context managers and how to use it with resources like files:

Context Managers: A context manager is an object that manages the lifecycle of a resource, such as a file. It provides a mechanism to automatically open, use, and close the resource without the need for manual resource management. This helps to ensure proper resource usage and prevents memory leaks.
How to Use Context Managers with Files:
1.Open the file using the open() function. The open() function takes a filename as its first argument and a mode as its second argument. The mode specifies how the file should be opened, such as read or write.
2.Use the context manager's methods to access and manipulate the file. For example, the read() method can be used to read data from a file, while the write() method can be used to write data to a file.
3.Use the close() method to close the file when you're finished. This method automatically releases the file's resources, preventing memory leaks.
4.Use the context manager object itself as a context manager. This allows you to automatically close the file when it goes out of scope, ensuring proper cleanup even if an exception occurs.
Example:
# Open a file for reading
with open("myfile.txt") as file:
    # Read the file's contents
    contents = file.read().strip()

# Open a file for writing
with open("output.txt", "w") as file:
    # Write the contents to the file
    file.write(contents)
# Close the files automatically
Benefits of Using Context Managers:
Automatic resource management: Context managers handle the opening, closing, and cleanup of resources without manual intervention.
Prevents memory leaks: By automatically closing resources, context managers prevent memory leaks that can occur when resources are not closed properly.
Simplifies code: Using context managers can simplify your code by eliminating the need to manually open, close, and clean up resources.
Provides a consistent interface: Context managers provide a consistent interface for managing resources, making it easier to understand and use.
Example 3: Analyze the question asked bellow about the Python programming language and return an answer that would help a software engineer understand the concept better. [ What is the difference between an integer and a float in Python?] =

Sure. Here's the difference: Integer:
An integer is a whole number.
It can represent whole numbers or fractions (1/2, for example).
It is used for mathematical calculations and comparisons with whole numbers. Float:
A float is a real number with a decimal point.
It can represent both whole numbers and fractions.
It is used for scientific and technical applications where precision is important. Example:
integer = 10  # An integer
float = 12.5  # A float
print(f"Integer: {integer}")  # Output: Integer: 10
print(f"Float: {float}")  # Output: Float: 12.5
Note: The int() and float functions can convert between integers and floats. However, the conversion may be lossy (some precision may be lost).
Sure. Here's how to perform division with a floating-point number in Python:
# Define two numbers to divide
num1, num2 = 10.0, 2.0  # The dividend and divisor
# Perform division
result, reminder = divmod(num1, num2)  # Use the divmod() function
# Print the result and reminder
print(f"{num1}/ {num2} = {result} with a reminder of {reminder}")
Explanation:
num1 is the dividend, and num2 is the divisor.
We use the divmod() function to perform the division.
The divmod() function returns two values: the quotient (
Example 4: Analyze the question asked bellow about the Python programming language and return an answer that would help a software engineer understand the concept better. [ Differentiate between the True and False values. Provide examples and real-world use cases for each.] = ["sure. here's the difference between true and false values in python: true¶
True:
A truthy value that represents truth.
It is evaluated to True in a boolean context.
It is used to control program flow, execute different code blocks, and perform logical operations. False:
A falsy value that represents falsity or falsehood.
It is evaluated to False in a boolean context.
False is the opposite of True. Examples:
True: True, 1, True, False
False: False, 0, None, False Real-world use cases:
Conditional statements: if and else use True and False to determine which block of code to execute.
Boolean operations: True and False can be used in logical expressions to perform different operations.
Data validation: True and `False can be used to validate user input or check for errors.
Boolean variables: True and `False are used to represent the state of a variable, such as whether a button is pressed or not.
Example 5: Analyze the question asked bellow about the Python programming language and return an answer that would help a software engineer understand the concept better. [ Define a class and a module in Python. Provide a real-world example for each.]
Class: A class is a blueprint that defines the behavior and attributes of an object. It contains methods, which are functions that belong to the class and can be called directly on the object. A class also has attributes, which are data members that belong to the class and can't be accessed directly.

Module: A module is a collection of related classes, functions, and data that are grouped together to perform a particular task. A module can contain multiple classes, each with its own set of methods and attributes. It can also contain functions and data that are shared across all classes in the module.
Example: Class:
class Student:
    name = ""
    age = 0

    def __init__(self,name,age):
        self.name =name
        self.age = age

    def print_student(self):
        print("Student name:", self.name)
        println("Student age:", self.age)
This class defines a student with attributes name and age. It also has a method print_student that prints the student's name and age.
Module:
module student_module:
    class Student:
        name = ""
        age = 0

        def __init__(self,name:str,age:int):
            self.name =name
            self.age = age

        def print_student(self): 
            print("Student name:", self. name)
            print("Student age:", self.age )

    def main():
        student = Student("John", 18)
        student.print_student()

if __name__ == "__main__":
    main()
This module contains the Student class defined above, along with a main function that creates an instance of the Student class and calls the print_student method. The module also defines a top-level function main that can be called directly.
Output: When you run the code, it will output the following result:
Student name: John
Student age: 18
This demonstrates how a class defines the structure and behavior of a single object, while a module can contain multiple classes and provides a way to organize and reuse code.
------------------------------------------------------------------------------------------------------------------------------------

12.5.5  总结

本项目旨在利用自然语言处理技术,特别是语言模型和信息检索技术,为软件工程师提供更有效的Python编程问题解答和支持。通过本项目的源码,结合检索增强生成(RAG)技术和微调模型对两种不同规模的Gemma语言模型进行了比较和研究。

(1)首先,采用了2B规模的Gemma模型进行实验。然而,该模型在理解Python概念方面存在局限性,导致回答与问题不相关的结果。即使在微调后,该模型的表现也仍然不理想,经常生成无意义的输出,无法满足我们的需求。

(2)随后,使用7B规模的Gemma模型进行测试,并观察到这个模型具有一定的Python编程基础知识。尽管该模型有时会偏离主题或拒绝提供答案,但在微调后,其响应质量得到了显著提升。通过RAG技术和少量示例提示,我们进一步提高了模型的响应准确性和全面性。

虽然在RAG的导航过程中遇到了挑战,需要不断调整以确保模型正确理解问题,但最终7B指令Gemma模型被确定为最适合的选择。与经典的7B Gemma模型相比,该模型表现出更好的对话连贯性和更一致的答案。经过微调后,Gemma模型对Python相关问题的回答质量显著提高,展现出更强的意识和更准确的表达。

尽管目前的实现仍然存在优化空间,但是对RAG技术在项目中的应用和模型响应质量的提升表示满意。未来,大家可以进一步改进模型和算法,以进一步提高解答效率和准确性,并持续关注自然语言处理领域的最新发展。

本项目已完结:

(24-1)基于大模型的Python学习系统(Gemma+Langchain+FAISS):背景介绍+项目介绍-CSDN博客

(24-2)基于大模型的Python学习系统(Gemma+Langchain+FAISS):模型准备和处理-CSDN博客

(24-3-01)基于大模型的Python学习系统(Gemma+Langchain+FAISS):基于Transformers的模型微调(1)-CSDN博客

(24-3-02)基于大模型的Python学习系统(Gemma+Langchain+FAISS):基于Transformers的模型微调(2)-CSDN博客

(24-4-01)基于大模型的Python学习系统(Gemma+Langchain+FAISS):RAG(1)-CSDN博客

(24-4-02)基于大模型的Python学习系统(Gemma+Langchain+FAISS):RAG(2)-CSDN博客

  • 15
    点赞
  • 18
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

码农三叔

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值