基于星火大模型的大模型微调 Datawhale AI夏令营第四期Task02笔记

柒小毓

于 2024-08-14 12:52:40 发布

阅读量681

点赞数 28

文章标签：人工智能笔记机器学习 python pandas

本文链接：https://blog.csdn.net/2301_79638883/article/details/141095896

版权

本文是基于讯飞大模型定制训练平台和spark-13b微调模型的一篇从从零入门大模型微调的实践课，适合想 了解如何微调、学习微调数据处理、学习通过大模型微调实现试卷QA生成 的学习者参与，本篇文章是我自己的学习笔记，供大家参考

数据集分析

近几年的高考语文现代阅读和英语阅读的数据集（包含原文，题目，答案）用于训练和测试

数据集很小，鼓励参赛选手自行拓展和调整

本次大赛数据集视为科大讯飞的保密信息

本文要点

对模型微调数据的理解

对教育业务的理解

数据处理方法

代码讲解

1.数据处理

1.1语文问答数据制作

!pip install pandas openpyxl

安装两个库：pandas和openpyxl。

1. pandas：Python的数据分析库，提供了数据结构和数据分析工具，可以用于数据处理、分析、可视化和数据存储。它可以轻松地读取和处理CSV、Excel、SQL等数据格式，并进行数据分析。

2. openpyxl：这是一个用于读取和写入Excel 2010 xlsx/xlsm/xltx/xltm文件的Python库。它可以轻松地读取和写入Excel文件，进行数据处理和分析。

安装这两个库可以提高Python在数据处理和数据分析方面的能力。

import pandas as pd
import re

# 读取Excel文件
df = pd.read_excel('训练集-语文.xlsx')
df = df.replace('．', '.', regex=True)
df = df.replace('（', '(', regex=True)


# 读取第二行（即第三行）“选项”列的内容
second_row_option_content = df.loc[2, '选项']

# 显示第二行“选项”列的内容
print(second_row_option_content)

读取一个Excel文件（训练集-语文.xlsx），并处理其中的数据。

1. 导入pandas库，用于处理数据。
2. 导入re库，用于处理正则表达式。
3. 使用pandas的read_excel函数读取Excel文件（训练集-语文.xlsx）。
4. 使用replace函数将文件中的“．”替换为“.”，将“（”替换为“(”，以处理特殊字符。
5. 读取第二行（即第三行）“选项”列的内容，并将其存储在变量second_row_option_content中。
6. 使用print函数显示第二行“选项”列的内容。

def chinese_multiple_choice_questions(questions_with_answers):
    # 输入的题目文本
    text = questions_with_answers

    # 正则表达式模式
    question_pattern = re.compile(r'\d+\..*?(?=\d+\.|$)', re.DOTALL)
    choice_pattern = re.compile(r'([A-D])\s*(.*?)(?=[A-D]|$|\n)', re.DOTALL)

    # 找到所有问题
    questions = question_pattern.findall(text)

    # 初始化选择题和简答题列表
    multiple_choice_questions = []
    short_answer_questions = []

        # 处理每个问题
    for id,question in enumerate(questions):
        # 检查是否是选择题
        if re.search(r'[A-D]', question):
            
            choices = choice_pattern.findall(question)
            question_text = re.split(r'\n', question.split('(')[0])[0]
            
            
            pattern_question = re.compile(r'(\d+)\.(.*)')
            matches_question = str(id+1)+'.'+ pattern_question.findall(question_text)[0][1] # 取出问题后重排序
            # print(str(id+1)+'.'+matches_question)
            
            multiple_choice_questions.append({
                'question': matches_question,
                'choices': choices
            })
        else:
            short_answer_questions.append(question.strip())
    return multiple_choice_questions

定义了一个名为`chinese_multiple_choice_questions`的函数，它接受一个参数`questions_with_answers`。这个函数的目的是为中文 Multiple-Choice 问题提供一个模板，以便用户可以轻松地创建自己的中文 Multiple-Choice 问题。

1.定义的一个变量`text`，并将其值设置为`questions_with_answers`，以便在后续的代码中使用。

2. `question_pattern`：这个模式用于匹配文本中的问题部分。它匹配一个或多个数字点（`.`），后面跟着任意数量的任意字符（`.*?`），直到遇到一个数字点或字符串的末尾（`(?=\d+\.|$)`）。`re.DOTALL`标志表示点号（`.`）可以匹配换行符。

3. `choice_pattern`：这个模式用于匹配文本中的选项部分。它匹配一个或多个字母（`[A-D]`），后面跟着一个或多个空格（`\s*`），然后是一个或多个任意字符（`.*?`），直到遇到一个字母（`[A-D]`）或字符串的末尾（`(?=[A-D]|$|\n)`）。`re.DOTALL`标志表示点号（`.`）可以匹配换行符。

4. 使用`question_pattern.findall(text)`来查找文本中所有与问题模式匹配的问题。
5. 将找到的问题存储在名为`questions`的列表中。

6.选择题存储`multiple_choice_questions`列表中，简答题存储`short_answer_questions`列表中。

7. 使用`enumerate`函数遍历问题列表中的每个问题，并获取问题的编号和内容。
8. 使用正则表达式`re.search`检查问题是否包含选择题的选项（A、B、C、D）。
9. 如果问题包含选择题选项，则使用`re.compile`和`findall`函数提取所有选项。
10. 提取问题文本，并使用正则表达式`re.split`和`re.compile`提取问题编号和排序后的问题内容。
11. 将提取的问题编号和问题内容存储在一个字典中，将其添加`multiple_choice_questions`列表中。
12. 如果问题不是选择题，则将其添加到`short_answer_questions`列表中。
13. 返回包含选择题和短答案问题的列表。

questions_list = []
for data_id in range(len(df[:3])):
    second_row_option_content = df.loc[data_id, '选项']
    questions_list.append(chinese_multiple_choice_questions(second_row_option_content))

1. 首先，创建一个空列表`questions_list`，用于存储所有问题。
2. 使用`for`循环遍历`df[:3]`（即数据帧的前三行）的索引`data_id`。
3. 在循环内部，获取当前行的'选项'列内容，将其作为参数传递给`chinese_multiple_choice_
questions`函数。这个函数可能是用于生成多选题问题的。
4. 将生成的多选题问题添加到`questions_list`中。

需要注意的是，这段代码假设`df`是一个包含数据的DataFrame，并且已经定义`chinese_m
ultiple_choice_questions`函数。此外，代码中使用的索引`data_id`和行切片`[:3]`可能需要根据实际情况进行调整。

def chinese_multiple_choice_answers(questions_with_answers):
    questions_with_answers = questions_with_answers.replace(" ", "").replace("\n", "")
    
    # print(questions_with_answers)
    # 使用正则表达式匹配答案
    choice_pattern = re.compile(r'(\d+)\.([A-Z]+)')
    short_pattern = re.compile(r'(\d+)\.([^A-Z]+)')

    # 找到所有匹配的答案
    choice_matches = choice_pattern.findall(questions_with_answers)
    short_matches = short_pattern.findall(questions_with_answers)

    # 将匹配结果转换为字典
    choice_answers = {int(index): answer for index, answer in choice_matches}
    short_answers = {int(index): answer for index, answer in short_matches}

    # 按序号重新排序
    sorted_choice_answers = sorted(choice_answers.items())
    sorted_short_answers = sorted(short_answers.items())
    
    answers = []

    # 输出结果
    
    # print("选择题答案：")
    for id in range(len(sorted_choice_answers)):
        answers.append(f"{id+1}. {sorted_choice_answers[id][1]}")
    return answers

函数`chinese_multiple_choice_answers`，用于处理中文 Multiple-Choice 题目的答案。函数接收一个包含问题和答案的字符串作为输入，返回一个包含所有答案的列表。

1.首先将输入字符串中的空格和换行符替换为空字符串，以方便正则表达式的匹配。

2.使用两个正则表达式`choice_pattern`和`short_pattern`分别匹配选择题和简答题的答案。`choice_pattern`用于匹配形如`1. A`或`1. B`的答案，`short_pattern`用于匹配形如`1. 答案`的答案。

3.使用`findall`方法找到所有匹配的答案，并将它们转换为字典。字典的键为答案的序号，值为答案本身。然后，对字典进行排序，并将排序后的结果添加到`answers`列表中。最后，返回`answers`列表。

# 读取第二行（即第三行）“选项”列的内容
second_row_option_content = df.loc[60, '答案']

# 显示第二行“选项”列的内容
print(second_row_option_content)


chinese_multiple_choice_answers(second_row_option_content)

1. 首先，代码读取了数据帧的第60行（即第三行）的“答案”列的内容，并将该内容存储在变量`second_row_option_content`中。

2. 然后，代码使用`print()`函数打印出`second_row_option_content`的内容。

3. 最后，代码调用了一个名为`chinese_multiple_choice_answers()`的函数，该函数可能用于处理第二行“选项”列的内容，并返回一个结果。这个结果可能是一个中文的多选题答案，或者是一个其他类型的结果。

数据帧是一个二维的、行索引和列索引都是整数的数据结构，用于存储和操作数据。

df['答案_processed'] = df['答案'].map(chinese_multiple_choice_answers)

将'答案'列中的每个元素传递给函数`chinese_multiple_choice_answers`，并将返回的结果作为'答案_processed'列的新值。

def get_prompt_cn(text):
    prompt = f'''
    你是⼀个⾼考选择题出题专家，你出的题有⼀定深度，你将根据阅读文本，出4道单项选择题，包含题目选项，以及对应的答案，注意：不⽤给出原文，每道题由1个问题和4个选项组成，仅存在1个正确答案，请严格按照要求执行。 阅读文本主要是中文，你出的题目需要满足以下要点，紧扣文章内容且题干和答案为中文：
    
    ### 回答要求
    (1)理解文中重要概念的含义
    (2)理解文中重要句子的含意
    (3)分析论点、论据和论证方法
    
    
    ### 阅读文本
    {text}
    '''
    
    return prompt

该函数用于生成一个中文提示，用于出中文选择题。函数接收一个参数`text`，用于存储阅读文本。

def process_cn(df): 
    res_input = []
    res_output = []
    for id in range(len(df)):
        data_options = df.loc[id, '选项']
        data_answers = df.loc[id,'答案']
        data_prompt = df.loc[id,'阅读文本']
        data_options = chinese_multiple_choice_questions(data_options)
        data_answers = chinese_multiple_choice_answers(data_answers)
        data_prompt = get_prompt_cn(data_prompt)
        # print(data_options)
        # print(data_answers)
        
        if(len(data_answers)==len(data_options)):
            res = ''
            for id_,question in enumerate(data_options):
                res += f'''
{question['question']}?
                '''+'\n'
                for choise in question['choices']:
                    res = res+ choise[0] + choise[1]+ '\n'
                res = res + '答案:' + str(data_answers[id_].split('.')[-1])  + '\n'
            res_output.append(res)
            res_input.append(data_prompt)
        # break
    return res_input,res_output

函数`process_cn`用于处理中文版本的 Multiple Choice 问题。函数接收一个 DataFrame 参数`df`，其中包含问题、选项、答案和阅读文本。函数的主要目的是将问题、选项和答案转换为适合训练的格式，并将它们存储在`res_input`和`res_output`列表中。

1.首先创建两个空列表`res_input`和`res_output`，用于存储处理后的输入和输出数据。

2.它遍历 DataFrame 的每一行，获取问题、选项、答案和阅读文本。接下来，它对选项和答案进行处理，将其转换为适合训练的格式。

3.将处理后的输入和输出数据添加到`res_input`和`res_output`列表中。。

cn_input,cn_output = process_cn(df)

这段代码是用于处理中国地区数据（cn_input）的，并将处理后的数据（cn_output）返回。

len(cn_input)

`len()` 用于计算一个对象（如字符串、列表、元组等）的长度。

1.2英语问答制作

import pandas as pd

# 读取Excel文件
df = pd.read_excel('训练集-英语.xlsx')
df = df.replace('．', '.', regex=True).replace('А.', 'A.', regex=True).replace('В.', 'B.', regex=True).replace('С.', 'C.', regex=True).replace('D.', 'D.', regex=True)
# df = df.replace('（', '(', regex=True)

# 读取第二行（即第三行）“选项”列的内容
second_row_option_content = df.loc[0, '选项']

# 显示第二行“选项”列的内容
print(second_row_option_content)

读取一个Excel文件（训练集-英语.xlsx），并处理其中的数据。

1. 导入pandas库，用于处理数据。
2. 导入re库，用于处理正则表达式。
3. 使用pandas的read_excel函数读取Excel文件（训练集-语文.xlsx）。
4. 使用replace函数将文件中的“．”替换为“.”，将“（”替换为“(”，以处理特殊字符。
5. 读取第二行（即第三行）“选项”列的内容，并将其存储在变量second_row_option_content中。
6. 使用print函数显示第二行“选项”列的内容。


def remove_whitespace_and_newlines(input_string):
    # 使用str.replace()方法删除空格和换行符
    result = input_string.replace(" ", "").replace("\n", "").replace(".", "")
    return result

这个函数可以用于处理文本数据，例如在文本预处理中，可以用于删除文本中的空格、换行符和点号，以便在后续的文本处理中提高效率。
1. 使用`str.replace()`方法将字符串中的空格替换为空字符串（即删除空格），然后将换行符替换为空字符串（即删除换行符）。
2. 再次使用`str.replace()`方法将字符串中的点号替换为空字符串（即删除点号）。
3. 返回处理后的字符串。

import re

# 示例文本
text = """
32. B. The underlying logic of the effect.                                                   33.D. estimates were not fully independent.
34.C. The discussion process.            35.D. Approving.
"""
def get_answers(text):
    text = remove_whitespace_and_newlines(text)
    # 正则表达式模式
    pattern = re.compile(r'(\d)\s*([A-D])')

    # 查找所有匹配项
    matches = pattern.findall(text)
    res = []
    # 打印结果
    for match in matches:
        number_dot, first_letter = match
        res.append(first_letter)
    return res

这段代码的主要功能是从给定的文本中提取出所有数字和字母的组合，并将它们存储在一个列表中。这个列表将作为答案返回。

1. 首先，导入了`re`模块，用于处理正则表达式。

2. 定义了一个示例文本`text`，其中包含了一些数字和字母的组合。

3. 定义了一个名为`get_answers`的函数，该函数接受一个字符串参数`text`。

4. 调用了一个辅助函数`remove_whitespace_and_newlines`，该函数用于移除文本中的空格和换行符。

5. 定义了一个正则表达式模式`pattern`，用于匹配文本中的数字和字母的组合。正则表达式的`(\d)`匹配一个数字，`\s*`匹配一个或多个空白字符（包括空格、制表符和换行符），`([A-D])`匹配一个字母（大写或小写）。

6. 使用`pattern.findall(text)`查找文本中的所有匹配项。

7. 创建一个空列表`res`，用于存储结果。

8. 使用一个for循环遍历所有匹配项。

9. 对于每个匹配项，获取数字和字母部分，并将它们转换为大写。

10. 将转换后的字母添加到`res`列表中。

11. 返回`res`列表作为答案。

# 示例输入
input_string = "28. A. It is simple and plain.              29. D. Influential.                                30. D.33%.                                             31. B. Male chefs on TV programmes."
res = get_answers(input_string)
print(res)

用于从输入字符串中提取答案。

1. 定义一个名为`get_answers`的函数，该函数接受一个字符串参数`input_string`。
2. 在函数内部，首先将输入字符串分割成一个列表，每个元素都是一个句子。
3. 然后，遍历列表中的每个句子，并将其转换为小写。
4. 接下来，使用正则表达式（`re`模块）来匹配句子中的数字和答案。
5. 最后，将匹配到的数字和答案存储在一个字典中，并返回该字典。

import re

# 示例文本
text = second_row_option_content

def get_questions(text):
    text = text.replace('\n', '  ')+'  '
    # print(text)
    # 正则表达式模式
    pattern = re.compile(r'(\d+\..*?)(A\..*?\s{2})([B-D]\..*?\s{2})([B-D]\..*?\s{2})(D\..*?\s{2})', re.DOTALL)

    # 查找所有匹配项
    matches = pattern.findall(text)

    # 存储结果的字典列表
    questions_dict_list = []

    # 打印结果
    for match in matches:
        question, option1, option2, option3, option4 = match
        pattern_question = re.compile(r'(\d+)\.(.*)')
        question_text = pattern_question.findall(question.strip())[0][1]
        
        # 提取选项字母和内容
        options = {option1[0]: option1, option2[0]: option2, option3[0]: option3, option4[0]: option4}
        
        question_dict = {
            'question': question_text,
            'options': {
                'A': options.get('A', '').strip(),
                'B': options.get('B', '').strip(),
                'C': options.get('C', '').strip(),
                'D': options.get('D', '').strip()
            }
        }
        questions_dict_list.append(question_dict)
    return questions_dict_list

# 调用函数并打印结果
questions = get_questions(text)
for q in questions:
    print(q)

这段代码的主要功能是从给定的文本中提取问题和选项，并将它们存储在一个字典列表中。

1. 导入`re`模块，用于处理正则表达式。
2. 定义一个示例文本`text`，这里假设是从某个表格中获取的文本。
3. 定义一个名为`get_questions`的函数，该函数接受一个文本参数`text`。
4. 在函数内部，将文本中的换行符`\n`替换为两个空格` `，以便在正则表达式中使用。
5. 定义一个正则表达式模式`pattern`，用于匹配问题和选项。模式中的`.`表示任意字符，`*`表示零次或多次，`+`表示一次或多次，`?`表示零次或一次，`|`表示或。
6. 使用`re.compile()`函数编译正则表达式模式，以便在后续的查找操作中使用。
7. 使用`findall()`方法查找所有匹配项，并将结果存储在一个列表中。
8. 初始化一个空列表`questions_dict_list`，用于存储提取的问题和选项。
9. 使用一个循环遍历所有匹配项，将每个匹配项转换为字典格式，并将字典添加到`questions_dict_list`列表中。
10. 返回`questions_dict_list`列表，其中包含提取的问题和选项。
11. 在函数的最后，调用`get_questions()`函数并将结果存储在`questions`变量中。
12. 使用一个循环遍历`questions`列表，并打印每个问题字典。

def get_prompt_en(text):
    prompt = f'''
    你是⼀个⾼考选择题出题专家，你出的题有⼀定深度，你将根据阅读文本，出4道单项选择题，包含题目选项，以及对应的答案，注意：不⽤给出原文，每道题由1个问题和4个选项组成，仅存在1个正确答案，请严格按照要求执行。
The reading text is mainly in English. The questions and answers you raised need to be completed in English for at least the following points:
    
    ### 回答要求
    (1)Understanding the main idea of the main idea.
    (2)Understand the specific information in the text.
    (3)infering the meaning of words and phrases from the context
    
    
    ### 阅读文本
    {text}
    '''
    
    return prompt

该函数用于生成一个英文提示，用于出英文选择题。函数接收一个参数`text`，用于存储阅读文本。

def process_en(df): 
    res_input = []
    res_output = []
    for id in range(len(df)):
        data_options = df.loc[id, '选项']
        data_answers = df.loc[id,'答案']
        data_prompt = df.loc[id,'阅读文本']
        data_options = get_questions(data_options)
        data_answers = get_answers(data_answers)
        data_prompt = get_prompt_en(data_prompt)
        # print(data_options)
        # print(data_answers)

        if(len(data_answers)==len(data_options)):
            res = ''
            for id,question in enumerate(data_options):
                res += f'''
                {id+1}.{question['question']}
                {question['options']['A']}
                {question['options']['B']}
                {question['options']['C']}
                {question['options']['D']}
                answer:{data_answers[id]}
                '''+'\n'
            res_output.append(res)
            res_input.append(data_prompt)
    return res_input,res_output
    # break

`process_en`函数的主要目的是将题目转换为适合训练的格式，以便在英文环境中训练一个自然语言处理模型。

函数首先创建两个空列表`res_input`和`res_output`，用于存储处理后的输入和输出数据。然后，它遍历数据框中的每个题目，对每个题目进行以下操作：

1. 获取题目中的选项和答案。
2. 将选项转换为适合训练的格式。
3. 将答案转换为适合训练的格式。
4. 将提示文本转换为英文。

如果选项和答案的数量相同，那么将题目和答案组合成一个字符串，并将其添加到`res_output`列表中。同时，将提示文本添加到`res_input`列表中。最后，函数返回`res_input`和`res_output`列表。

en_input,en_output = process_en(df)

这段代码是用于处理英文数据的。它将一个名为`df`的DataFrame对象作为输入，并返回两个变量`en_input`和`en_output`


# 将两个列表转换为DataFrame
df_new = pd.DataFrame({'input': cn_input+cn_input[:30]+en_input+en_input[:20], 'output': cn_output+cn_output[:30]+en_output+en_output[:20]})

df_new

1.这段代码首先创建一个新的DataFrame，其列名为'input'和'output'。

2.它将两个列表（cn_input和en_input）以及它们的子列表（cn_output和en_output）合并为一个新列表。

3.它将这个新列表转换为DataFrame并显示。

import json


# 打开一个文件用于写入 JSONL，并设置编码为 UTF-8
with open('output.jsonl', 'w', encoding='utf-8') as f:
    # 遍历每一行并将其转换为 JSON
    for index, row in df_new.iterrows():
        row_dict = row.to_dict()
        row_json = json.dumps(row_dict, ensure_ascii=False,)
        # 将 JSON 字符串写入文件，并添加换行符
        f.write(row_json + '\n')

# 打印确认信息
print("JSONL 文件已生成")

这段代码的主要功能是将一个名为 `df_new` 的 DataFrame 对象转换为 JSONL 格式，并将结果写入名为 `output.jsonl` 的文件中。

1. 导入 `json` 模块，用于处理 JSON 数据。
2. 使用 `with open()` 语句打开一个文件用于写入 JSONL，并设置编码为 UTF-8。
3. 遍历 `df_new` 中的每一行，将其转换为字典（`to_dict()` 方法）并将其转换为 JSON 字符串（`json.dumps()` 方法）。
4. 确保 JSON 字符串中的非 ASCII 字符不被转义（`ensure_ascii=False`）。
5. 将 JSON 字符串写入文件，并在每个对象之间添加换行符（`\n`）。
6. 打印确认信息。

JSONL 是一种特殊的 JSON 格式，用于表示 JSON 对象序列，每个对象占一行。

2.模型微调

链接：大模型训练平台

3.模型提交

提交官网：星火大模型驱动阅读理解题库构建挑战赛

接着拿好我们的prompt：

中文部分

(1)理解文中重要概念的含义
(2)理解文中重要句子的含意 
(3)分析论点、论据和论证方法

英文部分

(1)Understanding the main idea of the main idea.

(2)Understand the specific information in the text.

(3)infering the meaning of words and phrases from the context

等待分数出来