DataWhale AI夏令营【大模型微调零代码数据流动详解】

是织梦者啊

已于 2024-08-11 21:47:51 修改

阅读量758

点赞数 19

分类专栏：大模型技术文章标签：人工智能大模型 nlp 自然语言处理

于 2024-08-11 18:35:19 首次发布

本文链接：https://blog.csdn.net/dream__me/article/details/141102866

版权

大模型技术专栏收录该内容

1 篇文章 0 订阅

订阅专栏

文章目录

概要

文章说明：本文基于《星火大模型驱动阅读理解题库构建挑战赛》的数据编写学习笔记，旨在介绍微调的流程，和微调过程中的数据流动。
赛事链接：https://challenge.xfyun.cn/h5/detail?type=question-bank-construction&option=phb&ch=dw24_E7x9gl
学习教程：https://linklearner.com/activity/14/12/26
注：本文所述零代码是指微调部分无需编写代码，将训练数据上传到平台即可微调。

整体架构流程

1. 数据处理，生成微调数据

整个过程包括读取Excel文件、清洗数据、解析文本、生成问题和选项，并构造提示信息和答案输出。

1.1 赛题数据介绍

1.1.1 原始数据格式

在这里插入图片描述

import pandas as pd
import re

# 读取Excel文件
df = pd.read_excel('训练集-语文.xlsx')

df.head(2)

在这里插入图片描述
训练集-语文.xlsx包含三列，数据细节如下：

在这里插入图片描述

在这里插入图片描述

训练集-语文.xlsx包含选择题和简答题，训练集-英语.xlsx只包含选择题，送入微调的数据仅为选择题。

在这里插入图片描述

1.1.2 期望用来微调数据格式

微调的数据的提示词格式为：

在这里插入图片描述

问题及答案数据格式为：

即把原始数据的选项和答案一一对应。

1.2 数据清洗

df = df.replace('．', '.', regex=True)
df = df.replace('（', '(', regex=True)

上述代码段使用了pandas库中的DataFrame对象的replace()方法，来在DataFrame中替换特定的字符或字符串。这里，代码执行了两个替换操作，每个操作都使用了正则表达式（通过设置regex=True）来匹配并替换特定的字符或字符序列。下面是这两个操作的详细解释：

第一个替换操作

df = df.replace('．', '.', regex=True)

目的：将DataFrame中所有的全角句号（．）替换为半角句号（.）。
第二个替换操作

df = df.replace('（', '(', regex=True)

目的：将DataFrame中所有的全角左括号（（）替换为半角左括号（(）。

1.3 生成微调数据

def chinese_multiple_choice_questions(questions_with_answers):
    # 输入的题目文本
    text = questions_with_answers

    # 正则表达式模式
    question_pattern = re.compile(r'\d+\..*?(?=\d+\.|$)', re.DOTALL)
    choice_pattern = re.compile(r'([A-D])\s*(.*?)(?=[A-D]|$|\n)', re.DOTALL)

    # 找到所有问题
    questions = question_pattern.findall(text)

    # 初始化选择题和简答题列表
    multiple_choice_questions = []
    short_answer_questions = []

        # 处理每个问题
    for id,question in enumerate(questions):
        # 检查是否是选择题
        if re.search(r'[A-D]', question):
            
            choices = choice_pattern.findall(question)
            question_text = re.split(r'\n', question.split('(')[0])[0]
            
            
            pattern_question = re.compile(r'(\d+)\.(.*)')
            matches_question = str(id+1)+'.'+ pattern_question.findall(question_text)[0][1] # 取出问题后重排序
            # print(str(id+1)+'.'+matches_question)
            
            multiple_choice_questions.append({
                'question': matches_question,
                'choices': choices
            })
        else:
            short_answer_questions.append(question.strip())
    return multiple_choice_questions

def chinese_multiple_choice_answers(questions_with_answers):
    questions_with_answers = questions_with_answers.replace(" ", "").replace("\n", "")
    
    # print(questions_with_answers)
    # 使用正则表达式匹配答案
    choice_pattern = re.compile(r'(\d+)\.([A-Z]+)')
    short_pattern = re.compile(r'(\d+)\.([^A-Z]+)')

    # 找到所有匹配的答案
    choice_matches = choice_pattern.findall(questions_with_answers)
    short_matches = short_pattern.findall(questions_with_answers)

    # 将匹配结果转换为字典
    choice_answers = {int(index): answer for index, answer in choice_matches}
    short_answers = {int(index): answer for index, answer in short_matches}

    # 按序号重新排序
    sorted_choice_answers = sorted(choice_answers.items())
    sorted_short_answers = sorted(short_answers.items())
    
    answers = []

    # 输出结果
    
    # print("选择题答案：")
    for id in range(len(sorted_choice_answers)):
        answers.append(f"{id+1}. {sorted_choice_answers[id][1]}")
    return answers

以下是两个函数 chinese_multiple_choice_questions 和 chinese_multiple_choice_answers 的详细解释：

chinese_multiple_choice_questions函数

目的：从包含问题和答案的文本中提取出选择题及其选项，并将它们以字典形式组织起来。

步骤：

初始化变量：将输入的 questions_with_answers 文本赋值给 text 变量。
编译正则表达式：
- question_pattern：用于匹配问题文本，从数字点开始到下一个数字点或文本末尾。
- choice_pattern：用于匹配选择题中的选项，从选项字母（A-D）开始到下一个选项字母、文本末尾或换行符。
查找所有问题：使用 question_pattern 查找 text 中所有匹配的问题，并将它们存储在 questions 列表中。
初始化列表：创建两个空列表 multiple_choice_questions 和 short_answer_questions，分别用于存储选择题和简答题。
遍历问题：对于 questions 列表中的每个问题，使用 enumerate 获取其索引（id）和内容（question）。
检查问题类型：
- 如果问题中包含选项字母（A-D），则认为它是选择题。
- 使用 choice_pattern 查找问题中的选项，并存储在 choices 列表中。
- 尝试从问题文本中提取问题编号和文本。这里使用了 split 和 re.split 的组合。
- 使用 re.compile 和 findall 提取问题编号和文本。
- 将格式化后的问题和选项作为字典添加到 multiple_choice_questions 列表中。
处理简答题：如果问题不包含选项字母，则将其视为简答题，并添加到 short_answer_questions 列表中。
返回结果：函数只返回 multiple_choice_questions 列表，忽略了 short_answer_questions 列表。微调数据只需要选择题。

chinese_multiple_choice_answers函数

目的：从包含问题和答案的文本中提取出选择题和简答题的答案，并将它们以列表形式返回。

步骤：

预处理文本：去除输入文本中的所有空格和换行符。
编译正则表达式：
- choice_pattern：用于匹配选择题的答案，格式为数字点后跟大写字母。
- short_pattern：用于匹配简答题的答案，格式为数字点后跟非大写字母的文本。
查找答案：使用两个正则表达式分别查找选择题和简答题的答案，并将它们存储在 choice_matches 和 short_matches 列表中。
将匹配结果转换为字典：将匹配结果转换为字典，其中键是问题编号（转换为整数），值是答案。
排序：将字典的项按编号排序，得到 sorted_choice_answers 和 sorted_short_answers。
构建答案列表：遍历 sorted_choice_answers，将选择题答案格式化为字符串，并添加到 answers 列表中。注意，这里没有包含简答题的答案。
返回结果：返回包含选择题答案的 answers 列表。

注意：函数只返回了选择题的答案，忽略了简答题的答案。此外，去除文本中的空格和换行符可能会导致意外的匹配结果，特别是如果答案文本中原本包含这些字符时。在实际应用中，可能需要更精细的文本处理逻辑来确保准确性。

df['答案_processed'] = df['答案'].map(chinese_multiple_choice_answers)

df增加一列答案_processed，如下：
在这里插入图片描述

1.3.1 生成提示词和回答（语文）

def get_prompt_cn(text):
    prompt = f'''
    你是⼀个⾼考选择题出题专家，你出的题有⼀定深度，你将根据阅读文本，出4道单项选择题，包含题目选项，以及对应的答案，注意：不⽤给出原文，每道题由1个问题和4个选项组成，仅存在1个正确答案，请严格按照要求执行。 阅读文本主要是中文，你出的题目需要满足以下要点，紧扣文章内容且题干和答案为中文：
    
    ### 回答要求
    (1)理解文中重要概念的含义
    (2)理解文中重要句子的含意
    (3)分析论点、论据和论证方法
    
    
    ### 阅读文本
    {text}
    '''
    
    return prompt

def process_cn(df): 
    res_input = []
    res_output = []
    for id in range(len(df)):
        data_options = df.loc[id, '选项']
        data_answers = df.loc[id,'答案']
        data_prompt = df.loc[id,'阅读文本']
        data_options = chinese_multiple_choice_questions(data_options)
        data_answers = chinese_multiple_choice_answers(data_answers)
        data_prompt = get_prompt_cn(data_prompt)
        # print(data_options)
        # print(data_answers)
        
        if(len(data_answers)==len(data_options)):
            res = ''
            for id_,question in enumerate(data_options):
                res += f'''
{question['question']}?
                '''+'\n'
                for choise in question['choices']:
                    res = res+ choise[0] + choise[1]+ '\n'
                res = res + '答案:' + str(data_answers[id_].split('.')[-1])  + '\n'
            res_output.append(res)
            res_input.append(data_prompt)
        # break
    return res_input,res_output

这段代码定义了一个名为 process_cn 的函数，它接收一个 DataFrame（df）作为输入，这个 DataFrame 预期包含三列：'选项'、'答案' 和 '阅读文本'。该函数的主要目的是处理这些列中的数据，将每个选项格式化为更易读的格式，并生成两个列表：res_input 和 res_output。res_input 包含处理后的“阅读文本”，而 res_output 包含格式化后的选项和答案。

下面是对代码主要部分的详细解释：

初始化结果列表：函数开始时，初始化了两个空列表 res_input 和 res_output，用于存储处理后的“阅读文本”和格式化后的选项及答案。
遍历 DataFrame：通过 for id in range(len(df)) 循环遍历 DataFrame 的每一行。这里使用 id 作为索引，但更常见的做法是直接遍历 df.iterrows() 或使用 for index, row in df.iterrows():，这样可以更直接地访问行数据。
提取数据：对于每一行，从 DataFrame 中提取 '选项'、'答案' 和 '阅读文本' 列的值。
处理数据：
- data_options = chinese_multiple_choice_questions(data_options)：调用 chinese_multiple_choice_questions 函数来处理 '选项' 列的数据。
- data_answers = chinese_multiple_choice_answers(data_answers)：调用 chinese_multiple_choice_answers 函数处理 '答案' 列的数据。
- data_prompt = get_prompt_cn(data_prompt)：调用 get_prompt_cn 函数来处理 '阅读文本' 列的数据。
条件检查：检查处理后的答案数量是否与选项数量相等。这是为了确保每个选项都有一个对应的答案。
格式化输出：
- 使用嵌套循环遍历每个选项及其选择项，将它们格式化为问题后跟选项的字符串，并在每个问题后添加对应的答案。
- 格式化后的字符串被添加到 res_output 列表中。
- 对应的“阅读文本”被添加到 res_input 列表中。
返回值：函数返回两个列表：res_input 和 res_output。

cn_input,cn_output = process_cn(df)

最终，处理训练集-语文.xlsx生成的提示词和回答如下：
在这里插入图片描述

1.3.2 生成提示词和回答（英文）

import pandas as pd
import re
# 读取Excel文件
df = pd.read_excel('训练集-英语.xlsx')
df = df.replace('．', '.', regex=True).replace('А.', 'A.', regex=True).replace('В.', 'B.', regex=True).replace('С.', 'C.', regex=True).replace('D.', 'D.', regex=True)

def remove_whitespace_and_newlines(input_string):
    # 使用str.replace()方法删除空格和换行符
    result = input_string.replace(" ", "").replace("\n", "").replace(".", "")
    return result

def get_answers(text):
    text = remove_whitespace_and_newlines(text)
    # 正则表达式模式
    pattern = re.compile(r'(\d)\s*([A-D])')

    # 查找所有匹配项
    matches = pattern.findall(text)
    res = []
    # 打印结果
    for match in matches:
        number_dot, first_letter = match
        res.append(first_letter)
    return res

def get_questions(text):
    text = text.replace('\n', '  ')+'  '
    # print(text)
    # 正则表达式模式
    pattern = re.compile(r'(\d+\..*?)(A\..*?\s{2})([B-D]\..*?\s{2})([B-D]\..*?\s{2})(D\..*?\s{2})', re.DOTALL)

    # 查找所有匹配项
    matches = pattern.findall(text)

    # 存储结果的字典列表
    questions_dict_list = []

    # 打印结果
    for match in matches:
        question, option1, option2, option3, option4 = match
        pattern_question = re.compile(r'(\d+)\.(.*)')
        question_text = pattern_question.findall(question.strip())[0][1]
        
        # 提取选项字母和内容
        options = {option1[0]: option1, option2[0]: option2, option3[0]: option3, option4[0]: option4}
        
        question_dict = {
            'question': question_text,
            'options': {
                'A': options.get('A', '').strip(),
                'B': options.get('B', '').strip(),
                'C': options.get('C', '').strip(),
                'D': options.get('D', '').strip()
            }
        }
        questions_dict_list.append(question_dict)
    return questions_dict_list

def get_prompt_en(text):
    prompt = f'''
    你是⼀个⾼考选择题出题专家，你出的题有⼀定深度，你将根据阅读文本，出4道单项选择题，包含题目选项，以及对应的答案，注意：不⽤给出原文，每道题由1个问题和4个选项组成，仅存在1个正确答案，请严格按照要求执行。
The reading text is mainly in English. The questions and answers you raised need to be completed in English for at least the following points:
    
    ### 回答要求
    (1)Understanding the main idea of the main idea.
    (2)Understand the specific information in the text.
    (3)infering the meaning of words and phrases from the context
    
    
    ### 阅读文本
    {text}
    '''
    
    return prompt   

def process_en(df): 
    res_input = []
    res_output = []
    for id in range(len(df)):
        data_options = df.loc[id, '选项']
        data_answers = df.loc[id,'答案']
        data_prompt = df.loc[id,'阅读文本']
        data_options = get_questions(data_options)
        data_answers = get_answers(data_answers)
        data_prompt = get_prompt_en(data_prompt)
        
        if(len(data_answers)==len(data_options)):
            res = ''
            for id,question in enumerate(data_options):
                res += f'''
                {id+1}.{question['question']}
                {question['options']['A']}
                {question['options']['B']}
                {question['options']['C']}
                {question['options']['D']}
                answer:{data_answers[id]}
                '''+'\n'
            res_output.append(res)
            res_input.append(data_prompt)
    return res_input,res_output

en_input,en_output = process_en(df)

最终，处理训练集-英语.xlsx生成的提示词和回答如下：
在这里插入图片描述

在这里插入图片描述

2.模型微调

import json

df_new = pd.DataFrame({'input': cn_input+cn_input[:30]+en_input+en_input[:20], 'output': cn_output+cn_output[:30]+en_output+en_output[:20]})

with open('output.jsonl', 'w', encoding='utf-8') as f:
    # 遍历每一行并将其转换为 JSON
    for index, row in df_new.iterrows():
        row_dict = row.to_dict()
        row_json = json.dumps(row_dict, ensure_ascii=False,)
        # 将 JSON 字符串写入文件，并添加换行符
        f.write(row_json + '\n')

将1.3生成的语文和英语数据存为output.jsonl，送入如下链接微调：
https://training.xfyun.cn/dataset/datasetIndex
微调步骤见学习教程：https://linklearner.com/activity/14/12/26

在这里插入图片描述

本次微调基于lora，设置的参数如下，训练时长：43分58秒：
在这里插入图片描述

3.模型推理

!pip install --upgrade spark_ai_python

prompt = {"input": "XXX", "output": "XXX"}
prompt = prompt['input']

from sparkai.llm.llm import ChatSparkLLM, ChunkPrintHandler
from sparkai.core.messages import ChatMessage

SPARKAI_URL = 'wss://xingchen-api.cn-huabei-1.xf-yun.com/v1.1/chat'

SPARKAI_APP_ID = ''
SPARKAI_API_SECRET = ''
SPARKAI_API_KEY = ''
serviceId = ''  
resourceId = ''

if __name__ == '__main__':
    spark = ChatSparkLLM(
        spark_api_url=SPARKAI_URL,
        spark_app_id=SPARKAI_APP_ID,
        spark_api_key=SPARKAI_API_KEY,
        spark_api_secret=SPARKAI_API_SECRET,
        spark_llm_domain=serviceId,
        model_kwargs={"patch_id": resourceId},
        streaming=False,
    )
    messages = [ChatMessage(
        role="user",
        content=prompt
    )]
    handler = ChunkPrintHandler()
    a = spark.generate([messages], callbacks=[handler])
    print(a.generations[0][0].text)

4.模型评估

没有自定义评估代码，大赛依据提交的resourceId进行测试产生定量的Q&A进行评估，采用自动化评估和人工评估两种方式综合评审成绩。

小结

df_new = pd.DataFrame({'input': cn_input+cn_input[:30]+en_input+en_input[:20], 'output': cn_output+cn_output[:30]+en_output+en_output[:20]})

不明白这里为什么要切片再拼接。

是织梦者啊

关注

19
点赞
踩
9

收藏

觉得还不错? 一键收藏
0
评论
DataWhale AI夏令营【大模型微调零代码数据流动详解】

import re# 读取Excel文件df = pd.read_excel('训练集-语文.xlsx')df.head(2)
复制链接

扫一扫

专栏目录