Datawhale AI夏令营【从零入门大模型微调】Task1笔记

Q1yAn

已于 2024-08-12 00:30:51 修改

阅读量447

点赞数 3

分类专栏：从零入门大模型微调文章标签：笔记 python

于 2024-08-11 22:07:36 首次发布

本文链接：https://blog.csdn.net/m0_73943012/article/details/141112960

版权

从零入门大模型微调专栏收录该内容

1 篇文章 0 订阅

订阅专栏

TASK01 星火大模型驱动阅读理解题库构建挑战赛Baseline01

Step1: 运行baseline:数据处理！

https://aistudio.baidu.com/projectdetail/8225663
数据处理及微调测试的主要代码，代码注释非常详细，小白也能轻松读懂
可以不做修改在飞桨AI Studio提供的环境里直接运行baseline代码，获得output.jsonl文件，用于后续模型微调

这部分的操作主要如下：

环境设置与依赖安装

!pip install pandas openpyxl

数据读取与处理，进行语文和英语的问答数据制作

2.1 使用 pandas 读取 Excel 文件，并用 replace 函数替换了一些特殊字符。

import pandas as pd
import re

# 读取Excel文件
df = pd.read_excel('训练集-语文.xlsx')
df = df.replace('．', '.', regex=True)
df = df.replace('（', '(', regex=True)

2.2 读取并显示第二行（即第三行）“选项”列的内容

second_row_option_content = df.loc[2, '选项']
print(second_row_option_content)

2.3 定义处理函数处理数据

定义函数 chinese_multiple_choice_questions，使用正则表达式解析文本中的选择题和选项。

def chinese_multiple_choice_questions(questions_with_answers):
# 输入的题目文本
text = questions_with_answers

# 正则表达式模式
question_pattern = re.compile(r'\d+\..*?(?=\d+\.|$)', re.DOTALL)
choice_pattern = re.compile(r'([A-D])\s*(.*?)(?=[A-D]|$|\n)', re.DOTALL)

# 找到所有问题
questions = question_pattern.findall(text)

# 初始化选择题和简答题列表
multiple_choice_questions = []
short_answer_questions = []

    # 处理每个问题
for id,question in enumerate(questions):
    # 检查是否是选择题
    if re.search(r'[A-D]', question):
        
        choices = choice_pattern.findall(question)
        question_text = re.split(r'\n', question.split('(')[0])[0]
        
        
        pattern_question = re.compile(r'(\d+)\.(.*)')
        matches_question = str(id+1)+'.'+ pattern_question.findall(question_text)[0][1] # 取出问题后重排序
        # print(str(id+1)+'.'+matches_question)
        
        multiple_choice_questions.append({
            'question': matches_question,
            'choices': choices
        })
    else:
        short_answer_questions.append(question.strip())
return multiple_choice_questions

遍历数据集的前几行，提取每行的选项内容并通过 chinese_multiple_choice_questions 函数进行处理。

questions_list = []
		for data_id in range(len(df[:3])):
		    second_row_option_content = df.loc[data_id, '选项']
		    questions_list.append(chinese_multiple_choice_questions(second_row_option_content))

2.4 解析答案

定义函数 chinese_multiple_choice_answers，解析并返回每个选择题的正确答案。

def chinese_multiple_choice_answers(questions_with_answers):
    questions_with_answers = questions_with_answers.replace(" ", "").replace("\n", "")
    
    # print(questions_with_answers)
    # 使用正则表达式匹配答案
    choice_pattern = re.compile(r'(\d+)\.([A-Z]+)')
    short_pattern = re.compile(r'(\d+)\.([^A-Z]+)')

    # 找到所有匹配的答案
    choice_matches = choice_pattern.findall(questions_with_answers)
    short_matches = short_pattern.findall(questions_with_answers)

    # 将匹配结果转换为字典
    choice_answers = {int(index): answer for index, answer in choice_matches}
    short_answers = {int(index): answer for index, answer in short_matches}

    # 按序号重新排序
    sorted_choice_answers = sorted(choice_answers.items())
    sorted_short_answers = sorted(short_answers.items())
    
    answers = []

    # 输出结果
    
    # print("选择题答案：")
    for id in range(len(sorted_choice_answers)):
        answers.append(f"{id+1}. {sorted_choice_answers[id][1]}")
    return answers

# 读取第二行（即第三行）“选项”列的内容
second_row_option_content = df.loc[60, '答案']

# 显示第二行“选项”列的内容
print(second_row_option_content)

chinese_multiple_choice_answers(second_row_option_content)

df['答案_processed'] = df['答案'].map(chinese_multiple_choice_answers)

2.5 定义生成中文题目模板的函数 get_prompt_cn，批量处理数据并生成输入与输出

def get_prompt_cn(text):
prompt = f'''
你是⼀个⾼考选择题出题专家，你出的题有⼀定深度，你将根据阅读文本，出4道单项选择题，包含题目选项，以及对应的答案，注意：不⽤给出原文，每道题由1个问题和4个选项组成，仅存在1个正确答案，请严格按照要求执行。 阅读文本主要是中文，你出的题目需要满足以下要点，紧扣文章内容且题干和答案为中文：

### 回答要求
(1)理解文中重要概念的含义
(2)理解文中重要句子的含意
(3)分析论点、论据和论证方法


### 阅读文本
{text}
'''

return prompt

def process_cn(df): 
    res_input = []
    res_output = []
    for id in range(len(df)):
        data_options = df.loc[id, '选项']
        data_answers = df.loc[id,'答案']
        data_prompt = df.loc[id,'阅读文本']
        data_options = chinese_multiple_choice_questions(data_options)
        data_answers = chinese_multiple_choice_answers(data_answers)
        data_prompt = get_prompt_cn(data_prompt)
        # print(data_options)
        # print(data_answers)
        
        if(len(data_answers)==len(data_options)):
            res = ''
            for id_,question in enumerate(data_options):
                res += f'''
{question['question']}?
                '''+'\n'
                for choise in question['choices']:
                    res = res+ choise[0] + choise[1]+ '\n'
                res = res + '答案:' + str(data_answers[id_].split('.')[-1])  + '\n'
            res_output.append(res)
            res_input.append(data_prompt)
        # break
    return res_input,res_output

cn_input,cn_output = process_cn(df)

len(cn_input)

2.6 类似地，进行英语的问答数据制作

将处理后的数据转换为 DataFrame，并导出为 JSONL 格式的文件

# 将两个列表转换为DataFrame
df_new = pd.DataFrame({'input': cn_input+cn_input[:30]+en_input+en_input[:20], 'output': cn_output+cn_output[:30]+en_output+en_output[:20]})

df_new

import json


# 打开一个文件用于写入 JSONL，并设置编码为 UTF-8
with open('output.jsonl', 'w', encoding='utf-8') as f:
    # 遍历每一行并将其转换为 JSON
    for index, row in df_new.iterrows():
        row_dict = row.to_dict()
        row_json = json.dumps(row_dict, ensure_ascii=False,)
        # 将 JSON 字符串写入文件，并添加换行符
        f.write(row_json + '\n')

# 打印确认信息
print("JSONL 文件已生成")

Step2: 模型微调

https://training.xfyun.cn/dataset/datasetIndex
使用刚才处理好的阅读题目（output.jsonl）文件对模型进行微调

主要步骤

数据上传
创建一个数据集，将准备好的 output.jsonl 文件上传并进行转换
模型训练
转换完成后在模型管理页面使用刚创建的数据集创建模型，开始训练
模型发布
等待至训练完成后，将模型发布为服务

Step3: 模型测试

回到飞桨AI Studio的baseline环境中，从使用训练好的模型发布的服务处复制信息

  SPARKAI_APP_ID = ''
  SPARKAI_API_SECRET = ''
  SPARKAI_API_KEY = ''
  serviceld=
  resourceld=

填写完成后运行即可

Step4: 去比赛页面提交新生成的jsonl文件，等待评分

Q1yAn

关注

3
点赞
踩
10

收藏

觉得还不错? 一键收藏
0
评论
Datawhale AI夏令营【从零入门大模型微调】Task1笔记

TASK01
复制链接

扫一扫

专栏目录