#AI夏令营 #Datawhale #从零入门大模型微调-Task 1

最新推荐文章于 2024-08-18 21:31:01 发布

xiabing86

最新推荐文章于 2024-08-18 21:31:01 发布

阅读量438

点赞数 10

文章标签：人工智能

本文链接：https://blog.csdn.net/xiabing86/article/details/141073441

版权

题目如下：

通过微调大模型根据阅读片段生成高考中英文阅读模拟题及答案，助力教育事业。

Task1内容如下：

工具/技术：百度PP飞桨(notebook)+讯飞开放平台（模型简易训练/发布）

主要代码说明：

安装需要的库：

!pip install pandas openpyxl

读取训练的excel文件，替换特殊字符，打印“选项”的第三行做测试

# coding~

import pandas as pd
import re

# 读取Excel文件
df = pd.read_excel('训练集-语文.xlsx')
df = df.replace('．', '.', regex=True)
df = df.replace('（', '(', regex=True)


# 读取第二行（即第三行）“选项”列的内容
second_row_option_content = df.loc[2, '选项']

# 显示第二行“选项”列的内容
print(second_row_option_content)

接着，用正则表达式对中英文的数据分别做处理：找出文章、题目、答案

def chinese_multiple_choice_questions(questions_with_answers):
def chinese_multiple_choice_answers(questions_with_answers):

提示词的结构：

def get_prompt_cn(text):
    prompt = f'''
    你是⼀个⾼考选择题出题专家，你出的题有⼀定深度，你将根据阅读文本，出4道单项选择题，包含题目选项，以及对应的答案，注意：不⽤给出原文，每道题由1个问题和4个选项组成，仅存在1个正确答案，请严格按照要求执行。 阅读文本主要是中文，你出的题目需要满足以下要点，紧扣文章内容且题干和答案为中文：
    
    ### 回答要求
    (1)理解文中重要概念的含义
    (2)理解文中重要句子的含意
    (3)分析论点、论据和论证方法
    
    
    ### 阅读文本
    {text}
    '''
    
    return prompt

把数据追加到提示词中：文章、题目、答案：

def process_cn(df): 
    res_input = []
    res_output = []
    for id in range(len(df)):
        data_options = df.loc[id, '选项']
        data_answers = df.loc[id,'答案']
        data_prompt = df.loc[id,'阅读文本']
        data_options = chinese_multiple_choice_questions(data_options)
        data_answers = chinese_multiple_choice_answers(data_answers)
        data_prompt = get_prompt_cn(data_prompt)
        # print(data_options)
        # print(data_answers)
        
        if(len(data_answers)==len(data_options)):
            res = ''
            for id_,question in enumerate(data_options):
                res += f'''
{question['question']}?
                '''+'\n'
                for choise in question['choices']:
                    res = res+ choise[0] + choise[1]+ '\n'
                res = res + '答案:' + str(data_answers[id_].split('.')[-1])  + '\n'
            res_output.append(res)
            res_input.append(data_prompt)
        # break
    return res_input,res_output

然后英文题目也同中文，就不说明了

最后把数据转换成input，out格式输出到json中用于做讯飞上面的LORA-fineturning

df_new = pd.DataFrame({'input': cn_input+cn_input[:30]+en_input+en_input[:20], 'output': cn_output+cn_output[:30]+en_output+en_output[:20]})

df_new


import json


# 打开一个文件用于写入 JSONL，并设置编码为 UTF-8
with open('output.jsonl', 'w', encoding='utf-8') as f:
    # 遍历每一行并将其转换为 JSON
    for index, row in df_new.iterrows():
        row_dict = row.to_dict()
        row_json = json.dumps(row_dict, ensure_ascii=False,)
        # 将 JSON 字符串写入文件，并添加换行符
        f.write(row_json + '\n')

# 打印确认信息
print("JSONL 文件已生成")

最后贴上得分：

xiabing86

关注

10
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
#AI夏令营 #Datawhale #从零入门大模型微调-Task 1

通过微调大模型根据阅读片段生成高考中英文阅读模拟题及答案，助力教育事业。读取训练的excel文件，替换特殊字符，打印“选项”的第三行做测试接着，用正则表达式对中英文的数据分别做处理：找出文章、题目、答案提示词的结构：把数据追加到提示词中：文章、题目、答案：然后英文题目也同中文，就不说明了最后贴上得分：
复制链接

扫一扫