DataWhale AI夏令营 — 星火大模型驱动阅读理解题库构建

本文链接：https://blog.csdn.net/qq_51309289/article/details/141221315

一、赛事背景

随着人工智能技术的快速发展，大模型正成为推动个性化学习与智能教育的关键力量。特别是在语言学科的教学中，利用大模型完成Question Answer Generation (QAG）过程可以有效赋能QA题型的出题流程。我们希望通过竞赛，激发开发者的创新思维，为我国的教育发展“提质增效”。

该赛事基于大模型微调技术，微调适用于高考语文现代文阅读和英语阅读的QAG的大模型，完成输入文章输出问题与答案的任务。大赛将为参赛团队提供免费的模型微调服务平台。

赛事链接：https://challenge.xfyun.cn/h5/detail?type=question-bank-construction&option=phb&ch=dw24_E7x9gl

学习教程：https://linklearner.com/activity/14/12/36

官方Baseline：https://aistudio.baidu.com/projectdetail/8225663

二、数据集

数据集包括训练集和测试集，内容涵盖英文阅读理解和中文阅读理解，第一列为阅读文本，第二列为问题选项，第三列为问题的答案。训练集-语文包含选择题和简答题，训练集-英语只包含选择题，送入微调的数据仅为选择题。

三、赛题Baseline

3.1 数据预处理

读取训练集-语文Excel文件，将数据中所有的全角标点符号'（' '．'替换为半角标点符号'(' '.' 。读取选项数据。

import pandas as pd
import re

# 读取Excel文件
df = pd.read_excel('训练集-语文.xlsx')
df = df.replace('．', '.', regex=True)
df = df.replace('（', '(', regex=True)


# 读取第二行（即第三行）“选项”列的内容
second_row_option_content = df.loc[2, '选项']

# 显示第二行“选项”列的内容
print(second_row_option_content)

3.2 问题抽取

对于数据中的问题，解析包含中文问题和答案的文本，从中提取选择题和简答题。

正则表达式定义：使用re.compile定义了两个正则表达式模式，question_pattern用于匹配整个问题文本，而choice_pattern用于匹配问题中的选项部分。
问题提取：利用question_pattern.findall方法找出文本中所有独立的问题。
问题分类：初始化两个列表，分别用于存储选择题和简答题。
问题遍历与分析：对每个问题进行遍历，使用enumerate同时获取问题索引和文本内容。通过检查问题文本中是否包含A-D（选择题选项的标识），来判断问题类型。
选择题处理：对于选择题，使用choice_pattern.findall提取所有选项，并通过re.split和re.compile进一步提取并格式化问题文本，然后构建一个包含问题编号、文本和选项的字典。
简答题处理：对于非选择题（即简答题），去除问题文本的首尾空白字符，并将其添加到简答题列表中。

结果返回：函数最终返回一个列表，其中包含了所有选择题的详细信息，每个选择题都是一个字典，包含问题和选项。简答题则作为辅助信息存储，可以根据需要进一步处理。

def chinese_multiple_choice_questions(questions_with_answers):
    # 输入的题目文本
    text = questions_with_answers

    # 正则表达式模式
    question_pattern = re.compile(r'\d+\..*?(?=\d+\.|$)', re.DOTALL)
    choice_pattern = re.compile(r'([A-D])\s*(.*?)(?=[A-D]|$|\n)', re.DOTALL)

    # 找到所有问题
    questions = question_pattern.findall(text)

    # 初始化选择题和简答题列表
    multiple_choice_questions = []
    short_answer_questions = []

        # 处理每个问题
    for id,question in enumerate(questions):
        # 检查是否是选择题
        if re.search(r'[A-D]', question):
            
            choices = choice_pattern.findall(question)
            question_text = re.split(r'\n', question.split('(')[0])[0]
            
            
            pattern_question = re.compile(r'(\d+)\.(.*)')
            matches_question = str(id+1)+'.'+ pattern_question.findall(question_text)[0][1] # 取出问题后重排序
            # print(str(id+1)+'.'+matches_question)
            
            multiple_choice_questions.append({
                'question': matches_question,
                'choices': choices
            })
        else:
            short_answer_questions.append(question.strip())
    return multiple_choice_questions

3.3答案抽取

从包含问题和答案的文本中提取出选择题和简答题的答案。

文本清洗：使用replace方法去除字符串中的所有空格和换行符，以便于正则表达式的匹配。
正则表达式定义：
- choice_pattern用于匹配选择题的答案，它查找以数字开始，后跟一个或多个大写字母的模式。
- short_pattern用于匹配简答题的答案，它查找以数字开始，后跟非大写字母字符的模式。
答案匹配：使用findall方法和定义好的正则表达式，分别找出所有选择题和简答题的答案。
字典转换：将匹配到的答案转换为字典，其中键是问题序号（转换为整数），值是对应的答案。
排序：使用sorted函数对两个字典的项进行排序，确保答案按照问题序号的顺序排列。
结果列表：初始化一个空列表answers，用于存储最终的排序后的答案字符串。
输出结果：循环遍历排序后的选择题答案，将每个答案格式化为字符串，并添加到answers列表中。
返回结果：函数最终返回answers列表，其中包含了按问题序号排序的所有选择题答案。

def chinese_multiple_choice_answers(questions_with_answers):
    questions_with_answers = questions_with_answers.replace(" ", "").replace("\n", "")
    
    # print(questions_with_answers)
    # 使用正则表达式匹配答案
    choice_pattern = re.compile(r'(\d+)\.([A-Z]+)')
    short_pattern = re.compile(r'(\d+)\.([^A-Z]+)')

    # 找到所有匹配的答案
    choice_matches = choice_pattern.findall(questions_with_answers)
    short_matches = short_pattern.findall(questions_with_answers)

    # 将匹配结果转换为字典
    choice_answers = {int(index): answer for index, answer in choice_matches}
    short_answers = {int(index): answer for index, answer in short_matches}

    # 按序号重新排序
    sorted_choice_answers = sorted(choice_answers.items())
    sorted_short_answers = sorted(short_answers.items())
    
    answers = []

    # 输出结果
    
    # print("选择题答案：")
    for id in range(len(sorted_choice_answers)):
        answers.append(f"{id+1}. {sorted_choice_answers[id][1]}")
    return answers

3.4 生成提示词和回答（语文）

使用要求+阅读材料组成prompt，作为input部分。

函数定义：get_prompt_cn接受一个参数text，这个参数预期是一段中文阅读材料。
格式化字符串：函数使用Python的f-string（格式化字符串字面量）来构建一个多行字符串prompt。这个字符串包含了对出题者的指导和要求。
提示内容：prompt字符串以一种格式化的方式呈现，包括一个标题和几个子标题，明确指出了出题者需要遵循的要点和要求。这些要点包括理解文中的重要概念、句子含义、分析论点、论据和论证方法。
阅读文本：在提示的最后部分，使用{text}占位符将传入的text参数插入到提示文本中，这样出题者就可以直接看到需要出题的阅读材料。
返回值：函数返回构建好的prompt字符串。

def get_prompt_cn(text):
    prompt = f'''
    你是⼀个⾼考选择题出题专家，你出的题有⼀定深度，你将根据阅读文本，出4道单项选择题，包含题目选项，以及对应的答案，注意：不⽤给出原文，每道题由1个问题和4个选项组成，仅存在1个正确答案，请严格按照要求执行。 阅读文本主要是中文，你出的题目需要满足以下要点，紧扣文章内容且题干和答案为中文：
    
    ### 回答要求
    (1)理解文中重要概念的含义
    (2)理解文中重要句子的含意
    (3)分析论点、论据和论证方法
    
    
    ### 阅读文本
    {text}
    '''
    
    return prompt

process_cn函数，它用于处理DataFrame中的数据，生成一系列格式化的题目和答案。

def process_cn(df): 
    res_input = []
    res_output = []
    for id in range(len(df)):
        data_options = df.loc[id, '选项']
        data_answers = df.loc[id,'答案']
        data_prompt = df.loc[id,'阅读文本']
        data_options = chinese_multiple_choice_questions(data_options)
        data_answers = chinese_multiple_choice_answers(data_answers)
        data_prompt = get_prompt_cn(data_prompt)
        # print(data_options)
        # print(data_answers)
        
        if(len(data_answers)==len(data_options)):
            res = ''
            for id_,question in enumerate(data_options):
                res += f'''
{question['question']}?
                '''+'\n'
                for choise in question['choices']:
                    res = res+ choise[0] + choise[1]+ '\n'
                res = res + '答案:' + str(data_answers[id_].split('.')[-1])  + '\n'
            res_output.append(res)
            res_input.append(data_prompt)
        # break
    return res_input,res_output

英文数据处理与中文类似，这里就不详细介绍了，具体代码可以看这。

3.5 模型微调

将上述生成的语文和英语数据存为output.jsonl，送入如下链接微调：
大模型定制训练平台
微调步骤见学习教程：Datawhale

本次大赛要求使用的模型为spark-13b。微调基本要求是：输入阅读文本，返回4组Q&A（单项选择）输出格式见“输出样例数据”。

参赛团队可以利用该开源工具（ GitHub - iflytek/spark-ai-python: 星火大模型 python sdk库）来测试微调后模型的输出效果。

本次微调基于lora，设置的参数如下，训练时长60分钟：

3.6 模型本地推理

!pip install --upgrade spark_ai_python

自定义评估prompt

prompt = {"input": "\n    你是⼀个⾼考选择题出题专家，你出的题有⼀定深度，你将根据阅读文本，出4道单项选择题，包含题目选项，以及对应的答案，注意：不⽤给出原文，每道题由1个问题和4个选项组成，仅存在1个正确答案，请严格按照要求执行。\nThe reading text is mainly in English. The questions and answers you raised need to be completed in English for at least the following points:\n    \n    ### 回答要求\n    (1)Understanding the main idea of the main idea.\n    (2)Understand the specific information in the text.\n    (3)infering the meaning of words and phrases from the context\n    \n    \n    ### 阅读文本\n    Bike Rental & Guided Tours Welcome to Amsterdam, welcome to MacBike. You see much more from the seat of a bike! Cycling is the most\neconomical, sustainable and fun way to explore the city, with its beautiful canals, parks, squares and countless lights.\nYou can also bike along lovely landscapes outside of Amsterdam.\nWhy MacBike MacBike has been around for almost 30 years and is the biggest bicycle rental company in Amsterdam. With over 2,500 bikes stored in our five rental shops at strategic locations, we make sure there is always a bike available for you. We offer the newest bicycles in a wide variety, including basic bikes with foot brake (AU 4), bikes with hand\nbrake and gears (HI-I'4), bikes with child seats, and children's bikes.                                                                       Price: 1 hour, 3 hours, 1 day(24hours), Each additional day                                                                                          Hand Brake, Three Gears: €7.50, €11.00, €14.75, €8.00                                                       Foot Brake, No Gears: €5.00, €7.50, €9.75, €6.00                     The 2.5-hour tour covers the Gooyer Windmill, the Skinny Bridge, the Rijksmuseum, Heineken Brewery and much more. The tour departs from D.m Square every hour on the hour, starting at 1:00 pm every day. You can buy\nyour ticket in a MacBike shop or book online.\n    ", "output": "\n                1. What is an advantage of MacBike?\n                A. It gives children a discount.\n                B. It of offers many types of bikes.\n                C. It organizes free cycle tours.\n                D. It has over 2,500 rental shops.\n                answer:B\n                \n\n                2. How much do you pay for renting a bike with hand brake and three gears for two days?\n                A. €15.75.\n                B. €19.50.\n                C. €22.75.\n                D. €29.50.\n                answer:D\n                \n\n                3. Where does the guided city tour start?\n                A. The Gooyer, Windmill.\n                B. The Skinny Bridge.\n                C. Heineken Brewery.\n                D. D.m Square.\n                answer:D\n                \n"}
prompt = prompt['input']

将下面代码中的"SPARKAI_APP_ID" "SPARKAI_API_SECRET '' "SPARKAI_API_KEY '' "serviceld=" "resourceld"参数替换为讯飞模型服务中发布后的微调模型的参数。

from sparkai.llm.llm import ChatSparkLLM, ChunkPrintHandler
from sparkai.core.messages import ChatMessage

SPARKAI_URL = 'wss://xingchen-api.cn-huabei-1.xf-yun.com/v1.1/chat'
#星火认知大模型调用秘钥信息，请结合飞书文档，前往讯飞微调控制台（https://training.xfyun.cn/modelService）查看
SPARKAI_APP_ID = 'd267e690'
SPARKAI_API_SECRET = 'OWY4MzIxZDViMjZhYTJmMGM0NmE2MDFl'
SPARKAI_API_KEY = 'ee818f1ad22392da20b1057960e9c27e'
serviceId = 'xspark13b6k'  
resourceId = '5115672164671488'

if __name__ == '__main__':
    spark = ChatSparkLLM(
        spark_api_url=SPARKAI_URL,
        spark_app_id=SPARKAI_APP_ID,
        spark_api_key=SPARKAI_API_KEY,
        spark_api_secret=SPARKAI_API_SECRET,
        spark_llm_domain=serviceId,
        model_kwargs={"patch_id": resourceId},
        streaming=False,
    )
    messages = [ChatMessage(
        role="user",
        content=prompt
    )]
    handler = ChunkPrintHandler()
    a = spark.generate([messages], callbacks=[handler])
    print(a.generations[0][0].text)