Datawhale AI 夏令营-星火大模型驱动阅读理解题库构建挑战赛-CSDN博客

本文链接：https://blog.csdn.net/2201_75858338/article/details/141039946

星火大模型驱动阅读理解题库构建挑战赛baseline01

1.数据处理

1.1 语文问答数据制作

In [2]

!pip install pandas openpyxl

Looking in indexes: https://mirror.baidu.com/pypi/simple/, https://mirrors.aliyun.com/pypi/simple/
Requirement already satisfied: pandas in /opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages (2.2.2)
WARNING: Skipping page https://mirror.baidu.com/pypi/simple/openpyxl/ because the GET request got Content-Type: application/octet-stream. The only supported Content-Types are application/vnd.pypi.simple.v1+json, application/vnd.pypi.simple.v1+html, and text/html
Collecting openpyxl
  Downloading https://mirrors.aliyun.com/pypi/packages/c0/da/977ded879c29cbd04de313843e76868e6e13408a94ed6b987245dc7c8506/openpyxl-3.1.5-py2.py3-none-any.whl (250 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 250.9/250.9 kB 2.5 MB/s eta 0:00:00a 0:00:01
Requirement already satisfied: numpy>=1.22.4 in /opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages (from pandas) (1.26.4)
Requirement already satisfied: python-dateutil>=2.8.2 in /opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages (from pandas) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages (from pandas) (2024.1)
Requirement already satisfied: tzdata>=2022.7 in /opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages (from pandas) (2024.1)
Collecting et-xmlfile (from openpyxl)
  Downloading https://mirrors.aliyun.com/pypi/packages/96/c2/3dd434b0108730014f1b96fd286040dc3bcb70066346f7e01ec2ac95865f/et_xmlfile-1.1.0-py3-none-any.whl (4.7 kB)
Requirement already satisfied: six>=1.5 in /opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages (from python-dateutil>=2.8.2->pandas) (1.16.0)
Installing collected packages: et-xmlfile, openpyxl
Successfully installed et-xmlfile-1.1.0 openpyxl-3.1.5

In [3]

# coding~

import pandas as pd
import re

# 读取Excel文件
df = pd.read_excel('训练集-语文.xlsx')
df = df.replace('．', '.', regex=True)
df = df.replace('（', '(', regex=True)


# 读取第二行（即第三行）“选项”列的内容
second_row_option_content = df.loc[2, '选项']

# 显示第二行“选项”列的内容
print(second_row_option_content)

7.下列对文本相关内容和艺术特色的分析鉴赏，不正确的一项是(3分)
A.作者在写南国的风物时，用了“那一块一块的稻田”那-堆-堆的房屋”等，语言的节奏感符合火
车行进时的动态感。
B.作者认为车过潭江的部分是“新宁铁路中的一段最美丽的工程”，既在于这里风景的优美，更在于
工程体现了机械的诗意。
C.作者认为如果只把“月夜”“ 花朝”“青山” 一类的东西当作写诗的材料，其实是不懂诗，依
据是这些材料本身缺乏生命力.
D.“诗应该给人以创造的喜悦，诗应该散布生命”是作者对诗的认识，也是他认为机械具有诗意的一一
个重要前提.
8.本文在写“机械的诗”时再写到工人，请简要分析二者之间的内在联系。(6分)
9.这篇随笔的最后段跳转到作者在上海的生活见闻，这样写有什么好处?请结合文本简要分析.

In [73]

def chinese_multiple_choice_questions(questions_with_answers):
    # 输入的题目文本
    text = questions_with_answers

    # 正则表达式模式
    question_pattern = re.compile(r'\d+\..*?(?=\d+\.|$)', re.DOTALL)
    choice_pattern = re.compile(r'([A-D])\s*(.*?)(?=[A-D]|$|\n)', re.DOTALL)

    # 找到所有问题
    questions = question_pattern.findall(text)

    # 初始化选择题和简答题列表
    multiple_choice_questions = []
    short_answer_questions = []

        # 处理每个问题
    for id,question in enumerate(questions):
        # 检查是否是选择题
        if re.search(r'[A-D]', question):
            
            choices = choice_pattern.findall(question)
            question_text = re.split(r'\n', question.split('(')[0])[0]
            
            
            pattern_question = re.compile(r'(\d+)\.(.*)')
            matches_question = str(id+1)+'.'+ pattern_question.findall(question_text)[0][1] # 取出问题后重排序
            # print(str(id+1)+'.'+matches_question)
            
            multiple_choice_questions.append({
                'question': matches_question,
                'choices': choices
            })
        else:
            short_answer_questions.append(question.strip())
    return multiple_choice_questions

In [74]

questions_list = []
for data_id in range(len(df[:3])):
    second_row_option_content = df.loc[data_id, '选项']
    questions_list.append(chinese_multiple_choice_questions(second_row_option_content))

In [75]

def chinese_multiple_choice_answers(questions_with_answers):
    questions_with_answers = questions_with_answers.replace(" ", "").replace("\n", "")
    
    # print(questions_with_answers)
    # 使用正则表达式匹配答案
    choice_pattern = re.compile(r'(\d+)\.([A-Z]+)')
    short_pattern = re.compile(r'(\d+)\.([^A-Z]+)')

    # 找到所有匹配的答案
    choice_matches = choice_pattern.findall(questions_with_answers)
    short_matches = short_pattern.findall(questions_with_answers)

    # 将匹配结果转换为字典
    choice_answers = {int(index): answer for index, answer in choice_matches}
    short_answers = {int(index): answer for index, answer in short_matches}

    # 按序号重新排序
    sorted_choice_answers = sorted(choice_answers.items())
    sorted_short_answers = sorted(short_answers.items())
    
    answers = []

    # 输出结果
    
    # print("选择题答案：")
    for id in range(len(sorted_choice_answers)):
        answers.append(f"{id+1}. {sorted_choice_answers[id][1]}")
    return answers

In [76]

# 读取第二行（即第三行）“选项”列的内容
second_row_option_content = df.loc[60, '答案']

# 显示第二行“选项”列的内容
print(second_row_option_content)


chinese_multiple_choice_answers(second_row_option_content)

4.B
5.窗子既是指现实世界中的窗子，可以是铁纱窗，或者是玻璃窗；窗子又是指隔绝自己生活与他人世界的象征。
有的人坐在窗子里面，有的人行走在窗子外面，而一扇窗子隔绝出来的，是两个截然不同的世界，窗外的人固然不了解窗里的人，窗里的人，也永远不能了解窗外的人。
6.你、我的代表了两种不同的人生视角，观看自己生活的视角和观看他人生活的视角。窗外是劳作、奔波、挣扎、穷苦，窗内是奢侈、悠闲、烦闷、无聊。这是两个世界，两种生活。蕴含着作者的态度：窗里窗外是两个世界，窗外的人无法理解窗内，窗内的人也无法走进窗外，我们只能以一个旁观者的角度对待世界，不要以为自己真正的解了什么而私下满足，“天知道那是罪过”。

['1. B']

In [77]

df['答案_processed'] = df['答案'].map(chinese_multiple_choice_answers)

In [78]

def get_prompt_cn(text):
    prompt = f'''
    你是⼀个⾼考选择题出题专家，你出的题有⼀定深度，你将根据阅读文本，出4道单项选择题，包含题目选项，以及对应的答案，注意：不⽤给出原文，每道题由1个问题和4个选项组成，仅存在1个正确答案，请严格按照要求执行。 阅读文本主要是中文，你出的题目需要满足以下要点，紧扣文章内容且题干和答案为中文：
    
    ### 回答要求
    (1)理解文中重要概念的含义
    (2)理解文中重要句子的含意
    (3)分析论点、论据和论证方法
    
    
    ### 阅读文本
    {text}
    '''
    
    return prompt

In [105]

def process_cn(df): 
    res_input = []
    res_output = []
    for id in range(len(df)):
        data_options = df.loc[id, '选项']
        data_answers = df.loc[id,'答案']
        data_prompt = df.loc[id,'阅读文本']
        data_options = chinese_multiple_choice_questions(data_options)
        data_answers = chinese_multiple_choice_answers(data_answers)
        data_prompt = get_prompt_cn(data_prompt)
        # print(data_options)
        # print(data_answers)
        
        if(len(data_answers)==len(data_options)):
            res = ''
            for id_,question in enumerate(data_options):
                res += f'''
{question['question']}?
                '''+'\n'
                for choise in question['choices']:
                    res = res+ choise[0] + choise[1]+ '\n'
                res = res + '答案:' + str(data_answers[id_].split('.')[-1])  + '\n'
            res_output.append(res)
            res_input.append(data_prompt)
        # break
    return res_input,res_output

In [106]

cn_input,cn_output = process_cn(df)

In [107]

len(cn_input)

1.2 英语问答制作

In [108]

# coding~

import pandas as pd

# 读取Excel文件
df = pd.read_excel('训练集-英语.xlsx')
df = df.replace('．', '.', regex=True).replace('А.', 'A.', regex=True).replace('В.', 'B.', regex=True).replace('С.', 'C.', regex=True).replace('D.', 'D.', regex=True)
# df = df.replace('（', '(', regex=True)

# 读取第二行（即第三行）“选项”列的内容
second_row_option_content = df.loc[0, '选项']

# 显示第二行“选项”列的内容
print(second_row_option_content)

21. What is an advantage of MacBike?
A. It gives children a discount.
B. It of offers many types of bikes.
C. It organizes free cycle tours.
D. It has over 2,500 rental shops.
22. How much do you pay for renting a bike with hand brake and three gears for two days?
A. €15.75.
B. €19.50.
C. €22.75.
D. €29.50.
23. Where does the guided city tour start?
A. The Gooyer, Windmill.
C. Heineken Brewery.
B. The Skinny Bridge.
D. D.m Square.

In [109]


def remove_whitespace_and_newlines(input_string):
    # 使用str.replace()方法删除空格和换行符
    result = input_string.replace(" ", "").replace("\n", "").replace(".", "")
    return result

In [110]

import re

# 示例文本
text = """
32. B. The underlying logic of the effect.                                                   33.D. estimates were not fully independent.
34.C. The discussion process.            35.D. Approving.
"""
def get_answers(text):
    text = remove_whitespace_and_newlines(text)
    # 正则表达式模式
    pattern = re.compile(r'(\d)\s*([A-D])')

    # 查找所有匹配项
    matches = pattern.findall(text)
    res = []
    # 打印结果
    for match in matches:
        number_dot, first_letter = match
        res.append(first_letter)
    return res

In [111]

# 示例输入
input_string = "28. A. It is simple and plain.              29. D. Influential.                                30. D.33%.                                             31. B. Male chefs on TV programmes."
res = get_answers(input_string)
print(res)

['A', 'D', 'D', 'B']

In [112]

import re

# 示例文本
text = second_row_option_content

def get_questions(text):
    text = text.replace('\n', '  ')+'  '
    # print(text)
    # 正则表达式模式
    pattern = re.compile(r'(\d+\..*?)(A\..*?\s{2})([B-D]\..*?\s{2})([B-D]\..*?\s{2})(D\..*?\s{2})', re.DOTALL)

    # 查找所有匹配项
    matches = pattern.findall(text)

    # 存储结果的字典列表
    questions_dict_list = []

    # 打印结果
    for match in matches:
        question, option1, option2, option3, option4 = match
        pattern_question = re.compile(r'(\d+)\.(.*)')
        question_text = pattern_question.findall(question.strip())[0][1]
        
        # 提取选项字母和内容
        options = {option1[0]: option1, option2[0]: option2, option3[0]: option3, option4[0]: option4}
        
        question_dict = {
            'question': question_text,
            'options': {
                'A': options.get('A', '').strip(),
                'B': options.get('B', '').strip(),
                'C': options.get('C', '').strip(),
                'D': options.get('D', '').strip()
            }
        }
        questions_dict_list.append(question_dict)
    return questions_dict_list

# 调用函数并打印结果
questions = get_questions(text)
for q in questions:
    print(q)

{'question': ' What is an advantage of MacBike?', 'options': {'A': 'A. It gives children a discount.', 'B': 'B. It of offers many types of bikes.', 'C': 'C. It organizes free cycle tours.', 'D': 'D. It has over 2,500 rental shops.'}}
{'question': ' How much do you pay for renting a bike with hand brake and three gears for two days?', 'options': {'A': 'A. €15.75.', 'B': 'B. €19.50.', 'C': 'C. €22.75.', 'D': 'D. €29.50.'}}
{'question': ' Where does the guided city tour start?', 'options': {'A': 'A. The Gooyer, Windmill.', 'B': 'B. The Skinny Bridge.', 'C': 'C. Heineken Brewery.', 'D': 'D. D.m Square.'}}

In [113]

def get_prompt_en(text):
    prompt = f'''
    你是⼀个⾼考选择题出题专家，你出的题有⼀定深度，你将根据阅读文本，出4道单项选择题，包含题目选项，以及对应的答案，注意：不⽤给出原文，每道题由1个问题和4个选项组成，仅存在1个正确答案，请严格按照要求执行。
The reading text is mainly in English. The questions and answers you raised need to be completed in English for at least the following points:
    
    ### 回答要求
    (1)Understanding the main idea of the main idea.
    (2)Understand the specific information in the text.
    (3)infering the meaning of words and phrases from the context
    
    
    ### 阅读文本
    {text}
    '''
    
    return prompt

In [114]

def process_en(df): 
    res_input = []
    res_output = []
    for id in range(len(df)):
        data_options = df.loc[id, '选项']
        data_answers = df.loc[id,'答案']
        data_prompt = df.loc[id,'阅读文本']
        data_options = get_questions(data_options)
        data_answers = get_answers(data_answers)
        data_prompt = get_prompt_en(data_prompt)
        # print(data_options)
        # print(data_answers)

        if(len(data_answers)==len(data_options)):
            res = ''
            for id,question in enumerate(data_options):
                res += f'''
                {id+1}.{question['question']}
                {question['options']['A']}
                {question['options']['B']}
                {question['options']['C']}
                {question['options']['D']}
                answer:{data_answers[id]}
                '''+'\n'
            res_output.append(res)
            res_input.append(data_prompt)
    return res_input,res_output
    # break

In [115]

en_input,en_output = process_en(df)

In [116]


# 将两个列表转换为DataFrame
df_new = pd.DataFrame({'input': cn_input+cn_input[:30]+en_input+en_input[:20], 'output': cn_output+cn_output[:30]+en_output+en_output[:20]})

df_new

                                                 input  \
0    \n    你是⼀个⾼考选择题出题专家，你出的题有⼀定深度，你将根据阅读文本，出4道单项选择...   
1    \n    你是⼀个⾼考选择题出题专家，你出的题有⼀定深度，你将根据阅读文本，出4道单项选择...   
2    \n    你是⼀个⾼考选择题出题专家，你出的题有⼀定深度，你将根据阅读文本，出4道单项选择...   
3    \n    你是⼀个⾼考选择题出题专家，你出的题有⼀定深度，你将根据阅读文本，出4道单项选择...   
4    \n    你是⼀个⾼考选择题出题专家，你出的题有⼀定深度，你将根据阅读文本，出4道单项选择...   
..                                                 ...   
147  \n    你是⼀个⾼考选择题出题专家，你出的题有⼀定深度，你将根据阅读文本，出4道单项选择...   
148  \n    你是⼀个⾼考选择题出题专家，你出的题有⼀定深度，你将根据阅读文本，出4道单项选择...   
149  \n    你是⼀个⾼考选择题出题专家，你出的题有⼀定深度，你将根据阅读文本，出4道单项选择...   
150  \n    你是⼀个⾼考选择题出题专家，你出的题有⼀定深度，你将根据阅读文本，出4道单项选择...   
151  \n    你是⼀个⾼考选择题出题专家，你出的题有⼀定深度，你将根据阅读文本，出4道单项选择...   

                                                output  
0    \n1.下列关于原文内容的理解和分析,正确的一项是?\n                \n...  
1    \n1.下列对原文相关内容的理解和分析,正确的一项是?\n                \...  
2    \n1.下列对文本相关内容和艺术特色的分析鉴赏，不正确的一项是?\n            ...  
3    \n1.下列关于原文内容的理解和分析，不正确的一-项是?\n                ...  
4    \n1.下列对原文相关内容的理解和分析，不正确的一项是?\n                ...  
..                                                 ...  
147  \n                1. What makes the applicatio...  
148  \n                1. Where is the Welsh Proms ...  
149  \n                1. How did the cockatoos get...  
150  \n                1. Which of the following be...  
151  \n                1. What is the first paragra...  

[152 rows x 2 columns]

In [117]

import json


# 打开一个文件用于写入 JSONL，并设置编码为 UTF-8
with open('output.jsonl', 'w', encoding='utf-8') as f:
    # 遍历每一行并将其转换为 JSON
    for index, row in df_new.iterrows():
        row_dict = row.to_dict()
        row_json = json.dumps(row_dict, ensure_ascii=False,)
        # 将 JSON 字符串写入文件，并添加换行符
        f.write(row_json + '\n')

# 打印确认信息
print("JSONL 文件已生成")

接下来参考Docs点击进入。