Datawhale AI 夏令营-星火大模型驱动阅读理解题库构建挑战赛-大模型微调技术实践

ky93

于 2024-08-11 23:34:25 发布

阅读量691

点赞数 16

文章标签：人工智能 AIGC 语言模型 spark spark-ml

本文链接：https://blog.csdn.net/weixin_73630100/article/details/141114193

版权

背景导入

从零入门大模型微调是 Datawhale 2024 年 AI 夏令营第四期的学习活动（“大模型技术”方向），基于讯飞开放平台“星火大模型驱动阅读理解题库构建挑战赛”开展的实践学习——
- 适合想了解如何微调、学习微调数据处理、学习通过大模型微调实现试卷QA生成的学习者参与
学习内容提要：基于讯飞大模型定制训练平台和spark-13b微调模型，生成高考语文现代文阅读和英语阅读问题QA对

任务驱动.

本笔记基于Datawhale夏令营的大模型微调技术方向.

参加赛事:星火大模型驱动阅读理解题库构建挑战赛

赛事具体实操:

对spark-13b模型微调,生成服务api提供给比赛平台进行测试.

大模型微调技术实操

基本操作步骤:

数据处理
上传数据
微调模型(利用之前提到的数据集)
发布模型(生成api)
提供平台测试.

本次实践主要应用

通过针对特定任务(阅读理解题库构建)丰富数据微调大模型测试中的实际表现.

使用的baseline

https://aistudio.baidu.com/projectdetail/8225663

深入代码

模型的数据集预处理(以"语文"数据为例)

环境构建(语文问答,通过pandas有关库实现数据有关操作)

    !pip install pandas openpyxl

导入数据库(通过pandas读入数据)

    import pandas as pd
    import re

    # 读取Excel文件
    df = pd.read_excel('训练集-语文.xlsx')
    df = df.replace('．', '.', regex=True)
    df = df.replace('（', '(', regex=True)


    # 读取第二行（即第三行）“选项”列的内容
    second_row_option_content = df.loc[2, '选项']

    # 显示第二行“选项”列的内容
    print(second_row_option_content)

清洗数据

    def chinese_multiple_choice_questions(questions_with_answers):
        # 输入的题目文本
        text = questions_with_answers

        # 正则表达式模式
        question_pattern = re.compile(r'\d+\..*?(?=\d+\.|$)', re.DOTALL)
        choice_pattern = re.compile(r'([A-D])\s*(.*?)(?=[A-D]|$|\n)', re.DOTALL)

        # 找到所有问题
        questions = question_pattern.findall(text)

        # 初始化选择题和简答题列表
        multiple_choice_questions = []
        short_answer_questions = []

            # 处理每个问题
    for id,question in enumerate(questions):
        # 检查是否是选择题
        if re.search(r'[A-D]', question):
            
            choices = choice_pattern.findall(question)
            question_text = re.split(r'\n', question.split('(')[0])[0]
            
            
            pattern_question = re.compile(r'(\d+)\.(.*)')
            matches_question = str(id+1)+'.'+ pattern_question.findall(question_text)[0][1] # 取出问题后重排序
            # print(str(id+1)+'.'+matches_question)
            
            multiple_choice_questions.append({
                'question': matches_question,
                'choices': choices
            })
        else:
            short_answer_questions.append(question.strip())
    return multiple_choice_questions
    questions_list = []
    for data_id in range(len(df[:3])):
        second_row_option_content = df.loc[data_id, '选项']
        questions_list.append(chinese_multiple_choice_questions(second_row_option_content))
    def chinese_multiple_choice_answers(questions_with_answers):
        questions_with_answers = questions_with_answers.replace(" ", "").replace("\n", "")
        
        # print(questions_with_answers)
        # 使用正则表达式匹配答案
        choice_pattern = re.compile(r'(\d+)\.([A-Z]+)')
        short_pattern = re.compile(r'(\d+)\.([^A-Z]+)')

        # 找到所有匹配的答案
        choice_matches = choice_pattern.findall(questions_with_answers)
        short_matches = short_pattern.findall(questions_with_answers)

        # 将匹配结果转换为字典
        choice_answers = {int(index): answer for index, answer in choice_matches}
        short_answers = {int(index): answer for index, answer in short_matches}

        # 按序号重新排序
        sorted_choice_answers = sorted(choice_answers.items())
        sorted_short_answers = sorted(short_answers.items())
        
        answers = []

        # 输出结果
        
        # print("选择题答案：")
        for id in range(len(sorted_choice_answers)):
            answers.append(f"{id+1}. {sorted_choice_answers[id][1]}")
        return answers
    # 读取第二行（即第三行）“选项”列的内容
    second_row_option_content = df.loc[60, '答案']

    # 显示第二行“选项”列的内容
    print(second_row_option_content)

    chinese_multiple_choice_answers(second_row_option_content)
    df['答案_processed'] = df['答案'].map(chinese_multiple_choice_answers)

调教模型(just do like what you do when 调教ai猫娘!)

    def get_prompt_cn(text):
        prompt = f'''
        你是⼀个⾼考选择题出题专家，你出的题有⼀定深度，你将根据阅读文本，出4道单项选择题，包含题目选项，以及对应的答案，注意：不⽤给出原文，每道题由1个问题和4个选项组成，仅存在1个正确答案，请严格按照要求执行。 阅读文本主要是中文，你出的题目需要满足以下要点，紧扣文章内容且题干和答案为中文：
        
        ### 回答要求
        (1)理解文中重要概念的含义
        (2)理解文中重要句子的含意
        (3)分析论点、论据和论证方法
        
        
        ### 阅读文本
        {text}
        '''
        
        return prompt 
    def process_cn(df): 
        res_input = []
        res_output = []
        for id in range(len(df)):
            data_options = df.loc[id, '选项']
            data_answers = df.loc[id,'答案']
            data_prompt = df.loc[id,'阅读文本']
            data_options = chinese_multiple_choice_questions(data_options)
            data_answers = chinese_multiple_choice_answers(data_answers)
            data_prompt = get_prompt_cn(data_prompt)
            # print(data_options)
            # print(data_answers)
            
            if(len(data_answers)==len(data_options)):
                res = ''
                for id_,question in enumerate(data_options):
                    res += f'''
                    {question['question']}?
                    '''+'\n'
                    for choise in question['choices']:
                        res = res+ choise[0] + choise[1]+ '\n'
                    res = res + '答案:' + str(data_answers[id_].split('.')[-1])  + '\n'
                res_output.append(res)
                res_input.append(data_prompt)
            # break
        return res_input,res_output
    cn_input,cn_output = process_cn(df)
    len(cn_input)

本地测试调教好微调好的模型

构建环境(导入spark星火提供的api)

    !pip install --upgrade spark_ai_python

调教引导大模型

    prompt = {"input": "\n    你是⼀个⾼考选择题出题专家，你出的题有⼀定深度，你将根据阅读文本，出4道单项选择题，包含题目选项，以及对应的答案，注意：不⽤给出原文，每道题由1个问题和4个选项组成，仅存在1个正确答案，请严格按照要求执行。\nThe reading text is mainly in English. The questions and answers you raised need to be completed in English for at least the following points:\n    \n    ### 回答要求\n    (1)Understanding the main idea of the main idea.\n    (2)Understand the specific information in the text.\n    (3)infering the meaning of words and phrases from the context\n    \n    \n    ### 阅读文本\n    Bike Rental & Guided Tours Welcome to Amsterdam, welcome to MacBike. You see much more from the seat of a bike! Cycling is the most\neconomical, sustainable and fun way to explore the city, with its beautiful canals, parks, squares and countless lights.\nYou can also bike along lovely landscapes outside of Amsterdam.\nWhy MacBike MacBike has been around for almost 30 years and is the biggest bicycle rental company in Amsterdam. With over 2,500 bikes stored in our five rental shops at strategic locations, we make sure there is always a bike available for you. We offer the newest bicycles in a wide variety, including basic bikes with foot brake (AU 4), bikes with hand\nbrake and gears (HI-I'4), bikes with child seats, and children's bikes.                                                                       Price: 1 hour, 3 hours, 1 day(24hours), Each additional day                                                                                          Hand Brake, Three Gears: €7.50, €11.00, €14.75, €8.00                                                       Foot Brake, No Gears: €5.00, €7.50, €9.75, €6.00                     The 2.5-hour tour covers the Gooyer Windmill, the Skinny Bridge, the Rijksmuseum, Heineken Brewery and much more. The tour departs from D.m Square every hour on the hour, starting at 1:00 pm every day. You can buy\nyour ticket in a MacBike shop or book online.\n    ", "output": "\n                1. What is an advantage of MacBike?\n                A. It gives children a discount.\n                B. It of offers many types of bikes.\n                C. It organizes free cycle tours.\n                D. It has over 2,500 rental shops.\n                answer:B\n                \n\n                2. How much do you pay for renting a bike with hand brake and three gears for two days?\n                A. €15.75.\n                B. €19.50.\n                C. €22.75.\n                D. €29.50.\n                answer:D\n                \n\n                3. Where does the guided city tour start?\n                A. The Gooyer, Windmill.\n                B. The Skinny Bridge.\n                C. Heineken Brewery.\n                D. D.m Square.\n                answer:D\n                \n"}
    prompt = prompt['input']

开始测试

    from sparkai.llm.llm import ChatSparkLLM, ChunkPrintHandler
    from sparkai.core.messages import ChatMessage

    SPARKAI_URL = 'wss://xingchen-api.cn-huabei-1.xf-yun.com/v1.1/chat'
    #星火认知大模型调用秘钥信息，请结合飞书文档，前往讯飞微调控制台（https://training.xfyun.cn/modelService）查看
    SPARKAI_APP_ID = ''
    SPARKAI_API_SECRET = ''
    SPARKAI_API_KEY = ''
    serviceId = ''  
    resourceId = ''

    if __name__ == '__main__':
        spark = ChatSparkLLM(
            spark_api_url=SPARKAI_URL,
            spark_app_id=SPARKAI_APP_ID,
            spark_api_key=SPARKAI_API_KEY,
            spark_api_secret=SPARKAI_API_SECRET,
            spark_llm_domain=serviceId,
            model_kwargs={"patch_id": resourceId},
            streaming=False,
        )
        messages = [ChatMessage(
            role="user",
            content=prompt
        )]
        handler = ChunkPrintHandler()
        a = spark.generate([messages], callbacks=[handler])
        print(a.generations[0][0].text)

大模型微调技术背景补充.

介绍:

大模型微调技术是一种在机器学习和人工智能领域中使用的技术，它允许开发者利用已经训练好的大型预训练模型（通常称为基础模型或基线模型），并对其进行调整以适应特定的任务或领域。这种技术可以显著减少从头开始训练一个模型所需的时间和资源，同时还能保持或提高模型的性能。

关键:

预训练模型：在微调之前，模型已经在大量数据上进行了预训练，学习了通用的语言模式、图像特征等。
任务特定数据：为了进行微调，开发者需要收集与特定任务相关的数据。这些数据可以是文本、图像、音频或其他形式。
微调过程：在微调过程中，模型的参数会在新数据上进行更新，以更好地适应新任务。这通常涉及到冻结预训练模型的一部分参数，同时训练其他参数。
少量数据：与传统的从头训练相比，微调通常只需要较少的数据量，因为模型已经具备了一定的先验知识。
快速部署：微调可以快速地将模型应用于新的领域或任务，加速了模型的开发和部署过程。
持续学习：微调后的模型可以继续接受新数据的训练，以适应不断变化的数据分布或任务需求。
领域适应性：微调技术使得模型能够更好地适应特定的领域或行业，提高模型在特定应用场景下的表现。
资源效率：由于不需要从头开始训练，微调技术节省了大量的计算资源和时间。

常用的实现手段

全参数微调：在这种微调方式中，模型的所有参数都会在新任务上进行更新。这种方法适用于当新任务与预训练任务差异较大时。
部分参数微调：只更新模型的某些部分，如顶层或某些特定层，而保持其他层的参数不变。这种方法适用于新任务与预训练任务相似度较高时。
冻结层微调：在这种技术中，大部分预训练模型的层被冻结，只有少数几层（通常是顶层）会在新任务上进行训练。这有助于保持模型的泛化能力，同时适应新任务。
多任务学习：在这种方法中，模型被训练以同时执行多个任务。这可以通过共享底层表示来提高模型的效率和性能。
领域适应：通过在特定领域的数据上进行微调，模型可以更好地理解和适应该领域的特定语言或特征。
任务适配层：添加一个或多个新的层，这些层专门用于适应新任务，而不影响预训练模型的原始结构。
数据增强：通过对训练数据进行变换（如旋转、缩放、裁剪等）来增加数据的多样性，这有助于提高模型的泛化能力。
正则化技术：使用L1、L2或其他正则化方法来防止过拟合，特别是在微调时，以保持模型的稳定性。
学习率调整：在微调过程中，适当降低学习率可以帮助模型更细致地调整参数，避免破坏预训练阶段学到的知识。
元学习（Meta-Learning）：通过学习如何快速适应新任务，模型可以在面对新问题时更快地收敛。
迁移学习：利用在一个大任务上训练得到的模型，将其知识迁移到相关的小任务上，这通常涉及到对模型的参数进行微调。
零样本学习（Zero-Shot Learning）和少样本学习（Few-Shot Learning）：在这些场景中，模型在没有或只有很少的标注数据的情况下进行微调，通常依赖于模型的泛化能力和预训练阶段学到的知识。

ky93

关注

16
点赞
踩
9

收藏

觉得还不错? 一键收藏
0
评论
Datawhale AI 夏令营-星火大模型驱动阅读理解题库构建挑战赛-大模型微调技术实践

大模型微调技术是一种在机器学习和人工智能领域中使用的技术，它允许开发者利用已经训练好的大型预训练模型（通常称为基础模型或基线模型），并对其进行调整以适应特定的任务或领域。这种技术可以显著减少从头开始训练一个模型所需的时间和资源，同时还能保持或提高模型的性能。
复制链接

扫一扫