提升大语言模型(LLMs)阅读理解能力的经验技巧【增强数据集的方法】_昇思mindspore模型开发挑战赛(模型微调赛题-CSDN博客

本文链接：https://blog.csdn.net/Wang_Dou_Dou_/article/details/145185003

引言：继上一篇博客，在参加决赛的二阶段比赛后，遗憾未上榜，不过官方开源了 排名第一 的方法，来学习一波吧。
上一篇博客📚：大语言模型(LLMs)数学推理的经验技巧【思维链CoT的应用方法】

✅ NLP 研 2 选手的学习笔记，2025年第一篇

笔者简介：Wang Linyong，NPU，2023级，计算机技术
研究方向：文本生成、大语言模型
大赛链接：昇思MindSpore模型开发挑战赛【模型微调赛题第二阶段】，2024 华为技术有限公司
项目链接：https://github.com/mindspore-courses/competition
赛事标题：昇思MindSpore模型开发挑战赛【模型微调赛题第二阶段】

在这里插入图片描述

1 赛题简介

● 本赛题要求基于中英文选择题数据集，跑通 baseline，并对 MindFormers 中 InternLM-7B 模型进行微调（LoRA或其他微调算法）。微调后的模型在原有能力不丢失的前提下（需保持在原能力的 90% 及以上），回答数学运算准确率相对 baseline 有提升，按照低参比例及准确率进行综合排名，评选出 1 个金奖 2 个银奖 3 个铜奖。

模型原有能力以其在 SQUAD 数据集上的阅读理解能力为准，评价标准为 F1 Score 和 Em Score，要求微调后两项评价指标需要给定阈值以上方可算作有效作品，如何进行原有能力评估，以及 F1 Score 和 Em Score 的参考阈值，请参考指导手册。
单选题准确率评价标准：模型基于测试数据集（不公开，与训练数据集格式相同，为数道单选题）进行推理，生成回答结果，最终统计在测试数据集上回答正确的题目数量占比：

准确率 = 正确答案题目数 / 测试集总题目数（注：baseline 的准确率为 40%，请以此为参考进行微调。）

低参比例：低参比例为微调参数量在总参数量的占比，选手在提交作品时需提供低参比例的计算结果，如何进行低参比例详见下方-低参比例运算。

低参比例 = 参与微调的参数量/模型总参数量

低参比例和运算准确率综合排名：低参比例越低越好，运算准确率越高越好，按照如下加权进行运算：

（100% - 低参比例 × 10）× 0.3 + 运算准确率 × 0.7

本题目共提供 2.7+万 条中英文混合题目作为训练数据集，选手可根据自己的实际情况调整数据集规模，建议综合在微调及推理时长、算力需求、维持模型原有能力、模型运算准备率提升等多方面因素进行训练数据集规模的评估。

2 获奖公示

● 金奖奖励 10万呢，可羡慕了。当时我记得我最高的准确率调到 60% 左右，第一名的 95.9% 是怎么达到的呢？一起来学习一下吧！

在这里插入图片描述

3 总体方案

3.1 数据预处理

3.1.1 csv 转 json 格式

● 根据手册获取 MMLU 和 CMMLU 的原始数据集，并运行 mmlu_cmmlu_csv2json.py ，注意将脚本中数据集的路径和保存路径改为自己对应的路径。

● mmlu_cmmlu_csv2json.py 代码，主要是把 csv格式 的数据转为 json格式：

import os
import json
import random
import pandas as pd
random.seed(42)


if __name__ == "__main__":
    save_path = "/path/to/origin_train_alpaca_format.json"
    cmmlu_path_list = ["/path/to/cmmlu/dev/", "/path/to/cmmlu/test/"]
    mmlu_path_list = ["/path/to/mmlu/data/dev", "/path/to/mmlu/data/test", "/path/to/mmlu/data/val"]

    train_data = []

    csv_files = [file for file in os.listdir(cmmlu_path_list[0]) if file.endswith(".csv")]
    for file in csv_files:
        data_list = []
        for folder_path in cmmlu_path_list:
            file_path = os.path.join(folder_path, file)
            df = pd.read_csv(file_path)
            if "cmmlu" not in file_path:
                df.columns = ["Question", "A", "B", "C", "D", "Answer"]
            # 将 DataFrame 转换为字典格式，并添加到列表中
            dict_data = df.to_dict(orient="records")
            for item in dict_data:
                domain =  file.replace("_dev", "").replace("_test", "").replace("_val", "").replace("_", " ").replace(".csv", "")
                data_list.append({
                    "instruction": f"Here is a question about {domain}, the correct answer is one of the options A/B/C/D. Please select the correct option and answer the question with 'The right option is'.",
                    "input": "Question: " + item["Question"] + " \nA." + str(item["A"]) + "\nB." + str(item["B"]) + "\nC." + str(item["C"]) + "\nD." + str(item["D"]),
                    "output": "The right option is " + item["Answer"] + "."
                })
        random.shuffle(data_list)
        train_data.extend(data_list)
        print("cmmlu: ", domain, len(data_list))

    csv_files = [file for file in os.listdir(mmlu_path_list[0]) if file.endswith(".csv")]
    for file in csv_files:
        data_list = []
        i = 0
        for folder_path in mmlu_path_list:
            i += 1
            if i == 2:
                file = file.replace("_dev", "_test")
            elif i == 3:
                file = file.replace("_test", "_val")
            file_path = os.path.join(folder_path, file)
            df = pd.read_csv(file_path)
            if "cmmlu" not in file_path:
                df.columns = ["Question", "A", "B", "C", "D", "Answer"]
            # 将 DataFrame 转换为字典格式，并添加到列表中
            dict_data = df.to_dict(orient="records")
            for item in dict_data:
                domain =  file.replace("_dev", "").replace("_test", "").replace("_val", "").replace("_", " ").replace(".csv", "")
                data_list.append({
                    "instruction": f"Here is a question about {domain}, the correct answer is one of the options A/B/C/D. Please select the correct option and answer the question with 'The right option is'.",
                    "input": "Question: " + item["Question"] + " \nA." + str(item["A"]) + "\nB." + str(item["B"]) + "\nC." + str(item["C"]) + "\nD." + str(item["D"]),
                    "output": "The right option is " + item["Answer"] + "."
                })
        random.shuffle(data_list)
        train_data.extend(data_list)
        print("mmlu: ", domain, len(data_list))

    with open(save_path, "w", encoding="utf-8") as json_file:
        json.dump(train_data, json_file, ensure_ascii=False, indent=4)

    print("train_data: ", len(train_data))

3.1.2 打乱数据的选项，构建新数据，划分数据集 ⭐️⭐️⭐️

● 运行 data_preprocess.py，注意将脚本中数据集的路径和保存路径改为自己对应的路径。

● 打乱和构建的策略：

针对数据集的每个样本，构建 6 个新数据，新数据补全了的 output 选项的完整答案，第 2-6 个数据 ABCD 四个选项的位置进行了重新排序，示例如下(原始数据)：

{
	"instruction": "Here is a question about college engineering hydrology, the correct answer is one of the options A/B/C/D. Please select the correct option and answer the question with 'The right option is'.",
	"input": "Question: 下渗率总是[ ]。 \nA.小于、等于下渗能力\nB.等于下渗能力\nC.小于下渗能力\nD.大于下渗能力",
	"output": "The right option is A."
}

构造的第 1 个新数据（只补全了的 output 选项的完整答案）：

>>> 填充 output 选项的完整答案：
{
	"instruction": "Here is a question about college engineering hydrology, the correct answer is one of the options A/B/C/D. Please select the correct option and answer the question with 'The right option is'.",
	"input": "Question: 下渗率总是[ ]。 \nA.小于、等于下渗能力\nB.等于下渗能力\nC.小于下渗能力\nD.大于下渗能力",
	"output": "The right option is A.小于、等于下渗能力"
}

第 2-6 个新数据（补全了的 output 选项的完整答案，并对选项的顺序进行了重新排序）：

>>> 调换 input 中的选项 A 和 D, 即 ABCD → DBCA：
{
    "instruction": "Here is a question about college engineering hydrology, the correct answer is one of the options A/B/C/D. Please select the correct option and answer the question with 'The right option is'.",
    "input": "Question: 下渗率总是[ ]。 \nA.小于下渗能力\nB.等于下渗能力\nC.大于下渗能力\nD.小于、等于下渗能力",
    "output": "The right option is D.小于、等于下渗能力"
}

>>> 调换 input 中的选项 A 和 C, 即 ABCD → CBAD：
{
    "instruction": "Here is a question about college engineering hydrology, the correct answer is one of the options A/B/C/D. Please select the correct option and answer the question with 'The right option is'.",
    "input": "Question: 下渗率总是[ ]。 \nA.大于下渗能力\nB.小于下渗能力\nC.小于、等于下渗能力\nD.等于下渗能力",
    "output": "The right option is C.小于、等于下渗能力"
}

>>> 将 input 中的选项 ABCD → BDCA：
{
    "instruction": "Here is a question about college engineering hydrology, the correct answer is one of the options A/B/C/D. Please select the correct option and answer the question with 'The right option is'.",
    "input": "Question: 下渗率总是[ ]。 \nA.等于下渗能力\nB.大于下渗能力\nC.小于下渗能力\nD.小于、等于下渗能力",
    "output": "The right option is D.小于、等于下渗能力"
}

>>> 将 input 中的选项 ABCD → BCAD：
{
    "instruction": "Here is a question about college engineering hydrology, the correct answer is one of the options A/B/C/D. Please select the correct option and answer the question with 'The right option is'.",
    "input": "Question: 下渗率总是[ ]。 \nA.等于下渗能力\nB.小于下渗能力\nC.小于、等于下渗能力\nD.大于下渗能力",
    "output": "The right option is C.小于、等于下渗能力"
}

>>> 将 input 中的选项 ABCD → BACD：
{
	"instruction": "Here is a question about college engineering hydrology, the correct answer is one of the options A/B/C/D. Please select the correct option and answer the question with 'The right option is'.",
	"input": "Question: 下渗率总是[ ]。 \nA.等于下渗能力\nB.小于、等于下渗能力\nC.小于下渗能力\nD.大于下渗能力",
	"output": "The right option is B.小于、等于下渗能力"
}

● 数据划分： 基于上述得到的新数据集构建训练集和自测用的验证集、测试集

上述每个样本可以构建 6 个数据，可以构建成 6 个 epoch 的数据集（每个 epoch 包含原始数据的所有题目）
训练集 = 前 5 个 epoch 数据
对第 6 个 epoch 的数据按类别进行打乱
验证集 = 第 6 个 epoch 数据每个类别取前 20 个样本
测试集 = 第 6 个 epoch 数据每个类别取第 20~40 个样本
训练集大小：27604 × 5
验证集大小：20 × 124 (类别数)
测试集大小：20 × 124 (类别数)

3.1.3 json 转 mindrecord 格式

● 运行 alpaca_data_preprocess_v1.py，该脚本修复了原始脚本遇到数据长度刚好等于 seq_length 时存在的 bug。

# Copyright 2023 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================

"""
transform alpaca dataset to mindrecord.
"""
import argparse
import json
import os
import numpy as np

from mindspore.mindrecord import FileWriter

from internlm_tokenizer import InternLMTokenizer


IGNORE_TOKEN_ID = -100


def get_chat_format_data(ori_data):
    """Format original data

    Args:
        ori_data (dict): input data sample.

    Returns:
        dict: data sample with chat format.
    """
    input_str = ori_data["input"]
    instruction_str = ori_data["instruction"]
    output_str = ori_data["output"]
    data = dict()
    if input_str != "":
        data["user"] = f"<|User|>:{instruction_str}\n{input_str}"
    else:
        data["user"] = f"<|User|>:{instruction_str}"
    data["bot"] = f"<|Bot|>:{output_str}"
    return data


def preprocess(sources, tokenizer, seq_length, bos_token="<s>", eos_token="</s>"):
    """conversation preprocess."""
    input_ids = []
    labels = []
    over_num = 0
    for source in sources:
        data = get_chat_format_data(source)
        special_tokens_map = {"<eoh>": 103167, "<eoa>": 103166, "nl_id": 13}
        token_ids = tokenizer.encode(bos_token, add_special_tokens=False)
        human_s = data["user"]
        ass_s = data["bot"]

        human_ids = tokenizer.encode(human_s, add_special_tokens=False) + \
                    [special_tokens_map["<eoh>"], special_tokens_map["nl_id"]]

        ass_template_ids = tokenizer.encode("<|Bot|>:", add_special_tokens=False)

        ignore_len = len(human_ids) + len(ass_template_ids)

        ass_ids = (ass_template_ids + tokenizer.encode(ass_s[8:], add_special_tokens=False) + \
                   [special_tokens_map["<eoa>"], special_tokens_map["nl_id"]])

        targets = np.ones([seq_length,])
        token_ids += human_ids + ass_ids

        if len(token_ids) >= seq_length:
            over_num += 1
            token_ids = token_ids[:seq_length-1]
            token_ids += tokenizer.encode(eos_token, add_special_tokens=False)
            targets[:] = IGNORE_TOKEN_ID
        else:
            token_ids += tokenizer.encode(eos_token, add_special_tokens=False)
            ignore_len_end = seq_length - len(token_ids)
            token_ids = np.pad(token_ids, (0, ignore_len_end), 'constant', constant_values=(0, 0))
            targets = np.array(token_ids)
            targets[:ignore_len + 1] = IGNORE_TOKEN_ID
            targets[-ignore_len_end:] = IGNORE_TOKEN_ID

        input_ids.append(np.array(token_ids).astype(np.int32))
        labels.append(np.array(targets).astype(np.int32))

    print("over_num: ", over_num)
    return dict(
        input_ids=input_ids,
        labels=labels
    )


class SupervisedDataset:
    """Dataset for supervised fine-tuning."""

    def __init__(self, raw_data, tokenizer, seq_length):
        super(SupervisedDataset, self).__init__()

        sources = []
        for example in raw_data:
            sources.append(example)
        data_dict = preprocess(sources, tokenizer, seq_length)

        self.input_ids = data_dict["input_ids"]
        self.labels = data_dict["labels"]

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, i):
        return dict(
            input_ids=self.input_ids[i],
            labels=self.labels[i]
        )


def tokenize_qa(tokenizer, file_path, seq_length):
    raw_data = json.load(open(file_path, "r"))
    dataset_cls = SupervisedDataset(raw_data, tokenizer, seq_length)
    for i in range(len(dataset_cls)):
        yield dataset_cls[i]


if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument("--mindrecord_schema", type=str, default="internlm_alpaca")
    parser.add_argument("--input_glob", type=str, default="./alpaca_data.json")
    parser.add_argument("--output_file", type=str, default="./alpaca_processed/alpaca.mindrecord")
    parser.add_argument("--model_file", type=str, default="./tokenizer.model")
    parser.add_argument("--file_partition", type=int, default=1)
    parser.add_argument("--seq_length", type=int, default=2048)
    args = parser.parse_args()

    out_dir, out_file = os.path.split(os.path.abspath(args.output_file))
    if not os.path.exists(out_dir):
        os.mkdir(out_dir)

    schema = {'input_ids': {"type": "int32", "shape": [-1]},
              'labels': {"type": "int32", "shape": [-1]}}

    writer = FileWriter(file_name=args.output_file,
                        shard_num=args.file_partition)
    writer.add_schema(schema, args.mindrecord_schema)

    # Start to load tokenizer
    if not os.path.exists(args.model_file):
        raise FileNotFoundError(f"file {args.model_file} do not exists.")

    transforms_count = 0

    word_tokenizer = InternLMTokenizer(vocab_file=args.model_file)
    for x in tokenize_qa(word_tokenizer, args.input_glob, args.seq_length + 1):
        transforms_count += 1
        writer.write_raw_data([x])
    print("Transformed {} records.".format(transforms_count))

    writer.commit()
    out_file = args.output_file
    if args.file_partition > 1:
        out_file += '0'
    print("Transform finished, output files refer: {}".format(out_file))

● 代码解释：

1. 输入与输出

输入： 该代码的输入是一个 JSON 格式的 Alpaca 数据集文件（例如：alpaca_data.json），其中包含了训练数据的对话格式（每个样本有 input, instruction, 和 output 字段）。
输出： 该代码将输出一个 MindRecord 文件，通常是 .mindrecord 格式，用于在 MindSpore 框架中进行训练和评估。文件路径可以通过命令行参数指定。
2. 主要功能模块
第一步：数据格式转换 (get_chat_format_data)：

输入数据通过 get_chat_format_data 函数转化为特定格式，添加标识符如 <|User|> 和 <|Bot|>，以便区分用户和机器人的对话。
该函数将每个数据样本的 instruction、input 和 output 字段格式化成对话文本。这样，input 会成为用户的输入，而 output 会成为机器人的输出。
示例：
data = {
   "user": "<|User|>:instruction_text\ninput_text",
   "bot": "<|Bot|>:output_text"
}
第二步：对话数据预处理 (preprocess)：

对输入的每条对话数据进行标记化（tokenization），并将其转换为模型训练所需的格式。
使用 InternLMTokenizer 对输入的文本进行标记化，将用户和机器人的对话转换为 token IDs。
该部分还处理了模型输入的特殊标记（如 <eoh>, <eoa> 等），这些特殊标记用于区分对话中的不同部分（例如用户的输入和机器人的输出）。
如果生成的 token 序列超过指定的 seq_length（最大序列长度），会进行截断，确保每个样本的序列长度符合要求。

第三步：数据集封装 (SupervisedDataset)：

SupervisedDataset 类用于封装处理后的数据，将每个样本的 input_ids 和 labels 存储起来，供训练使用。
每个数据样本包含两个字段：input_ids（模型输入）和 labels（对应的标签，用于监督训练）。

第四步：数据写入 (MindRecord)：

使用 MindSpore 提供的 FileWriter 将数据写入 MindRecord 格式。数据按照指定的 schema（定义了 input_ids 和 labels 的类型和形状）进行写入。
数据集会按分片（partition）保存，如果 file_partition 大于 1，则会将数据分成多个文件存储。

3. 命令行参数

--mindrecord_schema：指定 MindRecord 文件的 schema 名称。
--input_glob：输入的原始 JSON 数据文件路径。
--output_file：输出的 MindRecord 文件路径。
--model_file：用于标记化文本的 tokenizer 模型文件路径。
--file_partition：指定分片数，若大于 1，则输出多个 .mindrecord 文件。
--seq_length：序列的最大长度，即每个输入和标签的最大 token 数量。

4. 执行流程

解析命令行参数。
检查输出目录是否存在，不存在则创建。
加载 tokenizer 模型（InternLMTokenizer），并使用它对输入文本进行标记化。
调用 tokenize_qa 函数，按批处理格式将数据转换为 MindRecord 格式并写入文件。
统计转换的记录数并打印。

5. 特殊标记说明

<|User|> 和 <|Bot|> 用于标记对话中的用户和机器人的话语。
<eoh>（End of Human）和 <eoa>（End of Answer）是用户和机器人的对话结束标记。
nl_id 是一个可能与特定领域或任务相关的标记，用于区别对话数据的不同部分。

● 执行脚本：

训练集 json 转 mindrecord 格式，seq_length 设为 1024，注意将路径替换成自己的。

cd /home/ma-user/work/mindformers/research/internlm/

python alpaca_data_preprocess_v1.py \
--mindrecord_schema internlm_alpaca \
--input_glob /path/to/train_alpaca_format.json \
--output_file /path/to/train_1024.mindrecord \
--model_file /home/ma-user/work/tokenizer.model \
--seq_length 1024

验证集 json 转 mindrecord，seq_length 设为 1023 (mindformers 训练过程中验证集的数据 length=训练集数据 length - 1)，注意将路径替换成自己的。

cd /home/ma-user/work/mindformers/research/internlm/

python alpaca_data_preprocess.py \
--mindrecord_schema internlm_alpaca \
--input_glob /path/to/valid_alpaca_format.json \
--output_file /path/to/valid_1024.mindrecord \
--model_file /home/ma-user/work/tokenizer.model \
--seq_length 1023

3.2 模型训练

3.2.1 环境配置

手册中的提供的官方镜像
4 卡 NPU（总共 64G 显存）环境
硬盘规格 500G

3.2.2 微调配置文件

● 主要基于手册提供的 mindformers/research/internlm/finetune_internlm_7b_lora_mmlu_64G.yaml 做以下修改，其它与手册保持一致，采用 4 卡微调：lora_finetune.yaml。

only_save_strategy: False

runner_config:
  epochs: 4
  batch_size: 4

model_config:
    seq_length: 1024
    pet_config:
      pet_type: lora
      # configuration of lora
      lora_rank: 16
      lora_alpha: 32
      lora_dropout: 0.05
      target_modules: '.*wq|.*wk|.*wv|.*wo'

do_eval: True
eval_step_interval: 1726
eval_epoch_interval: -1

eval_dataset: &eval_dataset
  data_loader:
    type: MindDataset
    dataset_dir: "/path/to/valid_1024.mindrecord"
    shuffle: False
  input_columns: ["input_ids", "labels"]

callbacks:
    save_checkpoint_steps: 3452
  	keep_checkpoint_max: 10

● 完整的 lora_finetune.yaml 文件：

seed: 0
output_dir: './output' # path to save checkpoint/strategy
load_checkpoint: './internlm.ckpt'
src_strategy_path_or_dir: ''
auto_trans_ckpt: False  # If true, auto transform load_checkpoint to load in distributed model
only_save_strategy: False
resume_training: False
use_parallel: True
run_mode: 'finetune'

# trainer config
trainer:
  type: CausalLanguageModelingTrainer
  model_name: 'internlm_7b_lora'

# runner config
runner_config:
  epochs: 4
  batch_size: 4
  sink_mode: True
  sink_size: 2

# optimizer
optimizer:
  type: FP32StateAdamWeightDecay
  beta1: 0.9
  beta2: 0.999
  eps: 1.e-8
  weight_decay: 0.01

# lr sechdule
lr_schedule:
  type: CosineWithWarmUpLR
  learning_rate: 5.e-5
  warmup_ratio: 0.03
  total_steps: -1 # -1 means it will load the total steps of the dataset

# dataset
train_dataset: &train_dataset
  data_loader:
    type: MindDataset
    dataset_dir: ""
    shuffle: True
  input_columns: ["input_ids", "labels"]  # "input_ids", "labels" , labels are used in instruction finetune.
  num_parallel_workers: 8
  python_multiprocessing: False
  drop_remainder: True
  repeat: 1
  numa_enable: False
  prefetch_size: 1
train_dataset_task:
  type: CausalLanguageModelDataset
  dataset_config: *train_dataset
# if True, do evaluate during the training process. if false, do nothing.
# note that the task trainer should support _evaluate_in_training function.
do_eval: True
eval_step_interval: 1726        # num of step intervals between each eval, -1 means no step end eval.
eval_epoch_interval: -1        # num of epoch intervals between each eval, 1 means eval on every epoch end.

# eval dataset
eval_dataset: &eval_dataset
  data_loader:
    type: MindDataset
    dataset_dir: "/home/ma-user/work/dataset_v2/valid_1024.mindrecord"
    shuffle: False
  input_columns: ["input_ids", "labels"]
  num_parallel_workers: 8
  python_multiprocessing: False
  drop_remainder: False
  repeat: 1
  numa_enable: False
  prefetch_size: 1
eval_dataset_task:
  type: CausalLanguageModelDataset
  dataset_config: *eval_dataset

# default parallel of device num = 8 for Atlas 800T A2
parallel_config:
  data_parallel: 4
  model_parallel: 1
  pipeline_stage: 1
  micro_batch_num: 1
  vocab_emb_dp: True
  gradient_aggregation_group: 4
# when model parallel is greater than 1, we can set micro_batch_interleave_num=2, that may accelerate the train process.
micro_batch_interleave_num: 1

# recompute config
recompute_config:
  recompute: True
  parallel_optimizer_comm_recompute: False
  mp_comm_recompute: True
  recompute_slice_activation: True

# callbacks
callbacks:
  - type: MFLossMonitor
  - type: CheckpointMointor
    prefix: "internlm_7b_lora"
    save_checkpoint_steps: 3452
    save_trainable_params: False    # Whether to save fine-tuned weights additionally.
    integrated_save: False
    async_save: False
    keep_checkpoint_max: 10
  - type: ObsMonitor

# mindspore context init config
context:
  mode: 0 #0--Graph Mode; 1--Pynative Mode
  device_target: "Ascend"
  enable_graph_kernel: False
  graph_kernel_flags: "--disable_expand_ops=Softmax,Dropout --enable_parallel_fusion=true --reduce_fuse_depth=8 --enable_auto_tensor_inplace=true"
  max_call_depth: 10000
  max_device_memory: "58GB"
  save_graphs: False
  save_graphs_path: "./graph"
  device_id: 0

# parallel context config
parallel:
  parallel_mode: 1 # 0-dataset, 1-semi, 2-auto, 3-hybrid
  gradients_mean: False
  enable_alltoall: False
  full_batch: True
  search_mode: "sharding_propagation"
  enable_parallel_optimizer: True
  strategy_ckpt_config:
    save_file: "./ckpt_strategy.ckpt"
    only_trainable_params: False
  parallel_optimizer_config:
    gradient_accumulation_shard: False
    parallel_optimizer_threshold: 64

# model config
model:
  model_config:
    type: InternLMConfig
    batch_size: 1 # add for increase predict
    seq_length: 1024
    hidden_size: 4096
    num_layers: 32
    num_heads: 32
    vocab_size: 103168
    multiple_of: 256
    rms_norm_eps: 1.0e-6
    bos_token_id: 1
    eos_token_id: 2
    pad_token_id: 2
    ignore_token_id: -100
    compute_dtype: "float16"
    layernorm_compute_type: "float32"
    softmax_compute_type: "float32"
    rotary_dtype: "float16"
    param_init_type: "float16"
    has_bias: True
    use_past: False
    scaling_factor: 1.0
    extend_method: "None"
    use_flash_attention: False
    offset: 0
    checkpoint_name_or_path: "internlm_7b_lora"
    repetition_penalty: 1.02
    max_decode_length: 100
    top_k: 50
    top_p: 0.9
    do_sample: True
    pet_config:
      pet_type: lora
      # configuration of lora
      lora_rank: 16
      lora_alpha: 32
      lora_dropout: 0.05
      target_modules: '.*wq|.*wk|.*wv|.*wo'
  arch:
    type: InternLMForCausalLM

processor:
  return_tensors: ms
  tokenizer:
    unk_token: '<unk>'
    bos_token: '<s>'
    eos_token: '</s>'
    pad_token: '</s>'
    type: InternLMTokenizer
    vocab_file: '/home/ma-user/work/tokenizer.model'
  type: LlamaProcessor

# metric
metric:
  type: PerplexityMetric

# wrapper cell config
runner_wrapper:
  type: MFTrainOneStepCell
  scale_sense:
    type: DynamicLossScaleUpdateCell
    loss_scale_value: 16384
    scale_factor: 2
    scale_window: 1000
  use_clip_grad: True

eval_callbacks:
  - type: ObsMonitor

auto_tune: False
filepath_prefix: './autotune'
autotune_per_step: 10

profile: False
profile_start_step: 1
profile_stop_step: 10
init_start_profile: False
profile_communication: False
profile_memory: True
layer_scale: False
layer_decay: 0.65
lr_scale_factor: 256

# aicc
remote_save_url: "Please input obs url on AICC platform."

3.2.3 微调日志

● 训练 loss：

在这里插入图片描述

● 验证 loss：

在这里插入图片描述

3.3 模型评估

3.3.1 原有能力评估

● 运行时需要指定微调后的模型权重的路径，配置文件与原文件相比修改了：

batch_size: 8
max_device_memory: "58GB"

pet_config:
  pet_type: lora
  # configuration of lora
  lora_rank: 16
  lora_alpha: 32
  lora_dropout: 0.05
  target_modules: '.*wq|.*wk|.*wv|.*wo'

● 完整的原有能力评估配置文件 predict_eval_squad.yaml 如下：

seed: 0
output_dir: './output'  # path to save checkpoint/strategy
load_checkpoint: './internlm.ckpt'
auto_trans_ckpt: False  # If true, auto transform load_checkpoint to load in distributed model
only_save_strategy: False
resume_training: False
use_parallel: False
run_mode: 'predict'

# trainer config
trainer:
  type: CausalLanguageModelingTrainer
  model_name: 'internlm_7b'

# runner config
runner_config:
  epochs: 1
  batch_size: 8
  sink_mode: True
  sink_size: 2

# eval dataset
eval_dataset: &eval_dataset
  data_loader:
    type: MindDataset
    dataset_dir: ""
    shuffle: False
  input_columns: ["input_ids", "labels"]
  num_parallel_workers: 8
  python_multiprocessing: False
  drop_remainder: False
  repeat: 1
  numa_enable: False
  prefetch_size: 1
eval_dataset_task:
  type: CausalLanguageModelDataset
  dataset_config: *eval_dataset

# default parallel of device num = 8 for Atlas 800T A2
parallel_config:
  data_parallel: 1
  model_parallel: 8
  pipeline_stage: 1
  micro_batch_num: 1
  vocab_emb_dp: True
  gradient_aggregation_group: 4
# when model parallel is greater than 1, we can set micro_batch_interleave_num=2, that may accelerate the train process.
micro_batch_interleave_num: 1

# recompute config
recompute_config:
  recompute: True
  parallel_optimizer_comm_recompute: False
  mp_comm_recompute: True
  recompute_slice_activation: True

# callbacks
callbacks:
  - type: MFLossMonitor
  - type: CheckpointMointor
    prefix: "internlm_7b"
    save_checkpoint_steps: 500
    keep_checkpoint_max: 3
    integrated_save: False
    async_save: False
  - type: ObsMonitor

# mindspore context init config
context:
  mode: 0 #0--Graph Mode; 1--Pynative Mode
  device_target: "Ascend"
  enable_graph_kernel: False
  graph_kernel_flags: "--disable_expand_ops=Softmax,Dropout --enable_parallel_fusion=true --reduce_fuse_depth=8 --enable_auto_tensor_inplace=true"
  max_call_depth: 10000
  max_device_memory: "58GB"
  save_graphs: False
  save_graphs_path: "./graph"
  device_id: 0

# parallel context config
parallel:
  parallel_mode: 1 # 0-dataset, 1-semi, 2-auto, 3-hybrid
  gradients_mean: False
  enable_alltoall: False
  full_batch: True
  search_mode: "sharding_propagation"
  enable_parallel_optimizer: False
  strategy_ckpt_save_file: "./ckpt_strategy.ckpt"
  parallel_optimizer_config:
    gradient_accumulation_shard: False
    parallel_optimizer_threshold: 64

# model config
model:
  model_config:
    type: InternLMConfig
    batch_size: 2 # add for increase predict
    seq_length: 8192
    hidden_size: 4096
    num_layers: 32
    num_heads: 32
    vocab_size: 103168
    multiple_of: 256
    rms_norm_eps: 1.0e-6
    bos_token_id: 1
    eos_token_id: 2
    pad_token_id: 2
    ignore_token_id: -100
    compute_dtype: "float16"
    layernorm_compute_type: "float16"
    softmax_compute_type: "float16"
    rotary_dtype: "float16"
    param_init_type: "float16"
    has_bias: True
    use_past: True
    block_size: 16
    num_blocks: 512
    is_dynamic: True
    scaling_factor: 1.0
    extend_method: "None"
    offset: 0
    checkpoint_name_or_path: "internlm_7b"
    repetition_penalty: 1.0
    max_decode_length: 700
    max_new_tokens: 20
    top_k: 3
    top_p: 0.8
    do_sample: False
    is_dynamic: False
    pet_config:
      pet_type: lora
      # configuration of lora
      lora_rank: 16
      lora_alpha: 32
      lora_dropout: 0.05
      target_modules: '.*wq|.*wk|.*wv|.*wo'
  arch:
    type: InternLMForCausalLM

processor:
  return_tensors: ms
  tokenizer:
    unk_token: '<unk>'
    bos_token: '<s>'
    eos_token: '</s>'
    pad_token: '</s>'
    type: InternLMTokenizer
    vocab_file: '/home/ma-user/work/tokenizer.model'
  type: LlamaProcessor

# metric
metric:
  type: EmF1Metric

eval_callbacks:
  - type: ObsMonitor

auto_tune: False
filepath_prefix: './autotune'
autotune_per_step: 10

profile: False
profile_start_step: 1
profile_stop_step: 10
init_start_profile: False
profile_communication: False
profile_memory: True

# aicc
remote_save_url: "Please input obs url on AICC platform."

● 原有能力评估结果（符合要求）：

F1 score: 49.542794601854574
Em score: 31.398161586840832

3.3.2 低参比例

在这里插入图片描述

3.3.3 阅读理解能力的评估

● 运行时需要指定微调后的模型权重的路径，配置文件与原文件相比修改了：

pet_config:
  pet_type: lora
  # configuration of lora
  lora_rank: 16
  lora_alpha: 32
  lora_dropout: 0.05
  target_modules: '.*wq|.*wk|.*wv|.*wo'

max_new_tokens: 200

● 完整的 predict_mmlu.yaml 文件：

seed: 0
output_dir: './output'  # path to save checkpoint/strategy
load_checkpoint: './internlm.ckpt'
auto_trans_ckpt: False  # If true, auto transform load_checkpoint to load in distributed model
only_save_strategy: False
resume_training: False
use_parallel: False
run_mode: 'predict'

# trainer config
trainer:
  type: CausalLanguageModelingTrainer
  model_name: 'internlm_7b'

# runner config
runner_config:
  epochs: 1
  batch_size: 1
  sink_mode: True
  sink_size: 2

# eval dataset
eval_dataset: &eval_dataset
  data_loader:
    type: MindDataset
    dataset_dir: ""
    shuffle: False
  input_columns: ["input_ids", "labels"]
  num_parallel_workers: 8
  python_multiprocessing: False
  drop_remainder: False
  repeat: 1
  numa_enable: False
  prefetch_size: 1
eval_dataset_task:
  type: CausalLanguageModelDataset
  dataset_config: *eval_dataset

# default parallel of device num = 8 for Atlas 800T A2
parallel_config:
  data_parallel: 1
  model_parallel: 8
  pipeline_stage: 1
  micro_batch_num: 1
  vocab_emb_dp: True
  gradient_aggregation_group: 4
# when model parallel is greater than 1, we can set micro_batch_interleave_num=2, that may accelerate the train process.
micro_batch_interleave_num: 1

# recompute config
recompute_config:
  recompute: True
  parallel_optimizer_comm_recompute: False
  mp_comm_recompute: True
  recompute_slice_activation: True

# callbacks
callbacks:
  - type: MFLossMonitor
  - type: CheckpointMointor
    prefix: "internlm_7b"
    save_checkpoint_steps: 500
    keep_checkpoint_max: 3
    integrated_save: False
    async_save: False
  - type: ObsMonitor

# mindspore context init config
context:
  mode: 0 #0--Graph Mode; 1--Pynative Mode
  device_target: "Ascend"
  enable_graph_kernel: False
  graph_kernel_flags: "--disable_expand_ops=Softmax,Dropout --enable_parallel_fusion=true --reduce_fuse_depth=8 --enable_auto_tensor_inplace=true"
  max_call_depth: 10000
  max_device_memory: "26GB"
  save_graphs: False
  save_graphs_path: "./graph"
  device_id: 0

# parallel context config
parallel:
  parallel_mode: 1 # 0-dataset, 1-semi, 2-auto, 3-hybrid
  gradients_mean: False
  enable_alltoall: False
  full_batch: True
  search_mode: "sharding_propagation"
  enable_parallel_optimizer: False
  strategy_ckpt_save_file: "./ckpt_strategy.ckpt"
  parallel_optimizer_config:
    gradient_accumulation_shard: False
    parallel_optimizer_threshold: 64

# model config
model:
  model_config:
    type: InternLMConfig
    batch_size: 1 # add for increase predict
    seq_length: 2048
    hidden_size: 4096
    num_layers: 32
    num_heads: 32
    vocab_size: 103168
    multiple_of: 256
    rms_norm_eps: 1.0e-6
    bos_token_id: 1
    eos_token_id: 2
    pad_token_id: 2
    ignore_token_id: -100
    compute_dtype: "float16"
    layernorm_compute_type: "float16"
    softmax_compute_type: "float16"
    rotary_dtype: "float16"
    param_init_type: "float16"
    has_bias: True
    use_past: True
    block_size: 16
    num_blocks: 512
    is_dynamic: True
    scaling_factor: 1.0
    extend_method: "None"
    offset: 0
    checkpoint_name_or_path: "internlm_7b"
    repetition_penalty: 1.0
    max_decode_length: 700
    max_new_tokens: 200
    top_k: 3
    top_p: 0.8
    do_sample: False
    is_dynamic: False
    pet_config:
      pet_type: lora
      # configuration of lora
      lora_rank: 16
      lora_alpha: 32
      lora_dropout: 0.05
      target_modules: '.*wq|.*wk|.*wv|.*wo'
  arch:
    type: InternLMForCausalLM

processor:
  return_tensors: ms
  tokenizer:
    unk_token: '<unk>'
    bos_token: '<s>'
    eos_token: '</s>'
    pad_token: '</s>'
    type: InternLMTokenizer
    vocab_file: '/home/ma-user/work/tokenizer.model'
  type: LlamaProcessor

# metric
metric:
  type: EmF1Metric

eval_callbacks:
  - type: ObsMonitor

auto_tune: False
filepath_prefix: './autotune'
autotune_per_step: 10

profile: False
profile_start_step: 1
profile_stop_step: 10
init_start_profile: False
profile_communication: False
profile_memory: True

# aicc
remote_save_url: "Please input obs url on AICC platform."