提升大语言模型(LLMs)阅读理解能力的经验技巧【增强数据集的方法】


引言:继上一篇博客,在参加决赛的二阶段比赛后,遗憾未上榜,不过官方开源了 排名第一 的方法,来学习一波吧。
上一篇博客📚:大语言模型(LLMs)数学推理的经验技巧【思维链CoT的应用方法】


✅ NLP 研 2 选手的学习笔记,2025年第一篇

笔者简介:Wang Linyong,NPU,2023级,计算机技术
研究方向:文本生成、大语言模型
大赛链接:昇思MindSpore模型开发挑战赛【模型微调赛题第二阶段】,2024 华为技术有限公司
项目链接:https://github.com/mindspore-courses/competition
赛事标题:昇思MindSpore模型开发挑战赛【模型微调赛题第二阶段】

在这里插入图片描述



1 赛题简介

● 本赛题要求基于中英文选择题数据集,跑通 baseline,并对 MindFormersInternLM-7B 模型进行微调(LoRA或其他微调算法)。微调后的模型在原有能力不丢失的前提下(需保持在原能力的 90% 及以上),回答数学运算准确率相对 baseline 有提升,按照低参比例及准确率进行综合排名,评选出 1 个金奖 2 个银奖 3 个铜奖。

  1. 模型原有能力以其在 SQUAD 数据集上的阅读理解能力为准,评价标准为 F1 ScoreEm Score,要求微调后两项评价指标需要给定阈值以上方可算作有效作品,如何进行原有能力评估,以及 F1 ScoreEm Score 的参考阈值,请参考指导手册。
  2. 单选题准确率评价标准:模型基于测试数据集(不公开,与训练数据集格式相同,为数道单选题)进行推理,生成回答结果,最终统计在测试数据集上回答正确的题目数量占比:

准确率 = 正确答案题目数 / 测试集总题目数(注:baseline 的准确率为 40%,请以此为参考进行微调。)

  1. 低参比例:低参比例为微调参数量在总参数量的占比,选手在提交作品时需提供低参比例的计算结果,如何进行低参比例详见下方-低参比例运算。

低参比例 = 参与微调的参数量/模型总参数量

  1. 低参比例和运算准确率综合排名:低参比例越低越好,运算准确率越高越好,按照如下加权进行运算:

(100% - 低参比例 × 10)× 0.3 + 运算准确率 × 0.7

  1. 本题目共提供 2.7+万 条中英文混合题目作为训练数据集,选手可根据自己的实际情况调整数据集规模,建议综合在微调及推理时长、算力需求、维持模型原有能力、模型运算准备率提升等多方面因素进行训练数据集规模的评估。

2 获奖公示

● 金奖奖励 10万 呢,可羡慕了。当时我记得我最高的准确率调到 60% 左右,第一名的 95.9% 是怎么达到的呢?一起来学习一下吧!

在这里插入图片描述

3 总体方案

3.1 数据预处理

3.1.1 csv 转 json 格式

● 根据手册获取 MMLUCMMLU 的原始数据集,并运行 mmlu_cmmlu_csv2json.py ,注意将脚本中数据集的路径和保存路径改为自己对应的路径。

mmlu_cmmlu_csv2json.py 代码,主要是把 csv格式 的数据转为 json格式

import os
import json
import random
import pandas as pd
random.seed(42)


if __name__ == "__main__":
    save_path = "/path/to/origin_train_alpaca_format.json"
    cmmlu_path_list = ["/path/to/cmmlu/dev/", "/path/to/cmmlu/test/"]
    mmlu_path_list = ["/path/to/mmlu/data/dev", "/path/to/mmlu/data/test", "/path/to/mmlu/data/val"]

    train_data = []

    csv_files = [file for file in os.listdir(cmmlu_path_list[0]) if file.endswith(".csv")]
    for file in csv_files:
        data_list = []
        for folder_path in cmmlu_path_list:
            file_path = os.path.join(folder_path, file)
            df = pd.read_csv(file_path)
            if "cmmlu" not in file_path:
                df.columns = ["Question", "A", "B", "C", "D", "Answer"]
            # 将 DataFrame 转换为字典格式,并添加到列表中
            dict_data = df.to_dict(orient="records")
            for item in dict_data:
                domain =  file.replace("_dev", "").replace("_test", "").replace("_val", "").replace("_", " ").replace(".csv", "")
                data_list.append({
                    "instruction": f"Here is a question about {domain}, the correct answer is one of the options A/B/C/D. Please select the correct option and answer the question with 'The right option is'.",
                    "input": "Question: " + item["Question"] + " \nA." + str(item["A"]) + "\nB." + str(item["B"]) + "\nC." + str(item["C"]) + "\nD." + str(item["D"]),
                    "output": "The right option is " + item["Answer"] + "."
                })
        random.shuffle(data_list)
        train_data.extend(data_list)
        print("cmmlu: ", domain, len(data_list))

    csv_files = [file for file in os.listdir(mmlu_path_list[0]) if file.endswith(".csv")]
    for file in csv_files:
        data_list = []
        i = 0
        for folder_path in mmlu_path_list:
            i += 1
            if i == 2:
                file = file.replace("_dev", "_test")
            elif i == 3:
                file = file.replace("_test", "_val")
            file_path = os.path.join(folder_path, file)
            df = pd.read_csv(file_path)
            if "cmmlu" not in file_path:
                df.columns = ["Question", "A", "B", "C", "D", "Answer"]
            # 将 DataFrame 转换为字典格式,并添加到列表中
            dict_data = df.to_dict(orient="records")
            for item in dict_data:
                domain =  file.replace("_dev", "").replace("_test", "").replace("_val", "").replace("_", " ").replace(".csv", "")
                data_list.append({
                    "instruction": f"Here is a question about {domain}, the correct answer is one of the options A/B/C/D. Please select the correct option and answer the question with 'The right option is'.",
                    "input": "Question: " + item["Question"] + " \nA." + str(item["A"]) + "\nB." + str(item["B"]) + "\nC." + str(item["C"]) + "\nD." + str(item["D"]),
                    "output": "The right option is " + item["Answer"] + "."
                })
        random.shuffle(data_list)
        train_data.extend(data_list)
        print("mmlu: ", domain, len(data_list))

    with open(save_path, "w", encoding="utf-8") as json_file:
        json.dump(train_data, json_file, ensure_ascii=False, indent=4)

    print("train_data: ", len(train_data))

3.1.2 打乱数据的选项,构建新数据,划分数据集 ⭐️⭐️⭐️

● 运行 data_preprocess.py,注意将脚本中数据集的路径和保存路径改为自己对应的路径。

打乱和构建的策略:

  1. 针对数据集的每个样本,构建 6 个新数据,新数据补全了的 output 选项的完整答案,第 2-6 个数据 ABCD 四个选项的位置进行了重新排序,示例如下(原始数据):
{
	"instruction": "Here is a question about college engineering hydrology, the correct answer is one of the options A/B/C/D. Please select the correct option and answer the question with 'The right option is'.",
	"input": "Question: 下渗率总是[ ]。 \nA.小于、等于下渗能力\nB.等于下渗能力\nC.小于下渗能力\nD.大于下渗能力",
	"output": "The right option is A."
}
  1. 构造的第 1新数据(只补全了的 output 选项的完整答案):
>>> 填充 output 选项的完整答案:
{
	"instruction": "Here is a question about college engineering hydrology, the correct answer is one of the options A/B/C/D. Please select the correct option and answer the question with 'The right option is'.",
	"input": "Question: 下渗率总是[ ]。 \nA.小于、等于下渗能力\nB.等于下渗能力\nC.小于下渗能力\nD.大于下渗能力",
	"output": "The right option is A.小于、等于下渗能力"
}
  1. 2-6新数据(补全了的 output 选项的完整答案,并对选项的顺序进行了重新排序):
>>> 调换 input 中的选项 A 和 D, 即 ABCD → DBCA:
{
    "instruction": "Here is a question about college engineering hydrology, the correct answer is one of the options A/B/C/D. Please select the correct option and answer the question with 'The right option is'.",
    "input": "Question: 下渗率总是[ ]。 \nA.小于下渗能力\nB.等于下渗能力\nC.大于下渗能力\nD.小于、等于下渗能力",
    "output": "The right option is D.小于、等于下渗能力"
}
>>> 调换 input 中的选项 A 和 C, 即 ABCD → CBAD:
{
    "instruction": "Here is a question about college engineering hydrology, the correct answer is one of the options A/B/C/D. Please select the correct option and answer the question with 'The right option is'.",
    "input": "Question: 下渗率总是[ ]。 \nA.大于下渗能力\nB.小于下渗能力\nC.小于、等于下渗能力\nD.等于下渗能力",
    "output": "The right option is C.小于、等于下渗能力"
}
>>> 将 input 中的选项 ABCD → BDCA:
{
    "instruction": "Here is a question about college engineering hydrology, the correct answer is one of the options A/B/C/D. Please select the correct option and answer the question with 'The right option is'.",
    "input": "Question: 下渗率总是[ ]。 \nA.等于下渗能力\nB.大于下渗能力\nC.小于下渗能力\nD.小于、等于下渗能力",
    "output": "The right option is D.小于、等于下渗能力"
}
>>> 将 input 中的选项 ABCD → BCAD:
{
    "instruction": "Here is a question about college engineering hydrology, the correct answer is one of the options A/B/C/D. Please select the correct option and answer the question with 'The right option is'.",
    "input": "Question: 下渗率总是[ ]。 \nA.等于下渗能力\nB.小于下渗能力\nC.小于、等于下渗能力\nD.大于下渗能力",
    "output": "The right option is C.小于、等于下渗能力"
}
>>> 将 input 中的选项 ABCD → BACD:
{
	"instruction": "Here is a question about college engineering hydrology, the correct answer is one of the options A/B/C/D. Please select the correct option and answer the question with 'The right option is'.",
	"input": "Question: 下渗率总是[ ]。 \nA.等于下渗能力\nB.小于、等于下渗能力\nC.小于下渗能力\nD.大于下渗能力",
	"output": "The right option is B.小于、等于下渗能力"
}

数据划分: 基于上述得到的新数据集构建训练集和自测用的验证集、测试集

  1. 上述每个样本可以构建 6 个数据,可以构建成 6epoch 的数据集(每个 epoch 包含原始数据的所有题目)
    训练集 = 前 5epoch 数据
  2. 对第 6epoch 的数据按类别进行打乱
    验证集 = 第 6epoch 数据每个类别取前 20 个样本
    测试集 = 第 6epoch 数据每个类别取第 20~40 个样本
  3. 训练集大小:27604 × 5
    验证集大小:20 × 124 (类别数)
    测试集大小:20 × 124 (类别数)

3.1.3 json 转 mindrecord 格式

● 运行 alpaca_data_preprocess_v1.py,该脚本修复了原始脚本遇到数据长度刚好等于 seq_length 时存在的 bug

# Copyright 2023 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================

"""
transform alpaca dataset to mindrecord.
"""
import argparse
import json
import os
import numpy as np

from mindspore.mindrecord import FileWriter

from internlm_tokenizer import InternLMTokenizer


IGNORE_TOKEN_ID = -100


def get_chat_format_data(ori_data):
    """Format original data

    Args:
        ori_data (dict): input data sample.

    Returns:
        dict: data sample with chat format.
    """
    input_str = ori_data["input"]
    instruction_str = ori_data["instruction"]
    output_str = ori_data["output"]
    data = dict()
    if input_str != "":
        data["user"] = f"<|User|>:{instruction_str}\n{input_str}"
    else:
        data["user"] = f"<|User|>:{instruction_str}"
    data["bot"] = f"<|Bot|>:{output_str}"
    return data


def preprocess(sources, tokenizer, seq_length, bos_token="<s>", eos_token="</s>"):
    """conversation preprocess."""
    input_ids = []
    labels = []
    over_num = 0
    for source in sources:
        data = get_chat_format_data(source)
        special_tokens_map = {"<eoh>": 103167, "<eoa>": 103166, "nl_id": 13}
        token_ids = tokenizer.encode(bos_token, add_special_tokens=False)
        human_s = data["user"]
        ass_s = data["bot"]

        human_ids = tokenizer.encode(human_s, add_special_tokens=False) + \
                    [special_tokens_map["<eoh>"], special_tokens_map["nl_id"]]

        ass_template_ids = tokenizer.encode("<|Bot|>:", add_special_tokens=False)

        ignore_len = len(human_ids) + len(ass_template_ids)

        ass_ids = (ass_template_ids + tokenizer.encode(ass_s[8:], add_special_tokens=False) + \
                   [special_tokens_map["<eoa>"], special_tokens_map["nl_id"]])

        targets = np.ones([seq_length,])
        token_ids += human_ids + ass_ids

        if len(token_ids) >= seq_length:
            over_num += 1
            token_ids = token_ids[:seq_length-1]
            token_ids += tokenizer.encode(eos_token, add_special_tokens=False)
            targets[:] = IGNORE_TOKEN_ID
        else:
            token_ids += tokenizer.encode(eos_token, add_special_tokens=False)
            ignore_len_end = seq_length - len(token_ids)
            token_ids = np.pad(token_ids, (0, ignore_len_end), 'constant', constant_values=(0, 0))
            targets = np.array(token_ids)
            targets[:ignore_len + 1] = IGNORE_TOKEN_ID
            targets[-ignore_len_end:] = IGNORE_TOKEN_ID

        input_ids.append(np.array(token_ids).astype(np.int32))
        labels.append(np.array(targets).astype(np.int32))

    print("over_num: ", over_num)
    return dict(
        input_ids=input_ids,
        labels=labels
    )


class SupervisedDataset:
    """Dataset for supervised fine-tuning."""

    def __init__(self, raw_data, tokenizer, seq_length):
        super(SupervisedDataset, self).__init__()

        sources = []
        for example in raw_data:
            sources.append(example)
        data_dict = preprocess(sources, tokenizer, seq_length)

        self.input_ids = data_dict["input_ids"]
        self.labels = data_dict["labels"]

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, i):
        return dict(
            input_ids=self.input_ids[i],
            labels=self.labels[i]
        )


def tokenize_qa(tokenizer, file_path, seq_length):
    raw_data = json.load(open(file_path, "r"))
    dataset_cls = SupervisedDataset(raw_data, tokenizer, seq_length)
    for i in range(len(dataset_cls)):
        yield dataset_cls[i]


if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument("--mindrecord_schema", type=str, default="internlm_alpaca")
    parser.add_argument("--input_glob", type=str, default="./alpaca_data.json")
    parser.add_argument("--output_file", type=str, default="./alpaca_processed/alpaca.mindrecord")
    parser.add_argument("--model_file", type=str, default="./tokenizer.model")
    parser.add_argument("--file_partition", type=int, default=1)
    parser.add_argument("--seq_length", type=int, default=2048)
    args = parser.parse_args()

    out_dir, out_file = os.path.split(os.path.abspath(args.output_file))
    if not os.path.exists(out_dir):
        os.mkdir(out_dir)

    schema = {'input_ids': {"type": "int32", "shape": [-1]},
              'labels': {"type": "int32", "shape": [-1]}}

    writer = FileWriter(file_name=args.output_file,
                        shard_num=args.file_partition)
    writer.add_schema(schema, args.mindrecord_schema)

    # Start to load tokenizer
    if not os.path.exists(args.model_file):
        raise FileNotFoundError(f"file {args.model_file} do not exists.")

    transforms_count = 0

    word_tokenizer = InternLMTokenizer(vocab_file=args.model_file)
    for x in tokenize_qa(word_tokenizer, args.input_glob, args.seq_length + 1):
        transforms_count += 1
        writer.write_raw_data([x])
    print("Transformed {} records.".format(transforms_count))

    writer.commit()
    out_file = args.output_file
    if args.file_partition > 1:
        out_file += '0'
    print("Transform finished, output files refer: {}".format(out_file))

代码解释:

1. 输入与输出

输入: 该代码的输入是一个 JSON 格式的 Alpaca 数据集文件(例如:alpaca_data.json),其中包含了训练数据的对话格式(每个样本有 input, instruction, 和 output 字段)。
输出: 该代码将输出一个 MindRecord 文件,通常是 .mindrecord 格式,用于在 MindSpore 框架中进行训练和评估。文件路径可以通过命令行参数指定。
2. 主要功能模块
第一步:数据格式转换 (get_chat_format_data):

  • 输入数据通过 get_chat_format_data 函数转化为特定格式,添加标识符如 <|User|><|Bot|>,以便区分用户和机器人的对话。
  • 该函数将每个数据样本的 instructioninputoutput 字段格式化成对话文本。这样,input 会成为用户的输入,而 output 会成为机器人的输出。
    示例:
data = {
   "user": "<|User|>:instruction_text\ninput_text",
   "bot": "<|Bot|>:output_text"
}

第二步:对话数据预处理 (preprocess):

  • 对输入的每条对话数据进行标记化(tokenization),并将其转换为模型训练所需的格式。
  • 使用 InternLMTokenizer 对输入的文本进行标记化,将用户和机器人的对话转换为 token IDs。
  • 该部分还处理了模型输入的特殊标记(如 <eoh>, <eoa> 等),这些特殊标记用于区分对话中的不同部分(例如用户的输入和机器人的输出)。
  • 如果生成的 token 序列超过指定的 seq_length(最大序列长度),会进行截断,确保每个样本的序列长度符合要求。

第三步:数据集封装 (SupervisedDataset):

  • SupervisedDataset 类用于封装处理后的数据,将每个样本的 input_idslabels 存储起来,供训练使用。
  • 每个数据样本包含两个字段:input_ids(模型输入)和 labels(对应的标签,用于监督训练)。

第四步:数据写入 (MindRecord):

  • 使用 MindSpore 提供的 FileWriter 将数据写入 MindRecord 格式。数据按照指定的 schema(定义了 input_idslabels 的类型和形状)进行写入。
  • 数据集会按分片(partition)保存,如果 file_partition 大于 1,则会将数据分成多个文件存储。

3. 命令行参数

  • --mindrecord_schema:指定 MindRecord 文件的 schema 名称。
  • --input_glob:输入的原始 JSON 数据文件路径。
  • --output_file:输出的 MindRecord 文件路径。
  • --model_file:用于标记化文本的 tokenizer 模型文件路径。
  • --file_partition:指定分片数,若大于 1,则输出多个 .mindrecord 文件。
  • --seq_length:序列的最大长度,即每个输入和标签的最大 token 数量。

4. 执行流程

  • 解析命令行参数。
  • 检查输出目录是否存在,不存在则创建。
  • 加载 tokenizer 模型(InternLMTokenizer),并使用它对输入文本进行标记化。
  • 调用 tokenize_qa 函数,按批处理格式将数据转换为 MindRecord 格式并写入文件。
  • 统计转换的记录数并打印。

5. 特殊标记说明

  • <|User|><|Bot|> 用于标记对话中的用户和机器人的话语。
  • <eoh>(End of Human)和 <eoa>(End of Answer)是用户和机器人的对话结束标记。
  • nl_id 是一个可能与特定领域或任务相关的标记,用于区别对话数据的不同部分。

执行脚本:

  1. 训练集 jsonmindrecord 格式,seq_length 设为 1024,注意将路径替换成自己的。
cd /home/ma-user/work/mindformers/research/internlm/

python alpaca_data_preprocess_v1.py \
--mindrecord_schema internlm_alpaca \
--input_glob /path/to/train_alpaca_format.json \
--output_file /path/to/train_1024.mindrecord \
--model_file /home/ma-user/work/tokenizer.model \
--seq_length 1024
  1. 验证集 jsonmindrecordseq_length 设为 1023 (mindformers 训练过程中验证集的数据 length=训练集数据 length - 1),注意将路径替换成自己的。
cd /home/ma-user/work/mindformers/research/internlm/

python alpaca_data_preprocess.py \
--mindrecord_schema internlm_alpaca \
--input_glob /path/to/valid_alpaca_format.json \
--output_file /path/to/valid_1024.mindrecord \
--model_file /home/ma-user/work/tokenizer.model \
--seq_length 1023

3.2 模型训练

3.2.1 环境配置

  • 手册中的提供的官方镜像
  • 4 卡 NPU(总共 64G 显存)环境
  • 硬盘规格 500G

3.2.2 微调配置文件

● 主要基于手册提供的 mindformers/research/internlm/finetune_internlm_7b_lora_mmlu_64G.yaml 做以下修改,其它与手册保持一致,采用 4 卡微调:lora_finetune.yaml

only_save_strategy: False

runner_config:
  epochs: 4
  batch_size: 4

model_config:
    seq_length: 1024
    pet_config:
      pet_type: lora
      # configuration of lora
      lora_rank: 16
      lora_alpha: 32
      lora_dropout: 0.05
      target_modules: '.*wq|.*wk|.*wv|.*wo'

do_eval: True
eval_step_interval: 1726
eval_epoch_interval: -1

eval_dataset: &eval_dataset
  data_loader:
    type: MindDataset
    dataset_dir: "/path/to/valid_1024.mindrecord"
    shuffle: False
  input_columns: ["input_ids", "labels"]

callbacks:
    save_checkpoint_steps: 3452
  	keep_checkpoint_max: 10

● 完整的 lora_finetune.yaml 文件:

seed: 0
output_dir: './output' # path to save checkpoint/strategy
load_checkpoint: './internlm.ckpt'
src_strategy_path_or_dir: ''
auto_trans_ckpt: False  # If true, auto transform load_checkpoint to load in distributed model
only_save_strategy: False
resume_training: False
use_parallel: True
run_mode: 'finetune'

# trainer config
trainer:
  type: CausalLanguageModelingTrainer
  model_name: 'internlm_7b_lora'

# runner config
runner_config:
  epochs: 4
  batch_size: 4
  sink_mode: True
  sink_size: 2

# optimizer
optimizer:
  type: FP32StateAdamWeightDecay
  beta1: 0.9
  beta2: 0.999
  eps: 1.e-8
  weight_decay: 0.01

# lr sechdule
lr_schedule:
  type: CosineWithWarmUpLR
  learning_rate: 5.e-5
  warmup_ratio: 0.03
  total_steps: -1 # -1 means it will load the total steps of the dataset

# dataset
train_dataset: &train_dataset
  data_loader:
    type: MindDataset
    dataset_dir: ""
    shuffle: True
  input_columns: ["input_ids", "labels"]  # "input_ids", "labels" , labels are used in instruction finetune.
  num_parallel_workers: 8
  python_multiprocessing: False
  drop_remainder: True
  repeat: 1
  numa_enable: False
  prefetch_size: 1
train_dataset_task:
  type: CausalLanguageModelDataset
  dataset_config: *train_dataset
# if True, do evaluate during the training process. if false, do nothing.
# note that the task trainer should support _evaluate_in_training function.
do_eval: True
eval_step_interval: 1726        # num of step intervals between each eval, -1 means no step end eval.
eval_epoch_interval: -1        # num of epoch intervals between each eval, 1 means eval on every epoch end.

# eval dataset
eval_dataset: &eval_dataset
  data_loader:
    type: MindDataset
    dataset_dir: "/home/ma-user/work/dataset_v2/valid_1024.mindrecord"
    shuffle: False
  input_columns: ["input_ids", "labels"]
  num_parallel_workers: 8
  python_multiprocessing: False
  drop_remainder: False
  repeat: 1
  numa_enable: False
  prefetch_size: 1
eval_dataset_task:
  type: CausalLanguageModelDataset
  dataset_config: *eval_dataset

# default parallel of device num = 8 for Atlas 800T A2
parallel_config:
  data_parallel: 4
  model_parallel: 1
  pipeline_stage: 1
  micro_batch_num: 1
  vocab_emb_dp: True
  gradient_aggregation_group: 4
# when model parallel is greater than 1, we can set micro_batch_interleave_num=2, that may accelerate the train process.
micro_batch_interleave_num: 1

# recompute config
recompute_config:
  recompute: True
  parallel_optimizer_comm_recompute: False
  mp_comm_recompute: True
  recompute_slice_activation: True

# callbacks
callbacks:
  - type: MFLossMonitor
  - type: CheckpointMointor
    prefix: "internlm_7b_lora"
    save_checkpoint_steps: 3452
    save_trainable_params: False    # Whether to save fine-tuned weights additionally.
    integrated_save: False
    async_save: False
    keep_checkpoint_max: 10
  - type: ObsMonitor

# mindspore context init config
context:
  mode: 0 #0--Graph Mode; 1--Pynative Mode
  device_target: "Ascend"
  enable_graph_kernel: False
  graph_kernel_flags: "--disable_expand_ops=Softmax,Dropout --enable_parallel_fusion=true --reduce_fuse_depth=8 --enable_auto_tensor_inplace=true"
  max_call_depth: 10000
  max_device_memory: "58GB"
  save_graphs: False
  save_graphs_path: "./graph"
  device_id: 0

# parallel context config
parallel:
  parallel_mode: 1 # 0-dataset, 1-semi, 2-auto, 3-hybrid
  gradients_mean: False
  enable_alltoall: False
  full_batch: True
  search_mode: "sharding_propagation"
  enable_parallel_optimizer: True
  strategy_ckpt_config:
    save_file: "./ckpt_strategy.ckpt"
    only_trainable_params: False
  parallel_optimizer_config:
    gradient_accumulation_shard: False
    parallel_optimizer_threshold: 64

# model config
model:
  model_config:
    type: InternLMConfig
    batch_size: 1 # add for increase predict
    seq_length: 1024
    hidden_size: 4096
    num_layers: 32
    num_heads: 32
    vocab_size: 103168
    multiple_of: 256
    rms_norm_eps: 1.0e-6
    bos_token_id: 1
    eos_token_id: 2
    pad_token_id: 2
    ignore_token_id: -100
    compute_dtype: "float16"
    layernorm_compute_type: "float32"
    softmax_compute_type: "float32"
    rotary_dtype: "float16"
    param_init_type: "float16"
    has_bias: True
    use_past: False
    scaling_factor: 1.0
    extend_method: "None"
    use_flash_attention: False
    offset: 0
    checkpoint_name_or_path: "internlm_7b_lora"
    repetition_penalty: 1.02
    max_decode_length: 100
    top_k: 50
    top_p: 0.9
    do_sample: True
    pet_config:
      pet_type: lora
      # configuration of lora
      lora_rank: 16
      lora_alpha: 32
      lora_dropout: 0.05
      target_modules: '.*wq|.*wk|.*wv|.*wo'
  arch:
    type: InternLMForCausalLM

processor:
  return_tensors: ms
  tokenizer:
    unk_token: '<unk>'
    bos_token: '<s>'
    eos_token: '</s>'
    pad_token: '</s>'
    type: InternLMTokenizer
    vocab_file: '/home/ma-user/work/tokenizer.model'
  type: LlamaProcessor

# metric
metric:
  type: PerplexityMetric

# wrapper cell config
runner_wrapper:
  type: MFTrainOneStepCell
  scale_sense:
    type: DynamicLossScaleUpdateCell
    loss_scale_value: 16384
    scale_factor: 2
    scale_window: 1000
  use_clip_grad: True

eval_callbacks:
  - type: ObsMonitor

auto_tune: False
filepath_prefix: './autotune'
autotune_per_step: 10

profile: False
profile_start_step: 1
profile_stop_step: 10
init_start_profile: False
profile_communication: False
profile_memory: True
layer_scale: False
layer_decay: 0.65
lr_scale_factor: 256

# aicc
remote_save_url: "Please input obs url on AICC platform."

3.2.3 微调日志

● 训练 loss:

在这里插入图片描述

● 验证 loss:

在这里插入图片描述

3.3 模型评估

3.3.1 原有能力评估

● 运行时需要指定微调后的模型权重的路径,配置文件与原文件相比修改了:

batch_size: 8
max_device_memory: "58GB"

pet_config:
  pet_type: lora
  # configuration of lora
  lora_rank: 16
  lora_alpha: 32
  lora_dropout: 0.05
  target_modules: '.*wq|.*wk|.*wv|.*wo'

● 完整的原有能力评估配置文件 predict_eval_squad.yaml 如下:

seed: 0
output_dir: './output'  # path to save checkpoint/strategy
load_checkpoint: './internlm.ckpt'
auto_trans_ckpt: False  # If true, auto transform load_checkpoint to load in distributed model
only_save_strategy: False
resume_training: False
use_parallel: False
run_mode: 'predict'

# trainer config
trainer:
  type: CausalLanguageModelingTrainer
  model_name: 'internlm_7b'

# runner config
runner_config:
  epochs: 1
  batch_size: 8
  sink_mode: True
  sink_size: 2

# eval dataset
eval_dataset: &eval_dataset
  data_loader:
    type: MindDataset
    dataset_dir: ""
    shuffle: False
  input_columns: ["input_ids", "labels"]
  num_parallel_workers: 8
  python_multiprocessing: False
  drop_remainder: False
  repeat: 1
  numa_enable: False
  prefetch_size: 1
eval_dataset_task:
  type: CausalLanguageModelDataset
  dataset_config: *eval_dataset

# default parallel of device num = 8 for Atlas 800T A2
parallel_config:
  data_parallel: 1
  model_parallel: 8
  pipeline_stage: 1
  micro_batch_num: 1
  vocab_emb_dp: True
  gradient_aggregation_group: 4
# when model parallel is greater than 1, we can set micro_batch_interleave_num=2, that may accelerate the train process.
micro_batch_interleave_num: 1

# recompute config
recompute_config:
  recompute: True
  parallel_optimizer_comm_recompute: False
  mp_comm_recompute: True
  recompute_slice_activation: True

# callbacks
callbacks:
  - type: MFLossMonitor
  - type: CheckpointMointor
    prefix: "internlm_7b"
    save_checkpoint_steps: 500
    keep_checkpoint_max: 3
    integrated_save: False
    async_save: False
  - type: ObsMonitor

# mindspore context init config
context:
  mode: 0 #0--Graph Mode; 1--Pynative Mode
  device_target: "Ascend"
  enable_graph_kernel: False
  graph_kernel_flags: "--disable_expand_ops=Softmax,Dropout --enable_parallel_fusion=true --reduce_fuse_depth=8 --enable_auto_tensor_inplace=true"
  max_call_depth: 10000
  max_device_memory: "58GB"
  save_graphs: False
  save_graphs_path: "./graph"
  device_id: 0

# parallel context config
parallel:
  parallel_mode: 1 # 0-dataset, 1-semi, 2-auto, 3-hybrid
  gradients_mean: False
  enable_alltoall: False
  full_batch: True
  search_mode: "sharding_propagation"
  enable_parallel_optimizer: False
  strategy_ckpt_save_file: "./ckpt_strategy.ckpt"
  parallel_optimizer_config:
    gradient_accumulation_shard: False
    parallel_optimizer_threshold: 64

# model config
model:
  model_config:
    type: InternLMConfig
    batch_size: 2 # add for increase predict
    seq_length: 8192
    hidden_size: 4096
    num_layers: 32
    num_heads: 32
    vocab_size: 103168
    multiple_of: 256
    rms_norm_eps: 1.0e-6
    bos_token_id: 1
    eos_token_id: 2
    pad_token_id: 2
    ignore_token_id: -100
    compute_dtype: "float16"
    layernorm_compute_type: "float16"
    softmax_compute_type: "float16"
    rotary_dtype: "float16"
    param_init_type: "float16"
    has_bias: True
    use_past: True
    block_size: 16
    num_blocks: 512
    is_dynamic: True
    scaling_factor: 1.0
    extend_method: "None"
    offset: 0
    checkpoint_name_or_path: "internlm_7b"
    repetition_penalty: 1.0
    max_decode_length: 700
    max_new_tokens: 20
    top_k: 3
    top_p: 0.8
    do_sample: False
    is_dynamic: False
    pet_config:
      pet_type: lora
      # configuration of lora
      lora_rank: 16
      lora_alpha: 32
      lora_dropout: 0.05
      target_modules: '.*wq|.*wk|.*wv|.*wo'
  arch:
    type: InternLMForCausalLM

processor:
  return_tensors: ms
  tokenizer:
    unk_token: '<unk>'
    bos_token: '<s>'
    eos_token: '</s>'
    pad_token: '</s>'
    type: InternLMTokenizer
    vocab_file: '/home/ma-user/work/tokenizer.model'
  type: LlamaProcessor

# metric
metric:
  type: EmF1Metric

eval_callbacks:
  - type: ObsMonitor

auto_tune: False
filepath_prefix: './autotune'
autotune_per_step: 10

profile: False
profile_start_step: 1
profile_stop_step: 10
init_start_profile: False
profile_communication: False
profile_memory: True

# aicc
remote_save_url: "Please input obs url on AICC platform."

● 原有能力评估结果(符合要求):

F1 score: 49.542794601854574
Em score: 31.398161586840832

3.3.2 低参比例

在这里插入图片描述

3.3.3 阅读理解能力的评估

● 运行时需要指定微调后的模型权重的路径,配置文件与原文件相比修改了:

pet_config:
  pet_type: lora
  # configuration of lora
  lora_rank: 16
  lora_alpha: 32
  lora_dropout: 0.05
  target_modules: '.*wq|.*wk|.*wv|.*wo'

max_new_tokens: 200

● 完整的 predict_mmlu.yaml 文件:

seed: 0
output_dir: './output'  # path to save checkpoint/strategy
load_checkpoint: './internlm.ckpt'
auto_trans_ckpt: False  # If true, auto transform load_checkpoint to load in distributed model
only_save_strategy: False
resume_training: False
use_parallel: False
run_mode: 'predict'

# trainer config
trainer:
  type: CausalLanguageModelingTrainer
  model_name: 'internlm_7b'

# runner config
runner_config:
  epochs: 1
  batch_size: 1
  sink_mode: True
  sink_size: 2

# eval dataset
eval_dataset: &eval_dataset
  data_loader:
    type: MindDataset
    dataset_dir: ""
    shuffle: False
  input_columns: ["input_ids", "labels"]
  num_parallel_workers: 8
  python_multiprocessing: False
  drop_remainder: False
  repeat: 1
  numa_enable: False
  prefetch_size: 1
eval_dataset_task:
  type: CausalLanguageModelDataset
  dataset_config: *eval_dataset

# default parallel of device num = 8 for Atlas 800T A2
parallel_config:
  data_parallel: 1
  model_parallel: 8
  pipeline_stage: 1
  micro_batch_num: 1
  vocab_emb_dp: True
  gradient_aggregation_group: 4
# when model parallel is greater than 1, we can set micro_batch_interleave_num=2, that may accelerate the train process.
micro_batch_interleave_num: 1

# recompute config
recompute_config:
  recompute: True
  parallel_optimizer_comm_recompute: False
  mp_comm_recompute: True
  recompute_slice_activation: True

# callbacks
callbacks:
  - type: MFLossMonitor
  - type: CheckpointMointor
    prefix: "internlm_7b"
    save_checkpoint_steps: 500
    keep_checkpoint_max: 3
    integrated_save: False
    async_save: False
  - type: ObsMonitor

# mindspore context init config
context:
  mode: 0 #0--Graph Mode; 1--Pynative Mode
  device_target: "Ascend"
  enable_graph_kernel: False
  graph_kernel_flags: "--disable_expand_ops=Softmax,Dropout --enable_parallel_fusion=true --reduce_fuse_depth=8 --enable_auto_tensor_inplace=true"
  max_call_depth: 10000
  max_device_memory: "26GB"
  save_graphs: False
  save_graphs_path: "./graph"
  device_id: 0

# parallel context config
parallel:
  parallel_mode: 1 # 0-dataset, 1-semi, 2-auto, 3-hybrid
  gradients_mean: False
  enable_alltoall: False
  full_batch: True
  search_mode: "sharding_propagation"
  enable_parallel_optimizer: False
  strategy_ckpt_save_file: "./ckpt_strategy.ckpt"
  parallel_optimizer_config:
    gradient_accumulation_shard: False
    parallel_optimizer_threshold: 64

# model config
model:
  model_config:
    type: InternLMConfig
    batch_size: 1 # add for increase predict
    seq_length: 2048
    hidden_size: 4096
    num_layers: 32
    num_heads: 32
    vocab_size: 103168
    multiple_of: 256
    rms_norm_eps: 1.0e-6
    bos_token_id: 1
    eos_token_id: 2
    pad_token_id: 2
    ignore_token_id: -100
    compute_dtype: "float16"
    layernorm_compute_type: "float16"
    softmax_compute_type: "float16"
    rotary_dtype: "float16"
    param_init_type: "float16"
    has_bias: True
    use_past: True
    block_size: 16
    num_blocks: 512
    is_dynamic: True
    scaling_factor: 1.0
    extend_method: "None"
    offset: 0
    checkpoint_name_or_path: "internlm_7b"
    repetition_penalty: 1.0
    max_decode_length: 700
    max_new_tokens: 200
    top_k: 3
    top_p: 0.8
    do_sample: False
    is_dynamic: False
    pet_config:
      pet_type: lora
      # configuration of lora
      lora_rank: 16
      lora_alpha: 32
      lora_dropout: 0.05
      target_modules: '.*wq|.*wk|.*wv|.*wo'
  arch:
    type: InternLMForCausalLM

processor:
  return_tensors: ms
  tokenizer:
    unk_token: '<unk>'
    bos_token: '<s>'
    eos_token: '</s>'
    pad_token: '</s>'
    type: InternLMTokenizer
    vocab_file: '/home/ma-user/work/tokenizer.model'
  type: LlamaProcessor

# metric
metric:
  type: EmF1Metric

eval_callbacks:
  - type: ObsMonitor

auto_tune: False
filepath_prefix: './autotune'
autotune_per_step: 10

profile: False
profile_start_step: 1
profile_stop_step: 10
init_start_profile: False
profile_communication: False
profile_memory: True

# aicc
remote_save_url: "Please input obs url on AICC platform."

● 自验结果:测试集数量为 2480 (每个类别 20 个) ,准确率:96.85%


4 参考文献

1 篇


⭐️ ⭐️ 写于2025年1月16日 20:22 教研室工位

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

一支王同学

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值