引言:继上一篇博客,在参加决赛的二阶段比赛后,遗憾未上榜,不过官方开源了 排名第一 的方法,来学习一波吧。
上一篇博客📚:大语言模型(LLMs)数学推理的经验技巧【思维链CoT的应用方法】
✅ NLP 研 2 选手的学习笔记,2025年第一篇
笔者简介:Wang Linyong,NPU,2023级,计算机技术
研究方向:文本生成、大语言模型
大赛链接:昇思MindSpore模型开发挑战赛【模型微调赛题第二阶段】,2024 华为技术有限公司
项目链接:https://github.com/mindspore-courses/competition
赛事标题:昇思MindSpore模型开发挑战赛【模型微调赛题第二阶段】
文章目录
1 赛题简介
● 本赛题要求基于中英文选择题数据集,跑通 baseline
,并对 MindFormers
中 InternLM-7B
模型进行微调(LoRA
或其他微调算法)。微调后的模型在原有能力不丢失的前提下(需保持在原能力的 90%
及以上),回答数学运算准确率相对 baseline
有提升,按照低参比例及准确率进行综合排名,评选出 1
个金奖 2
个银奖 3
个铜奖。
- 模型原有能力以其在 SQUAD 数据集上的阅读理解能力为准,评价标准为
F1 Score
和Em Score
,要求微调后两项评价指标需要给定阈值以上方可算作有效作品,如何进行原有能力评估,以及F1 Score
和Em Score
的参考阈值,请参考指导手册。 - 单选题准确率评价标准:模型基于测试数据集(不公开,与训练数据集格式相同,为数道单选题)进行推理,生成回答结果,最终统计在测试数据集上回答正确的题目数量占比:
准确率 = 正确答案题目数 / 测试集总题目数(注:
baseline
的准确率为40%
,请以此为参考进行微调。)
- 低参比例:低参比例为微调参数量在总参数量的占比,选手在提交作品时需提供低参比例的计算结果,如何进行低参比例详见下方-低参比例运算。
低参比例 = 参与微调的参数量/模型总参数量
- 低参比例和运算准确率综合排名:低参比例越低越好,运算准确率越高越好,按照如下加权进行运算:
(100% - 低参比例 × 10)× 0.3 + 运算准确率 × 0.7
- 本题目共提供
2.7+万
条中英文混合题目作为训练数据集,选手可根据自己的实际情况调整数据集规模,建议综合在微调及推理时长、算力需求、维持模型原有能力、模型运算准备率提升等多方面因素进行训练数据集规模的评估。
2 获奖公示
● 金奖奖励 10万 呢,可羡慕了。当时我记得我最高的准确率调到 60%
左右,第一名的 95.9%
是怎么达到的呢?一起来学习一下吧!
3 总体方案
3.1 数据预处理
3.1.1 csv 转 json 格式
● 根据手册获取 MMLU
和 CMMLU
的原始数据集,并运行 mmlu_cmmlu_csv2json.py
,注意将脚本中数据集的路径和保存路径改为自己对应的路径。
● mmlu_cmmlu_csv2json.py 代码,主要是把 csv
格式 的数据转为 json
格式:
import os
import json
import random
import pandas as pd
random.seed(42)
if __name__ == "__main__":
save_path = "/path/to/origin_train_alpaca_format.json"
cmmlu_path_list = ["/path/to/cmmlu/dev/", "/path/to/cmmlu/test/"]
mmlu_path_list = ["/path/to/mmlu/data/dev", "/path/to/mmlu/data/test", "/path/to/mmlu/data/val"]
train_data = []
csv_files = [file for file in os.listdir(cmmlu_path_list[0]) if file.endswith(".csv")]
for file in csv_files:
data_list = []
for folder_path in cmmlu_path_list:
file_path = os.path.join(folder_path, file)
df = pd.read_csv(file_path)
if "cmmlu" not in file_path:
df.columns = ["Question", "A", "B", "C", "D", "Answer"]
# 将 DataFrame 转换为字典格式,并添加到列表中
dict_data = df.to_dict(orient="records")
for item in dict_data:
domain = file.replace("_dev", "").replace("_test", "").replace("_val", "").replace("_", " ").replace(".csv", "")
data_list.append({
"instruction": f"Here is a question about {domain}, the correct answer is one of the options A/B/C/D. Please select the correct option and answer the question with 'The right option is'.",
"input": "Question: " + item["Question"] + " \nA." + str(item["A"]) + "\nB." + str(item["B"]) + "\nC." + str(item["C"]) + "\nD." + str(item["D"]),
"output": "The right option is " + item["Answer"] + "."
})
random.shuffle(data_list)
train_data.extend(data_list)
print("cmmlu: ", domain, len(data_list))
csv_files = [file for file in os.listdir(mmlu_path_list[0]) if file.endswith(".csv")]
for file in csv_files:
data_list = []
i = 0
for folder_path in mmlu_path_list:
i += 1
if i == 2:
file = file.replace("_dev", "_test")
elif i == 3:
file = file.replace("_test", "_val")
file_path = os.path.join(folder_path, file)
df = pd.read_csv(file_path)
if "cmmlu" not in file_path:
df.columns = ["Question", "A", "B", "C", "D", "Answer"]
# 将 DataFrame 转换为字典格式,并添加到列表中
dict_data = df.to_dict(orient="records")
for item in dict_data:
domain = file.replace("_dev", "").replace("_test", "").replace("_val", "").replace("_", " ").replace(".csv", "")
data_list.append({
"instruction": f"Here is a question about {domain}, the correct answer is one of the options A/B/C/D. Please select the correct option and answer the question with 'The right option is'.",
"input": "Question: " + item["Question"] + " \nA." + str(item["A"]) + "\nB." + str(item["B"]) + "\nC." + str(item["C"]) + "\nD." + str(item["D"]),
"output": "The right option is " + item["Answer"] + "."
})
random.shuffle(data_list)
train_data.extend(data_list)
print("mmlu: ", domain, len(data_list))
with open(save_path, "w", encoding="utf-8") as json_file:
json.dump(train_data, json_file, ensure_ascii=False, indent=4)
print("train_data: ", len(train_data))
3.1.2 打乱数据的选项,构建新数据,划分数据集 ⭐️⭐️⭐️
● 运行 data_preprocess.py
,注意将脚本中数据集的路径和保存路径改为自己对应的路径。
● 打乱和构建的策略:
- 针对数据集的每个样本,构建
6
个新数据,新数据补全了的output
选项的完整答案,第2-6
个数据ABCD
四个选项的位置进行了重新排序,示例如下(原始数据):
{
"instruction": "Here is a question about college engineering hydrology, the correct answer is one of the options A/B/C/D. Please select the correct option and answer the question with 'The right option is'.",
"input": "Question: 下渗率总是[ ]。 \nA.小于、等于下渗能力\nB.等于下渗能力\nC.小于下渗能力\nD.大于下渗能力",
"output": "The right option is A."
}
- 构造的第
1
个新数据(只补全了的output
选项的完整答案):
>>> 填充 output 选项的完整答案:
{
"instruction": "Here is a question about college engineering hydrology, the correct answer is one of the options A/B/C/D. Please select the correct option and answer the question with 'The right option is'.",
"input": "Question: 下渗率总是[ ]。 \nA.小于、等于下渗能力\nB.等于下渗能力\nC.小于下渗能力\nD.大于下渗能力",
"output": "The right option is A.小于、等于下渗能力"
}
- 第
2-6
个新数据(补全了的output
选项的完整答案,并对选项的顺序进行了重新排序):
>>> 调换 input 中的选项 A 和 D, 即 ABCD → DBCA:
{
"instruction": "Here is a question about college engineering hydrology, the correct answer is one of the options A/B/C/D. Please select the correct option and answer the question with 'The right option is'.",
"input": "Question: 下渗率总是[ ]。 \nA.小于下渗能力\nB.等于下渗能力\nC.大于下渗能力\nD.小于、等于下渗能力",
"output": "The right option is D.小于、等于下渗能力"
}
>>> 调换 input 中的选项 A 和 C, 即 ABCD → CBAD:
{
"instruction": "Here is a question about college engineering hydrology, the correct answer is one of the options A/B/C/D. Please select the correct option and answer the question with 'The right option is'.",
"input": "Question: 下渗率总是[ ]。 \nA.大于下渗能力\nB.小于下渗能力\nC.小于、等于下渗能力\nD.等于下渗能力",
"output": "The right option is C.小于、等于下渗能力"
}
>>> 将 input 中的选项 ABCD → BDCA:
{
"instruction": "Here is a question about college engineering hydrology, the correct answer is one of the options A/B/C/D. Please select the correct option and answer the question with 'The right option is'.",
"input": "Question: 下渗率总是[ ]。 \nA.等于下渗能力\nB.大于下渗能力\nC.小于下渗能力\nD.小于、等于下渗能力",
"output": "The right option is D.小于、等于下渗能力"
}
>>> 将 input 中的选项 ABCD → BCAD:
{
"instruction": "Here is a question about college engineering hydrology, the correct answer is one of the options A/B/C/D. Please select the correct option and answer the question with 'The right option is'.",
"input": "Question: 下渗率总是[ ]。 \nA.等于下渗能力\nB.小于下渗能力\nC.小于、等于下渗能力\nD.大于下渗能力",
"output": "The right option is C.小于、等于下渗能力"
}
>>> 将 input 中的选项 ABCD → BACD:
{
"instruction": "Here is a question about college engineering hydrology, the correct answer is one of the options A/B/C/D. Please select the correct option and answer the question with 'The right option is'.",
"input": "Question: 下渗率总是[ ]。 \nA.等于下渗能力\nB.小于、等于下渗能力\nC.小于下渗能力\nD.大于下渗能力",
"output": "The right option is B.小于、等于下渗能力"
}
● 数据划分: 基于上述得到的新数据集构建训练集和自测用的验证集、测试集
- 上述每个样本可以构建
6
个数据,可以构建成6
个epoch
的数据集(每个epoch
包含原始数据的所有题目)
训练集 = 前5
个epoch
数据 - 对第
6
个epoch
的数据按类别进行打乱
验证集 = 第6
个epoch
数据每个类别取前20
个样本
测试集 = 第6
个epoch
数据每个类别取第20~40
个样本 - 训练集大小:
27604 × 5
验证集大小:20 × 124 (类别数)
测试集大小:20 × 124 (类别数)
3.1.3 json 转 mindrecord 格式
● 运行 alpaca_data_preprocess_v1.py
,该脚本修复了原始脚本遇到数据长度刚好等于 seq_length
时存在的 bug
。
# Copyright 2023 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""
transform alpaca dataset to mindrecord.
"""
import argparse
import json
import os
import numpy as np
from mindspore.mindrecord import FileWriter
from internlm_tokenizer import InternLMTokenizer
IGNORE_TOKEN_ID = -100
def get_chat_format_data(ori_data):
"""Format original data
Args:
ori_data (dict): input data sample.
Returns:
dict: data sample with chat format.
"""
input_str = ori_data["input"]
instruction_str = ori_data["instruction"]
output_str = ori_data["output"]
data = dict()
if input_str != "":
data["user"] = f"<|User|>:{instruction_str}\n{input_str}"
else:
data["user"] = f"<|User|>:{instruction_str}"
data["bot"] = f"<|Bot|>:{output_str}"
return data
def preprocess(sources, tokenizer, seq_length, bos_token="<s>", eos_token="</s>"):
"""conversation preprocess."""
input_ids = []
labels = []
over_num = 0
for source in sources:
data = get_chat_format_data(source)
special_tokens_map = {"<eoh>": 103167, "<eoa>": 103166, "nl_id": 13}
token_ids = tokenizer.encode(bos_token, add_special_tokens=False)
human_s = data["user"]
ass_s = data["bot"]
human_ids = tokenizer.encode(human_s, add_special_tokens=False) + \
[special_tokens_map["<eoh>"], special_tokens_map["nl_id"]]
ass_template_ids = tokenizer.encode("<|Bot|>:", add_special_tokens=False)
ignore_len = len(human_ids) + len(ass_template_ids)
ass_ids = (ass_template_ids + tokenizer.encode(ass_s[8:], add_special_tokens=False) + \
[special_tokens_map["<eoa>"], special_tokens_map["nl_id"]])
targets = np.ones([seq_length,])
token_ids += human_ids + ass_ids
if len(token_ids) >= seq_length:
over_num += 1
token_ids = token_ids[:seq_length-1]
token_ids += tokenizer.encode(eos_token, add_special_tokens=False)
targets[:] = IGNORE_TOKEN_ID
else:
token_ids += tokenizer.encode(eos_token, add_special_tokens=False)
ignore_len_end = seq_length - len(token_ids)
token_ids = np.pad(token_ids, (0, ignore_len_end), 'constant', constant_values=(0, 0))
targets = np.array(token_ids)
targets[:ignore_len + 1] = IGNORE_TOKEN_ID
targets[-ignore_len_end:] = IGNORE_TOKEN_ID
input_ids.append(np.array(token_ids).astype(np.int32))
labels.append(np.array(targets).astype(np.int32))
print("over_num: ", over_num)
return dict(
input_ids=input_ids,
labels=labels
)
class SupervisedDataset:
"""Dataset for supervised fine-tuning."""
def __init__(self, raw_data, tokenizer, seq_length):
super(SupervisedDataset, self).__init__()
sources = []
for example in raw_data:
sources.append(example)
data_dict = preprocess(sources, tokenizer, seq_length)
self.input_ids = data_dict["input_ids"]
self.labels = data_dict["labels"]
def __len__(self):
return len(self.input_ids)
def __getitem__(self, i):
return dict(
input_ids=self.input_ids[i],
labels=self.labels[i]
)
def tokenize_qa(tokenizer, file_path, seq_length):
raw_data = json.load(open(file_path, "r"))
dataset_cls = SupervisedDataset(raw_data, tokenizer, seq_length)
for i in range(len(dataset_cls)):
yield dataset_cls[i]
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument("--mindrecord_schema", type=str, default="internlm_alpaca")
parser.add_argument("--input_glob", type=str, default="./alpaca_data.json")
parser.add_argument("--output_file", type=str, default="./alpaca_processed/alpaca.mindrecord")
parser.add_argument("--model_file", type=str, default="./tokenizer.model")
parser.add_argument("--file_partition", type=int, default=1)
parser.add_argument("--seq_length", type=int, default=2048)
args = parser.parse_args()
out_dir, out_file = os.path.split(os.path.abspath(args.output_file))
if not os.path.exists(out_dir):
os.mkdir(out_dir)
schema = {'input_ids': {"type": "int32", "shape": [-1]},
'labels': {"type": "int32", "shape": [-1]}}
writer = FileWriter(file_name=args.output_file,
shard_num=args.file_partition)
writer.add_schema(schema, args.mindrecord_schema)
# Start to load tokenizer
if not os.path.exists(args.model_file):
raise FileNotFoundError(f"file {args.model_file} do not exists.")
transforms_count = 0
word_tokenizer = InternLMTokenizer(vocab_file=args.model_file)
for x in tokenize_qa(word_tokenizer, args.input_glob, args.seq_length + 1):
transforms_count += 1
writer.write_raw_data([x])
print("Transformed {} records.".format(transforms_count))
writer.commit()
out_file = args.output_file
if args.file_partition > 1:
out_file += '0'
print("Transform finished, output files refer: {}".format(out_file))
● 代码解释:
1. 输入与输出
输入: 该代码的输入是一个 JSON 格式的 Alpaca 数据集文件(例如:
alpaca_data.json
),其中包含了训练数据的对话格式(每个样本有input
,instruction
, 和output
字段)。
输出: 该代码将输出一个 MindRecord 文件,通常是.mindrecord
格式,用于在 MindSpore 框架中进行训练和评估。文件路径可以通过命令行参数指定。
2. 主要功能模块
第一步:数据格式转换 (get_chat_format_data
):
- 输入数据通过
get_chat_format_data
函数转化为特定格式,添加标识符如<|User|>
和<|Bot|>
,以便区分用户和机器人的对话。- 该函数将每个数据样本的
instruction
、input
和output
字段格式化成对话文本。这样,input
会成为用户的输入,而output
会成为机器人的输出。
示例:data = { "user": "<|User|>:instruction_text\ninput_text", "bot": "<|Bot|>:output_text" }
第二步:对话数据预处理 (preprocess):
- 对输入的每条对话数据进行标记化(tokenization),并将其转换为模型训练所需的格式。
- 使用 InternLMTokenizer 对输入的文本进行标记化,将用户和机器人的对话转换为 token IDs。
- 该部分还处理了模型输入的特殊标记(如
<eoh>
,<eoa>
等),这些特殊标记用于区分对话中的不同部分(例如用户的输入和机器人的输出)。- 如果生成的 token 序列超过指定的
seq_length
(最大序列长度),会进行截断,确保每个样本的序列长度符合要求。
第三步:数据集封装 (SupervisedDataset):
- SupervisedDataset 类用于封装处理后的数据,将每个样本的
input_ids
和labels
存储起来,供训练使用。- 每个数据样本包含两个字段:
input_ids
(模型输入)和labels
(对应的标签,用于监督训练)。
第四步:数据写入 (MindRecord):
- 使用 MindSpore 提供的 FileWriter 将数据写入 MindRecord 格式。数据按照指定的 schema(定义了
input_ids
和labels
的类型和形状)进行写入。- 数据集会按分片(partition)保存,如果
file_partition
大于1
,则会将数据分成多个文件存储。
3. 命令行参数
--mindrecord_schema
:指定 MindRecord 文件的 schema 名称。--input_glob
:输入的原始 JSON 数据文件路径。--output_file
:输出的 MindRecord 文件路径。--model_file
:用于标记化文本的 tokenizer 模型文件路径。--file_partition
:指定分片数,若大于1
,则输出多个.mindrecord
文件。--seq_length
:序列的最大长度,即每个输入和标签的最大 token 数量。
4. 执行流程
- 解析命令行参数。
- 检查输出目录是否存在,不存在则创建。
- 加载 tokenizer 模型(InternLMTokenizer),并使用它对输入文本进行标记化。
- 调用 tokenize_qa 函数,按批处理格式将数据转换为 MindRecord 格式并写入文件。
- 统计转换的记录数并打印。
5. 特殊标记说明
<|User|>
和<|Bot|>
用于标记对话中的用户和机器人的话语。<eoh>
(End of Human)和<eoa>
(End of Answer)是用户和机器人的对话结束标记。nl_id
是一个可能与特定领域或任务相关的标记,用于区别对话数据的不同部分。
● 执行脚本:
- 训练集
json
转mindrecord
格式,seq_length
设为1024
,注意将路径替换成自己的。
cd /home/ma-user/work/mindformers/research/internlm/
python alpaca_data_preprocess_v1.py \
--mindrecord_schema internlm_alpaca \
--input_glob /path/to/train_alpaca_format.json \
--output_file /path/to/train_1024.mindrecord \
--model_file /home/ma-user/work/tokenizer.model \
--seq_length 1024
- 验证集
json
转mindrecord
,seq_length
设为1023
(mindformers 训练过程中验证集的数据length=训练集数据 length - 1
),注意将路径替换成自己的。
cd /home/ma-user/work/mindformers/research/internlm/
python alpaca_data_preprocess.py \
--mindrecord_schema internlm_alpaca \
--input_glob /path/to/valid_alpaca_format.json \
--output_file /path/to/valid_1024.mindrecord \
--model_file /home/ma-user/work/tokenizer.model \
--seq_length 1023
3.2 模型训练
3.2.1 环境配置
- 手册中的提供的官方镜像
4
卡 NPU(总共64G
显存)环境- 硬盘规格
500G
3.2.2 微调配置文件
● 主要基于手册提供的 mindformers/research/internlm/finetune_internlm_7b_lora_mmlu_64G.yaml
做以下修改,其它与手册保持一致,采用 4
卡微调:lora_finetune.yaml
。
only_save_strategy: False
runner_config:
epochs: 4
batch_size: 4
model_config:
seq_length: 1024
pet_config:
pet_type: lora
# configuration of lora
lora_rank: 16
lora_alpha: 32
lora_dropout: 0.05
target_modules: '.*wq|.*wk|.*wv|.*wo'
do_eval: True
eval_step_interval: 1726
eval_epoch_interval: -1
eval_dataset: &eval_dataset
data_loader:
type: MindDataset
dataset_dir: "/path/to/valid_1024.mindrecord"
shuffle: False
input_columns: ["input_ids", "labels"]
callbacks:
save_checkpoint_steps: 3452
keep_checkpoint_max: 10
● 完整的 lora_finetune.yaml
文件:
seed: 0
output_dir: './output' # path to save checkpoint/strategy
load_checkpoint: './internlm.ckpt'
src_strategy_path_or_dir: ''
auto_trans_ckpt: False # If true, auto transform load_checkpoint to load in distributed model
only_save_strategy: False
resume_training: False
use_parallel: True
run_mode: 'finetune'
# trainer config
trainer:
type: CausalLanguageModelingTrainer
model_name: 'internlm_7b_lora'
# runner config
runner_config:
epochs: 4
batch_size: 4
sink_mode: True
sink_size: 2
# optimizer
optimizer:
type: FP32StateAdamWeightDecay
beta1: 0.9
beta2: 0.999
eps: 1.e-8
weight_decay: 0.01
# lr sechdule
lr_schedule:
type: CosineWithWarmUpLR
learning_rate: 5.e-5
warmup_ratio: 0.03
total_steps: -1 # -1 means it will load the total steps of the dataset
# dataset
train_dataset: &train_dataset
data_loader:
type: MindDataset
dataset_dir: ""
shuffle: True
input_columns: ["input_ids", "labels"] # "input_ids", "labels" , labels are used in instruction finetune.
num_parallel_workers: 8
python_multiprocessing: False
drop_remainder: True
repeat: 1
numa_enable: False
prefetch_size: 1
train_dataset_task:
type: CausalLanguageModelDataset
dataset_config: *train_dataset
# if True, do evaluate during the training process. if false, do nothing.
# note that the task trainer should support _evaluate_in_training function.
do_eval: True
eval_step_interval: 1726 # num of step intervals between each eval, -1 means no step end eval.
eval_epoch_interval: -1 # num of epoch intervals between each eval, 1 means eval on every epoch end.
# eval dataset
eval_dataset: &eval_dataset
data_loader:
type: MindDataset
dataset_dir: "/home/ma-user/work/dataset_v2/valid_1024.mindrecord"
shuffle: False
input_columns: ["input_ids", "labels"]
num_parallel_workers: 8
python_multiprocessing: False
drop_remainder: False
repeat: 1
numa_enable: False
prefetch_size: 1
eval_dataset_task:
type: CausalLanguageModelDataset
dataset_config: *eval_dataset
# default parallel of device num = 8 for Atlas 800T A2
parallel_config:
data_parallel: 4
model_parallel: 1
pipeline_stage: 1
micro_batch_num: 1
vocab_emb_dp: True
gradient_aggregation_group: 4
# when model parallel is greater than 1, we can set micro_batch_interleave_num=2, that may accelerate the train process.
micro_batch_interleave_num: 1
# recompute config
recompute_config:
recompute: True
parallel_optimizer_comm_recompute: False
mp_comm_recompute: True
recompute_slice_activation: True
# callbacks
callbacks:
- type: MFLossMonitor
- type: CheckpointMointor
prefix: "internlm_7b_lora"
save_checkpoint_steps: 3452
save_trainable_params: False # Whether to save fine-tuned weights additionally.
integrated_save: False
async_save: False
keep_checkpoint_max: 10
- type: ObsMonitor
# mindspore context init config
context:
mode: 0 #0--Graph Mode; 1--Pynative Mode
device_target: "Ascend"
enable_graph_kernel: False
graph_kernel_flags: "--disable_expand_ops=Softmax,Dropout --enable_parallel_fusion=true --reduce_fuse_depth=8 --enable_auto_tensor_inplace=true"
max_call_depth: 10000
max_device_memory: "58GB"
save_graphs: False
save_graphs_path: "./graph"
device_id: 0
# parallel context config
parallel:
parallel_mode: 1 # 0-dataset, 1-semi, 2-auto, 3-hybrid
gradients_mean: False
enable_alltoall: False
full_batch: True
search_mode: "sharding_propagation"
enable_parallel_optimizer: True
strategy_ckpt_config:
save_file: "./ckpt_strategy.ckpt"
only_trainable_params: False
parallel_optimizer_config:
gradient_accumulation_shard: False
parallel_optimizer_threshold: 64
# model config
model:
model_config:
type: InternLMConfig
batch_size: 1 # add for increase predict
seq_length: 1024
hidden_size: 4096
num_layers: 32
num_heads: 32
vocab_size: 103168
multiple_of: 256
rms_norm_eps: 1.0e-6
bos_token_id: 1
eos_token_id: 2
pad_token_id: 2
ignore_token_id: -100
compute_dtype: "float16"
layernorm_compute_type: "float32"
softmax_compute_type: "float32"
rotary_dtype: "float16"
param_init_type: "float16"
has_bias: True
use_past: False
scaling_factor: 1.0
extend_method: "None"
use_flash_attention: False
offset: 0
checkpoint_name_or_path: "internlm_7b_lora"
repetition_penalty: 1.02
max_decode_length: 100
top_k: 50
top_p: 0.9
do_sample: True
pet_config:
pet_type: lora
# configuration of lora
lora_rank: 16
lora_alpha: 32
lora_dropout: 0.05
target_modules: '.*wq|.*wk|.*wv|.*wo'
arch:
type: InternLMForCausalLM
processor:
return_tensors: ms
tokenizer:
unk_token: '<unk>'
bos_token: '<s>'
eos_token: '</s>'
pad_token: '</s>'
type: InternLMTokenizer
vocab_file: '/home/ma-user/work/tokenizer.model'
type: LlamaProcessor
# metric
metric:
type: PerplexityMetric
# wrapper cell config
runner_wrapper:
type: MFTrainOneStepCell
scale_sense:
type: DynamicLossScaleUpdateCell
loss_scale_value: 16384
scale_factor: 2
scale_window: 1000
use_clip_grad: True
eval_callbacks:
- type: ObsMonitor
auto_tune: False
filepath_prefix: './autotune'
autotune_per_step: 10
profile: False
profile_start_step: 1
profile_stop_step: 10
init_start_profile: False
profile_communication: False
profile_memory: True
layer_scale: False
layer_decay: 0.65
lr_scale_factor: 256
# aicc
remote_save_url: "Please input obs url on AICC platform."
3.2.3 微调日志
● 训练 loss:
● 验证 loss:
3.3 模型评估
3.3.1 原有能力评估
● 运行时需要指定微调后的模型权重的路径,配置文件与原文件相比修改了:
batch_size: 8
max_device_memory: "58GB"
pet_config:
pet_type: lora
# configuration of lora
lora_rank: 16
lora_alpha: 32
lora_dropout: 0.05
target_modules: '.*wq|.*wk|.*wv|.*wo'
● 完整的原有能力评估配置文件 predict_eval_squad.yaml
如下:
seed: 0
output_dir: './output' # path to save checkpoint/strategy
load_checkpoint: './internlm.ckpt'
auto_trans_ckpt: False # If true, auto transform load_checkpoint to load in distributed model
only_save_strategy: False
resume_training: False
use_parallel: False
run_mode: 'predict'
# trainer config
trainer:
type: CausalLanguageModelingTrainer
model_name: 'internlm_7b'
# runner config
runner_config:
epochs: 1
batch_size: 8
sink_mode: True
sink_size: 2
# eval dataset
eval_dataset: &eval_dataset
data_loader:
type: MindDataset
dataset_dir: ""
shuffle: False
input_columns: ["input_ids", "labels"]
num_parallel_workers: 8
python_multiprocessing: False
drop_remainder: False
repeat: 1
numa_enable: False
prefetch_size: 1
eval_dataset_task:
type: CausalLanguageModelDataset
dataset_config: *eval_dataset
# default parallel of device num = 8 for Atlas 800T A2
parallel_config:
data_parallel: 1
model_parallel: 8
pipeline_stage: 1
micro_batch_num: 1
vocab_emb_dp: True
gradient_aggregation_group: 4
# when model parallel is greater than 1, we can set micro_batch_interleave_num=2, that may accelerate the train process.
micro_batch_interleave_num: 1
# recompute config
recompute_config:
recompute: True
parallel_optimizer_comm_recompute: False
mp_comm_recompute: True
recompute_slice_activation: True
# callbacks
callbacks:
- type: MFLossMonitor
- type: CheckpointMointor
prefix: "internlm_7b"
save_checkpoint_steps: 500
keep_checkpoint_max: 3
integrated_save: False
async_save: False
- type: ObsMonitor
# mindspore context init config
context:
mode: 0 #0--Graph Mode; 1--Pynative Mode
device_target: "Ascend"
enable_graph_kernel: False
graph_kernel_flags: "--disable_expand_ops=Softmax,Dropout --enable_parallel_fusion=true --reduce_fuse_depth=8 --enable_auto_tensor_inplace=true"
max_call_depth: 10000
max_device_memory: "58GB"
save_graphs: False
save_graphs_path: "./graph"
device_id: 0
# parallel context config
parallel:
parallel_mode: 1 # 0-dataset, 1-semi, 2-auto, 3-hybrid
gradients_mean: False
enable_alltoall: False
full_batch: True
search_mode: "sharding_propagation"
enable_parallel_optimizer: False
strategy_ckpt_save_file: "./ckpt_strategy.ckpt"
parallel_optimizer_config:
gradient_accumulation_shard: False
parallel_optimizer_threshold: 64
# model config
model:
model_config:
type: InternLMConfig
batch_size: 2 # add for increase predict
seq_length: 8192
hidden_size: 4096
num_layers: 32
num_heads: 32
vocab_size: 103168
multiple_of: 256
rms_norm_eps: 1.0e-6
bos_token_id: 1
eos_token_id: 2
pad_token_id: 2
ignore_token_id: -100
compute_dtype: "float16"
layernorm_compute_type: "float16"
softmax_compute_type: "float16"
rotary_dtype: "float16"
param_init_type: "float16"
has_bias: True
use_past: True
block_size: 16
num_blocks: 512
is_dynamic: True
scaling_factor: 1.0
extend_method: "None"
offset: 0
checkpoint_name_or_path: "internlm_7b"
repetition_penalty: 1.0
max_decode_length: 700
max_new_tokens: 20
top_k: 3
top_p: 0.8
do_sample: False
is_dynamic: False
pet_config:
pet_type: lora
# configuration of lora
lora_rank: 16
lora_alpha: 32
lora_dropout: 0.05
target_modules: '.*wq|.*wk|.*wv|.*wo'
arch:
type: InternLMForCausalLM
processor:
return_tensors: ms
tokenizer:
unk_token: '<unk>'
bos_token: '<s>'
eos_token: '</s>'
pad_token: '</s>'
type: InternLMTokenizer
vocab_file: '/home/ma-user/work/tokenizer.model'
type: LlamaProcessor
# metric
metric:
type: EmF1Metric
eval_callbacks:
- type: ObsMonitor
auto_tune: False
filepath_prefix: './autotune'
autotune_per_step: 10
profile: False
profile_start_step: 1
profile_stop_step: 10
init_start_profile: False
profile_communication: False
profile_memory: True
# aicc
remote_save_url: "Please input obs url on AICC platform."
● 原有能力评估结果(符合要求):
F1 score: 49.542794601854574
Em score: 31.398161586840832
3.3.2 低参比例
3.3.3 阅读理解能力的评估
● 运行时需要指定微调后的模型权重的路径,配置文件与原文件相比修改了:
pet_config:
pet_type: lora
# configuration of lora
lora_rank: 16
lora_alpha: 32
lora_dropout: 0.05
target_modules: '.*wq|.*wk|.*wv|.*wo'
max_new_tokens: 200
● 完整的 predict_mmlu.yaml
文件:
seed: 0
output_dir: './output' # path to save checkpoint/strategy
load_checkpoint: './internlm.ckpt'
auto_trans_ckpt: False # If true, auto transform load_checkpoint to load in distributed model
only_save_strategy: False
resume_training: False
use_parallel: False
run_mode: 'predict'
# trainer config
trainer:
type: CausalLanguageModelingTrainer
model_name: 'internlm_7b'
# runner config
runner_config:
epochs: 1
batch_size: 1
sink_mode: True
sink_size: 2
# eval dataset
eval_dataset: &eval_dataset
data_loader:
type: MindDataset
dataset_dir: ""
shuffle: False
input_columns: ["input_ids", "labels"]
num_parallel_workers: 8
python_multiprocessing: False
drop_remainder: False
repeat: 1
numa_enable: False
prefetch_size: 1
eval_dataset_task:
type: CausalLanguageModelDataset
dataset_config: *eval_dataset
# default parallel of device num = 8 for Atlas 800T A2
parallel_config:
data_parallel: 1
model_parallel: 8
pipeline_stage: 1
micro_batch_num: 1
vocab_emb_dp: True
gradient_aggregation_group: 4
# when model parallel is greater than 1, we can set micro_batch_interleave_num=2, that may accelerate the train process.
micro_batch_interleave_num: 1
# recompute config
recompute_config:
recompute: True
parallel_optimizer_comm_recompute: False
mp_comm_recompute: True
recompute_slice_activation: True
# callbacks
callbacks:
- type: MFLossMonitor
- type: CheckpointMointor
prefix: "internlm_7b"
save_checkpoint_steps: 500
keep_checkpoint_max: 3
integrated_save: False
async_save: False
- type: ObsMonitor
# mindspore context init config
context:
mode: 0 #0--Graph Mode; 1--Pynative Mode
device_target: "Ascend"
enable_graph_kernel: False
graph_kernel_flags: "--disable_expand_ops=Softmax,Dropout --enable_parallel_fusion=true --reduce_fuse_depth=8 --enable_auto_tensor_inplace=true"
max_call_depth: 10000
max_device_memory: "26GB"
save_graphs: False
save_graphs_path: "./graph"
device_id: 0
# parallel context config
parallel:
parallel_mode: 1 # 0-dataset, 1-semi, 2-auto, 3-hybrid
gradients_mean: False
enable_alltoall: False
full_batch: True
search_mode: "sharding_propagation"
enable_parallel_optimizer: False
strategy_ckpt_save_file: "./ckpt_strategy.ckpt"
parallel_optimizer_config:
gradient_accumulation_shard: False
parallel_optimizer_threshold: 64
# model config
model:
model_config:
type: InternLMConfig
batch_size: 1 # add for increase predict
seq_length: 2048
hidden_size: 4096
num_layers: 32
num_heads: 32
vocab_size: 103168
multiple_of: 256
rms_norm_eps: 1.0e-6
bos_token_id: 1
eos_token_id: 2
pad_token_id: 2
ignore_token_id: -100
compute_dtype: "float16"
layernorm_compute_type: "float16"
softmax_compute_type: "float16"
rotary_dtype: "float16"
param_init_type: "float16"
has_bias: True
use_past: True
block_size: 16
num_blocks: 512
is_dynamic: True
scaling_factor: 1.0
extend_method: "None"
offset: 0
checkpoint_name_or_path: "internlm_7b"
repetition_penalty: 1.0
max_decode_length: 700
max_new_tokens: 200
top_k: 3
top_p: 0.8
do_sample: False
is_dynamic: False
pet_config:
pet_type: lora
# configuration of lora
lora_rank: 16
lora_alpha: 32
lora_dropout: 0.05
target_modules: '.*wq|.*wk|.*wv|.*wo'
arch:
type: InternLMForCausalLM
processor:
return_tensors: ms
tokenizer:
unk_token: '<unk>'
bos_token: '<s>'
eos_token: '</s>'
pad_token: '</s>'
type: InternLMTokenizer
vocab_file: '/home/ma-user/work/tokenizer.model'
type: LlamaProcessor
# metric
metric:
type: EmF1Metric
eval_callbacks:
- type: ObsMonitor
auto_tune: False
filepath_prefix: './autotune'
autotune_per_step: 10
profile: False
profile_start_step: 1
profile_stop_step: 10
init_start_profile: False
profile_communication: False
profile_memory: True
# aicc
remote_save_url: "Please input obs url on AICC platform."
● 自验结果:测试集数量为 2480
(每个类别 20
个) ,准确率:96.85%
。
4 参考文献
1 篇
⭐️ ⭐️ 写于2025年1月16日 20:22 教研室工位