Xtuner微调InternLM-Chat-7B笔记

m0_49908454

已于 2024-01-19 12:40:58 修改

阅读量1.1k

点赞数 22

文章标签：笔记语言模型

于 2024-01-19 11:46:21 首次发布

本文链接：https://blog.csdn.net/m0_49908454/article/details/135688928

版权

微调结果

微调前

微调后

微调

准备

下载模型

# 创建一个目录，放模型文件，防止散落一地
mkdir ~/weitiao/internlm-chat-7b

# 装一下拉取模型文件要用的库
pip install modelscope

# 从 modelscope 下载下载模型文件
cd ~/weitiao
apt install git git-lfs -y
git lfs install
git lfs clone https://modelscope.cn/Shanghai_AI_Laboratory/internlm-chat-7b.git -b v1.0.3

安装xtuner

# 拉取 0.1.9 的版本源码
git clone -b v0.1.9 https://github.com/InternLM/xtuner

# 无法访问github的用户请从 gitee 拉取:
# git clone -b v0.1.9 https://gitee.com/Internlm/xtuner

# 进入源码目录
cd xtuner

# 从源码安装 XTuner
pip install -e '.[all]'

准备数据

可以下载训练集

https://huggingface.co/datasets/timdettmers/openassistant-guanaco/tree/main

准备微调数据

# 创建一个json文件，json中内容可参考下方(复制粘贴n次做数据增广，数据量小无法有效微调，下面仅用于展示格式）

[
    {
        "conversation": [
            {
                "input": "请介绍一下你自己",
                "output": "我是臭屁娃娃的小助手，内在是上海AI实验室书生·浦语的7B大模型哦"
            }
        ]
    },
    {
        "conversation": [
            {
                "input": "请做一下自我介绍",
                "output": "我是臭屁娃娃的小助手，内在是上海AI实验室书生·浦语的7B大模型哦"
            }
        ]
    }
]

一个python脚本，用于生成数据集。在data目录下新建一个generate_data.py文件，将以下代码复制进去，然后运行该脚本即可生成数据集。

import json

# 输入你的名字
name = '臭屁娃娃'
# 重复次数
n = 10000

data = [
    {
        "conversation": [
            {
                "input": "请做一下自我介绍",
                "output": "我是{}的小助手，内在是上海AI实验室书生·浦语的7B大模型哦".format(name)
            }
        ]
    }
]

for i in range(n):
    data.append(data[0])

with open('123.json', 'w', encoding='utf-8') as f:
    json.dump(data, f, ensure_ascii=False, indent=4)

微调

准备配置文件

cd ~/weitiao
xtuner list-cfg
xtuner copy-cfg internlm_chat_7b_qlora_oasst1_e3 .
# 改个文件名
mv internlm_chat_7b_qlora_oasst1_e3_copy.py internlm_chat_7b_qlora_123_e3.py
修改配置文件
# 修改import部分
- from xtuner.dataset.map_fns import oasst1_map_fn, template_map_fn_factory
+ from xtuner.dataset.map_fns import template_map_fn_factory
# 修改模型为本地路径
- pretrained_model_name_or_path = 'internlm/internlm-chat-7b'
+ pretrained_model_name_or_path = './internlm-chat-7b'
# 修改训练数据集为本地路径
- data_path = 'timdettmers/openassistant-guanaco'
+ data_path = './123.json'
# 修改 train_dataset 对象
train_dataset = dict(
    type=process_hf_dataset,
-   dataset=dict(type=load_dataset, path=data_path),
+   dataset=dict(type=load_dataset, path='json', data_files=dict(train=data_path)),
    tokenizer=tokenizer,
    max_length=max_length,
-   dataset_map_fn=alpaca_map_fn,
+   dataset_map_fn=None,
    template_map_fn=dict(
        type=template_map_fn_factory, template=prompt_template),
    remove_unused_columns=True,
    shuffle_before_pack=True,
    pack_to_max_length=pack_to_max_length)

至此：有模型文件，有配置文件，有训练数据，可以动手微调了

微调

训练

# 单卡
## 用刚才改好的config文件训练
xtuner train ./internlm_chat_7b_qlora_oasst1_e3_copy.py

# 多卡
NPROC_PER_NODE=${GPU_NUM} xtuner train ./internlm_chat_7b_qlora_oasst1_e3_copy.py

# 若要开启 deepspeed 加速，增加 --deepspeed deepspeed_zero2 即可

转换为 HuggingFace 模型

将得到的 PTH 模型转换为 HuggingFace 模型，即：生成 Adapter 文件夹

mkdir hf
export MKL_SERVICE_FORCE_INTEL=1
xtuner convert pth_to_hf ./internlm_chat_7b_qlora_123_e3.py ./work_dirs/internlm_chat_7b_qlora_123_e3/epoch_1.pth ./hf
此时，hf 文件夹即为我们平时所理解的所谓 “LoRA 模型文件”

部署与测试

将 HuggingFace adapter 合并到大语言模型：

xtuner convert merge ./internlm-chat-7b ./hf ./merged --max-shard-size 2GB
# xtuner convert merge \
#     ${NAME_OR_PATH_TO_LLM} \
#     ${NAME_OR_PATH_TO_ADAPTER} \
#     ${SAVE_PATH} \
#     --max-shard-size 2GB

与合并后的模型对话：

# 加载 Adapter 模型对话（Float 16）
xtuner chat ./merged --prompt-template internlm_chat

# 4 bit 量化加载
# xtuner chat ./merged --bits 4 --prompt-template internlm_chat

资料来源：https://github.com/InternLM/tutorial/blob/vansin-patch-4/xtuner/README.md

总结

1、准备模型、xtuner、数据

2、修改配置文件

3、训练后得到epoch文件

4、pth转huggingface

5、将huggingface adapter与模型合并

6、与合并后的模型对话

m0_49908454

关注

22
点赞
踩
25

收藏

觉得还不错? 一键收藏
1
评论
Xtuner微调InternLM-Chat-7B笔记

"output": "我是{}的小助手，内在是上海AI实验室书生·浦语的7B大模型哦".format(name)"output": "我是臭屁娃娃的小助手，内在是上海AI实验室书生·浦语的7B大模型哦""output": "我是臭屁娃娃的小助手，内在是上海AI实验室书生·浦语的7B大模型哦"# 创建一个json文件，json中内容可参考下方(复制粘贴n次做数据增广，数据量小无法有效微调，下面仅用于展示格式）"input": "请做一下自我介绍","input": "请做一下自我介绍",
复制链接

扫一扫