[书生浦语] 大模型实战:XTuner 微调 LLM : 多模态、Agent

注意:按教程训练会报错,请参考这个issue解决 https://github.com/InternLM/Tutorial/issues/725

1. XTuner多模态训练与测试

在本节课中,我们将学习使用XTuner微调多模态LLM的内容,本部分需要的GPU资源为24GB 30% 的 A100。

这是学完本节内容后的多模态LLM性能效果展示:

Finetune前的多模态LLM(InternLM_Chat_1.8B_llava):只会给图像打标题在这里插入图片描述

Finetune后的多模态LLM(InternLM_Chat_1.8B_llava):会根据图像回答问题了
在这里插入图片描述

下面,让我们正式开始今天的课程吧~

1.1. 给LLM装上电子眼:多模态LLM原理简介

1.1.1. 文本单模态

输入文本
文本Embedding模型
文本向量
L L M
输出文本

1.1.2. 文本+图像多模态

Image Projector
输入图像
图像向量
输入文本
文本Embedding模型
文本向量
L L M
输出文本

1.2. 什么型号的电子眼:LLaVA方案简介

Haotian Liu等使用GPT-4V对图像数据生成描述,以此构建出大量<question text><image> -- <answer text>的数据对。利用这些数据对,配合文本单模态LLM,训练出一个Image Projector。

所使用的文本单模型LLM和训练出来的Image Projector,统称为LLaVA模型

1.2.1. LLaVA训练阶段示意图

训练阶段
显卡
文本+图像
问答对
(训练数据)
文本单模态
LLM
Image
Projector

1.2.2. LLaVA测试阶段示意图

测试阶段
显卡
Image
Projector
文本单模态
LLM
输入图像
输出文本

Image Projector的训练和测试,有点类似之前我们讲过的LoRA微调方案。

二者都是在已有LLM的基础上,用新的数据训练一个新的小文件。

只不过,LLM套上LoRA之后,有了新的灵魂(角色);而LLM套上Image Projector之后,才有了眼睛。

1.3. 快速上手

1.3.1. 环境准备

1.3.1.1. 开发机准备

参考之前的文章。

1.3.1.2. XTuner安装
# 如果你是在 InternStudio 平台,则从本地 clone 一个已有 pytorch 的环境:
# pytorch    2.0.1   py3.10_cuda11.7_cudnn8.5.0_0

cd ~ && studio-conda xtuner0.1.17
# 如果你是在其他平台:
# conda create --name xtuner0.1.17 python=3.10 -y

# 激活环境
conda activate xtuner0.1.17
# 进入家目录 (~的意思是 “当前用户的home路径”)
cd ~
# 创建版本文件夹并进入,以跟随本教程
mkdir -p /root/xtuner0117 && cd /root/xtuner0117

# 拉取 0.1.17 的版本源码
git clone -b v0.1.17  https://github.com/InternLM/xtuner
# 无法访问github的用户请从 gitee 拉取:
# git clone -b v0.1.15 https://gitee.com/Internlm/xtuner

# 进入源码目录
cd /root/xtuner0117/xtuner

# 从源码安装 XTuner
pip install -e '.[all]' && cd ~

假如速度太慢可以 Ctrl + C 退出后换成 pip install -e '.[all]' -i https://mirrors.aliyun.com/pypi/simple/

假如在这一过程中没有出现任何的报错的话,那也就意味着我们成功安装好支持 XTuner 所运行的环境啦。其实对于很多的初学者而言,安装好环境意味着成功了一大半!

1.3.2. 概述

在本节中,我们将 自己构造 <question text><image>--<answer text> 数据对,基于InternLM2_Chat_1.8B这个文本单模态模型,使用LLaVA方案,训练一个给InternLM2_Chat_1.8B使用的Image Projector文件。

LLaVA方案中,给LLM增加视觉能力的过程,即是训练Image Projector文件的过程。
该过程分为2个阶段:Pretrain和Finetune。

Finetune阶段
Pretrain阶段
显卡
图像
+
复杂对话文本
Finetuned
LLaVA
显卡
图像
+
标题(短文本)
文本单模态LLM
(InternLM2_Chat_1.8B)
Pretrained
LLaVA

1.3.3. Pretrain阶段

在Pretrain阶段,我们会使用大量的图片+简单文本(caption, 即图片标题)数据对,使LLM理解图像中的普遍特征。即,对大量的图片进行粗看

Pretrain阶段训练完成后,此时的模型已经有视觉能力了!但是由于训练数据中都是图片+图片标题,所以此时的模型虽然有视觉能力,但无论用户问它什么,它都只会回答输入图片的标题。即,此时的模型只会给输入图像“写标题”

Pretrain阶段相当于是开发LLM时预训练工作,对硬件要求非常高,有8卡的学有余力同学可以自行尝试。详见XTuner-LLaVALLaVA

NPROC_PER_NODE=8 xtuner train llava_internlm2_chat_1_8b_clip_vit_large_p14_336_e1_gpu8_pretrain --deepspeed deepspeed_zero2

NPROC_PER_NODE=8 xtuner train llava_internlm2_chat_1_8b_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune --deepspeed deepspeed_zero2

在本次实战营中,我们已经为大家提供了Pretrain阶段的产物——iter_2181.pth文件。它就是幼稚园阶段的Image Projector!大家带着iter_2181.pth文件继续进入下一阶段进行Finetune即可。

1.3.4. Finetune阶段

在Finetune阶段,我们会使用图片+复杂文本数据对,来对Pretrain得到的Image Projector即iter_2181.pth进行进一步的训练。

1.3.4.1. 训练数据构建
1.3.4.1.1. 格式
[
    {
        "id": "随便什么字符串",
        "image": "图片文件的相对位置。相对谁?相对你后面config文件里指定的image_folder参数的路径。",
        "conversation": [
            {
                "from": "human",
                "value": "<image>\n第1个问题。"
            },
            {
                "from": "gpt",
                "value": "第1个回答"
            },
            {
                "from": "human",
                "value": "第2个问题。"
            },
            {
                "from": "gpt",
                "value": "第2个回答"
            },
            # ......
            {
                "from": "human",
                "value": "第n个问题。"
            },
            {
                "from": "gpt",
                "value": "第n个回答"
            },
        ]
    },

    # 下面是第2组训练数据了。

    {
        "id": "随便什么字符串",
        "image": "图片文件的相对位置。相对谁?相对你后面config文件里指定的image_folder参数的路径。",
        "conversation": [
            {
                "from": "human",
                "value": "<image>\n第1个问题。"
            },
            # ......
            {
                "from": "gpt",
                "value": "第n个回答"
            }
        ]
    }
]

注意:每组训练数据的第1个来自human的问题前,要加上图片占位符,即<image>

示例
[
  {
    "id": "000000033471",
    "image": "coco/train2017/000000033471.jpg",
    "conversations": [
      {
        "from": "human",
        "value": "<image>\nWhat are the colors of the bus in the image?"
      },
      {
        "from": "gpt",
        "value": "The bus in the image is white and red."
      },
      {
        "from": "human",
        "value": "What feature can be seen on the back of the bus?"
      },
      {
        "from": "gpt",
        "value": "The back of the bus features an advertisement."
      },
      {
        "from": "human",
        "value": "Is the bus driving down the street or pulled off to the side?"
      },
      {
        "from": "gpt",
        "value": "The bus is driving down the street, which is crowded with people and other vehicles."
      }
    ]
  },
  {
    "id": "000000052846",
    "image": "coco/train2017/000000052846.jpg",
    "conversations": [
      {
        "from": "human",
        "value": "<image>\nWhere is the cat positioned in the image?"
      },
      {
        "from": "gpt",
        "value": "The cat is positioned on top of the back of the couch in the living room."
      }
    ]
  }
 ]
1.3.4.1.2. 制作

我们可以效法LLaVA作者的做法,将自己的图片发送给GPT,要求其按照上述格式生成若干条问答对。

Create a dataset for me, following this format.

[
  {
    "id": "<random_number_string>",
    "image": "test_img/oph.jpg",
    "conversations": [
      {
        "from": "human",
        "value": "<image>\nDescribe this image."
      },
      {
        "from": "gpt",
        "value": "<answer1>"
      },
      {
        "from": "human",
        "value": "<question2>"
      },
      {
        "from": "gpt",
        "value": "<answer2>"
      },
      {
        "from": "human",
        "value": "<question3>"
      },
      {
        "from": "gpt",
        "value": "<answer3>"
      }
    ]
  }
]

The questions and answers, please generate for me, based on the image I sent to you. Thes questions should be from the shallow to the deep, and the answers should be as detailed and correct as possible. The questions and answers should be stick to the contents in the image itself, like objects, peoples, equipment, environment, purpose, color, attitude, etc. 5 question and answer pairs.


为了方便大家跟随课程,针对这张示例图片的问答对数据(repeat_data.json),大家按照下面的脚本运行就可以生成啦~(重复200次)

cd ~ && git clone https://github.com/InternLM/tutorial -b camp2 && conda activate xtuner0.1.17 && cd tutorial

python /root/tutorial/xtuner/llava/llava_data/repeat.py \
  -i /root/tutorial/xtuner/llava/llava_data/unique_data.json \
  -o /root/tutorial/xtuner/llava/llava_data/repeated_data.json \
  -n 200
1.3.4.2. 准备配置文件

如果你懒到不想自己改配置文件,或者怎么改都失败。我们准备了一个fool_config文件在仓库里。运行:

cp /root/tutorial/xtuner/llava/llava_data/internlm2_chat_1_8b_llava_tutorial_fool_config.py /root/tutorial/xtuner/llava/llava_internlm2_chat_1_8b_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune_copy.py
1.3.4.2.1. 创建配置文件
# 查询xtuner内置配置文件
xtuner list-cfg -p llava_internlm2_chat_1_8b

# 拷贝配置文件到当前目录
xtuner copy-cfg \
  llava_internlm2_chat_1_8b_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune \
  /root/tutorial/xtuner/llava

当前你的/root/tutorial/xtuner/llava/目录下的文件结构应该是这样:

1.3.4.2.2. 修改配置文件

修改llava_internlm2_chat_1_8b_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune_copy.py文件中的:

  • pretrained_pth
  • llm_name_or_path
  • visual_encoder_name_or_path
  • data_root
  • data_path
  • image_folder
# Model
- llm_name_or_path = 'internlm/internlm2-chat-1_8b'
+ llm_name_or_path = '/root/share/new_models/Shanghai_AI_Laboratory/internlm2-chat-1_8b'
- visual_encoder_name_or_path = 'openai/clip-vit-large-patch14-336'
+ visual_encoder_name_or_path = '/root/share/new_models/openai/clip-vit-large-patch14-336'

# Specify the pretrained pth
- pretrained_pth = './work_dirs/llava_internlm2_chat_1_8b_clip_vit_large_p14_336_e1_gpu8_pretrain/iter_2181.pth'  # noqa: E501
+ pretrained_pth = '/root/share/new_models/xtuner/iter_2181.pth'

# Data
- data_root = './data/llava_data/'
+ data_root = '/root/tutorial/xtuner/llava/llava_data/'
- data_path = data_root + 'LLaVA-Instruct-150K/llava_v1_5_mix665k.json'
+ data_path = data_root + 'repeated_data.json'
- image_folder = data_root + 'llava_images'
+ image_folder = data_root

# Scheduler & Optimizer
- batch_size = 16  # per_device
+ batch_size = 1  # per_device


# evaluation_inputs
- evaluation_inputs = ['请描述一下这张图片','Please describe this picture']
+ evaluation_inputs = ['Please describe this picture','What is the equipment in the image?']

1.3.4.3. 开始Finetune
cd /root/tutorial/xtuner/llava/
xtuner train /root/tutorial/xtuner/llava/llava_internlm2_chat_1_8b_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune_copy.py --deepspeed deepspeed_zero2

报错

Traceback (most recent call last):
  File "/root/.conda/envs/xtuner0.1.17/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1271, in call_hook
    getattr(hook, fn_name)(self, **kwargs)
  File "/root/xtuner0117/xtuner/xtuner/engine/hooks/evaluate_chat_hook.py", line 221, in before_train
    self._generate_samples(runner, max_new_tokens=50)
  File "/root/xtuner0117/xtuner/xtuner/engine/hooks/evaluate_chat_hook.py", line 207, in _generate_samples
    self._eval_images(runner, model, device, max_new_tokens,
  File "/root/xtuner0117/xtuner/xtuner/engine/hooks/evaluate_chat_hook.py", line 146, in _eval_images
    generation_output = model.generate(
  File "/root/.conda/envs/xtuner0.1.17/lib/python3.10/site-packages/peft/peft_model.py", line 1491, in generate
    outputs = self.base_model.generate(*args, **kwargs)
  File "/root/.conda/envs/xtuner0.1.17/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/root/.conda/envs/xtuner0.1.17/lib/python3.10/site-packages/transformers/generation/utils.py", line 1759, in generate
    result = self._sample(
  File "/root/.conda/envs/xtuner0.1.17/lib/python3.10/site-packages/transformers/generation/utils.py", line 2391, in _sample
    model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs)
  File "/root/.conda/envs/xtuner0.1.17/lib/python3.10/site-packages/transformers/generation/utils.py", line 1322, in _get_initial_cache_position
    past_length = model_kwargs["past_key_values"][0][0].shape[2]
TypeError: 'NoneType' object is not subscriptable

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/xtuner0117/xtuner/xtuner/tools/train.py", line 342, in <module>
    main()
  File "/root/xtuner0117/xtuner/xtuner/tools/train.py", line 338, in main
    runner.train()
  File "/root/.conda/envs/xtuner0.1.17/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1200, in train
    model = self.train_loop.run()  # type: ignore
  File "/root/.conda/envs/xtuner0.1.17/lib/python3.10/site-packages/mmengine/runner/loops.py", line 271, in run
    self.runner.call_hook('before_train')
  File "/root/.conda/envs/xtuner0.1.17/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1273, in call_hook
    raise TypeError(f'{e} in {hook}') from e
TypeError: 'NoneType' object is not subscriptable in <xtuner.engine.hooks.evaluate_chat_hook.EvaluateChatHook object at 0x7f4a8469f3a0>

解决方法
在这里插入图片描述

1.3.5. 对比Finetune前后的性能差异

1.3.5.1. Finetune前

即:加载 1.8B 和 Pretrain阶段产物(iter_2181) 到显存。

# 解决小bug
export MKL_SERVICE_FORCE_INTEL=1
export MKL_THREADING_LAYER=GNU

# pth转huggingface
xtuner convert pth_to_hf \
  llava_internlm2_chat_1_8b_clip_vit_large_p14_336_e1_gpu8_pretrain \
  /root/share/new_models/xtuner/iter_2181.pth \
  /root/tutorial/xtuner/llava/llava_data/iter_2181_hf

# 启动!
xtuner chat /root/share/new_models/Shanghai_AI_Laboratory/internlm2-chat-1_8b \
  --visual-encoder /root/share/new_models/openai/clip-vit-large-patch14-336 \
  --llava /root/tutorial/xtuner/llava/llava_data/iter_2181_hf \
  --prompt-template internlm2_chat \
  --image /root/tutorial/xtuner/llava/llava_data/test_img/oph.jpg

Q1: Describe this image.
Q2: What is the equipment in the image?

1.3.5.2. Finetune后

即:加载 1.8B 和 Fintune阶段产物 到显存。

# 解决小bug
export MKL_SERVICE_FORCE_INTEL=1
export MKL_THREADING_LAYER=GNU

# pth转huggingface
xtuner convert pth_to_hf \
  /root/tutorial/xtuner/llava/llava_internlm2_chat_1_8b_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune_copy.py \
  /root/tutorial/xtuner/llava/work_dirs/llava_internlm2_chat_1_8b_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune_copy/iter_1200.pth \
  /root/tutorial/xtuner/llava/llava_data/iter_1200_hf

# 启动!
xtuner chat /root/share/new_models/Shanghai_AI_Laboratory/internlm2-chat-1_8b \
  --visual-encoder /root/share/new_models/openai/clip-vit-large-patch14-336 \
  --llava /root/tutorial/xtuner/llava/llava_data/iter_1200_hf \
  --prompt-template internlm2_chat \
  --image /root/tutorial/xtuner/llava/llava_data/test_img/oph.jpg

Q1: Describe this image.
Q2: What is the equipment in the image?

Finetune前后效果对比:

Finetune前:只会打标题
在这里插入图片描述

Finetune后:会回答问题了
在这里插入图片描述

进阶作业

自带资源CPU模式和GPU模式均无法运行
在这里插入图片描述
申请到了GPU资源后,一直处于排队中(持续一天,多次尝试)
在这里插入图片描述
放弃平台的“紧张资源”,本地部署,效果如图
在这里插入图片描述开启一个云服务器的竞价实例(1块钱以内),做一个内网穿透,转发然后修改app.py代码
注意不同gradio版本的api调用方式不一样,下面是4.10.0的调用方式

import gradio as gr
import os
from gradio_client import Client

client = Client("http://{云服务器的公网IP}:6999")
def chat(message, history):
    result = client.predict(message, api_name='/chat')
    return result

gr.ChatInterface(chat,
                 title="InternLM2-Chat-Demo",
                description="""
InternLM is mainly developed by Shanghai AI Laboratory.  
                 """,
                 ).queue(1).launch()

在这里插入图片描述

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值