一、环境准备
(一)开发机准备
具体操作与之前步骤相同,需注意——需要的GPU资源为24GB 30% 的 A100。
(二)XTuner安装
# 如果你是在 InternStudio 平台,则从本地 clone 一个已有 pytorch 的环境:
# pytorch 2.0.1 py3.10_cuda11.7_cudnn8.5.0_0
cd ~ && studio-conda xtuner0.1.17
# 如果你是在其他平台:
# conda create --name xtuner0.1.17 python=3.10 -y
# 激活环境
conda activate xtuner0.1.17
# 进入家目录 (~的意思是 “当前用户的home路径”)
cd ~
# 创建版本文件夹并进入,以跟随本教程
mkdir -p /root/xtuner0117 && cd /root/xtuner0117
# 拉取 0.1.17 的版本源码
git clone -b v0.1.17 https://github.com/InternLM/xtuner
# 无法访问github的用户请从 gitee 拉取:
# git clone -b v0.1.15 https://gitee.com/Internlm/xtuner
# 进入源码目录
cd /root/xtuner0117/xtuner
# 从源码安装 XTuner
pip install -e '.[all]' && cd ~
假如速度太慢可以
Ctrl + C
退出后换成pip install -e '.[all]' -i https://mirrors.aliyun.com/pypi/simple/
假如在这一过程中没有出现任何的报错的话,那也就意味着我们成功安装好支持 XTuner 所运行的环境啦。其实对于很多的初学者而言,安装好环境意味着成功了一大半!
二、概述
在本节中,我们将 自己构造
<question text><image>--<answer text>
数据对,基于InternLM2_Chat_1.8B这个文本单模态模型,使用LLaVA方案,训练一个给InternLM2_Chat_1.8B使用的Image Projector文件。
LLaVA方案中,给LLM增加视觉能力的过程,即是训练Image Projector文件的过程。 该过程分为2个阶段:Pretrain和Finetune。
三、Pretrain阶段
在Pretrain阶段,我们会使用大量的图片+简单文本(caption, 即图片标题)
数据对,使LLM理解图像中的普遍特征。即,对大量的图片进行粗看。
Pretrain阶段训练完成后,此时的模型已经有视觉能力了!但是由于训练数据中都是图片+图片标题,所以此时的模型虽然有视觉能力,但无论用户问它什么,它都只会回答输入图片的标题。即,此时的模型只会给输入图像“写标题”。
Pretrain阶段相当于是开发LLM时预训练工作,对硬件要求非常高,有8卡的学有余力同学可以自行尝试。详见XTuner-LLaVA和LLaVA。
在本次实战营中,我们已经为大家提供了Pretrain阶段的产物——iter_2181.pth
文件。它就是幼稚园阶段的Image Projector!大家带着iter_2181.pth
文件继续进入下一阶段进行Finetune即可。
四、Finetune阶段
在Finetune阶段,我们会使用图片+复杂文本
数据对,来对Pretrain得到的Image Projector即iter_2181.pth进行进一步的训练。
(一)训练数据构建
1、格式
[
{
"id": "随便什么字符串",
"image": "图片文件的相对位置。相对谁?相对你后面config文件里指定的image_folder参数的路径。",
"conversation": [
{
"from": "human",
"value": "<image>\n第1个问题。"
},
{
"from": "gpt",
"value": "第1个回答"
},
{
"from": "human",
"value": "第2个问题。"
},
{
"from": "gpt",
"value": "第2个回答"
},
# ......
{
"from": "human",
"value": "第n个问题。"
},
{
"from": "gpt",
"value": "第n个回答"
},
]
},
# 下面是第2组训练数据了。
{
"id": "随便什么字符串",
"image": "图片文件的相对位置。相对谁?相对你后面config文件里指定的image_folder参数的路径。",
"conversation": [
{
"from": "human",
"value": "<image>\n第1个问题。"
},
# ......
{
"from": "gpt",
"value": "第n个回答"
}
]
}
]
注意:每组训练数据的第1个来自human的问题前,要加上图片占位符,即
<image>
示例
[
{
"id": "000000033471",
"image": "coco/train2017/000000033471.jpg",
"conversations": [
{
"from": "human",
"value": "<image>\nWhat are the colors of the bus in the image?"
},
{
"from": "gpt",
"value": "The bus in the image is white and red."
},
{
"from": "human",
"value": "What feature can be seen on the back of the bus?"
},
{
"from": "gpt",
"value": "The back of the bus features an advertisement."
},
{
"from": "human",
"value": "Is the bus driving down the street or pulled off to the side?"
},
{
"from": "gpt",
"value": "The bus is driving down the street, which is crowded with people and other vehicles."
}
]
},
{
"id": "000000052846",
"image": "coco/train2017/000000052846.jpg",
"conversations": [
{
"from": "human",
"value": "<image>\nWhere is the cat positioned in the image?"
},
{
"from": "gpt",
"value": "The cat is positioned on top of the back of the couch in the living room."
}
]
}
]
2、制作
我们可以效法LLaVA作者的做法,将自己的图片发送给GPT,要求其按照上述格式生成若干条问答对。
prompts
Create a dataset for me, following this format.
[
{
"id": "<random_number_string>",
"image": "test_img/oph.jpg",
"conversations": [
{
"from": "human",
"value": "<image>\nDescribe this image."
},
{
"from": "gpt",
"value": "<answer1>"
},
{
"from": "human",
"value": "<question2>"
},
{
"from": "gpt",
"value": "<answer2>"
},
{
"from": "human",
"value": "<question3>"
},
{
"from": "gpt",
"value": "<answer3>"
}
]
}
]
为了方便大家跟随课程,针对这张示例图片的问答对数据(repeat_data.json),大家按照下面的脚本运行就可以生成啦~(重复200次)
cd ~ && git clone https://github.com/InternLM/tutorial -b camp2 && conda activate xtuner0.1.17 && cd tutorial
python /root/tutorial/xtuner/llava/llava_data/repeat.py \
-i /root/tutorial/xtuner/llava/llava_data/unique_data.json \
-o /root/tutorial/xtuner/llava/llava_data/repeated_data.json \
-n 200
(二)准备配置文件
如果你懒到不想自己改配置文件,或者怎么改都失败。我们准备了一个fool_config文件在仓库里。运行:
cp /root/tutorial/xtuner/llava/llava_data/internlm2_chat_1_8b_llava_tutorial_fool_config.py /root/tutorial/xtuner/llava/llava_internlm2_chat_1_8b_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune_copy.py
1、创建配置文件
# 查询xtuner内置配置文件
xtuner list-cfg -p llava_internlm2_chat_1_8b
# 拷贝配置文件到当前目录
xtuner copy-cfg \
llava_internlm2_chat_1_8b_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune \
/root/tutorial/xtuner/llava
当前你的/root/tutorial/xtuner/llava/
目录下的文件结构应该是这样:
|-- llava_data
| |-- repeat.py
| |-- repeated_data.json
| |-- test_img
| | `-- oph.jpg
| `-- unique_data.json
`-- llava_internlm2_chat_1_8b_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune_copy.py
2、修改配置文件
修改llava_internlm2_chat_1_8b_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune_copy.py
文件中的:
- pretrained_pth
- llm_name_or_path
- visual_encoder_name_or_path
- data_root
- data_path
- image_folder
# Model
- llm_name_or_path = 'internlm/internlm2-chat-1_8b'
+ llm_name_or_path = '/root/share/new_models/Shanghai_AI_Laboratory/internlm2-chat-1_8b'
- visual_encoder_name_or_path = 'openai/clip-vit-large-patch14-336'
+ visual_encoder_name_or_path = '/root/share/new_models/openai/clip-vit-large-patch14-336'
# Specify the pretrained pth
- pretrained_pth = './work_dirs/llava_internlm2_chat_1_8b_clip_vit_large_p14_336_e1_gpu8_pretrain/iter_2181.pth' # noqa: E501
+ pretrained_pth = '/root/share/new_models/xtuner/iter_2181.pth'
# Data
- data_root = './data/llava_data/'
+ data_root = '/root/tutorial/xtuner/llava/llava_data/'
- data_path = data_root + 'LLaVA-Instruct-150K/llava_v1_5_mix665k.json'
+ data_path = data_root + 'repeated_data.json'
- image_folder = data_root + 'llava_images'
+ image_folder = data_root
# Scheduler & Optimizer
- batch_size = 16 # per_device
+ batch_size = 1 # per_device
# evaluation_inputs
- evaluation_inputs = ['请描述一下这张图片','Please describe this picture']
+ evaluation_inputs = ['Please describe this picture','What is the equipment in the image?']
(三)开始Finetune
cd /root/tutorial/xtuner/llava/
xtuner train /root/tutorial/xtuner/llava/llava_internlm2_chat_1_8b_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune_copy.py --deepspeed deepspeed_zero2
五、对比Finetune前后的性能差异
(一)Finetune前
即:加载 1.8B 和 Pretrain阶段产物(iter_2181) 到显存。
# 解决小bug
export MKL_SERVICE_FORCE_INTEL=1
export MKL_THREADING_LAYER=GNU
# pth转huggingface
xtuner convert pth_to_hf \
llava_internlm2_chat_1_8b_clip_vit_large_p14_336_e1_gpu8_pretrain \
/root/share/new_models/xtuner/iter_2181.pth \
/root/tutorial/xtuner/llava/llava_data/iter_2181_hf
# 启动!
xtuner chat /root/share/new_models/Shanghai_AI_Laboratory/internlm2-chat-1_8b \
--visual-encoder /root/share/new_models/openai/clip-vit-large-patch14-336 \
--llava /root/tutorial/xtuner/llava/llava_data/iter_2181_hf \
--prompt-template internlm2_chat \
--image /root/tutorial/xtuner/llava/llava_data/test_img/oph.jpg
Q1: Describe this image.
Q2: What is the equipment in the image?
(二)Finetune后
即:加载 1.8B 和 Fintune阶段产物 到显存。
# 解决小bug
export MKL_SERVICE_FORCE_INTEL=1
export MKL_THREADING_LAYER=GNU
# pth转huggingface
xtuner convert pth_to_hf \
/root/tutorial/xtuner/llava/llava_internlm2_chat_1_8b_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune_copy.py \
/root/tutorial/xtuner/llava/work_dirs/llava_internlm2_chat_1_8b_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune_copy/iter_1200.pth \
/root/tutorial/xtuner/llava/llava_data/iter_1200_hf
# 启动!
xtuner chat /root/share/new_models/Shanghai_AI_Laboratory/internlm2-chat-1_8b \
--visual-encoder /root/share/new_models/openai/clip-vit-large-patch14-336 \
--llava /root/tutorial/xtuner/llava/llava_data/iter_1200_hf \
--prompt-template internlm2_chat \
--image /root/tutorial/xtuner/llava/llava_data/test_img/oph.jpg
Q1: Describe this image.
Q2: What is the equipment in the image?
Finetune前后效果对比:
Finetune前:只会打标题
Finetune后:会回答问题了