[书生实战营] InternVL 多模态模型部署微调实践

神奇的独角膏

已于 2024-08-20 22:48:53 修改

阅读量638

点赞数 24

文章标签：自然语言处理

于 2024-08-20 22:45:41 首次发布

本文链接：https://blog.csdn.net/m0_52468897/article/details/141351671

版权

闯关任务：使用QLoRA进行微调（冷笑话大师）模型，复现微调效果，并能成功讲出梗图。

1. InternVL介绍

InternVL是一种用于多模态任务的深度学习模型，旨在处理和理解多种类型的数据输入，如图像和文本。它结合了视觉和语言模型，能够执行复杂的跨模态任务，比如图文匹配、图像描述生成等。模型总览如下：

首先，输入的图片经过动态超分辨率模块，将输入的图片打散成不同的小块，之后，把小块送入到ViT模块里提取出相应的视觉特征，之后对视觉特征进行Pixel Shuffle，再进行MLP Projector模块，把视觉特征映射到大模型可以处理的特征上；最后，文本块通过Tokenizer编码之后也输入到大模型中。

1.1 Dynamic High Resolution

对于InternVL这个模型来说，它Vision模块就是一个微调过的ViT，LLM模块是一个InternLM的模型。对于视觉模块来说，它的特殊之处在Dynamic High Resolution。动态高分辨率，为了让ViT模型能够尽可能获取到更细节的图像信息，提高视觉特征的表达能力。对于输入的完整图片，首先resize成448的倍数，然后按照预定义的尺寸比例从完整图片上crop对应的区域。细节如图所示：

1.2 Pixel Shuffle

Pixel Shuffle在超分任务中是一个常见的操作，PyTorch中有官方实现，即nn.PixelShuffle(upscale_factor) 该类的作用就是将一个tensor中的元素值进行重排列，假设tensor维度为[B, C, H, W], PixelShuffle操作不仅可以改变tensor的通道数，也会改变特征图的大小。

Pixel Shuffle实现的功能是：将一个H × W的低分辨率输入图像（Low Resolution），通过Sub-pixel操作将其变为 rH x rW 的高分辨率图像（High Resolution）。但是其实现过程不是直接通过插值等方式产生这个高分辨率图像，而是通过卷积先得到 $r^{2}$ 个通道的特征图（特征图大小和输入低分辨率图像一致），然后通过周期筛选（periodic shuffing）的方法得到这个高分辨率的图像，其中 $r$ 为上采样因子（upscaling factor），也就是图像的扩大倍率。

2. InternVL部署微调实践

选定的任务是让InternVL-2B生成文生图提示词，这个任务需要VLM对图片有格式化的描述并输出，在这里，微调InterenVL使用Xtuner，部署InternVL使用LMDeploy。

将已在share文件夹下挂载好的InternVL2-2B模型移动出来：

cd /root
mkdir -p model
cp -r /root/share/new_models/OpenGVLab/InternVL2-2B /root/model/

配置对应的虚拟环境与依赖包，下载并安装Xtuner：

mkdir -p /root/InternLM/code
cd /root/InternLM/code
git clone -b v0.1.23  https://github.com/InternLM/XTuner

cd /root/InternLM/code/XTuner
pip install -e '.[deepspeed]'

安装LMDeploy：

pip install lmdeploy==0.5.3

2.1 准备微调数据集

使用huggingface上的zhongshsh/CLoT-Oogiri-GO据集，特别鸣谢。

数据集从官网下载下来并进行去重，只保留中文数据等操作，并制作成XTuner需要的形式。处理后的数据已在share里，从share中挪出数据集：

pip install datasets matplotlib Pillow timm
cp -r /root/share/new_models/datasets/CLoT_cn_2000 /root/InternLM/datasets/

打开数据集的一张图看看，这里选择jsonl里的第一条数据对应的图片。首先把这张图片挪动到InternLM文件夹下面：

cp InternLM/datasets/CLoT_cn_2000/ex_images/007aPnLRgy1hb39z0im50j30ci0el0wm.jpg InternLM/

数据集中给的回应如下：

2.2 微调前InternVL部署推理

使用LMDeploy自带的pipeline工具进行开箱即用的推理流程，首先新建文件 /root/InternLM/code/test_lmdeploy.py：

from lmdeploy import pipeline
from lmdeploy.vl import load_image

pipe = pipeline('/root/model/InternVL2-2B')

image = load_image('/root/InternLM/007aPnLRgy1hb39z0im50j30ci0el0wm.jpg')
response = pipe(('请你根据这张图片，讲一个脑洞大开的梗', image))
print(response.text)

运行推理结果，得到如下回复：

推理后发现直接使用2b模型不能很好地讲出梗，现在要对这个2b模型进行微调。

2.3 微调InternVL部署推理

数据集格式为：

# 为了高效训练，请确保数据格式为：
{
    "id": "000000033471",
    "image": ["coco/train2017/000000033471.jpg"], # 如果是纯文本，则该字段为 None 或者不存在
    "conversations": [
      {
        "from": "human",
        "value": "<image>\nWhat are the colors of the bus in the image?"
      },
      {
        "from": "gpt",
        "value": "The bus in the image is white and red."
      }
    ]
  }

可以直接进行微调的数据集在刚才复制进InternLM/datasets的数据中。接下来，修改XTuner下 InternVL 的config，即 /root/InternLM/code/XTuner/xtuner/configs/internvl/v2/internvl_v2_internlm2_2b_qlora_finetune.py文件：

# Copyright (c) OpenMMLab. All rights reserved.
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
                            LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import AutoTokenizer

from xtuner.dataset import InternVL_V1_5_Dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.samplers import LengthGroupedSampler
from xtuner.engine.hooks import DatasetInfoHook
from xtuner.engine.runner import TrainLoop
from xtuner.model import InternVL_V1_5
from xtuner.utils import PROMPT_TEMPLATE

#######################################################################
#                          PART 1  Settings                           #
#######################################################################
# Model
path = '/root/model/InternVL2-2B'

# Data
data_root = '/root/InternLM/datasets/CLoT_cn_2000/'
data_path = data_root + 'ex_cn.json'
image_folder = data_root
prompt_template = PROMPT_TEMPLATE.internlm2_chat
max_length = 6656

# Scheduler & Optimizer
batch_size = 4  # per_device
accumulative_counts = 4
dataloader_num_workers = 4
max_epochs = 6
optim_type = AdamW
# official 1024 -> 4e-5
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0.05
max_norm = 1  # grad clip
warmup_ratio = 0.03

# Save
save_steps = 1000
save_total_limit = 1  # Maximum checkpoints to keep (-1 means unlimited)

#######################################################################
#            PART 2  Model & Tokenizer & Image Processor              #
#######################################################################
model = dict(
    type=InternVL_V1_5,
    model_path=path,
    freeze_llm=True,
    freeze_visual_encoder=True,
    quantization_llm=True,  # or False
    quantization_vit=False,  # or True and uncomment visual_encoder_lora
    # comment the following lines if you don't want to use Lora in llm
    llm_lora=dict(
        type=LoraConfig,
        r=128,
        lora_alpha=256,
        lora_dropout=0.05,
        target_modules=None,
        task_type='CAUSAL_LM'),
    # uncomment the following lines if you don't want to use Lora in visual encoder # noqa
    # visual_encoder_lora=dict(
    #     type=LoraConfig, r=64, lora_alpha=16, lora_dropout=0.05,
    #     target_modules=['attn.qkv', 'attn.proj', 'mlp.fc1', 'mlp.fc2'])
)

#######################################################################
#                      PART 3  Dataset & Dataloader                   #
#######################################################################
llava_dataset = dict(
    type=InternVL_V1_5_Dataset,
    model_path=path,
    data_paths=data_path,
    image_folders=image_folder,
    template=prompt_template,
    max_length=max_length)

train_dataloader = dict(
    batch_size=batch_size,
    num_workers=dataloader_num_workers,
    dataset=llava_dataset,
    sampler=dict(
        type=LengthGroupedSampler,
        length_property='modality_length',
        per_device_batch_size=batch_size * accumulative_counts),
    collate_fn=dict(type=default_collate_fn))

#######################################################################
#                    PART 4  Scheduler & Optimizer                    #
#######################################################################
# optimizer
optim_wrapper = dict(
    type=AmpOptimWrapper,
    optimizer=dict(
        type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
    clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
    accumulative_counts=accumulative_counts,
    loss_scale='dynamic',
    dtype='float16')

# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md  # noqa: E501
param_scheduler = [
    dict(
        type=LinearLR,
        start_factor=1e-5,
        by_epoch=True,
        begin=0,
        end=warmup_ratio * max_epochs,
        convert_to_iter_based=True),
    dict(
        type=CosineAnnealingLR,
        eta_min=0.0,
        by_epoch=True,
        begin=warmup_ratio * max_epochs,
        end=max_epochs,
        convert_to_iter_based=True)
]

# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)

#######################################################################
#                           PART 5  Runtime                           #
#######################################################################
# Log the dialogue periodically during the training process, optional
tokenizer = dict(
    type=AutoTokenizer.from_pretrained,
    pretrained_model_name_or_path=path,
    trust_remote_code=True)

custom_hooks = [
    dict(type=DatasetInfoHook, tokenizer=tokenizer),
]

# configure default hooks
default_hooks = dict(
    # record the time of every iteration.
    timer=dict(type=IterTimerHook),
    # print log every 10 iterations.
    logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
    # enable the parameter scheduler.
    param_scheduler=dict(type=ParamSchedulerHook),
    # save checkpoint per `save_steps`.
    checkpoint=dict(
        type=CheckpointHook,
        save_optimizer=False,
        by_epoch=False,
        interval=save_steps,
        max_keep_ckpts=save_total_limit),
    # set sampler seed in distributed evrionment.
    sampler_seed=dict(type=DistSamplerSeedHook),
)

# configure environment
env_cfg = dict(
    # whether to enable cudnn benchmark
    cudnn_benchmark=False,
    # set multi process parameters
    mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
    # set distributed parameters
    dist_cfg=dict(backend='nccl'),
)

# set visualizer
visualizer = None

# set log level
log_level = 'INFO'

# load from which checkpoint
load_from = None

# whether to resume training from the loaded checkpoint
resume = False

# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)

# set log processor
log_processor = dict(by_epoch=False)

利用上述configs进行训练：

cd XTuner

NPROC_PER_NODE=1 xtuner train /root/InternLM/code/XTuner/xtuner/configs/internvl/v2/internvl_v2_internlm2_2b_qlora_finetune.py  --work-dir /root/InternLM/work_dir/internvl_ft_run_8_filter  --deepspeed deepspeed_zero1

用官方脚本进行权重合并：

cd XTuner
# transfer weights
python3 xtuner/configs/internvl/v1_5/convert_to_official.py xtuner/configs/internvl/v2/internvl_v2_internlm2_2b_qlora_finetune.py /root/InternLM/work_dir/internvl_ft_run_8_filter/iter_3000.pth /root/InternLM/InternVL2-2B/

最后的模型在：/root/InternLM/InternVL2-2B/，文件格式如下：

把以下代码替换进test_lmdeploy.py中，替换模型路径，然后跑一下效果：

from lmdeploy import pipeline
from lmdeploy.vl import load_image

pipe = pipeline('/root/InternLM/InternVL2-2B')

image = load_image('/root/InternLM/007aPnLRgy1hb39z0im50j30ci0el0wm.jpg')
response = pipe(('请你根据这张图片，讲一个脑洞大开的梗', image))
print(response.text)

cd /root/InternLM/code

python3 test_lmdeploy.py

可以看到微调后的效果还是不错的！

神奇的独角膏

关注

24
点赞
踩
12

收藏

觉得还不错? 一键收藏
0
评论
[书生实战营] InternVL 多模态模型部署微调实践

InternVL是一种用于多模态任务的深度学习模型，旨在处理和理解多种类型的数据输入，如图像和文本。它结合了视觉和语言模型，能够执行复杂的跨模态任务，比如图文匹配、图像描述生成等。模型总览如下：首先，输入的图片经过动态超分辨率模块，将输入的图片打散成不同的小块，之后，把小块送入到ViT模块里提取出相应的视觉特征，之后对视觉特征进行Pixel Shuffle，再进行MLP Projector模块，把视觉特征映射到大模型可以处理的特征上；最后，文本块通过Tokenizer编码之后也输入到大模型中。
复制链接

扫一扫