InternVL 微调

最新推荐文章于 2025-04-17 15:11:43 发布

chasemydreamidea

最新推荐文章于 2025-04-17 15:11:43 发布

阅读量891

点赞数 23

文章标签： chatgpt AIGC

本文链接：https://blog.csdn.net/chasemydreamidea/article/details/142633222

版权

一：概述

InternVL是一个开源的多模态视觉语言模型系列，它在视觉与语言交叉领域展现出了强大的能力和广泛的应用前景。以下是对InternVL的详细介绍：

1. 背景与定位

InternVL被定位为GPT-4V的开创性开源替代品，旨在通过开源套件缩小与商业多模态模型的差距。它结合了视觉和语言模型，能够同时处理图片和文本信息，执行复杂的跨模态任务。

2. 规模与性能

参数规模：InternVL-Chat-V1.5模型参数达到34B，其中核心视觉模型InternViT的参数为6B，较ViT-22B有明显扩大。
性能表现：InternVL在很多视觉语言评估任务上表现出色，如MMMU、DocVQA等成绩接近商业模型GPT-4V，语义分割mIoU也高于GPT-4V。

3. 多语言支持

InternVL不仅支持英语，还支持中文等其他语言，在多语言零画识别、翻译等任务上表现出色。截至当前时间，InternVL已经能够原生支持全球110多种语言的多语言生成。

模型特点

1. 可拓展性强

InternVL提供分级模型，包括2B参数的mini版本InternVL-Chat，也具备很强的功能。此外，还提供8位整型版本进行高效推理。

2. 开放性

InternVL采用MIT许可，所有模型、代码和数据都开源在GitHub上（GitHub地址：OpenGVLab/InternVL），方便开发者参考和应用。

3. 独特性

动态高分辨率：InternVL的视觉模块采用了动态高分辨率的预处理方式，以提高视觉特征的表达能力。
对齐策略：视觉模型和语言模型通过对齐策略结合在一起，确保在处理相同或相似的任务时表现一致。
像素随机播放：Pixel Shuffle在超分任务中是一个常见的操作，PyTorch中有官方实现，即nn.PixelShuffle（upscale_factor）该类的作用就是将一个tensor中的元素值进行重排列，假设tensor维度为[B， C， H， W]， PixelShuffle操作不仅可以改变tensor的通道数，也会改变特征图的大小。

InterLM的整体架构：

二：InterLM部署实践

我们选定的任务是让InternVL-2B生成文生图提示词，这个任务需要VLM对图片有格式化的描述并输出。

让我们来一起完成一个用VLM模型进行冷笑话生成，让你的模型说出很逗的冷笑话吧。在这里，我们微调InterenVL使用xtuner。部署InternVL使用lmdeploy。

<1> 准备InternVL模型

我们使用InternVL2-2B模型。该模型已在share文件夹下挂载好，现在让我们把移动出来。

cd /root
mkdir -p model

# cp 模型

cp -r /root/share/new_models/OpenGVLab/InternVL2-2B /root/model/

<2>准备环境

        这里我们来手动配置下xtuner。

配置虚拟环境

conda create --name xtuner python=3.10 -y

# 激活虚拟环境（注意：后续的所有操作都需要在这个虚拟环境中进行）
conda activate xtuner

# 安装一些必要的库
conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=12.1 -c pytorch -c nvidia -y
# 安装其他依赖
apt install libaio-dev
pip install transformers==4.39.3
pip install streamlit==1.36.0

安装 xTuner

# 创建一个目录，用来存放源代码
mkdir -p /root/InternLM/code

cd /root/InternLM/code

git clone -b v0.1.23  https://github.com/InternLM/XTuner

进入XTuner目录

cd /root/InternLM/code/XTuner
pip install -e '.[deepspeed]'

安装 LMDeploy

pip install lmdeploy==0.5.3

安装验证

xtuner version

##命令

xtuner help

之前的版本是0.1.21

如果版本不对，就删了对应的包。重新下载，前面的任务中，也用到了XTuner，下载的时候如果没有删除原来的，会显示已存在。然后版本不匹配，所以需要删除后，重新下载。

<3>准备微调数据集

我们这里使用huggingface上的zhongshsh/CLoT-Oogiri-GO据集.

@misc{zhong2023clot,
  title={Let's Think Outside the Box: Exploring Leap-of-Thought in Large Language Models with Creative Humor Generation},
  author={Zhong, Shanshan and Huang, Zhongzhan and Gao, Shanghua and Wen, Weushao and Lin, Liang and Zitnik, Marinka and Zhou, Pan},
  journal={arXiv preprint arXiv:2312.02439},
  year={2023}
}

数据集我们只保留中文数据等操作。并制作成XTuner需要的形式。并已在share里，我们一起从share里挪出数据集。

## 首先让我们安装一下需要的包
pip install datasets matplotlib Pillow timm

## 让我们把数据集挪出来
cp -r /root/share/new_models/datasets/CLoT_cn_2000 /root/InternLM/datasets/

让我们打开数据集的一张图看看，我们选择 jsonl里的第一条数据对应的图片。首先，我们先把这张图片挪动到InternLM文件夹下面。

cp InternLM/datasets/ex_images/MjxjVcrFf9TFLbr2BKR4Py1L5qAic8K4VzEQAsTph0ztWe9vj3d8DGDdAC3tJV0aiaOrSBcsKpBIXIAh6O1CDXcA.jpg InternLM/

在执行这个一步骤时，一定要注意路径问题，我在这块卡了好久😭

<4>InternVL 推理部署

使用pipeline进行推理

之后我们使用lmdeploy自带的pipeline工具进行开箱即用的推理流程，首先我们新建一个文件。

touch /root/InternLM/code/test_lmdeploy.py
cd /root/InternLM/code/

然后把以下代码拷贝进test_lmdeploy.py中。

from lmdeploy import pipeline
from lmdeploy.vl import load_image

pipe = pipeline('/root/model/InternVL2-2B')

image = load_image('/root/InternLM/007aPnLRgy1hb39z0im50j30ci0el0wm.jpg')
response = pipe(('请你根据这张图片，讲一个脑洞大开的梗', image))
print(response.text)

运行执行推理结果。

python3 test_lmdeploy.py

推理后

推理出来有什么文字是纯随机的，并不一定和展示结果完全一致哦～

推理后我们发现直接使用2b模型不能很好的讲出梗，现在我们要对这个2b模型进行微调。

运行结果如下所示：

<5> InternVL 微调

准备数据集

数据集格式为：



# 为了高效训练，请确保数据格式为：
{
    "id": "000000033471",
    "image": ["coco/train2017/000000033471.jpg"], # 如果是纯文本，则该字段为 None 或者不存在
    "conversations": [
      {
        "from": "human",
        "value": "<image>\nWhat are the colors of the bus in the image?"
      },
      {
        "from": "gpt",
        "value": "The bus in the image is white and red."
      }
    ]
  }

配置微调参数

让我们一起修改XTuner下 InternVL的config，文件在： /根/InternLM/代码/XTuner/xtuner/configs/internvl/v2/internvl_v2_internlm2_2b_qlora_finetune.py

# Copyright (c) OpenMMLab. All rights reserved.
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
                            LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import AutoTokenizer

from xtuner.dataset import InternVL_V1_5_Dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.samplers import LengthGroupedSampler
from xtuner.engine.hooks import DatasetInfoHook
from xtuner.engine.runner import TrainLoop
from xtuner.model import InternVL_V1_5
from xtuner.utils import PROMPT_TEMPLATE

#######################################################################
#                          PART 1  Settings                           #
#######################################################################
# Model
path = '/root/model/InternVL2-2B'

# Data
data_root = '/root/InternLM/datasets/CLoT_cn_2000/'
data_path = data_root + 'ex_cn.json'
image_folder = data_root
prompt_template = PROMPT_TEMPLATE.internlm2_chat
max_length = 6656

# Scheduler & Optimizer
batch_size = 4  # per_device
accumulative_counts = 4
dataloader_num_workers = 4
max_epochs = 6
optim_type = AdamW
# official 1024 -> 4e-5
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0.05
max_norm = 1  # grad clip
warmup_ratio = 0.03

# Save
save_steps = 1000
save_total_limit = 1  # Maximum checkpoints to keep (-1 means unlimited)

#######################################################################
#            PART 2  Model & Tokenizer & Image Processor              #
#######################################################################
model = dict(
    type=InternVL_V1_5,
    model_path=path,
    freeze_llm=True,
    freeze_visual_encoder=True,
    quantization_llm=True,  # or False
    quantization_vit=False,  # or True and uncomment visual_encoder_lora
    # comment the following lines if you don't want to use Lora in llm
    llm_lora=dict(
        type=LoraConfig,
        r=128,
        lora_alpha=256,
        lora_dropout=0.05,
        target_modules=None,
        task_type='CAUSAL_LM'),
    # uncomment the following lines if you don't want to use Lora in visual encoder # noqa
    # visual_encoder_lora=dict(
    #     type=LoraConfig, r=64, lora_alpha=16, lora_dropout=0.05,
    #     target_modules=['attn.qkv', 'attn.proj', 'mlp.fc1', 'mlp.fc2'])
)

#######################################################################
#                      PART 3  Dataset & Dataloader                   #
#######################################################################
llava_dataset = dict(
    type=InternVL_V1_5_Dataset,
    model_path=path,
    data_paths=data_path,
    image_folders=image_folder,
    template=prompt_template,
    max_length=max_length)

train_dataloader = dict(
    batch_size=batch_size,
    num_workers=dataloader_num_workers,
    dataset=llava_dataset,
    sampler=dict(
        type=LengthGroupedSampler,
        length_property='modality_length',
        per_device_batch_size=batch_size * accumulative_counts),
    collate_fn=dict(type=default_collate_fn))

#######################################################################
#                    PART 4  Scheduler & Optimizer                    #
#######################################################################
# optimizer
optim_wrapper = dict(
    type=AmpOptimWrapper,
    optimizer=dict(
        type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
    clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
    accumulative_counts=accumulative_counts,
    loss_scale='dynamic',
    dtype='float16')

# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md  # noqa: E501
param_scheduler = [
    dict(
        type=LinearLR,
        start_factor=1e-5,
        by_epoch=True,
        begin=0,
        end=warmup_ratio * max_epochs,
        convert_to_iter_based=True),
    dict(
        type=CosineAnnealingLR,
        eta_min=0.0,
        by_epoch=True,
        begin=warmup_ratio * max_epochs,
        end=max_epochs,
        convert_to_iter_based=True)
]

# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)

#######################################################################
#                           PART 5  Runtime                           #
#######################################################################
# Log the dialogue periodically during the training process, optional
tokenizer = dict(
    type=AutoTokenizer.from_pretrained,
    pretrained_model_name_or_path=path,
    trust_remote_code=True)

custom_hooks = [
    dict(type=DatasetInfoHook, tokenizer=tokenizer),
]

# configure default hooks
default_hooks = dict(
    # record the time of every iteration.
    timer=dict(type=IterTimerHook),
    # print log every 10 iterations.
    logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
    # enable the parameter scheduler.
    param_scheduler=dict(type=ParamSchedulerHook),
    # save checkpoint per `save_steps`.
    checkpoint=dict(
        type=CheckpointHook,
        save_optimizer=False,
        by_epoch=False,
        interval=save_steps,
        max_keep_ckpts=save_total_limit),
    # set sampler seed in distributed evrionment.
    sampler_seed=dict(type=DistSamplerSeedHook),
)

# configure environment
env_cfg = dict(
    # whether to enable cudnn benchmark
    cudnn_benchmark=False,
    # set multi process parameters
    mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
    # set distributed parameters
    dist_cfg=dict(backend='nccl'),
)

# set visualizer
visualizer = None

# set log level
log_level = 'INFO'

# load from which checkpoint
load_from = None

# whether to resume training from the loaded checkpoint
resume = False

# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)

# set log processor
log_processor = dict(by_epoch=False)

开始训练

这里使用之前搞好的configs进行训练。咱们要调整一下batch size，并且使用qlora。要不半卡不够用的 QAQ。

注意：在训练前需要将这个机器的配置修改为50%。否则会出现溢出的现象，进而导致训练失败。

cd XTuner

NPROC_PER_NODE=1 xtuner train /root/InternLM/code/XTuner/xtuner/configs/internvl/v2/internvl_v2_internlm2_2b_qlora_finetune.py  --work-dir /root/InternLM/work_dir/internvl_ft_run_8_filter  --deepspeed deepspeed_zero1

经过漫长的等待最终训练好了。

合并权重&&模型转换

用官方脚本进行权重合并

如果这里你执行的epoch不是6，是小一些的数字。你可能会发现internvl_ft_run_8_filter下没有iter_3000.pth，那你需要把iter_3000.pth切换成你internvl_ft_run_8_filter目录下的pth即可。

cd XTuner
# transfer weights
python3 xtuner/configs/internvl/v1_5/convert_to_official.py xtuner/configs/internvl/v2/internvl_v2_internlm2_2b_qlora_finetune.py /root/InternLM/work_dir/internvl_ft_run_8_filter/iter_3000.pth /root/InternLM/InternVL2-2B/

最后我们的模型在：/root/InternLM/convert_model/，文件格式：

.
|-- added_tokens.json
|-- config.json
|-- configuration_intern_vit.py
|-- configuration_internlm2.py
|-- configuration_internvl_chat.py
|-- conversation.py
|-- generation_config.json
|-- model.safetensors
|-- modeling_intern_vit.py
|-- modeling_internlm2.py
|-- modeling_internvl_chat.py
|-- special_tokens_map.json
|-- tokenization_internlm2.py
|-- tokenizer.model
`-- tokenizer_config.json

微调后效果对比

现在我们微调好啦，让我们再来试试这张图片吧！

我们把下面的代码替换进test_lmdeploy.py中，然后跑一下效果。

再换一张图片，看下效果：
接下来将数据集的一张图片同样复制到InternLM中，对它进行询问。

注意：如果你想要用相对路径，就得退到根目录中，然后复制这个相对路径进行复制。

cp /root/InternLM/datasets/ex_images/007aPnLRgy1gm64tivj5kj30ci0elqbm.jpg /root/InternLM/

接着我们更换这个图片的路径，最后保存文件重新运行一下就可。

python3 test_lmdeploy.py

再来一张吧！！！，过程一样就不说了。就看看最终的效果即可。

到此为止，InterVL微调结束。