📝简介
在目标检测领域,众多神经网络模型早已凭借其卓越的性能,实现了精准的目标检测与目标分割效果。然而,随着多模态模型的崛起,其在图像分析方面展现出的非凡能力,为该领域带来了新的机遇。多模态模型不仅能够深入理解图像内容,还能将这种理解转化为文本形式输出,极大地拓展了其应用场景。鉴于此,本文旨在打造一份详尽的教程,指导读者如何通过对主流多模态大模型进行微调,来实现目标检测任务。以Qwen2.5-VL为例,凭借其强大的多模态分析能力,无需从头开始,利用大量数据进行预训练来构建新模型,仅通过微调即可高效地实现目标检测功能,为该领域的发展提供一种全新的思路与方法。
📚链接资料
作者信息:情感机器实验室研究员-李馨雨 邮箱:wind.340171@gmail.com
模型地址:Qwen2.5-VL-3B-Instruct:huggingface|魔搭社区(下面的模型下载使用,而且比较方便)
数据集地址:TextVQA_GroundingTask_bbox:huggingface|魔搭社区
代码地址:github
可视化工具SwanLab项目地址:SwanLab训练指标观测结果曲线图
友情链接:
SwanLab官方文档,助你轻松开启深度学习之旅。
💻训练任务设置
1、训练方法简介
-
部分参数微调:部分参数微调是一种在预训练模型基础上进行针对性调整的策略。它仅对模型的一部分参数进行更新,而保持其他参数不变。这种方法的优点是
-
计算成本相对较低,因为它不需要对整个模型的所有参数进行优化。这使得部分参数微调在资源有限的情况下更加可行,例如在单个GPU上或在内存受限的环境中。
-
可以减少过拟合的风险,因为它限制了模型的调整范围,避免了对训练数据的过度拟合。
-
缺点
是可能无法充分利用预训练模型的全部潜力,因为只有部分参数得到了优化。这可能导致模型在某些复杂任务上的性能不如全参数微调。
-
-
全参数微调:全参数微调是一种直接且直观的方法,它允许模型在微调过程中对所有参数进行更新。
- 这种方法的优势在于能够充分利用预训练模型的知识,并针对特定任务进行精细调整,从而在许多任务上达到最优性能。
缺点
是计算成本高,尤其是在模型参数量巨大的情况下。全参数微调需要大量的GPU内存和计算资源,这在多模型部署和实时应用中可能成为瓶颈。
2、选用模型简介
- Qwen2.5-vl技术报告论文地址:[Qwen2.5-VL Technical Report]
- 代码地址:[Qwen2.5-VL]
多模态模型主要由**视觉编码器(Vision Encoder)、语言模型(LM)和多模态融合模块(Connector)**三块构成,和Qwen2-VL一样,Qwen2.5-VL并没有巨大的Connector,仅用一个MLP完成特征投影。打印模型结构如下:
### 代码表示
MODEL_PATH = '/data/nvme1/weights/Qwen2_5-VL-3B-Instruct'
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor, Qwen2_5_VLForConditionalGeneration
from qwen_vl_utils import process_vision_info
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
MODEL_PATH, torch_dtype="auto", device_map="auto"
)
print(model)
结果如下:
Qwen2_5_VLForConditionalGeneration(
(visual): Qwen2_5_VisionTransformerPretrainedModel(
(patch_embed): Qwen2_5_VisionPatchEmbed(
(proj): Conv3d(3, 1280, kernel_size=(2, 14, 14), stride=(2, 14, 14), bias=False)
)
(rotary_pos_emb): Qwen2_5_VisionRotaryEmbedding()
(blocks): ModuleList(
(0-31): 32 x Qwen2_5_VLVisionBlock(
(norm1): Qwen2RMSNorm((1280,), eps=1e-06)
(norm2): Qwen2RMSNorm((1280,), eps=1e-06)
(attn): Qwen2_5_VLVisionSdpaAttention(
(qkv): Linear(in_features=1280, out_features=3840, bias=True)
(proj): Linear(in_features=1280, out_features=1280, bias=True)
)
(mlp): Qwen2_5_VLMLP(
(gate_proj): Linear(in_features=1280, out_features=3420, bias=True)
(up_proj): Linear(in_features=1280, out_features=3420, bias=True)
(down_proj): Linear(in_features=3420, out_features=1280, bias=True)
(act_fn): SiLU()
)
)
)
(merger): Qwen2_5_VLPatchMerger(
(ln_q): Qwen2RMSNorm((1280,), eps=1e-06)
(mlp): Sequential(
(0): Linear(in_features=5120, out_features=5120, bias=True)
(1): GELU(approximate='none')
(2): Linear(in_features=5120, out_features=2048, bias=True)
)
)
)
(model): Qwen2_5_VLModel(
(embed_tokens): Embedding(151936, 2048)
(layers): ModuleList(
(0-35): 36 x Qwen2_5_VLDecoderLayer(
(self_attn): Qwen2_5_VLSdpaAttention(
(q_proj): Linear(in_features=2048, out_features=2048, bias=True)
(k_proj): Linear(in_features=2048, out_features=256, bias=True)
(v_proj): Linear(in_features=2048, out_features=256, bias=True)
(o_proj): Linear(in_features=2048, out_features=2048, bias=False)
(rotary_emb): Qwen2_5_VLRotaryEmbedding()
)
(mlp): Qwen2MLP(
(gate_proj): Linear(in_features=2048, out_features=11008, bias=False)
(up_proj): Linear(in_features=2048, out_features=11008, bias=False)
(down_proj): Linear(in_features=11008, out_features=2048, bias=False)
(act_fn): SiLU()
)
(input_layernorm): Qwen2RMSNorm((2048,), eps=1e-06)
(post_attention_layernorm): Qwen2RMSNorm((2048,), eps=1e-06)
)
)
(norm): Qwen2RMSNorm((2048,), eps=1e-06)
(rotary_emb): Qwen2_5_VLRotaryEmbedding()
)
(lm_head): Linear(in_features=2048, out_features=151936, bias=False)
)
Qwen2.5-VL-3B-Instruct 基于 Qwen2.5 架构,其参数量达到 30 亿级别,专为指令微调而设计。该模型在预训练阶段,通过海量文本和图像数据学习通用的语言和视觉知识,能够理解并生成自然语言文本,同时处理与文本相关的图像信息,实现多模态交互。在指令微调过程中,Qwen2.5-VL-3B-Instruct 针对特定的指令任务进行了优化,使其能够更好地理解和执行人类的指令,如问答、文本生成、图像描述等。它在多模态任务上展现出色的性能,能够将图像内容与文本语义相结合,生成准确且富有逻辑的回答。此外,该模型还具备一定的推理能力和创造力,能够在处理复杂任务时提供有价值的见解和解决方案。
下载代码:
modelscope download --model Qwen/Qwen2.5-VL-3B-Instruct --local_dir /data/nvme1/weights/Qwen/Qwen2.5-VL-3B-Instruct
3、数据集简介
TextVQA_GT_bbox 是 Hugging Face 上的一个视觉问答(VQA)数据集,专注于文本相关的视觉问答任务,来源于 TextVQA ,并提供目标边界框
信息。该数据集包含图像、与图像相关的问题以及对应的答案,边界框信息帮助模型精准定位图像中的文本内容,从而提高回答问题的准确性。该数据集选择TextVQA 中单目标检测的问答,保留5000个样本中的4370个。
本次教程的任务目标是利用问题和目标边界框信息来对Qwen2.5-VL-3B-Instruct模型进行微调,数据集样式如下:
论文《MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs》中使用该数据集用于研究MLLM的注意力模式。
下载代码:
modelscope download --dataset Tina12345/textVQA_groundingtask_bbox --local_dir /data/nvme0/textvqa_bbox
4、训练框架选择
Hugging Face Transformers 是一个基于 Python 的开源库,广泛应用于自然语言处理(NLP)任务。该框架提供了大量预训练的语言模型(如 BERT、GPT、T5、RoBERTa、DistilBERT 等),并支持使用 PyTorch 和 TensorFlow 两种主流深度学习框架进行模型的微调与部署。
Transformers 库的核心优势在于其统一且简洁的接口设计,使得研究人员和开发者可以快速实现文本分类、命名实体识别、问答系统、文本生成等多种 NLP 任务。此外,它集成了 Hugging Face Model Hub ,这是一个包含数万个社区贡献模型的平台,用户可直接加载已有模型或上传自定义模型,便于模型共享与复用。
在性能方面,Transformers 支持混合精度训练、分布式训练以及 ONNX 导出等功能,适用于从研究原型到工业级部署的全流程开发。结合 Datasets、Tokenizers、Accelerate 等配套库,Hugging Face 构建了一个完整的 NLP 开发生态系统,极大提升了模型开发与实验迭代的效率。
参考材料:https://huggingface.co/docs/transformers/index
📜 数据集准备
首先,该数据集可以从huggingface上直接下载,代码如下:
from datasets import load_dataset
# Login using e.g. `huggingface-cli login` to access this dataset
ds = load_dataset("jrzhang/TextVQA_GT_bbox")
如果huggingface无法使用,也可以选择魔搭社区,将数据集下载到本地,可以使用下面的代码,也可以使用命令行下载:
代码下载:
from modelscope.msdatasets import MsDataset
ds = MsDataset.load('Tina12345/textVQA_groundingtask_bbox', subset_name='default', split='train', cache_dir="./data")
命令行下载:
modelscope download --dataset Tina12345/textVQA_groundingtask_bbox --local_dir /data/nvme0/textvqa_bbox
⚠️注意:
使用魔搭社区下载数据集,会出现F&A的1、魔搭社区下载的数据集用不了的问题,解答在这里👉1、魔搭社区下载的数据集用不了
下载好后,把数据集稍加改造,保存成jsonl格式的文件,以便后续训练。
原数据集格式为:
{
'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=1024x681 at 0x7FA58E1DB340>,
'question': 'what is the name of the company on the card?',
'answer': ['blink', 'intergrative nutrition', 'blink', 'blink', 'blink', 'blink', 'blink', 'blink', 'blink', 'blink'],
'dataset_id': '36269',
'bbox': [712.0, 255.0, 64.0, 43.0]
}
我们只需要其中的image、question、bbox部分,可以将这三部分保存,其中question代表user部分的提问,bbox代表的是assistant部分的回答,我参考了swift的数据格式:query-response格式。
保存成以下格式:
{"image": ["./data/test/003001.jpg"], "query": "what is written on the ghost?", "response": "{\"bbox_2d\": [467.0, 628.0, 54.0, 33.0]}"}
其中需要注意的是,qwen对于grounding的训练任务有相应的模板,链接在这👉Qwen2.5-vl-finetune,因此上述的"{“bbox_2d”: [467.0, 628.0, 54.0, 33.0]}"其实是参考了官方的Grounding Example,
数据集转化代码保存到scripts/convert2sft_format.py中,代码如下:
import json
import os
from tqdm import tqdm
from datasets import load_dataset
def convert_to_sft_format(data_path,save_path,type='train'):
# 加载数据集
dataset = load_dataset(data_path,split='train')
# 每个数据保存到一个jsonl文件中,并且图片的话要另外放到一起
if not os.path.exists(save_path):
os.makedirs(save_path)
# 创建 JSONL 文件
jsonl_file = os.path.join(save_path, f"{type}.jsonl")
with open(jsonl_file, 'w', encoding='utf-8') as jsonl_out:
# 遍历数据集并保存图片,其他的部分信息保存成jsonl文件
for idx,sample in tqdm(enumerate(dataset),total=len(dataset)):
if type == 'train':
if idx >= 3000: # 判断是否处理到3000条数据
break
elif type == 'test':
# 判断是否处理到3001到3100条数据
if idx < 3000 or idx >= 3100:
continue
# 保存图片
image = sample['image']
# 生成文件名(格式为 000001.jpg, 000002.jpg 等)
filename = f"{idx + 1:06d}.jpg" # 使用 6 位数字格式化文件名
jpg_path = os.path.join(save_path, type)
output_path = os.path.join(jpg_path, filename)
# 保存图片
image.save(output_path)
# 保存其他信息
# 坐标信息
bbox = sample['bbox']
bbox_dict = {"bbox_2d": bbox}
formatted_json = json.dumps(bbox_dict, indent=None)
data = {
"image":[output_path],
"query":sample['question'],
"response":formatted_json,
}
# 将数据写入 JSONL 文件
# 将每条数据写入 JSONL 文件
jsonl_out.write(json.dumps(data, ensure_ascii=False) + '\n')
print(f"All images and data have been saved to {save_path} and {jsonl_file}")
# 示例调用
convert_to_sft_format(data_path='/data/nvme0/textvqa_bbox', save_path='./data', type='test')
其中图像保存到data/train中,train.jsonl保存到data文件夹中,同时还有测试集数据也同样保存到test中。
🚀微调代码
1、环境设置
- 硬件信息概览:概览
**GPU:**9 * NVIDIA H20 96GB
**CPU:**AMD EPYC 9K84 96-Core Processor
**操作系统:**TencentOS Server 3.1 (Final)
**python版本:**3.10.17
-
python训练环境:
modelscope qwen_vl_utils transformers peft diffusers torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 swanlab deepspeed
实测3090也行,8*3090 24GB也可以运行,不过后续的参数需要调整
2、数据预处理
可以说该步骤是大模型微调的核心,很容易出现报错bug,这里注意两点,只要这两点能做好,剩下的部分就不难了,这两点都是Trainer中出现的。
- train_dataset:数据集,并且是Dataset格式,也就是huggingface能读懂的格式
- data_collator:用于处理数据的批量组合和预处理,确保数据能够以正确的格式输入到模型中。包括填充(Padding)、张量转换(Tensor Conversion)、截断(Truncation)等
代码位于vision_datacollator.py中,具体怎么做我们看看下面的详细讲解。
1、train_dataset
最重要的就是格式对应上就行,transformers库是huggingface开源的专门用于处理大模型训练和推理等的函数库,为了确保数据能够被模型正确加载和处理,数据必须符合特定的格式。这种格式通常是 Dataset
对象,这是由 Hugging Face 的 datasets
库提供的一个类,用于表示和操作数据集。
DatasetDict({
train: Dataset({
features: ['image', 'query', 'response'],
num_rows: 3000
})
test: Dataset({
features: ['image', 'query', 'response'],
num_rows: 100
})
})
该部分代码如下:
################
# Dataset
################
# 1、读取保存的jsonl文件,使用datasets.load_dataset生成的数据集即是Dataset格式,符合hf使用标准格式
train_dataset = datasets.load_dataset("json", data_files=data_args.train_dataset_name)
test_dataset = datasets.load_dataset("json", data_files=data_args.test_dataset_name)
# 2、创建 DatasetDict,这部分只是为了后续读取测试数据方便,因此把train和test放在一起
raw_dataset = datasets.DatasetDict({
"train": train_dataset["train"],
"test": test_dataset["train"]
})
print(raw_dataset)
# 3、固定数据集格式用于后面批处理数据集
def preporocess_textvqa(example):
return {
"image": example["image"],
"user": example["query"],
"assistant": example["response"],
}
raw_dataset = raw_dataset.map(
preporocess_textvqa,
remove_columns=raw_dataset["train"].column_names,
desc="Preprocessing textvqa dataset",
)
# 4、Trainer数据集调用
train_dataset=raw_dataset["train"],
eval_dataset=(
raw_dataset["test"] if training_args.eval_strategy != "no" else None
),
2、data_collator
由于本次教程涉及坐标的缩放,因此需要自己写data_collator部分,通过调整数据集格式来用于模型训练。
- 缩放图像的大小
因为原数据集的坐标对应的图都是不同的大小,而且图像一般都比较大,多模态大模型一般在训练阶段对于图像的大小有要求,比如256*256、512*512等,而原图的大小不一,因此需要统一下图像大小,代码如下:
# 缩放图像的大小,同时因为grounding任务,需要同时缩放坐标
def resize_with_max_side(image, max_side_length):
# 获取原始尺寸
width, height = image.size
# 计算缩放比例
scale = min(max_side_length / width, max_side_length / height)
# 计算新的尺寸
new_width = int(width * scale)
new_height = int(height * scale)
# 调整图像大小
resized_image = image.resize((new_width, new_height), Image.Resampling.LANCZOS)
return resized_image, scale
- 缩放坐标数据
因为图像缩放了,因此坐标位置也要缩放到新的图像的对应位置,代码如下:
def resize_bbox(bbox, scale):
# 缩放矩形框坐标
return [int(coord * scale) for coord in bbox]
- 构建数据集的input_ids
根据上面两步调整的代码如下,需要分别把统一的image、question、answer输出:
question = example["user"]
answer = example["assistant"]
# 需要读取图像,需要确保是RGB图像
image_path = example['image'][0]
image = Image.open(image_path)
# 输出缩放后的图像以及缩放倍率
image, scale = resize_with_max_side(
image, max_side_length=self.max_img_side_length
)
# 缩放answer的坐标值
# answer是一个json字符串,解析成字典
answer = json.loads(answer)
answer = {"bbox_2d": resize_bbox(answer["bbox_2d"],scale)}
# 转化新的answer
answer = json.dumps(answer, indent=None)
根据得到的image、question、answer经过大模型加载为tokens格式:
prompt = "Please enclose the corresponding positions using coordinate boxes. Examples of coordinate value formats: [x1,y1,x2,y2]"
question = '<image>\n'+ question+prompt
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": question},
],
}
]
prompt = self.processor.tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
answer = f"{answer}<|im_end|>\n"
input_ids = self.processor(
images=[image],
text=prompt + answer,
return_tensors="pt",
max_length=self.max_seq_length,
truncation=False,
padding=False,
)
answer_ids = self.processor.tokenizer(
answer, add_special_tokens=False, return_tensors="pt"
)
ignore_ids_len = len(input_ids["input_ids"][0]) - len(
answer_ids["input_ids"][0]
)
input_ids["labels"] = torch.cat(
[
torch.tensor([IGNORE_INDEX] * ignore_ids_len).unsqueeze(0),
answer_ids["input_ids"],
],
dim=1,
)
增加position_ids
position_ids, _ = self.get_rope_index_2(
self.processor.image_processor.merge_size,
input_ids["input_ids"],
input_ids["image_grid_thw"],
)
input_ids["position_ids"] = position_ids
填充至最大seq_length
# padding
if len(input_ids["labels"]) < self.max_seq_length:
input_ids["input_ids"] = torch.cat(
[
input_ids["input_ids"],
torch.tensor(
[self.processor.tokenizer.pad_token_id]
* (self.max_seq_length - len(input_ids["input_ids"]))
).unsqueeze(0),
],
dim=1,
)
input_ids["labels"] = torch.cat(
[
input_ids["labels"],
torch.tensor(
[IGNORE_INDEX]
* (self.max_seq_length - len(input_ids["labels"]))
).unsqueeze(0),
],
dim=1,
)
input_ids["attention_mask"] = input_ids["input_ids"].ne(
self.processor.tokenizer.pad_token_id
)
# padding position_ids
pad_length = self.max_seq_length - input_ids["position_ids"].shape[2]
input_ids["position_ids"] = torch.nn.functional.pad(
input_ids["position_ids"], (0, pad_length), "constant", 1
)
如果超过长度部分进行截断truncate
# truncate
if len(input_ids["input_ids"][0]) > self.max_seq_length:
input_ids["input_ids"] = input_ids["input_ids"][
:, : self.max_seq_length
]
input_ids["labels"] = input_ids["labels"][:, : self.max_seq_length]
input_ids["attention_mask"] = input_ids["attention_mask"][
:, : self.max_seq_length
]
input_ids["position_ids"] = input_ids["position_ids"][
:, : self.max_seq_length
]
最终得到所有的input_ids
batch_input_ids = {
"input_ids": torch.cat(
[input_ids["input_ids"] for input_ids in batch_input_ids], dim=0
),
"attention_mask": torch.cat(
[input_ids["attention_mask"] for input_ids in batch_input_ids], dim=0
),
"labels": torch.cat(
[input_ids["labels"] for input_ids in batch_input_ids], dim=0
),
"pixel_values": torch.cat(
[input_ids["pixel_values"] for input_ids in batch_input_ids], dim=0
),
"image_grid_thw": torch.cat(
[input_ids["image_grid_thw"] for input_ids in batch_input_ids], dim=0
),
"position_ids": torch.cat(
[input_ids["position_ids"] for input_ids in batch_input_ids], dim=1
),
}
return batch_input_ids
该部分整体代码
from typing import Optional, Tuple
import copy
import transformers
import torch
from PIL import Image
import json
IGNORE_INDEX = -100
# 缩放图像的大小,同时因为grounding任务,需要同时缩放坐标
def resize_with_max_side(image, max_side_length):
# 获取原始尺寸
width, height = image.size
# 计算缩放比例
scale = min(max_side_length / width, max_side_length / height)
# 计算新的尺寸
new_width = int(width * scale)
new_height = int(height * scale)
# 调整图像大小
resized_image = image.resize((new_width, new_height), Image.Resampling.LANCZOS)
return resized_image, scale
def resize_bbox(bbox, scale):
# 缩放矩形框坐标
return [int(coord * scale) for coord in bbox]
class Qwen2_5VLCollator:
def __init__(
self, processor, max_seq_length=1024, max_img_side_length=1024, **kwargs
):
self.processor = processor
# to fix bug in Qwen2.5VL
self.processor.tokenizer.chat_template = "{% set image_count = namespace(value=0) %}{% set video_count = namespace(value=0) %}{% for message in messages %}{% if loop.first and message['role'] != 'system' %}<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n{% endif %}<|im_start|>{{ message['role'] }}\n{% if message['content'] is string %}{{ message['content'] }}<|im_end|>\n{% else %}{% for content in message['content'] %}{% if content['type'] == 'image' or 'image' in content or 'image_url' in content %}{% set image_count.value = image_count.value + 1 %}{% if add_vision_id %}Picture {{ image_count.value }}: {% endif %}<|vision_start|><|image_pad|><|vision_end|>{% elif content['type'] == 'video' or 'video' in content %}{% set video_count.value = video_count.value + 1 %}{% if add_vision_id %}Video {{ video_count.value }}: {% endif %}<|vision_start|><|video_pad|><|vision_end|>{% elif 'text' in content %}{{ content['text'] }}{% endif %}{% endfor %}<|im_end|>\n{% endif %}{% endfor %}{% if add_generation_prompt %}<|im_start|>assistant\n{% endif %}"
self.max_seq_length = max_seq_length
self.max_img_side_length = max_img_side_length
def __call__(self, examples):
batch_input_ids = []
for example in examples:
# 根据数据集格式来,数据集格式如下:
"""
{"image": ["./data/train/000001.jpg"], "query": "what is the name of the company on the card?", "response": "{\n \"bbox_2d\": [\n 712.0,\n 255.0,\n 64.0,\n 43.0\n ]\n}"}
"""
question = example["user"]
answer = example["assistant"]
# 需要读取图像,需要确保是RGB图像
image_path = example['image'][0]
image = Image.open(image_path)
# 输出缩放后的图像以及缩放倍率
image, scale = resize_with_max_side(
image, max_side_length=self.max_img_side_length
)
# 缩放answer的坐标值
# answer是一个json字符串,解析成字典
answer = json.loads(answer)
answer = {"bbox_2d": resize_bbox(answer["bbox_2d"],scale)}
# 转化新的answer
answer = json.dumps(answer, indent=None)
# 这了不知道是否需要添加prompt
prompt = "Please enclose the corresponding positions using coordinate boxes. Examples of coordinate value formats: [x1,y1,x2,y2]"
question = '<image>\n'+ question+prompt
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": question},
],
}
]
prompt = self.processor.tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
answer = f"{answer}<|im_end|>\n"
input_ids = self.processor(
images=[image],
text=prompt + answer,
return_tensors="pt",
max_length=self.max_seq_length,
truncation=False,
padding=False,
)
answer_ids = self.processor.tokenizer(
answer, add_special_tokens=False, return_tensors="pt"
)
ignore_ids_len = len(input_ids["input_ids"][0]) - len(
answer_ids["input_ids"][0]
)
input_ids["labels"] = torch.cat(
[
torch.tensor([IGNORE_INDEX] * ignore_ids_len).unsqueeze(0),
answer_ids["input_ids"],
],
dim=1,
)
# position_ids
position_ids, _ = self.get_rope_index_2(
self.processor.image_processor.merge_size,
input_ids["input_ids"],
input_ids["image_grid_thw"],
)
input_ids["position_ids"] = position_ids
# padding
if len(input_ids["labels"]) < self.max_seq_length:
input_ids["input_ids"] = torch.cat(
[
input_ids["input_ids"],
torch.tensor(
[self.processor.tokenizer.pad_token_id]
* (self.max_seq_length - len(input_ids["input_ids"]))
).unsqueeze(0),
],
dim=1,
)
input_ids["labels"] = torch.cat(
[
input_ids["labels"],
torch.tensor(
[IGNORE_INDEX]
* (self.max_seq_length - len(input_ids["labels"]))
).unsqueeze(0),
],
dim=1,
)
input_ids["attention_mask"] = input_ids["input_ids"].ne(
self.processor.tokenizer.pad_token_id
)
# padding position_ids
pad_length = self.max_seq_length - input_ids["position_ids"].shape[2]
input_ids["position_ids"] = torch.nn.functional.pad(
input_ids["position_ids"], (0, pad_length), "constant", 1
)
# truncate
if len(input_ids["input_ids"][0]) > self.max_seq_length:
input_ids["input_ids"] = input_ids["input_ids"][
:, : self.max_seq_length
]
input_ids["labels"] = input_ids["labels"][:, : self.max_seq_length]
input_ids["attention_mask"] = input_ids["attention_mask"][
:, : self.max_seq_length
]
input_ids["position_ids"] = input_ids["position_ids"][
:, : self.max_seq_length
]
# batching
batch_input_ids.append(input_ids)
batch_input_ids = {
"input_ids": torch.cat(
[input_ids["input_ids"] for input_ids in batch_input_ids], dim=0
),
"attention_mask": torch.cat(
[input_ids["attention_mask"] for input_ids in batch_input_ids], dim=0
),
"labels": torch.cat(
[input_ids["labels"] for input_ids in batch_input_ids], dim=0
),
"pixel_values": torch.cat(
[input_ids["pixel_values"] for input_ids in batch_input_ids], dim=0
),
"image_grid_thw": torch.cat(
[input_ids["image_grid_thw"] for input_ids in batch_input_ids], dim=0
),
"position_ids": torch.cat(
[input_ids["position_ids"] for input_ids in batch_input_ids], dim=1
),
}
return batch_input_ids
def get_rope_index_2(
self,
spatial_merge_size: Optional[int] = 2,
input_ids: Optional[torch.LongTensor] = None,
image_grid_thw: Optional[torch.LongTensor] = None,
video_grid_thw: Optional[torch.LongTensor] = None,
second_per_grid_ts: Optional[torch.Tensor] = None,
attention_mask: Optional[torch.Tensor] = None,
) -> Tuple[torch.Tensor, torch.Tensor]:
"""
Calculate the 3D rope index based on image and video's temporal, height and width in LLM.
Explanation:
Each embedding sequence contains vision embedding and text embedding or just contains text embedding.
For pure text embedding sequence, the rotary position embedding has no difference with mordern LLMs.
Examples:
input_ids: [T T T T T], here T is for text.
temporal position_ids: [0, 1, 2, 3, 4]
height position_ids: [0, 1, 2, 3, 4]
width position_ids: [0, 1, 2, 3, 4]
For vision and text embedding sequence, we calculate 3D rotary position embedding for vision part
and 1D rotary position embeddin for text part.
Examples:
Assume we have a video input with 3 temporal patches, 2 height patches and 2 width patches.
input_ids: [V V V V V V V V V V V V T T T T T], here V is for vision.
vision temporal position_ids: [0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2]
vision height position_ids: [0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1]
vision width position_ids: [0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1]
text temporal position_ids: [3, 4, 5, 6, 7]
text height position_ids: [3, 4, 5, 6, 7]
text width position_ids: [3, 4, 5, 6, 7]
Here we calculate the text start position_ids as the max vision position_ids plus 1.
Args:
input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
it.
image_grid_thw (`torch.LongTensor` of shape `(num_images, 3)`, *optional*):
The temporal, height and width of feature shape of each image in LLM.
video_grid_thw (`torch.LongTensor` of shape `(num_videos, 3)`, *optional*):
The temporal, height and width of feature shape of each video in LLM.
attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
- 1 for tokens that are **not masked**,
- 0 for tokens that are **masked**.
Returns:
position_ids (`torch.LongTensor` of shape `(3, batch_size, sequence_length)`)
mrope_position_deltas (`torch.Tensor` of shape `(batch_size)`)
"""
image_token_id = 151655
video_token_id = 151656
vision_start_token_id = 151652
mrope_position_deltas = []
if input_ids is not None and (
image_grid_thw is not None or video_grid_thw is not None
):
total_input_ids = input_ids
if attention_mask is None:
attention_mask = torch.ones_like(total_input_ids)
position_ids = torch.ones(
3,
input_ids.shape[0],
input_ids.shape[1],
dtype=input_ids.dtype,
device=input_ids.device,
)
image_index, video_index = 0, 0
for i, input_ids in enumerate(total_input_ids):
input_ids = input_ids[attention_mask[i] == 1]
image_nums, video_nums = 0, 0
vision_start_indices = torch.argwhere(
input_ids == vision_start_token_id
).squeeze(1)
vision_tokens = input_ids[vision_start_indices + 1]
image_nums = (vision_tokens == image_token_id).sum()
video_nums = (vision_tokens == video_token_id).sum()
input_tokens = input_ids.tolist()
llm_pos_ids_list: list = []
st = 0
remain_images, remain_videos = image_nums, video_nums
for _ in range(image_nums + video_nums):
if image_token_id in input_tokens and remain_images > 0:
ed_image = input_tokens.index(image_token_id, st)
else:
ed_image = len(input_tokens) + 1
if video_token_id in input_tokens and remain_videos > 0:
ed_video = input_tokens.index(video_token_id, st)
else:
ed_video = len(input_tokens) + 1
if ed_image < ed_video:
t, h, w = (
image_grid_thw[image_index][0],
image_grid_thw[image_index][1],
image_grid_thw[image_index][2],
)
image_index += 1
remain_images -= 1
ed = ed_image
else:
t, h, w = (
video_grid_thw[video_index][0],
video_grid_thw[video_index][1],
video_grid_thw[video_index][2],
)
video_index += 1
remain_videos -= 1
ed = ed_video
llm_grid_t, llm_grid_h, llm_grid_w = (
t.item(),
h.item() // spatial_merge_size,
w.item() // spatial_merge_size,
)
text_len = ed - st
st_idx = (
llm_pos_ids_list[-1].max() + 1
if len(llm_pos_ids_list) > 0
else 0
)
llm_pos_ids_list.append(
torch.arange(text_len).view(1, -1).expand(3, -1) + st_idx
)
t_index = (
torch.arange(llm_grid_t)
.view(-1, 1)
.expand(-1, llm_grid_h * llm_grid_w)
.flatten()
)
h_index = (
torch.arange(llm_grid_h)
.view(1, -1, 1)
.expand(llm_grid_t, -1, llm_grid_w)
.flatten()
)
w_index = (
torch.arange(llm_grid_w)
.view(1, 1, -1)
.expand(llm_grid_t, llm_grid_h, -1)
.flatten()
)
llm_pos_ids_list.append(
torch.stack([t_index, h_index, w_index]) + text_len + st_idx
)
st = ed + llm_grid_t * llm_grid_h * llm_grid_w
if st < len(input_tokens):
st_idx = (
llm_pos_ids_list[-1].max() + 1
if len(llm_pos_ids_list) > 0
else 0
)
text_len = len(input_tokens) - st
llm_pos_ids_list.append(
torch.arange(text_len).view(1, -1).expand(3, -1) + st_idx
)
llm_positions = torch.cat(llm_pos_ids_list, dim=1).reshape(3, -1)
position_ids[..., i, attention_mask[i] == 1] = llm_positions.to(
position_ids.device
)
mrope_position_deltas.append(
llm_positions.max() + 1 - len(total_input_ids[i])
)
mrope_position_deltas = torch.tensor(
mrope_position_deltas, device=input_ids.device
).unsqueeze(1)
return position_ids, mrope_position_deltas
else:
if attention_mask is not None:
position_ids = attention_mask.long().cumsum(-1) - 1
position_ids.masked_fill_(attention_mask == 0, 1)
position_ids = (
position_ids.unsqueeze(0)
.expand(3, -1, -1)
.to(attention_mask.device)
)
max_position_ids = position_ids.max(0, keepdim=False)[0].max(
-1, keepdim=True
)[0]
mrope_position_deltas = max_position_ids + 1 - attention_mask.shape[-1]
else:
position_ids = (
torch.arange(input_ids.shape[1], device=input_ids.device)
.view(1, 1, -1)
.expand(3, input_ids.shape[0], -1)
)
mrope_position_deltas = torch.zeros(
[input_ids.shape[0], 1],
device=input_ids.device,
dtype=input_ids.dtype,
)
return position_ids, mrope_position_deltas
################
# Data collator map
################
vision_data_collator_map = {
"Qwen2_5VLCollator": Qwen2_5VLCollator,
}
3、参数设置
1、初始化模型、数据、训练参数
因为像model_name_or_path可能需要多次修改,但是在代码里修改太麻烦了,因此我们可以使用脚本文件进行修改,前提需要对参数进行初始化,代码如下:
## 参数设置
################
# Model arguments
################
@dataclass
class ModelArguments:
auto_model_class: Optional[str] = field(
default="AutoModelForCausalLM",
metadata={
"help": (
"The auto model class to use for the model. Default is AutoModelForCausalLM."
)
},
)
model_name_or_path: Optional[str] = field(
default=None,
metadata={
"help": "Path to pretrained model or model identifier from huggingface.co/models."
},
)
processor_name_or_path: Optional[str] = field(
default=None,
metadata={
"help": "Path to pretrained processor or processor identifier from huggingface.co/models."
},
)
trust_remote_code: Optional[bool] = field(
default=True,
metadata={
"help": "Whether to trust the remote code when loading the model and processor. default is True."
},
)
torch_dtype: Optional[str] = field(
default="bfloat16",
metadata={"help": "The torch dtype to use for the model. Default is bfloat16."},
)
def __post_init__(self):
if self.processor_name_or_path is None:
self.processor_name_or_path = self.model_name_or_path
################
# datasets arguments
################
@dataclass
class DataTrainingArguments:
"""
Arguments pertaining to what data we are going to input our model for training and eval.
"""
train_dataset_name: Optional[str] = field(
default=None,
metadata={"help": "The name of the train dataset to use (via the datasets library)."},
)
test_dataset_name: Optional[str] = field(
default=None,
metadata={"help": "The name of the test dataset to use (via the datasets library)."},
)
data_collator: Optional[str] = field(
default="vision_data_collator",
metadata={
"help": (
"The data collator to use for the dataset. Default is vision_data_collator."
)
},
)
max_seq_length: Optional[int] = field(
default=1024,
metadata={
"help": (
"The maximum total input sequence length after tokenization. Sequences longer "
"than this will be truncated, sequences shorter will be padded."
)
},
)
max_image_side: Optional[int] = field(
default=256,
metadata={
"help": ("The size of the image to use for the dataset. Default is 224.")
},
)
################
# lora arguments
################
@dataclass
class LoraArguments:
use_lora: bool = False
r: int = 8
lora_alpha: int = 32
target_modules: List[str] = field(default_factory=lambda: ["q_proj", "v_proj"])
bias = "none"
task_type: str = "CAUSAL_LM"
lora_dropout: float = 0.05
inference_mode: bool = False
其中因为本次训练采用全参数微调,因此lora arguments没用上,有兴趣的小伙伴可以尝试下lora微调。
2、脚本文件设置
本次教程采用单机多卡分布式训练,因此脚本文件有点多有点乱,下行代码首先展示如何整体使用这些脚本文件,然后会一一讲解。
bash scripts/sft_vqa_8gpu-z2.sh configs/SFT_Qwen2_5-VL-3B-Instruct_vqa.yaml
-
scripts/sft_vqa_8gpu-z2.sh:
######################################################## # train sft.py with 8gpu in deepspeed zero2 bf16 ######################################################## accelerate launch \ --num_processes 8 \ --main_process_port 25001 \ --config_file configs/deepspeed_bf16_zero2.yaml \ sft.py \ --config $1
该脚本使用使用
accelerate
工具来管理多GPU训练过程,指定使用8个GPU进行训练。训练任务通过deepspeed
的zero2
优化策略和bf16
(bfloat16)浮点格式来提高效率和性能。脚本加载配置文件deepspeed_bf16_zero2.yaml
,该文件定义了分布式训练的各项参数。训练任务的主入口是sft.py
文件,接受一个外部参数config
,这个参数指定训练任务的配置文件和其他相关参数。如果GPU数量有变,可以修改num_processes部分,其他部分不变。
-
configs/deepspeed_bf16_zero2.yaml:
compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
deepspeed_multinode_launcher: standard
gradient_clipping: 1.0
offload_optimizer_device: none
offload_param_device: none
zero3_init_flag: false
zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: 'bf16'
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
该脚本文件定义了使用 DeepSpeed 进行分布式训练的配置,基本都是默认参数,不需要改内部参数。
- configs/SFT_Qwen2_5-VL-3B-Instruct_vqa.yaml:
# 模型设置,参数设置参考 trl.ModelConfig
model_name_or_path: /home/jiangqiushan/test/models/Qwen2.5-VL-3B-Instruct
auto_model_class: "Qwen2_5_VLForConditionalGeneration"
torch_dtype: bfloat16
# 数据集设置,参数设置参考 sft.DataTrainingArguments
train_dataset_name: ./data/train.jsonl
test_dataset_name: ./data/test.jsonl
preprocessing_num_workers: 1
data_collator: "Qwen2_5VLCollator"
max_seq_length: 256
# 训练设置,参数设置参考 transformers.TrainingArguments trl.SFTConfig
## 训练超参数
seed: 2025
data_seed: 2025
remove_unused_columns: False # 此处需要指定为false
## batchsize、训练次数相关
per_device_train_batch_size: 1
gradient_accumulation_steps: 1
max_steps: 2000
## 学习率相关
learning_rate: 1.0e-5
lr_scheduler_type: cosine
warmup_ratio: 0.1
## 训练效率相关
gradient_checkpointing: false
bf16: true
bf16_full_eval: true
## 验证输出
eval_strategy: steps
eval_steps: 0.1
## 结果输出+日志日志设置
output_dir: /home/jiangqiushan/test/models/SFT_Qwen2_5-VL-3B-Instruct_vqa
save_steps: 0.2
save_total_limit: 1
report_to: swanlab
logging_first_step: true
logging_steps: 0.001
该脚本文件参数基本就是上述说的模型、数据、训练参数设置,可以根据训练需求修改相应的参数。
💡注意:
由于本次教程固定max_steps,因此最终的epoch会很大,会有过拟合的现象,如果想要使用epoch,可以单独设置。
4、模型训练&保存
下面是模型训练和保存代码,在sft.py文件中保存。
def main(data_args, training_args, model_args, lora_args):
################
# Prepare something
################
output_dir = training_args.output_dir
dir_path, model_name = os.path.split(output_dir)
new_model_name = device_type + "_" + model_name
training_args.output_dir = os.path.join(dir_path, new_model_name)
training_args.run_name = new_model_name
set_seeds(training_args.seed)
################
# Model init kwargs & Tokenizer
################
# load processor
processor = AutoProcessor.from_pretrained(
pretrained_model_name_or_path=model_args.processor_name_or_path,
trust_remote_code=model_args.trust_remote_code,
local_files_only=True,
)
# load and construct model
model_class = getattr(transformers, model_args.auto_model_class) # 动态加载模型类
if model_class is None:
raise ValueError(f"Model class {model_args.auto_model_class} is not available.")
model = model_class.from_pretrained(
pretrained_model_name_or_path=model_args.model_name_or_path,
torch_dtype=getattr(torch, model_args.torch_dtype),
trust_remote_code=model_args.trust_remote_code,
local_files_only=True,
)
if lora_args.use_lora:
lora_config = LoraConfig(
r=lora_args.r,
lora_alpha=lora_args.lora_alpha,
target_modules=lora_args.target_modules,
lora_dropout=lora_args.lora_dropout,
bias=lora_args.bias,
)
model = get_peft_model(model, lora_config)
################
# Dataset
################
train_dataset = datasets.load_dataset("json", data_files=data_args.train_dataset_name)
test_dataset = datasets.load_dataset("json", data_files=data_args.test_dataset_name)
# 创建 DatasetDict
raw_dataset = datasets.DatasetDict({
"train": train_dataset["train"],
"test": test_dataset["train"]
})
print(raw_dataset)
# data formatting
def preporocess_textvqa(example):
return {
"image": example["image"],
"user": example["query"],
"assistant": example["response"],
}
raw_dataset = raw_dataset.map(
preporocess_textvqa,
remove_columns=raw_dataset["train"].column_names,
desc="Preprocessing textvqa dataset",
)
data_collator = vision_data_collator_map[data_args.data_collator](
processor=processor,
max_seq_length=data_args.max_seq_length,
max_img_side_length=data_args.max_image_side,
)
################
# Training
################
last_checkpoint = None # load last checkpoint if available
if (
os.path.isdir(training_args.output_dir)
and not training_args.overwrite_output_dir
):
last_checkpoint = get_last_checkpoint(training_args.output_dir)
if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0:
raise ValueError(
f"Output directory ({training_args.output_dir}) already exists and is not empty. "
"Use --overwrite_output_dir to overcome."
)
print(
f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
"the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
)
# Initialize our Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=raw_dataset["train"],
eval_dataset=(
raw_dataset["test"] if training_args.eval_strategy != "no" else None
),
data_collator=data_collator,
)
trainer.train(resume_from_checkpoint=last_checkpoint)
trainer.save_model(training_args.output_dir)
if __name__ == "__main__":
dataclass_types = (
DataTrainingArguments,
TrainingArguments,
ModelArguments,
LoraArguments,
)
parser = TrlParser(dataclass_types)
data_args, training_args, model_args, lora_args = parser.parse_args_and_config()
main(data_args, training_args, model_args, lora_args)
5、完整代码
git代码👉[textvqa_grounding_task_qwen2.5-vl-ft](https://github.com/828Tina/textvqa_grounding_task_qwen2.5-vl-ft)
代码总览如下:
project/
├── configs/
│ ├── deepspeed_bf16_zero2.yaml
│ ├── deepspeed_bf16_zero3.yaml
│ └── SFT_Qwen2_5-VL-3B-Instruct_vqa.yaml
├── scripts/
│ ├── merge_model.py
│ ├── convert2sft_format.py
│ ├── download_data.sh
│ ├── download_model.sh
│ ├── download_dayiwan.sh
│ └── sft_vqa_8gpu-z2.sh
├── README.md
├── requirements.txt
├── sft.py
├── utils.py
└── vision_datacollator.py
📈SwanLab可视化结果
链接在这👉SwanLab
📌微调模型后推理测试
由于服务器资源紧张,该部分之后再补充一下👋
⚙️F&A
1、魔搭社区下载的数据集用不了
由于本身数据集是来源于huggingface,魔搭社区上传的数据集会有dataset_infos.json文件,该文件是上传时自动生成,用以在数据预览功能里展示数据集中每一类别标签,但是不符合huggingface的格式,我们在使用的时候会调用datasets库,然后会报下面的错误:
代码:
from datasets import load_dataset
DATA_PATH = '/data/nvme0/textvqa_bbox'
ds = load_dataset(DATA_PATH,split='train')
报错:
TypeError: Value.__init__() missing 1 required positional argument: 'dtype'
解决:
删掉下载到本地的数据集文件里的dataset_infos.json文件。
参考资料
https://github.com/QwenLM/Qwen2.5-VL/tree/main
https://github.com/huggingface/trl
https://github.com/huggingface/transformers
https://www.modelscope.cn/datasets/Tina12345/textVQA_groundingtask_bbox/summary
https://huggingface.co/datasets/jrzhang/TextVQA_GT_bbox
https://github.com/modelscope/ms-swift/tree/main
多模态大模型应用实践(一)- 利用微调 LLaVA 实现高效酒店图片分类
swift与Internvl下的多模态大模型分布式微调指南(附代码和数据)