【mplug_owl2.1&mplug_owl_2_1推理】多模态大模型，图生文代码示例

内卷焦虑人士

已于 2024-07-12 15:11:27 修改

阅读量1k

点赞数 8

CC 4.0 BY-SA版权

文章标签： MLLM 人工智能

于 2024-05-16 17:16:46 首次发布

本文链接：https://blog.csdn.net/weixin_46398647/article/details/138967287

前情提要

mPLUG-Owl2是达摩院提出的多模态大语言模型（MLLM），是第一个在纯文本和多模态数据集上同时达到state-of-the-art水平并具有显着改进的MLLM。与尺寸相似的型号相比，mPLUG-Owl2在很多方面超越了强基线LLaVA-1.5。此外，即使视觉主干较小，mPLUG-Owl2 的性能也很大程度上优于 Qwen-VL（即 ViT-L（0.3B）与 ViT-G（1.9B）），特别是在低级感知任务上（Q-Bench）。
在这里插入图片描述
mPLUG-Owl2.1，mPLUG-Owl2 的中文增强版。权重可在HuggingFace上找到

为什么要写这篇文章

官方文档中的代码直接运行会报错

使用方法

克隆此存储库并导航到 mPLUG-Owl2 文件夹

git clone https://github.com/X-PLUG/mPLUG-Owl.git
cd mPLUG-Owl/mPLUG-Owl2

安装Package

官方新建了虚拟环境，给了几个特定的包，搞新环境很麻烦，没事儿整那么多虚拟环境干嘛，我就要用我自己的，但是执行pip install -e .又会覆盖我的某些包，所以我整理出下面我手动额外安装的包

pip install bitsandbytes
pip install icecream
pip isntall accelerate==0.21.0
pip install flash-attn --no-build-isolation

更改模型文件中的config.json

将vocab size由151936改为151851
如果不更改，会产生如下报错

Traceback (most recent call last):
  File "/home/hyh/mPLUG-Owl-main/mPLUG-Owl2/demo.py", line 15, in <module>
    tokenizer, model, image_processor, context_len = load_pretrained_model(model_path,
  File "/home/hyh/mPLUG-Owl-main/mPLUG-Owl2/mplug_owl2/model/builder.py", line 117, in load_pretrained_model
    model = AutoModelForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, **kwargs)
  File "/root/anaconda3/envs/sakura/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 563, in from_pretrained
    return model_class.from_pretrained(
  File "/root/anaconda3/envs/sakura/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3677, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/root/anaconda3/envs/sakura/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4104, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/root/anaconda3/envs/sakura/lib/python3.10/site-packages/transformers/modeling_utils.py", line 886, in _load_state_dict_into_meta_model
    set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
  File "/root/anaconda3/envs/sakura/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 358, in set_module_tensor_to_device
    raise ValueError(
ValueError: Trying to set a tensor of shape torch.Size([151851, 4096]) in "weight" (which has shape torch.Size([151936, 4096])), this look incorrect.

调整device_map

只能在一张卡上运行，不支持模型切分
如果不设置device="cuda:0"，则自动默认device_map='auto'导致报错
另外如果你的显卡显存小于22G，需要量化模型，设置load_8bit=True

Traceback (most recent call last):
  File "/home/hyh/mPLUG-Owl-main/mPLUG-Owl2/demo.py", line 15, in <module>
    tokenizer, model, image_processor, context_len = load_pretrained_model(model_path, None, model_name, load_8bit=False, load_4bit=False, device="cuda",bf16=False,fp16=True,)
  File "/home/hyh/mPLUG-Owl-main/mPLUG-Owl2/mplug_owl2/model/builder.py", line 117, in load_pretrained_model
    model = AutoModelForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, **kwargs)
  File "/root/anaconda3/envs/sakura/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 563, in from_pretrained
    return model_class.from_pretrained(
  File "/root/anaconda3/envs/sakura/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3593, in from_pretrained
    no_split_modules = model._get_no_split_modules(device_map)
  File "/root/anaconda3/envs/sakura/lib/python3.10/site-packages/transformers/modeling_utils.py", line 1867, in _get_no_split_modules
    raise ValueError(
ValueError: MplugOwlVisualAbstractorModel does not support `device_map='auto'`. To implement support, the model class needs to implement the `_no_split_modules` attribute.

调整image_tensor

tensor只支持bfloat16，所以需要修改float16变为bfloat16，否则产生报错

Traceback (most recent call last):
  File "/home/hyh/mPLUG-Owl-main/mPLUG-Owl2/demo.py", line 48, in <module>
    output_ids = model.generate(
  File "/home/hyh/mPLUG-Owl-main/mPLUG-Owl2/mplug_owl2/model/modeling_qwen.py", line 1213, in generate
    return super().generate(
  File "/root/anaconda3/envs/sakura/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/root/anaconda3/envs/sakura/lib/python3.10/site-packages/transformers/generation/utils.py", line 1622, in generate
    result = self._sample(
  File "/root/anaconda3/envs/sakura/lib/python3.10/site-packages/transformers/generation/utils.py", line 2791, in _sample
    outputs = self(
  File "/root/anaconda3/envs/sakura/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/anaconda3/envs/sakura/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/anaconda3/envs/sakura/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/hyh/mPLUG-Owl-main/mPLUG-Owl2/mplug_owl2/model/modeling_mplug_owl2.py", line 400, in forward
    self.prepare_inputs_labels_for_multimodal(input_ids, attention_mask, past_key_values, labels, images)
  File "/home/hyh/mPLUG-Owl-main/mPLUG-Owl2/mplug_owl2/model/modeling_mplug_owl2.py", line 84, in prepare_inputs_labels_for_multimodal
    image_features = self.encode_images(images)
  File "/home/hyh/mPLUG-Owl-main/mPLUG-Owl2/mplug_owl2/model/modeling_mplug_owl2.py", line 62, in encode_images
    image_features = self.get_model().vision_model(images).last_hidden_state
  File "/root/anaconda3/envs/sakura/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/anaconda3/envs/sakura/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/anaconda3/envs/sakura/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/hyh/mPLUG-Owl-main/mPLUG-Owl2/mplug_owl2/model/visual_encoder.py", line 431, in forward
    hidden_states = self.embeddings(pixel_values)
  File "/root/anaconda3/envs/sakura/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/anaconda3/envs/sakura/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/anaconda3/envs/sakura/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/hyh/mPLUG-Owl-main/mPLUG-Owl2/mplug_owl2/model/visual_encoder.py", line 115, in forward
    image_embeds = self.patch_embed(pixel_values)
  File "/root/anaconda3/envs/sakura/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/anaconda3/envs/sakura/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/anaconda3/envs/sakura/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/root/anaconda3/envs/sakura/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 460, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/root/anaconda3/envs/sakura/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 456, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: expected scalar type Half but found BFloat16

完整代码

import torch
from PIL import Image
from transformers import TextStreamer

from mplug_owl2.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
from mplug_owl2.conversation import conv_templates, SeparatorStyle
from mplug_owl2.model.builder import load_pretrained_model
from mplug_owl2.mm_utils import process_images, tokenizer_image_token, get_model_name_from_path, KeywordsStoppingCriteria

image_file = '' # Image Path
model_path = '' # Model Path
# query = "请用中文进行回答，识别图像中的文字并整理。"
query = "Describe the image."
model_name = get_model_name_from_path(model_path)
tokenizer, model, image_processor, context_len = load_pretrained_model(model_path,
                                                                        None,
                                                                        model_name,
                                                                        load_8bit=True,
                                                                        # load_4bit=False,
                                                                        device="cuda:0",
                                                                        )
print("模型加载完成")
conv = conv_templates["mplug_owl2"].copy()
roles = conv.roles

image = Image.open(image_file).convert('RGB')
max_edge = max(image.size) # We recommand you to resize to squared image for BEST performance.
image = image.resize((max_edge, max_edge))

image_tensor = process_images([image], image_processor)
image_tensor = image_tensor.to(model.device, dtype=torch.bfloat16)

inp = DEFAULT_IMAGE_TOKEN + query
conv.append_message(conv.roles[0], inp)
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()

input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).to(model.device)
stop_str = conv.sep2
keywords = [stop_str]
print("图片转向量")
stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

temperature = 0.7
max_new_tokens = 512
print("开始推理")
with torch.inference_mode():
    output_ids = model.generate(
        input_ids,
        images=image_tensor,
        do_sample=True,
        temperature=temperature,
        max_new_tokens=max_new_tokens,
        streamer=streamer,
        use_cache=True,
        stopping_criteria=[stopping_criteria])

outputs = tokenizer.decode(output_ids[0, input_ids.shape[1]:]).strip()
print(outputs)

结果

mPLUG-Owl2.1，mPLUG-Owl2 的中文增强版，不支持中文
它回答：

I'm sorry, but I cannot provide a translation for the given text as it is not in English. The text appears to be in a foreign language.