MLM之InternVL：InternVL(GPT-4V的开创性开源替代品/通过开源套件缩小与商业多模态模型的差距)的简介、安装和使用方法、案例应用之详细攻略-CSDN博客

InternVL将ViT扩展到6B参数，并将其与LLM对齐。InternVL是一个开源的多模态视觉语言模型系列，它可以解决视觉与语言交叉领域的各种任务。InternVL的主要特点和贡献包括：
>> 规模化：InternVL-Chat-V1。5模型参数达到34B，超过过去开源模型。核心视觉模型InternViT的参数6B，较ViT-22B明显扩大。
>> 性能强劲：InternVL在很多视觉语言评估任务上超过SOTA，如MMMU、DocVQA等成绩接近商业模型GPT-4V。语义分割mIoU也高GPT-4V几个点。
>> 多语言支持：InternVL不仅支持英语，还支持中文等其他语言，在多语言零画识别、翻译等任务上表现出色。
>> 可拓展性强：InternVL提供分级模型，2B参数mini版本的InternVL-Chat也具备很强功能。还提供8位整型版本进行高效推理。
>> 开放性：InternVL采用MIT许可，所有模型、代码和数据都开源在GitHub上，方便开发者参考和应用。
>> 全面性：InternVL不仅支援图像与文本对话任务，还可以解决图像分类、语义分割、视频分类、图像与文本匹配等任务。目标检测和实例分割也在持续研发中。
总之，InternVL是迄今最强大、全面和开放的视觉语言模型系列之一。它在规模、性能和可拓展性等方面超越以往开源工作，近似商业水平，为视觉语言领域的研究和应用奠定坚实基础。

GitHub地址：GitHub - OpenGVLab/InternVL: [CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4V. 接近GPT-4V表现的可商用开源多模态对话模型

1、更新日志

2024/05/13：�� InternVL现在可以作为扩散模型的文本编码器，原生支持全球110多种语言的多语言生成。详见MuLan。

2024/04/28：我们发布了InternVL-Chat-V1-5的INT8版本，请见HF链接。

2024/04/28：我们在信息图表VQA基准测试上达到了SOTA性能（75.74），请见此处。

2024/04/18：InternVL-Chat-V1.5已在HF链接上发布，接近于在各种基准测试如MMMU、DocVQA、ChartQA、MathVista等上的GPT-4V和Gemini Pro的性能。

2024/02/27：InternVL被CVPR 2024接受！��

2024/02/24：InternVL-Chat模型已包含在VLMEvalKit中。

2024/02/21：InternVL-Chat-V1.2-Plus在MathVista（59.9）、MMBench（83.8）和MMVP（58.7）上取得了SOTA性能。详情请见我们的博客。

2024/02/12：InternVL-Chat-V1.2已发布。在MMMU val上达到51.6，在MMBench测试上达到82.3。更多详情，请参阅我们的博客、SFT数据或尝试我们的演示。该模型现已在HuggingFace上提供，并且训练/评估数据和脚本均为开源。

2024/02/04：InternVL-Chat-V1.1在MMVP上达到了44.67%，高于GPT-4V！

2024/01/27：我们发布了448分辨率模型，在MMBench dev上达到了76.6，请见此处。

2024/01/24：InternVL-Chat-V1.1发布，支持中文并具有更强大的OCR功能，请见此处或尝试我们的演示。

2024/01/16：我们发布了定制的mmcv/mmsegmentation/mmdetection代码，集成了DeepSpeed，可用于训练大规模目标检测和语义分割模型。

2、文档

如何安装环境？ [链接]

如何复现InternVL-Chat-V1.2的SFT阶段？ [链接]

如何在自定义数据集上微调InternVL-Chat-V1.2？ [链接]

如何评估InternVL-Chat-V1-5？ [链接]

如何使用VLMEvalKit评估InternVL-Chat-V1-5？（推荐）[链接]

如何部署本地演示？ [链接]

如何在Nvidia V100 GPU上运行InternVL 1.5-8位？ [链接] [中文教程]

如何执行批量推断？ [链接]

LMDeploy进行推断加速 [链接] [中文教程]

How to install the environment? [link]
How to reproduce the SFT stage of InternVL-Chat-V1.2? [link]
How to fine-tune InternVL-Chat-V1.2 on a custom dataset? [link]
How to evaluate InternVL-Chat-V1-5? [link]
How to evaluate InternVL-Chat-V1-5 using VLMEvalKit? (Recommend) [link]
How to deploy a local demo? [link]
How to run InternVL 1.5-8bit with Nvidia V100 GPU? [link] [中文教程]
How to perform batch inference? [link]
Inference Acceleration by LMDeploy [link] [中文教程]

3、与SOTA VLLMs比较

4、InternVL能做什么？

视觉感知（点击展开）

Linear-Probe Image Classification [see details]

ViT-22B uses the private JFT-3B dataset.

method	#param	IN-1K	IN-ReaL	IN-V2	IN-A	IN-R	IN-Sketch
OpenCLIP-G	1.8B	86.2	89.4	77.2	63.8	87.8	66.4
DINOv2-g	1.1B	86.5	89.6	78.4	75.9	78.8	62.5
EVA-01-CLIP-g	1.1B	86.5	89.3	77.4	70.5	87.7	63.1
MAWS-ViT-6.5B	6.5B	87.8	-	-	-	-	-
ViT-22B*	21.7B	89.5	90.9	83.2	83.8	87.4	−
InternViT-6B (ours)	5.9B	88.2	90.4	79.9	77.5	89.8	69.1

Semantic Segmentation [see details]

method	decoder	#param (train/total)	crop size	mIoU
OpenCLIP-G (frozen)	Linear	0.3M / 1.8B	512	39.3
ViT-22B (frozen)	Linear	0.9M / 21.7B	504	34.6
InternViT-6B (frozen)	Linear	0.5M / 5.9B	504	47.2 (+12.6)
ViT-22B (frozen)	UperNet	0.8B / 22.5B	504	52.7
InternViT-6B (frozen)	UperNet	0.4B / 6.3B	504	54.9 (+2.2)
ViT-22B	UperNet	22.5B / 22.5B	504	55.3
InternViT-6B	UperNet	6.3B / 6.3B	504	58.9 (+3.6)

Zero-Shot Image Classification [see details]

method IN-1K IN-A IN-R IN-V2 IN-Sketch ObjectNet
OpenCLIP-G 80.1 69.3 92.1 73.6 68.9 73.0
EVA-02-CLIP-E+ 82.0 82.1 94.5 75.7 71.6 79.6
ViT-22B* 85.9 90.1 96.0 80.9 − 87.6
InternVL-C (ours) 83.2 83.8 95.5 77.3 73.9 80.6

method	IN-1K	IN-A	IN-R	IN-V2	IN-Sketch	ObjectNet
OpenCLIP-G	80.1	69.3	92.1	73.6	68.9	73.0
EVA-02-CLIP-E+	82.0	82.1	94.5	75.7	71.6	79.6
ViT-22B*	85.9	90.1	96.0	80.9	−	87.6
InternVL-C (ours)	83.2	83.8	95.5	77.3	73.9	80.6

Multilingual Zero-Shot Image Classification [see details]

EN: English, ZH: Chinese, JP: Japanese, Ar: Arabic, IT: Italian

method	IN-1K (EN)	IN-1K (ZH)	IN-1K (JP)	IN-1K (AR)	IN-1K (IT)
Taiyi-CLIP-ViT-H	-	54.4	-	-	-
WuKong-ViT-L-G	-	57.5	-	-	-
CN-CLIP-ViT-H	-	59.6	-	-	-
AltCLIP-ViT-L	74.5	59.6	-	-	-
EVA-02-CLIP-E+	82.0	-	-	-	41.2
OpenCLIP-XLM-R-H	77.0	55.7	53.1	37.0	56.8
InternVL-C (ours)	83.2	64.5	61.5	44.9	65.7

Zero-Shot Video Classification [see details]

method #frame K400 K600 K700
OpenCLIP-G 1 65.9 66.1 59.2
EVA-02-CLIP-E+ 1 69.8 69.3 63.4
InternVL-C (ours) 1 71.0 71.3 65.7
ViCLIP 8 75.7 73.5 66.4
InternVL-C (ours) 8 79.4 78.8 71.5

method	#frame	K400	K600	K700
OpenCLIP-G	1	65.9	66.1	59.2
EVA-02-CLIP-E+	1	69.8	69.3	63.4
InternVL-C (ours)	1	71.0	71.3	65.7
ViCLIP	8	75.7	73.5	66.4
InternVL-C (ours)	8	79.4	78.8	71.5

跨模态检索（点击展开）

English Zero-Shot Image-Text Retrieval [see details]

model	Flickr30K						COCO						avg
	image-to-text			text-to-image			image-to-text			text-to-image
	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10
OpenCLIP-G	92.9	99.3	99.8	79.5	95.0	97.1	67.3	86.9	92.6	51.4	74.9	83.0	85.0
EVA-02-CLIP-E+	93.9	99.4	99.8	78.8	94.2	96.8	68.8	87.8	92.8	51.1	75.0	82.7	85.1
EVA-CLIP-8B	95.6	99.6	99.9	80.8	95.5	97.6	70.3	89.3	93.9	53.0	76.0	83.4	86.2
InternVL-C (ours)	94.7	99.6	99.9	81.7	96.0	98.2	70.6	89.0	93.5	54.1	77.3	84.6	86.6
InternVL-G (ours)	95.7	99.7	99.9	85.0	97.0	98.6	74.9	91.3	95.2	58.6	81.3	88.0	88.8

Chinese Zero-Shot Image-Text Retrieval [see details]

model	Flickr30K-CN						COCO-CN						avg
	image-to-text			text-to-image			image-to-text			text-to-image
	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10
CN-CLIP-ViT-H	81.6	97.5	98.8	71.2	91.4	95.5	63.0	86.6	92.9	69.2	89.9	96.1	86.1
OpenCLIP-XLM-R-H	86.1	97.5	99.2	71.0	90.5	94.9	70.0	91.5	97.0	66.1	90.8	96.0	87.6
InternVL-C (ours)	90.3	98.8	99.7	75.1	92.9	96.4	68.8	92.0	96.7	68.9	91.9	96.5	89.0
InternVL-G (ours)	92.9	99.4	99.8	77.7	94.8	97.3	71.4	93.9	97.7	73.8	94.4	98.1	90.9

Multilingual Zero-Shot Image-Text Retrieval on XTD [see details]

method	EN	ES	FR	ZH	IT	KO	RU	JP	average
AltCLIP	95.4	94.1	92.9	95.1	94.2	94.4	91.8	91.7	93.7
OpenCLIP-XLM-R-H	97.3	96.1	94.5	94.7	96.0	90.2	93.9	94.0	94.6
InternVL-C (ours)	97.3	95.7	95.1	95.6	96.0	92.2	93.3	95.5	95.1
InternVL-G (ours)	98.6	97.7	96.5	96.7	96.9	95.1	94.8	96.1	96.6

多模态对话（参见“与SOTA VLLMs比较”）

5、Model Zoo

Vision Large Language Model

Model	Date	Download	Note
Mini-InternVL−Chat−2B-V1.5 (Preview version)	2024.05.19	🤗 HF link	🚀🚀 Only 2B parameters, anyone can deploy it locally.
InternVL−Chat−V1.5-Int8	2024.04.28	🤗 HF link	The INT8 version of InternVL-Chat-V1-5
InternVL−Chat−V1.5	2024.04.18	🤗 HF link	support 4K image; super strong OCR; Approaching the performance of GPT-4V and Gemini Pro on various benchmarks like MMMU, DocVQA, ChartQA, MathVista, etc. (🔥new)
InternVL−Chat−V1.2−Plus	2024.02.21	🤗 HF link	more SFT data and stronger
InternVL−Chat−V1.2	2024.02.11	🤗 HF link	scaling up LLM to 34B
InternVL−Chat−V1.1	2024.01.24	🤗 HF link	support Chinese and stronger OCR
InternVL−Chat−19B−448px	2024.02.03	🤗 HF link	448 resolution
InternVL−Chat−19B	2023.12.25	🤗 HF link	English multimodal dialogue
InternVL−Chat−13B	2023.12.25	🤗 HF link	English multimodal dialogue

Vision-Language Foundation Model

Model	Date	Download	Note
InternViT−6B−448px−V1.5	2024.04.20	🤗 HF link	support dynamic resolution, super strong OCR (🔥new)
InternViT−6B−448px−V1.2	2024.02.11	🤗 HF link	448 resolution
InternViT−6B−448px−V1.0	2024.01.30	🤗 HF link	448 resolution
InternViT−6B−224px	2023.12.22	🤗 HF link	vision foundation model
InternVL−14B−224px	2023.12.22	🤗 HF link	vision-language foundation model, InternViT-6B + QLLaMA, can be used for image-text retrival like CLIP

InternVL的安装和使用方法

1、安装

T1、CLI使用

使用Huggingface的快速入门

使用InternViT-6B（点击展开）

import torch
from PIL import Image
from transformers import AutoModel, CLIPImageProcessor

model = AutoModel.from_pretrained(
    'OpenGVLab/InternViT-6B-224px',
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True).cuda().eval()

image = Image.open('./examples/image1.jpg').convert('RGB')

image_processor = CLIPImageProcessor.from_pretrained('OpenGVLab/InternViT-6B-224px')

pixel_values = image_processor(images=image, return_tensors='pt').pixel_values
pixel_values = pixel_values.to(torch.bfloat16).cuda()

outputs = model(pixel_values)

使用InternVL-C(对比)和InternVL-G(生成)（点击展开）

import torch
from PIL import Image
from transformers import AutoModel, CLIPImageProcessor
from transformers import AutoTokenizer


model = AutoModel.from_pretrained(
    'OpenGVLab/InternVL-14B-224px',
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True).cuda().eval()

image_processor = CLIPImageProcessor.from_pretrained('OpenGVLab/InternVL-14B-224px')

tokenizer = AutoTokenizer.from_pretrained(
    'OpenGVLab/InternVL-14B-224px', use_fast=False, add_eos_token=True)
tokenizer.pad_token_id = 0  # set pad_token_id to 0

images = [
    Image.open('./examples/image1.jpg').convert('RGB'),
    Image.open('./examples/image2.jpg').convert('RGB'),
    Image.open('./examples/image3.jpg').convert('RGB')
]
prefix = 'summarize:'
texts = [
    prefix + 'a photo of a red panda',  # English
    prefix + '一张熊猫的照片',  # Chinese
    prefix + '二匹の猫の写真'  # Japanese
]

pixel_values = image_processor(images=images, return_tensors='pt').pixel_values
pixel_values = pixel_values.to(torch.bfloat16).cuda()
input_ids = tokenizer(texts, return_tensors='pt', max_length=80,
                      truncation=True, padding='max_length').input_ids.cuda()

# InternVL-C
logits_per_image, logits_per_text = model(
    image=pixel_values, text=input_ids, mode='InternVL-C')
probs = logits_per_image.softmax(dim=-1)
# tensor([[9.9609e-01, 5.2185e-03, 6.0070e-08],
#         [2.2949e-02, 9.7656e-01, 5.9903e-06],
#         [3.2932e-06, 7.4863e-05, 1.0000e+00]], device='cuda:0',
#        dtype=torch.bfloat16, grad_fn=<SoftmaxBackward0>)

# InternVL-G
logits_per_image, logits_per_text = model(
    image=pixel_values, text=input_ids, mode='InternVL-G')
probs = logits_per_image.softmax(dim=-1)
# tensor([[9.9609e-01, 3.1738e-03, 3.6322e-08],
#         [8.6060e-03, 9.9219e-01, 2.8759e-06],
#         [1.7583e-06, 3.1233e-05, 1.0000e+00]], device='cuda:0',
#        dtype=torch.bfloat16, grad_fn=<SoftmaxBackward0>)

# please set add_eos_token to False for generation
tokenizer.add_eos_token = False
image = Image.open('./examples/image1.jpg').convert('RGB')
pixel_values = image_processor(images=image, return_tensors='pt').pixel_values
pixel_values = pixel_values.to(torch.bfloat16).cuda()

tokenized = tokenizer("English caption:", return_tensors='pt')
pred = model.generate(
    pixel_values=pixel_values,
    input_ids=tokenized.input_ids.cuda(),
    attention_mask=tokenized.attention_mask.cuda(),
    num_beams=5,
    min_new_tokens=8,
)
caption = tokenizer.decode(pred[0].cpu(), skip_special_tokens=True).strip()
# English caption: a red panda sitting on top of a wooden platform

使用InternVL-Chat（点击展开）

from transformers import AutoTokenizer, AutoModel
import torch
import torchvision.transforms as T
from PIL import Image

from torchvision.transforms.functional import InterpolationMode


IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)


def build_transform(input_size):
    MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
    transform = T.Compose([
        T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
        T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
        T.ToTensor(),
        T.Normalize(mean=MEAN, std=STD)
    ])
    return transform


def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
    best_ratio_diff = float('inf')
    best_ratio = (1, 1)
    area = width * height
    for ratio in target_ratios:
        target_aspect_ratio = ratio[0] / ratio[1]
        ratio_diff = abs(aspect_ratio - target_aspect_ratio)
        if ratio_diff < best_ratio_diff:
            best_ratio_diff = ratio_diff
            best_ratio = ratio
        elif ratio_diff == best_ratio_diff:
            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
                best_ratio = ratio
    return best_ratio


def dynamic_preprocess(image, min_num=1, max_num=6, image_size=448, use_thumbnail=False):
    orig_width, orig_height = image.size
    aspect_ratio = orig_width / orig_height

    # calculate the existing image aspect ratio
    target_ratios = set(
        (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
        i * j <= max_num and i * j >= min_num)
    target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])

    # find the closest aspect ratio to the target
    target_aspect_ratio = find_closest_aspect_ratio(
        aspect_ratio, target_ratios, orig_width, orig_height, image_size)

    # calculate the target width and height
    target_width = image_size * target_aspect_ratio[0]
    target_height = image_size * target_aspect_ratio[1]
    blocks = target_aspect_ratio[0] * target_aspect_ratio[1]

    # resize the image
    resized_img = image.resize((target_width, target_height))
    processed_images = []
    for i in range(blocks):
        box = (
            (i % (target_width // image_size)) * image_size,
            (i // (target_width // image_size)) * image_size,
            ((i % (target_width // image_size)) + 1) * image_size,
            ((i // (target_width // image_size)) + 1) * image_size
        )
        # split the image
        split_img = resized_img.crop(box)
        processed_images.append(split_img)
    assert len(processed_images) == blocks
    if use_thumbnail and len(processed_images) != 1:
        thumbnail_img = image.resize((image_size, image_size))
        processed_images.append(thumbnail_img)
    return processed_images


def load_image(image_file, input_size=448, max_num=6):
    image = Image.open(image_file).convert('RGB')
    transform = build_transform(input_size=input_size)
    images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
    pixel_values = [transform(image) for image in images]
    pixel_values = torch.stack(pixel_values)
    return pixel_values


path = "OpenGVLab/InternVL-Chat-V1-5"
# If you have an 80G A100 GPU, you can put the entire model on a single GPU.
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True).eval().cuda()
# Otherwise, you need to set device_map='auto' to use multiple GPUs for inference.
# model = AutoModel.from_pretrained(
#     path,
#     torch_dtype=torch.bfloat16,
#     low_cpu_mem_usage=True,
#     trust_remote_code=True,
#     device_map='auto').eval()

tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True)
# set the max number of tiles in `max_num`
pixel_values = load_image('./examples/image1.jpg', max_num=6).to(torch.bfloat16).cuda()

generation_config = dict(
    num_beams=1,
    max_new_tokens=512,
    do_sample=False,
)

# single-round single-image conversation
question = "请详细描述图片" # Please describe the picture in detail
response = model.chat(tokenizer, pixel_values, question, generation_config)
print(question, response)

# multi-round single-image conversation
question = "请详细描述图片" # Please describe the picture in detail
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
print(question, response)

question = "请根据图片写一首诗" # Please write a poem according to the picture
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
print(question, response)

# multi-round multi-image conversation
pixel_values1 = load_image('./examples/image1.jpg', max_num=6).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_num=6).to(torch.bfloat16).cuda()
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)

question = "详细描述这两张图片" # Describe the two pictures in detail
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
print(question, response)

question = "这两张图片的相同点和区别分别是什么" # What are the similarities and differences between these two pictures
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
print(question, response)

# batch inference (single image per sample)
pixel_values1 = load_image('./examples/image1.jpg', max_num=6).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_num=6).to(torch.bfloat16).cuda()
image_counts = [pixel_values1.size(0), pixel_values2.size(0)]
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)

questions = ["Describe the image in detail."] * len(image_counts)
responses = model.batch_chat(tokenizer, pixel_values,
                             image_counts=image_counts,
                             questions=questions,
                             generation_config=generation_config)
for question, response in zip(questions, responses):
    print(question)
    print(response)

通过LMDeploy进行推断加速

如果需要优化InternVL-Chat模型推断，我们建议使用LMDeploy。

在以下子章节中，我们将以InternVL-Chat-V1-5模型为例介绍LMDeploy的使用。

设置推断环境

首先，请按照以下步骤设置推断环境：

LMDeploy pypi包默认依赖CUDA 12.x。对于CUDA 11.x环境，请参阅安装指南。

conda create -n internvl python=3.10 -y
conda activate internvl

pip install timm torchvision==0.17.2
pip install lmdeploy

离线推断管道

from lmdeploy import pipeline
from lmdeploy.vl import load_image

pipe = pipeline('OpenGVLab/InternVL-Chat-V1-5')
image = load_image('examples/image2.jpg')
response = pipe(('描述这张图片', image))
print(response)

有关使用VLM管道的更多信息，包括多图像推断或多轮对话，请查阅此指南。

在线推断服务

LMDeploy支持一键将VLM模型打包成OpenAI服务，与OpenAI API实现无缝集成。

服务可以通过以下命令一键启动：

lmdeploy serve api_server OpenGVLab/InternVL-Chat-V1-5

可以通过命令lmdeploy serve api_server -h查看api_server的参数，例如，--tp用于设置张量并行度，--session-len用于指定上下文窗口的最大长度，--cache-max-entry-count用于调整GPU内存比例以供k/v缓存等。

有关更多详情，包括使用docker启动服务、RESTful API信息和OpenAI集成方法，请参阅此指南。