BLIP-2 官方库学习

BLIP-2提出了一种通用的预训练策略,利用冻结的图像编码器和大型语言模型,通过轻量级Transformer实现高效率和先进性能。在参数显著减少的情况下,BLIP-2在VQAv2上表现出色,优于Flamingo80B模型。
摘要由CSDN通过智能技术生成

hugging face手册

Overview

《BLIP-2:Bootstrapping Language Image Pre-training with Frozen Image Encoders and Large Language Models》 中提出了BLIP-2模型。BLIP-2通过在冻结的预训练图像编码器和大型语言模型(LLM)之间训练一个轻量级的12层Transformer编码器,利用它们,在各种视觉语言任务中实现最先进的性能。最值得注意的是,BLIP-2在可训练参数减少54倍的零样本VQAv2上比Flamingo(800亿参数模型)提高了8.7%。

在这里插入图片描述
论文摘要如下:

由于大规模模型的端到端训练,视觉和语言预训练的成本变得越来越高。本文提出了 BLIP-2,这是一种通用且高效的预训练策略,可从现成的冻结预训练图像编码器和冻结大型语言模型 引导视觉语言预训练【 bootstraps vision-language pre-training】。 BLIP-2 通过轻量级查询转换器弥补了模态差距,该转换器分两个阶段进行预训练。第一阶段从冻结图像编码器引导视觉语言表示学习【 The first stage bootstraps vision-language representation learning from a frozen image encoder. 】。第二阶段从冻结的语言模型引导视觉到语言的生成学习。尽管可训练参数比现有方法少得多,但 BLIP-2 在各种视觉语言任务上实现了最先进的性能。例如,我们的模型在零样本 VQAv2 上的性能比 Flamingo80B 高出 8.7%,可训练参数减少了 54 倍。我们还展示了该模型的新兴功能,即可以遵循自然语言指令的零样本图像到文本生成功能。

Blip2Config

( vision_config = None, qformer_config = None, text_config = None, num_query_tokens = 32, **kwargs )
  • vision_config(dict,可选)–用于初始化Blip2VisionConfig的配置选项字典。

  • qformer_config (dict,可选)-用于初始化Blip2QFormerConfig的配置选项字典。

  • num_query_tokens (int,可选,默认为32)-通过transformer传递的查询令牌数。

  • kwargs (可选)-关键字参数词典。

Blip2Config是用于存储Blip2ForConditional Generation的配置的配置类。它用于根据指定的参数实例化BLIP-2模型,定义视觉模型、Q-Former模型和语言模型配置。

Blip2Config {
  "initializer_factor": 1.0,
  "initializer_range": 0.02,
  "model_type": "blip-2",
  "num_query_tokens": 32,
  "qformer_config": {
    "model_type": "blip_2_qformer"
  },
  "text_config": {
    "model_type": "opt"
  },
  "transformers_version": "4.35.2",
  "use_decoder_only_language_model": true,
  "vision_config": {
    "model_type": "blip_2_vision_model"
  }
}

Blip2VisionConfig

( hidden_size = 1408, intermediate_size = 6144, num_hidden_layers = 39, num_attention_heads = 16, image_size = 224, patch_size = 14, hidden_act = 'gelu', layer_norm_eps = 1e-06, attention_dropout = 0.0, initializer_range = 1e-10, qkv_bias = True, **kwargs )
  • hidden_size(int,可选,默认为1408)——编码器层和池器层的维度。

  • intermediate_size(int,可选,默认为6144)——Transformer编码器中“中间”(即前馈)层的维度。

  • num_hidden_layers(int,可选,默认为39)–Transformer编码器中的隐藏层数。

这是用于存储Blip2 VisionModel配置的配置类。

from transformers import Blip2VisionConfig, Blip2VisionModel

# Initializing a Blip2VisionConfig with Salesforce/blip2-opt-2.7b style configuration
configuration = Blip2VisionConfig()

# Initializing a Blip2VisionModel (with random weights) from the Salesforce/blip2-opt-2.7b style configuration
model = Blip2VisionModel(configuration)

# Accessing the model configuration
configuration = model.config

输出:

Blip2VisionConfig {
  "attention_dropout": 0.0,
  "hidden_act": "gelu",
  "hidden_size": 1408,
  "image_size": 224,
  "initializer_range": 1e-10,
  "intermediate_size": 6144,
  "layer_norm_eps": 1e-06,
  "model_type": "blip_2_vision_model",
  "num_attention_heads": 16,
  "num_hidden_layers": 39,
  "patch_size": 14,
  "qkv_bias": true,
  "transformers_version": "4.35.2"
}

Blip2QFormerConfig

( vocab_size = 30522, hidden_size = 768, num_hidden_layers = 12, num_attention_heads = 12, intermediate_size = 3072, hidden_act = 'gelu', hidden_dropout_prob = 0.1, attention_probs_dropout_prob = 0.1, max_position_embeddings = 512, initializer_range = 0.02, layer_norm_eps = 1e-12, pad_token_id = 0, position_embedding_type = 'absolute', cross_attention_frequency = 2, encoder_hidden_size = 1408, **kwargs )
  • vocab_size(int,可选,默认为30522)–Q-Former模型的词汇大小 Vocabulary size。定义调用模型时传递的inputs_ids可以表示的不同令牌的数量。
  • num_hidden_layers(int,可选,默认为12)–Transformer编码器中的隐藏层数。
from transformers import Blip2QFormerConfig, Blip2QFormerModel

# Initializing a BLIP-2 Salesforce/blip2-opt-2.7b style configuration
configuration = Blip2QFormerConfig()

# Initializing a model (with random weights) from the Salesforce/blip2-opt-2.7b style configuration
model = Blip2QFormerModel(configuration)
# Accessing the model configuration
configuration = model.config

输出:

Blip2QFormerConfig {
  "attention_probs_dropout_prob": 0.1,
  "cross_attention_frequency": 2,
  "encoder_hidden_size": 1408,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "blip_2_qformer",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.35.2",
  "vocab_size": 30522
}

Blip2Model

outputs = model(inputs)

from PIL import Image
import requests
from transformers import Blip2Processor, Blip2Model
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2Model.from_pretrained("Salesforce/blip2-opt-2.7b", torch_dtype=torch.float16)
model.to(device)
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

prompt = "Question: how many cats are there? Answer:"
inputs = processor(images=image, text=prompt, return_tensors="pt").to(device, torch.float16)

outputs = model(**inputs)

get_text_features

text_features = model.get_text_features(inputs)

import torch
from transformers import AutoTokenizer, Blip2Model

model = Blip2Model.from_pretrained("Salesforce/blip2-opt-2.7b")

tokenizer = AutoTokenizer.from_pretrained("Salesforce/blip2-opt-2.7b")
inputs = tokenizer(["a photo of a cat"], padding=True, return_tensors="pt")
text_features = model.get_text_features(**inputs)

get_image_features

image_outputs = model.get_image_features(inputs)

import torch
from PIL import Image
import requests
from transformers import AutoProcessor, Blip2Model

model = Blip2Model.from_pretrained("Salesforce/blip2-opt-2.7b")

processor = AutoProcessor.from_pretrained("Salesforce/blip2-opt-2.7b")
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(images=image, return_tensors="pt")
image_outputs = model.get_image_features(**inputs)

get_qformer_feature

qformer_outputs = model.get_qformer_features(inputs)

import torch
from PIL import Image
import requests
from transformers import Blip2Processor, Blip2Model

processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2Model.from_pretrained("Salesforce/blip2-opt-2.7b")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(images=image, return_tensors="pt")
qformer_outputs = model.get_qformer_features(**inputs)
  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值