应用方向
- Vision+NLP
- VQA(vision question answering )
- 视觉常识推理(VCR)
- 图像区域定位
- 图像检索
对于个人用户而言,面对海量的在线视频资源,快速准确地通过关键词或描述找到感兴趣的视频十分重要。同时,在个人存储设备如手机或网络云盘中,用户也存在检索自己拍摄录制的视频资料的需求。
而对于视频剪辑师和制作团队来说,在庞大的媒资库中搜索所需的视频片段或素材是一项日常基础工作。精准高效的视频检索技术可以满足其在短时间内锁定匹配的素材,有效提升创作效率。
基于大模型的自然语言视频检索
多模态表征大模型能够将文本、图片、音频、视频等内容转换成高维空间中的向量表示,也称为嵌入(embeddings)。这些嵌入可以捕捉到内容的语义信息,并将其映射到一个连续的向量空间内,使得语义上相似的内容在这个向量空间中彼此接近。
视觉+语言的多模态大模型目前主流方法是:借助预训练好的大语言模型和图像编码器,用一个图文特征对齐模块来连接,从而让语言模型理解图像特征并进行更深层的问答推理。
这样可以利用已有的大量单模态训练数据训练得到的单模态模型,减少对于高质量图文对数据的依赖,并通过特征对齐、指令微调等方式打通两个模态的表征.
| 模型/应用框架 | 模态 | 编码器/技术 | 来源/备注 |
|---|---|---|---|
| CLIP | NLP+Vision | Vit/Resnet | OpenAI 开源 |
| videollava | NLP+Vision | 未明确 | |
| Chinese-CLIP | NLP+Vision | Vit/Resnet | 达摩院 |
| BLIP-2 | NLP+Vision | Vit | salesforce |
| InstructBLIP | NLP+Vision | 未明确 | |
| Mini GPT-4 | VideoChat | Efficientnet | 未明确 |
| ALIGN | NLP+Vision | Efficientnet | encoder中包含depthwish卷积 |
| Data2Vec | 未明确 | Vit | meta |
| X-LLM | Cross-Modal | Large Language Model | 未明确 |
Clip
https://openai.com/research/clip
openai-codebase:https://github.com/openai/CLIP/blob/main/clip/model.py#L94
Image Encoder
CLIP-ResNet
Conv
nn.Conv2d(3, width // 2, kernel_size=3, stride=2, padding=1, bias=False)
nn.Conv2d(width // 2, width // 2, kernel_size=3, padding=1, bias=False)
conv2d+relu
conv2d+relu
conv2d+relu
avgpool
CLIP-VIT
CLIPVisionTower(
(vision_tower): CLIPVisionModel(
(vision_model): CLIPVisionTransformer(
(embeddings): CLIPVisionEmbeddings(
(patch_embedding): Conv2d(3, 1024, kernel_size=(14, 14), stride=(14, 14), bias=False)
(position_embedding): Embedding(257, 1024)
)
(pre_layrnorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(encoder): CLIPEncoder(
(layers): ModuleList(
(0-23): 24 x CLIPEncoderLayer(
(self_attn): CLIPAttention(
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(layer_norm1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(mlp): CLIPMLP(
(activation_fn): QuickGELUActivation()
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
)
(layer_norm2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
)
)
)
(post_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
)
)
)
Chinese-CLIP-VIT
ChineseCLIPModel(
(text_model): ChineseCLIPTextModel(
(embeddings): ChineseCLIPTextEmbeddings(
(word_embeddings): Embedding(30522, 768, padding_idx=0)
(position_embeddings): Embedding(512, 768)
(token_type_embeddings): Embedding(2, 768)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(encoder): ChineseCLIPTextEncoder(
(layer): ModuleList(
(0-11): 12 x ChineseCLIPTextLayer(
(attention): ChineseCLIPTextAttention(
(self): ChineseCLIPTextSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): ChineseCLIPTextSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): ChineseCLIPTextIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
(intermediate_act_fn): GELUActivation()
)
(output): ChineseCLIPTextOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
)
)
(vision_model): ChineseCLIPVisionTransformer(
(embeddings): ChineseCLIPVisionEmbeddings(
(patch_embedding): Conv2d(3, 768, kernel_size=(32, 32), stride=(32, 32), bias=False)
(position_embedding): Embedding(50, 768)
)
(pre_layrnorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(encoder): ChineseCLIPVisionEncoder(
(layers): ModuleList(
(0-11): 12 x ChineseCLIPVisionLayer(
(self_attn): ChineseCLIPVisionAttention(
(k_proj): Linear(in_features=768, out_features=768, bias=True)
(v_proj): Linear(in_features=768, out_features=768, bias=True)
(q_proj): Linear(in_features=768, out_features=768, bias=True)
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(layer_norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(mlp): ChineseCLIPVisionMLP(
(activation_fn): QuickGELUActivation()
(fc1): Linear(in_features=768, out_features=3072, bias=True)
(fc2): Linear(in_features=3072, out_features=768, bias=True)
)
(layer_norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
)
)
(post_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
(visual_projection): Linear(in_features=768, out_features=512, bias=False)
(text_projection): Linear(in_features=768, out_features=512, bias=False)
)
BLIP2
- 模型结构
sourcecode: src/transformers/models/blip_2/modeling_blip_2.py
Blip2VisionModel
Blip2VisionModel(
(embeddings): Blip2VisionEmbeddings(
(patch_embedding): Conv2d(3, 1408, kernel_size=(14, 14), stride=(14, 14))
)
(encoder): Blip2Encoder(
(layers): ModuleList(
(0-38): 39 x Blip2EncoderLayer(
(self_attn): Blip2Attention(
(dropout): Dropout(p=0.0, inplace=False)
(qkv): Linear(in_features=1408, out_features=4224, bias=True)
(projection): Linear(in_features=1408, out_features=1408, bias=True)
)
(layer_norm1): LayerNorm((1408,), eps=1e-06, elementwise_affine=True)
(mlp): Blip2MLP(
(activation_fn): GELUActivation()
(fc1): Linear(in_features=1408, out_features=6144, bias=True)
(fc2): Linear(in_features=6144, out_features=1408, bias=True)
)
(layer_norm2): LayerNorm((1408,), eps=1e-06, elementwise_affine=True)
)
)
)
(post_layernorm): LayerNorm((1408,), eps=1e-06, elementwise_affine=True)
)
ALIGN
https://arxiv.org/pdf/2102.05918.pdf
https://drive.weixin.qq.com/s?k=ALsAZwePAAYFqQHIPVAF8ABAZPAA8
AlignVisionModel(
(embeddings): AlignVisionEmbeddings(
(padding): ZeroPad2d((0, 1, 0, 1))
(convolution): Conv2d(3, 64, kernel_size=(3, 3), stride=(2, 2), padding=valid, bias=False)
(batchnorm): BatchNorm2d(64, eps=0.001, momentum=0.99, affine=True, track_running_stats=True)
(activation): SiLU()
)
(encoder): AlignVisionEncoder(
(blocks): ModuleList(
(0): AlignVisionBlock(
(depthwise_conv): AlignVisionDepthwiseLayer(
(depthwise_conv_pad): ZeroPad2d((0, 1, 0, 1))
(depthwise_conv): AlignVisionDepthwiseConv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=same, groups=64, bias=False)
(depthwise_norm): BatchNorm2d(64, eps=0.001, momentum=0.99, affine=True, track_running_stats=True)
(depthwise_act): SiLU()
)
(squeeze_excite): AlignVisionSqueezeExciteLayer(
(squeeze): AdaptiveAvgPool2d(output_size=1)
(reduce): Conv2d(64, 16, kernel_size=(1, 1), stride=(1, 1), padding=same)
(expand): Conv2d(16, 64, kernel_size=(1, 1), stride=(1, 1), padding=same)
(act_reduce): SiLU()
(act_expand): Sigmoid()
)
(projection): AlignVisionFinalBlockLayer(
(project_conv): Conv2d(64, 32, kernel_size=(1, 1), stride=(1, 1), padding=same, bias=False)
(project_bn): BatchNorm2d(32, eps=0.001, momentum=0.99, affine=True, track_running_stats=True)
(dropout): Dropout(p=0.0, inplace=False)
)
)
Data2Vec
textencoder
Data2VecTextModel(
(embeddings): Data2VecTextForTextEmbeddings(
(word_embeddings): Embedding(30522, 768, padding_idx=1)
(position_embeddings): Embedding(512, 768, padding_idx=1)
(token_type_embeddings): Embedding(2, 768)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(encoder): Data2VecTextEncoder(
(layer): ModuleList(
(0-11): 12 x Data2VecTextLayer(
(attention): Data2VecTextAttention(
(self): Data2VecTextSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): Data2VecTextSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): Data2VecTextIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
(intermediate_act_fn): GELUActivation()
)
(output): Data2VecTextOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
)
(pooler): Data2VecTextPooler(
(dense): Linear(in_features=768, out_features=768, bias=True)
(activation): Tanh()
)
)
audio encoder
Data2VecAudioModel(
(feature_extractor): Data2VecAudioFeatureEncoder(
(conv_layers): ModuleList(
(0): Data2VecAudioConvLayer(
(conv): Conv1d(1, 512, kernel_size=(10,), stride=(5,), bias=False)
(layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(activation): GELUActivation()
)
(1-4): 4 x Data2VecAudioConvLayer(
(conv): Conv1d(512, 512, kernel_size=(3,), stride=(2,), bias=False)
(layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(activation): GELUActivation()
)
(5-6): 2 x Data2VecAudioConvLayer(
(conv): Conv1d(512, 512, kernel_size=(2,), stride=(2,), bias=False)
(layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(activation): GELUActivation()
)
)
)
(feature_projection): Data2VecAudioFeatureProjection(
(layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(projection): Linear(in_features=512, out_features=768, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(encoder): Data2VecAudioEncoder(
(pos_conv_embed): Data2VecAudioPositionalConvEmbedding(
(layers): ModuleList(
(0-4): 5 x Data2VecAudioPositionalConvLayer(
(conv): Conv1d(768, 768, kernel_size=(19,), stride=(1,), padding=(9,), groups=16)
(padding): Data2VecAudioPadLayer()
(activation): GELUActivation()
(layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=False)
)
)
)
(layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
(layers): ModuleList(
(0-11): 12 x Data2VecAudioEncoderLayer(
(attention): Data2VecAudioAttention(
(k_proj): Linear(in_features=768, out_features=768, bias=True)
(v_proj): Linear(in_features=768, out_features=768, bias=True)
(q_proj): Linear(in_features=768, out_features=768, bias=True)
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(dropout): Dropout(p=0.1, inplace=False)
(layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(feed_forward): Data2VecAudioFeedForward(
(intermediate_dropout): Dropout(p=0.1, inplace=False)
(intermediate_dense): Linear(in_features=768, out_features=3072, bias=True)
(intermediate_act_fn): GELUActivation()
(output_dense): Linear(in_features=3072, out_features=768, bias=True)
(output_dropout): Dropout(p=0.1, inplace=False)
)
(final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
)
)
)
video encoder
Data2VecVisionModel(
(embeddings): Data2VecVisionEmbeddings(
(patch_embeddings): Data2VecVisionPatchEmbeddings(
(projection): Conv2d(3, 768, kernel_size=(16, 16), stride=(16, 16))
)
(dropout): Dropout(p=0.0, inplace=False)
)
(encoder): Data2VecVisionEncoder(
(layer): ModuleList(
(0): Data2VecVisionLayer(
(attention): Data2VecVisionAttention(
(attention): Data2VecVisionSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=False)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(output): Data2VecVisionSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
)
)
(intermediate): Data2VecVisionIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
(intermediate_act_fn): GELUActivation()
)
(output): Data2VecVisionOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(layernorm_before): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(drop_path): Identity()
(layernorm_after): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
)
(1): Data2VecVisionLayer(
(attention): Data2VecVisionAttention(
(attention): Data2VecVisionSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=False)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(output): Data2VecVisionSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
)
)
(intermediate): Data2VecVisionIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
(intermediate_act_fn): GELUActivation()
)
(output): Data2VecVisionOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(layernorm_before): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(drop_path): Data2VecVisionDropPath(p=0.00909090880304575)
(layernorm_after): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
)
(2): Data2VecVisionLayer(
(attention): Data2VecVisionAttention(
(attention): Data2VecVisionSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=False)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(output): Data2VecVisionSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
)
)
(intermediate): Data2VecVisionIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
(intermediate_act_fn): GELUActivation()
)
(output): Data2VecVisionOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(layernorm_before): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(drop_path): Data2VecVisionDropPath(p=0.0181818176060915)
(layernorm_after): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
)
(3): Data2VecVisionLayer(
(attention): Data2VecVisionAttention(
(attention): Data2VecVisionSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=False)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(output): Data2VecVisionSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
)
)
(intermediate): Data2VecVisionIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
(intermediate_act_fn): GELUActivation()
)
(output): Data2VecVisionOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(layernorm_before): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(drop_path): Data2VecVisionDropPath(p=0.027272727340459824)
(layernorm_after): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
)
(4): Data2VecVisionLayer(
(attention): Data2VecVisionAttention(
(attention): Data2VecVisionSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=False)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(output): Data2VecVisionSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
)
)
(intermediate): Data2VecVisionIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
(intermediate_act_fn): GELUActivation()
)
(output): Data2VecVisionOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(layernorm_before): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(drop_path): Data2VecVisionDropPath(p=0.036363635212183)
(layernorm_after): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
)
(5): Data2VecVisionLayer(
(attention): Data2VecVisionAttention(
(attention): Data2VecVisionSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=False)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(output): Data2VecVisionSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
)
)
(intermediate): Data2VecVisionIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
(intermediate_act_fn): GELUActivation()
)
(output): Data2VecVisionOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(layernorm_before): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(drop_path): Data2VecVisionDropPath(p=0.045454543083906174)
(layernorm_after): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
)
(6): Data2VecVisionLayer(
(attention): Data2VecVisionAttention(
(attention): Data2VecVisionSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=False)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(output): Data2VecVisionSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
)
)
(intermediate): Data2VecVisionIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
(intermediate_act_fn): GELUActivation()
)
(output): Data2VecVisionOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(layernorm_before): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(drop_path): Data2VecVisionDropPath(p=0.054545458406209946)
(layernorm_after): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
)
(7): Data2VecVisionLayer(
(attention): Data2VecVisionAttention(
(attention): Data2VecVisionSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=False)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(output): Data2VecVisionSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
)
)
(intermediate): Data2VecVisionIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
(intermediate_act_fn): GELUActivation()
)
(output): Data2VecVisionOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(layernorm_before): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(drop_path): Data2VecVisionDropPath(p=0.06363636255264282)
(layernorm_after): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
)
(8): Data2VecVisionLayer(
(attention): Data2VecVisionAttention(
(attention): Data2VecVisionSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=False)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(output): Data2VecVisionSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
)
)
(intermediate): Data2VecVisionIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
(intermediate_act_fn): GELUActivation()
)
(output): Data2VecVisionOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(layernorm_before): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(drop_path): Data2VecVisionDropPath(p=0.0727272778749466)
(layernorm_after): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
)
(9): Data2VecVisionLayer(
(attention): Data2VecVisionAttention(
(attention): Data2VecVisionSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=False)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(output): Data2VecVisionSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
)
)
(intermediate): Data2VecVisionIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
(intermediate_act_fn): GELUActivation()
)
(output): Data2VecVisionOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(layernorm_before): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(drop_path): Data2VecVisionDropPath(p=0.08181818574666977)
(layernorm_after): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
)
(10): Data2VecVisionLayer(
(attention): Data2VecVisionAttention(
(attention): Data2VecVisionSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=False)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(output): Data2VecVisionSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
)
)
(intermediate): Data2VecVisionIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
(intermediate_act_fn): GELUActivation()
)
(output): Data2VecVisionOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(layernorm_before): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(drop_path): Data2VecVisionDropPath(p=0.09090909361839294)
(layernorm_after): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
)
(11): Data2VecVisionLayer(
(attention): Data2VecVisionAttention(
(attention): Data2VecVisionSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=False)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(output): Data2VecVisionSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
)
)
(intermediate): Data2VecVisionIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
(intermediate_act_fn): GELUActivation()
)
(output): Data2VecVisionOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(layernorm_before): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(drop_path): Data2VecVisionDropPath(p=0.10000000149011612)
(layernorm_after): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
)
)
)
(layernorm): Identity()
)
Text 结合image 的联合tokenizer
-
text prompt 根据输入图片长度增加 的占位符:
-
输入prompt text结合image占位符进行联合tokenize,输出tokenid:
input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).cuda()
input_id : [1,81]
Vision Transformer
(1)patch embedding:例如输入图片大小为224x224,将图片分为固定大小的patch,patch大小为16x16,则每张图像会生成224x224/16x16=196个patch,即输入序列长度为196,每个patch维度16x16x3=768,线性投射层的维度为768xN (N=768),因此输入通过线性投射层之后的维度依然为196x768,即一共有196个token,每个token的维度是768。这里还需要加上一个特殊字符cls,因此最终的维度是197x768。到目前为止,已经通过patch embedding将一个视觉问题转化为了一个seq2seq问题
(2) positional encoding(standard learnable 1D position embeddings):ViT同样需要加入位置编码,位置编码可以理解为一张表,表一共有N行,N的大小和输入序列长度相同,每一行代表一个向量,向量的维度和输入序列embedding的维度相同(768)。注意位置编码的操作是sum,而不是concat。加入位置编码信息之后,维度依然是197x768
(3)LN/multi-head attention/LN:LN输出维度依然是197x768。多头自注意力时,先将输入映射到q,k,v,如果只有一个头,qkv的维度都是197x768,如果有12个头(768/12=64),则qkv的维度是197x64,一共有12组qkv,最后再将12组qkv的输出拼接起来,输出维度是197x768,然后在过一层LN,维度依然是197x768
(4)MLP:将维度放大再缩小回去,197x768放大为197x3072,再缩小变为197x768
参考文献
- 视频检索:https://www.cnblogs.com/VideoCloudTech/p/17987835
- 综述
- https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models
- https://arxiv.org/pdf/2302.10035.pdf
-https://github.com/wangxiao5791509/MultiModal_BigModels_Survey
- https://github.com/salesforce/LAVIS-一个多模态模型库
- https://zhuanlan.zhihu.com/p/445122996

6760

被折叠的 条评论
为什么被折叠?



