多模态大模型

应用方向

  • Vision+NLP
  • VQA(vision question answering )
  • 视觉常识推理(VCR)
  • 图像区域定位
  • 图像检索

对于个人用户而言,面对海量的在线视频资源,快速准确地通过关键词或描述找到感兴趣的视频十分重要。同时,在个人存储设备如手机或网络云盘中,用户也存在检索自己拍摄录制的视频资料的需求。
而对于视频剪辑师和制作团队来说,在庞大的媒资库中搜索所需的视频片段或素材是一项日常基础工作。精准高效的视频检索技术可以满足其在短时间内锁定匹配的素材,有效提升创作效率。
基于大模型的自然语言视频检索
多模态表征大模型能够将文本、图片、音频、视频等内容转换成高维空间中的向量表示,也称为嵌入(embeddings)。这些嵌入可以捕捉到内容的语义信息,并将其映射到一个连续的向量空间内,使得语义上相似的内容在这个向量空间中彼此接近。

视觉+语言的多模态大模型目前主流方法是:借助预训练好的大语言模型和图像编码器,用一个图文特征对齐模块来连接,从而让语言模型理解图像特征并进行更深层的问答推理。
这样可以利用已有的大量单模态训练数据训练得到的单模态模型,减少对于高质量图文对数据的依赖,并通过特征对齐、指令微调等方式打通两个模态的表征.

模型/应用框架模态编码器/技术来源/备注
CLIPNLP+VisionVit/ResnetOpenAI 开源
videollavaNLP+Vision未明确
Chinese-CLIPNLP+VisionVit/Resnet达摩院
BLIP-2NLP+VisionVitsalesforce
InstructBLIPNLP+Vision未明确
Mini GPT-4VideoChatEfficientnet未明确
ALIGNNLP+VisionEfficientnetencoder中包含depthwish卷积
Data2Vec未明确Vitmeta
X-LLMCross-ModalLarge Language Model未明确

Clip

https://openai.com/research/clip
openai-codebase:https://github.com/openai/CLIP/blob/main/clip/model.py#L94

Image Encoder

CLIP-ResNet

Conv
nn.Conv2d(3, width // 2, kernel_size=3, stride=2, padding=1, bias=False)
nn.Conv2d(width // 2, width // 2, kernel_size=3, padding=1, bias=False)
conv2d+relu
conv2d+relu
conv2d+relu
avgpool

CLIP-VIT

CLIPVisionTower(
  (vision_tower): CLIPVisionModel(
    (vision_model): CLIPVisionTransformer(
      (embeddings): CLIPVisionEmbeddings(
        (patch_embedding): Conv2d(3, 1024, kernel_size=(14, 14), stride=(14, 14), bias=False)
        (position_embedding): Embedding(257, 1024)
      )
      (pre_layrnorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
      (encoder): CLIPEncoder(
        (layers): ModuleList(
          (0-23): 24 x CLIPEncoderLayer(
            (self_attn): CLIPAttention(
              (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
              (v_proj): Linear(in_features=1024, out_features=1024, bias=True)
              (q_proj): Linear(in_features=1024, out_features=1024, bias=True)
              (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
            )
            (layer_norm1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
            (mlp): CLIPMLP(
              (activation_fn): QuickGELUActivation()
              (fc1): Linear(in_features=1024, out_features=4096, bias=True)
              (fc2): Linear(in_features=4096, out_features=1024, bias=True)
            )
            (layer_norm2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          )
        )
      )
      (post_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
    )
  )
)

Chinese-CLIP-VIT

ChineseCLIPModel(
  (text_model): ChineseCLIPTextModel(
    (embeddings): ChineseCLIPTextEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): ChineseCLIPTextEncoder(
      (layer): ModuleList(
        (0-11): 12 x ChineseCLIPTextLayer(
          (attention): ChineseCLIPTextAttention(
            (self): ChineseCLIPTextSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): ChineseCLIPTextSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): ChineseCLIPTextIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
            (intermediate_act_fn): GELUActivation()
          )
          (output): ChineseCLIPTextOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
      )
    )
  )
  (vision_model): ChineseCLIPVisionTransformer(
    (embeddings): ChineseCLIPVisionEmbeddings(
      (patch_embedding): Conv2d(3, 768, kernel_size=(32, 32), stride=(32, 32), bias=False)
      (position_embedding): Embedding(50, 768)
    )
    (pre_layrnorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    (encoder): ChineseCLIPVisionEncoder(
      (layers): ModuleList(
        (0-11): 12 x ChineseCLIPVisionLayer(
          (self_attn): ChineseCLIPVisionAttention(
            (k_proj): Linear(in_features=768, out_features=768, bias=True)
            (v_proj): Linear(in_features=768, out_features=768, bias=True)
            (q_proj): Linear(in_features=768, out_features=768, bias=True)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
          (layer_norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (mlp): ChineseCLIPVisionMLP(
            (activation_fn): QuickGELUActivation()
            (fc1): Linear(in_features=768, out_features=3072, bias=True)
            (fc2): Linear(in_features=3072, out_features=768, bias=True)
          )
          (layer_norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        )
      )
    )
    (post_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (visual_projection): Linear(in_features=768, out_features=512, bias=False)
  (text_projection): Linear(in_features=768, out_features=512, bias=False)
)

BLIP2

  • 模型结构
sourcecode: src/transformers/models/blip_2/modeling_blip_2.py
Blip2VisionModel
Blip2VisionModel(
  (embeddings): Blip2VisionEmbeddings(
    (patch_embedding): Conv2d(3, 1408, kernel_size=(14, 14), stride=(14, 14))
  )
  (encoder): Blip2Encoder(
    (layers): ModuleList(
      (0-38): 39 x Blip2EncoderLayer(
        (self_attn): Blip2Attention(
          (dropout): Dropout(p=0.0, inplace=False)
          (qkv): Linear(in_features=1408, out_features=4224, bias=True)
          (projection): Linear(in_features=1408, out_features=1408, bias=True)
        )
        (layer_norm1): LayerNorm((1408,), eps=1e-06, elementwise_affine=True)
        (mlp): Blip2MLP(
          (activation_fn): GELUActivation()
          (fc1): Linear(in_features=1408, out_features=6144, bias=True)
          (fc2): Linear(in_features=6144, out_features=1408, bias=True)
        )
        (layer_norm2): LayerNorm((1408,), eps=1e-06, elementwise_affine=True)
      )
    )
  )
  (post_layernorm): LayerNorm((1408,), eps=1e-06, elementwise_affine=True)
)

ALIGN

https://arxiv.org/pdf/2102.05918.pdf
https://drive.weixin.qq.com/s?k=ALsAZwePAAYFqQHIPVAF8ABAZPAA8
AlignVisionModel(
  (embeddings): AlignVisionEmbeddings(
    (padding): ZeroPad2d((0, 1, 0, 1))
    (convolution): Conv2d(3, 64, kernel_size=(3, 3), stride=(2, 2), padding=valid, bias=False)
    (batchnorm): BatchNorm2d(64, eps=0.001, momentum=0.99, affine=True, track_running_stats=True)
    (activation): SiLU()
  )
  (encoder): AlignVisionEncoder(
    (blocks): ModuleList(
      (0): AlignVisionBlock(
        (depthwise_conv): AlignVisionDepthwiseLayer(
          (depthwise_conv_pad): ZeroPad2d((0, 1, 0, 1))
          (depthwise_conv): AlignVisionDepthwiseConv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=same, groups=64, bias=False)
          (depthwise_norm): BatchNorm2d(64, eps=0.001, momentum=0.99, affine=True, track_running_stats=True)
          (depthwise_act): SiLU()
        )
        (squeeze_excite): AlignVisionSqueezeExciteLayer(
          (squeeze): AdaptiveAvgPool2d(output_size=1)
          (reduce): Conv2d(64, 16, kernel_size=(1, 1), stride=(1, 1), padding=same)
          (expand): Conv2d(16, 64, kernel_size=(1, 1), stride=(1, 1), padding=same)
          (act_reduce): SiLU()
          (act_expand): Sigmoid()
        )
        (projection): AlignVisionFinalBlockLayer(
          (project_conv): Conv2d(64, 32, kernel_size=(1, 1), stride=(1, 1), padding=same, bias=False)
          (project_bn): BatchNorm2d(32, eps=0.001, momentum=0.99, affine=True, track_running_stats=True)
          (dropout): Dropout(p=0.0, inplace=False)
        )
      )

Data2Vec

textencoder

Data2VecTextModel(
  (embeddings): Data2VecTextForTextEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=1)
    (position_embeddings): Embedding(512, 768, padding_idx=1)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): Data2VecTextEncoder(
    (layer): ModuleList(
      (0-11): 12 x Data2VecTextLayer(
        (attention): Data2VecTextAttention(
          (self): Data2VecTextSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): Data2VecTextSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): Data2VecTextIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
          (intermediate_act_fn): GELUActivation()
        )
        (output): Data2VecTextOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
  )
  (pooler): Data2VecTextPooler(
    (dense): Linear(in_features=768, out_features=768, bias=True)
    (activation): Tanh()
  )
)

audio encoder

Data2VecAudioModel(
  (feature_extractor): Data2VecAudioFeatureEncoder(
    (conv_layers): ModuleList(
      (0): Data2VecAudioConvLayer(
        (conv): Conv1d(1, 512, kernel_size=(10,), stride=(5,), bias=False)
        (layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (activation): GELUActivation()
      )
      (1-4): 4 x Data2VecAudioConvLayer(
        (conv): Conv1d(512, 512, kernel_size=(3,), stride=(2,), bias=False)
        (layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (activation): GELUActivation()
      )
      (5-6): 2 x Data2VecAudioConvLayer(
        (conv): Conv1d(512, 512, kernel_size=(2,), stride=(2,), bias=False)
        (layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (activation): GELUActivation()
      )
    )
  )
  (feature_projection): Data2VecAudioFeatureProjection(
    (layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
    (projection): Linear(in_features=512, out_features=768, bias=True)
    (dropout): Dropout(p=0.0, inplace=False)
  )
  (encoder): Data2VecAudioEncoder(
    (pos_conv_embed): Data2VecAudioPositionalConvEmbedding(
      (layers): ModuleList(
        (0-4): 5 x Data2VecAudioPositionalConvLayer(
          (conv): Conv1d(768, 768, kernel_size=(19,), stride=(1,), padding=(9,), groups=16)
          (padding): Data2VecAudioPadLayer()
          (activation): GELUActivation()
          (layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=False)
        )
      )
    )
    (layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
    (layers): ModuleList(
      (0-11): 12 x Data2VecAudioEncoderLayer(
        (attention): Data2VecAudioAttention(
          (k_proj): Linear(in_features=768, out_features=768, bias=True)
          (v_proj): Linear(in_features=768, out_features=768, bias=True)
          (q_proj): Linear(in_features=768, out_features=768, bias=True)
          (out_proj): Linear(in_features=768, out_features=768, bias=True)
        )
        (dropout): Dropout(p=0.1, inplace=False)
        (layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (feed_forward): Data2VecAudioFeedForward(
          (intermediate_dropout): Dropout(p=0.1, inplace=False)
          (intermediate_dense): Linear(in_features=768, out_features=3072, bias=True)
          (intermediate_act_fn): GELUActivation()
          (output_dense): Linear(in_features=3072, out_features=768, bias=True)
          (output_dropout): Dropout(p=0.1, inplace=False)
        )
        (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      )
    )
  )
)

video encoder

Data2VecVisionModel(
  (embeddings): Data2VecVisionEmbeddings(
    (patch_embeddings): Data2VecVisionPatchEmbeddings(
      (projection): Conv2d(3, 768, kernel_size=(16, 16), stride=(16, 16))
    )
    (dropout): Dropout(p=0.0, inplace=False)
  )
  (encoder): Data2VecVisionEncoder(
    (layer): ModuleList(
      (0): Data2VecVisionLayer(
        (attention): Data2VecVisionAttention(
          (attention): Data2VecVisionSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=False)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.0, inplace=False)
          )
          (output): Data2VecVisionSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.0, inplace=False)
          )
        )
        (intermediate): Data2VecVisionIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
          (intermediate_act_fn): GELUActivation()
        )
        (output): Data2VecVisionOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (dropout): Dropout(p=0.0, inplace=False)
        )
        (layernorm_before): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (drop_path): Identity()
        (layernorm_after): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      )
      (1): Data2VecVisionLayer(
        (attention): Data2VecVisionAttention(
          (attention): Data2VecVisionSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=False)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.0, inplace=False)
          )
          (output): Data2VecVisionSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.0, inplace=False)
          )
        )
        (intermediate): Data2VecVisionIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
          (intermediate_act_fn): GELUActivation()
        )
        (output): Data2VecVisionOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (dropout): Dropout(p=0.0, inplace=False)
        )
        (layernorm_before): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (drop_path): Data2VecVisionDropPath(p=0.00909090880304575)
        (layernorm_after): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      )
      (2): Data2VecVisionLayer(
        (attention): Data2VecVisionAttention(
          (attention): Data2VecVisionSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=False)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.0, inplace=False)
          )
          (output): Data2VecVisionSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.0, inplace=False)
          )
        )
        (intermediate): Data2VecVisionIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
          (intermediate_act_fn): GELUActivation()
        )
        (output): Data2VecVisionOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (dropout): Dropout(p=0.0, inplace=False)
        )
        (layernorm_before): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (drop_path): Data2VecVisionDropPath(p=0.0181818176060915)
        (layernorm_after): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      )
      (3): Data2VecVisionLayer(
        (attention): Data2VecVisionAttention(
          (attention): Data2VecVisionSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=False)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.0, inplace=False)
          )
          (output): Data2VecVisionSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.0, inplace=False)
          )
        )
        (intermediate): Data2VecVisionIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
          (intermediate_act_fn): GELUActivation()
        )
        (output): Data2VecVisionOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (dropout): Dropout(p=0.0, inplace=False)
        )
        (layernorm_before): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (drop_path): Data2VecVisionDropPath(p=0.027272727340459824)
        (layernorm_after): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      )
      (4): Data2VecVisionLayer(
        (attention): Data2VecVisionAttention(
          (attention): Data2VecVisionSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=False)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.0, inplace=False)
          )
          (output): Data2VecVisionSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.0, inplace=False)
          )
        )
        (intermediate): Data2VecVisionIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
          (intermediate_act_fn): GELUActivation()
        )
        (output): Data2VecVisionOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (dropout): Dropout(p=0.0, inplace=False)
        )
        (layernorm_before): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (drop_path): Data2VecVisionDropPath(p=0.036363635212183)
        (layernorm_after): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      )
      (5): Data2VecVisionLayer(
        (attention): Data2VecVisionAttention(
          (attention): Data2VecVisionSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=False)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.0, inplace=False)
          )
          (output): Data2VecVisionSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.0, inplace=False)
          )
        )
        (intermediate): Data2VecVisionIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
          (intermediate_act_fn): GELUActivation()
        )
        (output): Data2VecVisionOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (dropout): Dropout(p=0.0, inplace=False)
        )
        (layernorm_before): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (drop_path): Data2VecVisionDropPath(p=0.045454543083906174)
        (layernorm_after): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      )
      (6): Data2VecVisionLayer(
        (attention): Data2VecVisionAttention(
          (attention): Data2VecVisionSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=False)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.0, inplace=False)
          )
          (output): Data2VecVisionSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.0, inplace=False)
          )
        )
        (intermediate): Data2VecVisionIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
          (intermediate_act_fn): GELUActivation()
        )
        (output): Data2VecVisionOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (dropout): Dropout(p=0.0, inplace=False)
        )
        (layernorm_before): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (drop_path): Data2VecVisionDropPath(p=0.054545458406209946)
        (layernorm_after): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      )
      (7): Data2VecVisionLayer(
        (attention): Data2VecVisionAttention(
          (attention): Data2VecVisionSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=False)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.0, inplace=False)
          )
          (output): Data2VecVisionSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.0, inplace=False)
          )
        )
        (intermediate): Data2VecVisionIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
          (intermediate_act_fn): GELUActivation()
        )
        (output): Data2VecVisionOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (dropout): Dropout(p=0.0, inplace=False)
        )
        (layernorm_before): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (drop_path): Data2VecVisionDropPath(p=0.06363636255264282)
        (layernorm_after): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      )
      (8): Data2VecVisionLayer(
        (attention): Data2VecVisionAttention(
          (attention): Data2VecVisionSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=False)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.0, inplace=False)
          )
          (output): Data2VecVisionSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.0, inplace=False)
          )
        )
        (intermediate): Data2VecVisionIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
          (intermediate_act_fn): GELUActivation()
        )
        (output): Data2VecVisionOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (dropout): Dropout(p=0.0, inplace=False)
        )
        (layernorm_before): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (drop_path): Data2VecVisionDropPath(p=0.0727272778749466)
        (layernorm_after): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      )
      (9): Data2VecVisionLayer(
        (attention): Data2VecVisionAttention(
          (attention): Data2VecVisionSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=False)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.0, inplace=False)
          )
          (output): Data2VecVisionSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.0, inplace=False)
          )
        )
        (intermediate): Data2VecVisionIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
          (intermediate_act_fn): GELUActivation()
        )
        (output): Data2VecVisionOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (dropout): Dropout(p=0.0, inplace=False)
        )
        (layernorm_before): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (drop_path): Data2VecVisionDropPath(p=0.08181818574666977)
        (layernorm_after): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      )
      (10): Data2VecVisionLayer(
        (attention): Data2VecVisionAttention(
          (attention): Data2VecVisionSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=False)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.0, inplace=False)
          )
          (output): Data2VecVisionSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.0, inplace=False)
          )
        )
        (intermediate): Data2VecVisionIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
          (intermediate_act_fn): GELUActivation()
        )
        (output): Data2VecVisionOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (dropout): Dropout(p=0.0, inplace=False)
        )
        (layernorm_before): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (drop_path): Data2VecVisionDropPath(p=0.09090909361839294)
        (layernorm_after): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      )
      (11): Data2VecVisionLayer(
        (attention): Data2VecVisionAttention(
          (attention): Data2VecVisionSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=False)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.0, inplace=False)
          )
          (output): Data2VecVisionSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.0, inplace=False)
          )
        )
        (intermediate): Data2VecVisionIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
          (intermediate_act_fn): GELUActivation()
        )
        (output): Data2VecVisionOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (dropout): Dropout(p=0.0, inplace=False)
        )
        (layernorm_before): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (drop_path): Data2VecVisionDropPath(p=0.10000000149011612)
        (layernorm_after): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      )
    )
  )
  (layernorm): Identity()
)

Text 结合image 的联合tokenizer

  1. text prompt 根据输入图片长度增加 的占位符:

  2. 输入prompt text结合image占位符进行联合tokenize,输出tokenid:

input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).cuda()
input_id : [1,81]

Vision Transformer

(1)patch embedding:例如输入图片大小为224x224,将图片分为固定大小的patch,patch大小为16x16,则每张图像会生成224x224/16x16=196个patch,即输入序列长度为196,每个patch维度16x16x3=768,线性投射层的维度为768xN (N=768),因此输入通过线性投射层之后的维度依然为196x768,即一共有196个token,每个token的维度是768。这里还需要加上一个特殊字符cls,因此最终的维度是197x768。到目前为止,已经通过patch embedding将一个视觉问题转化为了一个seq2seq问题
(2) positional encoding(standard learnable 1D position embeddings):ViT同样需要加入位置编码,位置编码可以理解为一张表,表一共有N行,N的大小和输入序列长度相同,每一行代表一个向量,向量的维度和输入序列embedding的维度相同(768)。注意位置编码的操作是sum,而不是concat。加入位置编码信息之后,维度依然是197x768
(3)LN/multi-head attention/LN:LN输出维度依然是197x768。多头自注意力时,先将输入映射到q,k,v,如果只有一个头,qkv的维度都是197x768,如果有12个头(768/12=64),则qkv的维度是197x64,一共有12组qkv,最后再将12组qkv的输出拼接起来,输出维度是197x768,然后在过一层LN,维度依然是197x768
(4)MLP:将维度放大再缩小回去,197x768放大为197x3072,再缩小变为197x768

参考文献

  • 视频检索:https://www.cnblogs.com/VideoCloudTech/p/17987835
  • 综述
    • https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models
    • https://arxiv.org/pdf/2302.10035.pdf
      -https://github.com/wangxiao5791509/MultiModal_BigModels_Survey
  • https://github.com/salesforce/LAVIS-一个多模态模型库
  • https://zhuanlan.zhihu.com/p/445122996
Monkey 多模态模型是由阿里云开发的一种大型预训练模型。它采用了多模态学习技术,能够同时处理文本、图像、语音等多种形式的数据,并从中提取出丰富的特征信息。这种模型设计使得它在多种需要跨模态理解的任务上展现出较高的性能。 ### 驱动原理与优势: 1. **统一表示学习**:通过深度学习架构,Monkey 模型能够将不同模态的信息映射到共享的高维空间中,便于后续任务如分类、生成等操作。 2. **端到端学习**:基于强化学习或自监督学习机制,模型可以自动从数据中学习最优策略或表示,无需大量人工标注数据,提高了训练效率和泛化能力。 3. **大规模预训练**:通常,多模态模型会利用大规模未标记数据进行预训练,这有助于模型在各种下游任务中快速适应和获得良好的性能。 4. **高性能应用**:在诸如问答系统、智能推荐、视觉描述生成、语言理解和生成等多个领域,多模态模型展现了强大的处理能力和创新的应用潜力。 ### 实现与应用: - **问答系统**:结合文本理解和图像检索功能,提供更为精准的问题解答服务。 - **智能推荐**:融合用户的历史行为、喜好及实时情境信息,给出个性化的产品或内容推荐。 - **视觉描述生成**:对图像或视频进行描述,帮助视障人群理解多媒体内容,或用于辅助教育场景中的故事讲述。 - **自然语言处理**:提升机器翻译、情感分析、对话系统等任务的准确性和流畅度。 --- ### 相关问题: 1. **如何评估 Monkey 大模型的效果?** 2. **多模态模型与其他类似模型相比有何独特之处?** 3. **如何优化多模态模型的训练过程以提高其性能?**
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值