72_text_generation\unimo-text 理解

UNIMO模型通过学习图像、文本和图像-文本对的数据,构建了强大的跨模态表示,将视觉和文本表示融合到同一语义空间。该模型使用多头注意力机制的Transformer编码器层,通过大规模语料库进行对比学习,提升了模型的泛化能力。适用于图像检索、文本理解、语义融合等多个任务。
摘要由CSDN通过智能技术生成

UNIMO learns from different modalities of data, including images, texts and image-text
pairs, thus achieving more robust and generalizable
representations for both textual and visual input.
人类识别的角度,所谓的智能就是模仿人的智能,创新的模型架构一般都是受人的智能启发
Humans perceive the world through many modalities, such as sound, vision and language.
图像,文本,图像文本对
在这里插入图片描述
在这里插入图片描述
UNIMO learns visual
representations and textual representations simultaneously, and unifies them into the same semantic
space via cross-modal contrastive learning (CMCL)
based on a large-scale corpus of image collections,
text corpus and image-text pairs.
相当于三个模型,融合对比,图像,文本,图像文本对。

class UNIMOEmbeddings(nn.Layer):
    #Include embeddings from word, position and token_type.

    def __init__(self,
                 vocab_size,
                 hidden_size=768,
                 hidden_dropout_prob=0.1,
                 max_position_embeddings=512,
                 type_vocab_size=4):
        super(UNIMOEmbeddings, self).__init__()
        self.word_embeddings = nn.Embedding(vocab_size, hidden_size)
        self.position_embeddings = nn.Embedding(max_position_embeddings,
                                                hidden_size)
        self.token_type_embeddings = nn.Embedding(type_vocab_size, hidden_size)

    def forward(self, input_ids, token_type_ids, position_ids):
        input_embedings = self.word_embeddings(input_ids)
        position_embeddings = self.position_embeddings(position_ids)
        token_type_embeddings = self.token_type_embeddings(token_type_ids)

        embeddings = input_embedings + position_embeddings + token_type_embeddings
        return embeddings

transformer系列的编码都是如此,三个变量加权。

UNIMOLMHeadModel(
  (unimo): UNIMOModel(
    (embeddings): UNIMOEmbeddings(
      (word_embeddings): Embedding(18000, 768, sparse=False)
      (position_embeddings): Embedding(513, 768, sparse=False)
      (token_type_embeddings): Embedding(4, 768, sparse=False)
    )
    (encoder_norm): LayerNorm(normalized_shape=[768], epsilon=1e-05)
    (dropout): Dropout(p=0.1, axis=None, mode=upscale_in_train)
    (encoder): TransformerEncoder(
      (layers): LayerList(
        (0): TransformerEncoderLayer(
          (self_attn): MultiHeadAttention(
            (q_proj): Linear(in_features=768, out_features=768, dtype=float32)
            (k_proj): Linear(in_features=768, out_features=768, dtype=float32)
            (v_proj): Linear(in_features=768, out_features=768, dtype=float32)
            (out_proj): Linear(in_features=768, out_features=768, dtype=float32)
          )
          (linear1): Linear(in_features=768, out_features=3072, dtype=float32)
          (dropout): Dropout(p=0, axis=None, mode=upscale_in_train)
          (linear2): Linear(in_features=3072, out_features=768, dtype=float32)
          (norm1): LayerNorm(normalized_shape=[768], epsilon=1e-05)
          (norm2): LayerNorm(normalized_shape=[768], epsilon=1e-05)
          (dropout1): Dropout(p=0.1, axis=None, mode=upscale_in_train)
          (dropout2): Dropout(p=0.1, axis=None, mode=upscale_in_train)
        )
        (1): TransformerEncoderLayer(
          (self_attn): MultiHeadAttention(
            (q_proj): Linear(in_features=768, out_features=768, dtype=float32)
            (k_proj): Linear(in_features=768, out_features=768, dtype=float32)
            (v_proj): Linear(in_features=768, out_features=768, dtype=float32)
            (out_proj): Linear(in_features=768, out_features=768, dtype=float32)
          )
          (linear1): Linear(in_features=768, out_features=3072, dtype=float32)
          (dropout): Dropout(p=0, axis=None, mode=upscale_in_train)
          (linear2): Linear(in_features=3072, out_features=768, dtype=float32)
          (norm1): LayerNorm(normalized_shape=[768], epsilon=1e-05)
          (norm2): LayerNorm(normalized_shape=[768], epsilon=1e-05)
          (dropout1): Dropout(p=0.1, axis=None, mode=upscale_in_train)
          (dropout2): Dropout(p=0.1, axis=None, mode=upscale_in_train)
        )
        (2): TransformerEncoderLayer(
          (self_attn): MultiHeadAttention(
            (q_proj): Linear(in_features=768, out_features=768, dtype=float32)
            (k_proj): Linear(in_features=768, out_features=768, dtype=float32)
            (v_proj): Linear(in_features=768, out_features=768, dtype=float32)
            (out_proj): Linear(in_features=768, out_features=768, dtype=float32)
          )
          (linear1): Linear(in_features=768, out_features=3072, dtype=float32)
          (dropout): Dropout(p=0, axis=None, mode=upscale_in_train)
          (linear2): Linear(in_features=3072, out_features=768, dtype=float32)
          (norm1): LayerNorm(normalized_shape=[768], epsilon=1e-05)
          (norm2): LayerNorm(normalized_shape=[768], epsilon=1e-05)
          (dropout1): Dropout(p=0.1, axis=None, mode=upscale_in_train)
          (dropout2): Dropout(p=0.1, axis=None, mode=upscale_in_train)
        )
        (3): TransformerEncoderLayer(
          (self_attn): MultiHeadAttention(
            (q_proj): Linear(in_features=768, out_features=768, dtype=float32)
            (k_proj): Linear(in_features=768, out_features=768, dtype=float32)
            (v_proj): Linear(in_features=768, out_features=768, dtype=float32)
            (out_proj): Linear(in_features=768, out_features=768, dtype=float32)
          )
          (linear1): Linear(in_features=768, out_features=3072, dtype=float32)
          (dropout): Dropout(p=0, axis=None, mode=upscale_in_train)
          (linear2): Linear(in_features=3072, out_features=768, dtype=float32)
          (norm1): LayerNorm(normalized_shape=[768], epsilon=1e-05)
          (norm2): LayerNorm(normalized_shape=[768], epsilon=1e-05)
          (dropout1): Dropout(p=0.1, axis=None, mode=upscale_in_train)
          (dropout2): Dropout(p=0.1, axis=None, mode=upscale_in_train)
        )
        (4): TransformerEncoderLayer(
          (self_attn): MultiHeadAttention(
            (q_proj): Linear(in_features=768, out_features=768, dtype=float32)
            (k_proj): Linear(in_features=768, out_features=768, dtype=float32)
            (v_proj): Linear(in_features=768, out_features=768, dtype=float32)
            (out_proj): Linear(in_features=768, out_features=768, dtype=float32)
          )
          (linear1): Linear(in_features=768, out_features=3072, dtype=float32)
          (dropout): Dropout(p=0, axis=None, mode=upscale_in_train)
          (linear2): Linear(in_features=3072, out_features=768, dtype=float32)
          (norm1): LayerNorm(normalized_shape=[768], epsilon=1e-05)
          (norm2): LayerNorm(normalized_shape=[768], epsilon=1e-05)
          (dropout1): Dropout(p=0.1, axis=None, mode=upscale_in_train)
          (dropout2): Dropout(p=0.1, axis=None, mode=upscale_in_train)
        )
        (5): TransformerEncoderLayer(
          (self_attn): MultiHeadAttention(
            (q_proj): Linear(in_features=768, out_features=768, dtype=float32)
            (k_proj): Linear(in_features=768, out_features=768, dtype=float32)
            (v_proj): Linear(in_features=768, out_features=768, dtype=float32)
            (out_proj): Linear(in_features=768, out_features=768, dtype=float32)
          )
          (linear1): Linear(in_features=768, out_features=3072, dtype=float32)
          (dropout): Dropout(p=0, axis=None, mode=upscale_in_train)
          (linear2): Linear(in_features=3072, out_features=768, dtype=float32)
          (norm1): LayerNorm(normalized_shape=[768], epsilon=1e-05)
          (norm2): LayerNorm(normalized_shape=[768], epsilon=1e-05)
          (dropout1): Dropout(p=0.1, axis=None, mode=upscale_in_train)
          (dropout2): Dropout(p=0.1, axis=None, mode=upscale_in_train)
        )
        (6): TransformerEncoderLayer(
          (self_attn): MultiHeadAttention(
            (q_proj): Linear(in_features=768, out_features=768, dtype=float32)
            (k_proj): Linear(in_features=768, out_features=768, dtype=float32)
            (v_proj): Linear(in_features=768, out_features=768, dtype=float32)
            (out_proj): Linear(in_features=768, out_features=768, dtype=float32)
          )
          (linear1): Linear(in_features=768, out_features=3072, dtype=float32)
          (dropout): Dropout(p=0, axis=None, mode=upscale_in_train)
          (linear2): Linear(in_features=3072, out_features=768, dtype=float32)
          (norm1): LayerNorm(normalized_shape=[768], epsilon=1e-05)
          (norm2): LayerNorm(normalized_shape=[768], epsilon=1e-05)
          (dropout1): Dropout(p=0.1, axis=None, mode=upscale_in_train)
          (dropout2): Dropout(p=0.1, axis=None, mode=upscale_in_train)
        )
        (7): TransformerEncoderLayer(
          (self_attn): MultiHeadAttention(
            (q_proj): Linear(in_features=768, out_features=768, dtype=float32)
            (k_proj): Linear(in_features=768, out_features=768, dtype=float32)
            (v_proj): Linear(in_features=768, out_features=768, dtype=float32)
            (out_proj): Linear(in_features=768, out_features=768, dtype=float32)
          )
          (linear1): Linear(in_features=768, out_features=3072, dtype=float32)
          (dropout): Dropout(p=0, axis=None, mode=upscale_in_train)
          (linear2): Linear(in_features=3072, out_features=768, dtype=float32)
          (norm1): LayerNorm(normalized_shape=[768], epsilon=1e-05)
          (norm2): LayerNorm(normalized_shape=[768], epsilon=1e-05)
          (dropout1): Dropout(p=0.1, axis=None, mode=upscale_in_train)
          (dropout2): Dropout(p=0.1, axis=None, mode=upscale_in_train)
        )
        (8): TransformerEncoderLayer(
          (self_attn): MultiHeadAttention(
            (q_proj): Linear(in_features=768, out_features=768, dtype=float32)
            (k_proj): Linear(in_features=768, out_features=768, dtype=float32)
            (v_proj): Linear(in_features=768, out_features=768, dtype=float32)
            (out_proj): Linear(in_features=768, out_features=768, dtype=float32)
          )
          (linear1): Linear(in_features=768, out_features=3072, dtype=float32)
          (dropout): Dropout(p=0, axis=None, mode=upscale_in_train)
          (linear2): Linear(in_features=3072, out_features=768, dtype=float32)
          (norm1): LayerNorm(normalized_shape=[768], epsilon=1e-05)
          (norm2): LayerNorm(normalized_shape=[768], epsilon=1e-05)
          (dropout1): Dropout(p=0.1, axis=None, mode=upscale_in_train)
          (dropout2): Dropout(p=0.1, axis=None, mode=upscale_in_train)
        )
        (9): TransformerEncoderLayer(
          (self_attn): MultiHeadAttention(
            (q_proj): Linear(in_features=768, out_features=768, dtype=float32)
            (k_proj): Linear(in_features=768, out_features=768, dtype=float32)
            (v_proj): Linear(in_features=768, out_features=768, dtype=float32)
            (out_proj): Linear(in_features=768, out_features=768, dtype=float32)
          )
          (linear1): Linear(in_features=768, out_features=3072, dtype=float32)
          (dropout): Dropout(p=0, axis=None, mode=upscale_in_train)
          (linear2): Linear(in_features=3072, out_features=768, dtype=float32)
          (norm1): LayerNorm(normalized_shape=[768], epsilon=1e-05)
          (norm2): LayerNorm(normalized_shape=[768], epsilon=1e-05)
          (dropout1): Dropout(p=0.1, axis=None, mode=upscale_in_train)
          (dropout2): Dropout(p=0.1, axis=None, mode=upscale_in_train)
        )
        (10): TransformerEncoderLayer(
          (self_attn): MultiHeadAttention(
            (q_proj): Linear(in_features=768, out_features=768, dtype=float32)
            (k_proj): Linear(in_features=768, out_features=768, dtype=float32)
            (v_proj): Linear(in_features=768, out_features=768, dtype=float32)
            (out_proj): Linear(in_features=768, out_features=768, dtype=float32)
          )
          (linear1): Linear(in_features=768, out_features=3072, dtype=float32)
          (dropout): Dropout(p=0, axis=None, mode=upscale_in_train)
          (linear2): Linear(in_features=3072, out_features=768, dtype=float32)
          (norm1): LayerNorm(normalized_shape=[768], epsilon=1e-05)
          (norm2): LayerNorm(normalized_shape=[768], epsilon=1e-05)
          (dropout1): Dropout(p=0.1, axis=None, mode=upscale_in_train)
          (dropout2): Dropout(p=0.1, axis=None, mode=upscale_in_train)
        )
        (11): TransformerEncoderLayer(
          (self_attn): MultiHeadAttention(
            (q_proj): Linear(in_features=768, out_features=768, dtype=float32)
            (k_proj): Linear(in_features=768, out_features=768, dtype=float32)
            (v_proj): Linear(in_features=768, out_features=768, dtype=float32)
            (out_proj): Linear(in_features=768, out_features=768, dtype=float32)
          )
          (linear1): Linear(in_features=768, out_features=3072, dtype=float32)
          (dropout): Dropout(p=0, axis=None, mode=upscale_in_train)
          (linear2): Linear(in_features=3072, out_features=768, dtype=float32)
          (norm1): LayerNorm(normalized_shape=[768], epsilon=1e-05)
          (norm2): LayerNorm(normalized_shape=[768], epsilon=1e-05)
          (dropout1): Dropout(p=0.1, axis=None, mode=upscale_in_train)
          (dropout2): Dropout(p=0.1, axis=None, mode=upscale_in_train)
        )
      )
    )
  )
  (lm_head): UNIMOLMHead(
    (transform): Linear(in_features=768, out_features=768, dtype=float32)
    (layer_norm): LayerNorm(normalized_shape=[768], epsilon=1e-05)
  )
)

在这里插入图片描述

DEBUG代码看模型的结构定义是最清晰的方式,每一层是什么,结构是什么,参数是什么,输入输出是什么。比文档还清晰明了

在这里插入图片描述
在这里插入图片描述
https://blog.csdn.net/qq_15821487/article/details/120035220 参考python传参定义

input = paddle.to_tensor([[1,2],[3,4],[5,6]])
            index = paddle.to_tensor([0,1])
            output = paddle.gather(input, index, axis=0)
            # expected output: [[1,2],[3,4]]

截断,某个函数不清晰,直接点进去源码看一眼就好了,看多了自然就记住了
在这里插入图片描述
文件里面进去没找到对应的原函数的时候,直接按住ctrl,然后从弹出的列表中对照参数找到对应的函数。
在这里插入图片描述
通用的生成函数,聊天机器人的也是这个

   The interface for generation task. This method can generate sequences 
        by using decoding strategy. Currently, there are three decoding 
        strategies supported: "greedy_search", "sampling" and "beam_search".

text retrieval
文本检索

Cross-Modal Contrastive Learning

图像文本融合学习

Text Rewriting

文本重写和增强

Text Enhance Vision

Text Enhance Vision

互相融合和增加,各种语义特征空间

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值