UNIMO learns from different modalities of data, including images, texts and image-text
pairs, thus achieving more robust and generalizable
representations for both textual and visual input.
人类识别的角度,所谓的智能就是模仿人的智能,创新的模型架构一般都是受人的智能启发
Humans perceive the world through many modalities, such as sound, vision and language.
图像,文本,图像文本对
UNIMO learns visual
representations and textual representations simultaneously, and unifies them into the same semantic
space via cross-modal contrastive learning (CMCL)
based on a large-scale corpus of image collections,
text corpus and image-text pairs.
相当于三个模型,融合对比,图像,文本,图像文本对。
class UNIMOEmbeddings(nn.Layer):
#Include embeddings from word, position and token_type.
def __init__(self,
vocab_size,
hidden_size=768,
hidden_dropout_prob=0.1,
max_position_embeddings=512,
type_vocab_size=4):
super(UNIMOEmbeddings, self).__init__()
self.word_embeddings = nn.Embedding(vocab_size, hidden_size)
self.position_embeddings = nn.Embedding(max_position_embeddings,
hidden_size)
self.token_type_embeddings = nn.Embedding(type_vocab_size, hidden_size)
def forward(self, input_ids, token_type_ids, position_ids):
input_embedings = self.word_embeddings(input_ids)
position_embeddings = self.position_embeddings(position_ids)
token_type_embeddings = self.token_type_embeddings(token_type_ids)
embeddings = input_embedings + position_embeddings + token_type_embeddings
return embeddings
transformer系列的编码都是如此,三个变量加权。
UNIMOLMHeadModel(
(unimo): UNIMOModel(
(embeddings): UNIMOEmbeddings(
(word_embeddings): Embedding(18000, 768, sparse=False)
(position_embeddings): Embedding(513, 768, sparse=False)
(token_type_embeddings): Embedding(4, 768, sparse=False)
)
(encoder_norm): LayerNorm(normalized_shape=[768], epsilon=1e-05)
(dropout): Dropout(p=0.1, axis=None, mode=upscale_in_train)
(encoder): TransformerEncoder(
(layers): LayerList(
(0): TransformerEncoderLayer(
(self_attn): MultiHeadAttention(
(q_proj): Linear(in_features=768, out_features=768, dtype=float32)
(k_proj): Linear(in_features=768, out_features=768, dtype=float32)
(v_proj): Linear(in_features=768, out_features=768, dtype=float32)
(out_proj): Linear(in_features=768, out_features=768, dtype=float32)
)
(linear1): Linear(in_features=768, out_features=3072, dtype=float32)
(dropout): Dropout(p=0, axis=None, mode=upscale_in_train)
(linear2): Linear(in_features=3072, out_features=768, dtype=float32)
(norm1): LayerNorm(normalized_shape=[768], epsilon=1e-05)
(norm2): LayerNorm(normalized_shape=[768], epsilon=1e-05)
(dropout1): Dropout(p=0.1, axis=None, mode=upscale_in_train)
(dropout2): Dropout(p=0.1, axis=None, mode=upscale_in_train)
)
(1): TransformerEncoderLayer(
(self_attn): MultiHeadAttention(
(q_proj): Linear(in_features=768, out_features=768, dtype=float32)
(k_proj): Linear(in_features=768, out_features=768, dtype=float32)
(v_proj): Linear(in_features=768, out_features=768, dtype=float32)
(out_proj): Linear(in_features=768, out_features=768, dtype=float32)
)
(linear1): Linear(in_features=768, out_features=3072, dtype=float32)
(dropout): Dropout(p=0, axis=None, mode=upscale_in_train)
(linear2): Linear(in_features=3072, out_features=768, dtype=float32)
(norm1): LayerNorm(normalized_shape=[768], epsilon=1e-05)
(norm2): LayerNorm(normalized_shape=[768], epsilon=1e-05)
(dropout1): Dropout(p=0.1, axis=None, mode=upscale_in_train)
(dropout2): Dropout(p=0.1, axis=None, mode=upscale_in_train)
)
(2): TransformerEncoderLayer(
(self_attn): MultiHeadAttention(
(q_proj): Linear(in_features=768, out_features=768, dtype=float32)
(k_proj): Linear(in_features=768, out_features=768, dtype=float32)
(v_proj): Linear(in_features=768, out_features=768, dtype=float32)
(out_proj): Linear(in_features=768, out_features=768, dtype=float32)
)
(linear1): Linear(in_features=768, out_features=3072, dtype=float32)
(dropout): Dropout(p=0, axis=None, mode=upscale_in_train)
(linear2): Linear(in_features=3072, out_features=768, dtype=float32)
(norm1): LayerNorm(normalized_shape=[768], epsilon=1e-05)
(norm2): LayerNorm(normalized_shape=[768], epsilon=1e-05)
(dropout1): Dropout(p=0.1, axis=None, mode=upscale_in_train)
(dropout2): Dropout(p=0.1, axis=None, mode=upscale_in_train)
)
(3): TransformerEncoderLayer(
(self_attn): MultiHeadAttention(
(q_proj): Linear(in_features=768, out_features=768, dtype=float32)
(k_proj): Linear(in_features=768, out_features=768, dtype=float32)
(v_proj): Linear(in_features=768, out_features=768, dtype=float32)
(out_proj): Linear(in_features=768, out_features=768, dtype=float32)
)
(linear1): Linear(in_features=768, out_features=3072, dtype=float32)
(dropout): Dropout(p=0, axis=None, mode=upscale_in_train)
(linear2): Linear(in_features=3072, out_features=768, dtype=float32)
(norm1): LayerNorm(normalized_shape=[768], epsilon=1e-05)
(norm2): LayerNorm(normalized_shape=[768], epsilon=1e-05)
(dropout1): Dropout(p=0.1, axis=None, mode=upscale_in_train)
(dropout2): Dropout(p=0.1, axis=None, mode=upscale_in_train)
)
(4): TransformerEncoderLayer(
(self_attn): MultiHeadAttention(
(q_proj): Linear(in_features=768, out_features=768, dtype=float32)
(k_proj): Linear(in_features=768, out_features=768, dtype=float32)
(v_proj): Linear(in_features=768, out_features=768, dtype=float32)
(out_proj): Linear(in_features=768, out_features=768, dtype=float32)
)
(linear1): Linear(in_features=768, out_features=3072, dtype=float32)
(dropout): Dropout(p=0, axis=None, mode=upscale_in_train)
(linear2): Linear(in_features=3072, out_features=768, dtype=float32)
(norm1): LayerNorm(normalized_shape=[768], epsilon=1e-05)
(norm2): LayerNorm(normalized_shape=[768], epsilon=1e-05)
(dropout1): Dropout(p=0.1, axis=None, mode=upscale_in_train)
(dropout2): Dropout(p=0.1, axis=None, mode=upscale_in_train)
)
(5): TransformerEncoderLayer(
(self_attn): MultiHeadAttention(
(q_proj): Linear(in_features=768, out_features=768, dtype=float32)
(k_proj): Linear(in_features=768, out_features=768, dtype=float32)
(v_proj): Linear(in_features=768, out_features=768, dtype=float32)
(out_proj): Linear(in_features=768, out_features=768, dtype=float32)
)
(linear1): Linear(in_features=768, out_features=3072, dtype=float32)
(dropout): Dropout(p=0, axis=None, mode=upscale_in_train)
(linear2): Linear(in_features=3072, out_features=768, dtype=float32)
(norm1): LayerNorm(normalized_shape=[768], epsilon=1e-05)
(norm2): LayerNorm(normalized_shape=[768], epsilon=1e-05)
(dropout1): Dropout(p=0.1, axis=None, mode=upscale_in_train)
(dropout2): Dropout(p=0.1, axis=None, mode=upscale_in_train)
)
(6): TransformerEncoderLayer(
(self_attn): MultiHeadAttention(
(q_proj): Linear(in_features=768, out_features=768, dtype=float32)
(k_proj): Linear(in_features=768, out_features=768, dtype=float32)
(v_proj): Linear(in_features=768, out_features=768, dtype=float32)
(out_proj): Linear(in_features=768, out_features=768, dtype=float32)
)
(linear1): Linear(in_features=768, out_features=3072, dtype=float32)
(dropout): Dropout(p=0, axis=None, mode=upscale_in_train)
(linear2): Linear(in_features=3072, out_features=768, dtype=float32)
(norm1): LayerNorm(normalized_shape=[768], epsilon=1e-05)
(norm2): LayerNorm(normalized_shape=[768], epsilon=1e-05)
(dropout1): Dropout(p=0.1, axis=None, mode=upscale_in_train)
(dropout2): Dropout(p=0.1, axis=None, mode=upscale_in_train)
)
(7): TransformerEncoderLayer(
(self_attn): MultiHeadAttention(
(q_proj): Linear(in_features=768, out_features=768, dtype=float32)
(k_proj): Linear(in_features=768, out_features=768, dtype=float32)
(v_proj): Linear(in_features=768, out_features=768, dtype=float32)
(out_proj): Linear(in_features=768, out_features=768, dtype=float32)
)
(linear1): Linear(in_features=768, out_features=3072, dtype=float32)
(dropout): Dropout(p=0, axis=None, mode=upscale_in_train)
(linear2): Linear(in_features=3072, out_features=768, dtype=float32)
(norm1): LayerNorm(normalized_shape=[768], epsilon=1e-05)
(norm2): LayerNorm(normalized_shape=[768], epsilon=1e-05)
(dropout1): Dropout(p=0.1, axis=None, mode=upscale_in_train)
(dropout2): Dropout(p=0.1, axis=None, mode=upscale_in_train)
)
(8): TransformerEncoderLayer(
(self_attn): MultiHeadAttention(
(q_proj): Linear(in_features=768, out_features=768, dtype=float32)
(k_proj): Linear(in_features=768, out_features=768, dtype=float32)
(v_proj): Linear(in_features=768, out_features=768, dtype=float32)
(out_proj): Linear(in_features=768, out_features=768, dtype=float32)
)
(linear1): Linear(in_features=768, out_features=3072, dtype=float32)
(dropout): Dropout(p=0, axis=None, mode=upscale_in_train)
(linear2): Linear(in_features=3072, out_features=768, dtype=float32)
(norm1): LayerNorm(normalized_shape=[768], epsilon=1e-05)
(norm2): LayerNorm(normalized_shape=[768], epsilon=1e-05)
(dropout1): Dropout(p=0.1, axis=None, mode=upscale_in_train)
(dropout2): Dropout(p=0.1, axis=None, mode=upscale_in_train)
)
(9): TransformerEncoderLayer(
(self_attn): MultiHeadAttention(
(q_proj): Linear(in_features=768, out_features=768, dtype=float32)
(k_proj): Linear(in_features=768, out_features=768, dtype=float32)
(v_proj): Linear(in_features=768, out_features=768, dtype=float32)
(out_proj): Linear(in_features=768, out_features=768, dtype=float32)
)
(linear1): Linear(in_features=768, out_features=3072, dtype=float32)
(dropout): Dropout(p=0, axis=None, mode=upscale_in_train)
(linear2): Linear(in_features=3072, out_features=768, dtype=float32)
(norm1): LayerNorm(normalized_shape=[768], epsilon=1e-05)
(norm2): LayerNorm(normalized_shape=[768], epsilon=1e-05)
(dropout1): Dropout(p=0.1, axis=None, mode=upscale_in_train)
(dropout2): Dropout(p=0.1, axis=None, mode=upscale_in_train)
)
(10): TransformerEncoderLayer(
(self_attn): MultiHeadAttention(
(q_proj): Linear(in_features=768, out_features=768, dtype=float32)
(k_proj): Linear(in_features=768, out_features=768, dtype=float32)
(v_proj): Linear(in_features=768, out_features=768, dtype=float32)
(out_proj): Linear(in_features=768, out_features=768, dtype=float32)
)
(linear1): Linear(in_features=768, out_features=3072, dtype=float32)
(dropout): Dropout(p=0, axis=None, mode=upscale_in_train)
(linear2): Linear(in_features=3072, out_features=768, dtype=float32)
(norm1): LayerNorm(normalized_shape=[768], epsilon=1e-05)
(norm2): LayerNorm(normalized_shape=[768], epsilon=1e-05)
(dropout1): Dropout(p=0.1, axis=None, mode=upscale_in_train)
(dropout2): Dropout(p=0.1, axis=None, mode=upscale_in_train)
)
(11): TransformerEncoderLayer(
(self_attn): MultiHeadAttention(
(q_proj): Linear(in_features=768, out_features=768, dtype=float32)
(k_proj): Linear(in_features=768, out_features=768, dtype=float32)
(v_proj): Linear(in_features=768, out_features=768, dtype=float32)
(out_proj): Linear(in_features=768, out_features=768, dtype=float32)
)
(linear1): Linear(in_features=768, out_features=3072, dtype=float32)
(dropout): Dropout(p=0, axis=None, mode=upscale_in_train)
(linear2): Linear(in_features=3072, out_features=768, dtype=float32)
(norm1): LayerNorm(normalized_shape=[768], epsilon=1e-05)
(norm2): LayerNorm(normalized_shape=[768], epsilon=1e-05)
(dropout1): Dropout(p=0.1, axis=None, mode=upscale_in_train)
(dropout2): Dropout(p=0.1, axis=None, mode=upscale_in_train)
)
)
)
)
(lm_head): UNIMOLMHead(
(transform): Linear(in_features=768, out_features=768, dtype=float32)
(layer_norm): LayerNorm(normalized_shape=[768], epsilon=1e-05)
)
)
DEBUG代码看模型的结构定义是最清晰的方式,每一层是什么,结构是什么,参数是什么,输入输出是什么。比文档还清晰明了
https://blog.csdn.net/qq_15821487/article/details/120035220 参考python传参定义
input = paddle.to_tensor([[1,2],[3,4],[5,6]])
index = paddle.to_tensor([0,1])
output = paddle.gather(input, index, axis=0)
# expected output: [[1,2],[3,4]]
截断,某个函数不清晰,直接点进去源码看一眼就好了,看多了自然就记住了
文件里面进去没找到对应的原函数的时候,直接按住ctrl,然后从弹出的列表中对照参数找到对应的函数。
通用的生成函数,聊天机器人的也是这个
The interface for generation task. This method can generate sequences
by using decoding strategy. Currently, there are three decoding
strategies supported: "greedy_search", "sampling" and "beam_search".
text retrieval
文本检索
Cross-Modal Contrastive Learning
图像文本融合学习
Text Rewriting
文本重写和增强
Text Enhance Vision
Text Enhance Vision
互相融合和增加,各种语义特征空间