Transformer中数据维度的变化和mask

最新推荐文章于 2024-08-29 17:10:48 发布

LBWNB、

最新推荐文章于 2024-08-29 17:10:48 发布

阅读量3.7k

点赞数 4

文章标签：深度学习人工智能自然语言处理机器学习 python

本文链接：https://blog.csdn.net/qq_38356492/article/details/112570640

版权

本文介绍了Transformer模型中Encoder和Decoder的数据维度变化及mask操作。在Encoder中，对输入序列的<pad>进行mask，而在Decoder中，除了Encoder的output不需要padding mask外，还需要在Self-Attention中对未来的词进行mask，以便于逐步预测目标序列。

摘要由CSDN通过智能技术生成

会用到的一些函数

nn.Embedding()和padding_idx

a = torch.LongTensor([0,1,2,3,4])
emb = nn.Embedding(10,5,padding_idx=0)
emb(a)

## output:
# tensor([[ 0.0000,  0.0000,  0.0000,  0.0000,  0.0000],
#         [ 1.2378,  0.2666,  0.3143, -0.4785, -0.0261],
#         [ 0.9385,  1.8889, -2.1237,  0.9485, -0.5656],
#         [-0.6171,  0.3276, -0.5347,  0.1167, -0.7167],
#         [-0.5722,  1.7916, -2.8614,  0.1669,  1.2874]],
#        grad_fn=<EmbeddingBackward>)

在这个句子中，0是<pad>所对应的索引，padding_idx=0时，这里不做embedding处理。

unsqueeze(1) 增加维度

a = torch.LongTensor([1,2,3])
print(a.shape)
a = a.unsqueeze(1)
print(a.shape)

# torch.Size([3])
# torch.Size([3, 1])

np.triu创建一个上三角矩阵，右上角为1，左下角为0

print(np.triu(np.ones((1, 8, 8)), k=1).astype('uint8'))
# k=1即从第二列（索引为1）开始；
# [[[0 1 1 1 1 1 1 1]
#   [0 0 1 1 1 1 1 1]
#   [0 0 0 1 1 1 1 1]
#   [0 0 0 0 1 1 1 1]
#   [0 0 0 0 0 1 1 1]
#   [0 0 0 0 0 0 1 1]
#   [0 0 0 0 0 0 0 1]
#   [0 0 0 0 0 0 0 0]]]

在torch.Tensor中查找某个特定值的所有位置

value = 1
print((test_tensor==1).nonzero())

masked_fill用法

a = torch.LongTensor([[1,2,3],[4,5,6]])
masking = torch.LongTensor([[1,1,1],[0,0,0]])
masking = (masking==0) #get the booling value, if equal to 0, then set it to True
a.masked_fill(masking,value=8)
# tensor([[1, 2, 3],
#         [8, 8, 8]])

Layernorm

a = torch.rand((2,50,1,64))
###layernorm的形状即是输入张量的除第一位以外的形状
layer_norm = nn.LayerNorm(a.size()[1:],eps=1e-6)
layer_norm(a)

torch.eq

>>> torch.eq(torch.tensor([[1, 2], [3, 4]]), torch.tensor([[1, 1], [4, 4]]))
tensor([[ True, False],
        [False, True]])

Encoder

Input: batch of sentences with max length = 50

embedding_layer + positional_embedding

[batch_size,max_len,1,embed_size]

在词向量层中，我们需要把<pad>对应的index设置为0。在第一次传入Multi-head attention之前，我们可以直接在词向量层中将其设置为0；

Encoder block(xN stacks)

Input size: [batch_size,max_length,1,embed_size]

Multi-head attention

将输入的输入做了上述操作之后，得到Q,K,V三个矩阵，在进行softmax()之前，由于我们不希望看到<pad>位置上对应的内容，用masked_fill，将所有原<pad>位置的数设置为-inf. 这样softmax()对应的结果就会是0.

接下来详细说明Multi_head</

最低0.47元/天解锁文章

LBWNB、

关注

4
点赞
踩
16

收藏

觉得还不错? 一键收藏
3
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫