MixText中的attention_mask

最新推荐文章于 2023-03-28 16:54:28 发布

IT_BD_Zhang

最新推荐文章于 2023-03-28 16:54:28 发布

阅读量1.8k

点赞数 1

分类专栏：软件工程应用与实践文章标签：深度学习自然语言处理

本文链接：https://blog.csdn.net/m0_52073096/article/details/121872326

版权

2021SC@SDUSC

Mask大致分为两种

Padding Mask:在NLP任务中，由于句子长度不一致，经常会进行padding操作，在sequence中加入零向量。这部分padding不应该起作用，但是在Attention的计算中用到了softmax等操作，即便0向量也会参与计算（e^0=1),因此需要手动将这部分信息mask才行。padding mask主要包含两种：

key mask:在计算score之后，且softmax之前进行，将值设为很小的数字（如－e^12),这样经过的softmax之后值几乎为0

·query mask:在softmax之后进行，因此对应元素设置为0即可。

if attention_mask is None:
       if input_ids2 is not None:
           attention_mask2 = torch.ones_like(input_ids2)
       attention_mask = torch.ones_like(input_ids)

attention_mask的维度应保持和多头的hidden_states一致

extended_attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)

        extended_attention_mask = extended_attention_mask.to(
            dtype=next(self.parameters()).dtype)

mask部分token的权重直接给-10000，使其在self-att的时候基本不起作用。

extended_attention_mask

最低0.47元/天解锁文章

IT_BD_Zhang

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
MixText中的attention_mask

Mask大致分为两种Padding Mask:在NLP任务中，由于句子长度不一致，经常会进行padding操作，在sequence中加入零向量。这部分padding不应该起作用，但是在Attention的计算中用到了softmax等操作，即便0向量也会参与计算（e^0=1),因此需要手动将这部分信息mask才行。padding mask主要包含两种：key mask:在计算score之后，且softmax之前进行，将值设为很小的数字（如－e^12),这样经过的softmax之后值几乎为0·que
复制链接

扫一扫