阅读分享：UniDrop：A Simple yet Effective Technique to Improve Transformer without Extra Cost

最新推荐文章于 2022-10-25 13:11:49 发布

小皮肚鼓嘟嘟

最新推荐文章于 2022-10-25 13:11:49 发布

阅读量440

点赞数

分类专栏：分享文章标签： nlp

本文链接：https://blog.csdn.net/baidu_28820009/article/details/117356747

版权

分享专栏收录该内容

8 篇文章 2 订阅

订阅专栏

UniDrop: A Simple yet Effective Technique to Improve Transformer without Extra Cost

多余的不写了，直接重点：
在Transformer中加了三类droput：feature dropout, structure dropout, data dropout。原文说它们在防止Transformer过拟合和提高模型的鲁棒性方面可以发挥不同的作用。

Feature Dropout (FD)

在这里插入图片描述

除了Transformer中每个layer中两个特有的dropout（记为FD-1、FD-2）以外，又加了两个FD-3、FD-4。

FD-1（Transformer已有）

以BERT代码为例，比较好理解：

class BertSelfAttention(nn.Module):
	def forward(略):
    ............#此处省略好多行
		query_layer = self.transpose_for_scores(mixed_query_layer)
        key_layer = self.transpose_for_scores(mixed_key_layer)
        value_layer = self.transpose_for_scores(mixed_value_layer)# 
        # Take the dot product between "query" and "key" to get the raw attention scores.
        attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
        attention_scores = attention_scores / math.sqrt(self.attention_head_size)
        if attention_mask is not None:
            # Apply the attention mask is (precomputed for all layers in BertModel forward() function)
            attention_scores = attention_scores + attention_mask

        # Normalize the attention scores to probabilities.
        attention_probs = nn.Softmax(dim=-1)(attention_scores)

        # This is actually dropping out entire tokens to attend to, which might
        # seem a bit unusual, but is taken from the original Transformer paper.
        attention_probs = self.dropout(attention_probs) # 看这里：文中记为FD-1
        
        # Mask heads if we want to
        if head_mask is not None:
            attention_probs = attention_probs * head_mask

        context_layer = torch.matmul(attention_probs, value_layer)

在这里插入图片描述
FD-1是在上式softmax之后进行了dropout。

FD-2（Transformer已有）

同样上代码：

class BertSelfOutput(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
        self.LayerNorm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)

    def forward(self, hidden_states, input_tensor):
        hidden_states = self.dense(hidden_states)
        hidden_states = self.dropout(hidden_states)	# 看这里：文中记为FD-2
        hidden_states = self.LayerNorm(hidden_states + input_tensor)
        return hidden_states

FD-2是在归一化之前进行的dropout

FD-3（文中提出）

称为query, key, value dropout，提出的思路是FD-1可能会丢失关键信息，所以就在Q、K、V上先加了dropout来减轻风险。（怕在注意力加权QK上丢失信息，所以就提前丢一些？？）
差不多就在图中位置，对qkv分别dropout一下。在这里插入图片描述

FD-4（文中提出）

称为output dropout，在Transformer decoder或encoder最后一层输出后面加入dropout，再进行下游任务。
这里以BertForSequenceClassification方法为例：

	outputs = self.bert(
            input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )
        pooled_output = outputs[1]
        pooled_output = self.dropout(pooled_output)	# 看这里看这里
        logits = self.classifier(pooled_output)

它在BERT的池化输出上加了dropout才做的分类。
而作者实验是在Transformer基础上做的。

Structure Dropout

structure dropout有三种：LayerDrop , DropHead and HeadMask
这篇工作采用LayerDrop。

Data Dropout

就是以预先确定的概率p随机删除句子中的某些单词。应用在输入层丢掉一些word embedding。

—————————————————————————————————————————————
理论公式和实验分析就不写了。

还没有发布代码，好不好用等我在我的项目上试一下再说，就酱，解散！

小皮肚鼓嘟嘟

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
阅读分享：UniDrop：A Simple yet Effective Technique to Improve Transformer without Extra Cost

UniDrop: A Simple yet Effective Technique to Improve Transformer without Extra Cost多余的不写了，直接重点：在Transformer中加了三类droput：feature dropout, structure dropout, data dropout。原文说它们在防止Transformer过拟合和提高模型的鲁棒性方面可以发挥不同的作用。Feature Dropout (FD)除了Transformer中每个laye
复制链接

扫一扫