UniDrop: A Simple yet Effective Technique to Improve Transformer without Extra Cost
多余的不写了,直接重点:
在Transformer中加了三类droput:feature dropout, structure dropout, data dropout。原文说它们在防止Transformer过拟合和提高模型的鲁棒性方面可以发挥不同的作用。
Feature Dropout (FD)
除了Transformer中每个layer中两个特有的dropout(记为FD-1、FD-2)以外,又加了两个FD-3、FD-4。
FD-1(Transformer已有)
以BERT代码为例,比较好理解:
class BertSelfAttention(nn.Module):
def forward(略):
............#此处省略好多行
query_layer = self.transpose_for_scores(mixed_query_layer)
key_layer = self.transpose_for_scores(mixed_key_layer)
value_layer = self.transpose_for_scores(mixed_value_layer)#
# Take the dot product between "query" and "key" to get the raw attention scores.
attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
attention_scores = attention_scores / math.sqrt(self.attention_head_size)
if attention_mask is not None:
# Apply the attention mask is (precomputed for all layers in BertModel forward() function)
attention_scores = attention_scores + attention_mask
# Normalize the attention scores to probabilities.
attention_probs = nn.Softmax(dim=-1)(attention_scores)
# This is actually dropping out entire tokens to attend to, which might
# seem a bit unusual, but is taken from the original Transformer paper.
attention_probs = self.dropout(attention_probs) # 看这里:文中记为FD-1
# Mask heads if we want to
if head_mask is not None:
attention_probs = attention_probs * head_mask
context_layer = torch.matmul(attention_probs, value_layer)
FD-1是在上式softmax之后进行了dropout。
FD-2(Transformer已有)
同样上代码:
class BertSelfOutput(nn.Module):
def __init__(self, config):
super().__init__()
self.dense = nn.Linear(config.hidden_size, config.hidden_size)
self.LayerNorm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
def forward(self, hidden_states, input_tensor):
hidden_states = self.dense(hidden_states)
hidden_states = self.dropout(hidden_states) # 看这里:文中记为FD-2
hidden_states = self.LayerNorm(hidden_states + input_tensor)
return hidden_states
FD-2是在归一化之前进行的dropout
FD-3(文中提出)
称为query, key, value dropout,提出的思路是FD-1可能会丢失关键信息,所以就在Q、K、V上先加了dropout来减轻风险。(怕在注意力加权QK上丢失信息,所以就提前丢一些??)
差不多就在图中位置,对qkv分别dropout一下。
FD-4(文中提出)
称为output dropout,在Transformer decoder或encoder最后一层输出后面加入dropout,再进行下游任务。
这里以BertForSequenceClassification方法为例:
outputs = self.bert(
input_ids,
attention_mask=attention_mask,
token_type_ids=token_type_ids,
position_ids=position_ids,
head_mask=head_mask,
inputs_embeds=inputs_embeds,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
pooled_output = outputs[1]
pooled_output = self.dropout(pooled_output) # 看这里看这里
logits = self.classifier(pooled_output)
它在BERT的池化输出上加了dropout才做的分类。
而作者实验是在Transformer基础上做的。
Structure Dropout
structure dropout有三种:LayerDrop , DropHead and HeadMask
这篇工作采用LayerDrop。
Data Dropout
就是以预先确定的概率p随机删除句子中的某些单词。应用在输入层丢掉一些word embedding。
—————————————————————————————————————————————
理论公式和实验分析就不写了。
还没有发布代码,好不好用等我在我的项目上试一下再说,就酱,解散!