参考:https://www.bilibili.com/video/BV1J441137V6?from=search&seid=10108532718266855888
论文部分
encoder由6个layers组成,每个Layer由两个sub-layer组成,其一为multi-head self-attention和FFN,其中每个sub-layer是一个residual connection并紧跟一个layer normalization。d_model = 512,h = 8(h个heads)。
1. Self-attention
q: query (to match others)
k: key (to be matched)
v: information to be extracted
self-attention是拿每个query q去对每个key k做attention,attention表示两个向量有多匹配。
2. Multi-head attention
把原来的q, k, v拆分为h份,分别送到h个self-attention中,然后将结果cancat在一起。
3. Positional Encoding
No position information in self-attention:对self-attention来说,作为global操作,并不包含位置信息,对于x,它和相邻或者较远的tokens做的运算是一样的。但我们希望能够考虑位置。因此给ai加ei,ei是先验知识,并且对每个tokens都不相同。
为什么是相加,而不是concat。我们可以假设有一个one-hot vector pi,pi的第i个位置是1,其他是0。将xi和pi上下cancat,然后乘以W,等价于ai + ei。只不过这里的ei是hand-crafted。
4. Decoder
例如要翻译“机器学习”:先向decoder输入一个开始符,然后将开始符的输出当作输入,以此类推。
layer norm和batch norm。Transformer中使用的是Layer Norm。batch norm,是对同一个batch,不同data的相同维度做normalization;Layer Normalization,是对data做normalization,通常会搭配RNN使用。
代码部分
0. Copy modle
def _get_clones(module, N):
return nn.ModuleList([copy.deepcopy(module) for i in range(N)])
1. Self-attention:Softmax(QK^T/sqrt(d_k)) V
def attention(query, key, value, mask=None, dropout=None):
d_k = query.size(-1)
scores = torch.matmul(query, key.transpose(-2, -1)) \
/ math.sqrt(d_k)
if mask is not None:
scores = scores.masked_fill(mask == 0, -1e9)
p_attn = F.softmax(scores, dim=-1)
if dropout is not None:
p_attn = dropout(p_attn)
return torch.matmul(p_attn, value), p_attn
2. 组装Self-attention为Multi-head Attention,具体来说是q, k, v经过Linear Layer之后,根据heads数量,拆分为多个块,然后送入self-attention中做矩阵乘法。
class MultiHeadAttention(nn.Module):
def __init__(self, d_model, h, dropout=0.1, batch_size=6):
"Take in model size and number of heads."
super(MultiHeadAttention, self).__init__()
assert d_model % h == 0
# We assume d_v always equals d_k
self.batch_size = batch_size
self.d_k = d_model // h
self.h = h
self.linears = _get_clones(nn.Linear(d_model, d_model), 4)
self.attn = None
self.dropout = nn.Dropout(p=dropout)
def forward(self, query, key, value, mask=None):
# query, key, value = [batch_size*len, d_model]
if mask is not None:
# Same mask applied to all h heads.
mask = mask.unsqueeze(1)
# 1) embedding: query, key, value = [batch_size, n_heads, len, dim_heads]
query, key, value = \
[l(x).view(self.batch_size, -1, self.h, self.d_k).transpose(1, 2)
for l, x in zip(self.linears, (query, key, value))]
# 2) Apply attention on all the projected vectors in batch.
x, self.attn = attention(query, key, value, mask=mask,
dropout=self.dropout)
# 3) "Concat" using a view and apply a final linear. x = [batch_size*len, d_model]
x = x.transpose(1, 2).contiguous() \
.view(-1, self.h * self.d_k)
return self.linears[-1](x)
3. 组装Multi-head Attention和FFN为Encoder
class TransformerEncoderLayer(nn.Module):
def __init__(self, d_model, nhead, dim_feedforward=2048, dropout=0.1, batch_size=6):
super().__init__()
self.self_attn = MultiHeadAttention(d_model, nhead, dropout=dropout, batch_size)
self.linear1 = nn.Linear(d_model, dim_feedforward)
self.linear2 = nn.Linear(dim_feedforward, d_model)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.dropout = nn.Dropout(dropout)
self.activation = F.relu
def forward(self, src):
channel, dimen = src.shape[0], src.shape[1]
q = k = v = src
src2 = self.self_attn(q, k, v) # Multi-Head Attention
src = src + src2 # Add
src = self.norm1(src) # Norm
""" Position-wise Feed-Forward Networks """
src2 = self.linear2(self.dropout(self.activation(self.linear1(src)))) # Feed Forward
src2 = src + src2 # Add
src = self.norm2(src2) # Norm
return src
4. 复制多个Encoder组成Transformer的encoder
class TransformerEncoder(nn.Module):
def __init__(self, encoder_layer, num_layers):
"encoder_layer: encoder; num_layers: the number of encoder"
super().__init__()
self.layers = _get_clones(encoder_layer, num_layers)
self.num_layers = num_layers
def forward(self, src):
output = src
for layer in self.layers:
output = layer(output)
return output
5. Absolute Position encoding
class PositionalEncoding(nn.Module):
"Implement the PE function."
def __init__(self, d_model, dropout, max_len=5000):
super(PositionalEncoding, self).__init__()
self.dropout = nn.Dropout(p=dropout)
# Compute the positional encodings once in log space.
pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len).unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_model, 2) *
-(math.log(10000.0) / d_model))
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
# pe.shape = (5000, 1024)
self.register_buffer('pe', pe)
def forward(self, x):
x = x + Variable(self.pe[:x.size(0), :],
requires_grad=False)
return self.dropout(x)
6. 组合成Transformer中的encoder
class Transformer(nn.Module):
def __init__(self, d_model=512, nhead=8, num_encoder_layers=6,
dim_feedfprward=2048, dropout=0.1, batch_size=6):
super().__init__()
encoder_layer = TransformerEncoderLayer(d_model, nhead, dim_feedfprward,
dropout, batch_size)
self.encoder = TransformerEncoder(encoder_layer, num_encoder_layers)
self._reset_parameters()
self.d_model = d_model
self.nhead = nhead
self.pos_encoding = PositionalEncoding(d_model, dropout=0.1)
def _reset_parameters(self):
for p in self.parameters():
if p.dim() > 1:
nn.init.xavier_uniform_(p)
def forward(self, src):
src = self.pos_encoding(src)
return self.encoder(src)
参考代码:
DETR:https://github.com/facebookresearch/detr/blob/master/models/transformer.py
https://github.com/facebookresearch/detr/blob/master/models/position_encoding.py
BoTNet:PyTorch version Bottleneck Transformers · GitHub
http://nlp.seas.harvard.edu/2018/04/03/attention.html