Transformer的pytorch实现
Transformer架构
Transformer模型由encoder和decoder两部分组成,decoder输出的结果,经过一个线性层,然后计算softmax。
d_model = 512 # Embedding Size
d_ff = 2048 # FeedForward dimension
d_k = d_v = 64 # dimension of K(=Q), V
n_layers = 6 # number of Encoder of Decoder Layer
n_heads = 8 # number of heads in Multi-Head Attention
class Transformer(nn.Module):
def __init__(self):
super(Transformer, self).__init__()
self.encoder = Encoder()
self.decoder = Decoder()
self.projection = nn.Linear(d_model, tgt_vocab_size, bias=False)
def forward(self, enc_inputs, dec_inputs):
enc_outputs, enc_self_attns = self.encoder(enc_inputs)
dec_outputs, dec_self_attns, dec_enc_attns = self.decoder(dec_inputs, enc_inputs, enc_outputs)
dec_logits = self.projection(dec_outputs) # dec_logits : [batch_size x src_vocab_size x tgt_vocab_size]
return dec_logits.view(-1, dec_logits.size(-1)), enc_self_attns, dec_self_attns, dec_enc_attns
实例化一个Transformer模型,输入为enc_inputs[batch_size,src_len],dec_inputs[batch_size,tgt_len]
model = Transformer()
outputs, enc_self_attns, dec_self_attns, dec_enc_attns = model(enc_inputs, dec_inputs)
Encoder
Encoder分为embedding和6层Encoderlayer,首层的Encoderlayer输入为词embedding和位置embedding相加之和enc_outputs:[batch_size,seq_len,d_model],即(1,5,512)。
Attention_mask:[batch_size,seq_lenq,seq_len],即(1,