Transformer源代码学习,英语->德语翻译任务(Pytorch实现)

Transformer代码
其他

关于Transformer的原理建议读论文或者看看其他人的博客
推荐:

  1. 博客1
  2. Transformer中的mask机制超详细讲解
  3. 层层剖析,让你彻底搞懂Self-Attention、MultiHead-Attention和Masked-Attention的机制和原理
  4. transformer中QKV的通俗理解(渣男与备胎的故事)
  5. 史上最小白之Transformer详解

模型代码

对参考代码做了一点修改和包装。

transformer模型

import math
import torch
import torch.nn as nn
import numpy as np
import argparse

parser = argparse.ArgumentParser(__doc__)
parser.add_argument("--d_k", type=int, default=64, help="dim of attention's Key")
parser.add_argument("--d_v", type=int, default=64, help="dim of attention's Value")
parser.add_argument("--d_model", type=int, default=512, help="embedding dim of word")
parser.add_argument("--n_heads", type=int, default=8, help="head of ScaledDotProductAttention")
parser.add_argument("--device", type=str, default="cuda", help="device of training")
parser.add_argument("--d_ff", type=int, default=2048, help="dim of hidden layer")
parser.add_argument("--n_layers", type=int, default=6, help="number of Encoder of Decoder Layer(Block的个数)")

args = parser.parse_args()


class PositionalEncoding(nn.Module):
    def __init__(self, d_model, dropout=0.2, max_len=5000):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=dropout)
        pe = torch.zeros(max_len, d_model)  # 5000*512
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)  # 5000*1
        div_term = torch.exp(torch.arange(
            0, d_model, 2).float() * (-math.log(10000.0) / d_model))  # 256
        pe[:, 0::2] = torch.sin(position * div_term)  # 5000 * 256
        pe[:, 1::2] = torch.cos(position * div_term)  # 5000 * 256  1*5000*512
        pe = pe.unsqueeze(0).transpose(0, 1)  # 5000*1*512
        self.register_buffer('pe', pe)

    def forward(self, x):
        """
        x: [seq_len, batch_size, d_model]
        """
        x = x + self.pe[:x.size(0), :]
        return self.dropout(x)


def get_attn_pad_mask(seq_q, seq_k):
    # pad mask的作用:在对value向量加权平均的时候,可以让pad对应的alpha_ij=0,这样注意力就不会考虑到pad向量
    """这里的q,k表示的是两个序列(跟注意力机制的q,k没有关系),例如encoder_inputs (x1,x2,..xm)和encoder_inputs (x1,x2..xm)
    encoder和decoder都可能调用这个函数,所以seq_len视情况而定
    seq_q: [batch_size, seq_len]
    seq_k: [batch_size, seq_len]
    seq_len could be src_len or it could be tgt_len
    seq_len in seq_q and seq_len in seq_k maybe not equal
    """
    batch_size, len_q = seq_q.size()  # 这个seq_q只是用来expand维度的
    batch_size, len_k = seq_k.size()
    # eq(zero) is PAD token
    # 例如:seq_k = [[1,2,3,4,0], [1,2,3,5,0]]
    # [batch_size, 1, len_k], True is masked
    pad_attn_mask = seq_k.data.eq(0).unsqueeze(1)
    # [batch_size, len_q, len_k] 构成一个立方体(batch_size个这样的矩阵)
    return pad_attn_mask.expand(batch_size, len_q, len_k)


def get_attn_subsequence_mask(seq):  # 输入为batch_size*seq_len
    """建议打印出来看看是什么的输出(一目了然)
    seq: [batch_size, tgt_len]
    """
    attn_shape = [seq.size(0), seq.size(1), seq.size(1)]
    # attn_shape: [batch_size, tgt_len, tgt_len]
    subsequence_mask = np.triu(np.ones(attn_shape), k=1)  # 生成一个上三角矩阵
    subsequence_mask = torch.from_numpy(subsequence_mask).byte()
    return subsequence_mask  # [batch_size, tgt_len, tgt_len]


class ScaledDotProductAttention(nn.Module):
    def __init__(self):
        super(ScaledDotProductAttention, self).__init__()

    def forward(self, Q, K, V, attn_mask):
        """
        Q: [batch_size, n_heads, len_q, d_k]
        K: [batch_size, n_heads, len_k, d_k]
        V: [batch_size, n_heads, len_v(=len_k), d_v]
        attn_mask: [batch_size, n_heads, seq_len, seq_len]
        说明:在encoder-decoder的Attention层中len_q(q1,..qt)和len_k(k1,...km)可能不同
        补充:transformer的Encoder和Decoder中实现的Mask操作都是一样的,只要控制传入的attn_mask,即可实现<PAD>mask和decoder的mask
        """
        scores = torch.matmul(Q, K.transpose(-1, -2)) / \
                 np.sqrt(args.d_k)  # scores : [batch_size, n_heads, len_q, len_k]
        # mask矩阵填充scores(用-1e9填充scores中与attn_mask中值为1位置相对应的元素)
        # Fills elements of self tensor with value where mask is True.
        scores.masked_fill_(attn_mask, -1e9)
        # Fills elements of self tensor with value where mask is True. The shape of mask must be broadcastable with the shape of the underlying tensor.
        # Q和K在点积之后,需要先经过mask再进行softmax,因此,对于要屏蔽的部分,mask之后的输出需要为负无穷,这样softmax之后输出才为0。
        attn = nn.Softmax(dim=-1)(scores)  # 对最后一个维度(v)做softmax
        # scores : [batch_size, n_heads, len_q, len_k] * V: [batch_size, n_heads, len_v(=len_k), d_v]
        # context: [batch_size, n_heads, len_q, d_v]
        context = torch.matmul(attn, V)
        # context:[[z1,z2,...],[...]]向量, attn注意力稀疏矩阵(用于可视化的)
        return context, attn


class MultiHeadAttention(nn.Module):
    """这个Attention类可以实现:
    Encoder的Self-Attention
    Decoder的Masked Self-Attention
    Encoder-Decoder的Attention
    输入:seq_len x d_model
    输出:seq_len x d_model
    """

    def __init__(self):
        super(MultiHeadAttention, self).__init__()
        self.W_Q = nn.Linear(args.d_model, args.d_k * args.n_heads,
                             bias=False)  # q,k必须维度相同,不然无法做点积
        self.W_K = nn.Linear(args.d_model, args.d_k * args.n_heads, bias=False)
        self.W_V = nn.Linear(args.d_model, args.d_v * args.n_heads, bias=False)
        # 这个全连接层可以保证多头attention的输出仍然是seq_len x d_model
        self.fc = nn.Linear(args.n_heads * args.d_v, args.d_model, bias=False)

    def forward(self, input_Q, input_K, input_V, attn_mask):
        """
        input_Q: [batch_size, len_q, d_model]
        input_K: [batch_size, len_k, d_model]
        input_V: [batch_size, len_v(=len_k), d_model]
        attn_mask: [batch_size, seq_len, seq_len]
        """
        residual, batch_size = input_Q, input_Q.size(0)
        # 下面的多头的参数矩阵是放在一起做线性变换的,然后再拆成多个头,这是工程实现的技巧
        # B: batch_size, S:seq_len, D: dim
        # (B, S, D) -proj-> (B, S, D_new) -split-> (B, S, Head, W) -trans-> (B, Head, S, W)
        #           线性变换               拆成多头

        # Q: [batch_size, n_heads, len_q, d_k]
        Q = self.W_Q(input_Q).view(batch_size, -1,
                                   args.n_heads, args.d_k).transpose(1, 2)
        # K: [batch_size, n_heads, len_k, d_k] # K和V的长度一定相同,维度可以不同
        K = self.W_K(input_K).view(batch_size, -1,
                                   args.n_heads, args.d_k).transpose(1, 2)
        # V: [batch_size, n_heads, len_v(=len_k), d_v]
        V = self.W_V(input_V).view(batch_size, -1,
                                   args.n_heads, args.d_v).transpose(1, 2)

        # 因为是多头,所以mask矩阵要扩充成4维的
        # attn_mask: [batch_size, seq_len, seq_len] -> [batch_size, n_heads, seq_len, seq_len]
        attn_mask = attn_mask.unsqueeze(1).repeat(1, args.n_heads, 1, 1)

        # context: [batch_size, n_heads, len_q, d_v], attn: [batch_size, n_heads, len_q, len_k]
        context, attn = ScaledDotProductAttention()(Q, K, V, attn_mask)
        # 下面将不同头的输出向量拼接在一起
        # context: [batch_size, n_heads, len_q, d_v] -> [batch_size, len_q, n_heads * d_v]
        context = context.transpose(1, 2).reshape(
            batch_size, -1, args.n_heads * args.d_v)

        # 这个全连接层可以保证多头attention的输出仍然是seq_len x d_model
        output = self.fc(context)  # [batch_size, len_q, d_model]
        return nn.LayerNorm(args.d_model).to(args.device)(output + residual), attn


class PoswiseFeedForwardNet(nn.Module):
    def __init__(self):
        super(PoswiseFeedForwardNet, self).__init__()
        self.fc = nn.Sequential(
            nn.Linear(args.d_model, args.d_ff, bias=False),
            nn.ReLU(),
            nn.Linear(args.d_ff, args.d_model, bias=False)
        )

    def forward(self, inputs):
        """
        inputs: [batch_size, seq_len, d_model]
        """
        residual = inputs
        output = self.fc(inputs)
        # [batch_size, seq_len, d_model]
        return nn.LayerNorm(args.d_model).to(args.device)(output + residual)


class EncoderLayer(nn.Module):
    def __init__(self):
        super(EncoderLayer, self).__init__()
        self.enc_self_attn = MultiHeadAttention()
        self.pos_ffn = PoswiseFeedForwardNet()

    def forward(self, enc_inputs, enc_self_attn_mask):
        """E
        enc_inputs: [batch_size, src_len, d_model]
        enc_self_attn_mask: [batch_size, src_len, src_len]  mask矩阵(pad mask or sequence mask)
        """
        # enc_outputs: [batch_size, src_len, d_model], attn: [batch_size, n_heads, src_len, src_len]
        # 第一个enc_inputs * W_Q = Q
        # 第二个enc_inputs * W_K = K
        # 第三个enc_inputs * W_V = V
        enc_outputs, attn = self.enc_self_attn(enc_inputs, enc_inputs, enc_inputs,
                                               enc_self_attn_mask)  # enc_inputs to same Q,K,V(未线性变换前)
        enc_outputs = self.pos_ffn(enc_outputs)
        # enc_outputs: [batch_size, src_len, d_model]
        return enc_outputs, attn


class DecoderLayer(nn.Module):
    def __init__(self):
        super(DecoderLayer, self).__init__()
        self.dec_self_attn = MultiHeadAttention()
        self.dec_enc_attn = MultiHeadAttention()
        self.pos_ffn = PoswiseFeedForwardNet()

    def forward(self, dec_inputs, enc_outputs, dec_self_attn_mask, dec_enc_attn_mask):
        """
        dec_inputs: [batch_size, tgt_len, d_model]
        enc_outputs: [batch_size, src_len, d_model]
        dec_self_attn_mask: [batch_size, tgt_len, tgt_len]
        dec_enc_attn_mask: [batch_size, tgt_len, src_len]
        """
        # dec_outputs: [batch_size, tgt_len, d_model], dec_self_attn: [batch_size, n_heads, tgt_len, tgt_len]
        dec_outputs, dec_self_attn = self.dec_self_attn(dec_inputs, dec_inputs, dec_inputs,
                                                        dec_self_attn_mask)  # 这里的Q,K,V全是Decoder自己的输入
        # dec_outputs: [batch_size, tgt_len, d_model], dec_enc_attn: [batch_size, h_heads, tgt_len, src_len]
        dec_outputs, dec_enc_attn = self.dec_enc_attn(dec_outputs, enc_outputs, enc_outputs,
                                                      dec_enc_attn_mask)  # Attention层的Q(来自decoder) 和 K,V(来自encoder)
        # [batch_size, tgt_len, d_model]
        dec_outputs = self.pos_ffn(dec_outputs)
        # dec_self_attn, dec_enc_attn这两个是为了可视化的
        return dec_outputs, dec_self_attn, dec_enc_attn


class Encoder(nn.Module):
    def __init__(self, src_vocab_size):
        super(Encoder, self).__init__()
        self.src_emb = nn.Embedding(src_vocab_size, args.d_model,
                                    padding_idx=0)  # token Embedding  11*512  <PAD>位置不需要学习
        self.pos_emb = PositionalEncoding(
            args.d_model)  # Transformer中位置编码时固定的,不需要学习
        self.layers = nn.ModuleList([EncoderLayer() for _ in range(
            args.n_layers)])  # return:enc_outputs, attn | enc_outputs: [batch_size, src_len, d_model]

    def forward(self, enc_inputs):
        """enc_outputs = {Tensor: 2} tensor([[[ 0.8747,  0.6650, -0.7362,  ...,  0.1541,  0.1041, -1.0131],\n         [ 0.5965, -0.7938,  0.8106,  ..., -0.5160, -1.2227,  0.2863],\n         [-1.5760,  0.0621, -0.7876,  ..., -1.8266, -0.9605, -1.8297],\n         ...,\n         [-0.0097, -0.1358,  … View
        enc_inputs: [batch_size, src_len]
        """
        enc_outputs = self.src_emb(
            enc_inputs)  # [batch_size, src_len, d_model]
        enc_outputs = self.pos_emb(enc_outputs.transpose(0, 1)).transpose(
            0, 1)  # [batch_size, src_len, d_model] 这一步处理加撒上了 pos_emb
        # Encoder输入序列的pad mask矩阵  将元素为0的地方标识为TRUE 非零位置标识为False
        enc_self_attn_mask = get_attn_pad_mask(
            enc_inputs, enc_inputs)  # [batch_size, src_len, src_len]
        enc_self_attns = []  # 在计算中不需要用到,它主要用来保存你接下来返回的attention的值(这个主要是为了你画热力图等,用来看各个词之间的关系
        for layer in self.layers:  # for循环访问nn.ModuleList对象
            # 上一个block的输出enc_outputs作为当前block的输入
            # enc_outputs: [batch_size, src_len, d_model], enc_self_attn: [batch_size, n_heads, src_len, src_len]
            enc_outputs, enc_self_attn = layer(enc_outputs,
                                               enc_self_attn_mask)  # 传入的enc_outputs其实是input,传入mask矩阵是因为你要做self attention
            enc_self_attns.append(enc_self_attn)  # 这个只是为了可视化
        return enc_outputs, enc_self_attns


class Decoder(nn.Module):
    def __init__(self, tgt_vocab_size):
        super(Decoder, self).__init__()
        self.tgt_emb = nn.Embedding(
            tgt_vocab_size, args.d_model,padding_idx=0)  # Decoder输入的embed词表
        self.pos_emb = PositionalEncoding(args.d_model)
        self.layers = nn.ModuleList([DecoderLayer()
                                     for _ in range(args.n_layers)])  # Decoder的blocks

    def forward(self, dec_inputs, enc_inputs, enc_outputs):
        """
        dec_inputs: [batch_size, tgt_len]
        enc_inputs: [batch_size, src_len]
        enc_outputs: [batch_size, src_len, d_model]   # 用在Encoder-Decoder Attention层
        """
        dec_outputs = self.tgt_emb(
            dec_inputs)  # [batch_size, tgt_len, d_model]
        dec_outputs = self.pos_emb(dec_outputs.transpose(0, 1)).transpose(0, 1).to(
            args.device)  # [batch_size, tgt_len, d_model]
        # Decoder输入序列的pad mask矩阵(这个例子中decoder是没有加pad的,实际应用中都是有pad填充的)
        dec_self_attn_pad_mask = get_attn_pad_mask(dec_inputs, dec_inputs).to(
            args.device)  # [batch_size, tgt_len, tgt_len]

        # Masked Self_Attention:当前时刻是看不到未来的信息的
        dec_self_attn_subsequence_mask = get_attn_subsequence_mask(dec_inputs).to(
            args.device)  # [batch_size, tgt_len, tgt_len]

        # Decoder中把两种mask矩阵相加(既屏蔽了pad的信息,也屏蔽了未来时刻的信息)
        dec_self_attn_mask = torch.gt((dec_self_attn_pad_mask + dec_self_attn_subsequence_mask),
                                      0).to(
            args.device)  # [batch_size, tgt_len, tgt_len]; torch.gt比较两个矩阵的元素,大于则返回1,否则返回0

        # 这个mask主要用于encoder-decoder attention层
        # get_attn_pad_mask主要是enc_inputs的pad mask矩阵(因为enc是处理K,V的,求Attention时是用v1,v2,..vm去加权的,要把pad对应的v_i的相关系数设为0,这样注意力就不会关注pad向量)
        #                       dec_inputs只是提供expand的size的
        dec_enc_attn_mask = get_attn_pad_mask(
            dec_inputs, enc_inputs)  # [batc_size, tgt_len, src_len]

        dec_self_attns, dec_enc_attns = [], []
        for layer in self.layers:
            # dec_outputs: [batch_size, tgt_len, d_model], dec_self_attn: [batch_size, n_heads, tgt_len, tgt_len], dec_enc_attn: [batch_size, h_heads, tgt_len, src_len]
            # Decoder的Block是上一个Block的输出dec_outputs(变化)和Encoder网络的输出enc_outputs(固定)
            dec_outputs, dec_self_attn, dec_enc_attn = layer(dec_outputs, enc_outputs, dec_self_attn_mask,
                                                             dec_enc_attn_mask)
            dec_self_attns.append(dec_self_attn)
            dec_enc_attns.append(dec_enc_attn)
        # dec_outputs: [batch_size, tgt_len, d_model]
        return dec_outputs, dec_self_attns, dec_enc_attns


class Transformer(nn.Module):
    def __init__(self, src_vocab_size, tgt_vocab_size):
        super(Transformer, self).__init__()
        self.encoder = Encoder(src_vocab_size).to(args.device)
        self.decoder = Decoder(tgt_vocab_size).to(args.device)
        self.projection = nn.Linear(
            args.d_model, tgt_vocab_size, bias=False).to(args.device)

    def forward(self, enc_inputs, dec_inputs):
        """Transformers的输入:两个序列
        enc_inputs: [batch_size, src_len]
        dec_inputs: [batch_size, tgt_len]
        """
        # tensor to store decoder outputs
        # outputs = torch.zeros(batch_size, tgt_len, tgt_vocab_size).to(self.device)

        # enc_outputs: [batch_size, src_len, d_model], enc_self_attns: [n_layers, batch_size, n_heads, src_len, src_len]
        # 经过Encoder网络后,得到的输出还是[batch_size, src_len, d_model]
        enc_outputs, enc_self_attns = self.encoder(enc_inputs)
        # dec_outputs: [batch_size, tgt_len, d_model], dec_self_attns: [n_layers, batch_size, n_heads, tgt_len, tgt_len], dec_enc_attn: [n_layers, batch_size, tgt_len, src_len]
        dec_outputs, dec_self_attns, dec_enc_attns = self.decoder(
            dec_inputs, enc_inputs, enc_outputs)
        # dec_outputs: [batch_size, tgt_len, d_model] -> dec_logits: [batch_size, tgt_len, tgt_vocab_size]
        dec_logits = self.projection(dec_outputs)

        return dec_logits.view(-1, dec_logits.size(-1)), enc_self_attns, dec_self_attns, dec_enc_attns
        # return dec_outputs, enc_self_attns, dec_self_attns, dec_enc_attns

训练

为恢复训练结果,添加了随机种子

import torch
import torch.nn as nn
import torch.optim as optim
from data_utils import make_data
import torch.utils.data as Data
from model import Transformer
import os
import numpy as np

def seed_everything(seed):
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

seed_everything(11)

src_word2idx, src_idx2word, tgt_word2idx, tgt_idx2word, enc_inputs, dec_inputs, dec_outputs = make_data("newstest2013.en","newstest2013.de")


class MyDataSet(Data.Dataset):
    """自定义DataLoader"""

    def __init__(self, enc_inputs, dec_inputs, dec_outputs):
        super(MyDataSet, self).__init__()
        self.enc_inputs = enc_inputs
        self.dec_inputs = dec_inputs
        self.dec_outputs = dec_outputs

    def __len__(self):
        return self.enc_inputs.shape[0]

    def __getitem__(self, idx):
        return self.enc_inputs[idx], self.dec_inputs[idx], self.dec_outputs[idx]


BATCH_SIZE = 32
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

loader = Data.DataLoader(
    MyDataSet(enc_inputs, dec_inputs, dec_outputs), BATCH_SIZE, True)

model = Transformer(src_vocab_size=len(src_word2idx), tgt_vocab_size=len(tgt_word2idx))
model = model.to(device)

# 这里的损失函数里面设置了一个参数 ignore_index=0,因为 "pad" 这个单词的索引为 0,这样设置以后,就不会计算 "pad" 的损失(因为本来 "pad" 也没有意义,不需要计算)
criterion = nn.CrossEntropyLoss(ignore_index=0)
# https://ifwind.github.io/2021/08/17/Transformer%E7%9B%B8%E5%85%B3%E2%80%94%E2%80%94%EF%BC%887%EF%BC%89Mask%E6%9C%BA%E5%88%B6/#padding-mask%E4%BD%BF%E7%94%A8%E6%A1%88%E4%BE%8B
optimizer = optim.SGD(model.parameters(), lr=1e-3,
                      momentum=0.99)  # 用adam的话效果不好


for epoch in range(1000):
    model.train()
    for enc_inputs, dec_inputs, dec_outputs in loader:
        """
        enc_inputs: [batch_size, src_len]
        dec_inputs: [batch_size, tgt_len]
        dec_outputs: [batch_size, tgt_len]
        """

        enc_inputs, dec_inputs, dec_outputs = enc_inputs.to(
            device), dec_inputs.to(device), dec_outputs.to(device)
        # outputs: [batch_size * tgt_len, tgt_vocab_size]
        outputs, enc_self_attns, dec_self_attns, dec_enc_attns = model(
            enc_inputs, dec_inputs)
        loss = criterion(outputs, dec_outputs.view(-1))
        print('Epoch:', '%04d' % (epoch + 1), 'loss =', '{}'.format(loss))
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()



def greedy_decoder(model, enc_input, start_symbol):
    """贪心编码
    For simplicity, a Greedy Decoder is Beam search when K=1. This is necessary for inference as we don't know the
    target sequence input. Therefore we try to generate the target input word by word, then feed it into the transformer.
    Starting Reference: http://nlp.seas.harvard.edu/2018/04/03/attention.html#greedy-decoding
    :param model: Transformer Model
    :param enc_input: The encoder input
    :param start_symbol: The start symbol. In this example it is 'S' which corresponds to index 4
    :return: The target input
    """
    enc_outputs, enc_self_attns = model.encoder(enc_input)
    # 初始化一个空的tensor: tensor([], size=(1, 0), dtype=torch.int64)
    dec_input = torch.zeros(1, 0).type_as(enc_input.data)
    terminal = False
    next_symbol = start_symbol
    while ((not terminal) and (len(dec_input[0]) < 30)):
        print(len(dec_input))
        # 预测阶段:dec_input序列会一点点变长(每次添加一个新预测出来的单词)
        dec_input = torch.cat([dec_input.to(device), torch.tensor([[next_symbol]], dtype=enc_input.dtype).to(device)],
                              -1)
        dec_outputs, _, _ = model.decoder(dec_input, enc_input, enc_outputs)
        projected = model.projection(dec_outputs)
        prob = projected.squeeze(0).max(dim=-1, keepdim=False)[1]
        # 增量更新(我们希望重复单词预测结果是一样的)
        # 我们在预测是会选择性忽略重复的预测的词,只摘取最新预测的单词拼接到输入序列中
        # 拿出当前预测的单词(数字)。我们用x'_t对应的输出z_t去预测下一个单词的概率,不用z_1,z_2..z_{t-1}
        next_word = prob.data[-1]
        next_symbol = next_word
        if next_symbol == tgt_word2idx["<eos>"]:
            terminal = True
        # print(next_word)

    # greedy_dec_predict = torch.cat(
    #     [dec_input.to(device), torch.tensor([[next_symbol]], dtype=enc_input.dtype).to(device)],
    #     -1)
    greedy_dec_predict = dec_input[:, 1:]
    return greedy_dec_predict


sentences = [
    # enc_input                dec_input           dec_output
    ['in addition, the law are limit <pad>',
     '', '']
]

for sen in sentences:
    enc_inputs = torch.LongTensor(
        [[src_word2idx[word] if word in src_word2idx.keys() else src_word2idx['<unk>'] for word in sen[0].split()]])
    dec_inputs, dec_outputs = torch.LongTensor([]), torch.LongTensor([])

for i in range(len(enc_inputs)):
    greedy_dec_predict = greedy_decoder(model, enc_inputs.to(device), start_symbol=tgt_word2idx["<bos>"])
    print(enc_inputs[i], '->', greedy_dec_predict.squeeze())
    print([src_idx2word[t.item()] for t in enc_inputs[i]], '->',
          [tgt_idx2word[n.item()] for n in greedy_dec_predict.squeeze()])

训练数据格式

原始数据
英语
在这里插入图片描述
德语
在这里插入图片描述


# -*- coding: utf-8 -*-
import torch
from collections import Counter
from copy import *

# Extra vocabulary symbols
    pad_token = "<pad>"
    unk_token = "<unk>"
    bos_token = "<bos>"
    eos_token = "<eos>"

extra_tokens = [pad_token, unk_token, bos_token, eos_token]

PAD = extra_tokens.index(pad_token)
UNK = extra_tokens.index(unk_token)
BOS = extra_tokens.index(bos_token)
EOS = extra_tokens.index(eos_token)


def convert_text2idx(examples, word2idx):
    return [[word2idx[w] if w in word2idx else UNK
             for w in sent] for sent in examples]


def read_corpus(src_path, max_len, lower_case=False):
    print('Reading examples from {}..'.format(src_path))
    src_sents = []
    empty_lines, exceed_lines = 0, 0
    with open(src_path) as src_file:
        for idx, src_line in enumerate(src_file):
            if idx % 10000 == 0:
                print('  reading {} lines..'.format(idx))
            if src_line.strip() == '':  # remove empty lines
                empty_lines += 1
                continue
            if lower_case:  # check lower_case
                src_line = src_line.lower()

            src_words = src_line.strip().split()
            if max_len is not None and len(src_words) > max_len:
                exceed_lines += 1
                continue
            src_sents.append(src_words)

    print('Removed {} empty lines'.format(empty_lines),
          'and {} lines exceeding the length {}'.format(exceed_lines, max_len))
    print('Result: {} lines remained'.format(len(src_sents)))
    return src_sents


def read_parallel_corpus(src_path, tgt_path, max_len, lower_case=False):
    print('Reading examples from {} and {}..'.format(src_path, tgt_path))
    src_sents, tgt_sents = [], []
    empty_lines, exceed_lines = 0, 0
    with open(src_path) as src_file, open(tgt_path) as tgt_file:
        for idx, (src_line, tgt_line) in enumerate(zip(src_file, tgt_file)):
            if idx % 10000 == 0:
                print('  reading {} lines..'.format(idx))
            if src_line.strip() == '' or tgt_line.strip() == '':  # remove empty lines
                empty_lines += 1
                continue
            if lower_case:  # check lower_case
                src_line = src_line.lower()
                tgt_line = tgt_line.lower()

            src_words = src_line.strip().split()
            tgt_words = tgt_line.strip().split()
            if max_len is not None and (len(src_words) > max_len or len(tgt_words) > max_len):
                exceed_lines += 1
                continue
            src_sents.append(src_words)
            tgt_sents.append(tgt_words)

    print('Filtered {} empty lines'.format(empty_lines),
          'and {} lines exceeding the length {}'.format(exceed_lines, max_len))
    print('Result: {} lines remained'.format(len(src_sents)))
    return src_sents, tgt_sents


def build_vocab(examples, max_size, min_freq, extra_tokens):
    '''
    :param examples: 输入数组  [[sentence list],[]]
    :param max_size: 最大词典长度
    :param min_freq: 最小频率
    :param extra_tokens: list
    :return: counter, word2idx, idx2word
    '''
    print('Creating vocabulary with max limit {}..'.format(max_size))
    counter = Counter()  # 计数器
    word2idx, idx2word = {}, []
    if extra_tokens:
        idx2word += extra_tokens
        word2idx = {word: idx for idx, word in enumerate(extra_tokens)}
    min_freq = max(min_freq, 1)
    max_size = max_size + len(idx2word) if max_size else None
    for sent in examples:
        for w in sent:
            counter.update([w])
    # first sort items in alphabetical order and then by frequency
    sorted_counter = sorted(counter.items(), key=lambda tup: tup[0])
    sorted_counter.sort(key=lambda tup: tup[1], reverse=True)

    for word, freq in sorted_counter:
        # 去除频率小于min_freq的词和设置最大词典长度
        if freq < min_freq or (max_size and len(idx2word) == max_size):
            break
        idx2word.append(word)
        word2idx[word] = len(idx2word) - 1

    print('Vocabulary of size {} has been created'.format(len(idx2word)))
    return counter, word2idx, idx2word


def add_pad_and_cut(inputs, length, dec_inputs_label=False, dec_outputs_label=False):
    '''
    :param inputs: 输入句子
    :param length: 单句最大长度
    :param dec_inputs: 判断加 BOS 还是 EOS
    False表示dec_outputs 最后加EOS
    True表示enc_inputs  开头加BOS
    :return:
    '''

    for id, l in enumerate(inputs):
        if dec_inputs_label == True:
            l.insert(0, BOS)
        if len(l) < length:
            if dec_outputs_label == True:
                l.append(EOS)
            while len(l) < length:
                l.append(0)
        else:
            inputs[id] = l[:length]
            if dec_outputs_label == True:
                inputs[id][length - 1] = EOS
    return inputs


def make_data(src_file,tgt_file):
    src_sents, tgt_sents = read_parallel_corpus(src_file, tgt_file, 100, lower_case=True)
    src_counter, src_word2idx, src_idx2word = build_vocab(src_sents, 800, 0, extra_tokens)
    tgt_counter, tgt_word2idx, tgt_idx2word = build_vocab(tgt_sents, 800, 0, extra_tokens)

    enc_inputs = convert_text2idx(src_sents, src_word2idx)
    dec_inputs = convert_text2idx(tgt_sents, tgt_word2idx)
    dec_outputs = deepcopy(dec_inputs)

    enc_inputs = add_pad_and_cut(enc_inputs, 30)  # 每句最长单词数
    dec_inputs = add_pad_and_cut(dec_inputs, 30, dec_inputs_label=True)
    dec_outputs = add_pad_and_cut(dec_outputs, 30, dec_outputs_label=True)
    return src_word2idx, src_idx2word, tgt_word2idx, tgt_idx2word, torch.LongTensor(
        enc_inputs), torch.LongTensor(dec_inputs), torch.LongTensor(dec_outputs)


# src_word2idx, src_idx2word, tgt_word2idx, tgt_idx2word, enc_inputs, dec_inputs, dec_outputs = make_data()

实例

pad_token = "<pad>"
unk_token = "<unk>"
bos_token = "<bos>"
eos_token = "<eos>"
sentences = [
    # 中文和英语的单词个数不要求相同
    # enc_input                dec_input           dec_output
    ['a republican strategy to counter the re-election of obama P',
     'S eine republikanische strategie, um der wiederwahl von obama entgegenzutreten',
     'eine republikanische strategie, um der wiederwahl von obama entgegenzutreten E'],
    ['republican leaders justified their policy by the need to combat electoral fraud. P',
     'S die führungskräfte der republikaner rechtfertigen ihre politik mit der notwendigkeit, den wahlbetrug zu bekämpfen.',
     'die führungskräfte der republikaner rechtfertigen ihre politik mit der notwendigkeit, den wahlbetrug zu bekämpfen. E'],
    ['however, the brennan centre considers this a myth, stating that electoral fraud is rarer in the united states than the number of people killed by lightning. P',
     'S allerdings hält das brennan center letzteres für einen mythos, indem es bekräftigt, dass der wahlbetrug in den usa seltener ist als die anzahl der vom blitzschlag getöteten menschen.',
     'allerdings hält das brennan center letzteres für einen mythos, indem es bekräftigt, dass der wahlbetrug in den usa seltener ist als die anzahl der vom blitzschlag getöteten menschen. E']
]

将以上数据转换为id,即可训练,但是需要注意加上其他标识符号等。

enc_inputs: ********<pad> # 没加<BOS>和<EOS>
dec_inputs: <BOS>********<pad> # 没加<EOS>
dec_outputs: ********<EOS> # 没加<BOS>

效果

损失
输入
在这里插入图片描述

输出
在这里插入图片描述

存在的问题

  1. 在transformer后加上Softmax激活函数后,训练变得十分困难,经常出现梯度消失问题。
  2. 训练的词典全部为小写,甚至有的会you.这样后面带句号的词。
  3. 使用Adma优化器也十分难训练。

欢迎大佬留言讨论指正!!!

  • 13
    点赞
  • 12
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值