第十二课.Seq2Seq与Attention

最新推荐文章于 2022-03-07 14:19:15 发布

tzc_fly

最新推荐文章于 2022-03-07 14:19:15 发布

阅读量615

点赞数 3

分类专栏：白景屹的Pytorch笔记本文章标签： nlp 深度学习

本文链接：https://blog.csdn.net/qq_40943760/article/details/113061820

版权

白景屹的Pytorch笔记本专栏收录该内容

24 篇文章 57 订阅

订阅专栏

Seq2Seq，机器翻译Encoder-Decoder

Seq2Seq是NLP的一个热门分支，模型通常应用于机器翻译和聊天机器人，Seq2Seq从最初的Encoder-Decoder发展起来，在2014到2015年间，出现了Attention（注意力）机制，注意力机制与Seq2Seq结合后进一步提高了模型的表现。

现在将实现Encoder-Decoder模型，将其用于机器翻译任务；

机器翻译数据集与数据预处理

使用轻量级的机器翻译数据集nmt，数据内容在个人资源处，这是一个小型的机器翻译数据集便于开展实验，en-cn为英文与中文，en-fr为英文与法文。Encoder-Decoder将实现英文翻译中文的任务，en-cn数据量小，训练文件只有14533组中英语句，数据集格式：英文+中文(繁体字)。每组语句都较短，比如 train.txt 前6组数据：

Anyone can do that.	任何人都可以做到。
How about another piece of cake?	要不要再來一塊蛋糕？
She married him.	她嫁给了他。
I don't like learning irregular verbs.	我不喜欢学习不规则动词。
It's a whole new ball game for me.	這對我來說是個全新的球類遊戲。
He's sleeping like a baby.	他正睡着，像个婴儿一样。

原始语料需要进行预处理，所以导入必要的包和模块：

import os
import sys
import math
from collections import Counter
import numpy as np
import random

import torch
import torch.nn as nn
import torch.nn.functional as F

其次，导入nltk，nltk专用于英文分词：

import nltk

初次安装nltk后，进行分词需要依赖工具punkt，分词工具内容在个人资源处，punkt是nltk的分词工具，将其解压，放置到当前虚拟环境（假设环境名为TORCH），则目录结构为"TORCH/nltk_data/tokenizers/punkt"；定义函数load_data，用于读取句子，再将句子转为分词的列表，额外地设置起始标志，每个句子以BOS开始，以EOS结束：

# 读取句子，将句子转为词的列表，每个句子以BOS开始，EOS结束
def load_data(in_file):
    cn=[]
    en=[]
    num_example=0
    with open(in_file,'r',encoding='utf-8') as f:
        # readlines不同于readline，readlines返回的是一个列表，每个元素就是一行
        # readlines适合处理小数据，大数据最好用readline+生成器方式读
        for line in f.readlines():
            # strip()用于去除首尾的指定字符，split()用于文本分隔，回顾python记事本
            line=line.strip().split("\t") # line[0]存英文句子,line[1]存中文句子
            
            # nltk.word_tokenize对英文分词，标识字符串BOS，EOS分别表示Beginning of Sentence,Ending of Sentence
            en.append(["BOS"]+nltk.word_tokenize(line[0].lower())+["EOS"])
            
            #中文分词按照字逐个分开
            cn.append(["BOS"]+[c for c in line[1]]+["EOS"])
            
            num_example+=1
            
    return en,cn


train_file="./nmt/en-cn/train.txt"
dev_file="./nmt/en-cn/dev.txt"

train_en,train_cn=load_data(train_file)
dev_en,dev_cn=load_data(dev_file)

查看分词结果：

# 查看分词结果
dev_en[0],dev_cn[0]

"""
(['BOS', 'she', 'put', 'the', 'magazine', 'on', 'the', 'table', '.', 'EOS'],
 ['BOS', '她', '把', '雜', '誌', '放', '在', '桌', '上', '。', 'EOS'])
"""

构建单词表vocab{word:counts}，通过词汇表内的词生成word_to_idx：

UNK_IDX=0
PAD_IDX=1

def build_dict(sentences,max_words=50000):
    # 使用Counter计数
    word_count=Counter()
    for sentence in sentences:
        for s in sentence:
            word_count[s]+=1
    ls=word_count.most_common(max_words) # ls为列表，每个元素是元组(word,counts)
    
    total_words=len(ls)+2
    
    # word_to_idx {word:idx}
    word_dict={w[0]:index+2 for index,w in enumerate(ls)}
    word_dict["UNK"]=UNK_IDX
    word_dict["PAD"]=PAD_IDX
    
    return word_dict,total_words

en_dict,en_total_words=build_dict(train_en)
cn_dict,cn_total_words=build_dict(train_cn)

构造idx_to_word {idx:word}：

# idx_to_word {idx:word}
inv_en_dict={v:k for k,v in en_dict.items()}
inv_cn_dict={v:k for k,v in cn_dict.items()}

把英文，中文的词均转为数字，同时根据句子长度进行排序处理（排序可以使每个batch里的句子长度相接近），其中的高阶函数sorted回顾Python笔记本.第五课.Python函数(二)：

def encode(en_sentences,cn_sentences,en_dict,cn_dict,sort_by_len=True):
    length=len(en_sentences)
    
    # D.get(k[,d]) -> D[k] if k in D, else d.
    out_en_sentences=[[en_dict.get(word,0) for word in sent] for sent in en_sentences]
    out_cn_sentences=[[cn_dict.get(word,0) for word in sent] for sent in cn_sentences]
    
    # 给一批语句，按照每句话的词数排序
    def len_argsort(seq):
        # sorted(iterable, key=None, reverse=False)，默认排序是升序排列，key参数接收的是一个函数
        # 回顾 第五课.Python函数(二)
        return sorted(range(len(seq)),key=lambda x:len(seq[x]))
    
    if sort_by_len:
        sorted_index=len_argsort(out_en_sentences)
        out_en_sentences=[out_en_sentences[i] for i in sorted_index]
        out_cn_sentences=[out_cn_sentences[i] for i in sorted_index]
        
    return out_en_sentences,out_cn_sentences

train_en,train_cn=encode(train_en,train_cn,en_dict,cn_dict)
dev_en,dev_cn=encode(dev_en,dev_cn,en_dict,cn_dict)

查看前10个英文句子，由于sorted默认升序，所以短的句子在前面：

train_en[:10]

"""
[[2, 475, 4, 3],
 [2, 1318, 126, 3],
 [2, 1707, 126, 3],
 [2, 254, 126, 3],
 [2, 1318, 126, 3],
 [2, 130, 11, 3],
 [2, 2045, 126, 3],
 [2, 693, 126, 3],
 [2, 2266, 126, 3],
 [2, 1707, 126, 3]]
"""

将数字转回句子原型，比如选择第24个句子：

[inv_en_dict[i] for i in train_en[23]],[inv_cn_dict[i] for i in train_cn[23]]

"""
(['BOS', 'why', 'me', '?', 'EOS'],
 ['BOS', '为', '什', '么', '是', '我', '？', 'EOS'])
"""

将数据生成batch：

def get_minibatch(n:"数据集一共有多少组句子",minibatch_size,shuffle=True):
    idx_list=np.arange(0,n,minibatch_size)
    if shuffle:
        np.random.shuffle(idx_list)
    minibatches=[]
    for idx in idx_list:
        minibatches.append(np.arange(idx,min(idx+minibatch_size,n)))
    return minibatches

def prepare_data(seqs):
    # 将句子处理到相同长度,不够的在句子后面填充0
    lengths=[len(seq) for seq in seqs]
    n_samples=len(seqs)
    max_len=np.max(lengths)
    
    x=np.zeros((n_samples,max_len)).astype('int32')
    x_lengths=np.array(lengths).astype('int32')
    
    for idx,seq in enumerate(seqs):
        x[idx,:lengths[idx]]=seq

    return x,x_lengths

def gen_examples(en_sentences,cn_sentences,batch_size):
    minibatches=get_minibatch(len(en_sentences),batch_size)
    all_ex=[]
    for minibatch in minibatches:
        mb_en_sentences=[en_sentences[t] for t in minibatch]
        mb_cn_sentences=[cn_sentences[t] for t in minibatch]
        
        mb_x,mb_x_len=prepare_data(mb_en_sentences)
        mb_y,mb_y_len=prepare_data(mb_cn_sentences)
        
        # mb_x [batch_size,该batch内英文句子最长长度]
        # mb_x_len [batch_size,]
        # mb_y [batch_size,该batch内中文句子最长长度]
        # mb_y_len [batch_size,]
        all_ex.append((mb_x,mb_x_len,mb_y,mb_y_len))
        
    return all_ex

batch_size=64
train_data=gen_examples(train_en,train_cn,batch_size)
#  random.shuffle(x)->"Shuffle list x"
random.shuffle(train_data)
dev_data=gen_examples(dev_en,dev_cn,batch_size)

Encoder-Decoder模型

Encoder-Decoder模型本质是两个循环神经网络（一般使用GRU）进行连接；假设现在有一个Seq元组：一句英文，一句中文，句子已经分词处理过，令 $x$ 表示英语的分词， $y$ 表示中文的分词，既有：
$x,y):[x_{1},x_{2},x_{3}]|[y_{1},y_{2},y_{3},y_{4}]$
按照Seq2Seq的一般处理格式，会构造 $x_{1},x_{2},x_{3},y_{1})$ 为输入数据， $y_{2},y_{3},y_{4})$ 为标签；

Encoder-Decoder的网络结构如下：
fig1
上述结构中，Encoder的初始输入 hidden state： $h_{0}$ 可使用零向量，Decoder输出的预测结果为 $yp_{2},yp_{3},yp_{4})$ ，对比标签数据 $y_{2},y_{3},y_{4})$ ，机器翻译问题即转为普通的分类任务；Decoder其实是一个语言模型，利用当前中文分词，顺序预测后面的中文分词；

Encoder-Decoder使用到GRU，先了解pytorch中的GRU，GRU可看做是LSTM的简化版本，其用法类似LSTM，只是少了cell state；

GRU参数：
        input_size - 输入词向量的维数；
        hidden_size - 输出向量的维数；
        num_layers - GRU的层数；
        batch_first - 是否将batch维度设置到首维,默认为false:(seq_len, batch, input_size)
        
Inputs: 
		input, (h_0)
        input的形状 (seq_len, batch, input_size)，input_size为输入词向量的维数；
        h_0是GRU最开始输入需要的hidden state向量,默认为0向量
        如果GRU是双向的，num_directions=2，否则为1
        h_0 的形状  (num_layers * num_directions, batch, hidden_size)

Outputs: 
		output, (h_n)
        output的形状  (seq_len, batch, num_directions * hidden_size)
        h_n 的形状 (num_layers * num_directions, batch, hidden_size)

一般Seq为了构成一个batch，会向尾部弥补"填充字符"（比如"<PAD>"），当RNN计算这个batch时，会对大部分句子出现无用计算，因为填充字符没有太多实际意义，这降低了效率，所以可以借助nn.utils.rnn.pack_padded_sequence计算到每个Seq的句尾就结束（使用该工具前需要先对张量按seq的长度排序，且为降序排列，长的在前，短的在后），得到的输出再使用nn.utils.rnn.pad_packed_sequence补齐长度，得到一个规整的输出张量，整个过程从表面看，和直接将填充字符纳入计算后的输出张量形状一样，但执行效率却高了很多；

Encoder模型为：

class PlainEncoder(nn.Module):
    def __init__(self,vocab_size,hidden_size,dropout=0.2):
        super().__init__()
        self.embed=nn.Embedding(vocab_size,hidden_size)
        
        # GRU用法类似LSTM,少了cell state
        self.rnn=nn.GRU(hidden_size,hidden_size,batch_first=True)
        
        self.dropout=nn.Dropout(dropout)
    
    def forward(self,x:"一个batch:[batch_size,seq_len:句子最长长度]",lengths:"一个batch中各句子长度"):
        # 把batch里的seq按照长度排序
        # Input沿着指定维度dim排序
        # torch.sort(input, dim=None, descending=False:"默认降序为False")->排序后的tensor,原tensor的序号索引
        sorted_len,sorted_index=lengths.sort(0,descending=True)
        # 将句子按照长度排序,eg:arr[[m,n,k]]挑选出矩阵arr第m,n,k行组成新矩阵
        x_sorted=x[sorted_index] # [batch_size,seq_len]
        
        embeded=self.dropout(self.embed(x_sorted)) # [batch_size,seq_len,hidden_size]
        
        """
        将句子张量"压缩"(使rnn处理张量时不计算其末尾的pad),提高计算效率
        torch.nn.utils.rnn.pack_padded_sequence(input, lengths, batch_first=False)
        输入的形状可以是(T×B×* )，T是最长序列长度，B是batch size，*代表任意维度；返回 PackedSequence 对象
        input (Variable) – 变长序列 被填充后的 batch，input中保存的序列，应该按降序排列，长的在前，短的在后
        lengths (list[int]) – Variable 中 每个序列的长度。
        batch_first (bool, optional) – 如果是True，input的形状应该是B*T*size。
        """
        packed_embeded=nn.utils.rnn.pack_padded_sequence(embeded,sorted_len.long().cpu().data.numpy(),batch_first=True)
        
        # packed_out [batch_size, seq_len, num_directions * hidden_size]
        # hid [batch_size, num_layers * num_directions, hidden_size]
        packed_out,hid=self.rnn(packed_embeded)
        """
        将张量填充回去,填充时会初始化为0
        torch.nn.utils.rnn.pad_packed_sequence(PackedSequence,batch_first)
        
        sequence (PackedSequence) – 将要被填充的 batch
        batch_first (bool, optional) – 如果为True，返回的数据的格式为 B×T×*。
        
        返回值: 一个tuple，包含被填充后的batch，和batch中序列(填充前)的长度列表
        """
        out,_=nn.utils.rnn.pad_packed_sequence(packed_out,batch_first=True) # out [batch_size, seq_len, num_directions * hidden_size]
        
        # sorted_index是降序排列的结果,升序排列返回原顺序(在降序之前其实就已经是排序过的)
        # 返回原顺序才能与中文target匹配
        _,original_idx=sorted_index.sort(0,descending=False)
        
        # tensor.contiguous()将tensor在内存中变成物理连续分布形式,节省空间
        out=out[original_idx.long()].contiguous() # [batch_size, seq_len, num_directions * hidden_size]
        
        hid=hid[:,original_idx.long()].contiguous() # [num_layers * num_directions, batch_size, hidden_size]
        
        # hid只要最后一行,即最后一层的 hidden state
        # hid[[-1]] :[1, batch_size, hidden_size]
        return out,hid[[-1]]

Decoder模型为：

class PlainDecoder(nn.Module):
    def __init__(self,vocab_size,hidden_size,dropout=0.2):
        super().__init__()
        self.embed=nn.Embedding(vocab_size,hidden_size)
        self.rnn=nn.GRU(hidden_size,hidden_size,batch_first=True)
        self.fc=nn.Linear(hidden_size,vocab_size)
        self.dropout=nn.Dropout(dropout)
        
    def forward(self,y:"[batch_size,seq_len]",y_lengths:"一个batch中各句子长度",hid:"[1, batch_size, hidden_size]"):
        sorted_len,sorted_index=y_lengths.sort(dim=0,descending=True)
        y_sorted=y[sorted_index] # [batch_size,seq_len]
        hid=hid[:,sorted_index]  # [1, batch_size, hidden_size]
        
        embeded=self.dropout(self.embed(y_sorted)) # [batch_size,seq_len,hidden_size]
        
        packed_embeded=nn.utils.rnn.pack_padded_sequence(embeded,sorted_len.long().cpu().data.numpy(),batch_first=True)
        packed_out,hid=self.rnn(packed_embeded,hid)        
        out,_= nn.utils.rnn.pad_packed_sequence(packed_out, batch_first=True) #out [batch_size, seq_len, num_directions * hidden_size]
        
        _, original_idx = sorted_index.sort(dim=0,descending=False)
        
        out=out[original_idx.long()].contiguous() #out [batch_size, seq_len, num_directions * hidden_size]
        hid=hid[:,original_idx.long()].contiguous() # [num_layers * num_directions, batch_size, hidden_size]
        
        # self.out(out) [batch_size, seq_len, vocab_size]
        # log_softmax对每个元素都计算LogSoftmax,dim=-1表示沿着vocab_size轴操作
        output = F.log_softmax(self.fc(out), dim=-1) # [batch_size, seq_len, vocab_size]
        
        return output,hid[[-1]]

结合Encoder和Decoder实现Seq2Seq：

class PlainSeq2Seq(nn.Module):
    def __init__(self,encoder,decoder):
        super().__init__()
        self.encoder=encoder
        self.decoder=decoder
        
    def forward(self, x, x_lengths, y, y_lengths):
        encoder_out,hid=self.encoder(x,x_lengths)
        output,hid=self.decoder(y,y_lengths,hid)
        return output
    
    def translate(self, x, x_lengths, y:"BOS格式遵循[batch_size,seq_len=1]", max_length=10):
        """从中文BOS开始,逐个预测下一文字"""
        encoder_out, hid = self.encoder(x, x_lengths)
        preds = []
        batch_size = x.shape[0]
        
        for i in range(max_length):
            # output [batch_size, seq_len=1, vocab_size]
            output, hid = self.decoder(y=y,
                    y_lengths=torch.ones(batch_size).long().to(y.device),
                    hid=hid)
            
            # tensor.max(dim)->tensor:"最大值组成的张量",tensor:"最大值索引组成的张量"
            y = output.max(dim=2)[1].view(batch_size, 1) # y [batch_size,1]
            preds.append(y)
            
        return torch.cat(preds, dim=1) # [batch_size,max_length]

Seq2Seq中定义了实例方法translate，其过程为：

英文分词输入Encoder，得到输出的 hidden state；
Decoder的GRU固定为输出 max seq len 个词向量；
将标志符号BOS作为中文的第一个分词，结合Encoder输出的 hidden state 输入到Decoder，得到第一个输出词向量，再将该词向量作为Decoder的输入词向量，依次得到一组输出词向量，每个词向量经过全连接映射得到one-hot编码，即得到输出的中文分词列表；
顺着列表检查分词，如果出现标志符号EOS就截取前面的分词组成中文结果。

损失函数

先了解gather的用法，torch.gather用于收集输入的特定维度指定位置的数值，其参数有：

torch.gather:
	input(tensor):待操作张量;
	dim(int):待操作的维度;
	index(LongTensor):在input的dim维度上取出对应位置的值

采用自定义的损失函数，以便提升效果：

class LanguageModelCriterion(nn.Module):
    """自定义损失函数"""
    def __init__(self):
        super().__init__()
        
    def forward(self,input,target,mask):
        # input [batch_size, seq_len, vocab_size]
        # target [batch_size, seq_len]
        input=input.contiguous().view(-1,input.size(2)) # [*,vocab_size]
        target=target.contiguous().view(-1,1) # [*,1]
        
        # mask 表示哪些词是句子中的,哪些不是
        # mask [batch_size, seq_len]
        mask=mask.contiguous().view(-1,1) # reshape到[*,1]
        
        # input是模型的Decoder的输出,输出经过了log_softmax,已经有log,加负号后相当于负对数损失
        output=-input.gather(1,target)*mask
        output=torch.sum(output)/torch.sum(mask)
        
        return output

损失函数的计算使用了mask（mask可看做一个batch的byte型矩阵，是句子的词，则对应位置值为1，否则为0），mask加强了网络判别预测分词能不能算是句子组成部分的能力；

实例化模型并定义优化方法：

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
dropout=0.2
hidden_size=100

encoder=PlainEncoder(vocab_size=en_total_words,
                    hidden_size=hidden_size,
                    dropout=dropout)

decoder=PlainDecoder(vocab_size=cn_total_words,
                     hidden_size=hidden_size,
                     dropout=dropout)

model=PlainSeq2Seq(encoder,decoder)
model=model.to(device)

loss_fn=LanguageModelCriterion().to(device)
optimizer=torch.optim.Adam(model.parameters())

训练与测试

实际工作中，训练一个好的机器翻译模型需要大量的语料，一般要训练2周。本次实验数据集简单，训练耗时短，定义训练函数为：

def train(model,data,num_epochs=30):
    for epoch in range(num_epochs):
        model.train()
        num_words=total_loss=0.0
        for it,(mb_x,mb_x_len,mb_y,mb_y_len) in enumerate(data):
            """
            mb_x [batch_size,该batch内英文句子最长长度]
            mb_x_len [batch_size,]
            mb_y [batch_size,该batch内中文句子最长长度]
            mb_y_len [batch_size,]
            """
            # torch.from_numpy(ndarray)将ndarray转为tensor
            mb_x=torch.from_numpy(mb_x).to(device).long()
            mb_x_len=torch.from_numpy(mb_x_len).to(device).long()
            
            # 对于一句中文: BOS 为 什 么 是 我 EOS
            # 根据模型架构,输入是：BOS 为 什 么 是 我
            # 标签是：为 什 么 是 我 EOS
            mb_input=torch.from_numpy(mb_y[:,:-1]).to(device).long()
            mb_output=torch.from_numpy(mb_y[:,1:]).to(device).long()
            
            mb_y_len=torch.from_numpy(mb_y_len-1).to(device).long()
            # 某个长度为0的句子赋值长度为1,避免异常情况
            mb_y_len[mb_y_len<=0]=1
            
            mb_pred=model(mb_x,mb_x_len,mb_input,mb_y_len)
            
            # mask 表示哪些词是句子中的,哪些不是
            # 0 到 原始中文batch中句子最长长度减1
            mb_out_mask=torch.arange(mb_y_len.max().item(),device=device) # [mb_y_len.max().item()]
            # 增加维度
            # mb_out_mask.unsqueeze(dim=0) [1,mb_y_len.max().item()]
            # mb_y_len.unsqueeze(dim=-1) [batch_size,1]
            # 广播比较,比如对于第一句话,取mb_out_mask逐个元素,小于mb_y_len[0]的认为是句子里的词
            mb_out_mask=mb_out_mask.unsqueeze(dim=0) < mb_y_len.unsqueeze(dim=-1) # [batch_size,mb_y_len.max().item()]
            
            mb_out_mask=mb_out_mask.float()
            
            loss=loss_fn(mb_pred,mb_output,mb_out_mask)
            
            # 计算梯度更新模型
            optimizer.zero_grad()
            loss.backward()
            # 梯度限制：回顾第五课.语言模型
            torch.nn.utils.clip_grad_norm_(model.parameters(),5.0)
            optimizer.step()
            
            if it % 100 == 0:
                print("Epoch:",epoch,"iter:",it,"loss:",loss.item())
                
        # 在dev上进行验证衡量
        if epoch % 5 == 0:
            evaluate(model, dev_data)

注意到存在验证函数evaluate，该函数使模型在dev上进行一个epoch的前向计算，并返回损失：

def evaluate(model, data):
    model.eval()
    total_num_words = total_loss = 0.
    with torch.no_grad():
        for it, (mb_x, mb_x_len, mb_y, mb_y_len) in enumerate(data):
            mb_x = torch.from_numpy(mb_x).to(device).long()
            mb_x_len = torch.from_numpy(mb_x_len).to(device).long()
            mb_input = torch.from_numpy(mb_y[:, :-1]).to(device).long()
            mb_output = torch.from_numpy(mb_y[:, 1:]).to(device).long()
            mb_y_len = torch.from_numpy(mb_y_len-1).to(device).long()
            mb_y_len[mb_y_len<=0] = 1

            mb_pred = model(mb_x, mb_x_len, mb_input, mb_y_len)

            mb_out_mask = torch.arange(mb_y_len.max().item(), device=device)[None, :] < mb_y_len[:, None]
            mb_out_mask = mb_out_mask.float()

            loss = loss_fn(mb_pred, mb_output, mb_out_mask)

            num_words = torch.sum(mb_y_len).item()
            total_loss += loss.item() * num_words
            total_num_words += num_words
    print("Evaluation loss", total_loss/total_num_words)

训练模型：

train(model, train_data, num_epochs=20)

"""
Epoch: 0 iter: 0 loss: 8.087753295898438
Epoch: 0 iter: 100 loss: 5.2040791511535645
Epoch: 0 iter: 200 loss: 5.6370744705200195
Evaluation loss 4.843846693269169
...
Epoch: 15 iter: 0 loss: 2.4589157104492188
Epoch: 15 iter: 100 loss: 2.647231101989746
Epoch: 15 iter: 200 loss: 3.7229344844818115
Evaluation loss 3.2680165134782917
...
Epoch: 19 iter: 0 loss: 2.302227020263672
Epoch: 19 iter: 100 loss: 2.4202232360839844
Epoch: 19 iter: 200 loss: 3.61832857131958
"""

使用模型进行机器翻译：

# 使用模型进行机器翻译
def translate_dev(i:"第i个句子"):
    en_sent = " ".join([inv_en_dict[w] for w in dev_en[i]])
    print(en_sent)
    cn_sent = " ".join([inv_cn_dict[w] for w in dev_cn[i]])
    print("".join(cn_sent))

    mb_x = torch.from_numpy(np.array(dev_en[i]).reshape(1, -1)).long().to(device)
    mb_x_len = torch.from_numpy(np.array([len(dev_en[i])])).long().to(device)
    
    # cn_dict 中文的word_to_idx {word:idx}
    bos = torch.Tensor([[cn_dict["BOS"]]]).long().to(device)  # [batch_size=1,seq_len=1]

    translation= model.translate(mb_x, mb_x_len, bos) # [batch_size=1,max_length=10]
    translation = [inv_cn_dict[i] for i in translation.data.cpu().numpy().reshape(-1)]
    trans = []
    for word in translation:
        if word != "EOS":
            trans.append(word)
        else:
            break
    print("".join(trans))
    
for i in range(100,120):
    print("句子序号:",i)
    translate_dev(i)

结果为：

句子序号: 100
BOS you have nice skin . EOS
BOS 你的皮膚真好。 EOS
你最好的很好。
句子序号: 101
BOS you 're UNK correct . EOS
BOS 你部分正确。 EOS
你是个好人的。
句子序号: 102
BOS everyone admired his courage . EOS
BOS 每個人都佩服他的勇氣。 EOS
大家都沒有人。
句子序号: 103
BOS what time is it ? EOS
BOS 几点了？ EOS
那裡有什么？
句子序号: 104
BOS i 'm free tonight . EOS
BOS 我今晚有空。 EOS
我有一個好人。
句子序号: 105
BOS here is your book . EOS
BOS 這是你的書。 EOS
這是你的朋友。
句子序号: 106
BOS they are at lunch . EOS
BOS 他们在吃午饭。 EOS
他們在家裡。
…
句子序号: 119
BOS i made a mistake . EOS
BOS 我犯了一個錯。 EOS
我有一個漂亮的。

结合Luong Attention

Luong Attention

Attention机制通常有Bahdanau Attention与Luong Attention，两种注意力的理论相似，Luong Attention使用更加广泛。通常Attention会结合原始Encoder和原始Decoder的输出，重新整合得到新的输出：
fig2
网络的Encoder输出为序列 $o_{s}$ （每个元素是一个词向量），原始Decoder输出序列为 $o_{c}$ ，Attention层会对两个特征序列进行一下处理：

计算score， $o_{s}$ 中的词向量需要经过全连接层进行变换： $W_{a}o_{s}$ ，变换到特征的另一种表达；然后用 $o_{c}$ 的每一个词向量与变换后的 $W_{a}o_{s}$ 的特征逐个计算点积：
假设对于一个Seq， $o_{s}$ 的形状为 $[E n g l i s h S e q L e n, E n c o d e r H i d d e n S i z e]$ ；
$o_{c}$ 的形状为 $[C h i n e s e S e q L e n, D e c o d e r H i d d e n S i z e]$ ；
则 $W_{a}$ 对应pytorch的Linear应设置为：nn.Linear(enc_hidden_size, dec_hidden_size, bias=False)；
用 $o_{c}$ 的每个词与 $W_{a}o_{s}$ 的特征逐个求点积，即 $o_{c}$ 每个词对应一个形状为 $[E n g l i s h S e q L e n,]$ 的向量，score计算为：
$score(o_{c},o_{s})=o_{c}W_{a}o_{s}$
通过softmax对score计算比例 $a (s c o r e)$ ：
$a (s c o r e) = s o f t m a x (s c o r e, d i m = - 1)$
已知张量 $s c o r e$ 为 $[C h i n e s e S e q L e n, E n g l i s h S e q L e n]$ ，在最后一维上进行softmax，得到张量的第 $i$ 行表示各个英文分词对第 $i$ 个中文分词的重要程度；
将比例融入回Encoder的输出 $o_{s}$ 得到 $o_{new}$ ：
$o_{new}=a(score)o_{s}$
$o_{new}$ 形状为 $[C h i n e s e S e q L e n, E n c o d e r H i d d e n S i z e]$ ， $o_{new}$ 可看做是一个包含了各个英文分词重要程度的特征，用该特征与原始Decoder的输出 $o_{c}$ 在最后一维进行拼接，得到张量形状为：
$[C h i n e s e S e q L e n, E n c o d e r H i d d e n S i z e + D e c o d e r H i d d e n S i z e]$
对该张量进行全连接变换恢复维度，综合看来可以描述为：
$o_{h}=tanh(W_{c}[o_{new};o_{c}])$
$o_{h}$ 的形状通常会与 $o_{c}$ 一致： $[C h i n e s e S e q L e n, D e c o d e r H i d d e n S i z e]$ ；
对 $o_{h}$ 进行映射，变为 $[C h i n e s e S e q L e n, V o c a b S i z e]$ 的one-hot编码，即得到机器翻译结果；

上述过程中，待学习的参数有 $W_{a}$ 和 $W_{c}$ ， $W_{a}$ 用于变换Encoder输出特征的表达，使之可以通过与中文分词的特征点积获得每个英文分词的重要程度，即对某个中文分词，可以计算出各个英文分词对它的重要性，也就是"注意力"； $W_{c}$ 主要是用于变换输出张量的维度；

在原始的Encoder-Decoder模型里，英文句子的信息被压缩在Encoder的输出 hidden state 里，这不可避免的造成大量信息损失，对翻译中文不利，引入注意力后，给原始Decoder的某个输出词向量融合了其对应的重要英文分词信息，能提升翻译出该中文分词的准确性；

Encoder-Decoder结合LuongAttention

根据上述说明实现Encoder：

class Encoder(nn.Module):
    def __init__(self, vocab_size, embed_size, enc_hidden_size, dec_hidden_size, dropout=0.2):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, embed_size)
        self.rnn = nn.GRU(embed_size, enc_hidden_size, batch_first=True, bidirectional=True)
        self.dropout = nn.Dropout(dropout)
        self.fc = nn.Linear(enc_hidden_size * 2, dec_hidden_size)

    def forward(self, x, lengths):
        sorted_len, sorted_idx = lengths.sort(0, descending=True)
        x_sorted = x[sorted_idx.long()]
        embedded = self.dropout(self.embed(x_sorted))
        
        packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, sorted_len.long().cpu().data.numpy(), batch_first=True)
        packed_out, hid = self.rnn(packed_embedded)
        out, _ = nn.utils.rnn.pad_packed_sequence(packed_out, batch_first=True)
        _, original_idx = sorted_idx.sort(0, descending=False)
        out = out[original_idx.long()].contiguous() # [batch_size, seq_len, num_directions=2 * enc_hidden_size]
        hid = hid[:, original_idx.long()].contiguous() # [num_layers=1 * num_directions=2, batch_size, enc_hidden_size]
        
        # hid[m] [batch_size, enc_hidden_size]
        hid = torch.cat([hid[-2], hid[-1]], dim=1) # [batch_size, enc_hidden_size*2]
        hid = torch.tanh(self.fc(hid)).unsqueeze(0) # [1,batch_size,dec_hidden_size]

        return out, hid

实现Attention层：

class Attention(nn.Module):
    def __init__(self, enc_hidden_size, dec_hidden_size):
        super().__init__()
        self.enc_hidden_size = enc_hidden_size
        self.dec_hidden_size = dec_hidden_size
        self.linear_in = nn.Linear(enc_hidden_size*2, dec_hidden_size, bias=False)
        self.linear_out = nn.Linear(enc_hidden_size*2 + dec_hidden_size, dec_hidden_size)
        
    def forward(self, output:"decoder的'GRU'输出-seq_len对应中文", context:"encoder的输出-seq_len对应英文", mask):
        # output: [batch_size, output_len, dec_hidden_size]
        # context: [batch_size, context_len, 2*enc_hidden_size]
    
        batch_size = output.size(0)
        output_len = output.size(1)
        input_len = context.size(1)
        
        context_in = self.linear_in(context.view(batch_size*input_len, -1)).view(                
            batch_size, input_len, -1) # batch_size, context_len, dec_hidden_size
        
        # context_in.transpose(1,2): batch_size, dec_hidden_size, context_len 
        # output: batch_size, output_len, dec_hidden_size
        attn = torch.bmm(output, context_in.transpose(1,2))  # batch_size, output_len, context_len

        # mask必须是一个 ByteTensor 而且shape必须和 attn 一样 并且元素只能是 0或者1
        # tensor.data.masked_fill(mask,value):将 mask中为1的 元素所在的索引，在tensor中相同的的索引处替换为 value
        # 将不是单词的位置设成很小的数,使softmax不受非单词的元素影响
        attn.data.masked_fill(mask, -1e6)

        attn = F.softmax(attn, dim=2)  # batch_size, output_len, context_len

        context = torch.bmm(attn, context) 
        # batch_size, output_len, 2*enc_hidden_size
        
        output = torch.cat((context, output), dim=2) # batch_size, output_len, enc_hidden_size*2 + dec_hidden_size

        output = output.view(batch_size*output_len, -1)
        output = torch.tanh(self.linear_out(output))
        output = output.view(batch_size, output_len, -1) # batch_size, output_len, dec_hidden_size
        return output, attn

新的Decoder实际上是原始Decoder加Attention层：

class Decoder(nn.Module):
    def __init__(self, vocab_size, embed_size, enc_hidden_size, dec_hidden_size, dropout=0.2):
        super(Decoder, self).__init__()
        self.embed = nn.Embedding(vocab_size, embed_size)
        self.attention = Attention(enc_hidden_size, dec_hidden_size)
        self.rnn = nn.GRU(embed_size, dec_hidden_size, batch_first=True)
        self.out = nn.Linear(dec_hidden_size, vocab_size)
        self.dropout = nn.Dropout(dropout)

    def create_mask(self, y_len:"对应中文len", x_len:"对应英文len"):
        max_y_len = y_len.max()
        max_x_len = x_len.max()
        x_mask = torch.arange(max_x_len, device=x_len.device)[None, :] < x_len[:, None] # [batch_size,max_x_len]
        y_mask = torch.arange(max_y_len, device=x_len.device)[None, :] < y_len[:, None] # [batch_size,max_y_len]
        # 以1个batch的1句话为例(英文共n个字),取其中一个中文字(矩阵的某行),英文前n个字(列)值为0,为0代表有效
        # ~代表bool变量取反
        mask = (~(y_mask[:, :, None] * x_mask[:, None, :])).byte() # [batch_size,max_y_len,max_x_len]
        return mask
    
    def forward(self, ctx:"encoder的输出", ctx_lengths, y, y_lengths, hid:"[1,batch_size,dec_hidden_size]"):
        sorted_len, sorted_idx = y_lengths.sort(0, descending=True)
        y_sorted = y[sorted_idx.long()] # [batch_size,y_len,vocab_size]
        hid = hid[:, sorted_idx.long()]
        
        y_sorted = self.dropout(self.embed(y_sorted)) # batch_size, 中文seq_length, embed_size

        packed_seq = nn.utils.rnn.pack_padded_sequence(y_sorted, sorted_len.long().cpu().data.numpy(), batch_first=True)
        out, hid = self.rnn(packed_seq, hid)
        unpacked, _ = nn.utils.rnn.pad_packed_sequence(out, batch_first=True) # [batch_size,中文seq_len,dec_hidden_size]
        
        _, original_idx = sorted_idx.sort(0, descending=False)
        
        output_seq = unpacked[original_idx.long()].contiguous() # [batch_size,中文seq_len,dec_hidden_size]
        
        hid = hid[:, original_idx.long()].contiguous() # [1,batch_size,dec_hidden_size]

        mask = self.create_mask(y_lengths, ctx_lengths) # [batch_size,中文seq_len,英文seq_len]

        # output [batch_size, 中文seq_len, dec_hidden_size]
        # attn [batch_size,中文seq_len,英文seq_len]
        output, attn = self.attention(output_seq, ctx, mask)
        
        output = F.log_softmax(self.out(output), dim=-1) # batch_size, output_len, vocab_size
        
        return output, hid, attn

为了进一步提升效果，在Attention中新增了mask进行干预：
$[B a t c h S i z e, C h i n e s e S e q L e n, E n g l i s h S e q L e n]$
以1个batch第 $s$ 个Seq为例（假设该Seq的英文共n个分词， $n < E n g l i s h S e q L e n$ ），取Seq第 $i$ 个中文分词（矩阵mask[s]第 $i$ 行），前n个元素（即英文分词）才有效；其余元素的位置对应到 $score(o_{c},o_{s})$ 上，将 $s c o r e$ 的这些位置值设为极小的正数，在计算 $a (s c o r e)$ 时便可以加强中英两个句子的对应关系，将注意力"集中"在属于该Seq中文分词的英文分词上；

将Encoder与新的Decoder组合成为Seq2Seq模型：

class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder):
        super(Seq2Seq, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        
    def forward(self, x, x_lengths, y, y_lengths):
        encoder_out, hid = self.encoder(x, x_lengths)
        output, hid, attn = self.decoder(ctx=encoder_out, 
                    ctx_lengths=x_lengths,
                    y=y,
                    y_lengths=y_lengths,
                    hid=hid)
        return output
    
    def translate(self, x, x_lengths, y, max_length=100):
        encoder_out, hid = self.encoder(x, x_lengths)
        preds = []
        batch_size = x.shape[0]
        attns = []
        for i in range(max_length):
            output, hid, attn = self.decoder(ctx=encoder_out, 
                    ctx_lengths=x_lengths,
                    y=y,
                    y_lengths=torch.ones(batch_size).long().to(y.device),
                    hid=hid)
            y = output.max(2)[1].view(batch_size, 1)
            preds.append(y)
            attns.append(attn)
        return torch.cat(preds, dim=1)

实例化模型：

dropout = 0.2
embed_size = hidden_size = 100
encoder = Encoder(vocab_size=en_total_words,
                       embed_size=embed_size,
                      enc_hidden_size=hidden_size,
                       dec_hidden_size=hidden_size,
                      dropout=dropout)
decoder = Decoder(vocab_size=cn_total_words,
                      embed_size=embed_size,
                      enc_hidden_size=hidden_size,
                       dec_hidden_size=hidden_size,
                      dropout=dropout)

model = Seq2Seq(encoder, decoder)
model = model.to(device)
loss_fn = LanguageModelCriterion().to(device)
optimizer = torch.optim.Adam(model.parameters())

结合Attention后，模型在输入输出上依然和原始Encoder-Decoder一样，所以，可以使用之前定义的训练与验证函数，以及翻译函数；训练如下：

# 无视警告
import warnings
warnings.filterwarnings("ignore")

train(model, train_data, num_epochs=20)

"""
Epoch: 0 iter: 0 loss: 8.07270622253418
Epoch: 0 iter: 100 loss: 5.43229866027832
Epoch: 0 iter: 200 loss: 5.791189670562744
Evaluation loss 5.098033384093281
...
Evaluation loss 2.9741732850286744
...
Epoch: 19 iter: 0 loss: 1.8926399946212769
Epoch: 19 iter: 100 loss: 1.9622280597686768
Epoch: 19 iter: 200 loss: 3.3936874866485596
"""

进行机器翻译：

for i in range(100,120):
    print("句子序号:",i)
    translate_dev(i)

结果为：

句子序号: 100
BOS you have nice skin . EOS
BOS 你的皮膚真好。 EOS
你不是否明白。
句子序号: 101
BOS you 're UNK correct . EOS
BOS 你部分正确。 EOS
你是个想的。
句子序号: 102
BOS everyone admired his courage . EOS
BOS 每個人都佩服他的勇氣。 EOS
每個人都都在家了。
句子序号: 103
BOS what time is it ? EOS
BOS 几点了？ EOS
它是什麼時候的？
句子序号: 104
BOS i 'm free tonight . EOS
BOS 我今晚有空。 EOS
我今晚了。
句子序号: 105
BOS here is your book . EOS
BOS 這是你的書。 EOS
这里有你的書。
句子序号: 106
BOS they are at lunch . EOS
BOS 他们在吃午饭。 EOS
他們在午餐。
…
句子序号: 119
BOS i made a mistake . EOS
BOS 我犯了一個錯。 EOS
我做了一個錯誤。