NMT实战理解Attention、Seq2Seq

最新推荐文章于 2024-03-29 19:58:11 发布

beyourselfwb

最新推荐文章于 2024-03-29 19:58:11 发布

阅读量2k

点赞数 3

分类专栏： NLP 文章标签： pytorch NMT

本文链接：https://blog.csdn.net/baidu_20163013/article/details/94461060

版权

NLP 专栏收录该内容

9 篇文章 0 订阅

订阅专栏

最近在看NMT相关的研究，论文很多，每隔几个月就会有新的论文发出来，提出新的模型或者改进，作为小白，我觉得还是先搞懂一些基础理念，试着去实现最简单的模型，练练手。

本次以Pytorch的Translation with sequence to sequence network and attention为例，介绍一下Seq2Seq和Attention机制，顺便了解一下最简单的NMT模型。好了话不多说，进入正题。

任务简介

任务很简单，French -> English ，法语到英语的翻译任务。如下示例，> 表示输入的源语言句子， = 表示目标语言句子，< 表示模型翻译的目标语言结果。

[KEY: > input, = target, < output]

> il est en train de peindre un tableau .
= he is painting a picture .
< he is painting a picture .

> pourquoi ne pas essayer ce vin delicieux ?
= why not try that delicious wine ?
< why not try that delicious wine ?

> elle n est pas poete mais romanciere .
= she is not a poet but a novelist .
< she not not a poet but a novelist .

> vous etes trop maigre .
= you re too skinny .
< you re all alone .

本文的 seq2seq network 参考论文是谷歌发表于NIPS 2014的 Sequence to Sequence Learning with Neural Networks。整体架构如下图所示：

在这里插入图片描述

输入是一个法语单词序列，最后增加一个<EOS>表示句子结尾(End of Sentence)，然后将单词转为词汇表中该单词对应的编号，依次喂给Encoder，Encoder内部有一个GRU（就是RNN的一种变体）的结构，循环接收输入，最后遇到<EOS>结束，将Encoder的最后一个输出向量作为Decoder的输入，在Decoder端，将目标语言序列(在最前面加一个<SOS>符号表示句子开头 start of sentence)依次喂给Decoder，Decoder依次输出"the cat is black"，最后结束。

这种传统的Encoder Decoder 框架，在解码时仅仅依赖Encoder生成的固定长度的向量表示，当输入序列比较长时，性能很差，于是就有人提出Attention机制进行优化。本文的Attention版本参考论文是Bahdanau发表于ICLR 2015的 Neural Machine Translation by Jointly Learning to Align and Translate。

关于Attention的本质，张俊林博客里有很详细的介绍，大意就是模型在Decoder端解码的时候，Encoder端的输入序列各个单词对其影响程度是不同的。举例来说，比如输入的是英文句子：Tom chase Jerry，Encoder-Decoder框架逐步生成中文单词：“汤姆”，“追逐”，“杰瑞”。在翻译“杰瑞”这个中文单词的时候，显然“Jerry”对于翻译成“杰瑞”更重要，但是传统模型是无法体现这一点的，这就是引入注意力机制的原因。本文将着重从代码层面来分析理解Attention。

代码分析

首先第一步，导入依赖库，主要是torch相关的库，因为涉及法语的一些起码字符，引入了unicodedata这个库：

from __future__ import unicode_literals, print_function, division

import random
import re
import unicodedata
from io import open

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch import optim

数据预处理相关

语言类，很常规的套路，其实就是为了构建一个语言的词汇表，word2index用来将单词转成对应的编号，作为模型的输入，index2word是将模型输出的标号转为词汇表中对应的单词，SOS和EOS是特殊的两个符号，表示句子开始和结尾，n_words表示词汇表的大小。

class Lang:
    def __init__(self, name):
        self.name = name
        self.word2index = {}
        self.word2count = {}
        self.index2word = {0: "SOS", 1: "EOS"}
        self.n_words = 2  # Count SOS and EOS

    def addSentence(self, sentence):
        for word in sentence.split(' '):
            self.addWord(word)

    def addWord(self, word):
        if word not in self.word2index:
            self.word2index[word] = self.n_words
            self.word2count[word] = 1
            self.index2word[self.n_words] = word
            self.n_words += 1
        else:
            self.word2count[word] += 1

特殊字符转码，具体细节没怎么研究，我就是用了几个法语的句子做了单元测试，看了一下输出，大概功能就是将法语里那些àè这类长得像英文字母的，转成正常的英文字母。

# Turn a Unicode string to plain ASCII, thanks to
# https://stackoverflow.com/a/518232/2809427
# 将 àè等这种字符转成ae正常的字母
def unicodeToAscii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
    )

剔除掉乱七八糟的字符，留下的单词或符号中间以空格分隔。这两正则表达式，我刚开始也很懵，把它们单拎出来做单元测试，就明白每个实现了什么功能了，这个方法是我用来理解复杂的系统或代码结构的绝招，俗称拆轮子。

def normalizeString(s):
    s = unicodeToAscii(s.lower().strip())
    # 给. ! ? 前面加空格
    s = re.sub(r"([.!?])", r" \1", s)
    # 对于任何非a-zA-Z.!?开头的一个或多个连续字符，都替换成空格
    s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)
    return s

读取数据，下载地址在这里。原始数据格式是每行左边是英语，右边是法语，本文是法语到英语的翻译任务，所以需要反过来，即reverse=True。将读取的语料对放在pairs里返回。

def readLangs(lang1, lang2, reverse=False):
    print("Reading lines...")

    # Read the file and split into lines
    lines = open('data/%s-%s.txt' % (lang1, lang2), encoding='utf-8').read().strip().split('\n')

    # Split every line into pairs and normalize
    pairs = [[normalizeString(s) for s in l.split('\t')] for l in lines]

    # Reverse pairs, make Lang instances
    if reverse:
        pairs = [list(reversed(p)) for p in pairs]
        input_lang = Lang(lang2)
        output_lang = Lang(lang1)
    else:
        input_lang = Lang(lang1)
        output_lang = Lang(lang2)

    return input_lang, output_lang, pairs

代码中一些常量解释：

# <SOS>和<EOS>在词汇表中的编码分别是0和1
SOS_token = 0
EOS_token = 1

# 从语料文件里，过滤出长度小于10的句子，
# 并且英语句子以en_prefixes为前缀的才留下，
# 作为本次任务的数据集
MAX_LENGTH = 10

eng_prefixes = (
    "i am ", "i m ",
    "he is", "he s ",
    "she is", "she s ",
    "you are", "you re ",
    "we are", "we re ",
    "they are", "they re "
)

# 随机概率值，随机选择是用target的序列
# 作为decoder的输入，还是用decoder上一个输出作为当前的输入
teacher_forcing_ratio = 0.5

# 隐藏层的大小，也即词向量的维度
hidden_size = 256

过滤条件和过滤操作：

# 只保留eng和fre长度都小于10，并且英语以eng_prefixes这些前缀开头的语料对
def filterPair(p):
    return len(p[0].split(' ')) < MAX_LENGTH and len(p[1].split(' ')) < MAX_LENGTH and p[1].startswith(eng_prefixes)


def filterPairs(pairs):
    return [pair for pair in pairs if filterPair(pair)]

准备数据：

# 原始语料有 13 5842条记录
# 经过过滤之后，剩下1 0599条平行语料对
# 词汇表大小：
# fra 4345
# eng 2803
def prepareData(lang1, lang2, reverse=False):
    input_lang, output_lang, pairs = readLangs(lang1, lang2, reverse)
    print("Read %s sentence pairs" % len(pairs))
    pairs = filterPairs(pairs)
    print("Trimmed to %s sentence pairs" % len(pairs))
    print("Counting words...")
    for pair in pairs:
        input_lang.addSentence(pair[0])
        output_lang.addSentence(pair[1])
    print("Counted words:")
    print(input_lang.name, input_lang.n_words)
    print(output_lang.name, output_lang.n_words)
    return input_lang, output_lang, pairs

模型部分

终于进入正题了。先看看Encoder。结构很简单，一个Embedding，一个GRU。模型数据流动看模型结构图就很好理解，具体到代码细节，就得好好捋清楚每个变量的shape，以及怎么转换的。经常在forward函数里看到各种变换方法，如view(1, 1, -1)、sequeeze、unsqueeze，还有各种矩阵乘法运算，如torch.bmm，sotmax归一化运算F.Softmax(matrix, dim=1)各种，都需要一个个拎出来，写点小例子测试一下，了解其参数含义，实现的功能。说白了，就是哪个轮子不太明白，就把它拆下来研究。

class EncoderRNN(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(EncoderRNN, self).__init__()
        self.hidden_size = hidden_size

        self.embedding = nn.Embedding(input_size, hidden_size)
        self.gru = nn.GRU(hidden_size, hidden_size)

    def forward(self, input, hidden):
        embedded = self.embedding(input).view(1, 1, -1)
        output = embedded
        output, hidden = self.gru(output, hidden)
        return output, hidden

    def initHidden(self):
        return torch.zeros(1, 1, self.hidden_size, device=device)

Encoder的模型结构图如下：

在这里插入图片描述

不加Attention的Decoder代码如下，有一点不明白就是forward里的relu的作用，如果不加这个激活函数会怎样？梯度消失？爆炸？

class DecoderRNN(nn.Module):
    def __init__(self, hidden_size, output_size):
        super(DecoderRNN, self).__init__()
        self.hidden_size = hidden_size

        # Embedding 第一个参数，词汇表的大小；第二个参数，词向量的维度
        # 由于Decoder的输出，是从词汇表大小里挑一个，所以 num_embeddings = output_size
        # 本文词向量的维度和隐藏层大小一致，所以 embedding_dim = hidden_size
        self.embedding = nn.Embedding(output_size, hidden_size)
        # GRU这两个参数？？？ 本文两个size都一样，都是hidden_size大小 256
        # input_size =
        # hidden_size =
        self.gru = nn.GRU(hidden_size, hidden_size)
        # Linear 两个参数：in_features, out_features
        self.out = nn.Linear(hidden_size, output_size)
        self.softmax = nn.LogSoftmax(dim=1)

    # RNN 系列，输入只有两个
    # input: shape [1, 1]
    def forward(self, input, hidden):
        # embedded shape : [1, 1, 256]
        embedded = self.embedding(input)
        # 这一步其实是多此一举
        output = embedded.view(1, 1, -1)
        # 非常不明白这为何relu
        output = F.relu(output)
        output, hidden = self.gru(output, hidden)
        output = self.out(output[0])
        output = self.softmax(output)
        return output, hidden

DecoderRNN模型结构图如下：

在这里插入图片描述

正如前面所说，为了解决长句子的翻译效果，本文基于Encoder-Decoder框架加入了Attention机制，这种一般被称为Soft-Attention。尽管本文训练的语料对都不超过10个单词，对比不出加与不加的区别。

本文真正使用的是AttnDecoderRNN，代码如下：

class AttnDecoderRNN(nn.Module):
    def __init__(self, hidden_size, output_size, dropout_p=0.1, max_length=MAX_LENGTH):
        super(AttnDecoderRNN, self).__init__()
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.dropout_p = dropout_p
        self.max_length = max_length

        self.embedding = nn.Embedding(self.output_size, self.hidden_size)
        self.attn = nn.Linear(self.hidden_size * 2, self.max_length)
        self.attn_combine = nn.Linear(self.hidden_size * 2, self.hidden_size)
        self.dropout = nn.Dropout(self.dropout_p)
        self.gru = nn.GRU(self.hidden_size, self.hidden_size)
        self.out = nn.Linear(self.hidden_size, self.output_size)

    def forward(self, input, hidden, encoder_outputs):
        # embedded shape : [1, 1, 256]
        # hidden shape : [1, 1, 256]
        embedded = self.embedding(input)
        # 这一步其实是多此一举
        embedded = embedded.view(1, 1, -1)
        # 为啥输入也dropout
        embedded = self.dropout(embedded)

        #######第一部分：利用Q和K的相似性，计算weights#####

        # 两者shape都是 [1, 256]
        # cat到一起变成了 [1, 512]
        cat_res = torch.cat((embedded[0], hidden[0]), 1)

        # 又来一个全连接层把它打回原形，max_length=10，变成 [1, 10]
        attn_res = self.attn(cat_res)

        # 对行进行归一化
        attn_weights = F.softmax(attn_res, dim=1)

        #######第二部分：context vector = weights * value#####

        # bmm : batch 矩阵相乘
        # unsqueeze(0) -- 将[1, 10] 变成 [1, 1, 10]
        # [1, 1, 10] 和 [1, 10, 256] 矩阵相乘
        # 得到  [1, 1, 256]
        attn_applied = torch.bmm(attn_weights.unsqueeze(0),
                                 encoder_outputs.unsqueeze(0))

        ### 后面干啥搞不太懂#####

        # 又把两个 [1, 256] 拼接到一起， 变成[1, 512]
        output = torch.cat((embedded[0], attn_applied[0]), 1)
        # 再来一个全连接层，打回原形  [1, 256]
        # 又unsqueeze变成  [1, 1, 256]
        output = self.attn_combine(output).unsqueeze(0)

        output = F.relu(output)
        output, hidden = self.gru(output, hidden)

        output = F.log_softmax(self.out(output[0]), dim=1)
        return output, hidden, attn_weights

    def initHidden(self):
        return torch.zeros(1, 1, self.hidden_size, device=device)

Attention Decoder 粗略模型结构图如下：

在这里插入图片描述

还是很好理解的，Decoder有3个输入，Input（Decoder端输入的单词序列）、Hidden（上一步输出的隐藏状态）、Encoder outputs（Encoder每个timestep输出的output列表）。其中 attention weights是根据Input和Hidden算出来的，然后再和Encoder outputs进行 element-wise product，即对应位置相乘，不同于数学中的点积或者矩阵乘法。关于element-wise product其实很好理解，见下图：

在这里插入图片描述

简单来说，上面就是Attention 的工作方式，具体到整个 Attention Decoder的数据流图，看一下下面这张：

在这里插入图片描述

通过断点调试，我将每个Tensor的shape都搞明白了（写在上面代码注释里），数学运算也清楚了，但知道怎么计算的了还是不太懂为何要这么设计，有什么理论依据吗？或者说参考哪篇论文实现的吗？比如，下面这几点我就不是很懂：

为何 attention weights 的计算就是简单的把 input的embedding向量和上一步的bidden向量拼接，然后通过一个全连接层，进行了线性运算，最后softmax归一？？这么简单吗？为什么这样算出的结果就能表示关于Encoder每个timestep的影响程度？
计算出 attention_weights 和 encoder_outputs 相乘的结果之后，为什么又要和input的embedding向量进行拼接，又经过一个全连接层，线性运算一下，将结果relu一下加入非线性运算？？？这一顿操作又是在干啥？？？
GRU比LSTM主要好在哪些方面？？？

后面的就是，把一个个句子转成Tensor的过程，都非常简单，常规的python操作逻辑：


# 把一个句子转换成id列表
def indexesFromSentence(lang, sentence):
    return [lang.word2index[word] for word in sentence.split(' ')]


# 在最后添加一个EOS结尾符号，并将id列表转为Tensor
def tensorFromSentence(lang, sentence):
    indexes = indexesFromSentence(lang, sentence)
    indexes.append(EOS_token)
    return torch.tensor(indexes, dtype=torch.long, device=device).view(-1, 1)


# 将每个语料对转成 Tensor形式的二元祖
def tensorsFromPair(pair):
    input_tensor = tensorFromSentence(input_lang, pair[0])
    target_tensor = tensorFromSentence(output_lang, pair[1])
    return (input_tensor, target_tensor)

训练部分

# 这是一次迭代（iteration），即跑一遍所有数据的训练函数
def train(input_tensor, target_tensor, encoder, decoder, encoder_optimizer, decoder_optimizer, criterion,
          max_length=MAX_LENGTH):
    encoder_hidden = encoder.initHidden()

    # 优化器梯度清零
    encoder_optimizer.zero_grad()
    decoder_optimizer.zero_grad()

    # input_length 和 target_length 分别表示
    # encoder 和 decoder 输入序列单词个数，在下面循环的时候用到
    input_length = input_tensor.size(0)
    target_length = target_tensor.size(0)

    # encoder_outputs 初始化是0，长度是ma_length，维度是 encoder 的hidden_size大小
    encoder_outputs = torch.zeros(max_length, encoder.hidden_size, device=device)

    loss = 0

    # encoder循环, 次数是输入句子的长度
    for ei in range(input_length):
        encoder_output, encoder_hidden = encoder(
            input_tensor[ei], encoder_hidden)
        # 这个encoder_outputs 保存每个time_step的output
        # 在后面会用来和attention weights相乘，得到一个context vector
        encoder_outputs[ei] = encoder_output[0, 0]

    # Decoder的输入初始化为<SOS>符号，表示开始
    decoder_input = torch.tensor([[SOS_token]], device=device)

    # 将encoder的最后一次hidden状态，最为decoder_hidden的初始值
    decoder_hidden = encoder_hidden

    # 1/2的概率：
    # True: 用target（目标语言）的单词作为decoder的每个输入，这个是从语料对里取出来的
    # False: 用decoder上一个time_step预测出的单词作为decoder的下一个输入
    use_teacher_forcing = True if random.random() < teacher_forcing_ratio else False

    if use_teacher_forcing:
        # Teacher forcing: Feed the target as the next input
        for di in range(target_length):
            decoder_output, decoder_hidden, decoder_attention = decoder(
                decoder_input, decoder_hidden, encoder_outputs)
            loss += criterion(decoder_output, target_tensor[di])
            # 使用训练语料的目标语言句子的单词作为下一个输入
            decoder_input = target_tensor[di]  # Teacher forcing

    else:
        # Without teacher forcing: use its own predictions as the next input
        for di in range(target_length):
            decoder_output, decoder_hidden, decoder_attention = decoder(
                decoder_input, decoder_hidden, encoder_outputs)
            # topv : softmax后的最大概率数值
            # topi : softmax后最大概率值的位置
            # decoder_output的shape是[1, target_vocab_size]
            # 所以topi 就是词汇表中对应的编号
            topv, topi = decoder_output.topk(1)
            # detach 表示不需要反向传播更新梯度
            decoder_input = topi.squeeze().detach()  # detach from history as input

            loss += criterion(decoder_output, target_tensor[di])
            if decoder_input.item() == EOS_token:
                break

    loss.backward()

    encoder_optimizer.step()
    decoder_optimizer.step()

    return loss.item() / target_length

大体就是这些，带有详细注释的代码已上传到我的github。

beyourselfwb

关注

3
点赞
踩
16

收藏

觉得还不错? 一键收藏
0
评论
NMT实战理解Attention、Seq2Seq

最近在看NMT相关的研究，论文很多，每隔几个月就会有新的论文发出来，提出新的模型或者改进，作为小白，我觉得还是先搞懂一些基础理念，试着去实现最简单的模型，练练手。本次以Pytorch的Translation with sequence to sequence network and attention为例，介绍一下Seq2Seq和Attention机制，顺便了解一下最简单的NMT模型。好了话不多...
复制链接

扫一扫

专栏目录