Deep NMT： Sequence to Sequence Learning with Neural Networks 介绍

最新推荐文章于 2024-07-11 09:23:29 发布

JohnBanana

最新推荐文章于 2024-07-11 09:23:29 发布

阅读量406

点赞数

分类专栏： NLP 文章标签：深度学习算法

本文链接：https://blog.csdn.net/john_hongming/article/details/108768608

版权

NLP 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

总览：

Abstract：本文提出了一种通用的端对端的方法进行序列到序列的学习，其中的 Encoder和Deocder都是多层的LSTM。我们的模型在机器翻译上取得了非常好的效果。

Introduction：为了处理变长的输入和变长的输出，我们使用了LSTM来作为Encoder和Deocder，并且得到了很好的结果。

The Model：我们使用两个不同的LSTM来作为 Encoder和Deocder，其中Encoder将源语言编码成定长的向量,Deocder生成一个个目标语言的词。

Experiment：在WMT14英语到法语上的实验及实验结果

Related Work：相关工作

Conclusion ：总结发展展望

简介：

本文内容：本文提出了一种end-to end序列学习方法，使用多层LSTM将输入序列映射到一个固定维数的向量上（编码），然后用另一个深层LSTM从向量上解码目标序列。将该方法应用在WMT-14数据集的英法翻译任务中。此外，LSTM擅长处理长句子（得益于LSTM处理long-term dependency的良好性能，交通领域即能记忆很久以前的交通状态，从而应用到当前状态的预测中）。LSTM还学习了对词序敏感、对主动语态和被动语态相对不敏感的短语和句子表示。最后，我们发现对所有输入序列(不是目标序列)做逆序处理后显著地提高了LSTM的性能，因为这样做会在输入序列和目标序列之间引入许多短期依赖关系，从而使优化问题变得更加容易。因为引入了许多短期依赖关系。关于短期依赖关系对模型性能提升产生的影响，可以从以下两个方面理解：（1）缩短了output sequence 中的单词与其对应的input sequence中单词的距离：当模型读取最后一个输入单词时候，它与需要对应的模型输出的第一个单词是距离输入最近的，保存的信息更多，相对于正序读取input sequence, 这种逆序读取的方式大大缩短了input sequence 第一个单词与output sequence 第一个单词的距离；（2）本文采用了条件概率的方式预估output sequence，随着前一个单词输出准确率的提高，序列后面词的预测准确率也会随之提升。

本文的背景：DNN是功能强大的模型，能够在高难度的学习任务中表现出色的性能，比如，语音识别、目标识别等；原因：因为它能在有限的步骤内实现任意并行计算。只要有足够的训练数据集，它就能训练出神经网络的参数。但是，DNN只能应用于那些输入和输入维度都固定的任务，而很多问题都无法提前预知被表示的序列的长度，例如语音识别，机器翻译等序列问题。虽然DNN有着很强的能力但是它要求能够获取足够多的labeled training set,并且不适用于输入-输入不固定的sequence to sequence的学习任务。

本文的解决思路：首先使用一层LSTM读取input sequence，以为一个句子的结束标志，获得大的固定维向量表示；然后使用另一个LSTM从该向量表示中解码得到输出序列，（这个LSTM本质上是一个语言模型，除了其初始状态就是输入序列编码得到的向量）LSTM能够学习长时间具有依赖关系的任务，因此正适用于序列到序列任务。LSTM的一个有用的特性是，不受限于句子的长度，能够将不同长度的句子映射到一个固定维度的向量。

BLEU介绍:

如何评价机器翻译结果的好坏

人工评价：通过人主观对翻译进行打分

优点：准确

缺点：速度慢，价格昂贵

机器自动评价：通过设置指标对翻译结果自动评价

优点：较为准确，速度快，

免费缺点：可能和人工评价有一些出入

评价指标：

Candidate（候选）: the the the the the the the. 𝑪ount(the)=7

Reference（参考） 1: the cat is on the mat. 𝑪𝒐𝒖𝒏𝒕𝟏 𝒄𝒍𝒊𝒑 (the)=min(7,2)=2 对比候选值取最小的

Reference （参考）2: there is a cat on the mat. 𝑪𝒐𝒖𝒏𝒕𝟐 𝒄𝒍𝒊𝒑 (the)=min(7,2)=1 对比候选值取得最小的

𝑪𝒐𝒖𝒏𝒕𝒄𝒍𝒊𝒑(𝒕𝒉𝒆) = 𝒎𝒂𝒙 （𝟐, 𝟏） = 2 两个参考值取最大

结果为 2/7

只使用1-gram的问题：对每个词进行翻译就能得到很高的分，完全没考虑到句子的流利性。

解决方法：使用多-gram融合，BLEU使用1-4gram

上面的4-gram 对短句比较有利：

例如下例

解决方案： r 参考 c候选

Seq2Seq模型简介：

任务场景：输入一个序列，输出一个序列。

基本思想：使用一个Encoder将输入序列编码成定长的向量，Decoder使用这个向量产生输出。

Embedding 部分使用RNN 中的（GRU 和 LSTM）处理。翻译到EOS （end of sentence）或者定长序列

1. 深度神经网络非常成功，但是却很难处理序列到序列的问题。

2. 本文使用一种新的Seq2Seq模型结果来解决序列到序列的问题，其中Seq2Seq模型的Encoder 和Decoder都使用的是LSTM。 3. 前人研究者针对这个问题已经有了很多工作，包括Seq2Seq模型和注意力机制。

4. 本文的深度Seq2Seq模型在机器翻译上取得了非常好的效果。

A B C 共享一组LSTM 参数

后面部分共享同一组LSTM 参数

Tricks：

1. 对于Encoder和Deocder，使用不同的LSTM。

2. 深层的LSTM比浅层的LSTM效果好。

3. 对源语言倒序输入会大幅度提高翻译效果。

LSTM的目标函数：

preview

目的：找到概率最大的output sequence

(x1, . . . , xT ) -- input sequence；

v -- 由input sequence得到的hidden state；

(y1, . . . , yT′ ) -- 相应于input sequence，模型预测的output sequence；

T 与T'不一定相等。

Our actual models differ from the above description in three important ways. First, we used two different LSTMs: one for the input sequence and another for the output sequence, because doing so increases the number model parameters at negligible computational cost and makes it natural to train the LSTM on multiple language pairs simultaneously [18]. Second, we found that deep LSTMs significantly outperformed shallow LSTMs, so we chose an LSTM with four layers. Third, we found it extremely valuable to reverse the order of the words of the input sentence. So for example, instead of mapping the sentence a, b, c to the sentence α, β, γ, the LSTM is asked to map c, b, a to α, β, γ, where α, β, γ is the translation of a, b, c. This way, a is in close proximity to α, b is fairly close to β, and so on, a fact that makes it easy for SGD to “establish communication” between the input and the output. We found this simple data transformation to greatly improve the performance of the LSTM

模型使用了两种LSTM, 第一个（encode中）LSTM首先根据input sequence获得最后一个隐藏状态对应的固定维度向量v,然后（decode中的LSTM）用一个标准的LSTM-LM公式计算得到output sequence 的概率；

在公式中，每一个p(yt|v, y1,...,yt-1)的分布用vocabulary中所有单词的softmax表示，同时需要在每个句子的解为用标识，这使得模型能够定义所有可能长度序列的分布。每个LSTM单元输出是embedding，做softmax

作者使用的模型与LSTM有以下三方面的不同:

1)使用了2个不同的LSTM，一个用于input sequence,一个用于output sequence，有助于训练多种语言对：输入是一种语言类型，输出可以是第二种，第三种等语言类型，这样训练得到的encoder部分可以和不同种类的decoder拼接，实现不同的任务；

2)深层的LSTM的表现显著优于浅层LSTM，本文选用了4层的LSTM；

3)模型采用逆序的方式输入input sequence,举例来说，不是把句子a,b,c映射为句子α,β,γ，而是把c,b,a映射为α,β,γ。这样做使得a非常靠近α，b非常靠近β，以此SGD就很容易在输入和输出之间"建立通信"，这种简单的数据转换大大提高了LSTM的性能。

实验部分：

The core of our experiments involved training a large deep LSTM on many sentence pairs. We trained it by maximizing the log probability of a correct translation T given the source sentence S, so the training objective is

实验的核心是在许多句子对上训练一个大而深的LSTM，通过最大化以下训练目标函数来训练这个网络，其中概率p是给定源句子S得到正确翻译T

where S is the training set. Once training is complete, we produce translations by finding the most likely translation according to the LSTM: S是训练集，一次训练完成后，根据模型找出最可能的翻译作为结果

We search for the most likely translation using a simple left-to-right beam search decoder which maintains a small number B of partial hypotheses, where a partial hypothesis is a prefix of some translation. At each timestep we extend each partial hypothesis in the beam with every possible word in the vocabulary. This greatly increases the number of the hypotheses so we discard all but the B most likely hypotheses according to the model’s log probability. As soon as the “” symbol is appended to a hypothesis, it is removed from the beam and is added to the set of complete hypotheses. While this decoder is approximate, it is simple to implement. Interestingly, our system performs well even with a beam size of 1, and a beam of size 2 provides most of the benefits of beam search (Table 1). We also used the LSTM to rescore the 1000-best lists produced by the baseline system [29]. To rescore an n-best list, we computed the log probability of every hypothesis with our LSTM and took an even average with their score and the LSTM’s score.

找到最相近的结果通过beam search 方法

维特比算法：

例如实体识别中的BIOE

O 是无意义 B 是开始 I 是中间词 E 是结尾词

穷举搜索：直接列出所有可能出现的概率

贪心搜索 - 每一步选择最大的

动态规划- 每次计算保留上一步的概率 p1 * p2 之后再计算最会选择最大概率的组合

beam search（束搜索）

引用地址：https://zhuanlan.zhihu.com/p/82829880

beam search是对贪心搜索的一个改进算法。相对贪心搜索扩大了搜索空间，但远远不及穷举搜索指数级的搜索空间，是二者的一个折中方案。beam search有一个超参数beam size（束宽），设为K 。第一个时间步长，选取当前条件概率最大的 K个词，当做候选输出序列的第一个词。之后的每个时间步长，基于上个步长的输出序列，挑选出所有组合中条件概率最大的 K 个，作为该时间步长下的候选输出序列。始终保持 K个候选。最后从 K 个候选中挑出最优的