Effective Approaches to Attention-based Neural Machine Translation

最新推荐文章于 2023-12-01 17:05:42 发布

GaryChern

最新推荐文章于 2023-12-01 17:05:42 发布

阅读量897

点赞数

分类专栏： DeepLearningPapers 文章标签：机器翻译自然语言处理深度学习

本文链接：https://blog.csdn.net/qq_30521843/article/details/115668409

版权

DeepLearningPapers 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

Effective Approaches to Attention-based Neural Machine Translation

Abstract
Introduction
Neural Machine Translation
Attention-based Models
others
- - - - 实验的具体细节：
        实验结果与分析

文章来自于2015年EMNLP，在基础 attention 上开始研究一些变化操作，尝试不同的 score-function，不同的 alignment-function。文章主要介绍了Global以及Local两种attention的具体操作。

Abstract

文章目的：
1）在已有Attentional-NMT基础上探索更有效Attentional-NMT结构。
2）提出Global和Local两种简单且高效的结构，并在WMT相关翻译任务上验证了其有效性（5.0BLEU）。
实验结果：
1)WMT’15 English to German translation task with 25.9 BLEU。
2) improve 1.0 BLEU points over the existing best system backed by NMT and an n-gram reranker

Introduction

1）目前NMT进展
2）NMT优势
3）Attention常用于模态间的Decode-Encode
4）Global以及Local两种attention（分别基于Neural machine translation by jointly learning to align and translate以及Show, attend and tell: Neural image caption generation with visual attention基础之上）
5)简介了其他对于Attention的贡献，比如score计算方法以及权重函数计算等。

Neural Machine Translation

1）主要回顾了一下NMT的相关知识以及目前进展，比如解释Decoder-Encoder简单建模原理以及Decoder里面Decomposition常用RNN及其变种，并且列举了一些这方面的论文
2）简单介绍了本文的结构基于Sequence to sequence learning with
neural networks以及 Addressing the rare word problem in neural machine translation基础之上，但是本文使用了LSTM，如下如所示 GaryChern
并且交代了本文训练的目标函数：

Attention-based Models

这部分是本文的重点，具体介绍了Global以及Local两种attention的计算方法以及具体步骤，并且还提出了一个Input-feeding的方法。两种Attention的区别就在于是否全部的source都参与了计算。
具体的生成过程公式如下：
GaryChern

第一个公式用于生成带Attention的t时刻状态，并且采用Contact计算score。第二个公式用于生成预测分布，即t时刻预测值。

Global Attention

整个结构的重点就在于下面这张图：
GaryChern
$a_t$ 是一个与时间序列长度相同的alignment vector，计算Attention的重要一步，其是当前目标状态 $h_t$ 与各个源隐藏层状态 $h_s^i$
计算所得，具体计算公式如下：

并且Score的计算方式作者给出了三种，分别是dot、general以及concat。
GaryChern
并且作者这里提了一下原本的实验方法是：
其原本是通过计算当前隐层状态的softmax来对应于source里面隐层状态的权重。两者对比图像如下：

关于Global Attention的思想来源可以参考这篇文章：Neural machine translation by jointly learning to align and translate

Local Attention

整个结构的重点就在于下面这张图：
Garychern
这部分最主要的就是哪部分的source里面的隐层状态需要参与计算，因此其先计算出一个aligned position $p_t$ 用于确定范围 $p_t-D,p_t+D]$ ，D的设置文章说依据经验而言，也由于框的变化，使得 $a_t$ 大小是变动的，本文提到了了两种 $p_t$ ，当 $p_t=t$ 是则回归到了Global （Monotonic alignment (local-m) ），第二种被称为Predictive alignment (local-p)，其计算公式为： $p_t=S*sigmoid(v_p^Ttanh(W_ph_t))$ 其中 $W_p,v_p$ 可以预测位置习得。 $S$ 是源序列长度，通过Sigmoid使得整个长度在 $[0, S]$ 。为了使得计算的权重更加靠近 $p_t$ ,以便于对于在框里的source部分具有较好权重，其还在源权重计算基础上乘以了一个高斯分布。具体的权重计算公式变为：
GaryChern
在其中设置了 $\sigma=D/2$ ， $s$ 是一个位于框中心的Integer。

Input-feeding Approach

具体操作步骤如下图所示：
GaryChern 以上便是个人觉得文章的所有重要的部分。

others

剩余包括两大部分：实验以及结果分析。

实验的具体细节：

1：具体的数据预处理与结构方面
1）filter out sentence pairs whose lengths exceed 50 words and shuffle mini-batches as we proceed.
2）stacking LSTM models have 4 layers, each with 1000 cells, and 1000-dimensional embeddings.
2：关于参数方面
1） our parameters are uniformly initialized in [−0.1, 0.1]
3：训练过程中的技巧
1）train for 10 epochs using plain SGD
2）a simple learning rate schedule is employed – we start with a learning rate of1; after 5 epochs, we begin to halve the learning
rate every epoch,
3）our mini-batch size is 128
4）the normalized gradient is rescaled whenever its norm exceeds 5.
5）dropout with probability 0.2 for our LSTMs
6）For dropout models, we train for 12 epochs and start halving the learning rate after 8 epochs. For local attention models, we empirically set the window sizeD = 10.

实验结果与分析

论文地址：https://arxiv.org/pdf/1508.04025)

代码实现方式：MATLAB
相关设备：Tesla K40（ 1K target words per second）

GaryChern

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Effective Approaches to Attention-based Neural Machine Translation

Effective Approaches to Attention-based Neural Machine TranslationAbstractIntroductionNeural Machine TranslationAttention-based ModelsGlobal AttentionLocal Attention功能快捷键合理的创建标题，有助于目录的生成如何改变文本的样式插入链接与图片如何插入一段漂亮的代码片生成一个适合你的列表创建一个表格设定内容居中、居左、居右SmartyPants创建一
复制链接

扫一扫