NOTES of NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE by Dzmitry Bahdanau et al. (2016)
Traditional
An encoder neural network reads and encodes a source sentence
into a fixed-length vector.
A decoder then outputs a translation from the encoded vector.
The whole encoder–decoder system, which consists of the encoder and the decoder for a language pair,
is jointly trained to maximize the probability of a correct translation given a source sentence.
Issue: a neural network needs to be able to compress all the necessary information of a source sentence into a fixed-length vector, difficult to cope with long sentences (especially longer than those in the training corpus).
Our work
Align and translate jointly.
Each time the proposed model generates a word in a translation, it (soft-)searches for a set of positions in a source sentence where the most relevant information is concentrated.
The model predicts a target word based on the context vectors associated with these source positions and all the previous generated target words.
It does not attempt to encode a whole input sentence into a single fixed-length vector. Instead, it encodes the input sentence into a equence of vectors and chooses a subset of these vectors adaptively while decoding the translation. Allow a model cope better with long sentences.
Problem depiction of translation
Translation is equivalent to finding a target sentence
y
y
that maximizes
the conditional probability of given a source sentence
x
x
, i.e., .
In neural machine translation, we fit a parameterized model to maximize the conditional probability of sentence pairs using a parallel training corpus.
background-basic work
In the Encoder–Decoder framework, an encoder reads the input sentence, a sequence of vectors x=(x1;...;xTx) x = ( x 1 ; . . . ; x T x ) , into a vector c c . The most common approach is to use an RNN such that
and
where
The decoder is often trained to predict the next word
yt′
y
t
′
given the context vector
c
c
and all the previously predicted words . In other words, the decoder defines a probability over the translation y by decomposing the joint probability into the ordered conditionals:
where y=(y1,...,yTy) y = ( y 1 , . . . , y T y ) . With an RNN, each conditional probability is modeled as
here g g is a nonlinear, potentially multi-layered, function that outputs the probability of , and st s t is the hidden state of the RNN.
Learing to align and translate
decoder
define each conditional probability
in Eq. (2) as:
where si s i is an RNN hidden state for time i i , computed by
here the probability is conditioned on a distinct context vector ci c i for each target word yi y i .
The context vector ci c i depends on a sequence of annotations (h1;...;hTx) ( h 1 ; . . . ; h T x ) to which an encoder maps the input sentence. Each annotation hi h i contains information about the whole input sequence with a strong focus on the parts surrounding the i-th word of the input sequence.
and αij=exp(eij)∑Txk=1exp(eik) α i j = e x p ( e i j ) ∑ k = 1 T x e x p ( e i k )
where eij=a(si−1,hj) e i j = a ( s i − 1 , h j )
is an alignment model which scores how well the inputs around position
j
j
and the output at position match. The score is based on the RNN hidden state
si−1
s
i
−
1
(just before emitting
yi
y
i
, Eq. (4)) and the
j-th annotation
hj
h
j
of the input sentence.
encoder
we would like the annotation of each word to summarize not only the preceding words, but also the following words. Hence, we propose to use a bidirectional RNN.
A BiRNN consists of forward and backward RNN’s. The forward RNN f→ f → reads the input sequence as it is ordered (from x1 x 1 to xTx x T x ) and alculates a sequence of forward hidden states (h1→;...;hTx−→) ( h 1 → ; . . . ; h T x → ) .
The backward RNN f← f ← reads the sequence in the reverse order (from xTx x T x to x1 x 1 ), resulting in a sequence of backward hidden states (h1←;...;hTx←−) ( h 1 ← ; . . . ; h T x ← ) .
In this way, the annotation hj h j contains the summaries of both the preceding words and the following words.