吴恩达深度学习课后习题第五课第三周编程作业1:Neural Machine Translation

Table of Contents

Packages

In [1]:

from tensorflow.keras.layers import Bidirectional, Concatenate, Permute, Dot, Input, LSTM, Multiply
from tensorflow.keras.layers import RepeatVector, Dense, Activation, Lambda
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import load_model, Model
import tensorflow.keras.backend as K
import tensorflow as tf
import numpy as np

from faker import Faker
import random
from tqdm import tqdm
from babel.dates import format_date
from nmt_utils import *
import matplotlib.pyplot as plt
%matplotlib inline

1 - Translating Human Readable Dates Into Machine Readable Dates

  • The model you will build here could be used to translate from one language to another, such as translating from English to Hindi.
  • However, language translation requires massive datasets and usually takes days of training on GPUs.
  • To give you a place to experiment with these models without using massive datasets, we will perform a simpler "date translation" task.
  • The network will input a date written in a variety of possible formats (e.g. "the 29th of August 1958", "03/30/1968", "24 JUNE 1987")
  • The network will translate them into standardized, machine readable dates (e.g. "1958-08-29", "1968-03-30", "1987-06-24").
  • We will have the network learn to output dates in the common machine-readable format YYYY-MM-DD.

1.1 - Dataset

We will train the model on a dataset of 10,000 human readable dates and their equivalent, standardized, machine readable dates. Let's run the following cells to load the dataset and print some examples.

In [2]:

m = 10000
dataset, human_vocab, machine_vocab, inv_machine_vocab = load_dataset(m)
100%|██████████| 10000/10000 [00:00<00:00, 22947.31it/s]

In [3]:

dataset[:10]

Out[3]:

[('9 may 1998', '1998-05-09'),
 ('10.11.19', '2019-11-10'),
 ('9/10/70', '1970-09-10'),
 ('saturday april 28 1990', '1990-04-28'),
 ('thursday january 26 1995', '1995-01-26'),
 ('monday march 7 1983', '1983-03-07'),
 ('sunday may 22 1988', '1988-05-22'),
 ('08 jul 2008', '2008-07-08'),
 ('8 sep 1999', '1999-09-08'),
 ('thursday january 1 1981', '1981-01-01')]

You've loaded:

  • dataset: a list of tuples of (human readable date, machine readable date).
  • human_vocab: a python dictionary mapping all characters used in the human readable dates to an integer-valued index.
  • machine_vocab: a python dictionary mapping all characters used in machine readable dates to an integer-valued index.
    • Note: These indices are not necessarily consistent with human_vocab.
  • inv_machine_vocab: the inverse dictionary of machine_vocab, mapping from indices back to characters.

Let's preprocess the data and map the raw text data into the index values.

  • We will set Tx=30
    • We assume Tx is the maximum length of the human readable date.
    • If we get a longer input, we would have to truncate it.
  • We will set Ty=10
    • "YYYY-MM-DD" is 10 characters long.

In [4]:

Tx = 30
Ty = 10
X, Y, Xoh, Yoh = preprocess_data(dataset, human_vocab, machine_vocab, Tx, Ty)

print("X.shape:", X.shape)
print("Y.shape:", Y.shape)
print("Xoh.shape:", Xoh.shape)
print("Yoh.shape:", Yoh.shape)
X.shape: (10000, 30)
Y.shape: (10000, 10)
Xoh.shape: (10000, 30, 37)
Yoh.shape: (10000, 10, 11)

You now have:

  • X: a processed version of the human readable dates in the training set.
    • Each character in X is replaced by an index (integer) mapped to the character using human_vocab.
    • Each date is padded to ensure a length of TxTx using a special character (< pad >).
    • X.shape = (m, Tx) where m is the number of training examples in a batch.
  • Y: a processed version of the machine readable dates in the training set.
    • Each character is replaced by the index (integer) it is mapped to in machine_vocab.
    • Y.shape = (m, Ty).
  • Xoh: one-hot version of X
    • Each index in X is converted to the one-hot representation (if the index is 2, the one-hot version has the index position 2 set to 1, and the remaining positions are 0.
    • Xoh.shape = (m, Tx, len(human_vocab))
  • Yoh: one-hot version of Y
    • Each index in Y is converted to the one-hot representation.
    • Yoh.shape = (m, Ty, len(machine_vocab)).
    • len(machine_vocab) = 11 since there are 10 numeric digits (0 to 9) and the - symbol.
  • Let's also look at some examples of preprocessed training examples.
  • Feel free to play with index in the cell below to navigate the dataset and see how source/target dates are preprocessed.

In [5]:

index = 0
print("Source date:", dataset[index][0])
print("Target date:", dataset[index][1])
print()
print("Source after preprocessing (indices):", X[index])
print("Target after preprocessing (indices):", Y[index])
print()
print("Source after preprocessing (one-hot):", Xoh[index])
print("Target after preprocessing (one-hot):", Yoh[index])
Source date: 9 may 1998
Target date: 1998-05-09

Source after preprocessing (indices): [12  0 24 13 34  0  4 12 12 11 36 36 36 36 36 36 36 36 36 36 36 36 36 36
 36 36 36 36 36 36]
Target after preprocessing (indices): [ 2 10 10  9  0  1  6  0  1 10]

Source after preprocessing (one-hot): [[0. 0. 0. ... 0. 0. 0.]
 [1. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 1.]
 [0. 0. 0. ... 0. 0. 1.]
 [0. 0. 0. ... 0. 0. 1.]]
Target after preprocessing (one-hot): [[0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]]

2 - Neural Machine Translation with Attention

  • If you had to translate a book's paragraph from French to English, you would not read the whole paragraph, then close the book and translate.
  • Even during the translation process, you would read/re-read and focus on the parts of the French paragraph corresponding to the parts of the English you are writing down.
  • The attention mechanism tells a Neural Machine Translation model where it should pay attention to at any step.

2.1 - Attention Mechanism

In this part, you will implement the attention mechanism presented in the lecture videos.

  • Here is a figure to remind you how the model works.
    • The diagram on the left shows the attention model.
    • The diagram on the right shows what one "attention" step does to calculate the attention variables α〈t,t′〉α〈t,t′〉.
    • The attention variables α〈t,t′〉α〈t,t′〉 are used to compute the context variable context〈t〉context〈t〉 for each timestep in the output (t=1,…,Tyt=1,…,Ty).

</table>

Here are some properties of the model that you may notice:

Pre-attention and Post-attention LSTMs on both sides of the attention mechanism

  • There are two separate LSTMs in this model (see diagram on the left): pre-attention and post-attention LSTMs.
  • Pre-attention Bi-LSTM is the one at the bottom of the picture is a Bi-directional LSTM and comes before the attention mechanism.
    • The attention mechanism is shown in the middle of the left-hand diagram.
    • The pre-attention Bi-LSTM goes through TxTx time steps
  • Post-attention LSTM: at the top of the diagram comes after the attention mechanism.

    • The post-attention LSTM goes through TyTy time steps.
  • The post-attention LSTM passes the hidden state s〈t〉s〈t〉 and cell state c〈t〉c〈t〉 from one time step to the next.

An LSTM has both a hidden state and cell state

  • In the lecture videos, we were using only a basic RNN for the post-attention sequence model
    • This means that the state captured by the RNN was outputting only the hidden state s〈t〉s〈t〉.
  • In this assignment, we are using an LSTM instead of a basic RNN.
    • So the LSTM has both the hidden state s〈t〉s〈t〉 and the cell state c〈t〉c〈t〉.

Each time step does not use predictions from the previous time step

  • Unlike previous text generation examples earlier in the course, in this model, the post-attention LSTM at time tt does not take the previous time step's prediction y〈t−1〉y〈t−1〉 as input.
  • The post-attention LSTM at time 't' only takes the hidden state s〈t〉s〈t〉 and cell state c〈t〉c〈t〉 as input.
  • We have designed the model this way because unlike language generation (where adjacent characters are highly correlated) there isn't as strong a dependency between the previous character and the next character in a YYYY-MM-DD date.

Concatenation of hidden states from the forward and backward pre-attention LSTMs

  • a→〈t〉a→〈t〉: hidden state of the forward-direction, pre-attention LSTM.
  • a←〈t〉a←〈t〉: hidden state of the backward-direction, pre-attention LSTM.
  • a〈t〉=[a→〈t〉,a←〈t〉]a〈t〉=[a→〈t〉,a←〈t〉]: the concatenation of the activations of both the forward-direction a→〈t〉a→〈t〉 and backward-directions a←〈t〉a←〈t〉 of the pre-attention Bi-LSTM.

Computing "energies" e〈t,t′〉e〈t,t′〉 as a function of s〈t−1〉s〈t−1〉 and a〈t′〉a〈t′〉

  • Recall in the lesson videos "Attention Model", at time 6:45 to 8:16, the definition of "e" as a function of s〈t−1〉s〈t−1〉 and a〈t〉a〈t〉.
    • "e" is called the "energies" variable.
    • s〈t−1〉s〈t−1〉 is the hidden state of the post-attention LSTM
    • a〈t′〉a〈t′〉 is the hidden state of the pre-attention LSTM.
    • s〈t−1〉s〈t−1〉 and a〈t〉a〈t〉 are fed into a simple neural network, which learns the function to output e〈t,t′〉e〈t,t′〉.
    • e〈t,t′〉e〈t,t′〉 is then used when computing the attention α〈t,t′〉α〈t,t′〉 that y〈t〉y〈t〉 should pay to a〈t′〉a〈t′〉.
    • 0
      点赞
    • 0
      收藏
      觉得还不错? 一键收藏
    • 0
      评论

    “相关推荐”对你有帮助么?

    • 非常没帮助
    • 没帮助
    • 一般
    • 有帮助
    • 非常有帮助
    提交
    评论
    添加红包

    请填写红包祝福语或标题

    红包个数最小为10个

    红包金额最低5元

    当前余额3.43前往充值 >
    需支付:10.00
    成就一亿技术人!
    领取后你会自动成为博主和红包主的粉丝 规则
    hope_wisdom
    发出的红包
    实付
    使用余额支付
    点击重新获取
    扫码支付
    钱包余额 0

    抵扣说明:

    1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
    2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

    余额充值