Table of Contents
- Packages
- 1 - Translating Human Readable Dates Into Machine Readable Dates
- 2 - Neural Machine Translation with Attention
- 3 - Visualizing Attention (Optional / Ungraded)
Packages
In [1]:
from tensorflow.keras.layers import Bidirectional, Concatenate, Permute, Dot, Input, LSTM, Multiply from tensorflow.keras.layers import RepeatVector, Dense, Activation, Lambda from tensorflow.keras.optimizers import Adam from tensorflow.keras.utils import to_categorical from tensorflow.keras.models import load_model, Model import tensorflow.keras.backend as K import tensorflow as tf import numpy as np from faker import Faker import random from tqdm import tqdm from babel.dates import format_date from nmt_utils import * import matplotlib.pyplot as plt %matplotlib inline
1 - Translating Human Readable Dates Into Machine Readable Dates
- The model you will build here could be used to translate from one language to another, such as translating from English to Hindi.
- However, language translation requires massive datasets and usually takes days of training on GPUs.
- To give you a place to experiment with these models without using massive datasets, we will perform a simpler "date translation" task.
- The network will input a date written in a variety of possible formats (e.g. "the 29th of August 1958", "03/30/1968", "24 JUNE 1987")
- The network will translate them into standardized, machine readable dates (e.g. "1958-08-29", "1968-03-30", "1987-06-24").
- We will have the network learn to output dates in the common machine-readable format YYYY-MM-DD.
1.1 - Dataset
We will train the model on a dataset of 10,000 human readable dates and their equivalent, standardized, machine readable dates. Let's run the following cells to load the dataset and print some examples.
In [2]:
m = 10000 dataset, human_vocab, machine_vocab, inv_machine_vocab = load_dataset(m)
100%|██████████| 10000/10000 [00:00<00:00, 22947.31it/s]
In [3]:
dataset[:10]
Out[3]:
[('9 may 1998', '1998-05-09'), ('10.11.19', '2019-11-10'), ('9/10/70', '1970-09-10'), ('saturday april 28 1990', '1990-04-28'), ('thursday january 26 1995', '1995-01-26'), ('monday march 7 1983', '1983-03-07'), ('sunday may 22 1988', '1988-05-22'), ('08 jul 2008', '2008-07-08'), ('8 sep 1999', '1999-09-08'), ('thursday january 1 1981', '1981-01-01')]
You've loaded:
dataset
: a list of tuples of (human readable date, machine readable date).human_vocab
: a python dictionary mapping all characters used in the human readable dates to an integer-valued index.machine_vocab
: a python dictionary mapping all characters used in machine readable dates to an integer-valued index.- Note: These indices are not necessarily consistent with
human_vocab
.
- Note: These indices are not necessarily consistent with
inv_machine_vocab
: the inverse dictionary ofmachine_vocab
, mapping from indices back to characters.
Let's preprocess the data and map the raw text data into the index values.
- We will set Tx=30
- We assume Tx is the maximum length of the human readable date.
- If we get a longer input, we would have to truncate it.
- We will set Ty=10
- "YYYY-MM-DD" is 10 characters long.
In [4]:
Tx = 30 Ty = 10 X, Y, Xoh, Yoh = preprocess_data(dataset, human_vocab, machine_vocab, Tx, Ty) print("X.shape:", X.shape) print("Y.shape:", Y.shape) print("Xoh.shape:", Xoh.shape) print("Yoh.shape:", Yoh.shape)
X.shape: (10000, 30) Y.shape: (10000, 10) Xoh.shape: (10000, 30, 37) Yoh.shape: (10000, 10, 11)
You now have:
X
: a processed version of the human readable dates in the training set.- Each character in X is replaced by an index (integer) mapped to the character using
human_vocab
. - Each date is padded to ensure a length of TxTx using a special character (< pad >).
X.shape = (m, Tx)
where m is the number of training examples in a batch.
- Each character in X is replaced by an index (integer) mapped to the character using
Y
: a processed version of the machine readable dates in the training set.- Each character is replaced by the index (integer) it is mapped to in
machine_vocab
. Y.shape = (m, Ty)
.
- Each character is replaced by the index (integer) it is mapped to in
Xoh
: one-hot version ofX
- Each index in
X
is converted to the one-hot representation (if the index is 2, the one-hot version has the index position 2 set to 1, and the remaining positions are 0. Xoh.shape = (m, Tx, len(human_vocab))
- Each index in
Yoh
: one-hot version ofY
- Each index in
Y
is converted to the one-hot representation. Yoh.shape = (m, Ty, len(machine_vocab))
.len(machine_vocab) = 11
since there are 10 numeric digits (0 to 9) and the-
symbol.
- Each index in
- Let's also look at some examples of preprocessed training examples.
- Feel free to play with
index
in the cell below to navigate the dataset and see how source/target dates are preprocessed.
In [5]:
index = 0 print("Source date:", dataset[index][0]) print("Target date:", dataset[index][1]) print() print("Source after preprocessing (indices):", X[index]) print("Target after preprocessing (indices):", Y[index]) print() print("Source after preprocessing (one-hot):", Xoh[index]) print("Target after preprocessing (one-hot):", Yoh[index])
Source date: 9 may 1998 Target date: 1998-05-09 Source after preprocessing (indices): [12 0 24 13 34 0 4 12 12 11 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36] Target after preprocessing (indices): [ 2 10 10 9 0 1 6 0 1 10] Source after preprocessing (one-hot): [[0. 0. 0. ... 0. 0. 0.] [1. 0. 0. ... 0. 0. 0.] [0. 0. 0. ... 0. 0. 0.] ... [0. 0. 0. ... 0. 0. 1.] [0. 0. 0. ... 0. 0. 1.] [0. 0. 0. ... 0. 0. 1.]] Target after preprocessing (one-hot): [[0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.] [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.] [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.] [0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.] [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.] [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]]
2 - Neural Machine Translation with Attention
- If you had to translate a book's paragraph from French to English, you would not read the whole paragraph, then close the book and translate.
- Even during the translation process, you would read/re-read and focus on the parts of the French paragraph corresponding to the parts of the English you are writing down.
- The attention mechanism tells a Neural Machine Translation model where it should pay attention to at any step.
2.1 - Attention Mechanism
In this part, you will implement the attention mechanism presented in the lecture videos.
- Here is a figure to remind you how the model works.
- The diagram on the left shows the attention model.
- The diagram on the right shows what one "attention" step does to calculate the attention variables α〈t,t′〉α〈t,t′〉.
- The attention variables α〈t,t′〉α〈t,t′〉 are used to compute the context variable context〈t〉context〈t〉 for each timestep in the output (t=1,…,Tyt=1,…,Ty).
</table>
Here are some properties of the model that you may notice:
Pre-attention and Post-attention LSTMs on both sides of the attention mechanism
- There are two separate LSTMs in this model (see diagram on the left): pre-attention and post-attention LSTMs.
- Pre-attention Bi-LSTM is the one at the bottom of the picture is a Bi-directional LSTM and comes before the attention mechanism.
- The attention mechanism is shown in the middle of the left-hand diagram.
- The pre-attention Bi-LSTM goes through TxTx time steps
-
Post-attention LSTM: at the top of the diagram comes after the attention mechanism.
- The post-attention LSTM goes through TyTy time steps.
-
The post-attention LSTM passes the hidden state s〈t〉s〈t〉 and cell state c〈t〉c〈t〉 from one time step to the next.
An LSTM has both a hidden state and cell state
- In the lecture videos, we were using only a basic RNN for the post-attention sequence model
- This means that the state captured by the RNN was outputting only the hidden state s〈t〉s〈t〉.
- In this assignment, we are using an LSTM instead of a basic RNN.
- So the LSTM has both the hidden state s〈t〉s〈t〉 and the cell state c〈t〉c〈t〉.
Each time step does not use predictions from the previous time step
- Unlike previous text generation examples earlier in the course, in this model, the post-attention LSTM at time tt does not take the previous time step's prediction y〈t−1〉y〈t−1〉 as input.
- The post-attention LSTM at time 't' only takes the hidden state s〈t〉s〈t〉 and cell state c〈t〉c〈t〉 as input.
- We have designed the model this way because unlike language generation (where adjacent characters are highly correlated) there isn't as strong a dependency between the previous character and the next character in a YYYY-MM-DD date.
Concatenation of hidden states from the forward and backward pre-attention LSTMs
- a→〈t〉a→〈t〉: hidden state of the forward-direction, pre-attention LSTM.
- a←〈t〉a←〈t〉: hidden state of the backward-direction, pre-attention LSTM.
- a〈t〉=[a→〈t〉,a←〈t〉]a〈t〉=[a→〈t〉,a←〈t〉]: the concatenation of the activations of both the forward-direction a→〈t〉a→〈t〉 and backward-directions a←〈t〉a←〈t〉 of the pre-attention Bi-LSTM.
Computing "energies" e〈t,t′〉e〈t,t′〉 as a function of s〈t−1〉s〈t−1〉 and a〈t′〉a〈t′〉
- Recall in the lesson videos "Attention Model", at time 6:45 to 8:16, the definition of "e" as a function of s〈t−1〉s〈t−1〉 and a〈t〉a〈t〉.
- "e" is called the "energies" variable.
- s〈t−1〉s〈t−1〉 is the hidden state of the post-attention LSTM
- a〈t′〉a〈t′〉 is the hidden state of the pre-attention LSTM.
- s〈t−1〉s〈t−1〉 and a〈t〉a〈t〉 are fed into a simple neural network, which learns the function to output e〈t,t′〉e〈t,t′〉.
- e〈t,t′〉e〈t,t′〉 is then used when computing the attention α〈t,t′〉α〈t,t′〉 that y〈t〉y〈t〉 should pay to a〈t′〉a〈t′〉.