Text Classification with Movie Reviews ----- TF Hub


This notebook classifies movie reviews as positive or negative using the text of the review. This is an example of binary or two-class-classification, an important and widely applicable kind of machine learning problem

We’ll use the IMDB dataset that contains the text of 50000 movie reviews from the Internet Movie Database. These are split into 25000 reviews for training and 25000 reviews for testing. The training and testing sets are balanced, meaning they contain an equal number of positive and negative reviews

This notebook uses tf.keras, a high-level API to build and train models in TensorFlow, and TensorFlow Hub, a library and platform for transfer learning. For a more advanced text classification tutorial using tf.keras, see the MLCC Text Classification Guide

Each example is a sentence representing the movie review and a corresponding label. The sentence is not preprocessed in any way. The label is an integer value of either 0 or 1, where 0 is a negative review, and 1 is a positive review

Build the model:
The neural network is created by stacking layers - this requires three main architectural decisions:

  • How to represent the text?
  • How many layers to use in the model?
  • How many hidden units to use for each layer?

In this example, the input data consists of sentences. The labels to predict are either 0 or 1.
One way to represent the text is to convert sentences into embeddings vectors. We can use a pre-trained text embedding as the first layer, which will have two advantages:

  • we don’t have to worry about text preprocessing
  • we can benefit from transfer learning

For this example we will use a model from TensorFlow Hub called google/nnlm-en-dim50/2
There are two other models to test for the sake of this tutorial:

  • google/nnlm-en-dim50-with-normalization/2 - same as google/nnlm-en-dim50/2, but with additional text normalization to remove punctuation. This can help to get better converage of in-vocabulary embeddings for tokens on your input text
  • google/nnlm-en-dim128-with-normalization/2 - A larger model with an embedding dimension of 128 instead of the smaller 50

Let’s first create a Keras layer that uses a TensorFlow Hub model to embed the sentences, and try it out on a couple of input examples. Note that the output shape of the produced embeddings is a expected: (num_examples, embedding_dimension)

The layers are stacked sequentially to build the classifier:

  • The first layer is a TensorFlow Hub layer. This layer uses a pre-trained Saved Model to map a sentence into its embedding vector. The model that we are using (google/nnlm-en-dim50/2) splits the sentence into tokens, embeds each token and then combines the embedding. The resulting dimensions are: (num_examples, embedding_dimension)
  • This fixed-length output vector is piped through a fully-connected (Dense) layer with 16 hidden units
  • The last layer is densely connected with a single output node. This outputs logits: the log-odds of the true class, according to the model

Hidden units:
The above model has two intermediate or “hidden” layers, between the input and output. The number of outputs (units, nodes, or neurons) is the dimension of the representational space for the layer. In other words, the amount of freedom the network is allowed when learning an internal representation

If a model has more hidden units (a higher-dimensional representation space), and/or more layers, then the network can learn more complex representations. However, it makes more computationally expensive and may lead to learning unwanted patterns - patterns that improve performance on training data but not on the test data. This is called overfitting, and we’ll explore it later

Loss function and optimizer:
A model needs a loss function and an optimizer for training. Since this is a binary classification problem and the model outputs a probability (a single-uint layer with a sigmoid activation), we’ll use the binary_crossentropy loss function.

This isn’t the only choice for a loss function, you could, for instance, choose mean_squared_error. But, generally, binary_crossentropy is better for dealing with probabilities - it measures the ‘distance’ between probability distributions, or in our case, between the ground-truth distribution and the predictions

When training, we want to check the accuracy of the model on data it hasn’t seen before. Create a validation set by setting apart 10000 examples from the original training data. (Why not use the testing set now? Our goal is to develop and tune our model using only the training data, then use the test data just once to evaluate our accuracy)

Train the model for 40 epochs in mini-batches of 512 samples. This is 40 iterations over all samples in the x_train and y_train tensors. While training, monitor the model’s loss and accuracy on the 10000 samples from the validation set

Notice the training loss decreases with each epoch and the training accuracy increases with each epoch. This is expected when using a gradient descent optimization - it should minimize the desired quantity on every iteration

This isn’t the case for the validation loss and accuracy - they seem to peak after about twenty epochs. This is an example of overfitting: the model performs better on the training data than it does on data it has never seen before. After this point, the model over-optimizes and learns representations specific to the training data that do not generalize to test data

import numpy as np

import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_datasets as tfds

import matplotlib.pyplot as plt

print("Version: ", tf.__version__)
print("Eager mode: ", tf.executing_eagerly())
print("Hub version: ", hub.__version__)
print("GPU is", "available" if tf.config.list_physical_devices('GPU') else "NOT AVAILABLE")

train_data, test_data = tfds.load(name="imdb_reviews", split=["train", "test"],
                                  batch_size=-1, as_supervised=True)

train_examples, train_labels = tfds.as_numpy(train_data)
test_examples, test_labels = tfds.as_numpy(test_data)

print("Training entries: {}, test entries: {}".format(len(train_examples), len(test_examples)))


model = "https://tfhub.dev/google/nnlm-en-dim50/2"
hub_layer = hub.KerasLayer(model, input_shape=[], dtype=tf.string, trainable=True)

model = tf.keras.Sequential()
model.add(tf.keras.layers.Dense(16, activation='relu'))


              metrics=[tf.metrics.BinaryAccuracy(threshold=0.0, name='accuracy')])

x_val = train_examples[:10000]
partial_x_train = train_examples[10000:]

y_val = train_labels[:10000]
partial_y_train = train_labels[10000:]

history = model.fit(partial_x_train,
                    validation_data=(x_val, y_val),

results = model.evaluate(test_data, test_labels)


history_dict = history.history

acc = history_dict['accuracy']
val_acc = history_dict['val_accuracy']
loss = history_dict['loss']
val_loss = history_dict['val_loss']

epochs = range(1, len(acc) + 1)

# "bo" is for "blue dot"
plt.plot(epochs, loss, 'bo', label='Training loss')
# b is for "solid blue line"
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')


plt.clf()  # clear figure

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')






