How to Predict Sentiment From Movie Reviews(Project)

Sentiment analysis is a natural language processing problem where text is understood and the underlying intent is predicted. In this lesson you will discover how you can predict the sentiment of movie reviews as either positive or negative in Python using the Keras deep learning library. After completing this step-by-step tutorial, you will know:

  • About the IMDB sentiment analysis problem for natural language processing and how to load it in Keras.
  • How to use word embedding in Keras for natural language problems.
  • How to develop and evaluate a Multilayer Perceptron model for the IMDB problem.
  • How to develop a one-dimensional convolutional neural network model for the IMDB problem.

1.1 Movie Review Sentiment Classification Dataset

The dataset used in this project is the Large Movie Review Dataset often referred to as the IMDB dataset1. The Large Movie Review Dataset (often referred to as the IMDB dataset) contains 25,000 highly-polar movie reviews (good or bad) for training and the same amount again for testing. The problem is to determine whether a given moving review has a positive or negative sentiment. The data was collected by Stanford researchers and was used in a 2011 paper where a split of 50-50 of the data was used for training and test2. An accuracy of 88.89% was achieved.

 1.2 Load the IMDB Dataset With Keras

Keras provides access to the IMDB dataset built-in3. The imdb.load data() function allows you to load the dataset in a format that is ready for use in neural network and deep learning models. The words have been replaced by integers that indicate the absolute popularity of the word in the dataset. The sentences in each review are therefore comprised of a sequence of integers.

        Calling imdb.load_data() the first time will download the IMDB dataset to your computer and store it in your home directory under ~/.keras/datasets/imdb.pkl as a 32 megabyte file. Usefully, the imdb.load_data() function provides additional arguments including the number of top words to load (where words with a lower integer are marked as zero in the returned data), the number of top words to skip (to avoid the the’s) and the maximum length of reviews to support. Let’s load the dataset and calculate some properties of it. We will start off by loading some libraries and loading the entire IMDB dataset as a training dataset.

# Load the IMDB Dataset
import numpy as np
from keras.datasets import imdb
from matplotlib import pyplot
# load the dataset
(X_train, y_train), (X_test, y_test) = imdb.load_data()
X = np.concatenate((X_train, X_test), axis=0)
y = np.concatenate((y_train, y_test), axis=0)

 Next we can display the shape of the training dataset.

# Display The Shape of the IMDB Dataset
# summarize size
print("Training data: ")
print(X.shape)
print(y.shape)

Running this snippet, we can see that there are 50,000 records.

# Display The Classes in the IMDB Dataset
# Summarize number of classes
print("Classes: ")
print(np.unique(y))

We can see that it is a binary classification problem for good and bad sentiment in the review.

 Next we can get an idea of the total number of unique words in the dataset.

# Display The Number of Unique Words in the IMDB Dataset
# Summarize number of words
print("Number of words:")
print(len(np.unique(np.hstack(X))))

Interestingly, we can see that there are just under 100,000 words across the entire dataset.

Number of words:
88585

 Finally, we can get an idea of the average review length.

# Plot the distribution of Review Lengths
# Summarize review length
print("Review length: ")
result = list(map(len, X))
print("Mean %.2f words (%f)" % (np.mean(result),np.std(result)))
# plot review length as a boxplot and histogram
pyplot.subplot(121)
pyplot.boxplot(result)
pyplot.subplot(122)
pyplot.hist(result)
pyplot.show()

We can see that the average review has just under 300 words with a standard deviation of just over 200 words.

Review length: 
Mean 234.76 words (172.911495)

The full code listing is provided below for completeness.

# The full code listing is provided below for completeness
import numpy as np
from keras.datasets import imdb
from matplotlib import pyplot
# load the dataset
(X_train, y_train), (X_train, y_test) = imdb.load_data()
X = np.concatenate((X_train, X_test), axis=0)
y = np.concatenate((y_train, y_test), axis=0)
# summarize size
print("Training data: ")
print(X.shape)
print(y.shape)

# Summarize number of classes
print("Classes: ")
print(np.unique(y))
# Summarize number of words
print("Number of words:")
print(len(np.unique(np.hstack(X))))
# Summarize review length
print("Review length: ")
result = list(map(len, X))
print("Mean %.2f words (%f)" %(np.mean(result), np.std(result)))

# plot review length as a boxplot and histogram
pyplot.subplot(121)
pyplot.boxplot(result)
pyplot.subplot(122)
pyplot.hist(result)
pyplot.show()

 1.3 Word Embeddings

A recent breakthrough in the field of natural language processing is called word embedding. This is a technique where words are encoded as real-valued vectors in a high dimensional space, where the similarity between words in terms of meaning translates to closeness in the vector space.

Discrete words are mapped to vectors of continuous numbers. This is useful when working with natural language problems with neural networks as we require numbers as input values.

Keras provides a convenient way to convert positive integer representations of words into a word embedding by an Embedding layer4. The layer takes arguments that define the mapping including the maximum number of expected words also called the vocabulary size (e.g. the largest integer value that will be seen as an input). The layer also allows you to specify the dimensionality for each word vector, called the output dimension.

We would like to use a word embedding representation for the IMDB dataset. Let’s say that we are only interested in the first 5,000 most used words in the dataset. Therefore our vocabulary size will be 5,000. We can choose to use a 32-dimensional vector to represent each word. Finally, we may choose to cap the maximum review length at 500 words, truncating reviews longer than that and padding reviews shorter than that with 0 values. We would load the IMDB dataset as follows:

# Only Load the Top 5000 words in the IMDB Review
imdb.load_data(nb_words=5000)

We would then use the Keras utility to truncate or pad the dataset to a length of 500 for each observation using the sequence.pad_sequences() function.

# Pad Reviews in the IMDB Dataset
from keras.preprocessing import sequence

X_train = sequence.pad_sequences(X_train, maxlen=500)
X_test = sequence.pad_sequences(X_test, maxlen=500)

Finally, later on, the first layer of our model would be an word embedding layer created using the Embedding class as follows:

# Define a Word Embedding Representation
from keras.layers.embeddings import Embedding
Embedding(5000, 32, input_length=500)

The output of this first layer would be a matrix with the size 32 ⇥ 500 for a given movie review training or test pattern in integer format. Now that we know how to load the IMDB dataset in Keras and how to use a word embedding representation for it, let’s develop and evaluate some models.

1.4 Simple Multiplayer Perceptron Model

We can start off by developing a simple Multilayer Perceptron model with a single hidden layer. The word embedding representation is a true innovation and we will demonstrate what would have been considered world class results in 2011 with a relatively simple neural network. Let’s start off by importing the classes and functions required for this model and initializing the random number generator to a constant value to ensure we can easily reproduce the results.

# Load Classes and Funcations and Seed Random Number Generator
# MLP for the IMDB problem
import numpy as np
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
# fix random seed for reproducibility
seed = 7
np.random.seed(seed)

Next we will load the IMDB dataset. We will simplify the dataset as discussed during the section on word embeddings. Only the top 5,000 words will be loaded. We will also use a 50%/50% split of the dataset into training and test. This is a good standard split methodology.

# Load and sPlit the IMDB Dataset
# load the dataset but only keep the top n words, zero the rest
top_words = 5000
test_split = 0.33
(X_train, y_train),(X_test, y_test) = imdb.load_data(nb_words=top_words)

We will bound reviews at 500 words, truncating longer reviews and zero-padding shorter reviews.

# Pad IMDB Review to a Fixed Length
from keras.preprocessing import sequence
max_words = 500
X_train = sequence.pad_sequences(X_train, maxlen=max_words)
X_test = sequence.pad_sequences(X_test, maxlen=max_words)

Now we can create our model. We will use an Embedding layer as the input layer, setting the vocabulary to 5,000, the word vector size to 32 dimensions and the input length to 500. The output of this first layer will be a 32 ⇥ 500 sized matrix as discussed in the previous section. We will flatten the Embedding layers output to one dimension, then use one dense hidden layer of 250 units with a rectifier activation function. The output layer has one neuron and will use a sigmoid activation to output values of 0 and 1 as predictions. The model uses logarithmic loss and is optimized using the efficient ADAM optimization procedure.

# Define a Multilayer Perceptron Model
# Create the model
model = Sequential()
model.add(Embedding(top_words,32,input_length=max_words))
model.add(Flatten())
model.add(Dense(250, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
print(model.summary())

 We can fit the model and use the test set as validation while training. This model overfits very quickly so we will use very few training epochs, in this case just 2. There is a lot of data so we will use a batch size of 128. After the model is trained, we evaluate it’s accuracy on the test dataset.

# Fit and Evaluate the Multilayer Perceptron Model
# Fit the model
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=2, batch_size=128, verbose=1)
# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

 The full code listing is provided below for completeness.

# Multilayer Perceptron Model for the IMDB Dataset
# MLP for the IMDB problem
import numpy as np
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence

# fix random seed for reproducibility
seed = 7
np.random.seed(seed)

# load the dataset but only keep the top n words, zero the rest

top_words = 5000
(X_train, y_train),(X_test, y_test) = imdb.load_data(nb_words=top_words)
max_words = 500
X_train = sequence.pad_sequences(X_train, maxlen=max_words)
X_test = sequence.pad_sequences(X_test, maxlen=max_words)

# create the model
model = Sequential()
model.add(Embedding(top_words, 32, input_length=max_words))
model.add(Flatten())
model.add(Dense(250, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
print(model.summary())
# Fit the model
model.fit(X_train, y_train, validation_data=(X_test, y_test),epochs=2,batch_size=128,verbose=1)

# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" %(scores[1]*100))

Running this example fits the model and summarizes the estimated performance. We can see that this very simple model achieves a score of nearly 86.4% which is in the neighborhood of the original paper, with very little effort.

Accuracy: 86.98%

        I’m sure we can do better if we trained this network, perhaps using a larger embedding and adding more hidden layers. Let’s try a di↵erent network type.

1.5 One-Dimensional Convolutional Neural Network

Convolutional neural networks were designed to honor the spatial structure in image data whilst being robust to the position and orientation of learned objects in the scene. This same principle can be used on sequences, such as the one-dimensional sequence of words in a movie review. The same properties that make the CNN model attractive for learning to recognize objects in images can help to learn structure in paragraphs of words, namely the techniques invariance to the specific position of features.

Keras supports one dimensional convolutions and pooling by the Convolution1D and MaxPooling1D classes respectively. Again, let’s import the classes and functions needed for this example and initialize our random number generator to a constant value so that we can easily reproduce results.

# Import Classes and Functions and Seed Random Number Generator
# CNN for the IMDB problem
import numpy as np
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers.convolutional import Convolution1D
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence

# fix the random for reproducibility
seed = 7
np.random.seed(seed)

 We can also load and prepare our IMDB dataset as we did before.

# Load, Split and Pad IMDB Dataset
# load the dataset but only keep the top n words,zero the rest
top_words = 5000
test_split = 0.33
(X_train, y_train),(X_test, y_test) = imdb.load_data(nb_words=top_words)
# pad dataset to a maximum review length in words
max_words = 500
X_train = sequence.pad_sequences(X_train, maxlen=max_words)
X_test = sequence.pad_sequences(X_test, maxlen=max_words)

We can now define our convolutional neural network model. This time, after the Embedding input layer, we insert a Convolution1D layer. This convolutional layer has 32 feature maps and reads embedded word representations 3 vector elements of the word embedding at a time. The convolutional layer is followed by a MaxPooling1D layer with a length and stride of 2 that halves the size of the feature maps from the convolutional layer. The rest of the network is the same as the neural network above.

# create the model
model = Sequential()
model.add(Embedding(top_words, 32, input_length=max_words))
model.add(Convolution1D(filters=32, kernel_size=3, padding='same',activation= 'relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Flatten())
model.add(Dense(250, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

         We also fit the network the same as before.

# Fit and Evaluate the CNN Model
# Fit the model
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=2, batch_size=128,verbose=1)
# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Running the example, we are first presented with a summary of the network structure (not shown here). We can see our convolutional layer preserves the dimensionality of our Embedding input layer of 32 dimensional input with a maximum of 500 words. The pooling layer compresses this representation by halving it. Running the example o↵ers a small but welcome improvement over the neural network model above with an accuracy of nearly 87.66%.

 The full code listing is provided below for completeness.

# CNN for the IMDB problem
import numpy as np
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers.convolutional import Convolution1D
from keras.layers.convolutional import MaxPooling1D
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence

# fix random seed for reproducibility
seed = 7
np.random.seed(seed)

# load the dataset but only keep the top n words ,zero the rest
top_words = 5000
(X_train, y_train), (X_test, y_test) = imdb.load_data(nb_words=top_words)

# pad dataset to a maximum review length in words
max_words = 500
X_train = sequence.pad_sequences(X_train, maxlen=max_words)
X_test = sequence.pad_sequences(X_test, maxlen=max_words)
# create the model
model = Sequential()
model.add(Embedding(top_words, 32, input_length=max_words))
model.add(Convolution1D(filters=32,kernel_size=3, padding='same',activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Flatten())
model.add(Dense(250, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam',metrics=['accuracy'])
print(model.summary())
# Fit the model
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=2, batch_size=128,verbose=1)

# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

 # Output from Evaluating the CNN Model

Accuracy: 88.67%

 Again, there is a lot of opportunity for further optimization, such as the use of deeper and/or larger convolutional layers. One interesting idea is to set the max pooling layer to use an input length of 500. This would compress each feature map to a single 32 length vector and may boost performance.

1.6 Summary

In this lesson you discovered the IMDB sentiment analysis dataset for natural language processing. You learned how to develop deep learning models for sentiment analysis including:

  • How to load and review the IMDB dataset within Keras.
  • How to develop a large neural network model for sentiment analysis.
  • How to develop a one-dimensional convolutional neural network model for sentiment analysis.

        This tutorial concludes Part V and your introduction to convolutional neural networks in Keras. Next in Part VI we will discover a di↵erent type of neural network intended to learn and predict sequences called recurrent neural networks.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值