情感分析是自然语言处理很重要的一个方向,目的是让计算机理解文本中包含的情感分析。在这里将通过IMDB收集的对电影评论的数据集,分析某部电影是一部好电影还是一部不好的电影。借此研究情感分析的问题。
1、在这里直接使用keras的imdb.load_data() 函数导入数据。
2、keras通过嵌入层(Embeding)将单词的正整数表示转换为词嵌入。嵌入层需要指定词汇大小预期的最大数量,以及输出的每个词向量的维度。
# -*- coding: utf-8 -*-
from keras.datasets import imdb
import numpy as np
from keras.preprocessing import sequence
from keras.layers.embeddings import Embedding
from keras.layers.convolutional import Conv1D, MaxPooling1D
from keras.layers import Dense, Flatten
from keras.models import Sequential
seed = 7
top_words = 5000
max_words = 500
out_dimension = 32
batch_size = 128
epochs = 10
def create_model():
model = Sequential()
# 构建嵌入层
model.add(Embedding(top_words, out_dimension, input_length=max_words))
# 1维度卷积层
model.add(Conv1D(filters=32, kernel_size=3, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Flatten())
model.add(Dense(250, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()
return model
if __name__ == '__main__':
np.random.seed(seed=seed)
# 导入数据
(x_train, y_train), (x_validation, y_validation) = imdb.load_data(num_words=top_words)
# 限定数据集的长度
x_train = sequence.pad_sequences(x_train, maxlen=max_words)
x_validation = sequence.pad_sequences(x_validation, maxlen=max_words)
# 生成模型
model = create_model()
model.fit(x_train, y_train, validation_data=(x_validation, y_validation),
batch_size=batch_size, epochs=epochs, verbose=2)
运行结果:
Using TensorFlow backend.
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_1 (Embedding) (None, 500, 32) 160000
_________________________________________________________________
conv1d_1 (Conv1D) (None, 500, 32) 3104
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 250, 32) 0
_________________________________________________________________
flatten_1 (Flatten) (None, 8000) 0
_________________________________________________________________
dense_1 (Dense) (None, 250) 2000250
_________________________________________________________________
dense_2 (Dense) (None, 1) 251
=================================================================
Total params: 2,163,605
Trainable params: 2,163,605
Non-trainable params: 0
_________________________________________________________________
Train on 25000 samples, validate on 25000 samples
Epoch 1/200
- 31s - loss: 0.4808 - acc: 0.7374 - val_loss: 0.2800 - val_acc: 0.8843
Epoch 2/200
- 31s - loss: 0.2234 - acc: 0.9118 - val_loss: 0.2727 - val_acc: 0.8858
Epoch 3/200
- 33s - loss: 0.1737 - acc: 0.9339 - val_loss: 0.2918 - val_acc: 0.8807
Epoch 4/200
- 33s - loss: 0.1293 - acc: 0.9540 - val_loss: 0.3168 - val_acc: 0.8777
Epoch 5/200
- 35s - loss: 0.0841 - acc: 0.9744 - val_loss: 0.3721 - val_acc: 0.8751
Epoch 6/200
- 33s - loss: 0.0450 - acc: 0.9904 - val_loss: 0.4340 - val_acc: 0.8730
Epoch 7/200
- 32s - loss: 0.0212 - acc: 0.9966 - val_loss: 0.5029 - val_acc: 0.8703
Epoch 8/200
- 31s - loss: 0.0085 - acc: 0.9993 - val_loss: 0.5897 - val_acc: 0.8688
Epoch 9/200
- 31s - loss: 0.0027 - acc: 0.9998 - val_loss: 0.6597 - val_acc: 0.8694
Epoch 10/200
- 31s - loss: 0.0013 - acc: 0.9999 - val_loss: 0.7108 - val_acc: 0.8697