【Keras-MLP】IMDb

最新推荐文章于 2023-02-12 17:31:04 发布

bryant_meng

最新推荐文章于 2023-02-12 17:31:04 发布

阅读量1.9k

点赞数 3

分类专栏： PyTorch/Keras/Caffe/TensroFlow

本文链接：https://blog.csdn.net/bryant_meng/article/details/81214649

版权

PyTorch/Keras/Caffe/TensroFlow 专栏收录该内容

63 篇文章 9 订阅

订阅专栏

IMDb（Internet Movie Database），收录了400多万部电影
数据集共有50000项（影评），训练测试55开，标签为正面评价或者负面评价

情感分析（sentiment analysis）又称为意见挖掘（opinion mining），是用自然语言处理，文字分析等方法找出作者某些话题上的态度、情感、评价或者情绪。

情感分析的商业价值在于，可提早得知顾客对公司或产品的观感，以调整营销策略方向。

本博客讲介绍IMDb网络电影数据集，使用Word Embedding 自然语言处理方法进行预处理，并构建各种深度学习模型（MLP），进行情感分析。

用RNN和LSTM处理的模型可以参考这篇博客【Keras-RNN】IMDb

1 数据处理

1.1 下载解压数据集

import urllib.request
import os
import tarfile

#下载数据集
url="http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
filepath="data/aclImdb_v1.tar.gz"
if not os.path.isfile(filepath):
    result=urllib.request.urlretrieve(url,filepath)
    print('downloaded:',result)
# 解压
if not os.path.exists("data/aclImdb"):
    tfile = tarfile.open("data/aclImdb_v1.tar.gz", 'r:gz')
    result=tfile.extractall('data/')

1.2 读入数据集

导入一些包

from keras.datasets import imdb
from keras.preprocessing import sequence # 截长补短，让所有数字列表长度为100
from keras.preprocessing.text import Tokenizer #建立字典

删除标签

# 删除文字中的HTML标签
import re
def rm_tags(text):
    re_tag = re.compile(r'<[^>]+>')
    return re_tag.sub('', text)

定义读取文件函数
文件结构如下，括号表示文件的数量

aclImdb
train（25000）
- pos（12500）
- neg（12500）
test（25000）
- pos（12500）
- neg（12500）
  调用下读函数

label_train,train_text=read_files("train")
label_test,test_text=read_files("test")

Output

read train files: 25000
read test files: 25000

看一个样本

train_text[0]

Output

‘This film was enjoyable but for the wrong reasons. The co-ordination of the action sequences are laughable and make the film have some funny slap stick moments. Robert Ginty and Fred Williamson have a memorable scene together near the end, where Williamson says to Ginty, “you sure do get around buddy boy!”. I did enjoy this film only for Ginty and Williamson, but not for the storyline that must have been written on the back of a napkin in four lines and the rest ad-libbed most likely. A film with over 30 parts only has a credit list for 10 or so. It seems odd that no one else has a credit in the film, maybe they had some insight into how the finished product would look. The one thing this film does have going for it is that it is quite violent, so that tripled with Fred Williamson and Robert Ginty make for a film worth seeing.’

看下他的标签

print(label_train[0])

output

1.3 建立单词数字映射字典

为了后面的训练，我们需要把影评的文字转化为数字列表，怎么操作呢？就用字典建立单词和数字的映射关系，怎么来确立这种关系呢？根据词频来，比如我们建立2000字的字典，那么就按照影评中单词频数排序，出现最多次的单词映射到1，第二多的映射到2

# 建立2000个单词的字典
token = Tokenizer(num_words=2000)
# 读取所有训练集，按单词出现的频数排序，构成字典
token.fit_on_texts(train_text)
print(token.document_count)

output

查看映射

print(token.word_index)

关系如下


{'the': 1, 'and': 2, 'a': 3, 'of': 4, 'to': 5, 'is': 6, 'in': 7, 'it': 8, 'i': 9, 'this': 10, 'that': 11, 'was': 12, 'as': 13, 'for': 14, 'with': 15, 'movie': 16, 'but': 17, 'film': 18……

1.4 根据字典把单词换成数字

x_train_seq = token.texts_to_sequences(train_text)
x_test_seq  = token.texts_to_sequences(test_text)
print(train_text[0])

output

This film was enjoyable but for the wrong reasons. The co-ordination of the action sequences are laughable and make the film have some funny slap stick moments. Robert Ginty and Fred Williamson have a memorable scene together near the end, where Williamson says to Ginty, “you sure do get around buddy boy!”. I did enjoy this film only for Ginty and Williamson, but not for the storyline that must have been written on the back of a napkin in four lines and the rest ad-libbed most likely. A film with over 30 parts only has a credit list for 10 or so. It seems odd that no one else has a credit in the film, maybe they had some insight into how the finished product would look. The one thing this film does have going for it is that it is quite violent, so that tripled with Fred Williamson and Robert Ginty make for a film worth seeing.

转换后


print(x_train_seq[0])

output

[10, 18, 12, 733, 17, 14, 1, 351, 1001, 1, 997, 4, 1, 202, 840, 22, 1322, 2, 93, 1, 18, 24, 45, 158, 1225, 384, 665, 2, 1792, 24, 3, 902, 132, 291, 746, 1, 126, 117, 554, 5, 21, 248, 78, 74, 183, 426, 9, 118, 353, 10, 18, 60, 14, 2, 17, 20, 14, 1, 763, 11, 211, 24, 73, 394, 19, 1, 141, 4, 3, 7, 683, 407, 2, 1, 356, 87, 1323, 3, 18, 15, 116, 1084, 527, 60, 43, 3, 1103, 1025, 14, 160, 38, 34, 8, 182, 1026, 11, 53, 27, 331, 43, 3, 1103, 7, 1, 18, 275, 32, 65, 45, 79, 85, 1, 1762, 58, 164, 1, 27, 151, 10, 18, 123, 24, 166, 14, 8, 6, 11, 8, 6, 175, 1109, 34, 11, 15, 1792, 2, 665, 93, 14, 3, 18, 286, 315]

1.5 使转换后的数字长度相同

我们设置为100

x_train = sequence.pad_sequences(x_train_seq, maxlen=100)
x_test  = sequence.pad_sequences(x_test_seq,  maxlen=100)

如果大于100，砍掉前面多的部分，如果小于100，前面padding0

print('before pad_sequences length=',len(x_train_seq[0]))
print(x_train_seq[0])

print('after pad_sequences length=',len(x_train[0]))
print(x_train[0])

output
截断前

before pad_sequences length= 143
[10, 18, 12, 733, 17, 14, 1, 351, 1001, 1, 997, 4, 1, 202, 840, 22, 1322, 2, 93, 1, 18, 24, 45, 158, 1225, 384, 665, 2, 1792, 24, 3, 902, 132, 291, 746, 1, 126, 117, 554, 5, 21, 248, 78, 74, 183, 426, 9, 118, 353, 10, 18, 60, 14, 2, 17, 20, 14, 1, 763, 11, 211, 24, 73, 394, 19, 1, 141, 4, 3, 7, 683, 407, 2, 1, 356, 87, 1323, 3, 18, 15, 116, 1084, 527, 60, 43, 3, 1103, 1025, 14, 160, 38, 34, 8, 182, 1026, 11, 53, 27, 331, 43, 3, 1103, 7, 1, 18, 275, 32, 65, 45, 79, 85, 1, 1762, 58, 164, 1, 27, 151, 10, 18, 123, 24, 166, 14, 8, 6, 11, 8, 6, 175, 1109, 34, 11, 15, 1792, 2, 665, 93, 14, 3, 18, 286, 315]

截断后

after pad_sequences length= 100
[ 74 183 426 9 118 353 10 18 60 14 2 17 20 14
1 763 11 211 24 73 394 19 1 141 4 3 7 683
407 2 1 356 87 1323 3 18 15 116 1084 527 60 43
3 1103 1025 14 160 38 34 8 182 1026 11 53 27 331
43 3 1103 7 1 18 275 32 65 45 79 85 1 1762
58 164 1 27 151 10 18 123 24 166 14 8 6 11
8 6 175 1109 34 11 15 1792 2 665 93 14 3 18
286 315]


print('before pad_sequences length=',len(x_train_seq[4]))
print(x_train_seq[4])

print('after pad_sequences length=',len(x_train[4]))
print(x_train[4])

output

before pad_sequences length= 63
[9, 82, 215, 10, 16, 7, 1, 747, 50, 9, 12, 708, 149, 150, 2, 8, 127, 68, 52, 1, 22, 34, 641, 2, 32, 1475, 119, 967, 7, 3, 92, 57, 510, 132, 6, 50, 1, 376, 3, 19, 4, 29, 3, 10, 6, 27, 1389, 2, 158, 16, 14, 358, 2, 842, 192, 408, 713, 2, 1, 356, 4, 1, 1357]

after pad_sequences length= 100
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 9 82 215 10 16
7 1 747 50 9 12 708 149 150 2 8 127 68 52
1 22 34 641 2 32 1475 119 967 7 3 92 57 510
132 6 50 1 376 3 19 4 29 3 10 6 27 1389
2 158 16 14 358 2 842 192 408 713 2 1 356 4
1 1357]

2 MLP模型进行IMDb情感分析

2.1 Build Model

词嵌入是一种自然语言处理技术，我们在模型中加入了嵌入层，嵌入层如下

model.add(Embedding(output_dim=32,
                    input_dim=2000, 
                    input_length=100))

输入的维度2000，因为我们的字典大小是2000，输出的维度是32，我们把每一个数字的维度变成32，使得意思相近的单词在空间上隔着更近一些，输入的长度是100，因为我们的样本被截断或者padding为100了

from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation,Flatten
from keras.layers.embeddings import Embedding
model = Sequential()

model.add(Embedding(output_dim=32,
                    input_dim=2000, 
                    input_length=100))
model.add(Dropout(0.2))

model.add(Flatten())

model.add(Dense(units=256,
                activation='relu' ))

model.add(Dropout(0.2))

model.add(Dense(units=1,
                activation='sigmoid' ))

print(model.summary())

output

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_1 (Embedding)      (None, 100, 32)           64000     
_________________________________________________________________
dropout_1 (Dropout)          (None, 100, 32)           0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 3200)              0         
_________________________________________________________________
dense_1 (Dense)              (None, 256)               819456    
_________________________________________________________________
dropout_2 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 257       
=================================================================
Total params: 883,713
Trainable params: 883,713
Non-trainable params: 0
_________________________________________________________________
None

参数计算
$32 * 2000 = 64000$
$3200 * 256 + 256 = 819456$
$256 * 1 + 1 = 257$

2.2 training process

model.compile(loss='binary_crossentropy', 
              optimizer='adam', 
              metrics=['accuracy'])

train_history =model.fit(x_train, y_train,batch_size=100, 
                         epochs=10,verbose=2,
                         validation_split=0.2)

参数说明请参考
【Keras-MLP】MNIST 或者 Keras中文文档

output

Train on 20000 samples, validate on 5000 samples
Epoch 1/10
 - 5s - loss: 0.4754 - acc: 0.7577 - val_loss: 0.5774 - val_acc: 0.7230
Epoch 2/10
 - 1s - loss: 0.2650 - acc: 0.8895 - val_loss: 0.4413 - val_acc: 0.8040
Epoch 3/10
 - 1s - loss: 0.1529 - acc: 0.9437 - val_loss: 0.6667 - val_acc: 0.7530
Epoch 4/10
 - 1s - loss: 0.0787 - acc: 0.9724 - val_loss: 0.8080 - val_acc: 0.7568
Epoch 5/10
 - 1s - loss: 0.0486 - acc: 0.9821 - val_loss: 1.0333 - val_acc: 0.7340
Epoch 6/10
 - 1s - loss: 0.0361 - acc: 0.9873 - val_loss: 1.1274 - val_acc: 0.7444
Epoch 7/10
 - 1s - loss: 0.0300 - acc: 0.9893 - val_loss: 1.1928 - val_acc: 0.7450
Epoch 8/10
 - 1s - loss: 0.0261 - acc: 0.9905 - val_loss: 1.0813 - val_acc: 0.7746
Epoch 9/10
 - 1s - loss: 0.0256 - acc: 0.9908 - val_loss: 1.3871 - val_acc: 0.7270
Epoch 10/10
 - 1s - loss: 0.0254 - acc: 0.9909 - val_loss: 1.4879 - val_acc: 0.7212

可视化一下结果

%pylab inline
import matplotlib.pyplot as plt
def show_train_history(train_history,train,validation):
    plt.plot(train_history.history[train])
    plt.plot(train_history.history[validation])
    plt.title('Train History')
    plt.ylabel(train)
    plt.xlabel('Epoch')
    plt.legend(['train', 'validation'], loc='upper left')
    plt.show()

调用查看准确率的变化情况

show_train_history(train_history,'acc','val_acc')

这里写图片描述

调用查看损失的变化情况

这里写图片描述

显然模型是过拟合了

2.3 评估模型的准确性

scores = model.evaluate(x_test, y_test, verbose=1)
scores[1]

Note： scores[0] 为损失
output

25000/25000 [==============================] - 2s 77us/step
0.80772

2.4 预测概率

输出层，用的sigmoid激活函数。二分类问题，结果的概率为0-1之间

probility=model.predict(x_test)

output

array([[ 0.99994957],
       [ 0.99999988],
       [ 0.76855087],
       [ 0.99998605],
       [ 0.9999541 ],
       [ 0.99995875],
       [ 0.99999988],
       [ 0.99994111],
       [ 0.99983203],
       [ 0.99997246]], dtype=float32)

2.5 预测类别

predict=model.predict_classes(x_test)
predict[:10]

output

array([[1],
       [1],
       [1],
       [1],
       [1],
       [0],
       [1],
       [1],
       [1],
       [1]], dtype=int32)

2.6 查看的下文本和预测结果

SentimentDict={1:'正面的',0:'负面的'}
def display_test_Sentiment(i):
    print(test_text[i])
    print('真实值:',SentimentDict[y_test[i]],
          '预测结果:',SentimentDict[predict[i][0]])

调用

display_test_Sentiment(2)

output

BLACK WATER is a thriller that manages to completely transcend it’s limitations (it’s an indie flick) by continually subverting expectations to emerge as an intense experience.In the tradition of all good animal centered thrillers ie Jaws, The Edge, the original Cat People, the directors know that restraint and what isn’t shown are the best ways to pack a punch. The performances are real and gripping, the crocdodile is extremely well done, indeed if the Black Water website is to be believed that’s because they used real crocs and the swamp location is fabulous.If you are after a B-grade gore fest croc romp forget Black Water but if you want a clever, suspenseful ride that will have you fearing the water and wondering what the hell would I do if i was up that tree then it’s a must see.
真实值: 正面的预测结果: 正面的

2.7 查看《美女与野兽》（Beauty and the Beast）的影评

input_text=’’’
Oh dear, oh dear, oh dear: where should I start folks. I had low expectations already because I hated each and every single trailer so far, but boy did Disney make a blunder here. I’m sure the film will still make a billion dollars - hey: if Transformers 11 can do it, why not Belle? - but this film kills every subtle beautiful little thing that had made the original special, and it does so already in the very early stages. It’s like the dinosaur stampede scene in Jackson’s King Kong: only with even worse CGI (and, well, kitchen devices instead of dinos).
The worst sin, though, is that everything (and I mean really EVERYTHING) looks fake. What’s the point of making a live-action version of a beloved cartoon if you make every prop look like a prop? I know it’s a fairy tale for kids, but even Belle’s village looks like it had only recently been put there by a subpar production designer trying to copy the images from the cartoon. There is not a hint of authenticity here. Unlike in Jungle Book, where we got great looking CGI, this really is the by-the-numbers version and corporate filmmaking at its worst. Of course it’s not really a “bad” film; those 200 million blockbusters rarely are (this isn’t ‘The Room’ after all), but it’s so infuriatingly generic and dull - and it didn’t have to be. In the hands of a great director the potential for this film would have been huge.
Oh and one more thing: bad CGI wolves (who actually look even worse than the ones in Twilight) is one thing, and the kids probably won’t care. But making one of the two lead characters - Beast - look equally bad is simply unforgivably stupid. No wonder Emma Watson seems to phone it in: she apparently had to act against an guy with a green-screen in the place where his face should have been.
‘’’

复制影评，定义为字符串赋值给 input_text

# 第一步转化为数字
input_seq = token.texts_to_sequences([input_text])
# 第二步数字长度截断为100
pad_input_seq  = sequence.pad_sequences(input_seq , maxlen=100)

然后用我们训练好的模型来预测下结果

predict_result=model.predict_classes(pad_input_seq)
predict_result[0][0]

output

用前面定义的字典转换下输出的结果

SentimentDict[predict_result[0][0]]

output

'负面的'

2.8 预测影评是正面的还是负面的

把2.7的功能封装成函数，定义如下

def predict_review(input_text):
    input_seq = token.texts_to_sequences([input_text])
    pad_input_seq  = sequence.pad_sequences(input_seq , maxlen=100)
    predict_result=model.predict_classes(pad_input_seq)
    print(SentimentDict[predict_result[0][0]])

可以去如下网站复制影评
https://www.imdb.com/title/tt2771200

调用下

predict_review(’’’
It’s hard to believe that the same talented director who made the influential cult action classic The Road Warrior had anything to do with this disaster.
Road Warrior was raw, gritty, violent and uncompromising, and this movie is the exact opposite. It’s like Road Warrior for kids who need constant action in their movies.
This is the movie. The good guys get into a fight with the bad guys, outrun them, they break down in their vehicle and fix it. Rinse and repeat. The second half of the movie is the first half again just done faster.
The Road Warrior may have been a simple premise but it made you feel something, even with it’s opening narration before any action was even shown. And the supporting characters were given just enough time for each of them to be likable or relatable.
In this movie there is absolutely nothing and no one to care about. We’re supposed to care about the characters because… well we should. George Miller just wants us to, and in one of the most cringe worthy moments Charlize Theron’s character breaks down while dramatic music plays to try desperately to make us care.
Tom Hardy is pathetic as Max. One of the dullest leading men I’ve seen in a long time. There’s not one single moment throughout the entire movie where he comes anywhere near reaching the same level of charisma Mel Gibson did in the role. Gibson made more of an impression just eating a tin of dog food. I’m still confused as to what accent Hardy was even trying to do.
I was amazed that Max has now become a cartoon character as well. Gibson’s Max was a semi-realistic tough guy who hurt, bled, and nearly died several times. Now he survives car crashes and tornadoes with ease?
In the previous movies, fuel and guns and bullets were rare. Not anymore. It doesn’t even seem Post-Apocalyptic. There’s no sense of desperation anymore and everything is too glossy looking. And the main villain’s super model looking wives with their perfect skin are about as convincing as apocalyptic survivors as Hardy’s Australian accent is. They’re so boring and one-dimensional, George Miller could have combined them all into one character and you wouldn’t miss anyone.
Some of the green screen is very obvious and fake looking, and the CGI sandstorm is laughably bad. It wouldn’t look out of place in a Pixar movie.
There’s no tension, no real struggle, or any real dirt and grit that Road Warrior had. Everything George Miller got right with that masterpiece he gets completely wrong here.
‘’’)

output

负面的

2.9 保存模型

model_json = model.to_json()
with open("SaveModel/Imdb_RNN_model.json", "w") as json_file:
    json_file.write(model_json)

model.save_weights("SaveModel/Imdb_RNN_model.h5")
print("Saved model to disk")

3 更大的MLP模型进行IMDb情感分析

第二节中，我们用的字典长度为2000，影评长度为100。精度在80%左右，这一节我们把字典长度设置为3800，影评长度为380

3.1 数据预处理

步骤同上述一样

from keras.datasets import imdb
from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer
import numpy as np
np.random.seed(10)
import re
re_tag = re.compile(r'<[^>]+>')

def rm_tags(text):
    return re_tag.sub('', text)

import os
def read_files(filetype):
    path = "data/aclImdb/"
    file_list=[]

    positive_path=path + filetype+"/pos/"
    for f in os.listdir(positive_path):
        file_list+=[positive_path+f]
    
    negative_path=path + filetype+"/neg/"
    for f in os.listdir(negative_path):
        file_list+=[negative_path+f]
        
    print('read',filetype, 'files:',len(file_list))
       
    all_labels = ([1] * 12500 + [0] * 12500) 
    
    all_texts  = []
    
    for fi in file_list:
        with open(fi,encoding='utf8') as file_input:
            all_texts += [rm_tags(" ".join(file_input.readlines()))]
            
    return all_labels,all_texts

y_train,train_text=read_files("train")
y_test,test_text=read_files("test")

# 建立单词和数字映射字典
token = Tokenizer(num_words=3800)
token.fit_on_texts(train_text)

# 将影评转换为数字列表
x_train_seq = token.texts_to_sequences(train_text)
x_test_seq  = token.texts_to_sequences(test_text)

# 截取成一样的长度
x_train = sequence.pad_sequences(x_train_seq, maxlen=380)
x_test  = sequence.pad_sequences(x_test_seq,  maxlen=380)

3.2 Build Model

from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation,Flatten
from keras.layers.embeddings import Embedding
model = Sequential()

model.add(Embedding(output_dim=32,
                    input_dim=3800, 
                    input_length=380))
model.add(Dropout(0.2))

model.add(Flatten())

model.add(Dense(units=256,
                activation='relu' ))
model.add(Dropout(0.2))

model.add(Dense(units=1,
                activation='sigmoid' ))

model.summary()

output

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_1 (Embedding)      (None, 380, 32)           121600    
_________________________________________________________________
dropout_1 (Dropout)          (None, 380, 32)           0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 12160)             0         
_________________________________________________________________
dense_1 (Dense)              (None, 256)               3113216   
_________________________________________________________________
dropout_2 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 257       
=================================================================
Total params: 3,235,073
Trainable params: 3,235,073
Non-trainable params: 0
_________________________________________________________________

参数计算

$32 * 3800 = 121600$

$12160 * 256 + 256 = 3113216$

$256 * 1 + 1 = 257$

3.3 训练

model.compile(loss='binary_crossentropy', 
              #optimizer='rmsprop', 
              optimizer='adam', 
              metrics=['accuracy'])

train_history =model.fit(x_train, y_train,batch_size=100, 
                         epochs=10,verbose=2,
                         validation_split=0.2)

output

Train on 20000 samples, validate on 5000 samples
Epoch 1/10
 - 2s - loss: 0.0159 - acc: 0.9943 - val_loss: 1.3720 - val_acc: 0.7720
Epoch 2/10
 - 2s - loss: 0.0091 - acc: 0.9972 - val_loss: 1.1946 - val_acc: 0.8074
Epoch 3/10
 - 2s - loss: 0.0084 - acc: 0.9972 - val_loss: 1.3041 - val_acc: 0.7916
Epoch 4/10
 - 2s - loss: 0.0082 - acc: 0.9972 - val_loss: 1.5168 - val_acc: 0.7724
Epoch 5/10
 - 2s - loss: 0.0088 - acc: 0.9969 - val_loss: 1.0484 - val_acc: 0.8422
Epoch 6/10
 - 2s - loss: 0.0119 - acc: 0.9958 - val_loss: 1.3346 - val_acc: 0.7970
Epoch 7/10
 - 2s - loss: 0.0097 - acc: 0.9971 - val_loss: 1.2102 - val_acc: 0.8164
Epoch 8/10
 - 2s - loss: 0.0106 - acc: 0.9962 - val_loss: 1.2153 - val_acc: 0.8168
Epoch 9/10
 - 2s - loss: 0.0057 - acc: 0.9980 - val_loss: 1.4007 - val_acc: 0.7938
Epoch 10/10
 - 2s - loss: 0.0062 - acc: 0.9977 - val_loss: 1.1795 - val_acc: 0.8264

可视化结果

%pylab inline
import matplotlib.pyplot as plt
def show_train_history(train_history,train,validation):
    plt.plot(train_history.history[train])
    plt.plot(train_history.history[validation])
    plt.title('Train History')
    plt.ylabel(train)
    plt.xlabel('Epoch')
    plt.legend(['train', 'validation'], loc='upper left')
    plt.show()

调用查看训练集

show_train_history(train_history,'acc','val_acc')

这里写图片描述
调用查看测试集

show_train_history(train_history,'loss','val_loss')

这里写图片描述

3.4 评估下模型

scores = model.evaluate(x_test, y_test, verbose=1)
scores[1]

结果为

25000/25000 [==============================] - 2s 84us/step
0.8526

3.5 预测概率和结果

预测概率

probility=model.predict(x_test)
probility[:10]

output

array([[1.        ],
       [0.9997482 ],
       [0.9999901 ],
       [0.99978524],
       [1.        ],
       [0.99999976],
       [0.9993693 ],
       [0.998722  ],
       [0.99913496],
       [0.08446117]], dtype=float32)

预测结果

predict=model.predict_classes(x_test)
predict

output

array([[1],
       [1],
       [1],
       ...,
       [0],
       [0],
       [1]], dtype=int32)

3.6 查看样本和预测结果

SentimentDict={1:'正面的',0:'负面的'}
def display_test_Sentiment(i):
    print(test_text[i])
    print('label:',SentimentDict[y_test[i]],'预测结果:',SentimentDict[predict[i][0]])

调用

display_test_Sentiment(2)

结果为

BLACK WATER is a thriller that manages to completely transcend it’s limitations (it’s an indie flick) by continually subverting expectations to emerge as an intense experience.In the tradition of all good animal centered thrillers ie Jaws, The Edge, the original Cat People, the directors know that restraint and what isn’t shown are the best ways to pack a punch. The performances are real and gripping, the crocdodile is extremely well done, indeed if the Black Water website is to be believed that’s because they used real crocs and the swamp location is fabulous.If you are after a B-grade gore fest croc romp forget Black Water but if you want a clever, suspenseful ride that will have you fearing the water and wondering what the hell would I do if i was up that tree then it’s a must see.
label: 正面的预测结果: 正面的

3.7 预测新样本

同第二节

def predict_review(input_text):
    input_seq = token.texts_to_sequences([input_text])
    pad_input_seq  = sequence.pad_sequences(input_seq , maxlen=380)
    predict_result=model.predict_classes(pad_input_seq)
    print(SentimentDict[predict_result[0][0]])

调用

predict_review(’’’
It’s hard to believe that the same talented director who made the influential cult action classic The Road Warrior had anything to do with this disaster.
Road Warrior was raw, gritty, violent and uncompromising, and this movie is the exact opposite. It’s like Road Warrior for kids who need constant action in their movies.
This is the movie. The good guys get into a fight with the bad guys, outrun them, they break down in their vehicle and fix it. Rinse and repeat. The second half of the movie is the first half again just done faster.
The Road Warrior may have been a simple premise but it made you feel something, even with it’s opening narration before any action was even shown. And the supporting characters were given just enough time for each of them to be likable or relatable.
In this movie there is absolutely nothing and no one to care about. We’re supposed to care about the characters because… well we should. George Miller just wants us to, and in one of the most cringe worthy moments Charlize Theron’s character breaks down while dramatic music plays to try desperately to make us care.
Tom Hardy is pathetic as Max. One of the dullest leading men I’ve seen in a long time. There’s not one single moment throughout the entire movie where he comes anywhere near reaching the same level of charisma Mel Gibson did in the role. Gibson made more of an impression just eating a tin of dog food. I’m still confused as to what accent Hardy was even trying to do.
I was amazed that Max has now become a cartoon character as well. Gibson’s Max was a semi-realistic tough guy who hurt, bled, and nearly died several times. Now he survives car crashes and tornadoes with ease?
In the previous movies, fuel and guns and bullets were rare. Not anymore. It doesn’t even seem Post-Apocalyptic. There’s no sense of desperation anymore and everything is too glossy looking. And the main villain’s super model looking wives with their perfect skin are about as convincing as apocalyptic survivors as Hardy’s Australian accent is. They’re so boring and one-dimensional, George Miller could have combined them all into one character and you wouldn’t miss anyone.
Some of the green screen is very obvious and fake looking, and the CGI sandstorm is laughably bad. It wouldn’t look out of place in a Pixar movie.
There’s no tension, no real struggle, or any real dirt and grit that Road Warrior had. Everything George Miller got right with that masterpiece he gets completely wrong here.
‘’’)

output

负面的

3.8 保存模型

model_json = model.to_json()
with open("SaveModel/Imdb_RNN_model.json", "w") as json_file:
    json_file.write(model_json)

model.save_weights("SaveModel/Imdb_RNN_model.h5")
print("Saved model to disk")

准确率从80% 提到了 85%

声明

声明：代码源于《TensorFlow+Keras深度学习人工智能实践应用》林大贵版，引用、转载请注明出处，谢谢，如果对书本感兴趣，买一本看看吧！！！

bryant_meng

关注

3
点赞
踩
8

收藏

觉得还不错? 一键收藏
1
评论
【Keras-MLP】IMDb

文章目录1 数据处理1.1 下载解压数据集1.2 读入数据集1.3 建立单词数字映射字典1.4 根据字典把单词换成数字1.5 使转换后的数字长度相同2 MLP模型进行IMDb情感分析2.1 Build Model2.2 training process2.3 评估模型的准确性2.4 预测概率2.5 预测类别2.6 查看的下文本和预测结果2.7 查看《美女与野兽》（Beauty and the Be...
复制链接

扫一扫