廖雪峰python教程菜鸟变高手_python怎样

eras是目前最流行的深度学习库之一。它使用起来相当简单,它使您能够通过几行代码构建神经网络。在这篇文章中,你将会发现如何用Keras建立一个神经网络,预测电影用户评论的情感分为两类:积极或消极。我们将使用Sentiment Analysis着名的imdb评论数据集来做到这一点。我们将构建的模型也可以应用于其他机器学习问题,只需进行一些更改。

什么是Keras?

Keras是一个开放源代码的python库,使您能够通过几行代码轻松构建神经网络。该库能够在TensorFlow,Microsoft Cognitive Toolkit,Theano和MXNet上运行。Tensorflow和Theano是Python中用来构建深度学习算法的最常用的数字平台,但它们可能相当复杂且难以使用。相比之下,Keras提供了创建深度学习模型的简便方法。它是为了尽可能快速和简单地构建神经网络而创建的。它的创造者FranoisChollet将注意力放在极简主义,模块化,极简主义和python的支持上。Keras可以用于GPU和CPU。它支持Python 2和3。

什么是情绪分析?

借助情感分析,我们想要确定例如演讲者或作家对于文档,交互或事件的态度(例如情感)。因此,这是一个自然语言处理问题,需要理解文本,以预测潜在的意图。情绪主要分为积极的,消极的和中立的类别。通过使用情绪分析,我们希望根据他撰写的评论,预测客户对产品的意见和态度。因此,情绪分析广泛应用于诸如评论,调查,文档等等。

imdb数据集

imdb情绪分类数据集由来自imdb用户的50,000个电影评论组成,标记为positive(1)或negative(0)。评论是预处理的,每一个都被编码为一个整数形式的单词索引序列。评论中的单词按照它们在数据集中的总体频率进行索引。例如,整数“2”编码数据中第二个最频繁的词。50,000份评论分为25,000份培训和25,000份测试。该数据集由斯坦福大学的研究人员创建,并在2011年发表在一篇论文中,他们的准确率达到了88.89%。它也被用在2011年的“袋装文字爆米花”Kaggle比赛中。

建立神经网络

我们从导入所需的依赖关系开始:

import matplotlib

import matplotlib.pyplot as plt

import numpy as np

from keras.utils import to_categorical

from keras import keras import models

from keras import layers

我们继续下载imdb数据集,这幸好已经内置到Keras中。由于我们不希望将50/50列车测试拆分,因此我们会在下载后立即将数据合并到数据和目标中,以便稍后进行80/20的拆分。

from keras.datasets import imdb

(training_data, training_targets), (testing_data, testing_targets) = imdb.load_data(num_words=10000)

data = np.concatenate((training_data, testing_data), axis=0)

现在我们可查看数据集:

targets = np.concatenate((training_targets, testing_targets), axis=0)

print("Categories:", np.unique(targets))

print("Number of unique words:", len(np.unique(np.hstack(data))))

Categories: [0 1]

Number of unique words: 9998

length = [len(i) for i in data]

print("Average Review length:", np.mean(length))

print("Standard Deviation:", round(np.std(length)))

Average Review length: 234.75892

Standard Deviation: 173.0

可以看到数据集被标记为两个类别,分别为0或1,表示审阅情绪。整个数据集包含9998个独特单词,平均评论长度为234个单词,标准差为173个单词。

现在我们来看一个训练样例:

print("Label:", targets[0])

Label: 1

print(data[0])

[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]

在上面,您会看到标记为posotive的数据集的第一次审核(1)。下面的代码检索字典映射词索引回到原来的单词,以便我们可以阅读它。它用“#”替换每个未知的单词。它通过使用get_word_index()函数来完成此操作。

index = imdb.get_word_index()

reverse_index = dict([(value, key) for (key, value) in index.items()])

decoded = " ".join( [reverse_index.get(i - 3, "#") for i in data[0]] )

print(decoded)

# this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert # is an amazing actor and now the same being director # father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for # and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also # to the two little boy's that played the # of norman and paul they were just brilliant children are often left out of the # list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you think the whole story was so lovely because it was true and was someone's life after all that was shared with us all

我们将矢量化每个评论并填充零,以便它包含正好10,000个数字。这意味着我们用零填充每个比10,000少的评论。我们这样做是因为最大的审查时间很长,我们的神经网络的每个输入都需要具有相同的大小。

def vectorize(sequences, dimension = 10000):

results = np.zeros((len(sequences), dimension))

for i, sequence in enumerate(sequences):

results[i, sequence] = 1

return results

data = vectorize(data)

targets = np.array(targets).astype("float32")

现在我们将数据分成训练和测试集。训练集将包含40,000条评论,并且测试设置为10,000条。

test_x = data [:10000]

test_y = targets [:10000]

train_x = data [10000:]

train_y = targets [10000:]

我们现在可以建立我们简单的神经网络。我们首先定义我们要构建的模型的类型。Keras中有两种类型的模型可供使用: 功能性API使用的Sequential模型 和 Model类。然后我们只需添加输入层,隐藏层和输出层。在他们之间,我们使用退出来防止过度配合。在每一层,我们使用“密集”,这意味着它们完全连接。在隐藏层中,我们使用relu函数,在输出层使用sigmoid函数。最后,我们让Keras打印我们刚刚构建的模型的摘要。

# Input - Layer

model.add(layers.Dense(50, activation = "relu", input_shape=(10000, )))

# Hidden - Layers

model.add(layers.Dropout(0.3, noise_shape=None, seed=None))

model.add(layers.Dense(50, activation = "relu"))

model.add(layers.Dropout(0.2, noise_shape=None, seed=None))

model.add(layers.Dense(50, activation = "relu"))

# Output- Layer

model.add(layers.Dense(1, activation = "sigmoid"))model.summary()

model.summary()

_________________________________________________________________

Layer (type) Output Shape Param #

=================================================================

dense_1 (Dense) (None, 50) 500050

_________________________________________________________________

dropout_1 (Dropout) (None, 50) 0

_________________________________________________________________

dense_2 (Dense) (None, 50) 2550

_________________________________________________________________

dropout_2 (Dropout) (None, 50) 0

_________________________________________________________________

dense_3 (Dense) (None, 50) 2550

_________________________________________________________________

dense_4 (Dense) (None, 1) 51

=================================================================

Total params: 505,201

Trainable params: 505,201

Non-trainable params: 0

_________________________________________________________________

现在我们需要编译我们的模型,这只不过是配置训练模型。我们使用“adam”优化器,二进制 - 交叉熵作为损失和准确性作为我们的评估指标。

modelpile(

optimizer =“adam”,

loss =“binary_crossentropy”,

metrics = [“accuracy”]

我们现在可以训练我们的模型。我们用batch_size为500来做这件事,并且只对两个时代做这件事,因为我认识到如果我们训练它的时间越长,模型就会过度。我们将结果保存在“结果”变量中:

results = model.fit(

train_x, train_y,

epochs= 2,

batch_size = 500,

validation_data = (test_x, test_y)

)

Train on 40000 samples, validate on 10000 samples

Epoch 1/2

40000/40000 [==============================] - 5s 129us/step - loss: 0.4051 - acc: 0.8212 - val_loss: 0.2635 - val_acc: 0.8945

Epoch 2/2

40000/40000 [==============================] - 4s 90us/step - loss: 0.2122 - acc: 0.9190 - val_loss: 0.2598 - val_acc: 0.8950

现在是评估我们的模型的时候了:

print(np.mean(results.history [“val_acc”]))

0.894750000536

整个模型的代码:

import numpy as np

from keras.utils import to_categorical

from keras import models

from keras import layers

from keras.datasets import imdb

(training_data, training_targets), (testing_data, testing_targets) = imdb.load_data(num_words=10000)

data = np.concatenate((training_data, testing_data), axis=0)

targets = np.concatenate((training_targets, testing_targets), axis=0)

def vectorize(sequences, dimension = 10000):

results = np.zeros((len(sequences), dimension))

for i, sequence in enumerate(sequences):

results[i, sequence] = 1

return results

test_x = data[:10000]

test_y = targets[:10000]

train_x = data[10000:]

train_y = targets[10000:]

model = models.Sequential()

# Input - Layer

model.add(layers.Dense(50, activation = "relu", input_shape=(10000, )))

# Hidden - Layers

model.add(layers.Dropout(0.3, noise_shape=None, seed=None))

model.add(layers.Dense(50, activation = "relu"))

model.add(layers.Dropout(0.2, noise_shape=None, seed=None))

model.add(layers.Dense(50, activation = "relu"))

# Output- Layer

model.add(layers.Dense(1, activation = "sigmoid"))

model.summary()

# compiling the model

modelpile(

optimizer = "adam",

loss = "binary_crossentropy",

metrics = ["accuracy"]

)

results = model.fit(

train_x, train_y,

epochs= 2,

batch_size = 500,

validation_data = (test_x, test_y)

)

print("Test-Accuracy:", np.mean(results.history["val_acc"]))

最后

如果你喜欢这些文章,请关注并转发这篇文章,谢谢!

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值