Python深度学习之Keras(新闻分类:多分类问题)

一、准备数据

使用路透社数据集,它包含许多短新闻及其对应的主题,由路透社在1896年发布,包含46个不同的主题:训练集中每个主题都至少有10个样本。

加载路透社数据集

from keras.datasets import reuters
import os
os.environ['KERAS_BACKEND']='tensorflow'
(train_data, train_labels),(test_data, test_labels) = reuters.load_data(path='D:/jupyter/deepLearning/reuters.npz', 
																		num_words=10000)
print(len(train_data), len(test_data))
train_data[10]

与IMDB数据集一样,参数num_words=10000将数据限定为前10000个最常出现的单词,有8982个训练样本和2246个测试样本。

8982 2246
[1, 245,273,207,156,53,74,160,26,14, 46, 296, 26, 39, 74, 2979, 3554, 14, 46, 4689, 4329, 86, 61, 3499, 4795, 14, 
61, 451, 4329, 17, 12]

可以将索引解码为文本

word_index = reuters.get_word_index(path='D:/jupyter/deepLearning/reuters_word_index.json')
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
#索引减去3,是因为0、1、2为"padding"(填充)、"start of sequence"(序列开始)、"unknown"(未知词)
decoded_newswire = ' '.join([reverse_word_index.get(i-3, '?') for i in train_data[0]])
print(decoded_newswire)
? ? ? said as a result of its december acquisition of space co it expects earnings per share in 1987 of 1 15 to 1 30
 dlrs per share up from 70 cts in 1986 the company said pretax net should rise to nine to 10 mln dlrs from six mln 
 dlrs in 1986 and rental operation revenues to 19 to 22 mln dlrs from 12 5 mln dlrs it said cash flow per share this
  year should be 2 50 to three dlrs reuter 3
二、处理数据

将数据向量化

import numpy as np

def vectorize_sequences(sequences, dimension=10000):
    results = np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
        results[i, sequence] = 1.
    return results
x_train = vectorize_sequences(train_data)  #将训练数据向量化
x_test = vectorize_sequences(test_data)  #将测试数据向量化

使用one-hot编码将标签向量化:标签的one-hot编码就是将每个标签表示为全零向量,只有标签索引对应的元素为1.
one-hot的代码实现:

def to_one_hot(labels, dimension=46):
	results = np.zeros((len(labels), dimension))
	for i, label in enumerate(labels):
		results[i, label] = 1.
	return results

one_hot_train_labels = to_one_hot(train_labels)
one_hot_test_labels = to_one_hot(test_labels)

Keras的内置方法可以实现这个操作。

from keras.utils.np_utils import to_categorical

one_hot_train_labels = to_categorical(train_labels)
one_hot_test_labels = to_categorical(test_labels)
三、构建网络

Dense层的堆叠:某一层丢失了与分类问题相关的信息,这些信息就永远无法被找回,因此不能再使用16维的中间层,无法学会区分46个不同的类别,所以使用维度更大的层,包含64个单元。

from keras import models
from keras import layers

model = models.Sequential()
model.add(layers.Dense(64, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(46, activation='softmax'))

最后一层是大小为46的Dense层,对于每个输入样本,网络都会输出一个46维向量,output[i]是样本属于第i 个类别的概率,46个概率的总和为1.
损失函数使用categorical_crossentropy(分类交叉熵)

#编译模型
model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

#留出验证集
x_val = x_train[:1000]
partial_x_train = x_train[1000:]

y_val = one_hot_train_labels[:1000]
partial_y_train = one_hot_train_labels[1000:]

训练模型: 共20个轮次

history = model.fit(partial_x_train,
                    partial_y_train,
                    epochs=20,
                    batch_size=512,
                    validation_data=(x_val, y_val))
Train on 7982 samples, validate on 1000 samples
Epoch 1/20
7982/7982 [==============================] - 4s 557us/step - loss: 2.5241 - acc: 0.4977 - val_loss: 1.7183 - val_acc: 0.6120
Epoch 2/20
7982/7982 [==============================] - 1s 155us/step - loss: 1.4443 - acc: 0.6889 - val_loss: 1.3496 - val_acc: 0.7090
Epoch 3/20
7982/7982 [==============================] - 1s 156us/step - loss: 1.0993 - acc: 0.7641 - val_loss: 1.1745 - val_acc: 0.7430
Epoch 4/20
7982/7982 [==============================] - 1s 152us/step - loss: 0.8729 - acc: 0.8157 - val_loss: 1.0842 - val_acc: 0.7580
Epoch 5/20
7982/7982 [==============================] - 1s 155us/step - loss: 0.7061 - acc: 0.8492 - val_loss: 0.9869 - val_acc: 0.7830
Epoch 6/20
7982/7982 [==============================] - 1s 167us/step - loss: 0.5696 - acc: 0.8790 - val_loss: 0.9418 - val_acc: 0.8040
Epoch 7/20
7982/7982 [==============================] - 1s 152us/step - loss: 0.4626 - acc: 0.9034 - val_loss: 0.9092 - val_acc: 0.8030
Epoch 8/20
7982/7982 [==============================] - 1s 155us/step - loss: 0.3728 - acc: 0.9221 - val_loss: 0.9330 - val_acc: 0.7910
Epoch 9/20
7982/7982 [==============================] - 1s 158us/step - loss: 0.3052 - acc: 0.9315 - val_loss: 0.8901 - val_acc: 0.8060
Epoch 10/20
7982/7982 [==============================] - 1s 170us/step - loss: 0.2547 - acc: 0.9415 - val_loss: 0.9053 - val_acc: 0.8140
Epoch 11/20
7982/7982 [==============================] - 1s 158us/step - loss: 0.2191 - acc: 0.9473 - val_loss: 0.9172 - val_acc: 0.8110
Epoch 12/20
7982/7982 [==============================] - 1s 163us/step - loss: 0.1877 - acc: 0.9513 - val_loss: 0.9061 - val_acc: 0.8130
Epoch 13/20
7982/7982 [==============================] - 1s 158us/step - loss: 0.1704 - acc: 0.9523 - val_loss: 0.9317 - val_acc: 0.8090
Epoch 14/20
7982/7982 [==============================] - 1s 155us/step - loss: 0.1534 - acc: 0.9555 - val_loss: 0.9633 - val_acc: 0.8050
Epoch 15/20
7982/7982 [==============================] - 1s 163us/step - loss: 0.1393 - acc: 0.9562 - val_loss: 0.9672 - val_acc: 0.8130
Epoch 16/20
7982/7982 [==============================] - 1s 159us/step - loss: 0.1315 - acc: 0.9559 - val_loss: 1.0246 - val_acc: 0.8030
Epoch 17/20
7982/7982 [==============================] - 1s 155us/step - loss: 0.1221 - acc: 0.9575 - val_loss: 1.0278 - val_acc: 0.7990
Epoch 18/20
7982/7982 [==============================] - 1s 158us/step - loss: 0.1199 - acc: 0.9570 - val_loss: 1.0403 - val_acc: 0.8040
Epoch 19/20
7982/7982 [==============================] - 1s 157us/step - loss: 0.1140 - acc: 0.9593 - val_loss: 1.0962 - val_acc: 0.7940
Epoch 20/20
7982/7982 [==============================] - 1s 158us/step - loss: 0.1113 - acc: 0.9595 - val_loss: 1.0677 - val_acc: 0.7980
四、绘制精度曲线和损失曲线

之后的绘制精度曲线和损失曲线都可以调用这两个函数

def plt_loss(history):

    loss = history.history['loss']
    val_loss = history.history['val_loss']

    epochs = range(1, len(loss) + 1)

    plt.figure()

    plt.plot(epochs, loss, 'bo', label = 'Traning loss')
    plt.plot(epochs, val_loss, 'b', label = 'Validation loss')
    plt.title('Training and Validation loss')
    plt.legend()

    plt.show()

def plt_acc(history):

    loss = history.history['acc']
    val_loss = history.history['val_acc']

    epochs = range(1, len(acc) + 1)

    plt.figure()

    plt.plot(epochs, loss, 'bo', label = 'Traning acc')
    plt.plot(epochs, val_loss, 'b', label = 'Validation acc')
    plt.title('Training and Validation acc')
    plt.legend()

    plt.show()
plt_loss(history)
plt_acc(history)
  • 损失曲线

在这里插入图片描述

  • 精度曲线

在这里插入图片描述

五、生成预测结果

大约在第9轮后开始出现过拟合,从头开始训练一个新网络,共9个轮次,然后在测试集上评估模型。

model = models.Sequential()
model.add(layers.Dense(64, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(46, activation='softmax'))

model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

model.fit(partial_x_train,
          partial_y_train,
          epochs=9,
          batch_size=512,
          validation_data=(x_val, y_val))
results = model.evaluate(x_test, one_hot_test_labels)
Train on 7982 samples, validate on 1000 samples
Epoch 1/9
7982/7982 [==============================] - 2s 205us/step - loss: 2.5398 - acc: 0.5226 - val_loss: 1.6733 - val_acc: 0.6570
Epoch 2/9
7982/7982 [==============================] - 1s 158us/step - loss: 1.3712 - acc: 0.7121 - val_loss: 1.2758 - val_acc: 0.7210
Epoch 3/9
7982/7982 [==============================] - 1s 155us/step - loss: 1.0136 - acc: 0.7781 - val_loss: 1.1303 - val_acc: 0.7530
Epoch 4/9
7982/7982 [==============================] - 1s 159us/step - loss: 0.7976 - acc: 0.8251 - val_loss: 1.0539 - val_acc: 0.7590
Epoch 5/9
7982/7982 [==============================] - 1s 159us/step - loss: 0.6393 - acc: 0.8624 - val_loss: 0.9754 - val_acc: 0.7920
Epoch 6/9
7982/7982 [==============================] - 1s 159us/step - loss: 0.5124 - acc: 0.8923 - val_loss: 0.9102 - val_acc: 0.8140
Epoch 7/9
7982/7982 [==============================] - 1s 167us/step - loss: 0.4123 - acc: 0.9137 - val_loss: 0.8932 - val_acc: 0.8210
Epoch 8/9
7982/7982 [==============================] - 1s 162us/step - loss: 0.3354 - acc: 0.9288 - val_loss: 0.8732 - val_acc: 0.8260
Epoch 9/9
7982/7982 [==============================] - 1s 161us/step - loss: 0.2782 - acc: 0.9371 - val_loss: 0.9337 - val_acc: 0.8010
2246/2246 [==============================] - 1s 233us/step

从结果可以看出,这种方法大约可以得到大约78%的精度,完全随机精度仅为18%。

print(results)
[1.0222080025626206, 0.7756010686194165]
#完全随机精度
import copy
test_labels_copy = copy.copy(test_labels)
np.random.shuffle(test_labels_copy)
hits_array = np.array(test_labels) == np.array(test_labels_copy)
print(float(np.sum(hits_array)) / len(test_labels))
0.182546749777382

在新数据上生成预测成果

predictions = model.predict(x_test)
predictions[0].shape

predictions 中的每个元素都是长度为46的向量

(46, )
np.sum(predictions[5])

所有元素总和为1

1.0

概率最大的类别

np.argmax(predictions[0])
3
  • 2
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值