Digital Recognizer
Gang Ma
9/10/2018
- 1. 介绍
- 2. 数据预处理
- 2.1 加载数据
- 2.2 检查是否有null和缺失值
- 2.3 标准化和归一化
- 2.4 修改数据到指定维度
- 2.5 编码标签----独热码
- 2.6 划分训练集和验证集
- 3. CNN
- 3.1 定义模型
- 3.2 定义优化器和自动调节学习率退火器
- 3.3 数据增强
- 4. 评估模型
- 4.1 训练和验证曲线
- 4.2 混淆矩阵
- 5. 预测和提交
- 5.1 预测和提交结果
1. 介绍
基于keras(tf为backend),7层CNN。
- 数据预处理
- CNN模型和训练
- 预测结果和提交
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import seaborn as sns
%matplotlib inline
np.random.seed(2)
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
import itertools
from keras.utils.np_utils import to_categorical # convert to one-hot-encoding
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten, Conv2D, MaxPool2D
from keras.optimizers import RMSprop,Nadam
from keras.preprocessing.image import ImageDataGenerator
from keras.callbacks import ReduceLROnPlateau
sns.set(style='white', context='notebook', palette='deep')
2. 数据预处理
2.1 加载数据
# Load the data
train = pd.read_csv("../data/train.csv")
test = pd.read_csv("../data/test.csv")
Y_train = train["label"]
# Drop 'label' column
X_train = train.drop(labels = ["label"],axis = 1)
# free some space
del train
g = sns.countplot(Y_train)
Y_train.value_counts()
1 4684
7 4401
3 4351
9 4188
2 4177
6 4137
0 4132
4 4072
8 4063
5 3795
Name: label, dtype: int64
一共有10个类别的数据,数据集的分布是均衡的(各个数据类别的数量基本一致)
2.2 检查是否有null和缺失值
# Check the data
X_train.isnull().any().describe()
count 784
unique 1
top False
freq 784
dtype: object
test.isnull().any().describe()
count 784
unique 1
top False
freq 784
dtype: object
不存才缺失值,不需要对数据进行删除或者添加等操作,可放心使用数据
2.3 标准化和归一化
利用灰度标准化将数据范围从[0,255]转换到[0,1]
# Normalize the data
X_train = X_train / 255.0
test = test / 255.0
# how about the next method [-1,1]
# X_train = X_train / 127.5 - 1.
# test = test / 127.5 - 1.
2.4 修改数据到指定维度
# Reshape image in 3 dimensions (height = 28px, width = 28px , canal = 1)
X_train = X_train.values.reshape(-1,28,28,1)
test = test.values.reshape(-1,28,28,1)
训练集和测试集图片是(28 * 28)格式,pandas存储。 也就是一维向量长度为784。将其变为 [ N o n e , 28 , 28 , 1 ] [None,28,28,1] [None,28,28,1]图片格式。其中灰度图是1channel,RGB是3channel。
2.5 编码标签----独热码
# Encode labels to one hot vectors (ex : 2 -> [0,0,1,0,0,0,0,0,0,0])
Y_train = to_categorical(Y_train, num_classes = 10)
标签是从0到9的,需要将标签进行编码。如果标签是1到10,那么对应的需要首先进行减1,然后编码
2.6 划分训练集和验证集
# Set the random seed
random_seed = 2
# Split the train and the validation set
X_train, X_val, Y_train, Y_val = train_test_split(X_train, Y_train, test_size = 0.1, random_state=random_seed)
其中验证集选择比例为10%,一共42000个数据,对于数据是均衡的,所以不需要注意数据划分时不同类别的比例选择。
在train_test_split函数中其实可以添加stratity = True解决不均衡时划分的情况
可以查看数据划分后的一些例子
X_train.shape
(37800, 28, 28, 1)
# Some examples
g = plt.imshow(X_train[2][:,:,0])
3. CNN
3.1 定义模型
# Set the CNN model
# my CNN architechture is In -> [[Conv2D->relu]*2 -> MaxPool2D -> Dropout]*2 -> Flatten -> Dense -> Dropout -> Out
model = Sequential()
model.add(Conv2D(filters = 32, kernel_size = (5,5),padding = 'Same',
activation ='relu', input_shape = (28,28,1)))
model.add(Conv2D(filters = 32, kernel_size = (5,5),padding = 'Same',
activation ='relu'))
model.add(MaxPool2D(pool_size=(2,2)))
model.add(Dropout(0.25))
model.add(Conv2D(filters = 64, kernel_size = (3,3),padding = 'Same',
activation ='relu'))
model.add(Conv2D(filters = 64, kernel_size = (3,3),padding = 'Same',
activation ='relu'))
model.add(MaxPool2D(pool_size=(2,2)))
model.add(Dropout(0.25))
model.add(Conv2D(filters = 128, kernel_size = (2,2),padding = 'Same',
activation ='relu'))
model.add(Conv2D(filters = 128, kernel_size = (2,2),padding = 'Same',
activation ='relu'))
model.add(MaxPool2D(pool_size=(2,2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(256, activation = "relu"))
model.add(Dropout(0.5))
model.add(Dense(10, activation = "softmax"))
model.summary()
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d_11 (Conv2D) (None, 28, 28, 32) 832
_________________________________________________________________
conv2d_12 (Conv2D) (None, 28, 28, 32) 25632
_________________________________________________________________
max_pooling2d_6 (MaxPooling2 (None, 14, 14, 32) 0
_________________________________________________________________
dropout_8 (Dropout) (None, 14, 14, 32) 0
_________________________________________________________________
conv2d_13 (Conv2D) (None, 14, 14, 64) 18496
_________________________________________________________________
conv2d_14 (Conv2D) (None, 14, 14, 64) 36928
_________________________________________________________________
max_pooling2d_7 (MaxPooling2 (None, 7, 7, 64) 0
_________________________________________________________________
dropout_9 (Dropout) (None, 7, 7, 64) 0
_________________________________________________________________
conv2d_15 (Conv2D) (None, 7, 7, 128) 32896
_________________________________________________________________
conv2d_16 (Conv2D) (None, 7, 7, 128) 65664
_________________________________________________________________
max_pooling2d_8 (MaxPooling2 (None, 3, 3, 128) 0
_________________________________________________________________
dropout_10 (Dropout) (None, 3, 3, 128) 0
_________________________________________________________________
flatten_3 (Flatten) (None, 1152) 0
_________________________________________________________________
dense_5 (Dense) (None, 256) 295168
_________________________________________________________________
dropout_11 (Dropout) (None, 256) 0
_________________________________________________________________
dense_6 (Dense) (None, 10) 2570
=================================================================
Total params: 478,186
Trainable params: 478,186
Non-trainable params: 0
_________________________________________________________________
3.2 定义优化器和自动调节学习率退火器
定义好模型后,需要添加优化器、损失函数、评价函数。
类别交叉熵函数
Nadam优化函数
“准确率”测量函数
# Another optimizer
# from keras.optimizers import Nadam
optimizer = Nadam(lr=0.002, beta_1=0.9, beta_2=0.999, epsilon=1e-08, schedule_decay=0.004)
# Define the optimizer
# optimizer = RMSprop(lr=0.001, rho=0.9, epsilon=1e-08, decay=0.0)
# Compile the model
model.compile(optimizer = optimizer , loss = "categorical_crossentropy", metrics=["accuracy"])
为了更好的调节学习率,利用keras的回调函数随着训练而自动调节学习率ReduceLROnPlateau
# Set a learning rate annealer
learning_rate_reduction = ReduceLROnPlateau(monitor='val_acc',
patience=3,
verbose=1,
factor=0.5,
min_lr=0.00001)
epochs = 40# Turn epochs to 30 to get 0.9967 accuracy
batch_size = 128
3.3 数据增强
对于图像数据,可以通过旋转,平移等操作来进行数据的增强,达到防止过拟合,提高泛化能力。
# Without data augmentation i obtained an accuracy of 0.98114
#history = model.fit(X_train, Y_train, batch_size = batch_size, epochs = epochs,
# validation_data = (X_val, Y_val), verbose = 2)
# With data augmentation to prevent overfitting (accuracy 0.99286)
datagen = ImageDataGenerator(
featurewise_center=False, # set input mean to 0 over the dataset
samplewise_center=False, # set each sample mean to 0
featurewise_std_normalization=False, # divide inputs by std of the dataset
samplewise_std_normalization=False, # divide each input by its std
zca_whitening=False, # apply ZCA whitening
rotation_range=10, # randomly rotate images in the range (degrees, 0 to 180)
zoom_range = 0.1, # Randomly zoom image
width_shift_range=0.1, # randomly shift images horizontally (fraction of total width)
height_shift_range=0.1, # randomly shift images vertically (fraction of total height)
horizontal_flip=False, # randomly flip images
vertical_flip=False) # randomly flip images
datagen.fit(X_train)
数据增强通过四点:
- 图片旋转10degrees
- 随机缩放10%
- 随机水平移动10%
- 随即垂直移动10%
相对于model.fit函数,model.fit_flow可以节省内存
# Fit the model
# ImageDataGenerator().flow 生成可迭代数据
history = model.fit_generator(datagen.flow(X_train,Y_train, batch_size=batch_size),
epochs = epochs, validation_data = (X_val,Y_val),
verbose = 2, steps_per_epoch=X_train.shape[0] // batch_size
, callbacks=[learning_rate_reduction])
Epoch 1/50
- 7s - loss: 0.6120 - acc: 0.8000 - val_loss: 0.0753 - val_acc: 0.9783
Epoch 2/50
- 6s - loss: 0.1312 - acc: 0.9617 - val_loss: 0.0382 - val_acc: 0.9869
Epoch 3/50
- 6s - loss: 0.1047 - acc: 0.9694 - val_loss: 0.0393 - val_acc: 0.9867
Epoch 4/50
- 6s - loss: 0.0852 - acc: 0.9744 - val_loss: 0.0353 - val_acc: 0.9890
Epoch 5/50
- 5s - loss: 0.0749 - acc: 0.9778 - val_loss: 0.0324 - val_acc: 0.9900
Epoch 6/50
- 6s - loss: 0.0708 - acc: 0.9797 - val_loss: 0.0246 - val_acc: 0.9917
Epoch 7/50
- 6s - loss: 0.0667 - acc: 0.9800 - val_loss: 0.0349 - val_acc: 0.9902
Epoch 8/50
- 5s - loss: 0.0648 - acc: 0.9813 - val_loss: 0.0228 - val_acc: 0.9929
Epoch 9/50
- 6s - loss: 0.0591 - acc: 0.9837 - val_loss: 0.0223 - val_acc: 0.9933
Epoch 10/50
- 5s - loss: 0.0567 - acc: 0.9835 - val_loss: 0.0226 - val_acc: 0.9943
Epoch 11/50
- 5s - loss: 0.0548 - acc: 0.9847 - val_loss: 0.0290 - val_acc: 0.9910
Epoch 12/50
- 5s - loss: 0.0533 - acc: 0.9842 - val_loss: 0.0333 - val_acc: 0.9917
Epoch 13/50
- 6s - loss: 0.0543 - acc: 0.9835 - val_loss: 0.0235 - val_acc: 0.9940
Epoch 00013: ReduceLROnPlateau reducing learning rate to 0.0010000000474974513.
Epoch 14/50
- 6s - loss: 0.0418 - acc: 0.9881 - val_loss: 0.0182 - val_acc: 0.9940
Epoch 15/50
- 6s - loss: 0.0360 - acc: 0.9897 - val_loss: 0.0221 - val_acc: 0.9940
Epoch 16/50
- 6s - loss: 0.0369 - acc: 0.9890 - val_loss: 0.0308 - val_acc: 0.9936
Epoch 00016: ReduceLROnPlateau reducing learning rate to 0.0005000000237487257.
Epoch 17/50
- 5s - loss: 0.0294 - acc: 0.9918 - val_loss: 0.0172 - val_acc: 0.9962
Epoch 18/50
- 6s - loss: 0.0266 - acc: 0.9920 - val_loss: 0.0175 - val_acc: 0.9955
Epoch 19/50
- 5s - loss: 0.0278 - acc: 0.9916 - val_loss: 0.0200 - val_acc: 0.9960
Epoch 20/50
- 5s - loss: 0.0251 - acc: 0.9925 - val_loss: 0.0225 - val_acc: 0.9945
Epoch 00020: ReduceLROnPlateau reducing learning rate to 0.0002500000118743628.
Epoch 21/50
- 5s - loss: 0.0228 - acc: 0.9935 - val_loss: 0.0188 - val_acc: 0.9960
Epoch 22/50
- 5s - loss: 0.0211 - acc: 0.9932 - val_loss: 0.0201 - val_acc: 0.9964
Epoch 23/50
- 5s - loss: 0.0243 - acc: 0.9926 - val_loss: 0.0184 - val_acc: 0.9964
Epoch 24/50
- 5s - loss: 0.0222 - acc: 0.9936 - val_loss: 0.0189 - val_acc: 0.9960
Epoch 25/50
- 5s - loss: 0.0217 - acc: 0.9934 - val_loss: 0.0176 - val_acc: 0.9964
Epoch 00025: ReduceLROnPlateau reducing learning rate to 0.0001250000059371814.
Epoch 26/50
- 6s - loss: 0.0211 - acc: 0.9935 - val_loss: 0.0170 - val_acc: 0.9967
Epoch 27/50
- 5s - loss: 0.0185 - acc: 0.9947 - val_loss: 0.0190 - val_acc: 0.9955
Epoch 28/50
- 5s - loss: 0.0197 - acc: 0.9940 - val_loss: 0.0180 - val_acc: 0.9957
Epoch 29/50
- 5s - loss: 0.0190 - acc: 0.9945 - val_loss: 0.0173 - val_acc: 0.9962
Epoch 00029: ReduceLROnPlateau reducing learning rate to 6.25000029685907e-05.
Epoch 30/50
- 5s - loss: 0.0173 - acc: 0.9947 - val_loss: 0.0174 - val_acc: 0.9964
Epoch 31/50
- 5s - loss: 0.0176 - acc: 0.9947 - val_loss: 0.0179 - val_acc: 0.9962
Epoch 32/50
- 5s - loss: 0.0177 - acc: 0.9947 - val_loss: 0.0179 - val_acc: 0.9964
Epoch 00032: ReduceLROnPlateau reducing learning rate to 3.125000148429535e-05.
Epoch 33/50
- 6s - loss: 0.0154 - acc: 0.9957 - val_loss: 0.0176 - val_acc: 0.9964
Epoch 34/50
- 6s - loss: 0.0159 - acc: 0.9949 - val_loss: 0.0167 - val_acc: 0.9967
Epoch 35/50
- 6s - loss: 0.0165 - acc: 0.9950 - val_loss: 0.0171 - val_acc: 0.9964
Epoch 00035: ReduceLROnPlateau reducing learning rate to 1.5625000742147677e-05.
Epoch 36/50
- 5s - loss: 0.0153 - acc: 0.9955 - val_loss: 0.0175 - val_acc: 0.9960
Epoch 37/50
- 5s - loss: 0.0179 - acc: 0.9941 - val_loss: 0.0172 - val_acc: 0.9960
Epoch 38/50
- 5s - loss: 0.0180 - acc: 0.9945 - val_loss: 0.0171 - val_acc: 0.9962
Epoch 00038: ReduceLROnPlateau reducing learning rate to 1e-05.
Epoch 39/50
- 6s - loss: 0.0165 - acc: 0.9952 - val_loss: 0.0173 - val_acc: 0.9962
Epoch 40/50
- 5s - loss: 0.0148 - acc: 0.9957 - val_loss: 0.0175 - val_acc: 0.9962
Epoch 41/50
- 5s - loss: 0.0181 - acc: 0.9947 - val_loss: 0.0175 - val_acc: 0.9962
Epoch 42/50
- 5s - loss: 0.0165 - acc: 0.9947 - val_loss: 0.0174 - val_acc: 0.9962
Epoch 43/50
- 5s - loss: 0.0160 - acc: 0.9949 - val_loss: 0.0177 - val_acc: 0.9962
Epoch 44/50
- 5s - loss: 0.0157 - acc: 0.9953 - val_loss: 0.0176 - val_acc: 0.9962
Epoch 45/50
- 5s - loss: 0.0171 - acc: 0.9949 - val_loss: 0.0178 - val_acc: 0.9962
Epoch 46/50
- 6s - loss: 0.0167 - acc: 0.9951 - val_loss: 0.0177 - val_acc: 0.9960
Epoch 47/50
- 5s - loss: 0.0162 - acc: 0.9950 - val_loss: 0.0177 - val_acc: 0.9962
Epoch 48/50
- 5s - loss: 0.0176 - acc: 0.9952 - val_loss: 0.0176 - val_acc: 0.9962
Epoch 49/50
- 5s - loss: 0.0169 - acc: 0.9948 - val_loss: 0.0176 - val_acc: 0.9962
Epoch 50/50
- 5s - loss: 0.0164 - acc: 0.9951 - val_loss: 0.0176 - val_acc: 0.9962
4. 评估模型
4.1 训练和验证曲线
# Plot the loss and accuracy curves for training and validation
fig, ax = plt.subplots(2,1)
ax[0].plot(history.history['loss'], color='b', label="Training loss")
ax[0].plot(history.history['val_loss'], color='r', label="validation loss",axes =ax[0])
legend = ax[0].legend(loc='best', shadow=True)
ax[1].plot(history.history['acc'], color='b', label="Training accuracy")
ax[1].plot(history.history['val_acc'], color='r',label="Validation accuracy")
legend = ax[1].legend(loc='best', shadow=True)
4.2 混淆矩阵
# Look at confusion matrix
def plot_confusion_matrix(cm, classes,
normalize=False,
title='Confusion matrix',
cmap=plt.cm.Blues):
"""
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.
"""
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, cm[i, j],
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
# Predict the values from the validation dataset
Y_pred = model.predict(X_val)
# Convert predicted probabilities to classes
Y_pred_classes = np.argmax(Y_pred,axis = 1)
# Convert predicted probabilities to classes
Y_true = np.argmax(Y_val,axis = 1)
# compute the confusion matrix
confusion_mtx = confusion_matrix(Y_true, Y_pred_classes)
# plot the confusion matrix
plot_confusion_matrix(confusion_mtx, classes = range(10))
查看分类错误的数据和绘制他们之间的关系
errors = (Y_pred_classes - Y_true != 0)
Y_pred[errors].shape
(16, 10)
# Display some error results
# Errors are difference between predicted labels and true labels
errors = (Y_pred_classes - Y_true != 0)
Y_pred_classes_errors = Y_pred_classes[errors] # 类别,单值
Y_pred_errors = Y_pred[errors]# 概率
Y_true_errors = Y_true[errors]# 类别,单值
X_val_errors = X_val[errors]# 图像数据类型
def display_errors(errors_index,img_errors,pred_errors, obs_errors):
""" This function shows 6 images with their predicted and real labels"""
n = 0
nrows = 2
ncols = 3
fig, ax = plt.subplots(nrows,ncols,sharex=True,sharey=True,figsize=(16, 8))
for row in range(nrows):
for col in range(ncols):
error = errors_index[n]
ax[row,col].imshow((img_errors[error]).reshape((28,28)))
ax[row,col].set_title("Predicted label :{}\nTrue label :{}".format(pred_errors[error],obs_errors[error]))
n += 1
# 错误预测出来的概率
Y_pred_errors_prob = np.max(Y_pred_errors,axis = 1)
# Predicted probabilities of the true values in the error set
true_prob_errors = np.diagonal(np.take(Y_pred_errors, Y_true_errors, axis=1))
# Difference between the probability of the predicted label and the true label
delta_pred_true_errors = Y_pred_errors_prob - true_prob_errors
# Sorted list of the delta prob errors
sorted_dela_errors = np.argsort(delta_pred_true_errors)
# Top 6 errors
most_important_errors = sorted_dela_errors[-6:]
# Show the top 6 errors
display_errors(most_important_errors, X_val_errors, Y_pred_classes_errors, Y_true_errors)
most_important_errors
array([ 3, 9, 6, 15, 2, 13], dtype=int64)
5. 预测和提交
5.1 预测和提交结果
# predict results
results = model.predict(test)
# select the indix with the maximum probability
results = np.argmax(results,axis = 1)
results = pd.Series(results,name="Label")
submission = pd.concat([pd.Series(range(1,28001),name = "ImageId"),results],axis = 1)
submission.to_csv("cnn_mnist4.csv",index=False)
参考kaggle上CNN的结构,简单修改了一些内容和框架。