Kaggle入门之基于CNN的数字识别

Digital Recognizer

Gang Ma

9/10/2018
  • 1. 介绍
  • 2. 数据预处理
    • 2.1 加载数据
    • 2.2 检查是否有null和缺失值
    • 2.3 标准化和归一化
    • 2.4 修改数据到指定维度
    • 2.5 编码标签----独热码
    • 2.6 划分训练集和验证集
  • 3. CNN
    • 3.1 定义模型
    • 3.2 定义优化器和自动调节学习率退火器
    • 3.3 数据增强
  • 4. 评估模型
    • 4.1 训练和验证曲线
    • 4.2 混淆矩阵
  • 5. 预测和提交
    • 5.1 预测和提交结果

1. 介绍

基于keras(tf为backend),7层CNN。

  • 数据预处理
  • CNN模型和训练
  • 预测结果和提交

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import seaborn as sns
%matplotlib inline

np.random.seed(2)

from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
import itertools

from keras.utils.np_utils import to_categorical # convert to one-hot-encoding
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten, Conv2D, MaxPool2D
from keras.optimizers import RMSprop,Nadam
from keras.preprocessing.image import ImageDataGenerator
from keras.callbacks import ReduceLROnPlateau

sns.set(style='white', context='notebook', palette='deep')

2. 数据预处理

2.1 加载数据

# Load the data
train = pd.read_csv("../data/train.csv")
test = pd.read_csv("../data/test.csv")
Y_train = train["label"]

# Drop 'label' column
X_train = train.drop(labels = ["label"],axis = 1) 

# free some space
del train 

g = sns.countplot(Y_train)

Y_train.value_counts()
1    4684
7    4401
3    4351
9    4188
2    4177
6    4137
0    4132
4    4072
8    4063
5    3795
Name: label, dtype: int64

一共有10个类别的数据,数据集的分布是均衡的(各个数据类别的数量基本一致)

2.2 检查是否有null和缺失值

# Check the data
X_train.isnull().any().describe()
count       784
unique        1
top       False
freq        784
dtype: object
test.isnull().any().describe()
count       784
unique        1
top       False
freq        784
dtype: object

不存才缺失值,不需要对数据进行删除或者添加等操作,可放心使用数据

2.3 标准化和归一化

利用灰度标准化将数据范围从[0,255]转换到[0,1]

# Normalize the data
X_train = X_train / 255.0
test = test / 255.0
# how about the next method [-1,1]
# X_train = X_train / 127.5 - 1.
# test = test / 127.5 - 1.

2.4 修改数据到指定维度

# Reshape image in 3 dimensions (height = 28px, width = 28px , canal = 1)
X_train = X_train.values.reshape(-1,28,28,1)
test = test.values.reshape(-1,28,28,1)

训练集和测试集图片是(28 * 28)格式,pandas存储。 也就是一维向量长度为784。将其变为 [ N o n e , 28 , 28 , 1 ] [None,28,28,1] [None,28,28,1]图片格式。其中灰度图是1channel,RGB是3channel。

2.5 编码标签----独热码

# Encode labels to one hot vectors (ex : 2 -> [0,0,1,0,0,0,0,0,0,0])
Y_train = to_categorical(Y_train, num_classes = 10)

标签是从0到9的,需要将标签进行编码。如果标签是1到10,那么对应的需要首先进行减1,然后编码

2.6 划分训练集和验证集

# Set the random seed
random_seed = 2
# Split the train and the validation set 
X_train, X_val, Y_train, Y_val = train_test_split(X_train, Y_train, test_size = 0.1, random_state=random_seed)

其中验证集选择比例为10%,一共42000个数据,对于数据是均衡的,所以不需要注意数据划分时不同类别的比例选择。
train_test_split函数中其实可以添加stratity = True解决不均衡时划分的情况

可以查看数据划分后的一些例子

X_train.shape
(37800, 28, 28, 1)
# Some examples
g = plt.imshow(X_train[2][:,:,0])

3. CNN

3.1 定义模型

# Set the CNN model 
# my CNN architechture is In -> [[Conv2D->relu]*2 -> MaxPool2D -> Dropout]*2 -> Flatten -> Dense -> Dropout -> Out

model = Sequential()

model.add(Conv2D(filters = 32, kernel_size = (5,5),padding = 'Same', 
                 activation ='relu', input_shape = (28,28,1)))
model.add(Conv2D(filters = 32, kernel_size = (5,5),padding = 'Same', 
                 activation ='relu'))
model.add(MaxPool2D(pool_size=(2,2)))
model.add(Dropout(0.25))


model.add(Conv2D(filters = 64, kernel_size = (3,3),padding = 'Same', 
                 activation ='relu'))
model.add(Conv2D(filters = 64, kernel_size = (3,3),padding = 'Same', 
                 activation ='relu'))
model.add(MaxPool2D(pool_size=(2,2)))
model.add(Dropout(0.25))


model.add(Conv2D(filters = 128, kernel_size = (2,2),padding = 'Same', 
                 activation ='relu'))
model.add(Conv2D(filters = 128, kernel_size = (2,2),padding = 'Same', 
                 activation ='relu'))
model.add(MaxPool2D(pool_size=(2,2)))
model.add(Dropout(0.25))

model.add(Flatten())
model.add(Dense(256, activation = "relu"))
model.add(Dropout(0.5))
model.add(Dense(10, activation = "softmax"))
model.summary()
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d_11 (Conv2D)           (None, 28, 28, 32)        832       
_________________________________________________________________
conv2d_12 (Conv2D)           (None, 28, 28, 32)        25632     
_________________________________________________________________
max_pooling2d_6 (MaxPooling2 (None, 14, 14, 32)        0         
_________________________________________________________________
dropout_8 (Dropout)          (None, 14, 14, 32)        0         
_________________________________________________________________
conv2d_13 (Conv2D)           (None, 14, 14, 64)        18496     
_________________________________________________________________
conv2d_14 (Conv2D)           (None, 14, 14, 64)        36928     
_________________________________________________________________
max_pooling2d_7 (MaxPooling2 (None, 7, 7, 64)          0         
_________________________________________________________________
dropout_9 (Dropout)          (None, 7, 7, 64)          0         
_________________________________________________________________
conv2d_15 (Conv2D)           (None, 7, 7, 128)         32896     
_________________________________________________________________
conv2d_16 (Conv2D)           (None, 7, 7, 128)         65664     
_________________________________________________________________
max_pooling2d_8 (MaxPooling2 (None, 3, 3, 128)         0         
_________________________________________________________________
dropout_10 (Dropout)         (None, 3, 3, 128)         0         
_________________________________________________________________
flatten_3 (Flatten)          (None, 1152)              0         
_________________________________________________________________
dense_5 (Dense)              (None, 256)               295168    
_________________________________________________________________
dropout_11 (Dropout)         (None, 256)               0         
_________________________________________________________________
dense_6 (Dense)              (None, 10)                2570      
=================================================================
Total params: 478,186
Trainable params: 478,186
Non-trainable params: 0
_________________________________________________________________

3.2 定义优化器和自动调节学习率退火器

定义好模型后,需要添加优化器、损失函数、评价函数。
类别交叉熵函数
Nadam优化函数
“准确率”测量函数

# Another optimizer
# from keras.optimizers import Nadam
optimizer = Nadam(lr=0.002, beta_1=0.9, beta_2=0.999, epsilon=1e-08, schedule_decay=0.004)

# Define the optimizer
# optimizer = RMSprop(lr=0.001, rho=0.9, epsilon=1e-08, decay=0.0)
# Compile the model
model.compile(optimizer = optimizer , loss = "categorical_crossentropy", metrics=["accuracy"])

为了更好的调节学习率,利用keras的回调函数随着训练而自动调节学习率ReduceLROnPlateau

# Set a learning rate annealer
learning_rate_reduction = ReduceLROnPlateau(monitor='val_acc', 
                                            patience=3, 
                                            verbose=1, 
                                            factor=0.5, 
                                            min_lr=0.00001)
epochs = 40# Turn epochs to 30 to get 0.9967 accuracy
batch_size = 128

3.3 数据增强

对于图像数据,可以通过旋转,平移等操作来进行数据的增强,达到防止过拟合,提高泛化能力。

# Without data augmentation i obtained an accuracy of 0.98114
#history = model.fit(X_train, Y_train, batch_size = batch_size, epochs = epochs, 
#          validation_data = (X_val, Y_val), verbose = 2)
# With data augmentation to prevent overfitting (accuracy 0.99286)

datagen = ImageDataGenerator(
        featurewise_center=False,  # set input mean to 0 over the dataset
        samplewise_center=False,  # set each sample mean to 0
        featurewise_std_normalization=False,  # divide inputs by std of the dataset
        samplewise_std_normalization=False,  # divide each input by its std
        zca_whitening=False,  # apply ZCA whitening
        rotation_range=10,  # randomly rotate images in the range (degrees, 0 to 180)
        zoom_range = 0.1, # Randomly zoom image 
        width_shift_range=0.1,  # randomly shift images horizontally (fraction of total width)
        height_shift_range=0.1,  # randomly shift images vertically (fraction of total height)
        horizontal_flip=False,  # randomly flip images
        vertical_flip=False)  # randomly flip images


datagen.fit(X_train)

数据增强通过四点:

  • 图片旋转10degrees
  • 随机缩放10%
  • 随机水平移动10%
  • 随即垂直移动10%

相对于model.fit函数,model.fit_flow可以节省内存

# Fit the model
# ImageDataGenerator().flow 生成可迭代数据
history = model.fit_generator(datagen.flow(X_train,Y_train, batch_size=batch_size),
                              epochs = epochs, validation_data = (X_val,Y_val),
                              verbose = 2, steps_per_epoch=X_train.shape[0] // batch_size
                              , callbacks=[learning_rate_reduction])
Epoch 1/50
 - 7s - loss: 0.6120 - acc: 0.8000 - val_loss: 0.0753 - val_acc: 0.9783
Epoch 2/50
 - 6s - loss: 0.1312 - acc: 0.9617 - val_loss: 0.0382 - val_acc: 0.9869
Epoch 3/50
 - 6s - loss: 0.1047 - acc: 0.9694 - val_loss: 0.0393 - val_acc: 0.9867
Epoch 4/50
 - 6s - loss: 0.0852 - acc: 0.9744 - val_loss: 0.0353 - val_acc: 0.9890
Epoch 5/50
 - 5s - loss: 0.0749 - acc: 0.9778 - val_loss: 0.0324 - val_acc: 0.9900
Epoch 6/50
 - 6s - loss: 0.0708 - acc: 0.9797 - val_loss: 0.0246 - val_acc: 0.9917
Epoch 7/50
 - 6s - loss: 0.0667 - acc: 0.9800 - val_loss: 0.0349 - val_acc: 0.9902
Epoch 8/50
 - 5s - loss: 0.0648 - acc: 0.9813 - val_loss: 0.0228 - val_acc: 0.9929
Epoch 9/50
 - 6s - loss: 0.0591 - acc: 0.9837 - val_loss: 0.0223 - val_acc: 0.9933
Epoch 10/50
 - 5s - loss: 0.0567 - acc: 0.9835 - val_loss: 0.0226 - val_acc: 0.9943
Epoch 11/50
 - 5s - loss: 0.0548 - acc: 0.9847 - val_loss: 0.0290 - val_acc: 0.9910
Epoch 12/50
 - 5s - loss: 0.0533 - acc: 0.9842 - val_loss: 0.0333 - val_acc: 0.9917
Epoch 13/50
 - 6s - loss: 0.0543 - acc: 0.9835 - val_loss: 0.0235 - val_acc: 0.9940

Epoch 00013: ReduceLROnPlateau reducing learning rate to 0.0010000000474974513.
Epoch 14/50
 - 6s - loss: 0.0418 - acc: 0.9881 - val_loss: 0.0182 - val_acc: 0.9940
Epoch 15/50
 - 6s - loss: 0.0360 - acc: 0.9897 - val_loss: 0.0221 - val_acc: 0.9940
Epoch 16/50
 - 6s - loss: 0.0369 - acc: 0.9890 - val_loss: 0.0308 - val_acc: 0.9936

Epoch 00016: ReduceLROnPlateau reducing learning rate to 0.0005000000237487257.
Epoch 17/50
 - 5s - loss: 0.0294 - acc: 0.9918 - val_loss: 0.0172 - val_acc: 0.9962
Epoch 18/50
 - 6s - loss: 0.0266 - acc: 0.9920 - val_loss: 0.0175 - val_acc: 0.9955
Epoch 19/50
 - 5s - loss: 0.0278 - acc: 0.9916 - val_loss: 0.0200 - val_acc: 0.9960
Epoch 20/50
 - 5s - loss: 0.0251 - acc: 0.9925 - val_loss: 0.0225 - val_acc: 0.9945

Epoch 00020: ReduceLROnPlateau reducing learning rate to 0.0002500000118743628.
Epoch 21/50
 - 5s - loss: 0.0228 - acc: 0.9935 - val_loss: 0.0188 - val_acc: 0.9960
Epoch 22/50
 - 5s - loss: 0.0211 - acc: 0.9932 - val_loss: 0.0201 - val_acc: 0.9964
Epoch 23/50
 - 5s - loss: 0.0243 - acc: 0.9926 - val_loss: 0.0184 - val_acc: 0.9964
Epoch 24/50
 - 5s - loss: 0.0222 - acc: 0.9936 - val_loss: 0.0189 - val_acc: 0.9960
Epoch 25/50
 - 5s - loss: 0.0217 - acc: 0.9934 - val_loss: 0.0176 - val_acc: 0.9964

Epoch 00025: ReduceLROnPlateau reducing learning rate to 0.0001250000059371814.
Epoch 26/50
 - 6s - loss: 0.0211 - acc: 0.9935 - val_loss: 0.0170 - val_acc: 0.9967
Epoch 27/50
 - 5s - loss: 0.0185 - acc: 0.9947 - val_loss: 0.0190 - val_acc: 0.9955
Epoch 28/50
 - 5s - loss: 0.0197 - acc: 0.9940 - val_loss: 0.0180 - val_acc: 0.9957
Epoch 29/50
 - 5s - loss: 0.0190 - acc: 0.9945 - val_loss: 0.0173 - val_acc: 0.9962

Epoch 00029: ReduceLROnPlateau reducing learning rate to 6.25000029685907e-05.
Epoch 30/50
 - 5s - loss: 0.0173 - acc: 0.9947 - val_loss: 0.0174 - val_acc: 0.9964
Epoch 31/50
 - 5s - loss: 0.0176 - acc: 0.9947 - val_loss: 0.0179 - val_acc: 0.9962
Epoch 32/50
 - 5s - loss: 0.0177 - acc: 0.9947 - val_loss: 0.0179 - val_acc: 0.9964

Epoch 00032: ReduceLROnPlateau reducing learning rate to 3.125000148429535e-05.
Epoch 33/50
 - 6s - loss: 0.0154 - acc: 0.9957 - val_loss: 0.0176 - val_acc: 0.9964
Epoch 34/50
 - 6s - loss: 0.0159 - acc: 0.9949 - val_loss: 0.0167 - val_acc: 0.9967
Epoch 35/50
 - 6s - loss: 0.0165 - acc: 0.9950 - val_loss: 0.0171 - val_acc: 0.9964

Epoch 00035: ReduceLROnPlateau reducing learning rate to 1.5625000742147677e-05.
Epoch 36/50
 - 5s - loss: 0.0153 - acc: 0.9955 - val_loss: 0.0175 - val_acc: 0.9960
Epoch 37/50
 - 5s - loss: 0.0179 - acc: 0.9941 - val_loss: 0.0172 - val_acc: 0.9960
Epoch 38/50
 - 5s - loss: 0.0180 - acc: 0.9945 - val_loss: 0.0171 - val_acc: 0.9962

Epoch 00038: ReduceLROnPlateau reducing learning rate to 1e-05.
Epoch 39/50
 - 6s - loss: 0.0165 - acc: 0.9952 - val_loss: 0.0173 - val_acc: 0.9962
Epoch 40/50
 - 5s - loss: 0.0148 - acc: 0.9957 - val_loss: 0.0175 - val_acc: 0.9962
Epoch 41/50
 - 5s - loss: 0.0181 - acc: 0.9947 - val_loss: 0.0175 - val_acc: 0.9962
Epoch 42/50
 - 5s - loss: 0.0165 - acc: 0.9947 - val_loss: 0.0174 - val_acc: 0.9962
Epoch 43/50
 - 5s - loss: 0.0160 - acc: 0.9949 - val_loss: 0.0177 - val_acc: 0.9962
Epoch 44/50
 - 5s - loss: 0.0157 - acc: 0.9953 - val_loss: 0.0176 - val_acc: 0.9962
Epoch 45/50
 - 5s - loss: 0.0171 - acc: 0.9949 - val_loss: 0.0178 - val_acc: 0.9962
Epoch 46/50
 - 6s - loss: 0.0167 - acc: 0.9951 - val_loss: 0.0177 - val_acc: 0.9960
Epoch 47/50
 - 5s - loss: 0.0162 - acc: 0.9950 - val_loss: 0.0177 - val_acc: 0.9962
Epoch 48/50
 - 5s - loss: 0.0176 - acc: 0.9952 - val_loss: 0.0176 - val_acc: 0.9962
Epoch 49/50
 - 5s - loss: 0.0169 - acc: 0.9948 - val_loss: 0.0176 - val_acc: 0.9962
Epoch 50/50
 - 5s - loss: 0.0164 - acc: 0.9951 - val_loss: 0.0176 - val_acc: 0.9962

4. 评估模型

4.1 训练和验证曲线

# Plot the loss and accuracy curves for training and validation 
fig, ax = plt.subplots(2,1)
ax[0].plot(history.history['loss'], color='b', label="Training loss")
ax[0].plot(history.history['val_loss'], color='r', label="validation loss",axes =ax[0])
legend = ax[0].legend(loc='best', shadow=True)

ax[1].plot(history.history['acc'], color='b', label="Training accuracy")
ax[1].plot(history.history['val_acc'], color='r',label="Validation accuracy")
legend = ax[1].legend(loc='best', shadow=True)

png

4.2 混淆矩阵

# Look at confusion matrix 

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

# Predict the values from the validation dataset
Y_pred = model.predict(X_val)
# Convert predicted probabilities to classes 
Y_pred_classes = np.argmax(Y_pred,axis = 1) 
# Convert predicted probabilities to classes
Y_true = np.argmax(Y_val,axis = 1) 
# compute the confusion matrix
confusion_mtx = confusion_matrix(Y_true, Y_pred_classes) 
# plot the confusion matrix
plot_confusion_matrix(confusion_mtx, classes = range(10)) 

png

查看分类错误的数据和绘制他们之间的关系

errors = (Y_pred_classes - Y_true != 0)
Y_pred[errors].shape
(16, 10)
# Display some error results 

# Errors are difference between predicted labels and true labels
errors = (Y_pred_classes - Y_true != 0)

Y_pred_classes_errors = Y_pred_classes[errors] # 类别,单值
Y_pred_errors = Y_pred[errors]# 概率
Y_true_errors = Y_true[errors]# 类别,单值
X_val_errors = X_val[errors]# 图像数据类型

def display_errors(errors_index,img_errors,pred_errors, obs_errors):
    """ This function shows 6 images with their predicted and real labels"""
    n = 0
    nrows = 2
    ncols = 3
    fig, ax = plt.subplots(nrows,ncols,sharex=True,sharey=True,figsize=(16, 8))
    for row in range(nrows):
        for col in range(ncols):
            error = errors_index[n]
            ax[row,col].imshow((img_errors[error]).reshape((28,28)))
            ax[row,col].set_title("Predicted label :{}\nTrue label :{}".format(pred_errors[error],obs_errors[error]))
            n += 1

# 错误预测出来的概率
Y_pred_errors_prob = np.max(Y_pred_errors,axis = 1)

# Predicted probabilities of the true values in the error set
true_prob_errors = np.diagonal(np.take(Y_pred_errors, Y_true_errors, axis=1))

# Difference between the probability of the predicted label and the true label
delta_pred_true_errors = Y_pred_errors_prob - true_prob_errors

# Sorted list of the delta prob errors
sorted_dela_errors = np.argsort(delta_pred_true_errors)

# Top 6 errors 
most_important_errors = sorted_dela_errors[-6:]

# Show the top 6 errors
display_errors(most_important_errors, X_val_errors, Y_pred_classes_errors, Y_true_errors)

png

most_important_errors
array([ 3,  9,  6, 15,  2, 13], dtype=int64)

5. 预测和提交

5.1 预测和提交结果

# predict results
results = model.predict(test)

# select the indix with the maximum probability
results = np.argmax(results,axis = 1)

results = pd.Series(results,name="Label")
submission = pd.concat([pd.Series(range(1,28001),name = "ImageId"),results],axis = 1)

submission.to_csv("cnn_mnist4.csv",index=False)

参考kaggle上CNN的结构,简单修改了一些内容和框架。

  • 0
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值