Kaggle入门——手写数字识别

0.前言

  • 比赛说明
    MNIST(“修改后的国家标准与技术研究所”)是计算机视觉事实上的“hello world”数据集。自1999年发布以来,这一经典的手写图像数据集已成为分类算法基准测试的基础。随着新的机器学习技术的出现,MNIST仍然是研究人员和学习者的可靠资源。
    在本次比赛中,您的目标是从数万个手写图像的数据集中正确识别数字。我们策划了一套教程式内核,涵盖从回归到神经网络的所有内容。我们鼓励您尝试不同的算法,以便第一手了解哪些方法有效以及技术如何比较。

  • 练习技巧
    计算机视觉基础包括简单的神经网络

  • 分类方法,如SVM和K-最近邻

  • 比赛地址:https://www.kaggle.com/c/digit-recognizer/overview

1.用到的软件包

import numpy as np
import pandas as pd
from time import time
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler
import warnings
warnings.filterwarnings("ignore", category=FutureWarning, module="sklearn")

2.导入数据

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
  • 查看信息
train.info()

在这里插入图片描述

test.info()

在这里插入图片描述

  • 查看有无缺失值
print(train.isnull().any().describe())
print()
print(test.isnull().any().describe())

在这里插入图片描述
无缺失值,有唯一值

  • 查看数据行数
print(train.shape)
print(test.shape)
print(train.head())

在这里插入图片描述
训练集的标签在第一行

3.特征预处理

  • 将训练集中的特征和标签列分开
X = train.iloc[:,1:]
y = train.iloc[:,0]
  • 查看训练集的前10的数字
plt.figure(figsize = (10,5))

for num in range(0,10):
    plt.subplot(2,5,num+1)
    #将长度为784的向量数据转化为28*28的矩阵
    grid_data = X.iloc[num].as_matrix().reshape(28,28)
    #显示图片,颜色为黑白
    plt.imshow(grid_data, interpolation = "none", cmap = "Greys")

在这里插入图片描述

  • 特征预处理,将特征的值域规范化
X = MinMaxScaler().fit_transform(X)
print(X)
test = MinMaxScaler().fit_transform(test)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1, random_state = 14)
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]

4.主成分分析

训练集维度784列,需要降低纬度

  • 使用主成分分析,降低维度
all_scores = [] #画图使用
# 生成n_components的取值列表
n_components = np.linspace(0.7,0.9,num=20, endpoint=False)
print(n_components)
def get_accuracy_score(n, X_train, X_test, y_train, y_test):
    '''当主成分为n时,计算模型预测的准确率'''      
    t0 = time()
    pca = PCA(n_components = n)
    pca.fit(X_train)
    X_train_pca = pca.transform(X_train)
    X_test_pca = pca.transform(X_test)
    # 使用支持向量机分类器
    clf = svm.SVC()
    clf.fit(X_train_pca, y_train)
    # 计算准确度
    accuracy = clf.score(X_test_pca, y_test)
    t1 = time()
    print('n_components:{:.2f} , accuracy:{:.4f} , time elaps:{:.2f}s'.format(n, accuracy, t1-t0))
    return accuracy 

for n in n_components:
    score = get_accuracy_score(n,X_train, X_test, y_train, y_test)
    all_scores.append(score)  
[0.7  0.71 0.72 0.73 0.74 0.75 0.76 0.77 0.78 0.79 0.8  0.81 0.82 0.83
 0.84 0.85 0.86 0.87 0.88 0.89]
n_components:0.70 , accuracy:0.9750 , time elaps:26.44s
n_components:0.71 , accuracy:0.9757 , time elaps:26.35s
n_components:0.72 , accuracy:0.9769 , time elaps:27.06s
n_components:0.73 , accuracy:0.9760 , time elaps:27.49s
n_components:0.74 , accuracy:0.9776 , time elaps:27.72s
n_components:0.75 , accuracy:0.9781 , time elaps:28.63s
n_components:0.76 , accuracy:0.9781 , time elaps:29.46s
n_components:0.77 , accuracy:0.9781 , time elaps:30.22s
n_components:0.78 , accuracy:0.9783 , time elaps:31.08s
n_components:0.79 , accuracy:0.9776 , time elaps:33.13s
n_components:0.80 , accuracy:0.9779 , time elaps:35.57s
n_components:0.81 , accuracy:0.9771 , time elaps:36.11s
n_components:0.82 , accuracy:0.9774 , time elaps:36.17s
n_components:0.83 , accuracy:0.9769 , time elaps:36.98s
n_components:0.84 , accuracy:0.9755 , time elaps:38.15s
n_components:0.85 , accuracy:0.9748 , time elaps:39.21s
n_components:0.86 , accuracy:0.9748 , time elaps:40.42s
n_components:0.87 , accuracy:0.9729 , time elaps:42.43s
n_components:0.88 , accuracy:0.9721 , time elaps:44.55s
n_components:0.89 , accuracy:0.9717 , time elaps:47.55s
  • 画出主成分和准确度的关系图
plt.plot(n_components, all_scores, '-o')
plt.xlabel('n_components')
plt.ylabel('accuracy')
plt.show()

在这里插入图片描述
主成分n_components的临界值为0.78时,精确度最高

  • 使用SVM基础模型
# 找出识别有误的数据
pca = PCA(n_components = 0.78)
pca.fit(X_train)
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)

clf = svm.SVC()
clf.fit(X_train_pca, y_train)
y_pred = clf.predict(X_test_pca)

errors = (y_pred != y_test)
y_pred_errors = y_pred[errors]  #预测
y_test_errors = y_test[errors].values  #测试Y
X_test_errors = X_test[errors]   #测试X


# 查看预测错误的数据
print(y_pred_errors[:5]) #预测标签结果
print(y_test_errors[:5]) #测试集标签
print(X_test_errors[:5]) #测试集X值
[5 0 8 6 9]  
[8 9 6 8 7]
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
  • 数据可视化,查看预测有误的数字
n = 0
nrows = 2
ncols = 5

fig, ax = plt.subplots(nrows,ncols,figsize=(10,6))

for row in range(nrows):
    for col in range(ncols):
        ax[row,col].imshow((X_test_errors[n]).reshape((28,28)), cmap = "Greys")
        ax[row,col].set_title("Predict:{}\nTrue: {}".format(y_pred_errors[n],y_test_errors[n]))
        n += 1

在这里插入图片描述

5.建模

pca = PCA(n_components=0.78)  # n_components为0.78时, 模型的准确率最高
pca.fit(X)
print(pca.n_components_)
# 对训练集和测试集进行主成分转换
X = pca.transform(X)
test = pca.transform(test)
39

对训练集和测试集进行PCA降低维度处理, 主成分个数为39

  • 模型调参
# 使用支持向量机预测,使用网格搜索进行调参

clf_svc = GridSearchCV(estimator=svm.SVC(), param_grid={ 'C': [1, 2, 4, 5], 'kernel': [ 'linear', 'rbf', 'sigmoid' ] }, cv=5, verbose=2 ) 

clf_svc.fit(X, y)  # 训练算法
Fitting 5 folds for each of 12 candidates, totalling 60 fits
[CV] C=1, kernel=linear ..............................................

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.

[CV] ............................... C=1, kernel=linear, total=  27.5s
[CV] C=1, kernel=linear ..............................................

[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   27.4s remaining:    0.0s

[CV] ............................... C=1, kernel=linear, total=  27.7s
[CV] C=1, kernel=linear ..............................................
[CV] ............................... C=1, kernel=linear, total=  25.9s
[CV] C=1, kernel=linear ..............................................
[CV] ............................... C=1, kernel=linear, total=  26.8s
[CV] C=1, kernel=linear ..............................................
[CV] ............................... C=1, kernel=linear, total=  27.5s
[CV] C=1, kernel=rbf .................................................
[CV] .................................. C=1, kernel=rbf, total=  24.5s
[CV] C=1, kernel=rbf .................................................
[CV] .................................. C=1, kernel=rbf, total=  23.8s
[CV] C=1, kernel=rbf .................................................
[CV] .................................. C=1, kernel=rbf, total=  23.7s
[CV] C=1, kernel=rbf .................................................
[CV] .................................. C=1, kernel=rbf, total=  23.6s
[CV] C=1, kernel=rbf .................................................
[CV] .................................. C=1, kernel=rbf, total=  23.5s
[CV] C=1, kernel=sigmoid .............................................
[CV] .............................. C=1, kernel=sigmoid, total=  29.4s
[CV] C=1, kernel=sigmoid .............................................
[CV] .............................. C=1, kernel=sigmoid, total=  29.6s
[CV] C=1, kernel=sigmoid .............................................
[CV] .............................. C=1, kernel=sigmoid, total=  30.4s
[CV] C=1, kernel=sigmoid .............................................
[CV] .............................. C=1, kernel=sigmoid, total=  29.5s
[CV] C=1, kernel=sigmoid .............................................
[CV] .............................. C=1, kernel=sigmoid, total=  29.4s
[CV] C=2, kernel=linear ..............................................
[CV] ............................... C=2, kernel=linear, total=  36.7s
[CV] C=2, kernel=linear ..............................................
[CV] ............................... C=2, kernel=linear, total=  36.6s
[CV] C=2, kernel=linear ..............................................
[CV] ............................... C=2, kernel=linear, total=  35.7s
[CV] C=2, kernel=linear ..............................................
[CV] ............................... C=2, kernel=linear, total=  37.1s
[CV] C=2, kernel=linear ..............................................
[CV] ............................... C=2, kernel=linear, total=  37.6s
[CV] C=2, kernel=rbf .................................................
[CV] .................................. C=2, kernel=rbf, total=  23.1s
[CV] C=2, kernel=rbf .................................................
[CV] .................................. C=2, kernel=rbf, total=  23.5s
[CV] C=2, kernel=rbf .................................................
[CV] .................................. C=2, kernel=rbf, total=  23.1s
[CV] C=2, kernel=rbf .................................................
[CV] .................................. C=2, kernel=rbf, total=  22.5s
[CV] C=2, kernel=rbf .................................................
[CV] .................................. C=2, kernel=rbf, total=  22.7s
[CV] C=2, kernel=sigmoid .............................................
[CV] .............................. C=2, kernel=sigmoid, total=  27.3s
[CV] C=2, kernel=sigmoid .............................................
[CV] .............................. C=2, kernel=sigmoid, total=  27.0s
[CV] C=2, kernel=sigmoid .............................................
[CV] .............................. C=2, kernel=sigmoid, total=  27.3s
[CV] C=2, kernel=sigmoid .............................................
[CV] .............................. C=2, kernel=sigmoid, total=  27.3s
[CV] C=2, kernel=sigmoid .............................................
[CV] .............................. C=2, kernel=sigmoid, total=  27.2s
[CV] C=4, kernel=linear ..............................................
[CV] ............................... C=4, kernel=linear, total=  53.3s
[CV] C=4, kernel=linear ..............................................
[CV] ............................... C=4, kernel=linear, total=  52.7s
[CV] C=4, kernel=linear ..............................................
[CV] ............................... C=4, kernel=linear, total=  52.0s
[CV] C=4, kernel=linear ..............................................
[CV] ............................... C=4, kernel=linear, total=  54.8s
[CV] C=4, kernel=linear ..............................................
[CV] ............................... C=4, kernel=linear, total=  53.2s
[CV] C=4, kernel=rbf .................................................
[CV] .................................. C=4, kernel=rbf, total=  20.9s
[CV] C=4, kernel=rbf .................................................
[CV] .................................. C=4, kernel=rbf, total=  21.0s
[CV] C=4, kernel=rbf .................................................
[CV] .................................. C=4, kernel=rbf, total=  20.8s
[CV] C=4, kernel=rbf .................................................
[CV] .................................. C=4, kernel=rbf, total=  22.2s
[CV] C=4, kernel=rbf .................................................
[CV] .................................. C=4, kernel=rbf, total=  22.2s
[CV] C=4, kernel=sigmoid .............................................
[CV] .............................. C=4, kernel=sigmoid, total=  26.3s
[CV] C=4, kernel=sigmoid .............................................
[CV] .............................. C=4, kernel=sigmoid, total=  26.4s
[CV] C=4, kernel=sigmoid .............................................
[CV] .............................. C=4, kernel=sigmoid, total=  25.9s
[CV] C=4, kernel=sigmoid .............................................
[CV] .............................. C=4, kernel=sigmoid, total=  26.6s
[CV] C=4, kernel=sigmoid .............................................
[CV] .............................. C=4, kernel=sigmoid, total=  27.3s
[CV] C=5, kernel=linear ..............................................
[CV] ............................... C=5, kernel=linear, total= 1.1min
[CV] C=5, kernel=linear ..............................................
[CV] ............................... C=5, kernel=linear, total= 1.3min
[CV] C=5, kernel=linear ..............................................
[CV] ............................... C=5, kernel=linear, total= 1.1min
[CV] C=5, kernel=linear ..............................................
[CV] ............................... C=5, kernel=linear, total= 1.1min
[CV] C=5, kernel=linear ..............................................
[CV] ............................... C=5, kernel=linear, total= 1.1min
[CV] C=5, kernel=rbf .................................................
[CV] .................................. C=5, kernel=rbf, total=  23.5s
[CV] C=5, kernel=rbf .................................................
[CV] .................................. C=5, kernel=rbf, total=  23.4s
[CV] C=5, kernel=rbf .................................................
[CV] .................................. C=5, kernel=rbf, total=  23.4s
[CV] C=5, kernel=rbf .................................................
[CV] .................................. C=5, kernel=rbf, total=  23.9s
[CV] C=5, kernel=rbf .................................................
[CV] .................................. C=5, kernel=rbf, total=  23.8s
[CV] C=5, kernel=sigmoid .............................................
[CV] .............................. C=5, kernel=sigmoid, total=  29.1s
[CV] C=5, kernel=sigmoid .............................................
[CV] .............................. C=5, kernel=sigmoid, total=  28.3s
[CV] C=5, kernel=sigmoid .............................................
[CV] .............................. C=5, kernel=sigmoid, total=  30.1s
[CV] C=5, kernel=sigmoid .............................................
[CV] .............................. C=5, kernel=sigmoid, total=  33.0s
[CV] C=5, kernel=sigmoid .............................................

[CV] .............................. C=5, kernel=sigmoid, total=  32.3s

[Parallel(n_jobs=1)]: Done  60 out of  60 | elapsed: 32.6min finished



# 显示使模型准确率最高的参数
print(clf_svc.best_params_)
{'C': 5, 'kernel': 'rbf'}
#显示最高模型准确率
print(clf_svc.best_score_)
0.9828809523809524
  • 预测
preds = clf_svc.predict(test)
image_id = pd.Series(range(1,len(preds)+1))
result_2 = pd.DataFrame({'ImageID': image_id,'Label':preds})
# 保存为CSV文件
result_2.to_csv('result_svc.csv',index = False)
print('Over')

6.提交结果

在这里插入图片描述

  • 0
    点赞
  • 9
    收藏
    觉得还不错? 一键收藏
  • 4
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值