上课笔记-机器学习(2)-手写数字识别

实训四 手写数字识别

内置数据

# 读入数据
from sklearn.datasets import load_digits
digits = load_digits()
print(digits)
{'data': array([[  0.,   0.,   5., ...,   0.,   0.,   0.],
       [  0.,   0.,   0., ...,  10.,   0.,   0.],
       [  0.,   0.,   0., ...,  16.,   9.,   0.],
       ..., 
       [  0.,   0.,   1., ...,   6.,   0.,   0.],
       [  0.,   0.,   2., ...,  12.,   0.,   0.],
       [  0.,   0.,  10., ...,  12.,   1.,   0.]]), 'target': array([0, 1, 2, ..., 8, 9, 8]), 'target_names': array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), 'images': array([[[  0.,   0.,   5., ...,   1.,   0.,   0.],
        [  0.,   0.,  13., ...,  15.,   5.,   0.],
        [  0.,   3.,  15., ...,  11.,   8.,   0.],
        ..., 
        [  0.,   4.,  11., ...,  12.,   7.,   0.],
        [  0.,   2.,  14., ...,  12.,   0.,   0.],
        [  0.,   0.,   6., ...,   0.,   0.,   0.]],

       [[  0.,   0.,   0., ...,   5.,   0.,   0.],
        [  0.,   0.,   0., ...,   9.,   0.,   0.],
        [  0.,   0.,   3., ...,   6.,   0.,   0.],
        ..., 
        [  0.,   0.,   1., ...,   6.,   0.,   0.],
        [  0.,   0.,   1., ...,   6.,   0.,   0.],
        [  0.,   0.,   0., ...,  10.,   0.,   0.]],

       [[  0.,   0.,   0., ...,  12.,   0.,   0.],
        [  0.,   0.,   3., ...,  14.,   0.,   0.],
        [  0.,   0.,   8., ...,  16.,   0.,   0.],
        ..., 
        [  0.,   9.,  16., ...,   0.,   0.,   0.],
        [  0.,   3.,  13., ...,  11.,   5.,   0.],
        [  0.,   0.,   0., ...,  16.,   9.,   0.]],

       ..., 
       [[  0.,   0.,   1., ...,   1.,   0.,   0.],
        [  0.,   0.,  13., ...,   2.,   1.,   0.],
        [  0.,   0.,  16., ...,  16.,   5.,   0.],
        ..., 
        [  0.,   0.,  16., ...,  15.,   0.,   0.],
        [  0.,   0.,  15., ...,  16.,   0.,   0.],
        [  0.,   0.,   2., ...,   6.,   0.,   0.]],

       [[  0.,   0.,   2., ...,   0.,   0.,   0.],
        [  0.,   0.,  14., ...,  15.,   1.,   0.],
        [  0.,   4.,  16., ...,  16.,   7.,   0.],
        ..., 
        [  0.,   0.,   0., ...,  16.,   2.,   0.],
        [  0.,   0.,   4., ...,  16.,   2.,   0.],
        [  0.,   0.,   5., ...,  12.,   0.,   0.]],

       [[  0.,   0.,  10., ...,   1.,   0.,   0.],
        [  0.,   2.,  16., ...,   1.,   0.,   0.],
        [  0.,   0.,  15., ...,  15.,   0.,   0.],
        ..., 
        [  0.,   4.,  16., ...,  16.,   6.,   0.],
        [  0.,   8.,  16., ...,  16.,   8.,   0.],
        [  0.,   1.,   8., ...,  12.,   1.,   0.]]]), 'DESCR': "Optical Recognition of Handwritten Digits Data Set\n===================================================\n\nNotes\n-----\nData Set Characteristics:\n    :Number of Instances: 5620\n    :Number of Attributes: 64\n    :Attribute Information: 8x8 image of integer pixels in the range 0..16.\n    :Missing Attribute Values: None\n    :Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)\n    :Date: July; 1998\n\nThis is a copy of the test set of the UCI ML hand-written digits datasets\nhttp://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits\n\nThe data set contains images of hand-written digits: 10 classes where\neach class refers to a digit.\n\nPreprocessing programs made available by NIST were used to extract\nnormalized bitmaps of handwritten digits from a preprinted form. From a\ntotal of 43 people, 30 contributed to the training set and different 13\nto the test set. 32x32 bitmaps are divided into nonoverlapping blocks of\n4x4 and the number of on pixels are counted in each block. This generates\nan input matrix of 8x8 where each element is an integer in the range\n0..16. This reduces dimensionality and gives invariance to small\ndistortions.\n\nFor info on NIST preprocessing routines, see M. D. Garris, J. L. Blue, G.\nT. Candela, D. L. Dimmick, J. Geist, P. J. Grother, S. A. Janet, and C.\nL. Wilson, NIST Form-Based Handprint Recognition System, NISTIR 5469,\n1994.\n\nReferences\n----------\n  - C. Kaynak (1995) Methods of Combining Multiple Classifiers and Their\n    Applications to Handwritten Digit Recognition, MSc Thesis, Institute of\n    Graduate Studies in Science and Engineering, Bogazici University.\n  - E. Alpaydin, C. Kaynak (1998) Cascading Classifiers, Kybernetika.\n  - Ken Tang and Ponnuthurai N. Suganthan and Xi Yao and A. Kai Qin.\n    Linear dimensionalityreduction using relevance weighted LDA. School of\n    Electrical and Electronic Engineering Nanyang Technological University.\n    2005.\n  - Claudio Gentile. A New Approximate Maximal Margin Classification\n    Algorithm. NIPS. 2000.\n"}
# 分别获取data、target、images
X, y, images = digits["data"], digits["target"], digits["images"]
print(X.shape, y.shape, images.shape) #1797个图像,每个图像的像素都是8*8
(1797, 64) (1797,) (1797, 8, 8)
#分析一幅图像的数据
print(images[100])
print(X[100].reshape(8,8)) #X就是拉直了的images
[[  0.   0.   0.   2.  13.   0.   0.   0.]
 [  0.   0.   0.   8.  15.   0.   0.   0.]
 [  0.   0.   5.  16.   5.   2.   0.   0.]
 [  0.   0.  15.  12.   1.  16.   4.   0.]
 [  0.   4.  16.   2.   9.  16.   8.   0.]
 [  0.   0.  10.  14.  16.  16.   4.   0.]
 [  0.   0.   0.   0.  13.   8.   0.   0.]
 [  0.   0.   0.   0.  13.   6.   0.   0.]]
[[  0.   0.   0.   2.  13.   0.   0.   0.]
 [  0.   0.   0.   8.  15.   0.   0.   0.]
 [  0.   0.   5.  16.   5.   2.   0.   0.]
 [  0.   0.  15.  12.   1.  16.   4.   0.]
 [  0.   4.  16.   2.   9.  16.   8.   0.]
 [  0.   0.  10.  14.  16.  16.   4.   0.]
 [  0.   0.   0.   0.  13.   8.   0.   0.]
 [  0.   0.   0.   0.  13.   6.   0.   0.]]
# 绘图看看其中一行数据
%matplotlib inline
import matplotlib.pyplot as plt
#plt.imshow(images[100])
plt.imshow(X[100].reshape(8,8))
print(y[100])  #查看标注->4

在这里插入图片描述

# 机器学习过程
# 选用KNN算法

#引入库
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

#划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=33)

#生成模型
knn = KNeighborsClassifier(n_neighbors=5)

#模型学习
knn.fit(X_train, y_train)

#模型预测
y_pred = knn.predict(X_test)

#模型评估
print("KNN手写识别:", knn.score(X_test, y_test))


KNN手写识别: 0.986111111111
#拿出一幅图像检验
print(y_pred[50])
plt.imshow(X_test[50].reshape(8,8))

在这里插入图片描述

# 看一下分类模型评估报告
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
             precision    recall  f1-score   support

          0       1.00      1.00      1.00        29
          1       0.94      1.00      0.97        46
          2       1.00      1.00      1.00        37
          3       1.00      1.00      1.00        39
          4       1.00      1.00      1.00        25
          5       0.97      0.97      0.97        37
          6       0.98      1.00      0.99        41
          7       1.00      1.00      1.00        29
          8       1.00      0.95      0.98        44
          9       1.00      0.94      0.97        33

avg / total       0.99      0.99      0.99       360
# 选取支持向量机算法(SVM)
from sklearn.svm import SVC #SVC是分类,用于回归的是SVR
svc = SVC(gamma=0.001, C=1)
svc.fit(X_train, y_train)
y_pred = svc.predict(X_test)
print("SVM手写识别:",svc.score(X_test, y_test))
# 看一下分类模型评估报告
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
SVM手写识别: 0.988888888889
             precision    recall  f1-score   support

          0       1.00      1.00      1.00        29
          1       0.98      1.00      0.99        46
          2       1.00      1.00      1.00        37
          3       1.00      1.00      1.00        39
          4       1.00      1.00      1.00        25
          5       0.97      0.95      0.96        37
          6       0.98      1.00      0.99        41
          7       1.00      1.00      1.00        29
          8       1.00      0.98      0.99        44
          9       0.97      0.97      0.97        33

avg / total       0.99      0.99      0.99       360

来自Kaggle的手写数字数据集

# 读入数据(内置数据是字典,可以直接取data,而csv文件有表头)
import pandas as pd
digits = pd.read_csv("data/digit_train.csv")
print(digits.shape) #784开方是28,像素更清晰了(28*28)
(42000, 785)
# 分别获取数据和标记(X和y)
X = digits.iloc[:,1:].values #去掉label列
y = digits.iloc[:,0].values
print(X.shape, y.shape)
(42000, 784) (42000,)
# 绘制数字图像 # %matplotlib inline 引进了这个就不用每次都写plt.show()了
%matplotlib inline 
import matplotlib.pyplot as plt
plt.imshow(X[300].reshape(28,28))
print(y[300])

在这里插入图片描述

# 划分训练集和测试集
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=33)
#选取KNN分类算法 (计算这个点与其他点的距离,KNN算法只用循环一次,其他方法不止)
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
print("KNN手写数字准确度:", knn.score(X_test, y_test))
KNN手写数字准确度: 0.965158730159
# 上面的KNN算法运行该例子的时间会非常长,因为维度高
# 解决方案:降维    (去掉重复的特征,也就是列,高度相关的,比如出生日期和年龄)
from sklearn.decomposition import PCA #PCA(主成分分析)
pca = PCA(n_components=50, whiten=True)   #想要多少维度,784维度压到50维度,压缩之后再划分
X_pca = pca.fit_transform(X)#这是压缩后的
print(X_pca.shape)
(42000, 50)
# 对降维后的数据重新划分训练集和测试集并且进行学习
X_train, X_test, y_train, y_test = train_test_split(X_pca, y, test_size=0.3, random_state=33)
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
print("KNN手写数字准确度:", knn.score(X_test, y_test))
KNN手写数字准确度: 0.95873015873
# 降到多少维是合适的
from sklearn.decomposition import PCA
pca = PCA()
pca.fit(X)
EV_list = pca.explained_variance_
EVR_list = []
for i in range(len(EV_list)):
    EVR_list.append(EV_list[i]/EV_list[0])  #找突变
for j in range(len(EVR_list)):
    if EVR_list[j] < 0.1:
        print("推荐的维度:",j)
        break;
推荐的维度: 22
# 根据推荐的维度再来一次
from sklearn.decomposition import PCA #PCA(主成分分析)
pca = PCA(n_components=22, whiten=True)   #想要多少维度,784维度压到50维度,压缩之后再划分
#pca.fit(X)
#X_pca = pca.fit_transfrom(X)  #这两步可以合成为下面一步
X_pca = pca.fit_transform(X)#这是压缩后的
print(X_pca.shape)
X_train, X_test, y_train, y_test = train_test_split(X_pca, y, test_size=0.3, random_state=33)
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
print("KNN手写数字准确度:", knn.score(X_test, y_test))
(42000, 22)
KNN手写数字准确度: 0.967222222222
# ------用支持向量机算法完整实现手写数字的识别
# 引入库
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.svm import SVC

# 读入数据
digits = pd.read_csv("data/digit_train.csv")
X = digits.iloc[:,1:].values
y = digits.iloc[:,0].values

# 对数据进行降维(根据推荐维度:22)
pca = PCA(n_components=22, whiten=True) 
X_pca = pca.fit_transform(X)

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X_pca, y, test_size=0.3, random_state=0)

#使用SVM算法学习
svc = SVC(C=15) #参数:kernel="rbf" 映射,高曲面; C,惩罚项越大,越关注分错的类别,加大对错误的学习,但不能过大,会过拟合; gamma,学习域,迈的步子
svc.fit(X_train, y_train)
print("SVM手写数字识别的准确度:", svc.score(X_test, y_test))

#模型评估
y_pred = svc.predict(X_test)
print(classification_report(y_pred, y_test))
SVM手写数字识别的准确度: 0.975317460317
             precision    recall  f1-score   support

          0       0.99      0.98      0.98      1255
          1       0.99      0.99      0.99      1424
          2       0.98      0.97      0.97      1298
          3       0.96      0.98      0.97      1281
          4       0.97      0.98      0.98      1234
          5       0.97      0.97      0.97      1117
          6       0.99      0.98      0.98      1262
          7       0.97      0.98      0.97      1323
          8       0.97      0.97      0.97      1202
          9       0.95      0.96      0.96      1204

avg / total       0.98      0.98      0.98     12600
# 观察模型的参数
print(svc)
SVC(C=15, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
# 决策树
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier()
dtc.fit(X_train, y_train)
print("决策树手写数字识别的准确度:", dtc.score(X_test, y_test))
决策树手写数字识别的准确度: 0.831349206349
# 随机森林
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)
print("随机森林手写数字识别准确度:", rfc.score(X_test, y_test))
随机森林手写数字识别准确度: 0.917222222222
# 用kaggle网站给出的测试数据检验模型
import pandas as pd
test = pd.read_csv("data/digit_test.csv")
XX = test.values
pca = PCA(n_components=22, whiten=True)
XX_pca = pca.fit_transform(XX)
yy = svc.predict(XX_pca)
print(yy[:10])
[2 0 8 2 4 3 0 9 0 7]
# 随机森林的预测
yy = rfc.predict(XX_pca)
print(yy[:10])
[2 0 8 2 7 2 0 9 0 7]
#knn
yy = knn.predict(XX_pca)
print(yy[:10])
[2 0 5 2 0 6 0 3 0 4]
# 绘图验证
%matplotlib inline
import matplotlib.pyplot as plt
plt.imshow(XX[6].reshape(28,28))

在这里插入图片描述

超参数的选取

  • 交叉验证(k-折验证)
  • 网格搜索
# 内置手写数字数据集
# 支持向量计算法
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.datasets import load_digits
from sklearn.model_selection import GridSearchCV #交叉验证网格搜索

digits = load_digits()
X, y = digits["data"], digits["target"]
svc = SVC()

#交叉验证网格搜索(为了找到最适合的参数)
params = {"C":[1,5,10,15,20,50,100], "gamma":[0.0001, 0.001,0.005, 0.01, 0.05, 0.1], "kernel":["rbf", "linear"]}
grid = GridSearchCV(svc, params, cv=5, scoring="accuracy")
grid.fit(X, y)
print(grid.best_score_)
print(grid.best_params_)
0.972732331664
{'C': 5, 'gamma': 0.001, 'kernel': 'rbf'}
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
svc = SVC(C=5, gamma=0.001, kernel="rbf")
svc.fit(X_train, y_train)
print("SVM手写数字识别的准确度:", svc.score(X_test, y_test))
SVM手写数字识别的准确度: 0.990740740741

课堂练习

对内置的手写数字识别数据集采用随机森林算法,利用交叉验证网格搜索(5折)寻找以下参数的最佳组合:

  • n_estimators: 5, 10, 15, 20
  • n_jobs: 1, 3, 5, 8
  • max_features: “auto”, 0.1, 0.2, 0.3, 0.5, 0.8, 1
  • oob_score: True, False
    然后用得到的最佳参数生成学习器,进行模型评估
# 引入库
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

# # 读入数据
# digits = load_digits()
# X, y = digits["data"], digits["target"]

# # # 对数据进行降维(根据推荐维度:22)
# # pca = PCA(n_components=22, whiten=True) 
# # X_pca = pca.fit_transform(X)

rfc = RandomForestClassifier()

#交叉验证网格搜索(为了找到最适合的参数)
params = {"n_estimators":[5,10,15,20], "n_jobs":[1,3,5,8], "max_features":["auto", 0.1, 0.2, 0.3, 0.5, 0.8, 1]}
grid = GridSearchCV(rfc, params, cv=5, scoring="accuracy")
grid.fit(X, y)
print(grid.best_score_)
print(grid.best_params_)
0.930439621592
{'max_features': 0.1, 'n_estimators': 20, 'n_jobs': 5}
# 用上面得到的最佳参数生成学习器,进行模型评估

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

# 随机森林
rfc = RandomForestClassifier(n_estimators=20, n_jobs=5, max_features= 0.1)
rfc.fit(X_train, y_train)
print("随机森林手写数字识别准确度:", rfc.score(X_test, y_test))

随机森林手写数字识别准确度: 0.962962962963
  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值