机器学习案例整理—手写数字识别-SVM

最新推荐文章于 2024-07-27 10:49:46 发布

我是朵儿

最新推荐文章于 2024-07-27 10:49:46 发布

阅读量3.7k

点赞数 4

文章标签：机器学习 svm pca 手写数字识别

本文通过四个不同处理方式的案例，详细探讨了使用SVM进行手写数字识别的过程，包括灰度图像、二值图像以及结合PCA主成分分析进行降维的方法。实验结果显示，二值化和PCA降维显著提高了识别准确率和训练效率。最后，文章还介绍了基于GridSearchCV的SVM参数调优，进一步提升了模型性能。

摘要由CSDN通过智能技术生成

1.导包

import numpy as np 
import pandas as pd 
import os
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt, matplotlib.image as mpimg
import time 
import warnings
from sklearn import svm
from sklearn.model_selection import GridSearchCV

%matplotlib inline  
warnings.filterwarnings('ignore')  # 忽略警报

%matplotlib inline
仅在jupyter中会用到当你调用matplotlib.pyplot的绘图函数plot()进行绘图的时候，或者生成一个figure画布的时候，可以直接在你的python console里面生成图像

2.数据处理

2.1 读取文件

#读取csv数据文件
data = pd.read_csv('E:/Digit_Recognizer/train.csv')
print("Train Data Shape is: ",data.shape)

Train Data Shape is: (42000, 785)

label = data.label                  # 读取data数据表中label列，并令其为labal
data = data.drop('label',axis=1)    # drop([ '列名' ],axis=0/1（行/列）,inplace=True) True表示在原数据上改变
print("Data Shape: ",data.shape)
print("Label Shape: ",label.shape)

Data Shape: (42000, 784)
Label Shape: (42000,)

data.columns  #可以通过.columns和.index这两个属性返回数据集的列索引和行索引

Index([‘pixel0’, ‘pixel1’, ‘pixel2’, ‘pixel3’, ‘pixel4’, ‘pixel5’, ‘pixel6’,
‘pixel7’, ‘pixel8’, ‘pixel9’,
…
‘pixel774’, ‘pixel775’, ‘pixel776’, ‘pixel777’, ‘pixel778’, ‘pixel779’,
‘pixel780’, ‘pixel781’, ‘pixel782’, ‘pixel783’],
dtype=‘object’, length=784)

2.2 使用“reshape”将一维数组转换为二维28x28数组，以打印和查看灰度图像。
（这段有点没太看懂诶）

 for x in range(0,4):
    train_0=data[label==x]  # ? train_0 抽取出来的一个子数据集？
    data_new=[]
    for idx in train_0.index:
        val=train_0.loc[idx].values.reshape(28,28)  # .loc[idx]获取train_0中第idx行  将其reshape为(28，28)  a.loc['one']则会默认表示选取行为'one'的行；
        data_new.append(val)                        # 把上一行reshape的数据加入data_new的空列表中
    plt.figure(figsize=(25,25))          
    for x in range(1,5):                  #为啥是1到5？
        ax1=plt.subplot(1, 20, x)         
        ax1.imshow(data_new[x],cmap='gray')

plt.figure（）
新建画布，figsize:指定figure的宽和高，单位为英寸
plt.subplot
生成子图，(‘行’,‘列’,‘编号’)返回第一行第20列的第x个子图（x=1，2，3，4）
imshow()函数
实现热图绘制，data_new[x]（即为X存储图像） cmap=‘gray’ 把设置图像颜色变成灰度
在这里插入图片描述
2.3 把数据集拆成80%的训练集和20%的测试集

train, test,train_labels, test_labels = train_test_split(data, label, train_size=0.8, random_state=42) #random_state相当于随机数种子random.seed() 
print("Train Data Shape: ",train.shape)
print("Train Label Shape: ",train_labels.shape)
print("Test Data Shape: ",test.shape)
print("Test Label Shape: ",test_labels.shape)

Train Data Shape: (33600, 784)
Train Label Shape: (33600,)
Test Data Shape: (8400, 784)
Test Label Shape: (8400,)

3.SVM

i=5000;
score=[]
fittime=[]
scoretime=[]
clf = svm.SVC(random_state=42)    
print("Default Parameters are: \n",clf.get_params)   # 获得svm.SVC的默认参数

svm.SVC()括号内各项参数详解：
[https://blog.csdn.net/weixin_41990278/article/details/93137009]
在这里插入图片描述

Case 1 - Gray Scale Images 灰度图像

start_time = time.time()   
clf.fit(train[:i], train_labels[:i].values.ravel())
fittime = time.time() - start_time      #计算运行时间 
print("Time consumed to fit model: ",time.strftime("%H:%M:%S", time.gmtime(fittime))) #%H:%M:%S 时分秒
start_time = time.time()
score=clf.score(test,test_labels)
print("Accuracy for grayscale: ",score)
scoretime = time.time() - start_time
print("Time consumed to score: ",time.strftime("%H:%M:%S", time.gmtime(scoretime)))  #%H:%M:%S 时分秒
case1=[score,fittime,scoretime]

time.time() ——返回当前时间的时间戳
clf.fit(x,y） ——x代表输入数据（0到5000），y代表标签；
model.fit()函数
.ravel() ——将多维数组转换为一维数组的功能，但不会产生副本
avel()、flatten()、squeeze()的用法与区别
time.strftime() ——函数接收以时间元组，并返回以可读字符串表示的当地时间， time strftime()方法
time.gmtime() ——获取的时间为UTC时区（0时区）的struct_time，但是我们计算机显示的是东八区时间（+8），所以的得到的struct_time+8即为现在计算机显示的时间（按照所处不同时区计算）。
clf.score() ——score(self, X, y, sample_weight=None)
提供了一个缺省的评估法则来解决问题，用你训练好的模型在测试集上进行评分（0~1）1分代表最好。
在这里插入图片描述
准确度怎么这么低？确实是这么低= =！

Case 2 - Binary Images二值图像

简单地说，通过将所有大于0的值替换为1，将图像从灰度转换为黑白。以及使用整形将1d阵列转换为2d28x28阵列，以绘制和查看二进制图像。

test_b=test         # 复制一次test数据和train数据
train_b=train
test_b[test_b>0]=1   #让test和train中大于0的值=1
train_b[train_b>0]=1
for x in range(0,4):        #此处处理数据同灰度案例
    train_0=train_b[train_labels==x]
    data_new=[]
    for idx in train_0.index:
        val=train_0.loc[idx].values.reshape(28,28)
        data_new.append(val)
    plt.figure(figsize=(25,25))   
    for x in range(1,5):
        ax1=plt.subplot(1, 20, x)    # ax1是1行20列中的第x个子图
        ax1.imshow(data_new[x],cmap='binary')  # 设置的颜色模式改为二值

在这里插入图片描述
训练模型的代码同上

start_time = time.time()
clf.fit(train_b[:i], train_labels[:i].values.ravel())
fititme = time.time() - start_time
print("Time consumed to fit model: ",time.strftime("%H:%M:%S", time.gmtime(fittime)))
score=clf.score(test_b,test_labels)
start_time = time.time()
clf.fit(train_b[:i], train_labels[:i].values.ravel())
print("Accuracy for binary: ",score)
scoretime = time.time() - start_time
print("Time consumed to score: ",time.strftime("%H:%M:%S", time.gmtime(scoretime)))
case2=[score,fittime,scoretime]

在这里插入图片描述
对比之前的灰度案例可以看出，训练时间快了很多，且案例2（91%）的准确率远高于案例1（9.3%）。
然而，数据的高维性使得计算时间越来越长。使用PCA（主成分分析）减少维度

case3 灰度+降维——PCA主成分分析

进行PCA分析前，先对数据进行标准化
导包>>标准化数据>>进行PCA训练>>训练后的数据进行降维>>计算方差百分比并进行累加>>选取主成分>>进行训练并降维

from sklearn.preprocessing import StandardScaler       # .preprocessing数据预处理
from sklearn.decomposition import PCA as sklearnPCA   # 进行PCA降维

# 先标准化数据
sc = StandardScaler().fit(train)      # sc里面存的有计算出来的均值和方差
X_std_train = sc.transform(train)   # 再用sc中的均值和方差来转换X(即为train)，使train标准化
X_std_test = sc.transform(test)  

#如果未设置n_components，则存储所有组件 
sklearn_pca = sklearnPCA().fit(X_std_train)   # sklearnPCA()为导入PAC包的简称/ 对标准化后的训练集进行PCA训练
train_pca = sklearn_pca.transform(X_std_train)   # 对训练后的数据转换成降维后的数据
test_pca = sklearn_pca.transform(X_std_test) 

#每个选定成分分量解释的方差百分比
#如果未设置n_components则存储所有成分，并且比率之和等于 1.0
var_per = sklearn_pca.explained_variance_ratio_   #.explained_variance_ratio_返回所保留的n个成分各自的方差百分比。
cum_var_per = sklearn_pca.explained_variance_ratio_.cumsum()  #把方差百分比累加？

StandardScaler().fit() ——用于计算训练数据的均值和方差，后面就会用均值和方差来转换训练数据
fit、fit_transform、transform的区别及使用方法

PCA方法
.fit(X,y=None) ——fit(X)，表示用数据X来训练PCA模型。
fit()可以说是scikit-learn中通用的方法，每个需要训练的算法都会有fit()方法，它其实就是算法中的“训练”这一步骤。因为PCA是无监督学习算法，此处y自然等于None。
.transform(X) ——将数据X转换成降维后的数据。当模型训练好后，对于新输入的数据，都可以用transform方法来降维。
PCA简介/属性/其他方法

#通过选择累积在 0.90 以内的成分来保留 90% 的信息。
n_comp=len(cum_var_per[cum_var_per <= 0.90])  # 选取方差累加和大于0.9的成分并且计算总数
print("Keeping 90% Info with ",n_comp," components")   # 打印出计算的结果
sklearn_pca = sklearnPCA(n_components=n_comp) 
train_pca = sklearn_pca.fit_transform(X_std_train)  # 用X_std_train来训练PCA模型，同时返回降维后的数据。
test_pca = sklearn_pca.transform(X_std_test)  # 表示用数据X_std_test来训练PCA模型。
print("Shape before PCA for Train: ",X_std_train.shape) 
print("Shape after PCA for Train: ",train_pca.shape)
print("Shape before PCA for Test: ",X_std_test.shape)
print("Shape after PCA for Test: ",test_pca.shape)

sklearnPCA即sklearn.decomposition.PCA()
sklearn.decomposition.PCA (n_components=None, copy=True, whiten=False)
其中n_components: PCA算法中所要保留的主成分个数n，也即保留下来的特征个数n （其他属性）

在这里插入图片描述

#让我们使用降维后的相同数量的样本数据来计算分数，并比较准确性
start_time = time.time()
clf.fit(train_pca[:i], train_labels[:i].values.ravel())
fittime = time.time() - start_time
print("Time consumed to fit model: ",time.strftime("%H:%M:%S", time.gmtime(fittime)))
start_time = time.time()
score=clf.score(test_pca,test_labels)
print("Accuracy for grayscale: ",score)
scoretime = time.time() - start_time
print("Time consumed to score model: ",time.strftime("%H:%M:%S", time.gmtime(scoretime)))
case3=[score,fittime,scoretime]

代码块同上
使用主成分分析后的结果
运行速度很快，准确率达91.87%

Case 4 - 二值图 + 降维——PCA

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA as sklearnPCA

sc = StandardScaler().fit(train_b)
X_std_train = sc.transform(train_b)
X_std_test = sc.transform(test_b)

sklearn_pca = sklearnPCA().fit(X_std_train)
#train_pca_b = sklearn_pca.transform(X_std_train)
#test_pca_b = sklearn_pca.transform(X_std_test)

var_per = sklearn_pca.explained_variance_ratio_
cum_var_per = sklearn_pca.explained_variance_ratio_.cumsum()

n_comp=len(cum_var_per[cum_var_per <= 0.90])
print("Keeping 90% Info with ",n_comp," components")
sklearn_pca = sklearnPCA(n_components=n_comp)
train_pca_b = sklearn_pca.fit_transform(X_std_train)
test_pca_b = sklearn_pca.transform(X_std_test)
print("Shape before PCA for Train: ",X_std_train.shape)
print("Shape after PCA for Train: ",train_pca_b.shape)
print("Shape before PCA for Test: ",X_std_test.shape)
print("Shape after PCA for Test: ",test_pca_b.shape)

start_time = time.time()
clf.fit(train_pca_b[:i], train_labels[:i].values.ravel())
fittime = time.time() - start_time
print("Time consumed to fit model: ",time.strftime("%H:%M:%S", time.gmtime(fittime)))
start_time = time.time()
score=clf.score(test_pca_b,test_labels)
print("Accuracy for grayscale: ",score)
scoretime = time.time() - start_time
print("Time consumed to score model: ",time.strftime("%H:%M:%S", time.gmtime(scoretime)))
case4=[score,fittime,scoretime]

代码同上，将需要处理的数据集改成train_b和test_b
train_pca_b及test_pca_b
在这里插入图片描述

比较四个案例

把四个案例的数据都打印出来进行比较

head =["Accuracy","FittingTime","ScoringTime"]   # 定义表格的每一行名称
print("\t\t case1 \t\t\t case2 \t\t\t case3 \t\t\t case4")   # \t：制表符，为了在不使用表格的情况下，上下对齐，table的意思。
for h, c1, c2, c3, c4 in zip(head, case1, case2, case3, case4):
    print("{}\t{}\t{}\t{}\t{}".format(h, c1, c2, c3, c4))

在这里插入图片描述

结论：

通过简化案例2中的问题（通过将图像转换为二进制），对于所选的样本数，精度从9%提高到91%。

通过减小案例3和案例4中的维度，训练时间大大缩短。

4.训练数据大小与准确性、拟合和得分时间

了解训练数据大小如何影响准确性。

from tqdm import tqdm   #Tqdm 是一个快速，可扩展的Python进度条

fit_time=[]
score=[]
score_time=[]
for j in tqdm(range(1000,31000,5000)):   #从1000到31000，步长为5000
    start_time = time.time()
    clf.fit(train_pca_b[:j], train_labels[:j].values.ravel())
    fit_time.append(time.time() - start_time)
    start_time = time.time()
    score.append(clf.score(test_pca_b,test_labels))  # PCA降维后的测试集进行估计并加入到score列表中。
    score_time.append(time.time() - start_time)

Tqdm—— 一个快速，可扩展的Python进度条不是很懂它的用法
tqdm介绍及常用方法

x=list(range(1000,31000,5000))    # x=[1000,6000,11000,16000,21000,26000,31000]
plt.figure(figsize=[20,5]);   # 画一个长20宽5的图
ax1=plt.subplot(1, 2,1)  # ax1子图是一行两列中的第1个图
ax1.plot(x,score,'-o');   # 绘制实线、实心圈标记的图
plt.xlabel('Number of Training Samples')   # x轴标题
plt.ylabel('Accuray')                      # y轴标题
ax2=plt.subplot(1, 2,2)
ax2.plot(x,score_time,'-o');
ax2.plot(x,fit_time,'-o');
plt.xlabel('Number of Training Samples')
plt.ylabel('Time to Compute Score/Fit (sec)')
plt.legend(['score_time','fitting_time'])   # plt.legend 给图像加上图例

.plot(x,y,format_string)
x轴数据，y轴数据，format_string控制曲线的格式字串（由颜色字符，风格字符，和标记字符组成）plt.plot()函数细节
图像系列：
plt.figure()——绘制图像
plt.subplot() ——绘制子图
plt.xlabel() ——图像x轴标题
plt.ylabel() ——图像y轴标题
plt.legend() —— 给图像加上图例

在这里插入图片描述

5.基于GridSearchCV的支持向量机参数选择

在下面的参数中，我们将使用gamma和c，其中Gamma是高斯核的参数（用于处理非线性分类）c是软边际成本函数的参数，也称为误分类成本。一个大的C给你低偏差和高方差，反之亦然。

要找到最佳组合的参数，以达到最大的精度，使用来自SKEXCEL库的GRIDSKCHCV。gridsearchcv对估计器的指定参数值进行穷举搜索。
将要传递给GridSearchCV的参数值存储在参数中，保持交叉验证倍数为3，并将支持向量机作为估计器。

parameters = {'gamma': [1, 0.1, 0.01, 0.001],     #创建一个字典
             'C': [1000, 100, 10, 1]} 

p = GridSearchCV(clf , param_grid=parameters, cv=3)

GridSearchCV()
estimator（clf） ——选择使用的分类器（svm）
param_grid ——需要最优化的参数的取值，值为字典或者列表
cv ——交叉验证参数，默认None，使用三折交叉验证。指定fold数量，默认为3，也可以是yield训练/测试数据的生成器。
gridSearchCV（网格搜索）的参数、方法及示例

#在案例四的基础上进行调参
X=train_pca_b[:i]    #选取0到5000之间已经pca过的训练集
y=train_labels[:i].values.ravel()
start_time = time.time() 
p.fit(X,y)  
elapsed_time = time.time() - start_time 
print("Time consumed to fit model: ",time.strftime("%H:%M:%S", time.gmtime(elapsed_time)))

print("Scores for all Parameter Combination: \n",p.cv_results_['mean_test_score'])  # 'mean_test_score'是 cv_results_里的一个参数
print("\nOptimal C and Gamma Combination: ",p.best_params_)
print("\nMaximum Accuracy acheieved on LeftOut Data: ",p.best_score_)

clf.cv_results_ —— 返回使用交叉验证进行搜索的结果，它本身是一个字典，里面有很多内容。
.best_params_ ——返回最好的参数
.best_score_ —— 返回最好的测试分数
其他见此链接

#为了验证，让我们将最佳参数传递给分类器并检查分数。
C=p.best_params_['C']
gamma=p.best_params_['gamma']
clf=svm.SVC(C=C,gamma=gamma, random_state=42)

start_time = time.time()
clf.fit(train_pca_b[:i], train_labels[:i].values.ravel())
elapsed_time = time.time() - start_time
print("Time consumed to fit model: ",time.strftime("%H:%M:%S", time.gmtime(elapsed_time)))
print("Accuracy for binary: ",clf.score(test_pca_b,test_labels))

运行结果：
在这里插入图片描述
可以看出，对于所选择的训练样本，案例2对于最佳参数的准确性从91%提高到93.7%。
现在使用所有训练示例：

start_time = time.time()
clf.fit(train_pca_b, train_labels.values.ravel())
elapsed_time = time.time() - start_time
print("Time consumed to fit model: ",time.strftime("%H:%M:%S", time.gmtime(elapsed_time)))
print("Accuracy for binary: ",clf.score(test_pca_b,test_labels))

在这里插入图片描述
准确率提高了很多~