先尝试不采用PCA降维的SVM训练模型:
from keras.datasets import mnist
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.svm import SVC
import datetime
(X_train,Y_train),(X_test,Y_test) = mnist.load_data()
X_train_1 = X_train.reshape(60000,784)
Y_train_1 = Y_train.reshape(-1,1)
starttime = datetime.datetime.now() #用来计算PCA+SVM总的计算时间
### 利用支持向量机训练
svc = SVC() #这里利用默认参数就好,我试验过,默认参数的训练效果已经十分接近手动找最佳参数的效果了
x_train,x_test,y_train,y_test = train_test_split(X_train_1, Y_train_1, test_size = 0.25, random_state = 1)
y_train = y_train.reshape(-1,1).ravel() #最后加上.ravel(),不然jupyter notebook会报错
svc.fit(x_train,y_train)
accuracy = svc.score(x_test,y_test)
print("accuracy is ",accuracy)
endtime = datetime.datetime.now()
time = (endtime - starttime).seconds
print("time is ",time)
# accuracy is 0.9766
# time is 189
再尝试采用PCA降维的SVM训练模型:
from keras.datasets import mnist
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.svm import SVC
import datetime
(X_train,Y_train),(X_test,Y_test) = mnist.load_data()
X_train_1 = X_train.reshape(60000,784)
Y_train_1 = Y_train.reshape(-1,1)
#复制一份训练集,后面直接对原始数据降维
X_train_copy = X_train_1
starttime = datetime.datetime.now()
###找到能够保留95%方差的n_components
pca_1=PCA(n_components=0.95, copy = False) #copy = False指直接对数据集降维
#调用PCA,会自动均值归一化
X_reduce = pca_1.fit_transform(X_train_1)
n_x = pca_1.n_components
###利用上面找到的n_components,降维
pca=PCA(n_components=n_x, copy = False)
X_reduce_fianl =pca.fit_transform(X_train_copy)
print("X_reduce_fianl :",X_reduce_fianl.shape)
# X_reduce_fianl : (60000, 154) ,在保留95%方差的情况下,从784降至154,效果赞!
### 利用支持向量机训练
svc = SVC()
x_train,x_test,y_train,y_test = train_test_split\
(X_reduce_fianl, Y_train_1, test_size = 0.25, random_state = 1)
y_train = y_train.reshape(-1,1).ravel()
svc.fit(x_train,y_train)
accuracy = svc.score(x_test,y_test)
print("accuracy is ",accuracy)
endtime = datetime.datetime.now()
time = (endtime - starttime).seconds
print("time is ",time)
# accuracy is 0.9806
# time is 67
可以看出来,利用PCA后,时间大幅降低65%!并且精确性还提高了。
分析:
- 因为mnist每个数字的图像中,图像边界部分的像素几乎都是白色的,把这些像素删除并不会对识别有什么影响,从而可以大幅降低维度,以至于还能够保留95%的方差。特征大幅减少,学习算法当然能运行地更快,所以才会看到时间大幅减少。
- 准确率反而提高了一点,我想可能是因为剔除了噪声。举个例子,有些图片有点模糊时,通过调节“锐化”,是不是感觉看的更清晰一些了?