import numpy as np
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784')
mnist
{'data': array([[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]]),
'target': array(['5', '0', '4', ..., '4', '5', '6'], dtype=object),
'frame': None,
'categories': {},
'feature_names': ['pixel1',
X,y = mnist['data'],mnist['target']
# 对于一般数据集来说通畅需要进行训练数据集和测试数据集的分离,但是mnist数据集不需要,因为这个数据集本身已经帮我们分离好了
X.shape
输出:(70000, 784)
X_train = np.array(X[:60000],dtype=float)
y_train = np.array(y[:60000],dtype=float)
X_test = np.array(X[60000:],dtype=float)
y_test = np.array(y[60000:],dtype=float)
X_train.shape
输出:(60000, 784)
y_train.shape
输出:(60000,)
X_test.shape
输出:(10000, 784)
y_test.shape
输出:(10000,)
## 使用KNN来进行识别
from sklearn.neighbors import KNeighborsClassifier
knn_clf = KNeighborsClassifier()
%time knn_clf.fit(X_train,y_train)
输出:
Wall time: 53.5 s
KNeighborsClassifier()
%time knn_clf.score(X_test,y_test)
输出:
Wall time: 19min 57s
Parser : 120 ms
输出:0.9688
## PCA进行降维
from sklearn.decomposition import PCA
pca = PCA(0.9)
pca.fit(X_train)
X_train_reduction = pca.transform(X_train)
X_train_reduction.shape
输出:(60000, 87)
knn_clf = KNeighborsClassifier()
%time knn_clf.fit(X_train_reduction,y_train)
输出:
Wall time: 3.27 s
KNeighborsClassifier()
X_test_reduction = pca.transform(X_test)
%time knn_clf.score(X_test_reduction,y_test)
输出:
Wall time: 1min 49s
0.9728
我们可以发现降维后的数据精度竟然更高了,这是因为PCA在降维的过程中不仅只丢失了数据,而且在将我的过程中将数据原本的噪音减小了,这就出现了精度增大的现象。