数据集导入
#导入mnist数据集
import numpy as np
import os, gzip
#加载本地mnist数据集
def load_data(data_folder):
files = [
'train-labels-idx1-ubyte.gz', 'train-images-idx3-ubyte.gz',
't10k-labels-idx1-ubyte.gz', 't10k-images-idx3-ubyte.gz'
]
paths = []
for fname in files:
paths.append(os.path.join(data_folder,fname))
with gzip.open(paths[0], 'rb') as lbpath:
y_train = np.frombuffer(lbpath.read(), np.uint8, offset=8)
with gzip.open(paths[1], 'rb') as imgpath:
x_train = np.frombuffer(
imgpath.read(), np.uint8, offset=16).reshape(len(y_train), 28, 28)
with gzip.open(paths[2], 'rb') as lbpath:
y_test = np.frombuffer(lbpath.read(), np.uint8, offset=8)
with gzip.open(paths[3], 'rb') as imgpath:
x_test = np.frombuffer(
imgpath.read(), np.uint8, offset=16).reshape(len(y_test), 28, 28)
return (x_train, y_train), (x_test, y_test)
(train_images, train_labels), (test_images, test_labels) = load_data('D:/MNIST')
print ("mnist data loaded")
print ("original training data shape:",train_images.shape)
print ("original testing data shape:",test_images.shape)
运行结果:
mnist data loaded
original training data shape: (60000, 28, 28)
original testing data shape: (10000, 28, 28)
数据集处理
#将每张图片展开到一维
train_data=train_images.reshape(60000,784)
test_data=test_images.reshape(10000,784)
print ("training data shape after reshape:",train_data.shape)
print ("testing data shape after reshape:",test_data.shape)
运行结果:
training data shape after reshape: (60000, 784)
testing data shape after reshape: (10000, 784)
从结果可以看出数据图片都变成了大小为784的一维数组
PCA降维
#利用主成分分析对数据进行降维
#降维的主要原因是在原有数据的784维特征空间内进行KNN聚类的计算开销过大
#因此采用PCA算法提取出原有数据的主要特征
#提取了原有图片的100个主要特征,并构建了100维的特征空间
pca = PCA(n_components = 100)
pca.fit(train_data) #fit PCA with training data instead of the whole dataset
train_data_pca = pca.transform(train_data)
test_data_pca = pca.transform(test_data)
print("PCA completed with 100 components")
print ("training data shape after PCA:",train_data_pca.shape)
print ("testing data shape after PCA:",test_data_pca.shape)
运行结果:
PCA completed with 100 components
training data shape after PCA: (60000, 100)
testing data shape after PCA: (10000, 100)
从结果可以看出经过处理的数据集有着如下特点:
- 有60000个训练数据和10000个测试数据
- 所有数据的特征空间维度为100。
现在已经将用于训练和测试的所有数据全部处理好。
KNN分类
#对降维后的mnist进行KNN分类
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(train_data_pca, train_labels)
#计算测试得分
knn.score(test_data_pca, test_labels)
运行结果:
0.9737