python mnist数据集PCA降维后KNN分类 97%准确率

数据集导入

#导入mnist数据集
import numpy as np
import os, gzip
#加载本地mnist数据集
def load_data(data_folder):

    files = [
          'train-labels-idx1-ubyte.gz', 'train-images-idx3-ubyte.gz',
          't10k-labels-idx1-ubyte.gz', 't10k-images-idx3-ubyte.gz'
    ]
    paths = []
    for fname in files:
        paths.append(os.path.join(data_folder,fname))
        
    with gzip.open(paths[0], 'rb') as lbpath:
        y_train = np.frombuffer(lbpath.read(), np.uint8, offset=8)
        
    with gzip.open(paths[1], 'rb') as imgpath:
        x_train = np.frombuffer(
            imgpath.read(), np.uint8, offset=16).reshape(len(y_train), 28, 28)
            
    with gzip.open(paths[2], 'rb') as lbpath:
        y_test = np.frombuffer(lbpath.read(), np.uint8, offset=8)
        
    with gzip.open(paths[3], 'rb') as imgpath:
        x_test = np.frombuffer(
            imgpath.read(), np.uint8, offset=16).reshape(len(y_test), 28, 28)
            
    return (x_train, y_train), (x_test, y_test)
    
(train_images, train_labels), (test_images, test_labels) = load_data('D:/MNIST')

print ("mnist data loaded")
print ("original training data shape:",train_images.shape)
print ("original testing data shape:",test_images.shape)

运行结果:
mnist data loaded
original training data shape: (60000, 28, 28)
original testing data shape: (10000, 28, 28)

数据集处理

#将每张图片展开到一维
train_data=train_images.reshape(60000,784)
test_data=test_images.reshape(10000,784)
print ("training data shape after reshape:",train_data.shape)
print ("testing data shape after reshape:",test_data.shape)

运行结果:
training data shape after reshape: (60000, 784)
testing data shape after reshape: (10000, 784)

从结果可以看出数据图片都变成了大小为784的一维数组

PCA降维

#利用主成分分析对数据进行降维
#降维的主要原因是在原有数据的784维特征空间内进行KNN聚类的计算开销过大
#因此采用PCA算法提取出原有数据的主要特征
#提取了原有图片的100个主要特征,并构建了100维的特征空间
pca = PCA(n_components = 100)
pca.fit(train_data) #fit PCA with training data instead of the whole dataset
train_data_pca = pca.transform(train_data)
test_data_pca = pca.transform(test_data)
print("PCA completed with 100 components")
print ("training data shape after PCA:",train_data_pca.shape)
print ("testing data shape after PCA:",test_data_pca.shape)

运行结果:
PCA completed with 100 components
training data shape after PCA: (60000, 100)
testing data shape after PCA: (10000, 100)

从结果可以看出经过处理的数据集有着如下特点:

  1. 有60000个训练数据和10000个测试数据
  2. 所有数据的特征空间维度为100。

现在已经将用于训练和测试的所有数据全部处理好。

KNN分类

#对降维后的mnist进行KNN分类
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(train_data_pca, train_labels)
#计算测试得分
knn.score(test_data_pca, test_labels)

运行结果:
0.9737

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值