【数据】读取mnist数据集

最新推荐文章于 2024-05-09 08:56:43 发布

htshinichi

最新推荐文章于 2024-05-09 08:56:43 发布

阅读量1.4w

点赞数 15

分类专栏：练习文章标签： mnist数据机器学

本文链接：https://blog.csdn.net/u013597931/article/details/80099243

版权

练习专栏收录该内容

13 篇文章 4 订阅

订阅专栏

前段时间用过CNN在mnist数据集上做训练，最近在学机器学习算法，因此准备用SVM试试。不过在用SVM训练前，先学习学习mnist数据集的读取。

【数据集介绍】

先看看官方库中的描述：
这里写图片描述

训练数据集train和测试数据集test都分为label和image两个文件。
label中前两个整数为magic number和标签数目number of items；
image中前四个整数为magic number、图片数目number of
images、行数number of rows、列数number of columns。
可以看出训练数据集的数量为60000，测试数据集的数量为10000，图片大小为28×28。

【读取mnist数据集】

读取mnist数据集其实就是读取二进制文件
读取方式一：

import numpy as np
import struct
def load_images(file_name):
    ##   在读取或写入一个文件之前，你必须使用 Python 内置open()函数来打开它。##
    ##   file object = open(file_name [, access_mode][, buffering])          ##
    ##   file_name是包含您要访问的文件名的字符串值。                         ##
    ##   access_mode指定该文件已被打开，即读，写，追加等方式。               ##
    ##   0表示不使用缓冲，1表示在访问一个文件时进行缓冲。                    ##
    ##   这里rb表示只能以二进制读取的方式打开一个文件                        ##
    binfile = open(file_name, 'rb') 
    ##   从一个打开的文件读取数据
    buffers = binfile.read()
    ##   读取image文件前4个整型数字
    magic,num,rows,cols = struct.unpack_from('>IIII',buffers, 0)
    ##   整个images数据大小为60000*28*28
    bits = num * rows * cols
    ##   读取images数据
    images = struct.unpack_from('>' + str(bits) + 'B', buffers, struct.calcsize('>IIII'))
    ##   关闭文件
    binfile.close()
    ##   转换为[60000,784]型数组
    images = np.reshape(images, [num, rows * cols])
    return images

def load_labels(file_name):
    ##   打开文件
    binfile = open(file_name, 'rb')
    ##   从一个打开的文件读取数据    
    buffers = binfile.read()
    ##   读取label文件前2个整形数字，label的长度为num
    magic,num = struct.unpack_from('>II', buffers, 0) 
    ##   读取labels数据
    labels = struct.unpack_from('>' + str(num) + "B", buffers, struct.calcsize('>II'))
    ##   关闭文件
    binfile.close()
    ##   转换为一维数组
    labels = np.reshape(labels, [num])
    return labels

使用：

filename_train_images = '绝对路径\\train-images.idx3-ubyte'
filename_train_labels = '绝对路径\\train-labels.idx1-ubyte'
filename_test_images = '绝对路径\\t10k-images.idx3-ubyte'
filename_test_labels = '绝对路径\\t10k-labels.idx1-ubyte'
train_images=load_images(filename_train_images)
train_labels=load_labels(filename_train_labels)
test_images=load_images(filename_test_images)
test_labels=load_labels(filename_test_labels)

读取方式二：

import numpy as np
import struct
import os
def load_mnist_train(path, kind='train'):    
    labels_path = os.path.join(path,'%s-labels.idx1-ubyte'% kind)
    images_path = os.path.join(path,'%s-images.idx3-ubyte'% kind)
    with open(labels_path, 'rb') as lbpath:
        magic, n = struct.unpack('>II',lbpath.read(8))
        labels = np.fromfile(lbpath,dtype=np.uint8)
    with open(images_path, 'rb') as imgpath:
        magic, num, rows, cols = struct.unpack('>IIII',imgpath.read(16))
        images = np.fromfile(imgpath,dtype=np.uint8).reshape(len(labels), 784)
    return images, labels
def load_mnist_test(path, kind='t10k'):
    labels_path = os.path.join(path,'%s-labels.idx1-ubyte'% kind)
    images_path = os.path.join(path,'%s-images.idx3-ubyte'% kind)
    with open(labels_path, 'rb') as lbpath:
        magic, n = struct.unpack('>II',lbpath.read(8))
        labels = np.fromfile(lbpath,dtype=np.uint8)
    with open(images_path, 'rb') as imgpath:
        magic, num, rows, cols = struct.unpack('>IIII',imgpath.read(16))
        images = np.fromfile(imgpath,dtype=np.uint8).reshape(len(labels), 784)
    return images, labels

使用：

path='绝对路径'
train_images,train_labels=load_mnist_train(path)
test_images,test_labels=load_mnist_test(path)

打印前30个数字看一看，和前面digits数据集一样的操作。

fig=plt.figure(figsize=(8,8))
fig.subplots_adjust(left=0,right=1,bottom=0,top=1,hspace=0.05,wspace=0.05)
for i in range(30):
    images = np.reshape(train_images[i], [28,28])
    ax=fig.add_subplot(6,5,i+1,xticks=[],yticks=[])
    ax.imshow(images,cmap=plt.cm.binary,interpolation='nearest')
    ax.text(0,7,str(train_labels[i]))
plt.show()

这里写图片描述

ok，数据读取完毕，可以进行后续的训练了~

代码存放于：https://github.com/htshinichi/ML_practice/tree/master

htshinichi

关注

15
点赞
踩
73

收藏

觉得还不错? 一键收藏
2
评论
【数据】读取mnist数据集

前段时间用过CNN在mnist数据集上做训练，最近在学机器学习算法，因此准备用SVM试试。不过在用SVM训练前，先学习学习mnist数据集的读取。【数据集介绍】先看看官方库中的描述：训练数据集train和测试数据集test都分为label和image两个文件。 label中前两个整数为magic number和标签数目number of items； image中前四个...
复制链接

扫一扫