MNIST手写字体结构

一.MNIST手写字体文件说明
      MNIST手写字体数据库下载地址http://yann.lecun.com/exdb/mnist/ 。

minist本身就是灰度图     

mnist的结构如下,选取train-images

FILE FORMATS FOR THE MNIST DATABASEThe data is stored in a very simple file format designed for storing vectors and multidimensional matrices. General info on this format is given at the end of this page, but you don't need to read that to use the data files.

All the integers in the files are stored in the MSB first (high endian) format used by most non-Intel processors. Users of Intel processors and other low-endian machines must flip the bytes of the header.

There are 4 files:

train-images-idx3-ubyte: training set images
train-labels-idx1-ubyte: training set labels
t10k-images-idx3-ubyte:  test set images
t10k-labels-idx1-ubyte:  test set labels

The training set contains 60000 examples, and the test set 10000 examples.

The first 5000 examples of the test set are taken from the original NIST training set. The last 5000 are taken from the original NIST test set. The first 5000 are cleaner and easier than the last 5000.

TRAINING SET LABEL FILE (train-labels-idx1-ubyte):

[offset] [type]          [value]          [description]
0000     32 bit integer  0x00000801(2049) magic number (MSB first)
0004     32 bit integer  60000            number of items
0008     unsigned byte   ??               label
0009     unsigned byte   ??               label
........
xxxx     unsigned byte   ??               label

The labels values are 0 to 9.

TRAINING SET IMAGE FILE (train-images-idx3-ubyte):

[offset] [type]          [value]          [description]
0000     32 bit integer  0x00000803(2051) magic number
0004     32 bit integer  60000            number of images
0008     32 bit integer  28               number of rows
0012     32 bit integer  28               number of columns
0016     unsigned byte   ??               pixel
0017     unsigned byte   ??               pixel
........
xxxx     unsigned byte   ??               pixel

Pixels are organized row-wise. Pixel values are 0 to 255. 0 means background (white), 255 means foreground (black).

TEST SET LABEL FILE (t10k-labels-idx1-ubyte):

[offset] [type]          [value]          [description]
0000     32 bit integer  0x00000801(2049) magic number (MSB first)
0004     32 bit integer  10000            number of items
0008     unsigned byte   ??               label
0009     unsigned byte   ??               label
........
xxxx     unsigned byte   ??               label

The labels values are 0 to 9.

TEST SET IMAGE FILE (t10k-images-idx3-ubyte):

[offset] [type]          [value]          [description]
0000     32 bit integer  0x00000803(2051) magic number
0004     32 bit integer  10000            number of images
0008     32 bit integer  28               number of rows
0012     32 bit integer  28               number of columns
0016     unsigned byte   ??               pixel
0017     unsigned byte   ??               pixel
........
xxxx     unsigned byte   ??               pixel

Pixels are organized row-wise. Pixel values are 0 to 255. 0 means background (white), 255 means foreground (black).


首先该数据是以二进制存储的,我们读取的时候要以'rb'方式读取,其次,真正的数据只有[value]这一项其他的[type]等只是来描述的,并不真正在数据文件里面。

由offset我们可以看出真正的pixel式从16开始的,一个int 32字节,所以在读取pixel之前我们要读取4个 32 bit integer,也就是magic number,number of images,number of rows,number of columns,读取二进制文件用struct比较方便,struct.unpack_from('>IIII',buf,index)表示按照大端方式读取4个int.

虽然数据集网站写着“Users of Intel processors and other low-endian machines must flip the bytes of the header.”,而我的电脑就是intel处理器,但是我尝试了一把还是得用大端方式读,读出来才是“2051 60000 28 28”,用小端方式读取就不正确了,这个小小实验一把就行。

MNIST Handwritten Digits字符库中含有0-9的训练数据集和0-9测试数据集两种图片

每张图片灰度级都是8,且每张图片可以使用一个784大小的向量表征


下面先把数据文件直观的表现出来,用matplotlib把二进制文件用图像表现出来。具体如下:

从上面的数据库说明可以看出来,MNIST手写字体数据库包含了是个文件,每个文件都是单纯的普通文件格式,因此,可以采用C++的文件流将其打开,每一个文件除了几个字节的文件头之外,就是剩下的要数据部分了。因此,可以先将文件的文件头读进来,然后利用magic number进行验证,验证所读的文件是否为MNIST文件。

      由于MNIST存储的格式是大端存储的,和大部分Intel处理器的存储方式不同,所以,直接将文件头的前面四个直接存储为int类型或者long类型的话,是无法获得正确的数值的,还需要进行从大端模式到小端模式的转换,而且,在不同的处理器上面,int类型和long类型存储的位数是不一样的,所以,用int类型或者long类型来读取文件头的前面四个字节不具有可移植性。

    在C++标准中,char类型的长度被定义为一个字节,这个在不同的处理器上面是不变的,因此,可以采用char类型的数组来存储文件头的部分,同时,使用char类型的数组来进行大端模式到小端模式的转换也是很容易的。

    大端模式和小端模式的定义的区别如下:
大端模式:高位字节放在内存低地址处,低位字节放在内存高地址处;
小端模式:低位字节放在内存低地址处,高位字节放在内存高地址处;Intel处理器一般为小端模式。
   
    下面是定义图片文件的文件头和标签的文件头
struct MNISTImageFileHeader
{
    unsigned char MagicNumber[4];
    unsigned char NumberOfImages[4];
    unsigned char NumberOfRows[4];
    unsigned char NumberOfColums[4];
};

struct MNISTLabelFileHeader
{
    unsigned char MagicNumber[4];
    unsigned char NumberOfLabels[4];
};
由于大端模式是把高位字节放在内存的低位处,所以,char类型的数组的低字节表示的就是原来的数的高位,所以,在将char类型的数组转化为整数的时候,只要将数组的第一个元素左移24位,第二个元素左移16位,第三个元素左移8位,然后将这些元素相加起来就可以了,下面是用迭代的方法实现的代码。
int ConvertCharArrayToInt(unsigned char* array, int LengthOfArray)
{
    if (LengthOfArray < 0)
    {
        return -1;
    }
    int result = static_cast<signed int>(array[0]);
    for (int i = 1; i < LengthOfArray; i++)
    {
        result = (result << 8) + array[i];
    }
    return result;
}

在有了上面的将大端模式存储的char类型数组转化为整数之后,我们就可以将MNIST的文件按照普通的文件格式读进来了,在读取了文件头之后,就可以用普通的读文件的方法将数据部分放在一个char数组中,然后就可以将其放在OpenCV的Mat对象中了。下面是全部读取MNIST文件的代码。
#ifndef MNIST_H
#define MNIST_H

#include <iostream>
#include <fstream>
#include <opencv2/opencv.hpp>

struct MNISTImageFileHeader
{
    unsigned char MagicNumber[4];
    unsigned char NumberOfImages[4];
    unsigned char NumberOfRows[4];
    unsigned char NumberOfColums[4];
};


struct MNISTLabelFileHeader
{
    unsigned char MagicNumber[4];
    unsigned char NumberOfLabels[4];
};

const int MAGICNUMBEROFIMAGE = 2051;
const int MAGICNUMBEROFLABEL = 2049;

int ConvertCharArrayToInt(unsigned char* array, int LengthOfArray);

bool IsImageDataFile(unsigned char* MagicNumber, int LengthOfArray);

bool IsLabelDataFile(unsigned char* MagicNumber, int LengthOfArray);

cv::Mat ReadData(std::fstream& DataFile, int NumberOfData, int DataSizeInBytes);

cv::Mat ReadImageData(std::fstream& ImageDataFile, int NumberOfImages);

cv::Mat ReadLabelData(std::fstream& LabelDataFile, int NumberOfLabel);

cv::Mat ReadImages(std::string& FileName);

cv::Mat ReadLabels(std::string& FileName);




#endif // MNIST_H

/**
 * @file ReadData.cpp The file contains the functions used to read image data
 *                    and label data from the origin mnist file
 * @author sheng
 * @version 1.0.0
 * @date  2014-04-09
 *
 * @function
 *
 * @histroy     <author>      <date>      <version>      <description>
 *               sheng      2014-04-09      1.0.0      build the module
 */

#include <MNIST.h>


/**
 * @brief IsImageDataFile  Check the input MagicNumber is equal to
 *                         MAGICNUMBEROFIMAGE
 * @param MagicNumber      The array of the magicnumber to be checked
 * @param LengthOfArray    The length of the array
 * @return true, if the magcinumber is mathed;
 *         false, otherwise.
 *
 * @author sheng
 * @version 1.0.0
 * @date  2014-04-08
 *
 * @histroy     <author>      <date>      <version>      <description>
 *               sheng      2014-04-08      1.0.0      build the function
 */
bool IsImageDataFile(unsigned char* MagicNumber, int LengthOfArray)
{
    int MagicNumberOfImage = ConvertCharArrayToInt(MagicNumber, LengthOfArray);
    if (MagicNumberOfImage == MAGICNUMBEROFIMAGE)
    {
        return true;
    }

    return false;
}




/**
 * @brief IsImageDataFile  Check the input MagicNumber is equal to
 *                         MAGICNUMBEROFLABEL
 * @param MagicNumber      The array of the magicnumber to be checked
 * @param LengthOfArray    The length of the array
 * @return true, if the magcinumber is mathed;
 *         false, otherwise.
 *
 * @author sheng
 * @version 1.0.0
 * @date  2014-04-08
 *
 * @histroy     <author>      <date>      <version>      <description>
 *               sheng      2014-04-08      1.0.0      build the function
 */
bool IsLabelDataFile(unsigned char *MagicNumber, int LengthOfArray)
{
    int MagicNumberOfLabel = ConvertCharArrayToInt(MagicNumber, LengthOfArray);
    if (MagicNumberOfLabel == MAGICNUMBEROFLABEL)
    {
        return true;
    }

    return false;
}




/**
 * @brief ReadData  Read the data in a opened file
 * @param DataFile  The file which the data is read from.
 * @param NumberOfData  The number of the data
 * @param DataSizeInBytes  The size fo the every data
 * @return The Mat which rows is a data,
 *         Return a empty Mat if the file is not opened or the some flag was
 *                 seted when reading the  data.
 *
 * @author sheng
 * @version 1.0.0
 * @date  2014-04-08
 *
 * @histroy     <author>      <date>      <version>      <description>
 *               sheng      2014-04-08      1.0.0      build the function
 */

cv::Mat ReadData(std::fstream& DataFile, int NumberOfData, int DataSizeInBytes)
{
    cv::Mat DataMat;


    // read the data if the file is opened.
    if (DataFile.is_open())
    {


        int AllDataSizeInBytes = DataSizeInBytes * NumberOfData;
        unsigned char* TmpData = new unsigned char[AllDataSizeInBytes];
        DataFile.read((char *)TmpData, AllDataSizeInBytes);

        //        // If the state is good, convert the array to a mat.
        //        if (!DataFile.fail())
        //        {
        //            DataMat = cv::Mat(NumberOfData, DataSizeInBytes, CV_8UC1,
        //                              TmpData).clone();
        //        }

        DataMat = cv::Mat(NumberOfData, DataSizeInBytes, CV_8UC1,
                          TmpData).clone();
        delete [] TmpData;
        DataFile.close();

    }

    return DataMat;
}




/**
 * @brief ReadImageData  Read the Image data from the MNIST file.
 * @param ImageDataFile  The file which contains the Images.
 * @param NumberOfImages The number of the images.
 * @return The mat contains the image and each row of the mat is a image.
 *         Return empty mat is the file is closed or the data is not matching
 *                the number.
 *
 * @author sheng
 * @version 1.0.0
 * @date  2014-04-08
 *
 * @histroy     <author>      <date>      <version>      <description>
 *               sheng      2014-04-08      1.0.0      build the function
 */
cv::Mat ReadImageData(std::fstream& ImageDataFile, int NumberOfImages)
{
    int ImageSizeInBytes = 28 * 28;

    return ReadData(ImageDataFile, NumberOfImages, ImageSizeInBytes);
}



/**
 * @brief ReadLabelData Read the label data from the MNIST file.
 * @param LabelDataFile The file contained the labels.
 * @param NumberOfLabel The number of the labels.
 * @return The mat contains the labels and each row of the mat is a label.
 *         Return empty mat is the file is closed or the data is not matching
 *                the number.
 *
 * @author sheng
 * @version 1.0.0
 * @date  2014-04-08
 *
 * @histroy     <author>      <date>      <version>      <description>
 *               sheng      2014-04-08      1.0.0      build the function
 */
cv::Mat ReadLabelData(std::fstream& LabelDataFile, int NumberOfLabel)
{
    int LabelSizeInBytes = 1;

    return ReadData(LabelDataFile, NumberOfLabel, LabelSizeInBytes);
}




/**
 * @brief ReadImages Read the Training images.
 * @param FileName  The name of the file.
 * @return The mat contains the image and each row of the mat is a image.
 *         Return empty mat is the file is closed or the data is not matched.
 *
 * @author sheng
 * @version 1.0.0
 * @date  2014-04-08
 *
 * @histroy     <author>      <date>      <version>      <description>
 *               sheng      2014-04-08      1.0.0      build the function
 */
cv::Mat ReadImages(std::string& FileName)
{
    std::fstream File(FileName.c_str(), std::ios_base::in | std::ios_base::binary);

    if (!File.is_open())
    {
        return cv::Mat();
    }

    MNISTImageFileHeader FileHeader;
    File.read((char *)(&FileHeader), sizeof(FileHeader));

    if (!IsImageDataFile(FileHeader.MagicNumber, 4))
    {
        return cv::Mat();
    }

    int NumberOfImage = ConvertCharArrayToInt(FileHeader.NumberOfImages, 4);

    return ReadImageData(File, NumberOfImage);
}




/**
 * @brief ReadLabels  Read the label from the MNIST file.
 * @param FileName  The name of the file.
 * @return The mat contains the image and each row of the mat is a image.
 *         Return empty mat is the file is closed or the data is not matched.
 *
 * @author sheng
 * @version 1.0.0
 * @date  2014-04-08
 *
 * @histroy     <author>      <date>      <version>      <description>
 *               sheng      2014-04-08      1.0.0      build the function
 */
cv::Mat ReadLabels(std::string& FileName)
{
    std::fstream File(FileName.c_str(), std::ios_base::in | std::ios_base::binary);

    if (!File.is_open())
    {
        return cv::Mat();
    }

    MNISTLabelFileHeader FileHeader;
    File.read((char *)(&FileHeader), sizeof(FileHeader));

    if (!IsLabelDataFile(FileHeader.MagicNumber, 4))
    {
        return cv::Mat();
    }

    int NumberOfImage = ConvertCharArrayToInt(FileHeader.NumberOfLabels, 4);

    return ReadLabelData(File, NumberOfImage);
}









评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值