MNIST 数据集 ubyte 格式介绍

xwhking

于 2024-07-07 11:29:55 发布

阅读量1.2k

点赞数 25

文章标签： MNIST 数据读取 python

本文链接：https://blog.csdn.net/Go_ahead_forever/article/details/140243709

版权

train-images-idx1-ubyte 文件是用于存储 MNIST 数据集中手写数字图像数据的文件。与标签文件类似，这个文件使用的是一种简单而紧凑的二进制格式。具体的文件格式如下：

文件头（Header）：
文件头部分包含了一些描述文件内容的信息，具体如下：
- 魔数（Magic Number）：文件的前 4 个字节，用于标识文件类型。对于图像文件，魔数通常是 2051（十进制）。
- 图像数量：紧随魔数之后的 4 个字节，表示文件中包含的图像数量。
- 图像高度：再接下来的 4 个字节，表示图像的高度（像素）。
- 图像宽度：紧随图像高度之后的 4 个字节，表示图像的宽度（像素）。
图像数据：
图像数据部分紧随文件头之后，包含了所有图像的像素数据，每个像素占用 1 个字节（0 到 255 之间的灰度值）。

具体来说，文件的格式可以用伪代码表示如下：

[header]
| magic number (4 bytes) | number of images (4 bytes) | number of rows (4 bytes) | number of columns (4 bytes) |
[images]
| pixel 1 (1 byte) | pixel 2 (1 byte) | ... | pixel N (1 byte) |

其中，每个图像的数据按行优先顺序存储（即每行的像素从左到右排列，按行排列）。

读取 `train-images-idx1-ubyte` 文件的示例代码

以下是一个 Python 示例代码，用于读取和解析 train-images-idx1-ubyte 文件：

import struct
import numpy as np
import matplotlib.pyplot as plt

def read_images(file_path):
    with open(file_path, 'rb') as file:
        # 读取魔数和图像信息
        magic_number, num_images, num_rows, num_columns = struct.unpack('>IIII', file.read(16))
        print(f"Magic Number: {magic_number}, Number of Images: {num_images}, Rows: {num_rows}, Columns: {num_columns}")
        
        # 读取所有图像
        images = np.fromfile(file, dtype=np.uint8).reshape(num_images, num_rows, num_columns)
    
    return images

# 使用示例
file_path = 'train-images-idx3-ubyte'
images = read_images(file_path)
print(f"First Image Shape: {images[0].shape}")

# 显示第一张图像
plt.imshow(images[0], cmap='gray')
plt.show()

在这段代码中：

使用 struct.unpack 方法从文件中读取二进制数据。
'>IIII' 表示以大端序读取四个 4 字节的无符号整数，分别对应魔数、图像数量、图像高度和图像宽度。
np.fromfile 方法从文件中读取剩余的像素数据，并将其重塑为 (num_images, num_rows, num_columns) 的形状。

通过上述代码，可以将 train-images-idx1-ubyte 文件中的所有图像数据读取到一个 NumPy 数组中，并展示第一张图像。

train-labels-idx1-ubyte 是一个存储在 Ubyte 格式中的文件，常用于 MNIST 数据集的标签文件。这个文件的格式是一个二进制文件，包含了手写数字图片对应的标签。它的存储结构是非常简单和紧凑的，下面是具体的存储格式：

文件头（Header）：
文件头部分包含了一些描述文件内容的信息，具体如下：
- 魔数（Magic Number）：文件的前 4 个字节，用于标识文件类型。对于标签文件，魔数通常是 2049（十进制）。
- 标签数量：紧随魔数之后的 4 个字节，表示文件中包含的标签数量。
标签数据：
标签数据部分紧随文件头之后，包含了所有图片的标签，每个标签占用 1 个字节（表示 0 到 9 之间的数字）。

具体来说，文件的格式可以用伪代码表示如下：

[header]
| magic number (4 bytes) | number of items (4 bytes) |
[labels]
| label 1 (1 byte) | label 2 (1 byte) | ... | label N (1 byte) |

读取 `train-labels-idx1-ubyte` 文件的示例代码

以下是一个 Python 示例代码，用于读取和解析 train-labels-idx1-ubyte 文件：

import struct

def read_labels(file_path):
    with open(file_path, 'rb') as file:
        # 读取魔数和标签数量
        magic_number, num_labels = struct.unpack('>II', file.read(8))
        print(f"Magic Number: {magic_number}, Number of Labels: {num_labels}")
        
        # 读取所有标签
        labels = []
        for _ in range(num_labels):
            label = struct.unpack('B', file.read(1))[0]
            labels.append(label)
    
    return labels

# 使用示例
file_path = 'train-labels-idx1-ubyte'
labels = read_labels(file_path)
print(f"First 10 Labels: {labels[:10]}")