- MINIST 数据库作为机器学习入门数据库之一, 被广泛使用. 其中包含了共70,001张手写字符0-9的28x28的图片. 原始的MINIST是以二进制形式发布的, 需要一系列的转换才能转化为本地图片. 不方便进行数据库扩展及数据库可视化. 本文将MINIST数据库进行了本地化. 方便后来者进行进一步的数据库扩展.
- 本文利用了keras中自带的MINIST数据库,这个自带的数据库已经进行了train set 和test set的划分.因此,本文将相应的数据集保存到了本地.
- 直接上代码:
- import necessary packages
import os
import numpy as np
import matplotlib.pyplot as plt
from keras.datasets import mnist
from matplotlib.image import imsave
import itertools
- get the data which has been shuffled and split between train and test sets.
# the data, shuffled and split between tran and test sets
(X_train, y_train), (X_test, y_test) = mnist.load_data()
print("X_train original shape", X_train.shape)
print("y_train original shape", y_train.shape)
- results: 此处可以看到, 训练集6w张图片.
- show some demons
for i in range(9):
plt.subplot(3,3,i+1)
plt.imshow(X_train[i], cmap='gray', interpolation='none')
plt.title("Class {}".format(y_train[i]))
* the destination folder structure we tend to use:
外层结构:
内层结构
* 对训练集的处理: 图片名称从0开始编号.
image_counter = itertools.count(0)
for image, label in zip(X_train, y_train):
dest_folder = os.path.join(train_path, str(label))
image_name = next(image_counter)
image_path = os.path.join(dest_folder, str(image_name)+'.png')
if not os.path.exists(dest_folder):
os.mkdir(dest_folder)
imsave(image_path, image, cmap = 'gray')
- 对测试集图片的处理:图片接着训练集编号.
for image, label in zip(X_test, y_test):
dest_folder = os.path.join(test_path, str(label))
image_name = next(image_counter)
image_path = os.path.join(dest_folder, str(image_name)+'.png')
if not os.path.exists(dest_folder):
os.mkdir(dest_folder)
imsave(image_path, image, cmap = 'gray')
- 结果: