在训练网络时,经常会遇到数据集中数据与标签分离的情况,此时进行batch shuffle时需要保持数据与标签一一对应的关系,编程实现如下。
以图像识别中玩滥的MNIST数据集为例:
MNIST 数据集可在 http://yann.lecun.com/exdb/mnist/ 获取, 它包含了四个部分:
Training set images: train-images-idx3-ubyte.gz (9.9 MB, 解压后 47 MB, 包含 60,000 个样本)
Training set labels: train-labels-idx1-ubyte.gz (29 KB, 解压后 60 KB, 包含 60,000 个标签)
Test set images: t10k-images-idx3-ubyte.gz (1.6 MB, 解压后 7.8 MB, 包含 10,000 个样本)
Test set labels: t10k-labels-idx1-ubyte.gz (5KB, 解压后 10 KB, 包含 10,000 个标签)
我们首先将数据以数组的形式读取:
def load_mnist(path, kind='train'):
"""Load MNIST data from `path`"""
labels_path = os.path.join(path,
'%s-labels.idx1-ubyte'
% kind)
images_path = os.path.join(path,
'%s-images.idx3-ubyte'
% kind)
with open(labels_path, 'rb') as lbpath:
magic, n = struct.unpack('>II',
lbpath.read(8))
labels = np.fromfile(lbpath,
dtype=np.uint8)
with open(images_path, 'rb') as imgpath:
magic, num, rows, cols = struct.unpack('>IIII',
imgpath.read(16))
images = np.fromfile(imgpath,
dtype=np.uint8).reshape(len(labels), 784)
return images, labels
将读取的图像利用matplotlib.pyplot可视化:
def show():
X_train,y_train = load_mnist(os.getcwd())
fig, ax = plt.subplots(
nrows=2,
ncols=5,
sharex=True,
sharey=True, )
ax = ax.flatten()
for i in range(10):
img = X_train[y_train == i][0].reshape(28, 28)#显示0-9
#img = X_train[y_train == 7][i].reshape(28, 28)#仅显示数字7
ax[i].imshow(img, cmap='Greys', interpolation='nearest')
ax[0].set_xticks([])
ax[0].set_yticks([])
plt.tight_layout()
plt.show()
可视化结果:
训练时进行batch shuffle:
import numpy as np
def batch_shuffle(images,labels,batch_size):
state = np.random.get_state()
np.random.shuffle(images)
np.random.set_state(state)
np.random.shuffle(labels)
return images[:batch_size],labels[:batch_size]
数据处理完毕,后续训练即可。