Mxnet框架深度学习框架越来越受到大家的欢迎。但是如何正确的使用这一框架,很多人并不是很清楚。从训练数据的预处理,数据的生成(网络真正需要的数据格式,网络模型的保存,网络训练日志的保存,等等,虽然网上有很多的trick,但是大多数比较零散),这里,博主就从零开始,教大家训练手写字体(mnist)识别的一个完整的系统。
一、python、mxnet 如何安装。
trick:windows下使用pip安装Mxnet可能会错出,因为windows下的mxnet可能已来VC++2015或其他版本,linux下就不存在这种情况。
二、手写字体数据集如何获取。
import mxnet as mx mnist = mx.test_utils.get_mnist() # 得到手写字体数据集
运行这行代码就可以下载到mnist数据集。mnist数据集主要包含四个压缩文件,截图如下:
说明:t10k-images-idx3-ubyte.gz:测试集图片二进制压缩文件
t10k-labels-idx1-ubyte.gz:测试集图片对应的标签二进制压缩文件
train-images-idx3-ubyte.gz:训练集图片二进制压缩文件
train-labels-idx1-ubyte.gz: 训练集图片对应的标签二进制压缩文件
经常有人问我,为什么下载下来的图片打不开,根本不知道里面的图片长什么样子如何读取,如何进行训练,我只能说,这些图片已经被转成了2进制文件,并不是原始的图片。那么这些文件里面的图片到底是如何组织的呢?,通过下面的代码您就能完全了解。
由于我之前就已经下载过这4个压缩文件,所以直接从本地读取就可以了,没有必要重复下载,并且有时候并不能完全下载下来。
train_data_path = 'mnist_data/train-images-idx3-ubyte.gz' train_label_path = 'mnist_data/train-labels-idx1-ubyte.gz' test_data_path = 'mnist_data/t10k-images-idx3-ubyte.gz' test_label_path = 'mnist_data/t10k-labels-idx1-ubyte.gz' train_label, train_data = read_data(image_url=train_data_path, label_url=train_label_path) test_label, test_data = read_data(image_url=test_data_path, label_url=test_label_path) print('shape of train_data:', train_data.shape) print('shape of train_label:', train_label.shape) print('shape of test_data:', test_data.shape) print('shape of test_label:', test_label.shape)
输出结果:
shape of train_data: (60000, 1, 28, 28) shape of train_label: (60000,) shape of test_data: (10000, 1, 28, 28) shape of test_label: (10000,)如果大家是初次下载,运行
mnist = mx.test_utils.get_mnist()
后就已经得到了一个完整的手写字体对象mnist。我们就可以直接通过下面的方式得到训练集以及测试的数据,代码如下:
train_image = mnist['train_data'] train_image_label = mnist['train_label'] test_image = mnist['test_data'] testimage_label = mnist['test_label']
三、数据如何处理。
下载了mnist数据集,并且得到其具体的数据,该如何把这些数据转换成我们训练阶段真正需要的格式?从上面的print的信息中我们已经可以知道图片的大小已经是28×28、单通道灰度图。如果我们不对图片进行缩放的话,网络的输入应该是(batch_size, channel, height, width),所以我们需要把60000张训练集图片,10000张测试集图片转换成 (60000//batch_size)×(batch_size, channel, height, width)、(10000//batch_size)×(batch_size, channel, height, width)的迭代的形式。
为什么mxnet里面的训练数据必须是以迭代器的形式传入的?1)简单,简单,简单!!! 2)mxnet框架中,用户是不能像tensorflow框架那样写个for循环来显示的将数据送入到网络里面。那么如何正确的使用mxnet框架提供的迭代器呢?有的时候mxnet提供的迭代器类并不能满足所有的需求,我们还需要重写这个类。
熟悉mxnet框架的小伙伴,应该知道,mxnet框架中网络的输入主要包含两种:1)img,2)ndarray
一般来说,对于前者我们可以很方便的使用mxnet提供的img2rec.py这个文件,将所有的图片转换成rec文件,然后将这个rec文件作为网络的输入,其实也是一个迭代器对象。然而生成rec文件耗时,并且需要很大的额外空间,但是有没有一种办法不生成rec文件呢?当然有,就是上文提到的,重写DataIter类,返回一个迭代器对象,每一次迭代都是(batch_size, channel, height, width)的完整数据快,这样就可以将数据源源不断的送入到网络里面去。完整代码如下:
class Batch(object): def __init__(self, data, label): self.data = data self.label = label
class Inter(mx.io.DataIter): def __init__(self, batch_size, train_data, train_label): super(Inter, self).__init__() self.batch_size = batch_size self.begin = 0 self.index = 0 self.train_data = train_data self.train_label = train_label self.train_count = len(train_data) assert len(train_data) == len(train_label), 'Error' assert (self.train_count >= self.batch_size) and (self.batch_size > 0), 'Error' self.train_batches = self.train_count // self.batch_size def __iter__(self): return self def reset(self): self.begin = 0 self.index = 0 def next(self): if self.iter_next(): return self.getdata() else: raise StopIteration def __next__(self): return self.next() def iter_next(self): if self.begin < self.train_batches: return True else: return False def get_batch_images_labels(self): data = self.train_data[self.index:self.index + self.batch_size, :, :, :] label = self.train_label[self.index:self.index + self.batch_size] return data, label def getdata(self): images, labels = self.get_batch_images_labels() # 顺序的得到数据 data_all = [mx.nd.array(images)] label_all = [mx.nd.array(labels)] self.index += self.batch_size self.begin += 1 return Batch(data_all, label_all) def getlabel(self): pass def getindex(self): return None def getpad(self): pass
Inter这个类就简单的重写了原生的mxnet迭代器类DataIter。从我写的代码中就可以看出这个迭代器类每次都会返回一个Batch对象,数据(data)和标签(label),其中data的shape为:(batch_size, channel, height, width),label的shape为(batch_size,)请注意迭代器里面的reset方法。
四、神经网络的构建
mxnet框架里面有两个非常重要的包:symbol和gluon。我们完全可以通过这两个组件构建神经网络。当然也完全可以提通过ndarray对象构建神经网络。这里我会一一给出代码。
首先我给出网络上一张很经典的Lenet-5的网络结构图:
trick:有没有发现网络图片的原始输入是32×32,而我们的图片矩阵却是28×28的?所以我在具体实现的时候稍微调整了下网络结构。
1、使用symbol构建Lenet-5网络结构:
def get_net(class_num, bn_mom=0.99, filter_list=(6, 16)): data = mx.sym.Variable('data') imput = mx.sym.BatchNorm(data=data, fix_gamma=True, eps=1e-5, momentum=bn_mom, name='bn_imput') # 批量标准化 # layer_1 卷积 layer_1 = mx.sym.Convolution(data=imput, num_filter=filter_list[0], kernel=(5, 5), stride=(2, 2), pad=(2, 2), no_bias=False, name="conv_layer_1") bn_layer_1 = mx.sym.BatchNorm(data=layer_1, fix_gamma=False, eps=1e-5, momentum=bn_mom, name='bn_layer_1') a_bn_layer_1 = mx.sym.Activation(data=bn_layer_1, act_type='relu', name='relu_a_bn_layer_1') # layer_2 卷积 bn_layer_2 = mx.sym.BatchNorm(data=a_bn_layer_1, fix_gamma=True, eps=1e-5, momentum=bn_mom, name='bn_layer_2') conv_layer_2 = mx.sym.Convolution(data=bn_layer_2, num_filter=filter_list[1], kernel=(5, 5), stride=(1, 1), pad=(0, 0), no_bias=False, name="conv_layer_2") bn_layer_2_1 = mx.sym.BatchNorm(data=conv_layer_2, fix_gamma=False, eps=1e-5, momentum=bn_mom, name='bn_layer_2_1') a_bn_layer_2 = mx.sym.Activation(data=bn_layer_2_1, act_type='relu', name='relu_a_a_bn_layer_2') # 下采样层 pooling_layer_2 = mx.symbol.Pooling(data=a_bn_layer_2, kernel=(5, 5), stride=(2, 2), pad=(2, 2), pool_type='max', name='pooling_layer_2') # 全连接层 fc = mx.symbol.FullyConnected(data=pooling_layer_2, num_hidden=120, flatten=True, no_bias=False, name='fc') bn1_fc = mx.sym.BatchNorm(data=fc, fix_gamma=False, eps=1e-5, momentum=bn_mom, name='bn1_fc') fc1 = mx.symbol.FullyConnected(data=bn1_fc, num_hidden=84, flatten=True, no_bias=False, name='fc1') bn1_fc1 = mx.sym.BatchNorm(data=fc1, fix_gamma=False, eps=1e-5, momentum=bn_mom, name='bn1_fc1') fc2 = mx.symbol.FullyConnected(data=bn1_fc1, num_hidden=class_num, flatten=True, no_bias=False, name='fc2') bn1_fc2 = mx.sym.BatchNorm(data=fc2, fix_gamma=False, eps=1e-5, momentum=bn_mom, name='bn1_fc2') return mx.symbol.SoftmaxOutput(data=bn1_fc2, name='softmax')
2、使用gluon组件构建Lenet-5网络结构:
def create_net(): net = nn.Sequential() with net.name_scope(): net.add( nn.BatchNorm(epsilon=1e-5, momentum=0.9), nn.Conv2D(channels=6, kernel_size=5, strides=2, padding=2, activation='relu'), nn.BatchNorm(epsilon=1e-5, momentum=0.9), nn.Conv2D(channels=16, kernel_size=5, strides=1, padding=0, activation='relu'), nn.BatchNorm(epsilon=1e-5, momentum=0.9), nn.AvgPool2D(pool_size=2, strides=2, padding=2), nn.Flatten(), nn.BatchNorm(epsilon=1e-5, momentum=0.9), nn.Dense(120, activation='relu'), nn.BatchNorm(epsilon=1e-5, momentum=0.9), nn.Dense(84, activation='relu'), nn.BatchNorm(epsilon=1e-5, momentum=0.9), nn.Dense(10) ) return net
3、使用mxnet的ndarray(区别于 numpy的 array)构建Lenet-5网络结构:
ctx = mx.cpu() # 计算设备 # 输出特征数目 = 6, 卷积核 = (5,5)----------第一个卷积层 W1 = nd.random_normal(shape=(6, 1, 5, 5), scale=.1, ctx=ctx) b1 = nd.zeros(W1.shape[0], ctx=ctx) # 特征数目 = 16, 卷积核 = (5,5)----------第二个卷积层 W2 = nd.random_normal(shape=(16, 6, 3, 3), scale=.1, ctx=ctx) b2 = nd.zeros(W2.shape[0], ctx=ctx) # 第一个全链接层 W3 = nd.random_normal(shape=(400, 120), scale=.1, ctx=ctx) b3 = nd.zeros(W3.shape[1], ctx=ctx) # 第二个全链接层 W4 = nd.random_normal(shape=(W3.shape[1], 84), scale=.1, ctx=ctx) b4 = nd.zeros(W4.shape[1], ctx=ctx) # 第三个全链接层 W5 = nd.random_normal(shape=(W4.shape[1], 10), scale=.1, ctx=ctx) b5 = nd.zeros(W5.shape[1], ctx=ctx) params = [W1, b1, W2, b2, W3, b3, W4, b4, W5, b5] for param in params: param.attach_grad() def net(X): X = X.as_in_context(W1.context) # 批量归一化 bn_X = nd.BatchNorm_v1(data=X, fix_gamma=True, eps=1e-5, output_mean_var=0.99, name='bn_X') # 第一层卷积 h1_conv = nd.Convolution(data=bn_X, weight=W1, bias=b1, kernel=W1.shape[2:], num_filter=W1.shape[0], name='h1_conv') # 批量归一化 bn_h1_conv = mx.sym.BatchNorm(data=h1_conv, fix_gamma=False, eps=1e-5, momentum=0.99, name='bn_h1_conv') h1_activation = nd.relu(bn_h1_conv) # 第二层卷集 # 批量归一化 bn_h1_conv2 = nd.BatchNorm_v1(data=h1_activation, fix_gamma=False, eps=1e-5, momentum=0.99, name='bn_h1_conv2') h1_conv2 = nd.Convolution(data=bn_h1_conv2, weight=W2, bias=b2, kernel=W1.shape[2:], num_filter=W1.shape[0], name="h1_conv2") bn_h1_conv2 = nd.BatchNorm_v1(data=h1_conv2, fix_gamma=False, eps=1e-5, momentum=0.99, name='bn_h1_conv2') h2_activation = nd.relu(bn_h1_conv2) # 下采样层 # 下采样层 pooling_layer_2 = mx.symbol.Pooling(data=h2_activation, kernel=(5, 5), stride=(2, 2), pad=(2, 2), pool_type='max', name='pooling_layer_2') # 16 *5 *5 flatten = # flatten fla = nd.flatten(data=pooling_layer_2, name='fla') # 全链接层---1 fullcollect_layer = nd.dot(fla, W3) + b3 bn_fullcollect_layer = mx.sym.BatchNorm(data=fullcollect_layer, fix_gamma=False, eps=1e-5, momentum=0.99, name='bn_fullcollect_layer') relu_bn_fullcollect_layer = nd.relu(data=bn_fullcollect_layer) # 全链接层-2 fullcollect_layer_2 = nd.dot(relu_bn_fullcollect_layer, W4) + b4 bn_fullcollect_layer_2 = mx.sym.BatchNorm(data=fullcollect_layer_2, fix_gamma=False, eps=1e-5, momentum=0.99, name='bn_fullcollect_layer_2') relu_bn_fullcollect_layer_2 = nd.relu(data=bn_fullcollect_layer_2) # 全链接层3 fullcollect_layer_3 = nd.dot(relu_bn_fullcollect_layer_2, W5) + b5 bn_fullcollect_layer_3 = mx.sym.BatchNorm(data=fullcollect_layer_3, fix_gamma=False, eps=1e-5, momentum=0.99, name='bn_fullcollect_layer_3') relu_bn_fullcollect_layer_3 = nd.relu(data=bn_fullcollect_layer_3) print('网络结构:') print('第一个卷积层:', h1_activation.shape) print('第二个卷积层:', h2_activation.shape) print('下采样层:', pooling_layer_2.shape) print('第一个全链接层:', relu_bn_fullcollect_layer.shape) print('第二个全链接层:', relu_bn_fullcollect_layer_2.shape) print('输出层:', relu_bn_fullcollect_layer_3.shape) return relu_bn_fullcollect_layer_3
五、训练网络模型
数据处理了,网络模型构建好了,就可以将数据喂到网络里面去,训练网络模型了。
trick:这里需要说明一下,使用不同的组件构建的网络模型,训练的时候代码可能有点差异。这里分别针对不同组件构建的网络该如何编制训练程序进行说明。
1、如果使用上面提到的symbol组件构建的网络,那么我们就可以编制下面的程序,训练网络。
1)设置训练日志输出格式:
# 检查路径 Util.check_all_path([config.saved_model_path, config.train_test_log_save_path.replace('/resnet_log.log', '')]) logger = logging.getLogger() logging.basicConfig(level=logging.INFO, format='%(message)s', datefmt='%a, %d %b %Y %H:%M:%S', filename=config.train_test_log_save_path, filemode='w')
2)获取数据迭代器对象,代码:
train_data, train_label, test_data, test_label = get_all_avaliable_data(config.train_data_path, config.train_label_path, config.test_data_path, config.test_label_path) data_train = Inter(config.batch_size, train_data, train_label) # 获取训练集的迭代器对象 _eval_data = Inter(config.batch_size*2, test_data, test_label) # 获取测试集的迭代器对象
3)训练:
_eval_data = mx.sym.Variable('eval_data:') softmax_out = get_net(class_num=10, bn_mom=0.99, filter_list=[6, 16]) model = mx.mod.Module(symbol=softmax_out, context=mx.cpu(), data_names=['data'], label_names=['softmax_label'])
model.fit(data_train, eval_data=_eval_data, optimizer='sgd', initializer=mx.init.Xavier(rnd_type='gaussian', factor_type='in', magnitude=2), eval_metric=['acc', 'ce'], optimizer_params={ 'learning_rate': config.learning_rate, 'momentum': config.momentum}, batch_end_callback=mx.callback.Speedometer(config.batch_size, 1), epoch_end_callback=mx.callback.do_checkpoint(config.saved_model_path), num_epoch=config.num_epoch)
4) 这部分的完整代码:
import logging import mxnet as mx from net import get_net from tool import Inter, Test from util import Util import config from lodad_data import get_all_avaliable_data # 检查路径 Util.check_all_path([config.saved_model_path, config.train_test_log_save_path.replace('/resnet_log.log', '')]) logger = logging.getLogger() logging.basicConfig(level=logging.INFO, format='%(message)s', datefmt='%a, %d %b %Y %H:%M:%S', filename=config.train_test_log_save_path, filemode='w') if __name__ == '__main__': """ By nxg read only no copy and no broadcast...... """ _eval_data = mx.sym.Variable('eval_data:') softmax_out = get_net(class_num=10, bn_mom=0.99, filter_list=[6, 16]) model = mx.mod.Module(symbol=softmax_out, context=mx.cpu(), data_names=['data'], label_names=['softmax_label']) train_data, train_label, test_data, test_label = get_all_avaliable_data(config.train_data_path, config.train_label_path, config.test_data_path, config.test_label_path) data_train = Inter(config.batch_size, train_data, train_label) # 获取训练集的迭代器对象 _eval_data = Inter(config.batch_size*2, test_data, test_label) # 获取测试集的迭代器对象 model.fit(data_train, eval_data=_eval_data, optimizer='sgd', initializer=mx.init.Xavier(rnd_type='gaussian', factor_type='in', magnitude=2), eval_metric=['acc', 'ce'], optimizer_params={ 'learning_rate': config.learning_rate, 'momentum': config.momentum}, batch_end_callback=mx.callback.Speedometer(config.batch_size, 1), epoch_end_callback=mx.callback.do_checkpoint(config.saved_model_path), num_epoch=config.num_epoch)
2、如果式样上面提到的使用gluon组件构建的神经网络,那么训练网络时候的完整代码如下:
def accuracy(output, label): return nd.mean(output.argmax(axis=1) == label).asscalar() def evaluate_accuracy(_test_data, net): acc = 0. for test_data_label_data_names_label_names in _test_data: test_data = test_data_label_data_names_label_names.data test_label = test_data_label_data_names_label_names.label data = test_data[0].as_in_context(ctx) label = test_label[0].as_in_context(ctx) output = net(data) label = label.as_in_context(ctx) acc += accuracy(output, label) return acc / eval_data_batch_count def main(): train_data, train_label, test_data, test_label = get_all_avaliable_data(config.train_data_path, config.train_label_path, config.test_data_path, config.test_label_path) data_train = Inter(config.batch_size, train_data, train_label) _eval_data = Inter(config.batch_size, test_data, test_label) global train_data_batch_count global eval_data_batch_count global train_step train_step = 0 train_data_batch_count = len(train_data) // config.batch_size # 937 eval_data_batch_count = len(test_data) // config.batch_size # 156 # 保存日志 log = open(file='train_test_log/resnet_log.log', mode='w') softmax_cross_entropy_loss = gluon.loss.SoftmaxCrossEntropyLoss() net = create_net() net.initialize(ctx=ctx) # 初始化网络参数 trainer = gluon.Trainer(net.collect_params(), 'sgd', { "learning_rate": 0.5}) for epoch in range(5): all_train_loss = 0. all_train_acc = 0. data_train.reset() # 这句话如果不要,那么整个数据集只会迭代一次 _eval_data.reset() # 这句话如果不要,那么整个数据集只会迭代一次 for data_label_data_names_label_names in data_train: train_step += 1 data = data_label_data_names_label_names.data label = data_label_data_names_label_names.label data = data[0].as_in_context(ctx) # 在何种计算设备上实施计算 label = label[0].as_in_context(ctx) with autograd.record(): output = net(data) loss = softmax_cross_entropy_loss(output, label) loss.backward() trainer.step(config.batch_size) train_loss = nd.mean(loss).asscalar() train_acc = accuracy(output, label) all_train_loss += train_loss all_train_acc += train_acc log.writelines("Epoch:%d, train_step: %d, loss: %f, Train_acc: %f \n" % (epoch, train_step, train_loss, train_acc)) test_acc = evaluate_accuracy(_eval_data, net) log.writelines("\n\nEpoch:%d, avg_train_loss: %f, avg_train_acc: %f, Test_acc: %f \n" % (epoch, all_train_loss / train_data_batch_count, all_train_acc / train_data_batch_count, test_acc)) if __name__ == '__main__': main()可能上面贴出的代码中的某些工具函数我并没有给全,大家可以到我的github上去下载,也可以留言,我会把完整的代码分享给大家。
6、本地保存的训练日志:
Epoch:0, train_step: 1, loss: 2.417782, Train_acc: 0.093750 Epoch:0, train_step: 2, loss: 2.147448, Train_acc: 0.218750 Epoch:0, train_step: 3, loss: 2.077140, Train_acc: 0.406250 Epoch:0, train_step: 4, loss: 1.847961, Train_acc: 0.437500 Epoch:0, train_step: 5, loss: 1.075216, Train_acc: 0.671875 Epoch:0, train_step: 6, loss: 0.592741, Train_acc: 0.859375 Epoch:0, train_step: 7, loss: 0.643913, Train_acc: 0.828125 Epoch:0, train_step: 8, loss: 0.837896, Train_acc: 0.796875 Epoch:0, train_step: 9, loss: 0.582398, Train_acc: 0.859375 Epoch:0, train_step: 10, loss: 0.750824, Train_acc: 0.765625 Epoch:0, train_step: 11, loss: 0.532329, Train_acc: 0.781250 Epoch:0, train_step: 12, loss: 0.583528, Train_acc: 0.796875 Epoch:0, train_step: 13, loss: 0.422033, Train_acc: 0.921875 Epoch:0, train_step: 14, loss: 0.829014, Train_acc: 0.718750 Epoch:0, train_step: 15, loss: 0.643326, Train_acc: 0.812500 Epoch:0, train_step: 16, loss: 0.667152, Train_acc: 0.828125 Epoch:0, train_step: 17, loss: 0.743936, Train_acc: 0.796875 Epoch:0, train_step: 18, loss: 0.640609, Train_acc: 0.718750 Epoch:0, train_step: 19, loss: 0.578947, Train_acc: 0.843750 Epoch:0, train_step: 20, loss: 0.678622, Train_acc: 0.796875 Epoch:0, train_step: 21, loss: 0.659916, Train_acc: 0.781250 Epoch:0, train_step: 22, loss: 0.886372, Train_acc: 0.703125 Epoch:0, train_step: 23, loss: 0.498017, Train_acc: 0.812500 Epoch:0, train_step: 24, loss: 0.339886, Train_acc: 0.890625 Epoch:0, train_step: 25, loss: 0.383869, Train_acc: 0.890625 Epoch:0, train_step: 26, loss: 0.352800, Train_acc: 0.890625 Epoch:0, train_step: 27, loss: 0.235351, Train_acc: 0.921875 Epoch:0, train_step: 28, loss: 0.335911, Train_acc: 0.906250 Epoch:0, train_step: 29, loss: 0.321678, Train_acc: 0.906250 Epoch:0, train_step: 30, loss: 0.214269, Train_acc: 0.937500 Epoch:0, train_step: 31, loss: 0.194405, Train_acc: 0.937500 Epoch:0, train_step: 32, loss: 0.229423, Train_acc: 0.937500 Epoch:0, train_step: 33, loss: 0.357825, Train_acc: 0.921875 Epoch:0, train_step: 34, loss: 0.093697, Train_acc: 0.984375 Epoch:0, train_step: 35, loss: 0.236372, Train_acc: 0.906250 Epoch:0, train_step: 36, loss: 0.171640, Train_acc: 0.921875 Epoch:0, train_step: 37, loss: 0.760929, Train_acc: 0.828125 Epoch:0, train_step: 38, loss: 0.425227, Train_acc: 0.890625 Epoch:0, train_step: 39, loss: 0.419191, Train_acc: 0.875000 Epoch:0, train_step: 40, loss: 0.206767, Train_acc: 0.906250 Epoch:0, train_step: