MindSpore学习之网络迁移调试与调优
- ResNet50为例
迁移流程
- 迁移目标: 网络实现、数据集、收敛精度、训练性能
- 复现指标:不仅要复现训练阶段,推理阶段也同样重要。细微差别,属于正常的波动范围。
- 复现步骤:单步复现+整合网络。复现单 Step 的运行结果,即获取只执行第一个 Step 后网络的状态,然后多次迭代出整个网络的运行结果(数据预处理、权重初始化、正向计算、loss 计算、反向梯度计算和优化器更新之后的结果)
准备工作
-
安装MindSpore,Python等环境
-
ResNet50 是 CV 中经典的深度神经网络,主流 ResNet 系列网络的实现(ResNet18、ResNet34、ResNet50、ResNet101、ResNet152)。ResNet50 所使用的数据集为 ImageNet2012
网络分析
-
MindSpore 既支持动态图(PyNative)模式,又支持静态图(Graph)模式,动态图模式灵活、易于调试,因此动态图模式主要用于网络调试,静态图模式性能好,主要用于整网训练,在分析缺失算子和功能时,要分别分析这两种模式。
-
如果发现有缺失的算子和功能,首先可考虑基于当前算子或功能来组合出缺失的算子和功能
-
ResNet 系列网络结构
-
算子分析:可参考算子映射
配套算子:(nn.Conv2D-nn.Conv2d、nn.BatchNorm2D-nn.BatchNom2d、nn.ReLU-nn.ReLU、nn.MaxPool2D-nn.MaxPool2d、nn.Linear-nn.Dense、torch.flatten-nn.Flatten)
缺失算子:nn.AdaptiveAvgPool2D -
缺少算子替代方案:在 ResNet50 网络中,输入的图片 shape 是固定的,统一为
N,3,224,224
,其中 N 为 batch size,3 为通道的数量,224 和 224 分别为图片的宽和高,网络中改变图片大小的算子有Conv2d
和Maxpool2d
,这两个算子对shape 的影响是固定的,因此,nn.AdaptiveAvgPool2D
的输入和输出 shape 是可以提前确定的,只要我们计算出nn.AdaptiveAvgPool2D
的输入和输出 shape,就可以通过nn.AvgPool
或nn.ReduceMean
来实现,所以该算子的缺失是可替代的,并不影响网络的训练。 -
其他功能对照
Pytorch 使用功能 | MindSpore 对应功能 |
---|---|
nn.init.kaiming_normal_ | initializer(init='HeNormal') |
nn.init.constant_ | initializer(init='Constant') |
nn.Sequential | nn.SequentialCell |
nn.Module | nn.Cell |
nn.distibuted | context.set_auto_parallel_context |
torch.optim.SGD | nn.optim.SGD or nn.optim.Momentum |
网络脚本开发
-
CIFAR-10、CIFAR-100 数据集下载:http://www.cs.toronto.edu/~kriz/cifar.html
-
CIFAR-10:共10个类、60,000个32*32彩色图像。二进制文件,数据在dataset.py中处理。
- 训练集:50,000个图像
- 测试集:10,000个图像
-
ImageNet2012:https://image-net.org/
-
ImageNet2012:共1000个类、224*224彩色图像。数据格式:JPEG,数据在dataset.py中处理。
-
训练集:共1,281,167张图像
- 测试集:共50,000张图像
数据集处理
- 使用 MindData 进行数据预处理主要包括以下几个步骤:
- 传入数据路径,读取数据文件。
- 解析数据。
- 数据处理(如常见数据切分、shuffle、数据增强等操作)。
- 数据分发(以 batch_size 为单位分发数据,分布式训练涉及多机分发)。
- ResNet50 网络使用的是 ImageNet2012 数据集(PyTorch版)
# sample execution (requires torchvision)
from PIL import Image
from torchvision import transforms
input_image = Image.open(filename)
preprocess = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
input_tensor = preprocess(input_image)
input_batch = input_tensor.unsqueeze(0) # create a mini-batch as expected by the model
- 主要做了 Resize、CenterCrop、Normalize 操作
基于 MindData 开发的数据处理
"""
create train or eval dataset.
"""
from mindspore import dtype as mstype
import mindspore.dataset as ds
import mindspore.dataset.vision.c_transforms as C
import mindspore.dataset.transforms.c_transforms as C2
# 创建数据集 (路径,batch_size,rank_size:device数,rank_id:device在所有机器中的序号,训练模式)
def create_dataset(dataset_path, batch_size=32, rank_size=1, rank_id=0, do_train=True):
# num_paralel_workers: parallel degree of data process 并行度 并行训练时用到
# num_shards: total number devices for distribute training, which equals number shard of data # devices数量
# shard_id: the sequence of current device in all distribute training devices, # device在所有机器中的序号
# which equals the data shard sequence for current device
data_set = ds.ImageFolderDataset(dataset_path, num_parallel_workers=8, shuffle=do_train,
num_shards=rank_size, shard_id=rank_id)
mean = [0.485 * 255, 0.456 * 255, 0.406 * 255]
std = [0.229 * 255, 0.224 * 255, 0.225 * 255]
# define map operations
trans = [
C.Decode(),
C.Resize(256),
C.CenterCrop(224),
C.Normalize(mean=mean, std=std),
C.HWC2CHW()
]
type_cast_op = C2.TypeCast(mstype.int32) # 精度转换
# call data operations by map
data_set = data_set.map(operations=trans, input_columns="image", num_parallel_workers=8)
data_set = data_set.map(operations=type_cast_op, input_columns="label", num_parallel_workers=8)
# apply batch operations batch_size
data_set = data_set.batch(batch_size, drop_remainder=do_train)
return data_set
- 分布式训练需要额外指定
num_shard
和shard_id
两个参数
子网开发:训练子网和 loss 子网
- 将网络中不同模块或子模块作为一个个子网抽离出来单独开发,这样可以保证各个子网并行开发,互相不受干扰。
分析 ResNet50 网络代码,主要可以分成以下几个子网:
- conv1x1、conv3x3:定义了不同 kernel_size 的卷积。
- BasicBlock:ResNet 系列网络中 ResNet18 和 ResNet34 的最小子网,由 Conv、BN、ReLU 和 残差组成。
- BottleNeck:ResNet 系列网络中 ResNet50、ResNet101 和 ResNet152 的最小子网,相比 BasicBlock 多了一层 Conv、BN 和 ReLU的结构,下采样的卷积位置也做了改变。
- ResNet:封装了 BasiclBlock、BottleNeck 和 Layer 结构的网络,传入不同的参数即可构造不同的ResNet系列网络。在该结构中,也使用了一些 PyTorch 自定义的初始化功能。
重新开发 conv3x3 和 conv1x1
import mindspore.nn as nn
# 3x3的卷积
def _conv3x3(in_channel, out_channel, stride=1):
return nn.Conv2d(in_channel, out_channel, kernel_size=3, stride=stride, padding=0, pad_mode='same')
# 1x1的卷积
def _conv1x1(in_channel, out_channel, stride=1):
return nn.Conv2d(in_channel, out_channel, kernel_size=1, stride=stride, padding=0, pad_mode='same')
重新开发 BasicBlock 和 BottleNeck:
# ResNet50 ResNet101 ResNet152 残差子网(输入通道,输出通道,步长:卷积步长) : ResidualBlock(3, 256, stride=2)
class ResidualBlock(nn.Cell):
expansion = 4 #
def __init__(self, in_channel, out_channel, stride=1):
super(ResidualBlock, self).__init__()
self.stride = stride
channel = out_channel // self.expansion
self.conv1 = _conv1x1(in_channel, channel, stride=1) # 1x1卷积
self.bn1 = _bn(channel) # BatchNorm
if self.stride != 1: # 步长不为1
self.e2 = nn.SequentialCell([_conv3x3(channel, channel, stride=1), _bn(channel),
nn.ReLU(), nn.MaxPool2d(kernel_size=2, stride=2, pad_mode='same')])
else: # 步长为1
self.conv2 = _conv3x3(channel, channel, stride=stride)
self.bn2 = _bn(channel)
self.conv3 = _conv1x1(channel, out_channel, stride=1) # 1x1卷积
self.bn3 = _bn_last(out_channel) # 最后一层 BatchNorm
self.relu = nn.ReLU() # 激活函数
self.down_sample = False # 下采样
if stride != 1 or in_channel != out_channel: # 下采样
self.down_sample = True
self.down_sample_layer = None
if self.down_sample: # # 下采样
self.down_sample_layer = nn.SequentialCell([_conv1x1(in_channel, out_channel, stride), _bn(out_channel)])
def construct(self, x):
identity = x
out = self.conv1(x) # 1x1卷积
out = self.bn1(out) # BatchNorm
out = self.relu(out) # 激活
if self.stride != 1: # 步长不为1
out = self.e2(out)
else: # 步长为1
out = self.conv2(out)
out = self.bn2(out)
out = self.relu(out)
out = self.conv3(out) # 1x1卷积
out = self.bn3(out) # BatchNorm
if self.down_sample: # 下采样 需要转换维度
identity = self.down_sample_layer(identity)
out = out + identity # 残差
out = self.relu(out) # 激活
return out
# ResNet18 和 ResNet34 残差子网(输入通道,输出通道,步长:卷积步长) : ResidualBlock(3, 256, stride=2)
class ResidualBlockBase(nn.Cell):
def __init__(self, in_channel, out_channel, stride=1):
super(ResidualBlockBase, self).__init__()
self.conv1 = _conv3x3(in_channel, out_channel, stride=stride) # 3x3 卷积
self.bn1d = _bn(out_channel) # BatchNorm
self.conv2 = _conv3x3(out_channel, out_channel, stride=1) # 3x3 卷积
self.bn2d = _bn(out_channel) # BatchNorm
self.relu = nn.ReLU() # 激活
self.down_sample = False # 有无下采样
if stride != 1 or in_channel != out_channel:
self.down_sample = True
self.down_sample_layer = None # 下采样
if self.down_sample:
self.down_sample_layer = nn.SequentialCell([_conv1x1(in_channel, out_channel, stride),
_bn(out_channel)])
# 论文中图2结构,两个3x3子网之后残差
def construct(self, x):
identity = x # 输入
out = self.conv1(x) # 3x3 卷积 步长自定义
out = self.bn1d(out) # BatchNorm
out = self.relu(out) # 激活
out = self.conv2(out) # 3x3 卷积 步长为1
out = self.bn2d(out) # BatchNorm
if self.down_sample: # 下采样,如果输入和输出维度不同,则将输入转换成输入维度,便于残差
identity = self.down_sample_layer(identity)
out = out + identity # 残差
out = self.relu(out) # 激活
return out
重新开发 ResNet 整网
# ResNet 50为例
class ResNet(nn.Cell):
"""
block (Cell): 子网
layer_nums (list): 每个子网的个数
in_channels (list): 每个子网的输入维度
out_channels (list): 每个子网的输出维度
strides (list): 每个层的步长
Examples:
>>> ResNet(ResidualBlock,
>>> [3, 4, 6, 3],
>>> [64, 256, 512, 1024],
>>> [256, 512, 1024, 2048],
>>> [1, 2, 2, 2],
>>> 10)
"""
def __init__(self, block, layer_nums, in_channels, out_channels, strides, num_classes):
super(ResNet, self).__init__()
if not len(layer_nums) == len(in_channels) == len(out_channels) == 4: # 验证输入是否正确
raise ValueError("the length of layer_num, in_channels, out_channels list must be 4!")
# 第1层 7x7卷积+池化 步长2
self.conv1 = _conv7x7(3, 64, stride=2)
self.bn1 = _bn(64) # BatchNorm
self.relu = ops.ReLU() # 激活
# 最大池化 3x3卷积核 步长2 Padding
self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, pad_mode="same")
# 第1个子网 3*3层 输入64 输出256 步长1
self.layer1 = self._make_layer(block, layer_nums[0], in_channel=in_channels[0], out_channel=out_channels[0],
stride=strides[0])
# 第2个子网 4*3层 输入256 输出512 步长2
self.layer2 = self._make_layer(block, layer_nums[1], in_channel=in_channels[1], out_channel=out_channels[1],
stride=strides[1])
# 第3个子网 6*3层 输入512 输出1024 步长2
self.layer3 = self._make_layer(block, layer_nums[2], in_channel=in_channels[2], out_channel=out_channels[2],
stride=strides[2])
# 第4个子网 3*3层 输入1024 输出2048 步长2
self.layer4 = self._make_layer(block, layer_nums[3], in_channel=in_channels[3], out_channel=out_channels[3],
stride=strides[3])
# 输出层
self.mean = ops.ReduceMean(keep_dims=True) # 平均池化
self.flatten = nn.Flatten() # 折叠
self.end_point = _fc(out_channels[3], num_classes) # 全连接层
def _make_layer(self, block, layer_num, in_channel, out_channel, stride):
"""
Args:
block (Cell): 残差块
layer_num (int): 每个子网个数
in_channel (int): 每个子网输入维度
out_channel (int): 每个子网输出维度
stride (int): 第一个卷积层的步长.
Examples:
>>> _make_layer(ResidualBlock, 3, 128, 256, 2)
"""
layers = [] # 网络层
resnet_block = block(in_channel, out_channel, stride=stride) # 残差
layers.append(resnet_block) # 增加第一个残差块, 步长不同 输入输出维度不同
for _ in range(1, layer_num): # 增加剩余残差块,步长为1
resnet_block = block(out_channel, out_channel, stride=1)
layers.append(resnet_block)
return nn.SequentialCell(layers) # 组成子网
def construct(self, x):
# 第1层 7x7卷积 步长2
x = self.conv1(x)
x = self.bn1(x)
x = self.relu(x)
# 最大池化 3x3卷积核 步长2 Padding
c1 = self.maxpool(x)
# 第2-50层 中间48层子结构
c2 = self.layer1(c1) # 2-10
c3 = self.layer2(c2) # 11-22
c4 = self.layer3(c3) # 23-40
c5 = self.layer4(c4) # 41-49
# 输出层 平均池化+全连接层
out = self.mean(c5, (2, 3))
out = self.flatten(out)
out = self.end_point(out)
return out
传入 ResNet50 层数信息,构造 ResNet50 整网:
# class_num:数据集中分类数目 net = resnet50(10)
def resnet50(class_num=10):
return ResNet(ResidualBlock,
[3, 4, 6, 3],
[64, 256, 512, 1024],
[256, 512, 1024, 2048],
[1, 2, 2, 2],
class_num)
其他模块
- 反向构造、梯度裁剪、优化器、学习率生成等
ResNet50 的训练主要涉及以下几项:
- 使用了 SGD + Momentum 优化器
- 使用了 WeightDecay 功能(但 BatchNorm 的 gamma 和 bias 没有使用)
- 使用了 cosine LR schedule
- 使用了 Label Smoothing
实现带 Momentum 的 SGD 优化器,除 BN 的 gamma 和 bias 外,其他权重应用 WeightDecay :
# Momentum 的 SGD 优化器
decayed_params = []
no_decayed_params = []
for param in net.trainable_params():
if 'beta' not in param.name and 'gamma' not in param.name and 'bias' not in param.name:
decayed_params.append(param)
else:
no_decayed_params.append(param)
group_params = [{'params': decayed_params, 'weight_decay': weight_decay},
{'params': no_decayed_params},
{'order_params': net.trainable_params()}]
opt = Momentum(group_params, lr, momentum)
定义 Loss 函数和实现 Label Smoothing:
import mindspore.nn as nn
from mindspore import Tensor
from mindspore import dtype as mstype
from mindspore.nn import LossBase
import mindspore.ops as ops
# define cross entropy loss
class CrossEntropySmooth(LossBase):
"""CrossEntropy"""
def __init__(self, sparse=True, reduction='mean', smooth_factor=0., num_classes=1000):
super(CrossEntropySmooth, self).__init__()
self.onehot = ops.OneHot()
self.sparse = sparse
self.on_value = Tensor(1.0 - smooth_factor, mstype.float32)
self.off_value = Tensor(1.0 * smooth_factor / (num_classes - 1), mstype.float32)
self.ce = nn.SoftmaxCrossEntropyWithLogits(reduction=reduction)
def construct(self, logit, label):
if self.sparse:
label = self.onehot(label, ops.shape(logit)[1], self.on_value, self.off_value)
loss = self.ce(logit, label)
return loss
流程打通
单机训练
- 重构以上代码如下:
.
├── scripts
│ ├── run_distribute_train.sh # 启动Ascend分布式训练(8卡)
│ ├── run_eval.sh # 启动Ascend评估
│ └── run_standalone_train.sh # 启动Ascend单机训练(单卡)
├── src
│ ├── config.py # 配置文件
│ ├── cross_entropy_smooth.py # 损失定义
│ ├── dataset.py # 数据预处理
│ └── resnet.py # 网络结构
├── eval.py # 推理流程
└── train.py # 训练流程
其中 train.py 定义
import os
import argparse
import ast
from mindspore import context, set_seed, Model
from mindspore.nn import Momentum
from mindspore.context import ParallelMode
from mindspore.train.callback import ModelCheckpoint, CheckpointConfig, LossMonitor, TimeMonitor
from mindspore.communication import init
from mindspore.common import initializer
import mindspore.nn as nn
from src.config import config
from src.dataset import create_dataset
from src.resnet import resnet50
from src.cross_entropy_smooth import CrossEntropySmooth
# 设置seed
set_seed(1)
# 加载参数
parser = argparse.ArgumentParser(description='Image classification')
parser.add_argument('--run_distribute', type=ast.literal_eval, default=False, help='Run distribute') # 分布式寻
parser.add_argument('--device_num', type=int, default=1, help='Device num.') # Device
parser.add_argument('--dataset_path', type=str, default=None, help='Dataset path') # 数据集存储路径
parser.add_argument('--device_target', type=str, default='GPU', choices=("Ascend", "GPU", "CPU"),
help='Device target, support Ascend,GPU,CPU') # 数据集存储路径
args_opt = parser.parse_args()
if __name__ == '__main__':
# 1 解析参数并设置基础的环境
# 分布式训练获取设备信息 环境变量
device_id = int(os.getenv('DEVICE_ID', '0')) # 默认Device
rank_size = int(os.getenv('RANK_SIZE', '1'))
rank_id = int(os.getenv('RANK_ID', '0'))
# init context 训练环境
# 单机 动态图模式:PYNATIVE_MODE 静态图模式:GRAPH_MODE 平台:Ascend GPU CPU
context.set_context(mode=context.GRAPH_MODE, device_target=args_opt.device_target, device_id=device_id)
# 多机分布式
if rank_size > 1:
context.set_auto_parallel_context(device_num=rank_size, parallel_mode=ParallelMode.DATA_PARALLEL, gradients_mean=True)
context.set_auto_parallel_context(all_reduce_fusion_config=[85, 160])
init()
# 2 定义数据集
# 数据集
dataset = create_dataset(args_opt.dataset_path, config.batch_size, rank_size, rank_id)
step_size = dataset.get_dataset_size()
# 3 定义网络结构
# 定义网络
net = resnet50(class_num=config.class_num)
# 权重初始化
for _, cell in net.cells_and_names():
if isinstance(cell, nn.Conv2d): # 卷积初始化 XavierUniform
cell.weight.set_data(initializer.initializer(initializer.XavierUniform(), cell.weight.shape,
cell.weight.dtype))
if isinstance(cell, nn.Dense): # 稠密初始化 TruncatedNormal
cell.weight.set_data(initializer.initializer(initializer.TruncatedNormal(), cell.weight.shape,
cell.weight.dtype))
# 4 定义损失函数和优化器
# 学习率 衰减 warmup
lr = nn.dynamic_lr.cosine_decay_lr(config.lr_end, config.lr, config.epoch_size * step_size, step_size, config.warmup)
# 衰减策略 Momentum 的 SGD 优化器
decayed_params = []
no_decayed_params = []
for param in net.trainable_params():
if 'beta' not in param.name and 'gamma' not in param.name and 'bias' not in param.name:
decayed_params.append(param)
else:
no_decayed_params.append(param)
group_params = [{'params': decayed_params, 'weight_decay': config.weight_decay},
{'params': no_decayed_params},
{'order_params': net.trainable_params()}]
opt = Momentum(group_params, lr, config.momentum)
# 交叉熵损失
loss = CrossEntropySmooth(sparse=True, reduction="mean", smooth_factor=config.label_smooth_factor,
num_classes=config.class_num)
# 5 定义模型和回调函数
# 模型定义,网络
model = Model(net, loss_fn=loss, optimizer=opt, metrics={'acc'}) # metrics={'top_1_accuracy', 'top_5_accuracy'}
# 回调,保存训练实践、模型等
time_cb = TimeMonitor(data_size=step_size)
loss_cb = LossMonitor()
cb = [time_cb, loss_cb]
if config.save_checkpoint: #
# config_ck = CheckpointConfig(save_checkpoint_steps=config.save_checkpoint_epochs * step_size,
config_ck = CheckpointConfig(save_checkpoint_steps=5, keep_checkpoint_max=config.keep_checkpoint_max)
ckpt_cb = ModelCheckpoint(prefix="resnet", directory=config.save_checkpoint_path, config=config_ck)
cb += [ckpt_cb]
model.train(config.epoch_size, dataset, callbacks=cb, sink_size=step_size, dataset_sink_mode=False)
运行训练
source activate py37_ms16
pip install easydict
python train.py --dataset_path=./data/imagenet_original/train/
错误汇总
内存不足:Device(id:0) memory isn’t enough and alloc failed, kernel name:
- 减小batch_size 搞定
- 导致内存不足原因:batch_size过大、模型过大、数据shape太大等