Parallelize由前向网络构造函数打造出运行于设备上的数据并行网络。
其实现可以概括为以下几个部分:
- 设置前向网络构造函数在不同设备上运行;
- 添加相应梯度算子;
- 添加梯度同步算子;
- 添加优化器;
- 进行内存优化。
Parallelize默认产生的网络类型为DAGNet。
检查当前设备域。
CurrentDeviceScope()会访问全局ThreadLocal对象中的属性。
assert scope.CurrentDeviceScope() is None \
or scope.CurrentDeviceScope().device_type == caffe2_pb2.CPU, \
"Parallelize must be called without device-scope, \
device scope was: {}".format(scope.CurrentDeviceScope())
如果未指定devices
,则设置为所有可用gpu。
if devices is None:
devices = list(range(0, workspace.NumCudaDevices())),
设置model_helper_obj
中的设备类型和设备前缀。
if not cpu_device:
for gpu in devices:
if gpu >= workspace.NumCudaDevices():
log.warning("** Only {} GPUs available, GPUs {} requested".format(
workspace.NumCudaDevices(), devices))
break
model_helper_obj._device_type = caffe2_pb2.CUDA
model_helper_obj._device_prefix = "gpu"
model_helper_obj._shared_model = False
device_name = "GPU"
assert shared_model is False, "Shared model only supported on CPU"
else:
model_helper_obj._device_type = caffe2_pb2.CPU
model_helper_obj._device_prefix = "cpu"
device_name = "CPU"
model_helper_obj._shared_model = shared_model
if shared_model and rendezvous is not None:
assert "Shared model only supported on single-node currently"
Rendezvous(会合)通信模式需要8个extra_workers
?
默认每个设备分配4个线程。
设置model_helper_obj.net.Proto().num_workers
。
log.info("Parallelizing model for devices: {}".format(devices))
extra_workers = 8 if rendezvous is not None else 0 # best-guess
num_workers = len(devices) * num_threads_per_device + extra_workers
max_concurrent_distributed_ops =\
min(max_concurrent_distributed_ops, num_workers - 1)
model_helper_obj.net.Proto().num_workers = num_workers
model_helper_obj.net.Proto().type = net_type
设置model_helper_obj
所使用的设备,是否使用rendezvous。
_broadcast_context
在_SyncAllParamsDistributed函数中使用。_sync_barrier_net
已经弃用了。
# Store some information in the model -- a bit ugly
model_helper_obj._devices = devices
model_helper_obj._rendezvous = rendezvous
model_helper_obj._sync_barrier_net = None
model_helper_obj._broadcast_context = None
model_helper_obj._grad_names = []
检查model_helper_obj
类型是否正确。
assert isinstance(model_helper_obj, model_helper.ModelHelper)
浅拷贝,跟踪之前模型中的参数:它们不是数据并行,因此我们需要单独处理它们。
# Keep track of params that were in the model before: they are not
# data parallel, so we need to handle them separately
non_datapar_params = copy.copy(model_helper_obj.params)
下面将创建输入和模型训练操作符。
# Add input and model
log.info("Create input and model training operators")
创建loss的字典,设置loss_scale
。num_shards
为分布式运行的机器数。
losses_by_gpu = {}
num_shards = 1 if rendezvous is None else rendezvous['num_shards']
loss_scale = 1.0 / (len(devices) * num_shards)
参数更新策略。使用param_update_builder_fun
或者optimizer_builder_fun
。
has_parameter_updates = param_update_builder_fun is not None or \
optimizer_builder_fun is not None
assert not (
param_update_builder_fun is not None and
optimizer_builder_fun is not None
), 'Can only specify one of param_update_builder_fun, optimizer_builder_fun'
检查用于验证/测试的模型是否将init_params
设置为False
,否则运行参数初始网络将覆盖训练网络的同步值。
# Check that a model that is used for validation/testing has
# init_params False, otherwise running the param init net will overwrite
# synchronized values by the training net
if not has_parameter_updates and model_helper_obj.init_params:
log.warning('')
log.warning("############# WARNING #############")
log.warning("Model {}/{} is used for testing/validation but".format(
model_helper_obj.name, model_helper_obj))
log.warning("has init_params=True!")
log.warning("This can conflict with model training.")
log.warning("Please ensure model = ModelHelper(init_params=False)")
log.warning('####################################')
log.warning('')
# TODO: make into assert
对于每个设备分别调用input_builder_fun
和forward_pass_builder_fun
计算出损失。 _ValidateParams检查是否有重复的参数。
for device in devices:
device_opt = core.DeviceOption(model_helper_obj._device_type, device)
with core.DeviceScope(device_opt):
with core.NameScope("{}_{}".format(model_helper_obj._device_prefix,
device)):
log.info("Model for {} : {}".format(device_name, device))
input_builder_fun(model_helper_obj)
losses = forward_pass_builder_fun(model_helper_obj, loss_scale)
# Losses are not needed for test net
if has_parameter_updates:
assert isinstance(losses, list), \
'Model builder function must return list of loss blobs'
for loss in losses:
assert isinstance(loss, core.BlobReference), \
'Model builder func must return list of loss blobs'
losses_by_gpu[device] = losses
_ValidateParams(model_helper_obj.params)
由各设备的blob生成参数map。model_helper_obj
创建了一个_device_grouped_blobs
属性。
以单下划线做前缀的名称指定了这个名称是“私有的”。在有些导入import * 的场景中,下一个使用你代码的人(或者你本人)会明白这个名称仅内部使用。
_GroupByDevice返回类型为OrderedDict。
# Create parameter map
model_helper_obj._device_grouped_blobs =\
_GroupByDevice(model_helper_obj, devices,
model_helper_obj.params, non_datapar_params)
ModelHelper.GetComputedParams返回当前名称范围中的计算参数。 '计算参数’不是通过梯度下降来优化的,而是直接从数据中计算出来的,比如空间批量归一化的运行均值和方差。
将computed params
添加到model_helper_obj
中。
# computed params
computed_params_grouped =\
_GroupByDevice(model_helper_obj, devices,
model_helper_obj.GetComputedParams(''), [])
model_helper_obj._device_grouped_blobs.update(computed_params_grouped)
model_helper_obj._param_names =\
list(viewkeys(model_helper_obj._device_grouped_blobs))
model_helper_obj._computed_param_names =\
list(viewkeys(computed_params_grouped))
添加梯度算子。
if has_parameter_updates:
log.info("Adding gradient operators")
_AddGradientOperators(devices, model_helper_obj, losses_by_gpu)
构建网络后转换网络。
if net_transformer_fun:
net_transformer_fun(
model_helper_obj,
len(devices),
model_helper_obj._device_prefix,
model_helper_obj._device_type)
如果不进行参数更新则返回。_InferBlobDevice根据输出运算符的设备选项划分 blob。
if not has_parameter_updates:
log.info("Parameter update function not defined --> only forward")
_InferBlobDevice(model_helper_obj)
return
Parallelize中combine_spatial_bn
默认为False
,仅支持CPU。_InterleaveOps确保不同设备上的同一算子相邻排列。_InterDeviceBatchNormalization追加算子实现 CPU 上的 SyncBN。
if combine_spatial_bn:
assert(cpu_device), \
'combine_spatial_bn is currently only supported on the CPU'
assert(has_parameter_updates), \
'combine_spatial_bn should only be used for train model'
_InterleaveOps(model_helper_obj)
_InterDeviceBatchNormalization(model_helper_obj)
再次检查是否有重复的参数。
_ValidateParams(model_helper_obj.params)
将梯度按设备分组并注册到 blob 查找表。
# Group gradients by device and register to blob lookup
param_to_grad = model_helper_obj.param_to_grad
grads_ordered = [param_to_grad[p] for p in
model_helper_obj.params if p in param_to_grad]
non_datapar_grads = [param_to_grad[p] for p in non_datapar_params]
gradients_grouped = _GroupByDevice(
model_helper_obj,
devices,
grads_ordered,
non_datapar_grads
)
model_helper_obj._device_grouped_blobs.update(gradients_grouped)
model_helper_obj._grad_names = list(viewkeys(gradients_grouped))
model_helper_obj._losses_by_gpu = losses_by_gpu
_InferBlobDevice根据输出运算符的设备选项划分 blob。
_InferBlobDevice(model_helper_obj)
log.info("Add gradient all-reduces for SyncSGD")
if broadcast_computed_params:
_BroadcastComputedParams(devices, model_helper_obj, rendezvous, use_nccl)
_GetReverseOrderedGrads获得反顺序(剥离命名空间的)梯度,这是最佳同步顺序。
_AllReduceBlobs对梯度队列进行规约。
if len(model_helper_obj._grad_names) > 0:
# Gradients in reverse order
reverse_ordered_grads = _GetReverseOrderedGrads(model_helper_obj)
assert(len(reverse_ordered_grads) > 0)
_AllReduceBlobs(
reverse_ordered_grads,
devices,
model_helper_obj,
model_helper_obj.net,
rendezvous,
use_nccl,
max_concurrent_distributed_ops,
)
else:
log.info("NOTE: Param builder function did not create any parameters.")
似乎all_params
放置的位置不对。
_PruneParametersForSharing删除非主参数,以便它们不会接收参数更新运算符。
log.info("Post-iteration operators for updating params")
num_shards = 1 if rendezvous is None else rendezvous['num_shards']
all_params = set(model_helper_obj.GetParams(''))
if shared_model:
_PruneParametersForSharing(model_helper_obj)
如果定义了param_update_builder_fun
则执行它,否则执行optimizer_builder_fun
。optimizer_builder_fun
未设置设备选项,在_build中实现。
if param_update_builder_fun is not None:
for device in devices:
device_opt = core.DeviceOption(model_helper_obj._device_type, device)
with core.DeviceScope(device_opt):
with core.NameScope(
"{}_{}".format(model_helper_obj._device_prefix, device)
):
param_update_builder_fun(model_helper_obj)
else:
log.info("Calling optimizer builder function")
optimizer = optimizer_builder_fun(model_helper_obj)
model_helper_obj._optimizer = optimizer
_ComputeBlobsToSync同步由 param init net 生成的所有 blob 并且是“数据并行”,即分配给设备。
(sync_blobs, sync_names) = _ComputeBlobsToSync(model_helper_obj)
sync_blobs_grouped = _GroupByDevice(
model_helper_obj,
devices,
sync_blobs,
[],
)
model_helper_obj._device_grouped_blobs.update(sync_blobs_grouped)
_InferBlobDevice根据输出运算符的设备选项划分 blob。
_AnalyzeOperators查看所有运算符并检查它们是否跨越设备域。
_InferBlobDevice(model_helper_obj)
_AnalyzeOperators(model_helper_obj)
ModelHelper.Proto()调用的是self.net.Proto()
。
# Configure dagnet to run with only one worker on the first iteration,
# to prevent concurrency problems with allocs and nccl.
arg = model_helper_obj.Proto().arg.add()
arg.name = "first_iter_only_one_worker"
arg.i = 1
同步网络初始化参数。
# Add initial parameter syncs
log.info("Add initial parameter sync")
_SyncAllParams(
devices,
model_helper_obj,
model_helper_obj.param_init_net,
model_helper_obj.param_init_net,
rendezvous,
sync_names,
max_concurrent_distributed_ops=1
)
处理参数同步后需要完成的任何操作,即确保参数的多精度副本是最新的。
# Handle any operations that need to be done after parameter sync
# i.e. making sure multi-precision copies of parameters are up-to-date
if post_sync_builder_fun is not None:
for device in devices:
device_opt = core.DeviceOption(model_helper_obj._device_type, device)
with core.DeviceScope(device_opt):
with core.NameScope(
"{}_{}".format(model_helper_obj._device_prefix, device)
):
post_sync_builder_fun(model_helper_obj)
优化梯度内存或者进行动态内存管理。
assert not (optimize_gradient_memory and dynamic_memory_management), \
"""It is not advised to use gradient optimization ('memonger')
with dynamic memory management."""
if optimize_gradient_memory:
_OptimizeGradientMemorySimple(model_helper_obj, losses_by_gpu, devices)
if dynamic_memory_management:
_AddDynamicMemoryOptimization(model_helper_obj, blobs_to_keep, devices)
_data_parallel_model_init_nets
会有多个吗?最终得到了_data_parallel_model_init_nets
和_data_parallel_model_nets
。
model_helper_obj._data_parallel_model_init_nets = [
model_helper_obj.param_init_net,
]
model_helper_obj._data_parallel_model_nets = [
model_helper_obj.net
]
_AddBarrierToModelNets在每个 epoch 开始时同步 DPM。这允许 epoch 中早开始 shard 等待慢的 shard。如果没有这个,快的 shard 将开始下一个 epoch 的训练,而在 IO 上阻塞了落后者,并且可能在30秒后超时(_DEFAULT_TIMEOUT_SEC)。 我们传入 model.param_init_net,以便屏障网络可以作为param_init_net的一部分运行。
_AddBarrierToModelNets(model_helper_obj, barrier_net_timeout_sec)
_RemapParameterBlobsForSharedModel删除非主参数。
if shared_model:
_RemapParameterBlobsForSharedModel(model_helper_obj, all_params)
创建在许多 GPU 上运行的模型的函数,并为 parameter_updates 创建一个网络,该网络可以独立运行一定迭代次数,然后是另一个运行一次的网络,根据以下描述的块式模型更新过滤规则计算最终参数更新: Scalable Training of Deep Learning Machines by Incremental Block Training with Intra-block Parallel Optimization and Blockwise Model-Update Filtering(ICASSP 2016)。
def Parallelize_BMUF(
model_helper_obj,
input_builder_fun,
forward_pass_builder_fun,
param_update_builder_fun,
block_learning_rate=1.0,
block_momentum=None,
devices=None,
rendezvous=None,
net_type='dag',
master_device=None,
use_nccl=False,
nesterov=False,
optimize_gradient_memory=False,
reset_momentum_sgd=False,
warmup_iterations=None,
max_concurrent_distributed_ops=4,
add_blobs_to_sync=None,
num_threads_per_device=4,
cpu_device=False,
barrier_net_timeout_sec=_DEFAULT_BARRIER_NET_TIMEOUT_SEC,
):
'''
Function to create model that run on many GPUs and creates a net for
parameter_updates that can be run independently for number of iterations
then followed by another net that runs once to compute the final parameter
updates according to block wise model update filtering rule described
in : Scalable Training of Deep Learning Machines by Incremental Block
Training with Intra-block Parallel Optimization and Blockwise Model-Update
Filtering (ICASSP 2016).
'''
检查是否有重复参数。
enumerate() 函数用于将一个可遍历的数据对象(如列表、元组或字符串)组合为一个索引序列,同时列出数据和数据下标,一般用在 for 循环当中。
def _ValidateParams(params):
set_params = set(params)
if len(params) > len(set_params):
dupes = []
sp = sorted(params)
for j, p in enumerate(sp):
if j > 0 and sp[j - 1] == p:
dupes.append(p)
assert len(params) == len(set_params), \
"Duplicate entries in params: {}".format(dupes)
##_GroupByDevice
通过设备对blob进行分组,返回[blobname] = {0: BlobRef, 1: ..}
的映射。
返回有序字典,确保原始顺序。
仅仅用到了len(non_data_params)
,似乎并不需要传入non_data_params
。
'''
Groups blobs by device, returning a map of [blobname] = {0: BlobRef, 1: ..}.
Returns ordered dictionary, ensuring the original order.
'''
grouped = OrderedDict()
# Only consider params that were created to be "data parallel"
params = params[len(non_data_params):]
params
类型应该为BlobReference或GradientSlice。
for _i, p in enumerate(params):
assert isinstance(p, core.BlobReference) or \
isinstance(p, core.GradientSlice), \
"Param {} is not BlobReference or GradientSlice".format(p)
获得gpuid
。
GetNameScope同样是对名称字符串进行操作。
name = stripBlobName(p)
gpuid = None
if isinstance(p, core.BlobReference):
gpuid = int(p.GetNameScope().split("_")[1].split("/")[0])
assert "{}_{}/".format(model._device_prefix, gpuid) in p.GetNameScope(),\
"Param {} expected to have namescope '{}_{}'".format(str(p), model._device_prefix, gpuid)
else:
gpuid = int(p.indices.GetNameScope().split("_")[1].split("/")[0])
assert "{}_{}/".format(model._device_prefix, gpuid) in p.indices.GetNameScope(),\
"Indices {} expected to have namescope '{}_{}'".format(str(p), model._device_prefix, gpuid)
assert "{}_{}/".format(model._device_prefix, gpuid) in p.values.GetNameScope(),\
"Values {} expected to have namescope '{}_{}'".format(str(p), model._device_prefix, gpuid)
如果名称不在字典中则加入。
if name not in grouped:
grouped[name] = {}
grouped[name][gpuid] = p
return grouped
##_InferBlobDevice
定义一个map_ops
函数,根据操作符的device_option
为其输入和输出blob设置设备选项。
mapping = {}
def map_ops(proto):
for op in proto.op:
device_option = op.device_option
if op.type == "Iter":
# Hack for Iters which have blob in CPU context
device_option = caffe2_pb2.DeviceOption()
device_option.device_type = caffe2_pb2.CPU
for b in list(op.input) + list(op.output):
if b not in mapping:
mapping[b] = device_option
if op.type.startswith('RecurrentNetwork'):
step_args = [a for a in op.arg if a.name.endswith("step_net")]
for step_arg in step_args:
map_ops(step_arg.n)
map_ops(model.param_init_net.Proto())
map_ops(model.net.Proto())
model._blob_to_device = mapping
##_IsGPUBlob
如果已经_InferBlobDevice获取了设备选项,则直接判断;否则,根据模型的设备类型进行判断。
if blob_name in model._blob_to_device:
return model._blob_to_device[blob_name].device_type == caffe2_pb2.CUDA
else:
blob_name = "{}_{}/{}".format(
model._device_prefix, model._devices[0], blob_name
)
if blob_name not in model._blob_to_device:
return model._device_type == caffe2_pb2.CUDA
return model._blob_to_device[blob_name].device_type == caffe2_pb2.CUDA
_RunComparison似乎并没有函数调用。
对于每个gpu,由损失构造损失梯度并添加梯度算子。
def _AddGradientOperators(devices, model, losses_by_gpu):
def create_grad(lossp):
return model.ConstantFill(lossp, str(lossp) + "_grad", value=1.0)
loss_grad = {}
# Explicitly need to create gradients on each GPU
for gpu_id in devices:
device = core.DeviceOption(model._device_type, gpu_id)
with core.DeviceScope(device):
for l in losses_by_gpu[gpu_id]:
lg = create_grad(l)
loss_grad[str(l)] = str(lg)
model.AddGradientOperators(loss_grad)
第一个GPU作为master。
# Copy params from gpu_0 to other
master_dev = devices[0]
如果使用NCCL,则设置net.NCCLBroadcast
。
请注意,根是根_rank_
而不是根_device_
。因此,无论使用哪个设备,我们总是使用root = 0
。
list(viewvalues(model._device_grouped_blobs[param]))
参数由dict转成了list。
core.DeviceOption
是什么?
if use_nccl:
if _IsGPUBlob(model, param):
master_device_opt = core.DeviceOption(model._device_type, master_dev)
with core.DeviceScope(master_device_opt):
# Note that the root is the root _rank_ and not the root
# _device_. Thus we always use root=0, regardless of the
# devices used.
net.NCCLBroadcast(
list(viewvalues(model._device_grouped_blobs[param])),
list(viewvalues(model._device_grouped_blobs[param])),
root=0,
)
return
否则,使用net.Copy
。
for dev_idx in devices[1:]:
if _IsGPUBlob(model, param):
device_opt = core.DeviceOption(caffe2_pb2.CUDA, dev_idx)
else:
device_opt = core.DeviceOption(caffe2_pb2.CPU, 0)
with core.DeviceScope(device_opt):
net.Copy(
model._device_grouped_blobs[param][master_dev],
model._device_grouped_blobs[param][dev_idx]
)
##_AllReduce
如果是GPU并且使用NCCL则直接使用NCCLAllreduce
control_input用于在算子图中表示控制依赖关系的附加“假”输入。这可以用于确保另一个算子准备好前算子不会运行。例如,调度控制。这些仅用于Net类内的调度,不作为算子实现的实际输入进行传递。
blobs_group = list(viewvalues(model._device_grouped_blobs[param]))
if model._device_type == caffe2_pb2.CUDA and use_nccl:
# TODO: for _shared_model, do only NCCLReduce
model.NCCLAllreduce(
blobs_group, blobs_group, control_input=control_input
)
return
如果是CUDA GPU,则设置p2p_access_pattern
。
if model._device_type == caffe2_pb2.CUDA:
p2p_access_pattern = workspace.GetCudaPeerAccessPattern()
else:
p2p_access_pattern = None
在不同的设备上为2个或多个Blob创建一个Sum操作。将结果保存在第一个设备上。
model
是model_helper_obj
。
def sumN(*dev_indices):
"""Create a Sum op for 2 or more blobs on different devices.
Saves the result on the first device.
Arguments:
dev_indices -- a list of device indices, which can be translated into
CUDA identifiers with model._devices
"""
devices = [model._devices[idx] for idx in dev_indices]
blobs = [blobs_group[idx] for idx in dev_indices]
for i, peer in enumerate(devices):
if i == 0:
continue # Skip the first device
if p2p_access_pattern is not None and not p2p_access_pattern[
devices[0], peer
]:
# Copy from peer to d0
blobs[i] = model.Copy(
blobs[i],
'gpu_{}/{}_gpu{}_copy'.format(devices[0], param, peer)
)
device_opt = core.DeviceOption(model._device_type, devices[0])
with core.DeviceScope(device_opt):
net.Sum(blobs, [blobs[0]], name='dpm')
使用tree reduction。
if len(devices) == 16:
# Special tree reduction for 16 gpus, TODO generalize like in muji.py
for j in range(8):
sumN(j * 2, j * 2 + 1)
for j in range(4):
sumN(j * 4, j * 4 + 2)
for j in range(2):
sumN(j * 8, j * 8 + 4)
sumN(0, 8)
elif len(devices) == 8:
for j in range(4):
sumN(j * 2, j * 2 + 1)
for j in range(2):
sumN(j * 4, j * 4 + 2)
sumN(0, 4)
elif len(devices) == 4:
sumN(0, 1)
sumN(2, 3)
sumN(0, 2)
else:
sumN(*range(len(devices)))
广播到其他GPU。
# TODO: for _shared_model, no need to broadcast
_Broadcast(devices, model, net, param)
##_ComputeBlobsToSync
我们同步由param_init_net
生成并且是’数据并行’的所有blob,即分配给设备的。
sync_names = set()
如果是共享模型,则不同步参数;否则,遍历param_init_net
中的op,获取其输出。
# We don't sync params if the model is shared
if model._shared_model:
blobs_to_sync = [str(p) for p in model.GetComputedParams('')]
sync_names = [stripBlobName(p) for p in blobs_to_sync]
else:
blobs_to_sync = []
for op in model.param_init_net.Proto().op:
dp_outputs = [
o for o in op.output
if o.startswith("{}_".format(model._device_prefix))
]
sync_names.update([stripBlobName(o) for o in dp_outputs])
blobs_to_sync.extend(dp_outputs)
# Sanity check
diff = set(model._param_names) - sync_names
assert diff == set(), \
"Some params not instantiated in param init net: {}".format(diff)
去重并排序。
# Remove duplicates and sort
prefixlen = len(model._device_prefix) + 1
def extract_sort_key(b):
# Sort first based on device id, and then by whole string
deviceid = int(b[prefixlen:b.index(scope._NAMESCOPE_SEPARATOR)])
return (deviceid, b)
blobs_to_sync = sorted(
list(set(blobs_to_sync)),
key=extract_sort_key)
由字符串获得BlobReference。
blobs_to_sync = [core.BlobReference(b) for b in blobs_to_sync]
return (blobs_to_sync, sync_names)
数据并行模型创建了一个网络,其中同一设备中的操作被分为一组。本函数将交错操作,以便每个设备的每个操作在网络中彼此相邻。有点像组合套牌。这确保了程序对于每个设备大致同时沿着关键路径进行,由于多设备批量规范化需要额外的节点内同步,这是重要的。
建立字典,向每个GPU中追加算子。
orig_ops = list(model.net.Proto().op)
num_devices = len(model._devices)
num_ops_per_dev = len(orig_ops) // num_devices
assert num_devices * num_ops_per_dev == len(orig_ops), \
'Number of ops per device in original net is not uniform'
new_ops = []
ops = {d: [] for d in range(num_devices)}
for op in orig_ops:
ops[op.device_option.cuda_gpu_id].append(op)
按照设备顺序将算子添加到new_ops
中。
for j in range(num_ops_per_dev):
tp = None
for d in model._devices:
if tp is None:
tp = ops[d][j].type
new_ops.append(ops[d][j])
# Sanity
assert ops[d][j].type == tp, \
"Type mismatch {} / {}".format(tp, ops[d][j].type)
替换。
del model.net.Proto().op[:]
model.net.Proto().op.extend(new_ops)
##_InterDeviceBatchNormalization
获得原有算子列表。
orig_ops = list(model.net.Proto().op)
new_ops = []
num_devices = len(model._devices)
batch_norm_ops = []
injected_ops = []
spatial_bn正向参数。
spatial_bn_phase = False
sums_blobs = []
sumsq_blobs = []
name = []
input_blob_name = None
spatial_bn反向参数。
spatial_bn_gradient_phase = False
scale_grad_blobs = []
bias_grad_blobs = []
对于 SpatialBN 和 SpatialBNGradient 之外的算子,在 spatial_bn_phase
阶段添加 Sum 算子计算sums
和sumsq
,然后重置相关变量。Sum 仅支持CPU。
ChannelStats 计算每个通道的均值及平方和。
for op in orig_ops:
if op.type != 'SpatialBN' and op.type != 'SpatialBNGradient':
if spatial_bn_phase:
new_ops.extend(injected_ops)
new_ops.append(
core.CreateOperator("Sum",
sums_blobs,
input_blob_name + "_sums_combined"))
new_ops.append(
core.CreateOperator("Sum",
sumsq_blobs,
input_blob_name + "_sumsq_combined"))
new_ops.extend(batch_norm_ops)
injected_ops = []
batch_norm_ops = []
sums_blobs = []
sumsq_blobs = []
spatial_bn_phase = False
input_blob_name = None
如果在spatial_bn_gradient_phase
阶段
elif spatial_bn_gradient_phase:
new_ops.extend(injected_ops)
scale_blob = \
"cpu_0/" + stripBlobName(scale_grad_blobs[0]) + "_combined"
bias_blob = \
"cpu_0/" + stripBlobName(bias_grad_blobs[0]) + "_combined"
new_ops.append(
core.CreateOperator("Sum", scale_grad_blobs, scale_blob))
new_ops.append(
core.CreateOperator("Sum", bias_grad_blobs, bias_blob))
for blob in scale_grad_blobs:
new_ops.append(
core.CreateOperator("Copy", scale_blob, blob))
for blob in bias_grad_blobs:
new_ops.append(core.CreateOperator("Copy", bias_blob, blob))
new_ops.extend(batch_norm_ops)
injected_ops = []
batch_norm_ops = []
scale_grad_blobs = []
bias_grad_blobs = []
spatial_bn_gradient_phase = False
new_ops.append(op)
如果是 SpatialBN 算子,设置为spatial_bn_phase
,添加 ChannelStats 算子。
elif op.type == 'SpatialBN':
spatial_bn_phase = True
if input_blob_name is None:
input_blob_name = op.input[0]
name = op.input[0]
injected_ops.append(
core.CreateOperator(
"ChannelStats",
name,
[name + "_sums", name + "_sumsq"]))
sums_blobs.append(name + "_sums")
sumsq_blobs.append(name + "_sumsq")
op.input.append(input_blob_name + "_sums_combined")
op.input.append(input_blob_name + "_sumsq_combined")
op.arg.extend([utils.MakeArgument("num_batches", num_devices)])
batch_norm_ops.append(op)
如果是 SpatialBNGradient,设置阶段为spatial_bn_gradient_phase
,添加ChannelBackpropStats 算子。
elif op.type == 'SpatialBNGradient':
spatial_bn_gradient_phase = True
injected_ops.append(
core.CreateOperator("ChannelBackpropStats",
[op.input[0], op.input[3], op.input[4],
op.input[2]],
[op.output[1], op.output[2]]))
scale_grad_blobs.append(op.output[1])
bias_grad_blobs.append(op.output[2])
op.arg.extend([utils.MakeArgument("num_batches", num_devices)])
op.input.extend([op.output[1], op.output[2]])
batch_norm_ops.append(op)
替换网络的算子列表。
assert not spatial_bn_phase, \
"Net modification for inter-device batch normalization failed"
del model.net.Proto().op[:]
model.net.Proto().op.extend(new_ops)
##GetCheckpointParams
返回完整检查点所需的一组 blob。它们是第一个 gpu 的 blob 以及迭代次数 blob。
startswith() 用于检查字符串是否是以指定子字符串开头。
(all_blobs, _) = _ComputeBlobsToSync(model)
first_gpu_blobs = {
b
for b in all_blobs
if str(b)
.startswith("{}_{}/".format(model._device_prefix, model._devices[0]))
}
添加没有单独名称范围的迭代 blob,因为它对检查点迭代计数器很重要。
# Add iteration blobs that do not have namescope separately, since
# it is important to checkpoint iteration counter
iteration_blobs = set()
for op in model.net.Proto().op:
if op.type == 'Iter' or op.type == 'AtomicIter':
if not op.output[0].startswith("{}_".format(model._device_prefix)):
iteration_blobs.add(op.output[0])
返回first_gpu_blobs
。
return first_gpu_blobs.union(iteration_blobs)
在从检查点/初始参数文件加载参数之后,应该调用该函数。
_ComputeBlobsToSync 同步由param_init_net
生成的所有 blob,并且是’数据并行’,即分配给设备。
stripBlobName 提取参数名称。
if not hasattr(model, "_checkpoint_net"):
if blobs is None:
(_, uniq_blob_names) = _ComputeBlobsToSync(model)
else:
uniq_blob_names = [stripBlobName(p) for p in blobs]
与blob查找映射同步,因为提供的blob可能没有参数,例如 momemtum blob。
GetDevices 返回设备,在 Parallelize 的开始设置。
# Synchronize to the blob lookup map, as the provided
# blobs might have non-parameters, such as momemtum blobs.
log.info("Creating checkpoint synchronization net")
devices = model.GetDevices()
for name in uniq_blob_names:
if name not in model._device_grouped_blobs:
grouped = {
d:
core.BlobReference("{}_{}{}{}".format(
model._device_prefix,
d,
scope._NAMESCOPE_SEPARATOR,
name)
) for d in devices}
model._device_grouped_blobs[name] = grouped
创建_checkpoint_net
网络。
model._checkpoint_net = core.Net("checkpoint_sync_net")
model._checkpoint_net.RunAllOnGPU()
如果是分布式的,则创建checkpoint_init_net
。
checkpoint_init_net = None
if (model._rendezvous is not None and model._rendezvous['num_shards'] > 1):
checkpoint_init_net = core.Net("checkpoint_init_net")
checkpoint_init_net.RunAllOnGPU()
添加同步所有参数的操作。
_SyncAllParams(
devices,
model,
checkpoint_init_net,
model._checkpoint_net,
model._rendezvous,
uniq_blob_names,
max_concurrent_distributed_ops=1
)
单机并不需要执行。
if (checkpoint_init_net):
workspace.RunNetOnce(checkpoint_init_net)
调用C++函数,构造出网络。
workspace.CreateNet(model._checkpoint_net)
运行_checkpoint_net
网络。
# Run the sync
log.info("Run checkpoint net")
workspace.RunNet(model._checkpoint_net.Proto().name)