Caffe2 Parallelize函数解析

图波列夫

已于 2023-08-14 10:01:57 修改

阅读量1.7k

点赞数 1

分类专栏： Caffe2 python DeepLearning 文章标签： Caffe2 深度学习

于 2018-07-25 15:58:05 首次发布

本文链接：https://blog.csdn.net/yiran103/article/details/81187705

版权

DeepLearning 同时被 3 个专栏收录

75 篇文章 9 订阅

订阅专栏

Caffe2

10 篇文章 0 订阅

订阅专栏

python

2 篇文章 0 订阅

订阅专栏

Parallelize由前向网络构造函数打造出运行于设备上的数据并行网络。
其实现可以概括为以下几个部分：

设置前向网络构造函数在不同设备上运行；
添加相应梯度算子；
添加梯度同步算子；
添加优化器；
进行内存优化。

##Parallelize

Parallelize默认产生的网络类型为DAGNet。

检查当前设备域。
CurrentDeviceScope()会访问全局ThreadLocal对象中的属性。

    assert scope.CurrentDeviceScope() is None \
        or scope.CurrentDeviceScope().device_type == caffe2_pb2.CPU, \
        "Parallelize must be called without device-scope, \
        device scope was: {}".format(scope.CurrentDeviceScope())

如果未指定devices，则设置为所有可用gpu。

    if devices is None:
        devices = list(range(0, workspace.NumCudaDevices())),

设置model_helper_obj中的设备类型和设备前缀。

    if not cpu_device:
        for gpu in devices:
            if gpu >= workspace.NumCudaDevices():
                log.warning("** Only {} GPUs available, GPUs {} requested".format(
                    workspace.NumCudaDevices(), devices))
                break
        model_helper_obj._device_type = caffe2_pb2.CUDA
        model_helper_obj._device_prefix = "gpu"
        model_helper_obj._shared_model = False
        device_name = "GPU"
        assert shared_model is False, "Shared model only supported on CPU"
    else:
        model_helper_obj._device_type = caffe2_pb2.CPU
        model_helper_obj._device_prefix = "cpu"
        device_name = "CPU"
        model_helper_obj._shared_model = shared_model
        if shared_model and rendezvous is not None:
            assert "Shared model only supported on single-node currently"

Rendezvous（会合）通信模式需要8个extra_workers？
默认每个设备分配4个线程。
设置model_helper_obj.net.Proto().num_workers。

    log.info("Parallelizing model for devices: {}".format(devices))
    extra_workers = 8 if rendezvous is not None else 0  # best-guess
    num_workers = len(devices) * num_threads_per_device + extra_workers
    max_concurrent_distributed_ops =\
        min(max_concurrent_distributed_ops, num_workers - 1)
    model_helper_obj.net.Proto().num_workers = num_workers
    model_helper_obj.net.Proto().type = net_type

设置model_helper_obj所使用的设备，是否使用rendezvous。
_broadcast_context在_SyncAllParamsDistributed函数中使用。_sync_barrier_net已经弃用了。

    # Store some information in the model -- a bit ugly
    model_helper_obj._devices = devices
    model_helper_obj._rendezvous = rendezvous
    model_helper_obj._sync_barrier_net = None

    model_helper_obj._broadcast_context = None
    model_helper_obj._grad_names = []

检查model_helper_obj类型是否正确。

 assert isinstance(model_helper_obj, model_helper.ModelHelper)

浅拷贝，跟踪之前模型中的参数：它们不是数据并行，因此我们需要单独处理它们。

    # Keep track of params that were in the model before: they are not
    # data parallel, so we need to handle them separately
    non_datapar_params = copy.copy(model_helper_obj.params)

下面将创建输入和模型训练操作符。

    # Add input and model
    log.info("Create input and model training operators")

创建loss的字典，设置loss_scale。num_shards为分布式运行的机器数。

    losses_by_gpu = {}
    num_shards = 1 if rendezvous is None else rendezvous['num_shards']
    loss_scale = 1.0 / (len(devices) * num_shards)

参数更新策略。使用param_update_builder_fun或者optimizer_builder_fun。

    has_parameter_updates = param_update_builder_fun is not None or \
        optimizer_builder_fun is not None
    assert not (
        param_update_builder_fun is not None and
        optimizer_builder_fun is not None
    ), 'Can only specify one of param_update_builder_fun, optimizer_builder_fun'

检查用于验证/测试的模型是否将init_params设置为False，否则运行参数初始网络将覆盖训练网络的同步值。

    # Check that a model that is used for validation/testing has
    # init_params False, otherwise running the param init net will overwrite
    # synchronized values by the training net
    if not has_parameter_updates and model_helper_obj.init_params:
        log.warning('')
        log.warning("############# WARNING #############")
        log.warning("Model {}/{} is used for testing/validation but".format(
            model_helper_obj.name, model_helper_obj))
        log.warning("has init_params=True!")
        log.warning("This can conflict with model training.")
        log.warning("Please ensure model = ModelHelper(init_params=False)")
        log.warning('####################################')
        log.warning('')
        # TODO: make into assert

对于每个设备分别调用input_builder_fun和forward_pass_builder_fun计算出损失。 _ValidateParams检查是否有重复的参数。

    for device in devices:
        device_opt = core.DeviceOption(model_helper_obj._device_type, device)
        with core.DeviceScope(device_opt):
            with core.NameScope("{}_{}".format(model_helper_obj._device_prefix,
                                               device)):
                log.info("Model for {} : {}".format(device_name, device))
                input_builder_fun(model_helper_obj)
                losses = forward_pass_builder_fun(model_helper_obj, loss_scale)
                # Losses are not needed for test net
                if has_parameter_updates:
                    assert isinstance(losses, list), \
                        'Model builder function must return list of loss blobs'
                    for loss in losses:
                        assert isinstance(loss, core.BlobReference), \
                            'Model builder func must return list of loss blobs'

                losses_by_gpu[device] = losses
    _ValidateParams(model_helper_obj.params)

由各设备的blob生成参数map。model_helper_obj创建了一个_device_grouped_blobs属性。
以单下划线做前缀的名称指定了这个名称是“私有的”。在有些导入import * 的场景中，下一个使用你代码的人（或者你本人）会明白这个名称仅内部使用。
_GroupByDevice返回类型为OrderedDict。

    # Create parameter map
    model_helper_obj._device_grouped_blobs =\
        _GroupByDevice(model_helper_obj, devices,
                       model_helper_obj.params, non_datapar_params)

ModelHelper.GetComputedParams返回当前名称范围中的计算参数。 '计算参数’不是通过梯度下降来优化的，而是直接从数据中计算出来的，比如空间批量归一化的运行均值和方差。
将computed params添加到model_helper_obj中。

    # computed params
    computed_params_grouped =\
        _GroupByDevice(model_helper_obj, devices,
                       model_helper_obj.GetComputedParams(''), [])
    model_helper_obj._device_grouped_blobs.update(computed_params_grouped)

    model_helper_obj._param_names =\
        list(viewkeys(model_helper_obj._device_grouped_blobs))
    model_helper_obj._computed_param_names =\
        list(viewkeys(computed_params_grouped))

添加梯度算子。

    if has_parameter_updates:
        log.info("Adding gradient operators")
        _AddGradientOperators(devices, model_helper_obj, losses_by_gpu)

构建网络后转换网络。

    if net_transformer_fun:
        net_transformer_fun(
            model_helper_obj,
            len(devices),
            model_helper_obj._device_prefix,
            model_helper_obj._device_type)

如果不进行参数更新则返回。_InferBlobDevice根据输出运算符的设备选项划分 blob。

    if not has_parameter_updates:
        log.info("Parameter update function not defined --> only forward")
        _InferBlobDevice(model_helper_obj)
        return

Parallelize中combine_spatial_bn默认为False，仅支持CPU。_InterleaveOps确保不同设备上的同一算子相邻排列。_InterDeviceBatchNormalization追加算子实现 CPU 上的 SyncBN。

    if combine_spatial_bn:
        assert(cpu_device), \
            'combine_spatial_bn is currently only supported on the CPU'
        assert(has_parameter_updates), \
            'combine_spatial_bn should only be used for train model'
        _InterleaveOps(model_helper_obj)
        _InterDeviceBatchNormalization(model_helper_obj)

再次检查是否有重复的参数。

 _ValidateParams(model_helper_obj.params)

将梯度按设备分组并注册到 blob 查找表。

    # Group gradients by device and register to blob lookup
    param_to_grad = model_helper_obj.param_to_grad
    grads_ordered = [param_to_grad[p] for p in
                     model_helper_obj.params if p in param_to_grad]
    non_datapar_grads = [param_to_grad[p] for p in non_datapar_params]

    gradients_grouped = _GroupByDevice(
        model_helper_obj,
        devices,
        grads_ordered,
        non_datapar_grads
    )
    model_helper_obj._device_grouped_blobs.update(gradients_grouped)
    model_helper_obj._grad_names = list(viewkeys(gradients_grouped))
    model_helper_obj._losses_by_gpu = losses_by_gpu

_InferBlobDevice根据输出运算符的设备选项划分 blob。

 _InferBlobDevice(model_helper_obj)

_BroadcastComputedParams用于广播

_BroadcastComputedParams

    log.info("Add gradient all-reduces for SyncSGD")
    if broadcast_computed_params:
        _BroadcastComputedParams(devices, model_helper_obj, rendezvous, use_nccl)

_GetReverseOrderedGrads获得反顺序（剥离命名空间的）梯度，这是最佳同步顺序。
_AllReduceBlobs对梯度队列进行规约。

_AllReduceBlobs

    if len(model_helper_obj._grad_names) > 0:
        # Gradients in reverse order
        reverse_ordered_grads = _GetReverseOrderedGrads(model_helper_obj)
        assert(len(reverse_ordered_grads) > 0)
        _AllReduceBlobs(
            reverse_ordered_grads,
            devices,
            model_helper_obj,
            model_helper_obj.net,
            rendezvous,
            use_nccl,
            max_concurrent_distributed_ops,
        )
    else:
        log.info("NOTE: Param builder function did not create any parameters.")

似乎all_params放置的位置不对。
_PruneParametersForSharing删除非主参数，以便它们不会接收参数更新运算符。

    log.info("Post-iteration operators for updating params")
    num_shards = 1 if rendezvous is None else rendezvous['num_shards']

    all_params = set(model_helper_obj.GetParams(''))
    if shared_model:
        _PruneParametersForSharing(model_helper_obj)

如果定义了param_update_builder_fun则执行它，否则执行optimizer_builder_fun。optimizer_builder_fun未设置设备选项，在_build中实现。

    if param_update_builder_fun is not None:
        for device in devices:
            device_opt = core.DeviceOption(model_helper_obj._device_type, device)
            with core.DeviceScope(device_opt):
                with core.NameScope(
                    "{}_{}".format(model_helper_obj._device_prefix, device)
                ):
                    param_update_builder_fun(model_helper_obj)
    else:
        log.info("Calling optimizer builder function")
        optimizer = optimizer_builder_fun(model_helper_obj)
        model_helper_obj._optimizer = optimizer

_ComputeBlobsToSync同步由 param init net 生成的所有 blob 并且是“数据并行”，即分配给设备。

    (sync_blobs, sync_names) = _ComputeBlobsToSync(model_helper_obj)
    sync_blobs_grouped = _GroupByDevice(
        model_helper_obj,
        devices,
        sync_blobs,
        [],
    )
    model_helper_obj._device_grouped_blobs.update(sync_blobs_grouped)

_InferBlobDevice根据输出运算符的设备选项划分 blob。
_AnalyzeOperators查看所有运算符并检查它们是否跨越设备域。

    _InferBlobDevice(model_helper_obj)
    _AnalyzeOperators(model_helper_obj)

ModelHelper.Proto()调用的是self.net.Proto()。

    # Configure dagnet to run with only one worker on the first iteration,
    # to prevent concurrency problems with allocs and nccl.
    arg = model_helper_obj.Proto().arg.add()
    arg.name = "first_iter_only_one_worker"
    arg.i = 1

同步网络初始化参数。

_SyncAllParams

    # Add initial parameter syncs
    log.info("Add initial parameter sync")
    _SyncAllParams(
        devices,
        model_helper_obj,
        model_helper_obj.param_init_net,
        model_helper_obj.param_init_net,
        rendezvous,
        sync_names,
        max_concurrent_distributed_ops=1
    )

处理参数同步后需要完成的任何操作，即确保参数的多精度副本是最新的。

    # Handle any operations that need to be done after parameter sync
    # i.e. making sure multi-precision copies of parameters are up-to-date
    if post_sync_builder_fun is not None:
        for device in devices:
            device_opt = core.DeviceOption(model_helper_obj._device_type, device)
            with core.DeviceScope(device_opt):
                with core.NameScope(
                    "{}_{}".format(model_helper_obj._device_prefix, device)
                ):
                    post_sync_builder_fun(model_helper_obj)

优化梯度内存或者进行动态内存管理。

memonger

    assert not (optimize_gradient_memory and dynamic_memory_management), \
        """It is not advised to use gradient optimization ('memonger')
        with dynamic memory management."""

    if optimize_gradient_memory:
        _OptimizeGradientMemorySimple(model_helper_obj, losses_by_gpu, devices)

    if dynamic_memory_management:
        _AddDynamicMemoryOptimization(model_helper_obj, blobs_to_keep, devices)

_data_parallel_model_init_nets会有多个吗？最终得到了_data_parallel_model_init_nets和_data_parallel_model_nets 。

    model_helper_obj._data_parallel_model_init_nets = [
        model_helper_obj.param_init_net,
    ]

    model_helper_obj._data_parallel_model_nets = [
        model_helper_obj.net
    ]

_AddBarrierToModelNets在每个 epoch 开始时同步 DPM。这允许 epoch 中早开始 shard 等待慢的 shard。如果没有这个，快的 shard 将开始下一个 epoch 的训练，而在 IO 上阻塞了落后者，并且可能在30秒后超时(_DEFAULT_TIMEOUT_SEC)。我们传入 model.param_init_net，以便屏障网络可以作为param_init_net的一部分运行。

 _AddBarrierToModelNets(model_helper_obj, barrier_net_timeout_sec)

_RemapParameterBlobsForSharedModel删除非主参数。

    if shared_model:
        _RemapParameterBlobsForSharedModel(model_helper_obj, all_params)

##Parallelize_BMUF

创建在许多 GPU 上运行的模型的函数，并为 parameter_updates 创建一个网络，该网络可以独立运行一定迭代次数，然后是另一个运行一次的网络，根据以下描述的块式模型更新过滤规则计算最终参数更新: Scalable Training of Deep Learning Machines by Incremental Block Training with Intra-block Parallel Optimization and Blockwise Model-Update Filtering(ICASSP 2016）。

def Parallelize_BMUF(
    model_helper_obj,
    input_builder_fun,
    forward_pass_builder_fun,
    param_update_builder_fun,
    block_learning_rate=1.0,
    block_momentum=None,
    devices=None,
    rendezvous=None,
    net_type='dag',
    master_device=None,
    use_nccl=False,
    nesterov=False,
    optimize_gradient_memory=False,
    reset_momentum_sgd=False,
    warmup_iterations=None,
    max_concurrent_distributed_ops=4,
    add_blobs_to_sync=None,
    num_threads_per_device=4,
    cpu_device=False,
    barrier_net_timeout_sec=_DEFAULT_BARRIER_NET_TIMEOUT_SEC,
):
    '''
    Function to create model that run on many GPUs and creates a net for
    parameter_updates that can be run independently for number of iterations
    then followed by another net that runs once to compute the final parameter
    updates according to block wise model update filtering rule described
    in : Scalable Training of Deep Learning Machines by Incremental Block
    Training with Intra-block Parallel Optimization and Blockwise Model-Update
    Filtering (ICASSP 2016).
    '''

##_ValidateParams

检查是否有重复参数。
enumerate() 函数用于将一个可遍历的数据对象(如列表、元组或字符串)组合为一个索引序列，同时列出数据和数据下标，一般用在 for 循环当中。

def _ValidateParams(params):
    set_params = set(params)
    if len(params) > len(set_params):
        dupes = []
        sp = sorted(params)
        for j, p in enumerate(sp):
            if j > 0 and sp[j - 1] == p:
                dupes.append(p)

        assert len(params) == len(set_params), \
            "Duplicate entries in params: {}".format(dupes)

##_GroupByDevice
通过设备对blob进行分组，返回[blobname] = {0: BlobRef, 1: ..}的映射。
返回有序字典，确保原始顺序。
仅仅用到了len(non_data_params)，似乎并不需要传入non_data_params。

    '''
    Groups blobs by device, returning a map of [blobname] = {0: BlobRef, 1: ..}.
    Returns ordered dictionary, ensuring the original order.
    '''
    grouped = OrderedDict()
    # Only consider params that were created to be  "data parallel"
    params = params[len(non_data_params):]

params类型应该为BlobReference或GradientSlice。

    for _i, p in enumerate(params):
        assert isinstance(p, core.BlobReference) or \
            isinstance(p, core.GradientSlice), \
            "Param {} is not BlobReference or GradientSlice".format(p)

获得gpuid。
GetNameScope同样是对名称字符串进行操作。

        name = stripBlobName(p)
        gpuid = None

        if isinstance(p, core.BlobReference):
            gpuid = int(p.GetNameScope().split("_")[1].split("/")[0])
            assert "{}_{}/".format(model._device_prefix, gpuid) in p.GetNameScope(),\
                "Param {} expected to have namescope '{}_{}'".format(str(p), model._device_prefix, gpuid)
        else:
            gpuid = int(p.indices.GetNameScope().split("_")[1].split("/")[0])
            assert "{}_{}/".format(model._device_prefix, gpuid) in p.indices.GetNameScope(),\
                "Indices {} expected to have namescope '{}_{}'".format(str(p), model._device_prefix, gpuid)
            assert "{}_{}/".format(model._device_prefix, gpuid) in p.values.GetNameScope(),\
                "Values {} expected to have namescope '{}_{}'".format(str(p), model._device_prefix, gpuid)

如果名称不在字典中则加入。

        if name not in grouped:
            grouped[name] = {}
        grouped[name][gpuid] = p

    return grouped

##_InferBlobDevice
定义一个map_ops函数，根据操作符的device_option为其输入和输出blob设置设备选项。

    mapping = {}

    def map_ops(proto):
        for op in proto.op:
            device_option = op.device_option
            if op.type == "Iter":
                # Hack for Iters which have blob in CPU context
                device_option = caffe2_pb2.DeviceOption()
                device_option.device_type = caffe2_pb2.CPU
            for b in list(op.input) + list(op.output):
                if b not in mapping:
                    mapping[b] = device_option
            if op.type.startswith('RecurrentNetwork'):
                step_args = [a for a in op.arg if a.name.endswith("step_net")]
                for step_arg in step_args:
                    map_ops(step_arg.n)
    map_ops(model.param_init_net.Proto())
    map_ops(model.net.Proto())
    model._blob_to_device = mapping

##_IsGPUBlob
如果已经_InferBlobDevice获取了设备选项，则直接判断；否则，根据模型的设备类型进行判断。

    if blob_name in model._blob_to_device:
        return model._blob_to_device[blob_name].device_type == caffe2_pb2.CUDA
    else:
        blob_name = "{}_{}/{}".format(
            model._device_prefix, model._devices[0], blob_name
        )
        if blob_name not in model._blob_to_device:
            return model._device_type == caffe2_pb2.CUDA
        return model._blob_to_device[blob_name].device_type == caffe2_pb2.CUDA

_RunComparison似乎并没有函数调用。

##_AddGradientOperators

_AddGradientOperators
对于每个gpu，由损失构造损失梯度并添加梯度算子。

def _AddGradientOperators(devices, model, losses_by_gpu):
    def create_grad(lossp):
        return model.ConstantFill(lossp, str(lossp) + "_grad", value=1.0)

    loss_grad = {}
    # Explicitly need to create gradients on each GPU
    for gpu_id in devices:
        device = core.DeviceOption(model._device_type, gpu_id)
        with core.DeviceScope(device):
            for l in losses_by_gpu[gpu_id]:
                lg = create_grad(l)
                loss_grad[str(l)] = str(lg)

    model.AddGradientOperators(loss_grad)

##_Broadcast

第一个GPU作为master。

    # Copy params from gpu_0 to other
    master_dev = devices[0]

如果使用NCCL，则设置net.NCCLBroadcast。
请注意，根是根_rank_而不是根_device_。因此，无论使用哪个设备，我们总是使用root = 0。
list(viewvalues(model._device_grouped_blobs[param]))参数由dict转成了list。
core.DeviceOption 是什么？

    if use_nccl:
        if _IsGPUBlob(model, param):
            master_device_opt = core.DeviceOption(model._device_type, master_dev)
            with core.DeviceScope(master_device_opt):
                # Note that the root is the root _rank_ and not the root
                # _device_. Thus we always use root=0, regardless of the
                # devices used.
                net.NCCLBroadcast(
                    list(viewvalues(model._device_grouped_blobs[param])),
                    list(viewvalues(model._device_grouped_blobs[param])),
                    root=0,
                )
                return

否则，使用net.Copy。

    for dev_idx in devices[1:]:
        if _IsGPUBlob(model, param):
            device_opt = core.DeviceOption(caffe2_pb2.CUDA, dev_idx)
        else:
            device_opt = core.DeviceOption(caffe2_pb2.CPU, 0)
        with core.DeviceScope(device_opt):
            net.Copy(
                model._device_grouped_blobs[param][master_dev],
                model._device_grouped_blobs[param][dev_idx]
            )

##_AllReduce
如果是GPU并且使用NCCL则直接使用NCCLAllreduce
control_input用于在算子图中表示控制依赖关系的附加“假”输入。这可以用于确保另一个算子准备好前算子不会运行。例如，调度控制。这些仅用于Net类内的调度，不作为算子实现的实际输入进行传递。

    blobs_group = list(viewvalues(model._device_grouped_blobs[param]))
    if model._device_type == caffe2_pb2.CUDA and use_nccl:
        # TODO: for _shared_model, do only NCCLReduce
        model.NCCLAllreduce(
            blobs_group, blobs_group, control_input=control_input
        )
        return

如果是CUDA GPU，则设置p2p_access_pattern。

    if model._device_type == caffe2_pb2.CUDA:
        p2p_access_pattern = workspace.GetCudaPeerAccessPattern()
    else:
        p2p_access_pattern = None

在不同的设备上为2个或多个Blob创建一个Sum操作。将结果保存在第一个设备上。
model是model_helper_obj。

sumN

   def sumN(*dev_indices):
        """Create a Sum op for 2 or more blobs on different devices.
        Saves the result on the first device.
        Arguments:
        dev_indices -- a list of device indices, which can be translated into
                       CUDA identifiers with model._devices
        """
        devices = [model._devices[idx] for idx in dev_indices]
        blobs = [blobs_group[idx] for idx in dev_indices]
        for i, peer in enumerate(devices):
            if i == 0:
                continue  # Skip the first device
            if p2p_access_pattern is not None and not p2p_access_pattern[
                devices[0], peer
            ]:
                # Copy from peer to d0
                blobs[i] = model.Copy(
                    blobs[i],
                    'gpu_{}/{}_gpu{}_copy'.format(devices[0], param, peer)
                )
        device_opt = core.DeviceOption(model._device_type, devices[0])
        with core.DeviceScope(device_opt):
            net.Sum(blobs, [blobs[0]], name='dpm')

使用tree reduction。

reduceTree

    if len(devices) == 16:
        # Special tree reduction for 16 gpus, TODO generalize like in muji.py
        for j in range(8):
            sumN(j * 2, j * 2 + 1)
        for j in range(4):
            sumN(j * 4, j * 4 + 2)
        for j in range(2):
            sumN(j * 8, j * 8 + 4)
        sumN(0, 8)
    elif len(devices) == 8:
        for j in range(4):
            sumN(j * 2, j * 2 + 1)
        for j in range(2):
            sumN(j * 4, j * 4 + 2)
        sumN(0, 4)
    elif len(devices) == 4:
        sumN(0, 1)
        sumN(2, 3)
        sumN(0, 2)
    else:
        sumN(*range(len(devices)))

广播到其他GPU。

    # TODO: for _shared_model, no need to broadcast
    _Broadcast(devices, model, net, param)

##_ComputeBlobsToSync
我们同步由param_init_net生成并且是’数据并行’的所有blob，即分配给设备的。

 sync_names = set()

如果是共享模型，则不同步参数；否则，遍历param_init_net中的op，获取其输出。

    # We don't sync params if the model is shared
    if model._shared_model:
        blobs_to_sync = [str(p) for p in model.GetComputedParams('')]
        sync_names = [stripBlobName(p) for p in blobs_to_sync]
    else:
        blobs_to_sync = []

        for op in model.param_init_net.Proto().op:
            dp_outputs = [
                o for o in op.output
                if o.startswith("{}_".format(model._device_prefix))
            ]
            sync_names.update([stripBlobName(o) for o in dp_outputs])
            blobs_to_sync.extend(dp_outputs)

        # Sanity check
        diff = set(model._param_names) - sync_names
        assert diff == set(), \
           "Some params not instantiated in param init net: {}".format(diff)

去重并排序。

    # Remove duplicates and sort
    prefixlen = len(model._device_prefix) + 1

    def extract_sort_key(b):
        # Sort first based on device id, and then by whole string
        deviceid = int(b[prefixlen:b.index(scope._NAMESCOPE_SEPARATOR)])
        return (deviceid, b)

    blobs_to_sync = sorted(
        list(set(blobs_to_sync)),
        key=extract_sort_key)

由字符串获得BlobReference。

    blobs_to_sync = [core.BlobReference(b) for b in blobs_to_sync]
    return (blobs_to_sync, sync_names)

##_InterleaveOps

数据并行模型创建了一个网络，其中同一设备中的操作被分为一组。本函数将交错操作，以便每个设备的每个操作在网络中彼此相邻。有点像组合套牌。这确保了程序对于每个设备大致同时沿着关键路径进行，由于多设备批量规范化需要额外的节点内同步，这是重要的。

建立字典，向每个GPU中追加算子。

    orig_ops = list(model.net.Proto().op)
    num_devices = len(model._devices)
    num_ops_per_dev = len(orig_ops) // num_devices
    assert num_devices * num_ops_per_dev == len(orig_ops), \
           'Number of ops per device in original net is not uniform'
    new_ops = []
    ops = {d: [] for d in range(num_devices)}
    for op in orig_ops:
        ops[op.device_option.cuda_gpu_id].append(op)

按照设备顺序将算子添加到new_ops中。

    for j in range(num_ops_per_dev):
        tp = None
        for d in model._devices:
            if tp is None:
                tp = ops[d][j].type
            new_ops.append(ops[d][j])
            # Sanity
            assert ops[d][j].type == tp, \
                "Type mismatch {} / {}".format(tp, ops[d][j].type)

替换。

    del model.net.Proto().op[:]
    model.net.Proto().op.extend(new_ops)

##_InterDeviceBatchNormalization

获得原有算子列表。

    orig_ops = list(model.net.Proto().op)
    new_ops = []
    num_devices = len(model._devices)
    batch_norm_ops = []
    injected_ops = []

spatial_bn正向参数。

    spatial_bn_phase = False
    sums_blobs = []
    sumsq_blobs = []
    name = []
    input_blob_name = None

spatial_bn反向参数。

    spatial_bn_gradient_phase = False
    scale_grad_blobs = []
    bias_grad_blobs = []

对于 SpatialBN 和 SpatialBNGradient 之外的算子，在 spatial_bn_phase 阶段添加 Sum 算子计算sums和sumsq，然后重置相关变量。Sum 仅支持CPU。
ChannelStats 计算每个通道的均值及平方和。

    for op in orig_ops:
        if op.type != 'SpatialBN' and op.type != 'SpatialBNGradient':
            if spatial_bn_phase:
                new_ops.extend(injected_ops)
                new_ops.append(
                    core.CreateOperator("Sum",
                                        sums_blobs,
                                        input_blob_name + "_sums_combined"))
                new_ops.append(
                    core.CreateOperator("Sum",
                                        sumsq_blobs,
                                        input_blob_name + "_sumsq_combined"))
                new_ops.extend(batch_norm_ops)
                injected_ops = []
                batch_norm_ops = []
                sums_blobs = []
                sumsq_blobs = []
                spatial_bn_phase = False
                input_blob_name = None

如果在spatial_bn_gradient_phase阶段

            elif spatial_bn_gradient_phase:
                new_ops.extend(injected_ops)
                scale_blob = \
                    "cpu_0/" + stripBlobName(scale_grad_blobs[0]) + "_combined"
                bias_blob = \
                    "cpu_0/" + stripBlobName(bias_grad_blobs[0]) + "_combined"
                new_ops.append(
                    core.CreateOperator("Sum", scale_grad_blobs, scale_blob))
                new_ops.append(
                    core.CreateOperator("Sum", bias_grad_blobs, bias_blob))
                for blob in scale_grad_blobs:
                    new_ops.append(
                        core.CreateOperator("Copy", scale_blob, blob))
                for blob in bias_grad_blobs:
                    new_ops.append(core.CreateOperator("Copy", bias_blob, blob))
                new_ops.extend(batch_norm_ops)
                injected_ops = []
                batch_norm_ops = []
                scale_grad_blobs = []
                bias_grad_blobs = []
                spatial_bn_gradient_phase = False
            new_ops.append(op)

如果是 SpatialBN 算子，设置为spatial_bn_phase，添加 ChannelStats 算子。

        elif op.type == 'SpatialBN':
            spatial_bn_phase = True
            if input_blob_name is None:
                input_blob_name = op.input[0]
            name = op.input[0]
            injected_ops.append(
                core.CreateOperator(
                    "ChannelStats",
                    name,
                    [name + "_sums", name + "_sumsq"]))
            sums_blobs.append(name + "_sums")
            sumsq_blobs.append(name + "_sumsq")
            op.input.append(input_blob_name + "_sums_combined")
            op.input.append(input_blob_name + "_sumsq_combined")
            op.arg.extend([utils.MakeArgument("num_batches", num_devices)])
            batch_norm_ops.append(op)

如果是 SpatialBNGradient，设置阶段为spatial_bn_gradient_phase，添加ChannelBackpropStats 算子。

        elif op.type == 'SpatialBNGradient':
            spatial_bn_gradient_phase = True
            injected_ops.append(
                core.CreateOperator("ChannelBackpropStats",
                                    [op.input[0], op.input[3], op.input[4],
                                     op.input[2]],
                                    [op.output[1], op.output[2]]))
            scale_grad_blobs.append(op.output[1])
            bias_grad_blobs.append(op.output[2])
            op.arg.extend([utils.MakeArgument("num_batches", num_devices)])
            op.input.extend([op.output[1], op.output[2]])
            batch_norm_ops.append(op)

替换网络的算子列表。

    assert not spatial_bn_phase, \
        "Net modification for inter-device batch normalization failed"
    del model.net.Proto().op[:]
    model.net.Proto().op.extend(new_ops)

##GetCheckpointParams
返回完整检查点所需的一组 blob。它们是第一个 gpu 的 blob 以及迭代次数 blob。
startswith() 用于检查字符串是否是以指定子字符串开头。

    (all_blobs, _) = _ComputeBlobsToSync(model)
    first_gpu_blobs = {
        b
        for b in all_blobs
        if str(b)
        .startswith("{}_{}/".format(model._device_prefix, model._devices[0]))
    }

添加没有单独名称范围的迭代 blob，因为它对检查点迭代计数器很重要。

    # Add iteration blobs that do not have namescope separately, since
    # it is important to checkpoint iteration counter
    iteration_blobs = set()
    for op in model.net.Proto().op:
        if op.type == 'Iter' or op.type == 'AtomicIter':
            if not op.output[0].startswith("{}_".format(model._device_prefix)):
                iteration_blobs.add(op.output[0])

返回first_gpu_blobs。

    return first_gpu_blobs.union(iteration_blobs)

##FinalizeAfterCheckpoint

在从检查点/初始参数文件加载参数之后，应该调用该函数。

_ComputeBlobsToSync 同步由param_init_net生成的所有 blob，并且是’数据并行’，即分配给设备。
stripBlobName 提取参数名称。

    if not hasattr(model, "_checkpoint_net"):
        if blobs is None:
            (_, uniq_blob_names) = _ComputeBlobsToSync(model)
        else:
            uniq_blob_names = [stripBlobName(p) for p in blobs]

与blob查找映射同步，因为提供的blob可能没有参数，例如 momemtum blob。
GetDevices 返回设备，在 Parallelize 的开始设置。

        # Synchronize to the blob lookup map, as the provided
        # blobs might have non-parameters, such as momemtum blobs.
        log.info("Creating checkpoint synchronization net")
        devices = model.GetDevices()
        for name in uniq_blob_names:
            if name not in model._device_grouped_blobs:
                grouped = {
                    d:
                    core.BlobReference("{}_{}{}{}".format(
                        model._device_prefix,
                        d,
                        scope._NAMESCOPE_SEPARATOR,
                        name)
                    ) for d in devices}
                model._device_grouped_blobs[name] = grouped

创建_checkpoint_net网络。

        model._checkpoint_net = core.Net("checkpoint_sync_net")
        model._checkpoint_net.RunAllOnGPU()

如果是分布式的，则创建checkpoint_init_net。

        checkpoint_init_net = None
        if (model._rendezvous is not None and model._rendezvous['num_shards'] > 1):
            checkpoint_init_net = core.Net("checkpoint_init_net")
            checkpoint_init_net.RunAllOnGPU()

添加同步所有参数的操作。

        _SyncAllParams(
            devices,
            model,
            checkpoint_init_net,
            model._checkpoint_net,
            model._rendezvous,
            uniq_blob_names,
            max_concurrent_distributed_ops=1
        )

单机并不需要执行。

        if (checkpoint_init_net):
            workspace.RunNetOnce(checkpoint_init_net)

调用C++函数，构造出网络。

 workspace.CreateNet(model._checkpoint_net)

运行_checkpoint_net网络。

    # Run the sync
    log.info("Run checkpoint net")
    workspace.RunNet(model._checkpoint_net.Proto().name)

图波列夫

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录