MMCV1.6.0之Runner/Hook/OptimizerHook（反向传播+参数更新）、Fp16OptimizerHook、自定义优化器与config设置

qq_41627642

已于 2024-07-29 15:34:18 修改

阅读量1.1k

点赞数 22

分类专栏： MMCV MMdetection 文章标签：目标检测

于 2024-07-29 14:30:43 首次发布

本文链接：https://blog.csdn.net/qq_41627642/article/details/140767156

版权

MMdetection 同时被 2 个专栏收录

35 篇文章 26 订阅

订阅专栏

MMCV

5 篇文章 0 订阅

订阅专栏

OptimizerHook

这段代码定义了一个名为 OptimizerHook 的类，它是一个用于优化器的自定义操作钩子。该钩子包含了一些用于梯度裁剪和检测异常参数的操作。这对于在深度学习训练过程中优化模型的性能和调试模型非常有用。

类的定义
OptimizerHook 类继承自 Hook，实现了一些与优化器相关的自定义操作。
参数说明
grad_clip: 一个字典，用于配置梯度裁剪的参数。默认值为 None。
detect_anomalous_params: 一个布尔值，用于调试目的。这将减慢训练速度，检测不包含在计算图中的异常参数。默认值为 False。

@HOOKS.register_module()
class OptimizerHook(Hook):
    """A hook contains custom operations for the optimizer.

    Args:
        grad_clip (dict, optional): A config dict to control the clip_grad.
            Default: None.
        detect_anomalous_params (bool): This option is only used for
            debugging which will slow down the training speed.
            Detect anomalous parameters that are not included in
            the computational graph with `loss` as the root.
            There are two cases

                - Parameters were not used during
                  forward pass.
                - Parameters were not used to produce
                  loss.
            Default: False.
    """

    def __init__(self,
                 grad_clip: Optional[dict] = None,
                 detect_anomalous_params: bool = False):
        self.grad_clip = grad_clip
        self.detect_anomalous_params = detect_anomalous_params

    def clip_grads(self, params):
        params = list(
            filter(lambda p: p.requires_grad and p.grad is not None, params))
        if len(params) > 0:
            return clip_grad.clip_grad_norm_(params, **self.grad_clip)

    def after_train_iter(self, runner):
        runner.optimizer.zero_grad()
        if self.detect_anomalous_params:
            self.detect_anomalous_parameters(runner.outputs['loss'], runner)
        runner.outputs['loss'].backward()

        if self.grad_clip is not None:
            grad_norm = self.clip_grads(runner.model.parameters())
            if grad_norm is not None:
                # Add grad norm to the logger
                runner.log_buffer.update({'grad_norm': float(grad_norm)},
                                         runner.outputs['num_samples'])
        runner.optimizer.step()

    def detect_anomalous_parameters(self, loss: Tensor, runner) -> None:
        logger = runner.logger
        parameters_in_graph = set()
        visited = set()

        def traverse(grad_fn):
            if grad_fn is None:
                return
            if grad_fn not in visited:
                visited.add(grad_fn)
                if hasattr(grad_fn, 'variable'):
                    parameters_in_graph.add(grad_fn.variable)
                parents = grad_fn.next_functions
                if parents is not None:
                    for parent in parents:
                        grad_fn = parent[0]
                        traverse(grad_fn)

        traverse(loss.grad_fn)
        for n, p in runner.model.named_parameters():
            if p not in parameters_in_graph and p.requires_grad:
                logger.log(
                    level=logging.ERROR,
                    msg=f'{n} with shape {p.size()} is not '
                    f'in the computational graph \n')

主要逻辑
初始化参数

接受 grad_clip 和 detect_anomalous_params 两个可选参数，并将它们赋值给实例变量。
clip_grads 方法

过滤出需要梯度裁剪的参数。
如果有参数需要裁剪，使用 clip_grad.clip_grad_norm_ 函数进行梯度裁剪。
after_train_iter 方法

每次训练迭代后被调用。
清零优化器的梯度。
如果启用了异常参数检测，调用 detect_anomalous_parameters 方法。
反向传播计算梯度。
如果启用了梯度裁剪，调用 clip_grads 方法，并将裁剪后的梯度范数记录到日志中。
更新优化器的参数。
detect_anomalous_parameters 方法

用于检测计算图中未包含的异常参数。
遍历损失的计算图，收集在图中的参数。
将模型中的参数与计算图中的参数进行比对，找出未包含在计算图中的参数，并记录错误日志。
总结
OptimizerHook 类提供了一种灵活的方法来管理和调试优化器的操作。通过梯度裁剪，可以防止梯度爆炸问题。而通过检测异常参数，可以帮助用户在训练过程中发现可能未正确参与计算的参数，从而提高模型的训练效率和效果。这对于大型深度学习模型的训练和调试尤为重要。

Fp16OptimizerHook（支持 FP16 精度的优化器钩子）

这段代码定义了一个名为 Fp16OptimizerHook 的类，它继承自 OptimizerHook，用于支持 FP16 精度的优化器钩子。这对于使用混合精度训练（Mixed Precision Training）以加速深度学习模型训练和减少显存使用非常有用。

类的定义
Fp16OptimizerHook 类继承自 OptimizerHook，实现了一些用于支持 FP16 精度的自定义操作。
参数说明
grad_clip: 一个字典，用于配置梯度裁剪的参数。默认值为 None。
coalesce: 一个布尔值，指示是否合并小的梯度张量以提高通信效率。默认值为 True。
bucket_size_mb: 一个整数，指示梯度桶的大小（以MB为单位）。默认值为 -1。
loss_scale: 一个浮点数、字符串或字典，配置损失缩放的参数。如果是浮点数，则使用静态损失缩放。如果是字符串，则必须为 ‘dynamic’，使用动态损失缩放。如果是字典，则包含 GradScaler 的参数。默认值为 512。
distributed: 一个布尔值，指示是否使用分布式训练。默认值为 True。

@HOOKS.register_module()
    class Fp16OptimizerHook(OptimizerHook):
        """FP16 optimizer hook (using PyTorch's implementation).

        If you are using PyTorch >= 1.6, torch.cuda.amp is used as the backend,
        to take care of the optimization procedure.

        Args:
            loss_scale (float | str | dict): Scale factor configuration.
                If loss_scale is a float, static loss scaling will be used with
                the specified scale. If loss_scale is a string, it must be
                'dynamic', then dynamic loss scaling will be used.
                It can also be a dict containing arguments of GradScalar.
                Defaults to 512. For Pytorch >= 1.6, mmcv uses official
                implementation of GradScaler. If you use a dict version of
                loss_scale to create GradScaler, please refer to:
                https://pytorch.org/docs/stable/amp.html#torch.cuda.amp.GradScaler
                for the parameters.

        Examples:
            >>> loss_scale = dict(
            ...     init_scale=65536.0,
            ...     growth_factor=2.0,
            ...     backoff_factor=0.5,
            ...     growth_interval=2000
            ... )
            >>> optimizer_hook = Fp16OptimizerHook(loss_scale=loss_scale)
        """

        def __init__(self,
                     grad_clip: Optional[dict] = None,
                     coalesce: bool = True,
                     bucket_size_mb: int = -1,
                     loss_scale: Union[float, str, dict] = 512.,
                     distributed: bool = True):
            self.grad_clip = grad_clip
            self.coalesce = coalesce
            self.bucket_size_mb = bucket_size_mb
            self.distributed = distributed
            self._scale_update_param = None
            if loss_scale == 'dynamic':
                self.loss_scaler = GradScaler()
            elif isinstance(loss_scale, float):
                self._scale_update_param = loss_scale
                self.loss_scaler = GradScaler(init_scale=loss_scale)
            elif isinstance(loss_scale, dict):
                self.loss_scaler = GradScaler(**loss_scale)
            else:
                raise ValueError('loss_scale must be of type float, dict, or '
                                 f'"dynamic", got {loss_scale}')

        def before_run(self, runner) -> None:
            """Preparing steps before Mixed Precision Training."""
            # wrap model mode to fp16
            wrap_fp16_model(runner.model)
            # resume from state dict
            if 'fp16' in runner.meta and 'loss_scaler' in runner.meta['fp16']:
                scaler_state_dict = runner.meta['fp16']['loss_scaler']
                self.loss_scaler.load_state_dict(scaler_state_dict)

        def copy_grads_to_fp32(self, fp16_net: nn.Module,
                               fp32_weights: Tensor) -> None:
            """Copy gradients from fp16 model to fp32 weight copy."""
            for fp32_param, fp16_param in zip(fp32_weights,
                                              fp16_net.parameters()):
                if fp16_param.grad is not None:
                    if fp32_param.grad is None:
                        fp32_param.grad = fp32_param.data.new(
                            fp32_param.size())
                    fp32_param.grad.copy_(fp16_param.grad)

        def copy_params_to_fp16(self, fp16_net: nn.Module,
                                fp32_weights: Tensor) -> None:
            """Copy updated params from fp32 weight copy to fp16 model."""
            for fp16_param, fp32_param in zip(fp16_net.parameters(),
                                              fp32_weights):
                fp16_param.data.copy_(fp32_param.data)

        def after_train_iter(self, runner) -> None:
            """Backward optimization steps for Mixed Precision Training. For
            dynamic loss scaling, please refer to
            https://pytorch.org/docs/stable/amp.html#torch.cuda.amp.GradScaler.

            1. Scale the loss by a scale factor.
            2. Backward the loss to obtain the gradients.
            3. Unscale the optimizer’s gradient tensors.
            4. Call optimizer.step() and update scale factor.
            5. Save loss_scaler state_dict for resume purpose.
            """
            # clear grads of last iteration
            runner.model.zero_grad()
            runner.optimizer.zero_grad()

            self.loss_scaler.scale(runner.outputs['loss']).backward()
            self.loss_scaler.unscale_(runner.optimizer)
            # grad clip
            if self.grad_clip is not None:
                grad_norm = self.clip_grads(runner.model.parameters())
                if grad_norm is not None:
                    # Add grad norm to the logger
                    runner.log_buffer.update({'grad_norm': float(grad_norm)},
                                             runner.outputs['num_samples'])
            # backward and update scaler
            self.loss_scaler.step(runner.optimizer)
            self.loss_scaler.update(self._scale_update_param)

            # save state_dict of loss_scaler
            runner.meta.setdefault(
                'fp16', {})['loss_scaler'] = self.loss_scaler.state_dict()

主要逻辑
初始化参数

接受 grad_clip、coalesce、bucket_size_mb、loss_scale 和 distributed 五个参数，并将它们赋值给实例变量。
根据 loss_scale 的类型，初始化 GradScaler 对象。
before_run 方法

在混合精度训练开始前的准备步骤。
将模型包装为 FP16 精度。
从状态字典中恢复 loss_scaler 的状态。
copy_grads_to_fp32 方法

将 FP16 模型中的梯度复制到 FP32 权重副本中。
copy_params_to_fp16 方法

将更新后的 FP32 权重副本的参数复制到 FP16 模型中。
after_train_iter 方法

每次训练迭代后被调用。
清零上一次迭代的梯度。
将损失按比例缩放并进行反向传播计算梯度。
将优化器的梯度张量取消缩放。
如果启用了梯度裁剪，调用 clip_grads 方法，并将裁剪后的梯度范数记录到日志中。
调用 optimizer.step() 并更新缩放因子。
保存 loss_scaler 的状态字典。
总结
Fp16OptimizerHook 类提供了一种灵活的方法来管理和支持混合精度训练。通过使用 PyTorch 的 torch.cuda.amp 模块，可以显著加速模型训练并减少显存使用。它还包括梯度裁剪和异常参数检测的功能，可以帮助用户在训练过程中更好地管理和调试模型。

# fp16 settings
fp16 = dict(loss_scale=512.)

您还可以设置fp16 = dict(loss_scale='dynamic')启用自动损失缩放。

Pytorch支持的自定义优化器

我们已经支持使用所有由PyTorch实现的优化器，唯一的修改就是更改配置文件的优化器字段。例如，如果您想要使用ADAM(注意性能可能会下降很多)，修改可以如下所示。

optimizer = dict(type='Adam', lr=0.0003, weight_decay=0.0001)

要修改模型的学习率，用户只需修改optimizer配置中的lr即可。用户可以直接在PyTorch的API文档后面设置参数。
在这里插入图片描述

定制自行实现的优化器

1、Define a new optimizer（定义一个新的优化器）

一个定制的优化器可以定义如下。假设您想添加一个名为MyOptimizer的优化器，它有参数a、b和c。您需要创建一个名为mmdet/core/optimizer的新目录。然后在文件中实现新的优化器，例如在mmdet/core/optimizer/my_optimizer.py中:

from .registry import OPTIMIZERS
from torch.optim import Optimizer
@OPTIMIZERS.register_module()
class MyOptimizer(Optimizer):

    def __init__(self, a, b, c)

2. 将优化器添加到注册表

要找到上面定义的模块，首先应该将该模块导入主命名空间。实现这一目标有两种选择。
(1)修改mmdet/core/optimizer/init.py来导入它。新定义的模块应该导入到mmdet/core/optimizer/init.py中，这样注册表就会找到新模块并添加它:

from .my_optimizer import MyOptimizer

(2)Use custom_imports in the config to manually import it （使用配置中的custom_imports手动导入它）

custom_imports = dict(imports=['mmdet.core.optimizer.my_optimizer'], allow_failed_imports=False)

模块 mmdet.core.optimizer.my_optimizer 会在程序开始时被导入，并且 MyOptimizer 类将会被自动注册。请注意，只需要导入包含 MyOptimizer 类的包，而不需要直接导入 mmdet.core.optimizer.my_optimizer.MyOptimizer。
实际上，用户可以使用这种导入方法使用完全不同的文件目录结构，只要模块根可以位于PYTHONPATH。

3.在配置文件中指定优化器

然后你可以在配置文件的优化器字段中使用MyOptimizer。在配置文件中，优化器由字段优化器定义，如下所示:

optimizer = dict(type='SGD', lr=0.02, momentum=0.9, weight_decay=0.0001)

To use your own optimizer, the field can be changed to

optimizer = dict(type='MyOptimizer', a=a_value, b=b_value, c=c_value)

自定义优化器构造函数

某些模型可能有一些针对优化的参数设置，例如 BatchNorm 层的权重衰减。用户可以通过自定义优化器构造函数来进行这些细粒度的参数调整。

from mmcv.utils import build_from_cfg

from mmcv.runner.optimizer import OPTIMIZER_BUILDERS, OPTIMIZERS
from mmdet.utils import get_root_logger
from .my_optimizer import MyOptimizer


@OPTIMIZER_BUILDERS.register_module()
class MyOptimizerConstructor(object):

    def __init__(self, optimizer_cfg, paramwise_cfg=None):

    def __call__(self, model):

        return my_optimizer

默认的优化器构造函数mmcv/mmcv/runner/optimizer
/default_constructor.py
在这里实现，它也可以作为新优化器构造函数的模板。

@OPTIMIZER_BUILDERS.register_module()
class DefaultOptimizerConstructor:
    """Default constructor for optimizers.

    By default each parameter share the same optimizer settings, and we
    provide an argument ``paramwise_cfg`` to specify parameter-wise settings.
    It is a dict and may contain the following fields:

    - ``custom_keys`` (dict): Specified parameters-wise settings by keys. If
      one of the keys in ``custom_keys`` is a substring of the name of one
      parameter, then the setting of the parameter will be specified by
      ``custom_keys[key]`` and other setting like ``bias_lr_mult`` etc. will
      be ignored. It should be noted that the aforementioned ``key`` is the
      longest key that is a substring of the name of the parameter. If there
      are multiple matched keys with the same length, then the key with lower
      alphabet order will be chosen.
      ``custom_keys[key]`` should be a dict and may contain fields ``lr_mult``
      and ``decay_mult``. See Example 2 below.
    - ``bias_lr_mult`` (float): It will be multiplied to the learning
      rate for all bias parameters (except for those in normalization
      layers).
    - ``bias_decay_mult`` (float): It will be multiplied to the weight
      decay for all bias parameters (except for those in
      normalization layers and depthwise conv layers).
    - ``norm_decay_mult`` (float): It will be multiplied to the weight
      decay for all weight and bias parameters of normalization
      layers.
    - ``dwconv_decay_mult`` (float): It will be multiplied to the weight
      decay for all weight and bias parameters of depthwise conv
      layers.
    - ``bypass_duplicate`` (bool): If true, the duplicate parameters
      would not be added into optimizer. Default: False.

    Args:
        model (:obj:`nn.Module`): The model with parameters to be optimized.
        optimizer_cfg (dict): The config dict of the optimizer.
            Positional fields are

                - `type`: class name of the optimizer.

            Optional fields are

                - any arguments of the corresponding optimizer type, e.g.,
                  lr, weight_decay, momentum, etc.
        paramwise_cfg (dict, optional): Parameter-wise options.

    Example 1:
        >>> model = torch.nn.modules.Conv1d(1, 1, 1)
        >>> optimizer_cfg = dict(type='SGD', lr=0.01, momentum=0.9,
        >>>                      weight_decay=0.0001)
        >>> paramwise_cfg = dict(norm_decay_mult=0.)
        >>> optim_builder = DefaultOptimizerConstructor(
        >>>     optimizer_cfg, paramwise_cfg)
        >>> optimizer = optim_builder(model)

    Example 2:
        >>> # assume model have attribute model.backbone and model.cls_head
        >>> optimizer_cfg = dict(type='SGD', lr=0.01, weight_decay=0.95)
        >>> paramwise_cfg = dict(custom_keys={
                '.backbone': dict(lr_mult=0.1, decay_mult=0.9)})
        >>> optim_builder = DefaultOptimizerConstructor(
        >>>     optimizer_cfg, paramwise_cfg)
        >>> optimizer = optim_builder(model)
        >>> # Then the `lr` and `weight_decay` for model.backbone is
        >>> # (0.01 * 0.1, 0.95 * 0.9). `lr` and `weight_decay` for
        >>> # model.cls_head is (0.01, 0.95).
    """

    def __init__(self, optimizer_cfg, paramwise_cfg=None):
        if not isinstance(optimizer_cfg, dict):
            raise TypeError('optimizer_cfg should be a dict',
                            f'but got {type(optimizer_cfg)}')
        self.optimizer_cfg = optimizer_cfg
        self.paramwise_cfg = {} if paramwise_cfg is None else paramwise_cfg
        self.base_lr = optimizer_cfg.get('lr', None)
        self.base_wd = optimizer_cfg.get('weight_decay', None)
        self._validate_cfg()

其他设置

使用梯度剪辑来稳定训练：

有些模型需要使用梯度剪辑来剪辑梯度以稳定训练过程。示例如下：

optimizer_config = dict(
    _delete_=True, grad_clip=dict(max_norm=35, norm_type=2))

如果您的配置继承了已设置的基础配置optimizer_config，则可能需要_delete_=True覆盖不必要的设置。

使用动量计划来加速模型收敛:

我们支持动量调度，根据学习速率修改模型的动量，使模型更快地收敛。动量调度器通常与LR调度器一起使用，例如，在三维检测中使用以下配置来加速收敛。更多细节，请参考CyclicLrUpdater和CyclicMomentumUpdater的实现。

lr_config = dict(
    policy='cyclic',
    target_ratio=(10, 1e-4),
    cyclic_times=1,
    step_ratio_up=0.4,
)
momentum_config = dict(
    policy='cyclic',
    target_ratio=(0.85 / 0.95, 1),
    cyclic_times=1,
    step_ratio_up=0.4,
)