EMA滑动平均训练方式

ZhengXinTang

已于 2024-01-20 15:52:05 修改

阅读量2.1k

点赞数 4

分类专栏： # 深度学习文章标签：深度学习

于 2023-11-12 16:05:05 首次发布

本文链接：https://blog.csdn.net/chumingqian/article/details/134360843

版权

深度学习专栏收录该内容

31 篇文章

订阅专栏

1. EMA 介绍

首先该类实现，使用timm ==0.6.11 版本；

Exponential Moving Average (EMA) for models in PyTorch.
目的：它旨在维护模型状态字典的移动平均值，包括参数和缓冲区。该技术通常用于训练方案，其中权重的平滑版本对于最佳性能至关重要。

1.1 v1 版本


class ModelEma:
    """ Model Exponential Moving Average (DEPRECATED)

    Keep a moving average of everything in the model state_dict (parameters and buffers).
    This version is deprecated, it does not work with scripted models. Will be removed eventually.

    This is intended to allow functionality like
    https://www.tensorflow.org/api_docs/python/tf/train/ExponentialMovingAverage

    A smoothed version of the weights is necessary for some training schemes to perform well.
    E.g. Google's hyper-params for training MNASNet, MobileNet-V3, EfficientNet, etc that use
    RMSprop with a short 2.4-3 epoch decay period and slow LR decay rate of .96-.99 requires EMA
    smoothing of weights to match results. Pay attention to the decay constant you are using
    relative to your update count per epoch.

    To keep EMA from using GPU resources, set device='cpu'. This will save a bit of memory but
    disable validation of the EMA weights. Validation will have to be done manually in a separate
    process, or after the training stops converging.

    This class is sensitive where it is initialized in the sequence of model init,
    GPU assignment and distributed training wrappers.
    """
    def __init__(self, model, decay=0.9999, device='', resume=''):
        # make a copy of the model for accumulating moving average of weights
        self.ema = deepcopy(model)
        self.ema.eval()
        self.decay = decay
        self.device = device  # perform ema on different device from model if set
        if device:
            self.ema.to(device=device)
        self.ema_has_module = hasattr(self.ema, 'module')
        if resume:
            self._load_checkpoint(resume)
        for p in self.ema.parameters():
            p.requires_grad_(False)

    def _load_checkpoint(self, checkpoint_path):
        checkpoint = torch.load(checkpoint_path, map_location='cpu')
        assert isinstance(checkpoint, dict)
        if 'state_dict_ema' in checkpoint:
            new_state_dict = OrderedDict()
            for k, v in checkpoint['state_dict_ema'].items():
                # ema model may have been wrapped by DataParallel, and need module prefix
                if self.ema_has_module:
                    name = 'module.' + k if not k.startswith('module') else k
                else:
                    name = k
                new_state_dict[name] = v
            self.ema.load_state_dict(new_state_dict)
            _logger.info("Loaded state_dict_ema")
        else:
            _logger.warning("Failed to find state_dict_ema, starting from loaded model weights")

    def update(self, model):
        # correct a mismatch in state dict keys
        needs_module = hasattr(model, 'module') and not self.ema_has_module
        with torch.no_grad():
            msd = model.state_dict()
            for k, ema_v in self.ema.state_dict().items():
                if needs_module:
                    k = 'module.' + k
                model_v = msd[k].detach()
                if self.device:
                    model_v = model_v.to(device=self.device)
                ema_v.copy_(ema_v * self.decay + (1. - self.decay) * model_v)

Methods:方法：

__init__：通过创建所提供模型的副本、设置衰减率和设备放置来初始化 EMA 模型。模型设置为评估模式，并且其梯度被禁用。

_load_checkpoint ：加载 EMA 模型的检查点。它处理由 DataParallel 包装器引起的状态字典命名约定中的潜在差异。

update ：
通过计算原始模型参数和当前 EMA 参数的加权平均值来更新 EMA 参数。

Features:特征

可以为模型及其 EMA 对应项指定不同的设备。
处理由于 DataParallel 包装器导致的状态字典键不匹配。
由于与脚本模型不兼容v1版本被弃用

1.2 v2 版本

import logging
from collections import OrderedDict
from copy import deepcopy

import torch
import torch.nn as nn

_logger = logging.getLogger(__name__)

class ModelEmaV2(nn.Module):
    """ Model Exponential Moving Average V2

    Keep a moving average of everything in the model state_dict (parameters and buffers).
    V2 of this module is simpler, it does not match params/buffers based on name but simply
    iterates in order. It works with torchscript (JIT of full model).

    This is intended to allow functionality like
    https://www.tensorflow.org/api_docs/python/tf/train/ExponentialMovingAverage

    A smoothed version of the weights is necessary for some training schemes to perform well.
    E.g. Google's hyper-params for training MNASNet, MobileNet-V3, EfficientNet, etc that use
    RMSprop with a short 2.4-3 epoch decay period and slow LR decay rate of .96-.99 requires EMA
    smoothing of weights to match results. Pay attention to the decay constant you are using
    relative to your update count per epoch.

    To keep EMA from using GPU resources, set device='cpu'. This will save a bit of memory but
    disable validation of the EMA weights. Validation will have to be done manually in a separate
    process, or after the training stops converging.

    This class is sensitive where it is initialized in the sequence of model init,
    GPU assignment and distributed training wrappers.
    """
    def __init__(self, model, decay=0.9999, device=None):
        super(ModelEmaV2, self).__init__()
        # make a copy of the model for accumulating moving average of weights
        self.module = deepcopy(model)
        self.module.eval()
        self.decay = decay
        self.device = device  # perform ema on different device from model if set
        if self.device is not None:
            self.module.to(device=device)

    def _update(self, model, update_fn):
        with torch.no_grad():
            for ema_v, model_v in zip(self.module.state_dict().values(), model.state_dict().values()):
                if self.device is not None:
                    model_v = model_v.to(device=self.device)
                ema_v.copy_(update_fn(ema_v, model_v))

    def update(self, model): # 使用衰减率更新 EMA 参数
        self._update(model, update_fn=lambda e, m: self.decay * e + (1. - self.decay) * m)

    def set(self, model):  # 直接将 EMA 参数设置为与提供的模型参数相同。
        self._update(model, update_fn=lambda e, m: m)

EmaV2版本：与 ModelEma 类似，但实现更简单。它还维护模型状态字典的移动平均值，并设计为与 torchscript（完整模型的 JIT）配合使用。

Methods:方法：

__init__：与 ModelEma 类似，但添加了对 super() 的调用来初始化 nn.Module 基类。

_update ：更新 EMA 参数的辅助函数，以自定义更新函数作为参数。

update ：使用衰减率更新 EMA 参数。

set ：直接将 EMA 参数设置为与提供的模型参数相同。

Features:特征：

比 ModelEma 更简单、更直接的实现。
与torchscipt兼容。
根据参数的顺序而不是名称来匹配参数。

v1 版本与 v2版本之间的差异
Differences差异:

设计复杂性： ModelEmaV2 更简单、更直接，避免了按名称匹配参数。
兼容性： ModelEmaV2 与 torchscript 兼容，与 ModelEma 不同。
.参数匹配： ModelEma 按名称匹配参数和缓冲区，而 ModelEmaV2 根据参数和顺序进行匹配。
版本控制和用例： ModelEma 已被弃用，并且对于较新的训练方案（尤其是需要脚本的训练方案）而言不太受欢迎。
这两个类本质上用于相同的目的，但采用不同的方法，使得 ModelEmaV2 更适合利用脚本的现代 PyTorch 工作流程。

2. 使用方法

与 ModelEma 相比，在训练过程中使用 ModelEmaV2 涉及的方法略有不同。以下是有关如何将 ModelEmaV2 合并到训练循环中的指南，以及有关衰减参数的作用和预训练权重的使用的说明。

要在训练过程中使用 ModelEma V2 ，您应该将其集成到现有的训练循环中。以下是有关如何执行此操作的分步指南：

由于v1版本被弃用，所以这里介绍使用 V2 版本；

2.1 使用步骤

2.1.1. 初始化ema 类

初始化：
先定义自己的模型后，
在初始化（或者实例化） ModelEmaV2时，将模型作为参数传入，根据自己的训练策略设置 decay 参数，可以先设置0.9，然后设置0.5 的方式，来确定自己的训练策略应该使用0.9 还是0.1;

model = YourModel()  # Replace with your model
ema = ModelEmaV2(model, decay=0.9999)

设备配置：如果使用 GPU 等特定设备，请确保您的模型和 EMA 模型都移至该设备。

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
ema.module.to(device)

2.1.2 训练阶段

这里需要注意的是，在训练阶段，调用的模型仍是原始的自定的self.model

在模型完成损失反向传播，以及参数更新之后，才会将此时的模型传入到 ema 中，调用Ema 中的updata()函数，完成对参数的滑动平均更新,

即Ema在训练阶段的调用情况，是在模型完成反向传播，以及参数更新之后。

           for i, (spec,cof,label) in enumerate(tqdm(self.train_data_loader,  desc=' training process')):
                spec_data, cof_data, label = spec.cuda().float(), cof.cuda().float(), label.long().cuda()

                model_out = self.net(spec_data, cof_data)
                self.optimizer.zero_grad()
                loss.backward()
                self.optimizer.step()
                # note, 在模型完成反向传播之后使用， 这里更新ema 的模型
                self.ema.update(self.net)

在重声一遍吧，
这里需要注意到的是，需要在每个反向传播更新之后，才去更新EMA 模型；

for epoch in range(num_epochs):
    for batch in dataloader:
        inputs, targets = batch
        inputs, targets = inputs.to(device), targets.to(device)

        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, targets)
        loss.backward()
        optimizer.step()
        ema.update(model)

2.1.3 推理阶段

即在获取EMA更新的权重之后，EMA 模型的参数权重，真正使用他的地方是在推理阶段。
由于滑动平均后的权重参数，更适合预测阶段，所以真正使用 Ema更新的权重，是在推理阶段

验证：使用EMA更新后的权重参数，进行验证。

ema.module.eval()  # Set EMA model to evaluation mode
with torch.no_grad():
    for batch in validation_dataloader:
        inputs, targets = batch
        inputs, targets = inputs.to(device), targets.to(device)
        outputs = ema.module(inputs)  # Use EMA model for predictions
        # Compute validation metrics

2.1.4 参数保存

检查点：保存常规模型和 EMA 模型的状态字典。

torch.save({
    'model_state_dict': model.state_dict(),
    'ema_state_dict': ema.module.state_dict(),
    # ... other states like optimizer, epoch, etc.
}, 'checkpoint.pth')

恢复训练：要从检查点恢复，请加载两个状态字典。

checkpoint = torch.load('checkpoint.pth')
model.load_state_dict(checkpoint['model_state_dict'])
ema.module.load_state_dict(checkpoint['ema_state_dict'])
# Load other states

2.2 decay 参数的影响

ModelEmaV2 中的衰减参数起着至关重要的作用：

它确定移动平均线中当前模型参数相对于历史参数的权重。

较高的衰减值（接近 1）赋予历史参数更大的权重，从而导致 EMA 模型权重的更新更平滑且更慢。
较低的衰减值使 EMA 模型的权重对模型参数的近期变化更加敏感。

衰减值的选择取决于您的训练动态和训练步骤总数。常见的做法是从高衰减开始，然后随着时间的推移逐渐减少。

decay 参数；
较高的衰减值（接近 1）：当衰减参数设置为接近 1 时，EMA 模型会为较旧的（历史）参数赋予更多权重，而为最近更新的参数赋予较少权重。这使得 EMA 权重随着时间的推移变得更加平滑和更加稳定。平均权重响应新数据的变化更慢，这有利于减少噪声更新的影响。

较低的衰减值（远离 1）：较低的衰减值导致 EMA 模型更加重视最近的模型更新。这使得 EMA 权重不太平滑，因为它们对模型参数的最新变化更加敏感。虽然这可以使 EMA 权重对数据的新趋势更加敏感，但也使它们更容易受到噪音和突然变化的影响。

总而言之，较高的衰减参数（接近 1）通过赋予历史数据更多权重来提高 EMA 模型权重的平滑度，从而导致权重更稳定但响应性较差。相反，较低的衰减值会降低平滑度，使权重对最近的变化更加敏感，但会牺牲稳定性。适当衰减值的选择取决于训练过程的具体要求和数据的性质。

使用 ModelEmaV2 时，在初始化 ModelEmaV2 之前将预训练的权重加载到原始模型中可能会很有帮助，特别是当您正在进行微调或有特定的起点时。

2.3 预训练权重

使用预先训练的权重:

使用 ModelEmaV2 时，在初始化 ModelEmaV2 之前将预训练的权重加载到原始模型中可能会很有帮助，特别是当您正在进行微调或有特定的起点时。
然后，EMA 模型将从这些权重的平滑版本开始，这可以导致更快的收敛和可能更好的最终性能，特别是在微调场景中。
但是，如果您从头开始训练，则使用没有预训练权重的模型初始化 ModelEmaV2 也可以。 EMA 模型将随着训练的进展进行调整。
总之， ModelEmaV2 用于维持模型权重的更平滑、更稳定的版本，这对于实现最佳性能至关重要，特别是在训练的后期阶段或微调场景中。衰减参数是控制应用平滑程度的关键。使用 ModelEmaV2 时，预训练权重可能很有用，但它们并不是绝对必要的，特别是在从头开始训练的场景中。

2.4 bug 问题

遇到的错误 RuntimeError: Only Tensors created explicitly by the user (graph leaves) support the deepcopy protocol at the moment:

表示 ModelEmaV2 初始化中的 deepcopy 操作存在问题。当尝试在 PyTorch 中深度复制具有一定复杂性或特定类型的层或参数的模型时，通常会出现此问题。

检查不可复制的层或参数：PyTorch 模型中的某些自定义层或参数可能不支持深度复制。如果您的模型包含此类层，请考虑修改模型以仅使用深度复制兼容的层。
更新 PyTorch 版本：确保您使用的是最新版本的 PyTorch。有时，此类问题会在新版本中得到解决。

解决方法：自定义深度复制方法：此函数将手动将每个参数和缓冲区从原始模型复制到新模型。可以编写自定义函数来创建模型的副本，而不是使用 deepcopy 。即将原始的__init__() 初始化过程中， self.module 不使用 deepcopy()函数。

替换成如下方式拷贝：

def custom_deepcopy(model):
    model_copy = type(model)()  # Create a new instance of the model's class
    model_copy.load_state_dict(model.state_dict())  # Copy parameters and buffers
    return model_copy

self.ema = ModelEmaV2(custom_deepcopy(self.net), decay=0.9999)

并且需要将原始 __init__() 初始化过程中， self.module 不使用 deepcopy()函数，

    def __init__(self, model, decay=0.9999, device=None):
        super(ModelEmaV2, self).__init__()
        # make a copy of the model for accumulating moving average of weights
        # self.module =    deepcopy(model)
        self.module = model
        self.module.eval()
        self.decay = decay
        self.device = device  # perform ema on different device from model if set
        if self.device is not None:
            self.module.to(device=device)