torch.optim 之 distinct penalization

MilkLeong

已于 2024-06-20 18:22:12 修改

阅读量345

点赞数 3

文章标签：机器学习 python pytorch

于 2024-06-20 18:06:32 首次发布

本文链接：https://blog.csdn.net/milkleong/article/details/139839294

版权

看torch.optim中介绍到distinct penalization:
Remember that parameters() returns an iterable that contains all learnable parameters, including biases and other parameters that may prefer distinct penalization. To address this, one can specify individual penalization weights for each parameter group

不太明白，问了问Chat-GPT，把结果记在这里。

【distinct penalization】
在深度学习中，"penalization"通常指的是正则化，它是一种用来防止模型过拟合的技术。正则化通过在损失函数中添加一个额外的项来实现，这个额外的项惩罚模型参数的复杂度。常见的正则化形式包括L1正则化（也称为Lasso）和L2正则化（也称为Ridge或权重衰减）。

"Distinct penalization"这个概念指的是对不同的模型参数应用不同程度的正则化。在神经网络中，不同的参数可能需要不同的正则化强度。例如，权重参数可能需要较强的正则化以防止过拟合，而偏置参数可能需要较弱的正则化，因为它们通常不会导致过拟合问题。

在PyTorch的torch.optim优化器中，可以通过为每个参数组指定不同的正则化权重来实现distinct penalization。参数组是一个包含一组参数和与之相关的优化选项（如学习率、正则化权重等）的字典。通过为不同的参数组设置不同的正则化权重，可以对模型中的不同部分应用不同程度的正则化。

以下是一个简单的例子，展示了如何在PyTorch中为不同的参数组设置不同的正则化权重：

import torch
from torch import nn
from torch.optim import SGD

# 假设我们有一个简单的模型
model = nn.Sequential(
    nn.Linear(10, 5),
    nn.ReLU(),
    nn.Linear(5, 1)
)

# 定义两个参数组，一个用于线性层的权重，另一个用于偏置
params = [
    {'params': model[0].weight, 'weight_decay': 1e-5},  # 对权重应用较强的正则化
    {'params': model[0].bias, 'weight_decay': 0},       # 对偏置不应用正则化
    {'params': model[2].weight, 'weight_decay': 1e-5},
    {'params': model[2].bias, 'weight_decay': 0},
]

# 创建优化器，传入参数组
optimizer = SGD(params, lr=0.01)

# 现在，优化器会根据每个参数组的设置来更新参数

在这个例子中，我们为模型中的权重和偏置创建了不同的参数组，并为它们设置了不同的权重衰减（正则化）值。这样，我们就可以对权重和偏置应用不同程度的正则化，从而实现distinct penalization。

【weight_decay】
weight_decay是与之相关的另一个概念，它一般作为优化函数的参数，如：

bias_params = [p for name, p in self.named_parameters() if 'bias' in name]
others = [p for name, p in self.named_parameters() if 'bias' not in name]

optim.SGD([
                {'params': others},
                {'params': bias_params, 'weight_decay': 0}
            ], weight_decay=1e-2, lr=1e-2)
# In this manner, bias terms are isolated from non-bias terms, and a weight_decay of 0 is set specifically for the bias terms, as to avoid any penalization for this group.