梯度爆炸和运行环境保存（torch.save）

最新推荐文章于 2024-08-12 11:05:45 发布

dadaHaHa1234

最新推荐文章于 2024-08-12 11:05:45 发布

阅读量608

点赞数

本文链接：https://blog.csdn.net/qq_32425195/article/details/109090255

版权

1.查找模型训练崩溃原因是否由于梯度爆炸或者梯度消失引起

方法1：参考 https://zhuanlan.zhihu.com/p/32154263，通过tensorboard检测梯度值和参数值来查找原因。该方式便于观察，不好做定量统计

方法2：使用csv文件记录网络所有层的相关梯度和网络层所有的值，通过统计分析（比如统计梯度的最大最小值），得出是否由于梯度爆炸造成网络崩溃

当由于梯度爆炸造成网络崩溃时，需要设置阈值。阈值的选择也需要通过统计获得

2.如果是梯度爆炸引起的，需要做梯度裁剪。

方法1：梯度范数裁剪：

import torch.nn as nn
import torch.optim as optim
import torch

class LinearNet(nn.Module):
    def __init__(self, features_in=5, features_out=5):
        super().__init__()
        self.linear = nn.Linear(features_in, features_out)
        self._init_weight()

    def forward(self, x):
        return self.linear(x)
    def _init_weight(self):
        nn.init.constant_(self.linear.weight, val=1)
        nn.init.constant_(self.linear.bias, val=0)
# 定义
net = LinearNet()
mse_fn = nn.L1Loss()
optimizer = optim.SGD(net.parameters(), lr=0.1)
# 网络输入和标签
x = torch.FloatTensor([120,200,0.5,-200,1])
target_value = torch.FloatTensor([2,1,5,10,40])
# loss计算
predict = net(x)
loss = mse_fn(predict, target_value)

loss.backward()
print("grad before clip:"+str(net.linear.weight.grad))
# nn.utils.clip_grad_value_(net.linear.weight, clip_value=1.1)
nn.utils.clip_grad_norm_(net.linear.weight, max_norm=2, norm_type='inf')
print("grad after clip:"+str(net.linear.weight.grad))

nn.utils.clip_grad_norm_:

源代码：https://blog.csdn.net/qq_40178291/article/details/100853237

归一化时的系数：max_norm/parameters的范数，范数类型为norm_type；如果norm_type为1，则为绝对值之和；如果norm_type为2，则为欧式距离；如果范数类型为inf，则为最大值，使用inf做归一化，如果parameters值大于max_norm,则归一化后最大值为max_norm，如果小于max_norm，则parameters不变。

nn.utils.clip_grad_value_

把参数中大于clip_value的值设置为clip_value.

这里所说的大于小于都是针对绝对值的比较，设置为也是符号不变。

1.为了重现运行环境，当检测到运行异常时，需要保存模型，输入等，调试时将保存的数据加载。

#保存一个tensor

torch.save(tensor,path)--->tensor=torch.load(path)

#保存多个环境：保存为一个字典，

torch.save({'exp':a,'exp2':a+2},path) -->dict_tensor=torch.load(path)

2.当出现梯度问题，而且自己找不到原因解决该问题时；可以出现这种问题时，跳过本次的结果(这种方式只能处理该次输入数据有问题，比如输入数据使得梯度为nan，对模型本身无法改善)

其中：

optimizer.zero_grad() 将参数的梯度都设置为0

loss.backward():计算各个变量的梯度，并将梯度赋值给各个变量