(写在前面:因为这是我去年写在md文件里面的笔记,转到csdn的时候图片全都外链导入失败,还有目录也出了问题,所以可能会出现图片窜位的情况,望各位看的小伙伴海涵)
(还有就是本文好像没有多少自己的思想,好像投原创有点儿厚脸皮,大家要是觉得我过分了就说,我改)
对常用的更新学习率的方法的比较
- 一、用来比较训练的模型
- 二、 梯度更新步长不进行调整
- 三、梯度更新步长自动进行调整
- 四、每次epoch调整学习率
- 1.torch.optim.lr_scheduler.LambdaLR
- 2.torch.optim.lr_scheduler.MultiplicativeLR
- 3.torch.optim.lr_scheduler.StepLR
- 4.torch.optim.lr_scheduler.MultiStepLR
- 5.torch.optim.lr_scheduler.ExponentialLR
- 6.torch.optim.lr_scheduler.CosineAnnealingLR
- 7.torch.optim.lr_scheduler.ReduceLROnPlateau
- 8.torch.optim.lr_scheduler.CyclicLR
- 五、参考链接
一、用来比较训练的模型
#将使用mnist数据集进行各个学习率的尝试以及比较
from pathlib import Path
import requests
import torch
import pickle
import gzip
import matplotlib.pyplot as plt
import numpy as np
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader,TensorDataset
import torch.optim as optim
DATA_PATH = Path("G:/学习/ml/datasets")
PATH = DATA_PATH / "mnist"
PATH.mkdir(parents=True, exist_ok=True)
URL = "http://deeplearning.net/data/mnist"
FILENAME = "mnist.pkl.gz"
if not (PATH/FILENAME).exists():
content = requests.get(URL+FILENAME).content
(PATH/FILENAME).open("wb").write(content)
with gzip.open((PATH/FILENAME).as_posix(), "rb") as f:
((x_train, y_train), (x_valid, y_valid), _)=pickle.load(f, encoding="latin-1")
def correct_get(a,b):
dev1 = torch.device('cpu')
outputs_numpy = a.to(dev1).detach().numpy()
train_get = np.argmax(outputs_numpy, axis=1)
count = 0
for i in range(len(train_get)):
if train_get[i] == b[i]:
count += 1
return count/len(train_get), count
class Net(nn.Module):
def __init__(self):
super().__init__()
self.conv1 = nn.Conv2d(1, 9, 3, 1, 1)
self.conv2 = nn.Conv2d(9, 36, 3, 1, 1)
self.fc1 = nn.Linear(7 * 7 * 36, 120)
self.fc2 = nn.Linear(120, 84)
self.fc3 = nn.Linear(84, 10)
def forward(self, x):
x = F.max_pool2d(F.relu(self.conv1(x)), 2)
x = F.max_pool2d(F.relu(self.conv2(x)), 2)
x = x.view(-1, 7 * 7 * 36)
x = self.fc1(x)
x = self.fc2(x)
x = self.fc3(x)
return x
#plt.imshow(x_train[0].reshape((28,28)), cmap="gray")
#plt.show()
x_train, y_train, x_valid, y_valid = map(
torch.tensor, (x_train, y_train, x_valid, y_valid)
)
n, c = x_train.shape
print(x_train.shape, y_train.shape, x_valid.shape, y_valid.shape)
x_train = x_train.view(-1,1,28,28)
x_valid = x_valid.view(-1,1,28,28)
train_ds = TensorDataset(x_train, y_train)
train_dl = DataLoader(train_ds, batch_size=100, shuffle=True)
valid_ds = TensorDataset(x_valid, y_valid)
valid_dl = DataLoader(valid_ds,batch_size=10000)
Net_mnist = Net()
dev = torch.device("cuda" if torch.cuda.is_available() else "cpu")
Net_mnist = Net_mnist.to(device=dev)
params = Net_mnist.parameters()
opt = optim.SGD(params, lr =1e-3, momentum=0.9)
criterion = nn.CrossEntropyLoss()
loss_list = []
for i in range(10):
for j, data in enumerate(train_dl):
inputs, labels = data
inputs, labels = inputs.to(device=dev), labels.to(device=dev)
opt.zero_grad()
outputs = Net_mnist(inputs)
loss = criterion(outputs, labels)
loss.backward()
opt.step()
with torch.no_grad():
for valid_x,valid_y in valid_dl:
valid_x = valid_x.to(device=dev)
valid_outputs = Net_mnist(valid_x)
valid_correct, _ = correct_get(valid_outputs, valid_y)
print('epoch:%d, correct:%.3f'%(i, valid_correct))
loss_list.append(valid_correct)
plt.plot(loss_list)
plt.show()
上面使用的优化器是SGD,下面就优化器的改变进行叙述。
二、 梯度更新步长不进行调整
1.torch.optim.SGD
torch.optim.SGD
(别问我为什么不把链接贴在小标题上)
使用动量的说明文献:http://www.cs.toronto.edu/~hinton/absps/momentum.pdf (论文链接一般在源码注释里面有贴)
def __init__(self, params, lr=required, momentum=0, dampening=0,weight_decay=0, nesterov=False)
- 使用随机随机梯度下降(SGD)+momentum:
opt = optim.SGD(params, lr =3e-3, momentum=0.9)
得到的分数曲线如下:
-
设置使用nesterov梯度下降:
即使当前的梯度为0,但是梯度会根据它仅靠动量滚到下一点的梯度方向来更新,举例来说就是:下坡快到底的时候,提前预测你会滚到对面而让你提前刹车。[3]
opt = optim.SGD(params, lr =3e-3, momentum=0.9, dampening=0,weight_decay=0,nesterov=True) #参数还不太明白,看完论文再回来填坑
三、梯度更新步长自动进行调整
1.torch.optim.Adagrad
论文:Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
def __init__(self, params, lr=1e-2, lr_decay=0, weight_decay=0, initial_accumulator_value=0, eps=1e-10)
Adagrad是指对于每个参数,随着更新的总距离增多,学习率也随之变小。[2]
- 将优化器设置为:
opt = optim.Adagrad(params, lr=0.01, lr_decay=0, weight_decay=0, initial_accumulator_value=0, eps=1e-10)
得到的分数曲线如下图所示:
2.torch.optim.Adadelta
论文链接:ADADELTA: An Adaptive Learning Rate Method
def __init__(self, params, lr=1.0, rho=0.9, eps=1e-6, weight_decay=0)
-
Adadelta是对Adagrad进行优化的优化方法[2]
对优化器进行设置:
opt = optim.Adadelta(params,lr=1.0, rho=0.9, eps=1e-6, weight_decay=0)
得到的分数曲线如下图所示:
3.torch.optim.Adam
论文:
A Method for Stochastic Optimization
On the Convergence of Adam and Beyond
def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8, weight_decay=0, amsgrad=False)
- 设置优化器如下:
opt = optim.Adam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False)
得到分数曲线如下:
4.torch.optim.AdamW
论文:Decoupled Weight Decay Regularization
def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8, weight_decay=1e-2, amsgrad=False)
- 设置优化器为:
opt = optim.AdamW(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.01, amsgrad=False)
得到的分数曲线如下:
5.torch.optim.SparseAdam
论文:A Method for Stochastic Optimization
def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8,weight_decay=1e-2, amsgrad=False)
- 将优化器设置为:
opt = optim.SparseAdam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08)
得到:SparseAdam does not support dense gradients, please consider Adam instead.
6.torch.optim.Adamax
def __init__(self, params, lr=2e-3, betas=(0.9, 0.999), eps=1e-8, weight_decay=0)
- 设置优化器为:
opt = optim.Adamax(params, lr=0.002, betas=(0.9, 0.999), eps=1e-08, weight_decay=0)
得到的分数曲线如下:
7.torch.optim.ASGD
论文:Acceleration of stochastic approximation by averaging
def __init__(self, params, lr=1e-2, lambd=1e-4, alpha=0.75, t0=1e6, weight_decay=0)
- 将优化器设置为:
opt = optim.ASGD(params, lr=0.01, lambd=0.0001, alpha=0.75, t0=1000000.0, weight_decay=0)
得到的分数曲线为:
8.torch.optim.LBFGS
def __init__(self,
params,
lr=1,
max_iter=20,
max_eval=None,
tolerance_grad=1e-7,
tolerance_change=1e-9,
history_size=100,
line_search_fn=None)
TypeError: step() missing 1 required positional argument: ‘closure’
9.torch.optim.RMSprop
def __init__(self, params, lr=1e-2, alpha=0.99, eps=1e-8, weight_decay=0, momentum=0, centered=False)
- 设置优化器:
opt = optim.RMSprop(params, lr=0.01, alpha=0.99, eps=1e-08, weight_decay=0, momentum=0, centered=False)
得到的分数曲线:
10.torch.optim.Rprop
def __init__(self, params, lr=1e-2, etas=(0.5, 1.2), step_sizes=(1e-6, 50))
- 设置优化器为:
opt = optim.Rprop(params, lr=0.01, etas=(0.5, 1.2), step_sizes=(1e-06, 50))
得到的分数曲线:
四、每次epoch调整学习率
为了进行试验,我们对代码定义优化器和训练部分进行修改,下面是代码片段:
opt = optim.Adamax(params, lr=0.002, betas=(0.9, 0.999), eps=1e-08, weight_decay=0)
lr_lambda = lambda epoch: 1/(epoch+1)
scheduler = torch.optim.lr_scheduler.LambdaLR(opt, lr_lambda, last_epoch=-1)
criterion = nn.CrossEntropyLoss()
loss_list = []
lr_list = []
for i in range(10):
for j, data in enumerate(train_dl):
inputs, labels = data
inputs, labels = inputs.to(device=dev), labels.to(device=dev)
opt.zero_grad()
outputs = Net_mnist(inputs)
loss = criterion(outputs, labels)
loss.backward()
opt.step()
scheduler.step()
lr_list.append(scheduler.get_last_lr()[0])
with torch.no_grad():
for valid_x,valid_y in valid_dl:
valid_x = valid_x.to(device=dev)
valid_outputs = Net_mnist(valid_x)
valid_correct, _ = correct_get(valid_outputs, valid_y)
print('epoch:%d, correct:%.3f'%(i, valid_correct))
loss_list.append(valid_correct)
plt.plot(loss_list)
plt.show()
plt.figure()
plt.plot(lr_list)
plt.show()
上面使用的是自定义调整学习率
1.torch.optim.lr_scheduler.LambdaLR
torch.optim.lr_scheduler.LambdaLR
def __init__(self, optimizer, lr_lambda, last_epoch=-1)
调整的规则是: l r = b a s e l r ∗ l m b d a ( s e l f . l a s t e p o c h ) lr = base_lr * lmbda(self.last_epoch) lr=baselr∗lmbda(self.lastepoch)[5]
- 定义:
lr_lambda = lambda epoch: 1/(epoch+1)
scheduler = torch.optim.lr_scheduler.LambdaLR(opt, lr_lambda, last_epoch=-1)
得到的分数曲线为:
得到的学习率曲线如下:
2.torch.optim.lr_scheduler.MultiplicativeLR
torch.optim.lr_scheduler.MultiplicativeLR
按照设定的倍率来调整学习率,提供的lr_lambda可以是一个乘倍率的函数,也可以是一个等比数列组成的列表
def __init__(self, optimizer, lr_lambda, last_epoch=-1)
- 定义函数:
lmbda = lambda epoch: 0.95
scheduler = optim.lr_scheduler.MultiplicativeLR(opt, lr_lambda=lmbda)
训练得到的分数曲线是:
得到的学习率曲线:
3.torch.optim.lr_scheduler.StepLR
torch.optim.lr_scheduler.StepLR
每隔设定epoch固定间隔调整学习率,gamma表示每次调整的乘数(初始值是在optimizer设置的学习率)
def __init__(self, optimizer, step_size, gamma=0.1, last_epoch=-1)
- 定义:
scheduler = optim.lr_scheduler.StepLR(opt, 2, gamma=0.1, last_epoch=-1)
得到的分数曲线如下:
得到的学习率曲线如下:
4.torch.optim.lr_scheduler.MultiStepLR
torch.optim.lr_scheduler.MultiStepLR
到达指定的epoch按倍率衰减各参数组的学习率。这种衰减可以与此调度程序之外的学习率的其他更改同时发生
def __init__(self, optimizer, milestones, gamma=0.1, last_epoch=-1)
- 定义:
scheduler = optim.lr_scheduler.MultiStepLR(opt, milestones=[3,8], gamma=0.1)
得到的分数曲线:
学习率变化曲线如下:
5.torch.optim.lr_scheduler.ExponentialLR
torch.optim.lr_scheduler.ExponentialLR
指数衰减调整学习率,调整公式: l r = l r ∗ g a m m a ∗ ∗ e p o c h lr=lr*gamma**epoch lr=lr∗gamma∗∗epoch [5]
def __init__(self, optimizer, gamma, last_epoch=-1)
- 定义:
scheduler = optim.lr_scheduler.ExponentialLR(opt, gamma=0.1, last_epoch=-1)
得到的分数曲线如下:
学习率曲线如下:
6.torch.optim.lr_scheduler.CosineAnnealingLR
torch.optim.lr_scheduler.CosineAnnealingLR
以余弦函数为周期,并在每个周期最大值时重新设置学习率,以初始学习率为最大学习率,以2*T_max为周期。[5]
公式:
(抱歉,这里当时偷懒直接截的别人的公式)
def __init__(self, optimizer, T_max, eta_min=0, last_epoch=-1)
- 定义:
scheduler = optim.lr_scheduler.CosineAnnealingLR(opt, T_max=2, eta_min=0, last_epoch=-1)
得到的分数曲线如下:
得到的学习率曲线如下:
7.torch.optim.lr_scheduler.ReduceLROnPlateau
torch.optim.lr_scheduler.ReduceLROnPlateau
自适应调整学习率,当某指标不再变化,调整学习率,这个没有get_last_lr()方法,不能查看每个epoch里面lr的实时变化。
def __init__(self, optimizer, mode='min', factor=0.1, patience=10,
verbose=False, threshold=1e-4, threshold_mode='rel',
cooldown=0, min_lr=0, eps=1e-8)
- 使用方式:
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(opt, mode='min', factor=0.1, patience=10, verbose=False, threshold=0.0001, threshold_mode='rel', cooldown=0, min_lr=0, eps=1e-08)
#训练内监看交叉集的
scheduler.step(valid_correct)
得到的分数曲线:
8.torch.optim.lr_scheduler.CyclicLR
torch.optim.lr_scheduler.CyclicLR
根据循环学习率策略(CLR)设置各参数组的学习率。策略以固定的频率循环两个边界之间的学习率,这在训练神经网络的循环学习率中有详细的描述。两个边界之间的距离可以按每次迭代或每次循环进行缩放。周期性学习速率策略改变每批处理后的学习速率。在使用批处理进行培训后,应调用step。
def __init__(self,
optimizer,
base_lr,
max_lr,
step_size_up=2000,
step_size_down=None,
mode='triangular',
gamma=1.,
scale_fn=None,
scale_mode='cycle',
cycle_momentum=True,
base_momentum=0.8,
max_momentum=0.9,
last_epoch=-1)
- 使用示例(在dataloader循环中使用,详见[4]):
scheduler = torch.optim.lr_scheduler.CyclicLR(opt, base_lr=0.01, max_lr=0.1) #opt需要满足使用动量条件
得到的分数曲线:
得到的学习率变化曲线如下:
[4]还有torch.optim.lr_scheduler.OneCycleLR 和torch.optim.lr_scheduler.CosineAnnealingWarmRestarts
五、参考链接
1.An overview of gradient descent optimization algorithms
2.各种优化方法总结比较(sgd/momentum/Nesterov/adagrad/adadelta)
4.https://pytorch.org/docs/stable/optim.html?highlight=sgd#