深度学习之深入理解梯度下降代码块torch.optim.SGD()

引言

希望大家看完这篇文章能够对编写参数优化的代码能有新的认识。
正文:
对于深度学习怎么学习,原作者表示:直接动手,在代码中理解深度学习的原理。复杂的代码看不懂,那就从简单的代码开始看起。举个例子,如果你对梯度下降没有彻底弄懂,先直接写几行代码,跑一个简单的例子。

损失函数: l o s s = w 1 2 + 2 ∗ w 2 2 loss=w_1^2+2*w_2^2 loss=w12+2w22

梯度: d l o s s d w = 2 ∗ w 1 + 4 ∗ w 2 \frac{dloss}{dw}=2*w_1+4*w_2 dwdloss=2w1+4w2

梯度更新公式: w 0 = [ 0.5 , 0.5 ] w_{0}=[0.5,0.5] w0=[0.5,0.5]

w = w − l e a r n i n g _ r a t e ∗ d l o s s d w = [ 0.5 , 0.5 ] − 0.01 ∗ [ 1 , 2 ] = [ 0.49 , 0.48 ] \begin{aligned} &w=w-learning\_rate*\frac{dloss}{dw} \\ &=[0.5,0.5]-0.01*[1,2] \\ &=[0.49,0.48] \end{aligned} w=wlearning_ratedwdloss=[0.5,0.5]0.01[1,2]=[0.49,0.48]

原始形态:

import torch
# 如果不进行梯度清零
w = torch.tensor([0.5, 0.5], requires_grad=True)
loss = w[0] ** 2 + 2 * w[1] ** 2
opti = torch.optim.SGD([w], lr=0.01)
loss.backward()
print("w的梯度", w.grad.data)
print("w:", w)
opti.step()
print("w:", w)
opti.step()
print("w:", w)
opti.step()
print("w:", w)

"""
w的梯度 tensor([1., 2.])
w: tensor([0.5000, 0.5000], requires_grad=True)
w: tensor([0.4900, 0.4800], requires_grad=True)
w: tensor([0.4800, 0.4600], requires_grad=True)
w: tensor([0.4700, 0.4400], requires_grad=True)
"""


改动1:多更新一次参数

手动根据公式进行计算:

w = w − l e a r n i n g _ r a t e ∗ d l o s s d w = [ 0.49 , 0.48 ] − 0.01 ∗ [ 0.98 , 1.92 ] = [ 0.4802 , 0.4608 ] \begin{aligned} &w=w-learning\_rate*\frac{dloss}{dw} \\ &=[0.49,0.48]-0.01*[0.98,1.92] \\ &=[0.4802,0.4608] \end{aligned} w=wlearning_ratedwdloss=[0.49,0.48]0.01[0.98,1.92]=[0.4802,0.4608]

# 如果进行梯度清零
w = torch.tensor([0.5, 0.5], requires_grad=True)
loss = w[0].clone() ** 2 + 2 * w[1].clone() ** 2
opti = torch.optim.SGD([w], lr=0.01)
loss.backward()
print("w的梯度", w.grad.data)
print("w:", w)
opti.step()
print("w:", w)
opti.zero_grad()
loss = w[0].clone() ** 2 + 2 * w[1].clone() ** 2
loss.backward()
print("w的梯度", w.grad.data)
opti.step()
print("w:", w)
opti.zero_grad()
loss = w[0] ** 2 + 2 * w[1] ** 2
loss.backward()
print("w的梯度", w.grad.data)
opti.step()
print("w:", w)

"""
w的梯度 tensor([1., 2.])
w: tensor([0.5000, 0.5000], requires_grad=True)
w: tensor([0.4900, 0.4800], requires_grad=True)
w的梯度 tensor([0.9800, 1.9200])
w: tensor([0.4802, 0.4608], requires_grad=True)
w的梯度 tensor([0.9604, 1.8432])
w: tensor([0.4706, 0.4424], requires_grad=True)
"""

opti.step()的作用只是机械地再做一次梯度更新

改动2:加入权值衰减

如果引入权值衰减,相当于损失函数增加了参数向量的二范数,即:

l o s s = l o s s + λ 2 ∣ ∣ w ∣ ∣ 2 = w 1 2 + 2 ∗ w 2 2 + λ 2 ( w 1 2 + w 2 2 ) \begin{aligned}&loss=loss+\frac\lambda2||w||^2\\&=w_1^2+2*w_2^2+\frac\lambda2(w_1^2+w_2^2)\end{aligned} loss=loss+2λ∣∣w2=w12+2w22+2λ(w12+w22)

代码中weight_decay=1,即:

l o s s = 3 2 w 1 2 + 5 2 w 2 2 d l o s s d w = 3 ∗ w 1 + 5 ∗ w 2 \begin{aligned}loss&=\frac{3}{2}w_1^2+\frac{5}{2}w_2^2\\\frac{dloss}{dw}&=3*w_1+5*w_2\end{aligned} lossdwdloss=23w12+25w22=3w1+5w2

w = w − l e a r n i n g _ r a t e ∗ d l o s s d w = [ 0.5 , 0.5 ] − 0.01 ∗ [ 1.5 , 2.5 ] = [ 0.485 , 0.475 ] \begin{aligned} &w=w-learning\_rate*\frac{dloss}{dw} \\ &=[0.5,0.5]-0.01*[1.5,2.5] \\ &=[0.485,0.475] \end{aligned} w=wlearning_ratedwdloss=[0.5,0.5]0.01[1.5,2.5]=[0.485,0.475]

# 引入权值衰减
w = torch.tensor([0.5, 0.5], requires_grad=True)
loss = w[0] ** 2 + 2 * w[1] ** 2
opti = torch.optim.SGD([w], lr=0.01, weight_decay=1)
loss.backward(retain_graph=True)
print("w的梯度", w.grad.data)
print("w:", w)
opti.step()
print("w:", w)
opti.step()
print("w:", w)
opti.step()
"""
w的梯度 tensor([1., 2.])
w: tensor([0.5000, 0.5000], requires_grad=True)
w: tensor([0.4850, 0.4750], requires_grad=True)
w: tensor([0.4702, 0.4502], requires_grad=True)
"""

改动3:引入动量momentum

momentum方法引入了动量的概念。每次梯度更新的量不是learning_rate*grad,而是动量。

v t + 1 = g a m m a ∗ v t + l e a r n i n g _ r a t e ∗ g r a d v_{t+1}=gamma*v_t+learning\_rate*grad vt+1=gammavt+learning_rategrad

v t v_{t} vt的初始值为0。在代码中gamma=0.1

第一次更新:

v 1 = 0.1 ∗ 0 + 0.01 ∗ [ 1 , 2 ] = [ 0.01 , 0.02 ] v_1=0.1*0+0.01*[1,2]=[0.01,0.02] v1=0.10+0.01[1,2]=[0.01,0.02]

w = w − v 1 = [ 0.5 , 0.5 ] − [ 0.01 , 0.02 ] = [ 0.49 , 0.48 ] \begin{aligned}&w=w-v_1\\&=[0.5,0.5]-[0.01,0.02]\\&=[0.49,0.48]\end{aligned} w=wv1=[0.5,0.5][0.01,0.02]=[0.49,0.48]

目前为止和不带动量的SGD的结果一致

第二次更新:

v 2 = 0.1 ∗ v 1 + 0.01 ∗ [ 1 , 2 ] = [ 0.001 , 0.002 ] + [ 0.01 , 0.02 ] = [ 0.011 , 0.022 ] \begin{aligned} &v_2=0.1*v_1+0.01*[1,2] \\ &=[0.001,0.002]+[0.01,0.02] \\ &=[0.011,0.022] \end{aligned} v2=0.1v1+0.01[1,2]=[0.001,0.002]+[0.01,0.02]=[0.011,0.022]

w = w − v 2 = [ 0.49 , 0.48 ] − [ 0.011 , 0.022 ] = [ 0.479 , 0.458 ] \begin{aligned} &w=w-v_{2} \\ &=[0.49,0.48]-[0.011,0.022] \\ &=[0.479,0.458] \end{aligned} w=wv2=[0.49,0.48][0.011,0.022]=[0.479,0.458]

改动4:Adagrad优化器

SGD对于每一个参数都使用同一个学习率。adagrad则对每一个参数使用不同的学习率:梯度越大,学习率越小;梯度越小,学习率越大。

adagrad会维护一个和w同维度的tensor s t s_{t} st , t代表步数,默认 s 0 = 0 s_{0}=0 s0=0

首先看第一步: ⊙ \odot 符号指的是对应元素的点乘

s 1 = s 0 + g r a d ⊙ g r a d = s 0 + [ 1 , 2 ] ⊙ [ 1 , 2 ] = [ 1 , 4 ] \begin{aligned} &s_1=s_0+grad\odot grad \\ &=s_0+[1,2]\odot[1,2] \\ &=\begin{bmatrix}1,4\end{bmatrix} \end{aligned} s1=s0+gradgrad=s0+[1,2][1,2]=[1,4]

w = w − l e a r n i n g _ r a t e s 1 ⊙ g r a d = [ 0.5 , 0.5 ] − 0.01 [ 1 , 2 ] ⊙ [ 1 , 2 ] = [ 0.49 , 0.49 ] \begin{aligned} &w=w-\frac{learning\_rate}{\sqrt{s_1}}\odot grad \\ &=[0.5,0.5]-\frac{0.01}{[1,2]}\odot[1,2] \\ &=[0.49,0.49] \end{aligned} w=ws1 learning_rategrad=[0.5,0.5][1,2]0.01[1,2]=[0.49,0.49]

再来看第二步:

s 2 = s 1 + g r a d ⊙ g r a d = s 1 + [ 0.98 , 1.96 ] ⊙ [ 0.98 , 1.96 ] = [ 1 , 4 ] + [ 0.9604 , 3.8416 ] = [ 1.9604 , 7.8416 ] \begin{aligned} &s_2=s_1+grad\odot grad \\ &=s_1+[0.98,1.96]\odot[0.98,1.96] \\ &=[1,4]+[0.9604,3.8416] \\ &=[1.9604,7.8416] \end{aligned} s2=s1+gradgrad=s1+[0.98,1.96][0.98,1.96]=[1,4]+[0.9604,3.8416]=[1.9604,7.8416]

w = w − l e a r n i n g _ r a t e s 2 ⊙ g r a d = [ 0.49 , 0.49 ] − 0.01 [ 1.9604 , 7.8416 ] ⊙ [ 0.98 , 1.96 ] = [ 0.483 , 0.483 ] \begin{aligned} &w=w-\frac{learning\_rate}{\sqrt{s_2}}\odot grad \\ &=[0.49,0.49]-\frac{0.01}{[\sqrt{1.9604},\sqrt{7.8416}]}\odot[0.98,1.96] \\ &=[0.483,0.483] \end{aligned} w=ws2 learning_rategrad=[0.49,0.49][1.9604 ,7.8416 ]0.01[0.98,1.96]=[0.483,0.483]

Adagrad本质上是对学习率做了一个修正: l e a r n i n g _ r a t e s t \frac{learning\_rate}{\sqrt{s_t}} st learning_rate。从Adagrad的公式可以看出,某个参数的梯度越大,其对应的学习率越小。在本例子中, w 1 w_{1} w1 w 2 w_{2} w2的梯度大,对应的学习率更小,大和小相抵消后, w 1 w_{1} w1 w 2 w_{2} w2更新的梯度刚好相同。而且可以从公式中看出,Adagrad算法需要维护一个和 w w w形状一致的张量 s t s_{t} st。也就是说,引入Adagrad后,显存会至少增加一倍参数量的级别。

import torch
w = torch.tensor([0.5,0.5],requires_grad=True)
loss = w[0].clone()**2 + 2*w[1].clone()**2
opti = torch.optim.Adagrad([w],lr=0.01,lr_decay=0, weight_decay=0, initial_accumulator_value=0)
loss.backward(retain_graph=True)
print("w的梯度:",w.grad)
print("w:",w)
opti.step()
print("w:",w)
opti.zero_grad()
loss = w[0].clone()**2 + 2*w[1].clone()**2
loss.backward(retain_graph=True)
print("w的梯度:",w.grad)
opti.step()
print("w:",w)
"""
w的梯度: tensor([1., 2.])
w: tensor([0.5000, 0.5000], requires_grad=True)
w: tensor([0.4900, 0.4900], requires_grad=True)
w的梯度: tensor([0.9800, 1.9600])
w: tensor([0.4830, 0.4830], requires_grad=True)
"""

改动5:RMSprop

RMSprop是基于adagrad的改进,adagrad是移动平均公式是: s t + 1 = s t + g r a d ⊙ g r a d s_{t+1}=s_{t}+grad\odot grad st+1=st+gradgrad

RMSprop的移动平均公式是: s t + 1 = α ⋅ s t + ( 1 − α ) g r a d ⊙ g r a d \begin{aligned}s_{t+1}&=\alpha\cdot s_t+(1-\alpha)grad\odot grad\end{aligned} st+1=αst+(1α)gradgrad ,也就是说,多了一个可调节的超参数。

首先看第一步:

s 1 = 0.5 s 0 + 0.5 ⋅ g r a d ⊙ g r a d = 0.5 s 0 + 0.5 ⋅ [ 1 , 2 ] ⊙ [ 1 , 2 ] = [ 0.5 , 2 ] \begin{aligned} &s_1=0.5s_0+0.5\cdot grad\odot grad \\ &=0.5s_0+0.5\cdot[1,2]\odot[1,2] \\ &=\begin{bmatrix}0.5,2\end{bmatrix} \end{aligned} s1=0.5s0+0.5gradgrad=0.5s0+0.5[1,2][1,2]=[0.5,2]

w = w − l e a r n i n g _ r a t e s 1 ⊙ g r a d = [ 0.5 , 0.5 ] − 0.01 [ 1 2 , 2 ] ⊙ [ 1 , 2 ] = [ 0.4859 , 0.4859 ] \begin{aligned} &w=w-\frac{learning\_rate}{\sqrt{s_{1}}}\odot grad \\ &=[0.5,0.5]-\frac{0.01}{[\frac1{\sqrt{2}},\sqrt{2}]}\odot[1,2] \\ &=[0.4859,0.4859] \end{aligned} w=ws1 learning_rategrad=[0.5,0.5][2 1,2 ]0.01[1,2]=[0.4859,0.4859]

再来看第二步:

s 2 = 0.5 s 1 + 0.5 ⋅ g r a d ⊙ g r a d = 0.5 s 1 + 0.5 ⋅ [ 0.9717 , 1.9434 ] ⊙ [ 0.9717 , 1.9434 ] = [ 0.25 , 1 ] + [ 0.4721 , 1.8884 ] = [ 0.7221 , 2.8884 ] \begin{aligned} &s_2=0.5s_1+0.5\cdot grad\odot grad \\ &=0.5s_1+0.5\cdot[0.9717,1.9434]\odot[0.9717,1.9434] \\ &=[0.25,1]+[0.4721,1.8884] \\ &=[0.7221,2.8884] \end{aligned} s2=0.5s1+0.5gradgrad=0.5s1+0.5[0.9717,1.9434][0.9717,1.9434]=[0.25,1]+[0.4721,1.8884]=[0.7221,2.8884]

w = w − l e a r n i n g _ r a t e s 2 ⊙ g r a d = [ 0.4859 , 0.4859 ] − 0.01 [ 0.7221 , 2.8884 ] ⊙ [ 0.9717 , 1.9434 ] = [ 0.474465 , 0.474465 ] \begin{aligned} &w=w-\frac{learning\_rate}{\sqrt{s_2}}\odot grad \\ &=[0.4859,0.4859]-\frac{0.01}{[\sqrt{0.7221},\sqrt{2.8884}]}\odot[0.9717,1.9434] \\ &=[0.474465,0.474465] \end{aligned} w=ws2 learning_rategrad=[0.4859,0.4859][0.7221 ,2.8884 ]0.01[0.9717,1.9434]=[0.474465,0.474465]

笔算的结果和代码结果有万分之一的误差,由于四舍五入的误差导致的。

如果令参数alpha=0,则梯度更新公式变成 w = w − l e a r n i n g _ r a t e w=w-learning\_rate w=wlearning_rate,和梯度毫无关系,网络根本得不到优化。

import torch
w = torch.tensor([0.5,0.5],requires_grad=True)
loss = w[0].clone()**2 + 2*w[1].clone()**2
opti = torch.optim.RMSprop([w],lr=0.01,alpha=0.5, eps=0,momentum=0,weight_decay=0,)
loss.backward(retain_graph=True)
print("w的梯度:",w.grad)
print("w:",w)
opti.step()
print("w:",w)
opti.zero_grad()
loss = w[0].clone()**2 + 2*w[1].clone()**2
loss.backward(retain_graph=True)
print("w的梯度:",w.grad)
opti.step()
print("w:",w)
"""
w的梯度: tensor([1., 2.])
w: tensor([0.5000, 0.5000], requires_grad=True)
w: tensor([0.4859, 0.4859], requires_grad=True)
w的梯度: tensor([0.9717, 1.9434])
w: tensor([0.4744, 0.4744], requires_grad=True)
"""

改动6:Adam

Adam的梯度更新公式:

m t = β 1 m t − 1 + ( 1 − β 1 ) g t v t = β 2 v t − 1 + ( 1 − β 2 ) g t 2 w = w − l e a r n i n g − r a t e ⋅ m t 1 − β 1 t v t 1 − β 2 t \begin{aligned} &m_t=\beta_1m_{t-1}+(1-\beta_1)g_t \\ &v_t=\beta_2v_{t-1}+(1-\beta_2)g_t^2 \\ &w=w-learning_-rate\cdot\frac{\frac{m_t}{1-\beta_1^t}}{\sqrt{\frac{v_t}{1-\beta_2^t}}} \end{aligned} mt=β1mt1+(1β1)gtvt=β2vt1+(1β2)gt2w=wlearningrate1β2tvt 1β1tmt

首先看第一步:

m 1 = 0.5 ⋅ [ 0 , 0 ] + 0.5 ⋅ [ 1 , 2 ] = [ 0.5 , 1 ] v 1 = [ 1 , 2 ] ⋅ [ 1 , 2 ] = [ 1 , 4 ] \begin{aligned}&m_1=0.5\cdot[0,0]+0.5\cdot[1,2]=[0.5,1]\\&v_1=[1,2]\cdot[1,2]=[1,4]\end{aligned} m1=0.5[0,0]+0.5[1,2]=[0.5,1]v1=[1,2][1,2]=[1,4]

w = w − 0.01 ⋅ [ 0.5 , 1 ] 0.5 [ 1 , 2 ] = [ 0.5 , 0.5 ] − [ 0.01 , 0.01 ] = [ 0.49 , 0.49 ] \begin{aligned} &w=w-0.01\cdot\frac{\frac{[0.5,1]}{0.5}}{[1,2]} \\ &=[0.5,0.5]-[0.01,0.01] \\ &=[0.49,0.49] \end{aligned} w=w0.01[1,2]0.5[0.5,1]=[0.5,0.5][0.01,0.01]=[0.49,0.49]

首先看第二步:

m 2 = 0.5 ⋅ [ 0.5 , 1 ] + 0.5 ⋅ [ 0.98 , 1.96 ] = [ 0.74 , 1.48 ] v 2 = [ 0.98 , 1.96 ] ⋅ [ 0.98 , 1.96 ] w = w − 0.01 ⋅ [ 0.74 , 1.48 ] 0.75 [ 0.98 , 1.96 ] = [ 0.49 , 0.49 ] − 0.01 ⋅ [ 0.74 , 1.48 ] 0.75 [ 0.98 , 1.96 ] = [ 0.4799 , 0.4799 ] \begin{aligned} &m_2=0.5\cdot[0.5,1]+0.5\cdot[0.98,1.96]=[0.74,1.48] \\ &v_2=[0.98,1.96]\cdot[0.98,1.96] \\ &w=w-0.01\cdot\frac{\frac{[0.74,1.48]}{0.75}}{[0.98,1.96]} \\ &=[0.49,0.49]-0.01\cdot\frac{\frac{[0.74,1.48]}{0.75}}{[0.98,1.96]} \\ &=[0.4799,0.4799] \end{aligned} m2=0.5[0.5,1]+0.5[0.98,1.96]=[0.74,1.48]v2=[0.98,1.96][0.98,1.96]w=w0.01[0.98,1.96]0.75[0.74,1.48]=[0.49,0.49]0.01[0.98,1.96]0.75[0.74,1.48]=[0.4799,0.4799]

改动7:试试结合学习率调整策略来看梯度下降的结果

pytorch中学习率调整策略分为三种:

(1)基于既定规则,比如等间隔调整(step)、指数衰减调整(exponential)等

(2)基于监控目标,自适应调整,比如ReduceLROnPlateau

(3)自定义调整

下面的代码以自定义调整为例,自定义了lambda函数,调整后的学习率等于初始学习率乘以epoch轮数。注意本例子中学习率调整的时机在梯度更新之前。

import torch
w = torch.tensor([0.5,0.5],requires_grad=True)
opti = torch.optim.Adam([w],lr=0.01,eps=0,betas=(0.5,0))
scheduler = torch.optim.lr_scheduler.LambdaLR(opti, lr_lambda=lambda x:x)
for epoch in range(2):
    loss = w[0].clone()**2 + 2*w[1].clone()**2
    loss.backward(retain_graph=True)
    print("\n===当前轮数:",epoch)
    print("w的梯度:",w.grad)
    print("梯度更新前的w:",w)
    scheduler.step()
    print("当前学习率:",scheduler.get_lr()[0])
    opti.step()
    print("梯度更新后的w:",w)
    opti.zero_grad()

"""
===当前轮数: 1
w的梯度: tensor([1., 2.])
梯度更新前的w: tensor([0.5000, 0.5000], requires_grad=True)
当前学习率: 0.01
梯度更新后的w: tensor([0.4900, 0.4900], requires_grad=True)

===当前轮数: 2
w的梯度: tensor([0.9800, 1.9600])
梯度更新前的w: tensor([0.4900, 0.4900], requires_grad=True)
当前学习率: 0.02
梯度更新后的w: tensor([0.4699, 0.4699], requires_grad=True)
"""

第一轮的学习率=0.01*1,不变,因此结果和【改动六】一致。

第二轮的学习率=0.01*2=0.02,因此第二轮中,

w = w − 0.02 ⋅ [ 0.74 , 1.48 ] 0.75 [ 0.98 , 1.96 ] = [ 0.49 , 0.49 ] − 0.02 ⋅ [ 0.74 , 1.48 ] 0.75 [ 0.98 , 1.96 ] = [ 0.4699 , 0.4699 ] \begin{aligned} &w=w-0.02\cdot\frac{\frac{[0.74,1.48]}{0.75}}{[0.98,1.96]} \\ &=[0.49,0.49]-0.02\cdot\frac{\frac{[0.74,1.48]}{0.75}}{[0.98,1.96]} \\ &=[0.4699,0.4699] \end{aligned} w=w0.02[0.98,1.96]0.75[0.74,1.48]=[0.49,0.49]0.02[0.98,1.96]0.75[0.74,1.48]=[0.4699,0.4699]

和代码结果一致。

总结

用简单的例子去跑代码,多调整一些不同的参数。然后对比笔算结果和代码结果是否一致。这种方法既可以加深对深度学习基础知识的理解,也可以更快入门pytorch代码。

参考文献

  • https://www.zhihu.com/question/437199981/answer/3252164609
  • 30
    点赞
  • 32
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

神马都会亿点点的毛毛张

你的鼓励将是我创作的最大动力!

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值