文章目录
引言
希望大家看完这篇文章能够对编写参数优化的代码能有新的认识。
正文:
对于深度学习怎么学习,原作者表示:直接动手,在代码中理解深度学习的原理。复杂的代码看不懂,那就从简单的代码开始看起。举个例子,如果你对梯度下降没有彻底弄懂,先直接写几行代码,跑一个简单的例子。
损失函数: l o s s = w 1 2 + 2 ∗ w 2 2 loss=w_1^2+2*w_2^2 loss=w12+2∗w22
梯度: d l o s s d w = 2 ∗ w 1 + 4 ∗ w 2 \frac{dloss}{dw}=2*w_1+4*w_2 dwdloss=2∗w1+4∗w2
梯度更新公式: w 0 = [ 0.5 , 0.5 ] w_{0}=[0.5,0.5] w0=[0.5,0.5]
w = w − l e a r n i n g _ r a t e ∗ d l o s s d w = [ 0.5 , 0.5 ] − 0.01 ∗ [ 1 , 2 ] = [ 0.49 , 0.48 ] \begin{aligned} &w=w-learning\_rate*\frac{dloss}{dw} \\ &=[0.5,0.5]-0.01*[1,2] \\ &=[0.49,0.48] \end{aligned} w=w−learning_rate∗dwdloss=[0.5,0.5]−0.01∗[1,2]=[0.49,0.48]
原始形态:
import torch
# 如果不进行梯度清零
w = torch.tensor([0.5, 0.5], requires_grad=True)
loss = w[0] ** 2 + 2 * w[1] ** 2
opti = torch.optim.SGD([w], lr=0.01)
loss.backward()
print("w的梯度", w.grad.data)
print("w:", w)
opti.step()
print("w:", w)
opti.step()
print("w:", w)
opti.step()
print("w:", w)
"""
w的梯度 tensor([1., 2.])
w: tensor([0.5000, 0.5000], requires_grad=True)
w: tensor([0.4900, 0.4800], requires_grad=True)
w: tensor([0.4800, 0.4600], requires_grad=True)
w: tensor([0.4700, 0.4400], requires_grad=True)
"""
改动1:多更新一次参数
手动根据公式进行计算:
w = w − l e a r n i n g _ r a t e ∗ d l o s s d w = [ 0.49 , 0.48 ] − 0.01 ∗ [ 0.98 , 1.92 ] = [ 0.4802 , 0.4608 ] \begin{aligned} &w=w-learning\_rate*\frac{dloss}{dw} \\ &=[0.49,0.48]-0.01*[0.98,1.92] \\ &=[0.4802,0.4608] \end{aligned} w=w−learning_rate∗dwdloss=[0.49,0.48]−0.01∗[0.98,1.92]=[0.4802,0.4608]
# 如果进行梯度清零
w = torch.tensor([0.5, 0.5], requires_grad=True)
loss = w[0].clone() ** 2 + 2 * w[1].clone() ** 2
opti = torch.optim.SGD([w], lr=0.01)
loss.backward()
print("w的梯度", w.grad.data)
print("w:", w)
opti.step()
print("w:", w)
opti.zero_grad()
loss = w[0].clone() ** 2 + 2 * w[1].clone() ** 2
loss.backward()
print("w的梯度", w.grad.data)
opti.step()
print("w:", w)
opti.zero_grad()
loss = w[0] ** 2 + 2 * w[1] ** 2
loss.backward()
print("w的梯度", w.grad.data)
opti.step()
print("w:", w)
"""
w的梯度 tensor([1., 2.])
w: tensor([0.5000, 0.5000], requires_grad=True)
w: tensor([0.4900, 0.4800], requires_grad=True)
w的梯度 tensor([0.9800, 1.9200])
w: tensor([0.4802, 0.4608], requires_grad=True)
w的梯度 tensor([0.9604, 1.8432])
w: tensor([0.4706, 0.4424], requires_grad=True)
"""
opti.step()的作用只是机械地再做一次梯度更新
改动2:加入权值衰减
如果引入权值衰减,相当于损失函数增加了参数向量的二范数,即:
l o s s = l o s s + λ 2 ∣ ∣ w ∣ ∣ 2 = w 1 2 + 2 ∗ w 2 2 + λ 2 ( w 1 2 + w 2 2 ) \begin{aligned}&loss=loss+\frac\lambda2||w||^2\\&=w_1^2+2*w_2^2+\frac\lambda2(w_1^2+w_2^2)\end{aligned} loss=loss+2λ∣∣w∣∣2=w12+2∗w22+2λ(w12+w22)
代码中weight_decay=1
,即:
l o s s = 3 2 w 1 2 + 5 2 w 2 2 d l o s s d w = 3 ∗ w 1 + 5 ∗ w 2 \begin{aligned}loss&=\frac{3}{2}w_1^2+\frac{5}{2}w_2^2\\\frac{dloss}{dw}&=3*w_1+5*w_2\end{aligned} lossdwdloss=23w12+25w22=3∗w1+5∗w2
w = w − l e a r n i n g _ r a t e ∗ d l o s s d w = [ 0.5 , 0.5 ] − 0.01 ∗ [ 1.5 , 2.5 ] = [ 0.485 , 0.475 ] \begin{aligned} &w=w-learning\_rate*\frac{dloss}{dw} \\ &=[0.5,0.5]-0.01*[1.5,2.5] \\ &=[0.485,0.475] \end{aligned} w=w−learning_rate∗dwdloss=[0.5,0.5]−0.01∗[1.5,2.5]=[0.485,0.475]
# 引入权值衰减
w = torch.tensor([0.5, 0.5], requires_grad=True)
loss = w[0] ** 2 + 2 * w[1] ** 2
opti = torch.optim.SGD([w], lr=0.01, weight_decay=1)
loss.backward(retain_graph=True)
print("w的梯度", w.grad.data)
print("w:", w)
opti.step()
print("w:", w)
opti.step()
print("w:", w)
opti.step()
"""
w的梯度 tensor([1., 2.])
w: tensor([0.5000, 0.5000], requires_grad=True)
w: tensor([0.4850, 0.4750], requires_grad=True)
w: tensor([0.4702, 0.4502], requires_grad=True)
"""
改动3:引入动量momentum
momentum
方法引入了动量的概念。每次梯度更新的量不是learning_rate*grad
,而是动量。
v t + 1 = g a m m a ∗ v t + l e a r n i n g _ r a t e ∗ g r a d v_{t+1}=gamma*v_t+learning\_rate*grad vt+1=gamma∗vt+learning_rate∗grad
v
t
v_{t}
vt的初始值为0。在代码中gamma=0.1
第一次更新:
v 1 = 0.1 ∗ 0 + 0.01 ∗ [ 1 , 2 ] = [ 0.01 , 0.02 ] v_1=0.1*0+0.01*[1,2]=[0.01,0.02] v1=0.1∗0+0.01∗[1,2]=[0.01,0.02]
w = w − v 1 = [ 0.5 , 0.5 ] − [ 0.01 , 0.02 ] = [ 0.49 , 0.48 ] \begin{aligned}&w=w-v_1\\&=[0.5,0.5]-[0.01,0.02]\\&=[0.49,0.48]\end{aligned} w=w−v1=[0.5,0.5]−[0.01,0.02]=[0.49,0.48]
目前为止和不带动量的SGD
的结果一致
第二次更新:
v 2 = 0.1 ∗ v 1 + 0.01 ∗ [ 1 , 2 ] = [ 0.001 , 0.002 ] + [ 0.01 , 0.02 ] = [ 0.011 , 0.022 ] \begin{aligned} &v_2=0.1*v_1+0.01*[1,2] \\ &=[0.001,0.002]+[0.01,0.02] \\ &=[0.011,0.022] \end{aligned} v2=0.1∗v1+0.01∗[1,2]=[0.001,0.002]+[0.01,0.02]=[0.011,0.022]
w = w − v 2 = [ 0.49 , 0.48 ] − [ 0.011 , 0.022 ] = [ 0.479 , 0.458 ] \begin{aligned} &w=w-v_{2} \\ &=[0.49,0.48]-[0.011,0.022] \\ &=[0.479,0.458] \end{aligned} w=w−v2=[0.49,0.48]−[0.011,0.022]=[0.479,0.458]
改动4:Adagrad优化器
SGD
对于每一个参数都使用同一个学习率。adagrad
则对每一个参数使用不同的学习率:梯度越大,学习率越小;梯度越小,学习率越大。
adagrad
会维护一个和w
同维度的tensor
s
t
s_{t}
st , t代表步数,默认
s
0
=
0
s_{0}=0
s0=0
首先看第一步: ⊙ \odot ⊙符号指的是对应元素的点乘
s 1 = s 0 + g r a d ⊙ g r a d = s 0 + [ 1 , 2 ] ⊙ [ 1 , 2 ] = [ 1 , 4 ] \begin{aligned} &s_1=s_0+grad\odot grad \\ &=s_0+[1,2]\odot[1,2] \\ &=\begin{bmatrix}1,4\end{bmatrix} \end{aligned} s1=s0+grad⊙grad=s0+[1,2]⊙[1,2]=[1,4]
w = w − l e a r n i n g _ r a t e s 1 ⊙ g r a d = [ 0.5 , 0.5 ] − 0.01 [ 1 , 2 ] ⊙ [ 1 , 2 ] = [ 0.49 , 0.49 ] \begin{aligned} &w=w-\frac{learning\_rate}{\sqrt{s_1}}\odot grad \\ &=[0.5,0.5]-\frac{0.01}{[1,2]}\odot[1,2] \\ &=[0.49,0.49] \end{aligned} w=w−s1learning_rate⊙grad=[0.5,0.5]−[1,2]0.01⊙[1,2]=[0.49,0.49]
再来看第二步:
s 2 = s 1 + g r a d ⊙ g r a d = s 1 + [ 0.98 , 1.96 ] ⊙ [ 0.98 , 1.96 ] = [ 1 , 4 ] + [ 0.9604 , 3.8416 ] = [ 1.9604 , 7.8416 ] \begin{aligned} &s_2=s_1+grad\odot grad \\ &=s_1+[0.98,1.96]\odot[0.98,1.96] \\ &=[1,4]+[0.9604,3.8416] \\ &=[1.9604,7.8416] \end{aligned} s2=s1+grad⊙grad=s1+[0.98,1.96]⊙[0.98,1.96]=[1,4]+[0.9604,3.8416]=[1.9604,7.8416]
w = w − l e a r n i n g _ r a t e s 2 ⊙ g r a d = [ 0.49 , 0.49 ] − 0.01 [ 1.9604 , 7.8416 ] ⊙ [ 0.98 , 1.96 ] = [ 0.483 , 0.483 ] \begin{aligned} &w=w-\frac{learning\_rate}{\sqrt{s_2}}\odot grad \\ &=[0.49,0.49]-\frac{0.01}{[\sqrt{1.9604},\sqrt{7.8416}]}\odot[0.98,1.96] \\ &=[0.483,0.483] \end{aligned} w=w−s2learning_rate⊙grad=[0.49,0.49]−[1.9604,7.8416]0.01⊙[0.98,1.96]=[0.483,0.483]
Adagrad
本质上是对学习率做了一个修正:
l
e
a
r
n
i
n
g
_
r
a
t
e
s
t
\frac{learning\_rate}{\sqrt{s_t}}
stlearning_rate。从Adagrad
的公式可以看出,某个参数的梯度越大,其对应的学习率越小。在本例子中,
w
1
w_{1}
w1比
w
2
w_{2}
w2的梯度大,对应的学习率更小,大和小相抵消后,
w
1
w_{1}
w1和
w
2
w_{2}
w2更新的梯度刚好相同。而且可以从公式中看出,Adagrad
算法需要维护一个和
w
w
w形状一致的张量
s
t
s_{t}
st。也就是说,引入Adagrad
后,显存会至少增加一倍参数量的级别。
import torch
w = torch.tensor([0.5,0.5],requires_grad=True)
loss = w[0].clone()**2 + 2*w[1].clone()**2
opti = torch.optim.Adagrad([w],lr=0.01,lr_decay=0, weight_decay=0, initial_accumulator_value=0)
loss.backward(retain_graph=True)
print("w的梯度:",w.grad)
print("w:",w)
opti.step()
print("w:",w)
opti.zero_grad()
loss = w[0].clone()**2 + 2*w[1].clone()**2
loss.backward(retain_graph=True)
print("w的梯度:",w.grad)
opti.step()
print("w:",w)
"""
w的梯度: tensor([1., 2.])
w: tensor([0.5000, 0.5000], requires_grad=True)
w: tensor([0.4900, 0.4900], requires_grad=True)
w的梯度: tensor([0.9800, 1.9600])
w: tensor([0.4830, 0.4830], requires_grad=True)
"""
改动5:RMSprop
RMSprop
是基于adagrad
的改进,adagrad
是移动平均公式是:
s
t
+
1
=
s
t
+
g
r
a
d
⊙
g
r
a
d
s_{t+1}=s_{t}+grad\odot grad
st+1=st+grad⊙grad
RMSprop
的移动平均公式是:
s
t
+
1
=
α
⋅
s
t
+
(
1
−
α
)
g
r
a
d
⊙
g
r
a
d
\begin{aligned}s_{t+1}&=\alpha\cdot s_t+(1-\alpha)grad\odot grad\end{aligned}
st+1=α⋅st+(1−α)grad⊙grad ,也就是说,多了一个可调节的超参数。
首先看第一步:
s 1 = 0.5 s 0 + 0.5 ⋅ g r a d ⊙ g r a d = 0.5 s 0 + 0.5 ⋅ [ 1 , 2 ] ⊙ [ 1 , 2 ] = [ 0.5 , 2 ] \begin{aligned} &s_1=0.5s_0+0.5\cdot grad\odot grad \\ &=0.5s_0+0.5\cdot[1,2]\odot[1,2] \\ &=\begin{bmatrix}0.5,2\end{bmatrix} \end{aligned} s1=0.5s0+0.5⋅grad⊙grad=0.5s0+0.5⋅[1,2]⊙[1,2]=[0.5,2]
w = w − l e a r n i n g _ r a t e s 1 ⊙ g r a d = [ 0.5 , 0.5 ] − 0.01 [ 1 2 , 2 ] ⊙ [ 1 , 2 ] = [ 0.4859 , 0.4859 ] \begin{aligned} &w=w-\frac{learning\_rate}{\sqrt{s_{1}}}\odot grad \\ &=[0.5,0.5]-\frac{0.01}{[\frac1{\sqrt{2}},\sqrt{2}]}\odot[1,2] \\ &=[0.4859,0.4859] \end{aligned} w=w−s1learning_rate⊙grad=[0.5,0.5]−[21,2]0.01⊙[1,2]=[0.4859,0.4859]
再来看第二步:
s 2 = 0.5 s 1 + 0.5 ⋅ g r a d ⊙ g r a d = 0.5 s 1 + 0.5 ⋅ [ 0.9717 , 1.9434 ] ⊙ [ 0.9717 , 1.9434 ] = [ 0.25 , 1 ] + [ 0.4721 , 1.8884 ] = [ 0.7221 , 2.8884 ] \begin{aligned} &s_2=0.5s_1+0.5\cdot grad\odot grad \\ &=0.5s_1+0.5\cdot[0.9717,1.9434]\odot[0.9717,1.9434] \\ &=[0.25,1]+[0.4721,1.8884] \\ &=[0.7221,2.8884] \end{aligned} s2=0.5s1+0.5⋅grad⊙grad=0.5s1+0.5⋅[0.9717,1.9434]⊙[0.9717,1.9434]=[0.25,1]+[0.4721,1.8884]=[0.7221,2.8884]
w = w − l e a r n i n g _ r a t e s 2 ⊙ g r a d = [ 0.4859 , 0.4859 ] − 0.01 [ 0.7221 , 2.8884 ] ⊙ [ 0.9717 , 1.9434 ] = [ 0.474465 , 0.474465 ] \begin{aligned} &w=w-\frac{learning\_rate}{\sqrt{s_2}}\odot grad \\ &=[0.4859,0.4859]-\frac{0.01}{[\sqrt{0.7221},\sqrt{2.8884}]}\odot[0.9717,1.9434] \\ &=[0.474465,0.474465] \end{aligned} w=w−s2learning_rate⊙grad=[0.4859,0.4859]−[0.7221,2.8884]0.01⊙[0.9717,1.9434]=[0.474465,0.474465]
笔算的结果和代码结果有万分之一的误差,由于四舍五入的误差导致的。
如果令参数alpha=0
,则梯度更新公式变成
w
=
w
−
l
e
a
r
n
i
n
g
_
r
a
t
e
w=w-learning\_rate
w=w−learning_rate,和梯度毫无关系,网络根本得不到优化。
import torch
w = torch.tensor([0.5,0.5],requires_grad=True)
loss = w[0].clone()**2 + 2*w[1].clone()**2
opti = torch.optim.RMSprop([w],lr=0.01,alpha=0.5, eps=0,momentum=0,weight_decay=0,)
loss.backward(retain_graph=True)
print("w的梯度:",w.grad)
print("w:",w)
opti.step()
print("w:",w)
opti.zero_grad()
loss = w[0].clone()**2 + 2*w[1].clone()**2
loss.backward(retain_graph=True)
print("w的梯度:",w.grad)
opti.step()
print("w:",w)
"""
w的梯度: tensor([1., 2.])
w: tensor([0.5000, 0.5000], requires_grad=True)
w: tensor([0.4859, 0.4859], requires_grad=True)
w的梯度: tensor([0.9717, 1.9434])
w: tensor([0.4744, 0.4744], requires_grad=True)
"""
改动6:Adam
Adam
的梯度更新公式:
m t = β 1 m t − 1 + ( 1 − β 1 ) g t v t = β 2 v t − 1 + ( 1 − β 2 ) g t 2 w = w − l e a r n i n g − r a t e ⋅ m t 1 − β 1 t v t 1 − β 2 t \begin{aligned} &m_t=\beta_1m_{t-1}+(1-\beta_1)g_t \\ &v_t=\beta_2v_{t-1}+(1-\beta_2)g_t^2 \\ &w=w-learning_-rate\cdot\frac{\frac{m_t}{1-\beta_1^t}}{\sqrt{\frac{v_t}{1-\beta_2^t}}} \end{aligned} mt=β1mt−1+(1−β1)gtvt=β2vt−1+(1−β2)gt2w=w−learning−rate⋅1−β2tvt1−β1tmt
首先看第一步:
m 1 = 0.5 ⋅ [ 0 , 0 ] + 0.5 ⋅ [ 1 , 2 ] = [ 0.5 , 1 ] v 1 = [ 1 , 2 ] ⋅ [ 1 , 2 ] = [ 1 , 4 ] \begin{aligned}&m_1=0.5\cdot[0,0]+0.5\cdot[1,2]=[0.5,1]\\&v_1=[1,2]\cdot[1,2]=[1,4]\end{aligned} m1=0.5⋅[0,0]+0.5⋅[1,2]=[0.5,1]v1=[1,2]⋅[1,2]=[1,4]
w = w − 0.01 ⋅ [ 0.5 , 1 ] 0.5 [ 1 , 2 ] = [ 0.5 , 0.5 ] − [ 0.01 , 0.01 ] = [ 0.49 , 0.49 ] \begin{aligned} &w=w-0.01\cdot\frac{\frac{[0.5,1]}{0.5}}{[1,2]} \\ &=[0.5,0.5]-[0.01,0.01] \\ &=[0.49,0.49] \end{aligned} w=w−0.01⋅[1,2]0.5[0.5,1]=[0.5,0.5]−[0.01,0.01]=[0.49,0.49]
首先看第二步:
m 2 = 0.5 ⋅ [ 0.5 , 1 ] + 0.5 ⋅ [ 0.98 , 1.96 ] = [ 0.74 , 1.48 ] v 2 = [ 0.98 , 1.96 ] ⋅ [ 0.98 , 1.96 ] w = w − 0.01 ⋅ [ 0.74 , 1.48 ] 0.75 [ 0.98 , 1.96 ] = [ 0.49 , 0.49 ] − 0.01 ⋅ [ 0.74 , 1.48 ] 0.75 [ 0.98 , 1.96 ] = [ 0.4799 , 0.4799 ] \begin{aligned} &m_2=0.5\cdot[0.5,1]+0.5\cdot[0.98,1.96]=[0.74,1.48] \\ &v_2=[0.98,1.96]\cdot[0.98,1.96] \\ &w=w-0.01\cdot\frac{\frac{[0.74,1.48]}{0.75}}{[0.98,1.96]} \\ &=[0.49,0.49]-0.01\cdot\frac{\frac{[0.74,1.48]}{0.75}}{[0.98,1.96]} \\ &=[0.4799,0.4799] \end{aligned} m2=0.5⋅[0.5,1]+0.5⋅[0.98,1.96]=[0.74,1.48]v2=[0.98,1.96]⋅[0.98,1.96]w=w−0.01⋅[0.98,1.96]0.75[0.74,1.48]=[0.49,0.49]−0.01⋅[0.98,1.96]0.75[0.74,1.48]=[0.4799,0.4799]
改动7:试试结合学习率调整策略来看梯度下降的结果
pytorch
中学习率调整策略分为三种:
(1)基于既定规则,比如等间隔调整(step
)、指数衰减调整(exponential
)等
(2)基于监控目标,自适应调整,比如ReduceLROnPlateau
(3)自定义调整
下面的代码以自定义调整为例,自定义了lambda函数,调整后的学习率等于初始学习率乘以epoch
轮数。注意本例子中学习率调整的时机在梯度更新之前。
import torch
w = torch.tensor([0.5,0.5],requires_grad=True)
opti = torch.optim.Adam([w],lr=0.01,eps=0,betas=(0.5,0))
scheduler = torch.optim.lr_scheduler.LambdaLR(opti, lr_lambda=lambda x:x)
for epoch in range(2):
loss = w[0].clone()**2 + 2*w[1].clone()**2
loss.backward(retain_graph=True)
print("\n===当前轮数:",epoch)
print("w的梯度:",w.grad)
print("梯度更新前的w:",w)
scheduler.step()
print("当前学习率:",scheduler.get_lr()[0])
opti.step()
print("梯度更新后的w:",w)
opti.zero_grad()
"""
===当前轮数: 1
w的梯度: tensor([1., 2.])
梯度更新前的w: tensor([0.5000, 0.5000], requires_grad=True)
当前学习率: 0.01
梯度更新后的w: tensor([0.4900, 0.4900], requires_grad=True)
===当前轮数: 2
w的梯度: tensor([0.9800, 1.9600])
梯度更新前的w: tensor([0.4900, 0.4900], requires_grad=True)
当前学习率: 0.02
梯度更新后的w: tensor([0.4699, 0.4699], requires_grad=True)
"""
第一轮的学习率=0.01*1,不变,因此结果和【改动六】一致。
第二轮的学习率=0.01*2=0.02,因此第二轮中,
w = w − 0.02 ⋅ [ 0.74 , 1.48 ] 0.75 [ 0.98 , 1.96 ] = [ 0.49 , 0.49 ] − 0.02 ⋅ [ 0.74 , 1.48 ] 0.75 [ 0.98 , 1.96 ] = [ 0.4699 , 0.4699 ] \begin{aligned} &w=w-0.02\cdot\frac{\frac{[0.74,1.48]}{0.75}}{[0.98,1.96]} \\ &=[0.49,0.49]-0.02\cdot\frac{\frac{[0.74,1.48]}{0.75}}{[0.98,1.96]} \\ &=[0.4699,0.4699] \end{aligned} w=w−0.02⋅[0.98,1.96]0.75[0.74,1.48]=[0.49,0.49]−0.02⋅[0.98,1.96]0.75[0.74,1.48]=[0.4699,0.4699]
和代码结果一致。
总结
用简单的例子去跑代码,多调整一些不同的参数。然后对比笔算结果和代码结果是否一致。这种方法既可以加深对深度学习基础知识的理解,也可以更快入门pytorch
代码。
参考文献
- https://www.zhihu.com/question/437199981/answer/3252164609