深度学习(23)随机梯度下降一: 随机梯度下降简介
Outline
- What’s Gradient
- What does it mean
- How to Search
- AutoGrad
1. What’s Gradient?
- 导数,derivative
- 偏微分,partial derivative
- 梯度,gradient
∇ f = ( ∂ f ∂ x 1 ; ∂ f ∂ x 2 ; … ; ∂ f ∂ x n ) ∇f=(\frac{∂f}{∂x_1} ; \frac{∂f}{∂x_2} ;…; \frac{∂f}{∂x_n} ) ∇f=(∂x1∂f;∂x2∂f;…;∂xn∂f)
2. What does it mean?
可以看到,梯度就是函数对x求偏导和对y求偏导的向量和; 在函数上升的时候,梯度方向是向外扩散的; 在函数上升的快的时候,梯度的向量的模也大(就是图中箭头长); 在函数梯度趋于缓慢或为0时,梯度的向量的模小(就是图中箭头短);
如上图所示,蓝色区域代表函数值较小的区域,因为箭头是向外扩散的,而梯度的方向就是函数值增大的方向; 红色区域代表函数值较大的区域,因为箭头是向内收缩的,而梯度的反方向就是函数值减小的方向;
3. How to search?
-
∇ f → l a r g e r v a l u e ∇f→larger\ value ∇f→larger value
- Search for minima:
-
l
r
/
α
/
η
lr/α/η
lr/α/η
θ t + 1 = θ t − α t ∇ f ( θ t ) θ_{t+1}=θ_t-α_t ∇f(θ_t) θt+1=θt−αt∇f(θt)
-
l
r
/
α
/
η
lr/α/η
lr/α/η
- Search for minima:
4. For instance
θ
t
+
1
=
θ
t
−
α
t
∇
f
(
θ
t
)
θ_{t+1}=θ_t-α_t ∇f(θ_t)
θt+1=θt−αt∇f(θt)
(1) Function:
J
(
θ
1
,
θ
2
)
=
θ
1
2
+
θ
2
2
J(θ_1,θ_2 )=θ_1^2+θ_2^2
J(θ1,θ2)=θ12+θ22
(2) Objective:
min
θ
1
,
θ
2
J
(
θ
1
,
θ
2
)
\min_{θ_1,θ_2}J(θ_1,θ_2 )
θ1,θ2minJ(θ1,θ2)
(3) Update rules:
θ
1
≔
θ
1
−
α
d
θ
1
J
(
θ
1
,
θ
2
)
θ_1≔θ_1-α \frac{d}{θ_1}J(θ_1,θ_2 )
θ1:=θ1−αθ1dJ(θ1,θ2)
θ
2
≔
θ
2
−
α
d
θ
2
J
(
θ
1
,
θ
2
)
θ_2≔θ_2-α \frac{d}{θ_2}J(θ_1,θ_2 )
θ2:=θ2−αθ2dJ(θ1,θ2)
(4) Derivatives:
d
d
θ
1
J
(
θ
1
,
θ
2
)
=
d
d
θ
1
θ
1
2
+
d
d
θ
1
θ
2
2
=
2
θ
1
\frac{d}{dθ_1} J(θ_1,θ_2 )=\frac{d}{dθ_1 } θ_1^2+\frac{d}{dθ_1} θ_2^2=2θ_1
dθ1dJ(θ1,θ2)=dθ1dθ12+dθ1dθ22=2θ1
d
d
θ
2
J
(
θ
1
,
θ
2
)
=
d
d
θ
2
θ
1
2
+
d
d
θ
2
θ
2
2
=
2
θ
1
\frac{d}{dθ_2} J(θ_1,θ_2 )=\frac{d}{dθ_2} θ_1^2+\frac{d}{dθ_2} θ_2^2=2θ_1
dθ2dJ(θ1,θ2)=dθ2dθ12+dθ2dθ22=2θ1
5. Learning Process
(1) Learning Process-1
(2) Learning Process-2
←
\leftarrow
←
6. AutoGrad
- With tf.GradientTape() as tape:
- Build computation graph
- l o s s = f θ ( x ) loss=f_θ (x) loss=fθ(x)
- [w_grad] = tape.gradient(loss, [w])
7. GradientTape
(1) with tf.GradientTape() as tape
: 将要更新的参数梯度全部放入这个函数里;
(2) grad1 = tape.gradient(y, [w])
: 其中y就是损失
l
o
s
s
loss
loss,[w]为要更新的参数;
可以看到,计算结果为[None],这是因为我们放入方法内的函数是
y
2
=
x
∗
w
y_2=x*w
y2=x∗w,与y无关,所以返回值为[None];
(3) grad2 = tape.gradient(y2, [w])
: 更新w的梯度,即:
∂
y
2
∂
w
=
x
=
2
\frac{∂y_2}{∂w}=x=2
∂w∂y2=x=2
8. Persistent GradientTape(多次调用)
with tf.GradientTape(persistent=True) as tape
: 设置参数persistent=True
,这样就可以进行多次更新操作了;
9. 2 n d 2^{nd} 2nd-order
-
y = x w + b y=xw+b y=xw+b
-
∂ y ∂ x = x \frac{∂y}{∂x}=x ∂x∂y=x
-
∂ 2 y ∂ x 2 = ∂ y ′ ∂ w = ∂ x ∂ w = N o n e \frac{∂^2 y}{∂x^2 }=\frac{∂y'}{∂w}=\frac{∂x}{∂w}=None ∂x2∂2y=∂w∂y′=∂w∂x=None
- 要包2层
tf.GradientTape()
,因为要求二阶导数。
- 要包2层
10. 二阶求导实战
import tensorflow as tf
w = tf.Variable(1.0)
b = tf.Variable(2.0)
x = tf.Variable(3.0)
with tf.GradientTape() as t1:
with tf.GradientTape() as t2:
y = x * w + b
dy_dw, dy_db = t2.gradient(y, [w, b])
d2y_dw2 = t1.gradient(dy_dw, w)
print(dy_dw)
print(dy_db)
print(d2y_dw2)
assert dy_dw.numpy() == 3.0
assert d2y_dw2 is None
运行结果如下:
参考文献:
[1] 龙良曲:《深度学习与TensorFlow2入门实战》
[2] http://mccormickml/2014/03/04/gradient-descent-derivation/
[3] http://ruder.io/optimizing-gradient-descent/