在神经网络中,经常遇到一系列变量分析问题
向量和矩阵微积分是单个变量微积分的延申
1 向量梯度
向量梯度:令
g
(
w
)
g({ \bold w})
g(w)为一个m个变量的可微数值函数,其中
w
=
[
w
1
,
…
,
w
m
]
T
{ \bold w}=[w_1,\dots,w_m]^T
w=[w1,…,wm]T
由此可以得到
g
(
w
)
g({ \bold w})
g(w)的梯度,采用g的偏微分形式:
∇
g
=
∂
g
∂
w
=
(
∂
g
∂
w
1
⋮
∂
g
∂
w
m
)
{\nabla g}=\frac{\partial g}{\partial \bold w}= \left(\begin{matrix} \frac{\partial g}{\partial w_1} \\ \vdots \\ \frac{\partial g}{\partial w_m} \end{matrix} \right)
∇g=∂w∂g=⎝⎜⎛∂w1∂g⋮∂wm∂g⎠⎟⎞
相似的,可以定义二阶梯度矩阵或Hessian矩阵:
∂
2
g
∂
w
2
=
(
∂
2
g
∂
w
1
2
⋯
∂
2
g
∂
w
1
w
m
⋮
⋱
⋮
∂
2
g
∂
w
m
w
1
⋯
∂
2
g
∂
w
m
2
)
\frac{\partial^2 g}{\partial \bold w^2}=\left( \begin{matrix} \frac{\partial^2 g}{\partial w_1^2} & \cdots & \frac{\partial^2 g}{\partial w_1w_m} \\ \vdots & \ddots & \vdots \\ \frac{\partial^2 g}{\partial w_mw_1} & \cdots & \frac{\partial^2 g}{\partial w_m^2} \end{matrix}\right)
∂w2∂2g=⎝⎜⎜⎛∂w12∂2g⋮∂wmw1∂2g⋯⋱⋯∂w1wm∂2g⋮∂wm2∂2g⎠⎟⎟⎞
将向量值函数进行推广,得到
g ( w ) = [ g 1 ( w ) , ⋯ , g n ( w ) ] T g(\bold w)=[g_1(\bold w),\cdots,g_n(\bold w)]^T g(w)=[g1(w),⋯,gn(w)]T
从而得到Jacobian矩阵的定义:
∂ g ∂ w = ( ∂ g ∂ w 1 ⋯ ∂ g n ∂ w 1 ⋮ ⋱ ⋮ ∂ g 1 ∂ w m ⋯ ∂ g n ∂ w m ) \frac{\partial g}{\partial \bold w}=\left( \begin{matrix} \frac{\partial g}{\partial w_1} & \cdots & \frac{\partial g_n}{\partial w_1} \\ \vdots & \ddots & \vdots \\ \frac{\partial g_1}{\partial w_m} & \cdots & \frac{\partial g_n}{\partial w_m} \end{matrix}\right) ∂w∂g=⎝⎜⎛∂w1∂g⋮∂wm∂g1⋯⋱⋯∂w1∂gn⋮∂wm∂gn⎠⎟⎞
在向量转化中,Jacobian矩阵的列向量是对应的分量函数 g i ( w ) g_i({\bold w}) gi(w)的梯度
2 微分公式
常用的微分公式为:
∂ f ( w ) g ( w ) ∂ w = ∂ f ( w ) ∂ w g ( w ) + f ( w ) ∂ g ( w ) ∂ w (1) {\frac{\partial f(\bold w)g(\bold w)}{\partial \bold w} = \frac{\partial f(\bold w) }{\partial \bold w}g(\bold w)+f(\bold w)\frac{\partial g(\bold w) }{\partial \bold w} }\tag{1} ∂w∂f(w)g(w)=∂w∂f(w)g(w)+f(w)∂w∂g(w)(1)
∂ f ( w ) / g ( w ) ∂ w = ∂ f ( w ) ∂ w g ( w ) − f ( w ) ∂ g ( w ) ∂ w g 2 ( w ) (2) \frac{\partial f(\bold w)/g(\bold w)}{\partial \bold w} = \frac{\frac{\partial f(\bold w) }{\partial \bold w}g(\bold w)-f(\bold w)\frac{\partial g(\bold w) }{\partial \bold w}}{g^2(\bold w)} \tag{2} ∂w∂f(w)/g(w)=g2(w)∂w∂f(w)g(w)−f(w)∂w∂g(w)(2)
∂ f ( g ( w ) ) ∂ w = f ′ ( g ( w ) ) ∂ g ( w ) ∂ w (3) \frac{\partial f(g(\bold w))}{\partial \bold w} = f'(g(\bold w))\frac{\partial g(\bold w)}{\partial \bold w} \tag{3} ∂w∂f(g(w))=f′(g(w))∂w∂g(w)(3)
∂ a T w ∂ w = a (4) \frac{\partial \bold a^T\bold w}{\partial \bold w}=\bold a \tag{4} ∂w∂aTw=a(4)
KaTeX parse error: Expected group after '\bold' at position 80: …A\bold w+\bold ^̲TA\bold w \tag{…
3 优化方法
梯度下降(Gradient descent):梯度下降是对给定代价函数 J ( w ) J(\bold w) J(w)进行最小化的方法
∇ J ( w ) = − α ( t ) ∂ J ( w ) ∂ w \nabla J(\bold w)=-\alpha(t)\frac{\partial J(\bold w)}{\partial \bold w} ∇J(w)=−α(t)∂w∂J(w)
优化步骤为:
- 初始值是 w ( 0 ) \bold w(0) w(0)
- 计算在 w ( 0 ) \bold w(0) w(0)处的梯度 ∇ J ( w ) \nabla J(\bold w) ∇J(w)
- 向负梯度或最陡下降方向移动一段距离
- 重复上述步骤,直到连续点足够接近
4 Lagrange 乘子法
通常,约束优化问题可以表述为:
m
i
n
J
(
w
)
,
s
u
b
j
e
c
t
t
o
H
i
(
w
)
=
0
,
i
=
1
,
…
,
k
.
min\ J(\bold w), \ subject\ to \ H_i(\bold w) = 0, \ i=1,\dots, k.
min J(w), subject to Hi(w)=0, i=1,…,k.
其中,
J
J
J是代价函数,
H
i
H_i
Hi是限制条件
Lagrange乘子法广泛用于求解带约束的优化问题
使用时需要构建Lagrange方程:
L ( w , λ 1 , λ 2 , … , λ k ) = J ( w ) + ∑ i = 1 k λ i H i ( w ) L(\bold w,\lambda_1,\lambda_2,\dots,\lambda_k)=J(\bold w)+\sum_{i=1}^k \lambda_iH_i(\bold w) L(w,λ1,λ2,…,λk)=J(w)+i=1∑kλiHi(w)
其中, λ i \lambda_i λi是Lagrange乘子
通过令 L L L的梯度为0,可以求解最值问题,即有:
∂ J ( w ) ∂ w + ∑ i = 1 k ∂ H i ( w ) ∂ w = 0 \frac{\partial J(\bold w)}{\partial \bold w} + \sum_{i=1}^k\frac{\partial H_i(\bold w)}{\partial \bold w}=0 ∂w∂J(w)+i=1∑k∂w∂Hi(w)=0
5 投影法
如果约束是简单的(例如参数向量的归一化),那么它们定义了一组简单的容许值
可以使用投影方法的参数值
-
使用无约束优化方法
-
优化方法的每个步骤后,将中间解决方案正交投影到约束集上