1.梯度介绍
深度学习的训练本质是优化损失,优化的方式是计算梯度,然后通过优化算法更新参数 ,常见的优化算法SGD/Momentum/Adagrad/RMSProp/Adam等,本文总结一下梯度的计算。
2.链式法则
利用微分求梯度的方法计算量太大,而误差反向传播算法的出现提高了计算效率,误差反向传播算法(BP)主要基于链式法则。
链式法则是求复合函数的导数:
例如多元复合函数
f
=
x
2
y
f=x^2y
f=x2y,可以看作
f
(
x
,
y
)
=
p
(
x
)
q
(
y
)
f(x,y)=p(x)q(y)
f(x,y)=p(x)q(y),其中
p
(
x
)
=
x
2
p(x)=x^2
p(x)=x2,
q
(
y
)
=
y
q(y)=y
q(y)=y
∂
f
∂
x
=
∂
f
∂
p
∗
∂
p
∂
x
=
q
∗
2
x
=
2
x
y
\frac{\partial{f}}{\partial{x}}=\frac{\partial{f}}{\partial{p}}*\frac{\partial{p}}{\partial{x}}=q*2x=2xy
∂x∂f=∂p∂f∗∂x∂p=q∗2x=2xy
∂
f
∂
y
=
∂
f
∂
q
∗
∂
q
∂
y
=
p
∗
1
=
x
2
\frac{\partial{f}}{\partial{y}}=\frac{\partial{f}}{\partial{q}}*\frac{\partial{q}}{\partial{y}}=p*1=x^2
∂y∂f=∂q∂f∗∂y∂q=p∗1=x2
有了偏导数,当y的梯度已知,各变量的梯度=偏导数*y的梯度,因此有几点常用如下
1.如果是由a + b = y,则反向传播时a b 的梯度相等,且等于y的梯度
2.如果是a * b = y,则反向传播时a b 的梯度分别为b a,如果是矩阵运算会涉及到矩阵转换
3.max操作梯度只有传播到取最大值的一路
图片来自 梯度是如何计算的
3.逻辑回归梯度计算
逻辑回归流程如下
- 全连接 z = w T x + b = w 1 x 1 + w 2 x 2 + b z = w^Tx +b=w_1x_1+w_2x_2+b z=wTx+b=w1x1+w2x2+b
- 激活层 y ^ = a = σ ( z ) \hat{y}=a = \sigma(z) y^=a=σ(z)
- 损失层(二分类交叉熵) L ( a , y ) = − ( y l o g ( a ) + ( 1 − y ) l o g ( 1 − a ) ) L(a,y)=-(ylog(a)+(1-y)log(1-a)) L(a,y)=−(ylog(a)+(1−y)log(1−a))
这里激活函数为sigmoid a = 1 1 + e x p ( − x ) a=\frac{1}{1+exp(-x)} a=1+exp(−x)1
d a d z = 1 ( 1 + e − x ) 2 ∗ e − x = 1 1 + e − x ∗ e − x 1 + e − x = 1 1 + e − x ∗ ( 1 − 1 1 + e − x ) \frac{da}{dz}=\frac{1}{(1+e^{-x})^2}*e^{-x}=\frac{1}{1+e^{-x}}*\frac{e^{-x}}{1+e^{-x}}=\frac{1}{1+e^{-x}}*(1-\frac{1}{1+e^{-x}}) dzda=(1+e−x)21∗e−x=1+e−x1∗1+e−xe−x=1+e−x1∗(1−1+e−x1)
d a d z = y ∗ ( 1 − y ) \frac{da}{dz}=y*(1-y) dzda=y∗(1−y)
先求L(a,y)关于a的导数
d
L
(
a
,
y
)
d
a
=
−
y
/
a
+
(
1
−
y
)
/
(
1
−
a
)
\frac{dL(a,y)}{da}=-y/a+(1-y)/(1-a)
dadL(a,y)=−y/a+(1−y)/(1−a)
因为
d
L
(
a
,
y
)
d
z
=
(
d
L
d
a
)
∗
(
d
a
d
z
)
\frac{dL(a,y)}{dz}=(\frac{dL}{da})*(\frac{da}{dz})
dzdL(a,y)=(dadL)∗(dzda)
所以有
d
z
=
d
L
(
a
,
y
)
d
z
=
(
d
L
d
a
)
∗
(
d
a
d
z
)
=
[
−
y
/
a
+
(
1
−
y
)
/
(
1
−
a
)
]
∗
a
(
1
−
a
)
d
z
=
a
−
y
dz=\frac{dL(a,y)}{dz}=(\frac{dL}{da})*(\frac{da}{dz})=[-y/a+(1-y)/(1-a)]*a(1-a)\\ dz=a-y
dz=dzdL(a,y)=(dadL)∗(dzda)=[−y/a+(1−y)/(1−a)]∗a(1−a)dz=a−y
进一步推导w和b
d
w
1
=
1
m
∑
i
m
x
1
(
i
)
(
a
(
i
)
−
y
(
i
)
)
d
w
2
=
1
m
∑
i
m
x
2
(
i
)
(
a
(
i
)
−
y
(
i
)
)
d
b
=
1
m
∑
i
m
(
a
(
i
)
−
y
(
i
)
)
dw_1=\frac{1}{m}\sum_i^mx_1^{(i)}(a^{(i)}-y^{(i)})\\ dw_2=\frac{1}{m}\sum_i^mx_2^{(i)}(a^{(i)}-y^{(i)})\\ db=\frac{1}{m}\sum_i^m(a^{(i)}-y^{(i)})
dw1=m1i∑mx1(i)(a(i)−y(i))dw2=m1i∑mx2(i)(a(i)−y(i))db=m1i∑m(a(i)−y(i))
4.梯度矩阵形式推导
标量对矩阵的求导,参考另外一篇转载的博客 矩阵求导术
趁热打铁,我们把矩阵形式的梯度推导一下,先放结果。假设
D
=
w
x
D=wx
D=wx
d
W
=
d
D
.
d
o
t
(
X
.
T
)
d
X
=
W
.
T
.
d
o
t
(
d
D
)
dW=dD.dot(X.T)\\ dX=W.T.dot(dD)
dW=dD.dot(X.T)dX=W.T.dot(dD)
a*b 表示矩阵对应位置相乘
a.dot(b) 表示矩阵内积
4.1基础知识
一元微积分中的微分
d
f
df
df与导数的全微分公式
f
′
(
x
)
f'(x)
f′(x)
d
f
=
f
′
(
x
)
d
x
df=f'(x)dx
df=f′(x)dx
多元微积分中的微分
d
f
df
df与梯度的全微分公式
∂
f
∂
x
\frac{\partial{f}}{\partial{x}}
∂x∂f:
d
f
=
∑
i
=
1
n
∂
f
∂
x
i
d
x
i
=
(
∂
f
∂
x
)
T
d
x
df=\sum_{i=1}^{n}\frac{\partial{f}}{\partial{x_i}}dx_i=(\frac{\partial{f}}{\partial{x}})^Tdx
df=i=1∑n∂xi∂fdxi=(∂x∂f)Tdx
从多元微积分的全微分公式可以看到全微分df等于梯度向量(n,1)与微分向量dx(n,1)的内积
类似的,微分df和矩阵导数
∂
f
∂
X
\frac{\partial f}{\partial X}
∂X∂f(标量对矩阵的导数):
d
f
=
∑
i
=
1
m
∑
j
=
1
n
∂
f
∂
X
i
j
d
X
i
j
=
t
r
(
(
∂
f
∂
X
)
T
d
X
)
df=\sum_{i=1}^{m}\sum_{j=1}^{n}\frac{\partial{f}}{\partial{X_{ij}}}dX_{ij}=tr((\frac{\partial{f}}{\partial{X}})^TdX)
df=i=1∑mj=1∑n∂Xij∂fdXij=tr((∂X∂f)TdX)
tr表示矩阵的迹(tarce),是方阵对角线元素之和,满足性质:对尺寸相同的矩阵A,B有
t r ( A T B ) = ∑ i , j A i j B i j tr(A^TB)=\sum_{i,j}A_{ij}B{ij} tr(ATB)=i,j∑AijBij
举个例子理解下
常用的微分的运算法则:
- d ( X ± Y ) = d X ± d Y d(X\pm Y)=dX\pm dY d(X±Y)=dX±dY
- d ( X Y ) = ( d X ) Y + X d Y d(XY)=(dX)Y+XdY d(XY)=(dX)Y+XdY
- d ( X T ) = d X T d(X^T) = dX^T d(XT)=dXT
- d t r ( X ) = t r ( d X ) dtr(X)=tr(dX) dtr(X)=tr(dX)
- d X − 1 = − X − 1 d X X − 1 dX^{-1}=-X^{-1}dXX^{-1} dX−1=−X−1dXX−1 可由 ( X X − 1 = I ) (XX^{-1}=I) (XX−1=I)求微分证明
- 行列式 d ∣ X ∣ = t r ( X ∗ d X ) d|X|=tr(X^*dX) d∣X∣=tr(X∗dX)其中 X ∗ X^* X∗表示 X X X的伴随矩阵,在X可逆时可以写成 d ∣ X ∣ = ∣ X ∣ t r ( X − 1 d X ) d|X|=|X|tr(X^{-1}dX) d∣X∣=∣X∣tr(X−1dX) 此式可以用laplace展开证明
- 逐元素乘法: d ( X ⊙ Y ) = ( d X ) ⊙ Y + X ⊙ d Y d(X\odot Y)=(dX)\odot Y+X\odot dY d(X⊙Y)=(dX)⊙Y+X⊙dY 其中 ⊙ \odot ⊙表示尺寸相同的矩阵X,Y进行元素乘法
- 逐元素函数 d σ ( X ) = σ ′ ( X ) ⊙ d X d\sigma(X)=\sigma'(X)\odot dX dσ(X)=σ′(X)⊙dX其中 σ ( X ) = [ σ ( X i , j ) ] \sigma(X)=[\sigma(X_{i,j})] σ(X)=[σ(Xi,j)]是逐元素标量函数计算, σ ′ ( X ) = [ σ ′ ( X i , j ) ] \sigma'(X)=[\sigma'(X_{i,j})] σ′(X)=[σ′(Xi,j)]是逐元素标量导数计算
逐元素是指针对矩阵单个元素求函数值或者求导
X = [ x 1 , x 2 ] X=[x_1,x_2] X=[x1,x2], d ( s i n X ) = [ cos x 1 d x 1 , cos x 2 d x 2 ] = cos X ⊙ d X d(sinX)=[\cos x_1dx_1,\cos x_2dx_2]=\cos X\odot dX d(sinX)=[cosx1dx1,cosx2dx2]=cosX⊙dX
迹技巧
- 标量套上迹: a = t r ( a ) a = tr(a) a=tr(a)
- 转置: t r ( A T ) = t r ( A ) tr(A^T)=tr(A) tr(AT)=tr(A)
- 线性: t r ( A ± B ) = t r ( A ) ± t r ( B ) tr(A\pm B)=tr(A)\pm tr(B) tr(A±B)=tr(A)±tr(B)
- 矩阵乘法交换: t r ( A B ) = t r ( B A ) tr(AB)=tr(BA) tr(AB)=tr(BA),假设 A m , n A_{m,n} Am,n与 B n , m B_{n,m} Bn,m,则有 ( A B ) i i = ∑ i = 1 n a i j ∗ b j i (AB)_{ii}=\sum_{i=1}^na_{ij}*b_{ji} (AB)ii=∑i=1naij∗bji t r ( A B ) = ∑ i = 1 m ( ∑ j = 1 n a i j ∗ b j i ) = ∑ i = 1 n ( ∑ j = 1 m b i j ∗ a j i ) = t r ( B A ) tr(AB)=\sum_{i=1}^m(\sum_{j=1}^na_{ij}*b_{ji})=\sum_{i=1}^n(\sum_{j=1}^mb_{ij}*a_{ji})=tr(BA) tr(AB)=i=1∑m(j=1∑naij∗bji)=i=1∑n(j=1∑mbij∗aji)=tr(BA)
- 矩阵乘法/逐元素乘法交换:
t
r
(
A
T
(
B
⊙
C
)
)
=
t
r
(
(
A
⊙
B
)
T
C
)
tr(A^T(B \odot C))=tr((A \odot B)^TC)
tr(AT(B⊙C))=tr((A⊙B)TC)其中ABC尺寸相同都为mxn,两侧都等于
∑
i
,
j
A
i
j
B
i
j
C
i
j
\sum_{i,j}A_{ij}B_{ij}C_{ij}
i,j∑AijBijCij
【089】深度学习读书笔记:P29证明迹Tr(AB)=Tr(BA)
4.2 三层神经网络反向传播推导
#正向传播
Z_1 = np.dot(W_1.T,X) + b_1 # 维度N1*M ,N1表示第一隐层的神经元数
A_1 = sigmoid(Z_1) # 维度N1*M
Z_2 = np.dot(W_2.T,A_1) + b_2 # 维度N2*M ,N2表示输出层的神经元数
A_2 = sigmoid(Z_2) # 维度N2*M ,本例中N2=1
L = cross_entropy(A_2,Y) # 标量
矩阵形式的损失函数
L
=
(
−
(
Y
⊙
l
o
g
(
A
2
)
)
−
(
(
1
−
Y
)
⊙
l
o
g
(
1
−
A
2
)
)
I
L=(-(Y\odot log(A_2))-((1-Y)\odot log(1-A_2))I
L=(−(Y⊙log(A2))−((1−Y)⊙log(1−A2))I
这里矩阵I是个全为1的(M,1)的矩阵,作用于前面(1,M)的损失求和
- 求微分dL,由
d
(
X
Y
)
=
(
d
X
)
Y
+
X
d
Y
d(XY)=(dX)Y+XdY
d(XY)=(dX)Y+XdY以及
d
(
X
⊙
Y
)
=
(
d
X
)
⊙
Y
+
X
⊙
d
Y
d(X\odot Y)=(dX)\odot Y+X\odot dY
d(X⊙Y)=(dX)⊙Y+X⊙dY得
d L = − ( d Y ⊙ l o g ( A 2 ) + Y ⊙ d l o g ( A 2 ) + d ( 1 − Y ) ⊙ l o g ( 1 − A 2 ) + ( 1 − Y ) ⊙ d l o g ( 1 − A 2 ) ) I dL=-(dY\odot log(A_2)+Y\odot dlog(A_2)+d(1-Y)\odot log(1-A_2)+(1-Y)\odot dlog(1-A_2))I dL=−(dY⊙log(A2)+Y⊙dlog(A2)+d(1−Y)⊙log(1−A2)+(1−Y)⊙dlog(1−A2))I
常数矩阵的微分为0矩阵,同时 d l o g ( A 2 ) = 1 A 2 ⊙ d A 2 dlog(A_2)=\frac{1}{A_2}\odot dA_2 dlog(A2)=A21⊙dA2 代入得
d L = − ( Y ⊙ 1 A 2 ⊙ d A 2 − ( 1 − Y ) ⊙ 1 1 − A 2 ⊙ d A 2 ) I dL=-(Y\odot \frac{1}{A_2}\odot dA_2-(1-Y)\odot \frac{1}{1-A_2}\odot dA_2)I dL=−(Y⊙A21⊙dA2−(1−Y)⊙1−A21⊙dA2)I
d L = ( A 2 − Y A 2 ⊙ ( 1 − A 2 ) ⊙ d A 2 ) dL=(\frac{A_2-Y}{A_2\odot(1-A_2)}\odot dA_2) dL=(A2⊙(1−A2)A2−Y⊙dA2)
继续对 A 2 A_2 A2和 Z 2 Z_2 Z2微分,就出现了 d W 2 dW_2 dW2
d A 2 = A 2 ⊙ ( 1 − A 2 ) ⊙ d Z 2 dA_2=A_2\odot (1-A_2)\odot dZ_2 dA2=A2⊙(1−A2)⊙dZ2
d Z 2 = d ( W 2 T ) A 1 + W 2 T d A 1 + d b 2 dZ_2=d(W_2^T)A_1+W_2^TdA_1+db_2 dZ2=d(W2T)A1+W2TdA1+db2
这里的 A 1 A_1 A1、 W 2 W_2 W2和 b 2 b_2 b2都是变量。将 d A 2 dA_2 dA2带入 d L dL dL得
d L = ( ( A 2 − Y ) ⊙ d Z 2 ) I dL=((A_2-Y)\odot dZ_2)I dL=((A2−Y)⊙dZ2)I
d L = ( ( A 2 − Y ) ⊙ [ d ( W 2 T ) ] A 1 + ( A 2 − Y ) ⊙ [ W 2 T d A 1 ] + ( A 2 − Y ) ⊙ d b 2 ) I dL=((A_2-Y)\odot [d(W_2^T)]A_1+(A_2-Y)\odot[W_2^TdA_1]+(A_2-Y)\odot db_2)I dL=((A2−Y)⊙[d(W2T)]A1+(A2−Y)⊙[W2TdA1]+(A2−Y)⊙db2)I
接下来使用迹技巧将
d
W
dW
dW换到最右侧
d
L
=
t
r
(
d
L
)
=
t
r
(
(
(
A
2
−
Y
)
⊙
d
Z
2
)
I
)
dL=tr(dL)=tr(((A_2-Y)\odot dZ_2)I)
dL=tr(dL)=tr(((A2−Y)⊙dZ2)I)
因为
(
A
2
−
Y
)
⊙
d
Z
2
(A_2-Y)\odot dZ_2
(A2−Y)⊙dZ2与
I
T
I^T
IT尺寸相同,所以有
d
L
=
t
r
(
d
L
)
=
t
r
(
(
(
A
2
−
Y
)
⊙
d
Z
2
)
I
)
=
t
r
(
I
(
(
A
2
−
Y
)
⊙
d
Z
2
)
)
=
t
r
(
(
I
T
)
T
(
(
A
2
−
Y
)
⊙
d
Z
2
)
)
dL=tr(dL)=tr(((A_2-Y)\odot dZ_2)I)=tr(I((A_2-Y)\odot dZ_2))=tr((I^T)^T((A_2-Y)\odot dZ_2))
dL=tr(dL)=tr(((A2−Y)⊙dZ2)I)=tr(I((A2−Y)⊙dZ2))=tr((IT)T((A2−Y)⊙dZ2))
由法则
t
r
(
A
T
(
B
⊙
C
)
)
=
t
r
(
(
A
⊙
B
)
T
C
)
tr(A^T(B \odot C))=tr((A \odot B)^TC)
tr(AT(B⊙C))=tr((A⊙B)TC)得:
d
L
=
t
r
(
(
I
T
)
T
(
(
A
2
−
Y
)
⊙
d
Z
2
)
)
=
t
r
(
[
(
I
T
)
⊙
(
A
2
−
Y
)
]
T
d
Z
2
)
dL=tr((I^T)^T((A_2-Y)\odot dZ_2))=tr([(I^T)\odot (A_2-Y)]^TdZ_2)
dL=tr((IT)T((A2−Y)⊙dZ2))=tr([(IT)⊙(A2−Y)]TdZ2)
由
d
L
=
t
r
(
(
∂
L
∂
Z
2
)
T
d
Z
2
)
dL=tr((\frac{\partial L}{\partial {Z_2}})^TdZ_2)
dL=tr((∂Z2∂L)TdZ2)比较上式得出:
∂
L
∂
Z
2
=
(
I
T
)
⊙
(
A
2
−
Y
)
=
A
2
−
Y
\frac{\partial L}{\partial {Z_2}}=(I^T)\odot (A_2-Y)=A_2-Y
∂Z2∂L=(IT)⊙(A2−Y)=A2−Y
所以
d
Z
2
=
A
2
−
Y
dZ_2=A_2-Y
dZ2=A2−Y
未完待续…