Deep Learning
Basic
- 神经网络:
-
监督学习:1个x对应1个y;
-
Sigmoid : 激活函数
s i g m o i d = 1 1 + e − x sigmoid=\frac{1}{1+e^{-x}} sigmoid=1+e−x1 -
ReLU : 线性整流函数;
Logistic Regression
–>binary classification / x–>y 0 1
some sign
( x , y ) , x ∈ R n x , y ∈ 0 , 1 M = m t r a i n m t e s t = t e s t M : ( x ( 1 ) , y ( 1 ) ) , ( x ( 2 ) , y ( 2 ) ) . . . , ( x ( m ) , y ( m ) ) X = [ x ( 1 ) x ( 2 ) ⋯ x ( m ) ] ← n x × m y ^ = P ( y = 1 ∣ x ) y ^ = σ ( w t x + b ) w ∈ R n x b ∈ R σ ( z ) = 1 1 + e − z (x,y) , x\in{\mathbb{R}^{n_{x}}},y\in{0,1}\\\\ M=m_{train}\quad m_{test}=test\\\\ M:{(x^{(1)},y^{(1)}),(x^{(2)},y^{(2)})...,(x^{(m)},y^{(m)})}\\\\ X = \left[ \begin{matrix} x^{(1)} & x^{(2)} &\cdots & x^{(m)} \end{matrix} \right] \leftarrow n^{x}\times m\\\\ \hat{y}=P(y=1\mid x)\quad\hat{y}=\sigma(w^tx+b)\qquad w\in \mathbb{R}^{n_x} \quad b\in \mathbb{R}\\ \sigma (z)=\frac{1}{1+e^{-z}} (x,y),x∈Rnx,y∈0,1M=mtrainmtest=testM:(x(1),y(1)),(x(2),y(2))...,(x(m),y(m))X=[x(1)x(2)⋯x(m)]←nx×my^=P(y=1∣x)y^=σ(wtx+b)w∈Rnxb∈Rσ(z)=1+e−z1
Loss function
单个样本
L
o
s
s
f
u
n
c
t
i
o
n
:
L
(
y
^
,
y
)
=
1
2
(
y
^
−
y
)
2
p
(
y
∣
x
)
=
y
^
y
(
1
−
y
^
)
(
1
−
y
)
m
i
n
c
o
s
t
→
m
a
x
log
(
y
∣
x
)
L
(
y
^
,
y
)
=
−
(
y
log
(
y
^
)
+
(
1
−
y
)
log
(
1
−
y
^
)
)
y
=
1
:
L
(
y
^
,
y
)
=
−
log
y
^
log
y
^
←
l
a
r
g
e
r
y
^
←
l
a
r
g
e
r
y
=
0
:
L
(
y
^
,
y
)
=
−
log
(
1
−
y
^
)
log
(
1
−
y
^
)
←
l
a
r
g
e
r
y
^
←
s
m
a
l
l
e
r
Loss\:function:\mathcal{L}(\hat{y},y)=\frac{1}{2}(\hat{y}-y)^2\\\\ p(y\mid x)=\hat{y}^y(1-\hat y)^{(1-y)}\\ min\;cost\rightarrow max\;\log(y\mid x)\\ \mathcal{L}(\hat{y},y)=-(y\log(\hat{y})+(1-y)\log(1-\hat{y}))\\\\ y=1:\mathcal{L}(\hat{y},y)=-\log\hat{y}\quad \log\hat{y}\leftarrow larger\quad\hat{y}\leftarrow larger\\ y=0:\mathcal{L}(\hat{y},y)=-\log(1-\hat{y})\quad \log(1-\hat{y})\leftarrow larger\quad\hat{y}\leftarrow smaller\\\\
Lossfunction:L(y^,y)=21(y^−y)2p(y∣x)=y^y(1−y^)(1−y)mincost→maxlog(y∣x)L(y^,y)=−(ylog(y^)+(1−y)log(1−y^))y=1:L(y^,y)=−logy^logy^←largery^←largery=0:L(y^,y)=−log(1−y^)log(1−y^)←largery^←smaller
cost function
J ( w , b ) = 1 m ∑ i = 1 m L ( y ^ ( i ) , y ( i ) ) \mathcal{J}(w,b)=\frac{1}{m}\sum_{i=1}^{m}\mathcal{L}(\hat{y}^{(i)},y^{(i)}) J(w,b)=m1i=1∑mL(y^(i),y(i))
Gradient Descent
find w,b that minimiaze J(w,b) ;
Repeat:
w
:
=
w
−
α
∂
J
(
w
,
b
)
∂
w
(
d
w
)
b
:
=
b
−
α
∂
J
(
w
,
b
)
∂
b
(
d
b
)
w:=w-\alpha \frac{\partial\mathcal{J}(w,b)}{\partial w}(dw)\\ b:=b-\alpha \frac{\partial\mathcal{J}(w,b)}{\partial b}(db)
w:=w−α∂w∂J(w,b)(dw)b:=b−α∂b∂J(w,b)(db)
Computation Grapha
example:
J
=
3
(
a
+
b
c
)
J=3(a+bc)
J=3(a+bc)
one example gradient descent computer grapha:
recap:
z
=
w
T
x
+
b
y
^
=
a
=
σ
(
z
)
=
1
1
+
e
−
z
L
(
a
,
y
)
=
−
(
y
log
(
a
)
+
(
1
−
y
)
log
(
1
−
a
)
)
z=w^Tx+b\\ \hat{y}=a=\sigma(z)=\frac{1}{1+e^{-z}} \\ \mathcal{L}(a,y)=-(y\log(a)+(1-y)\log(1-a))
z=wTx+by^=a=σ(z)=1+e−z1L(a,y)=−(ylog(a)+(1−y)log(1−a))
The grapha:
′
d
a
′
=
d
L
(
a
,
y
)
d
a
=
−
y
a
+
1
−
y
1
−
a
′
d
z
′
=
d
L
(
a
,
y
)
d
z
=
d
L
d
a
⋅
d
a
d
z
=
a
−
y
′
d
w
1
′
=
x
1
⋅
d
z
.
.
.
w
1
:
=
w
1
−
α
d
w
1
.
.
.
'da'=\frac{d\mathcal{L}(a,y)}{da}=-\frac{y}{a}+\frac{1-y}{1-a}\\ 'dz'=\frac{d\mathcal{L}(a,y)}{dz}=\frac{d\mathcal{L}}{da}\cdot\frac{da}{dz}=a-y\\ 'dw_1'=x_1\cdot dz\;\;\; ... \\w_1:=w_1-\alpha dw_1\;\;...
′da′=dadL(a,y)=−ay+1−a1−y′dz′=dzdL(a,y)=dadL⋅dzda=a−y′dw1′=x1⋅dz...w1:=w1−αdw1...
m example gradient descent computer grapha:
recap:
J
(
w
,
b
)
=
1
m
∑
i
=
1
m
L
(
a
(
i
)
,
y
(
1
)
)
\mathcal{J}(w,b)=\frac{1}{m}\sum_{i=1}^m\mathcal{L}(a^{(i)},y^{(1)})
J(w,b)=m1i=1∑mL(a(i),y(1))
The grapha: (two iterate)
∂
∂
w
1
J
(
w
,
b
)
=
1
m
∑
i
=
1
m
∂
∂
w
1
L
(
a
(
i
)
,
y
(
1
)
)
F
o
r
i
=
1
t
o
m
:
{
a
(
i
)
=
σ
(
w
T
x
(
i
)
+
b
)
J
+
=
−
[
y
(
i
)
log
a
i
+
(
1
−
y
(
i
)
log
(
1
−
a
(
i
)
)
)
]
d
z
(
i
)
=
a
(
i
)
−
y
(
i
)
d
w
1
+
=
x
1
(
i
)
d
z
(
i
)
d
w
2
+
=
x
2
(
i
)
d
z
(
i
)
d
b
+
=
d
z
(
i
)
}
J
/
=
m
;
d
w
1
/
=
m
;
d
w
2
/
=
m
;
d
b
/
=
m
d
w
1
=
∂
J
∂
w
1
w
1
=
w
1
−
α
d
w
1
\frac{\partial}{\partial w_1}\mathcal{J}(w,b)=\frac{1}{m}\sum_{i=1}^m\frac{\partial}{\partial w_1}\mathcal{L}(a^{(i)},y^{(1)})\\\\ For \quad i=1 \quad to \quad m:\{\\ a^{(i)}=\sigma (w^Tx^{(i)}+b)\\ \mathcal{J}+=-[y^{(i)}\log a^{i}+(1-y^{(i)}\log(1-a^{(i)}))] \\ dz^{(i)}=a^{(i)}-y^{(i)}\\ dw_1+=x_1^{(i)}dz^{(i)}\\ dw_2+=x_2^{(i)}dz^{(i)}\\ db+=dz^{(i)}\}\\ \mathcal{J}/=m;dw_1/=m;dw_2/=m;db/=m\\ dw_1=\frac{\partial\mathcal{J}}{\partial w_1}\\ w_1=w_1-\alpha dw_1
∂w1∂J(w,b)=m1i=1∑m∂w1∂L(a(i),y(1))Fori=1tom:{a(i)=σ(wTx(i)+b)J+=−[y(i)logai+(1−y(i)log(1−a(i)))]dz(i)=a(i)−y(i)dw1+=x1(i)dz(i)dw2+=x2(i)dz(i)db+=dz(i)}J/=m;dw1/=m;dw2/=m;db/=mdw1=∂w1∂Jw1=w1−αdw1
Vectorization
vectorized
z
=
n
p
.
d
o
t
(
w
,
x
)
+
b
z=np.dot(w,x)+b
z=np.dot(w,x)+b
logistic regression derivatives:
change:
d
w
1
=
0
,
d
w
2
=
0
→
d
w
=
n
p
.
z
e
r
o
s
(
(
n
x
,
1
)
)
{
d
w
1
+
=
x
1
(
i
)
d
z
(
i
)
d
w
2
+
=
x
2
(
i
)
d
z
(
i
)
→
d
w
+
=
x
(
i
)
d
z
(
i
)
Z
=
(
z
(
1
)
z
(
2
)
.
.
.
z
(
m
)
)
=
w
T
X
+
b
A
=
σ
(
Z
)
d
z
=
A
−
Y
=
(
a
(
1
)
−
y
(
1
)
z
(
2
)
−
y
(
2
)
.
.
.
z
(
m
)
−
y
(
m
)
)
d
b
=
1
m
∑
i
=
1
m
d
z
(
i
)
=
1
m
n
p
.
s
u
m
(
d
z
)
d
w
=
1
m
X
d
z
T
=
1
m
(
x
(
1
)
⋅
d
z
(
1
)
x
(
2
)
⋅
d
z
(
2
)
.
.
.
x
(
m
)
⋅
d
z
(
m
)
)
dw_1=0,dw_2=0\rightarrow dw=np.zeros((n_x,1))\\ \begin{cases}dw_1+=x_1^{(i)}dz^{(i)}\\ dw_2+=x_2^{(i)}dz^{(i)}\end{cases}\rightarrow dw+=x^{(i)}dz^{(i)}\\\\ Z=\left(\;\begin{matrix} z^{(1)} & z^{(2)} &... &z^{(m)}\end{matrix}\;\right)=w^TX+b\\ A=\sigma(Z)\\\\ dz=A-Y=\left(\;\begin{matrix} a^{(1)}-y^{(1)} & z^{(2)}-y^{(2)} &... &z^{(m)}-y^{(m)}\end{matrix}\;\right)\\ db=\frac{1}{m}\sum_{i=1}^mdz^{(i)}=\frac{1}{m}np.sum(dz)\\ dw=\frac{1}{m}Xdz^T=\frac{1}{m}\left(\;\begin{matrix} x^{(1)}\cdot dz^{(1)}&x^{(2)}\cdot dz^{(2)}&...&x^{(m)}\cdot dz^{(m)}\end{matrix}\;\right)
dw1=0,dw2=0→dw=np.zeros((nx,1)){dw1+=x1(i)dz(i)dw2+=x2(i)dz(i)→dw+=x(i)dz(i)Z=(z(1)z(2)...z(m))=wTX+bA=σ(Z)dz=A−Y=(a(1)−y(1)z(2)−y(2)...z(m)−y(m))db=m1i=1∑mdz(i)=m1np.sum(dz)dw=m1XdzT=m1(x(1)⋅dz(1)x(2)⋅dz(2)...x(m)⋅dz(m))
Implementing:
Z = w T X + b = n p . d o t ( w T , X ) + b A = σ ( Z ) J = − 1 m ∑ i = 1 m ( y ( i ) log ( a ( i ) ) + ( 1 − y ( i ) ) log ( 1 − a ( i ) ) ) d Z = A − Y d w = 1 m X d Z T d b = 1 m n p . s u m ( d Z ) w : = w − α d w b : = b − α d b Z=w^TX+b=np.dot(w^T,X)+b\\ A=\sigma(Z)\\ J=-\frac{1}{m}\sum_{i=1}^m(y^{(i)}\log(a^{(i)})+(1-y^{(i)})\log(1-a^{(i)}))\\ dZ=A-Y\\ dw=\frac{1}{m}XdZ^T\\ db=\frac{1}{m}np.sum(dZ)\\ w:=w-\alpha dw\\ b:=b-\alpha db Z=wTX+b=np.dot(wT,X)+bA=σ(Z)J=−m1i=1∑m(y(i)log(a(i))+(1−y(i))log(1−a(i)))dZ=A−Ydw=m1XdZTdb=m1np.sum(dZ)w:=w−αdwb:=b−αdb
broadcasting
n
p
.
d
o
t
(
w
T
,
X
)
+
b
np.dot(w^T,X)+b
np.dot(wT,X)+b
A note on numpy
a
=
n
p
.
r
a
n
d
o
m
.
r
a
n
d
n
(
5
)
/
/
w
r
o
n
g
→
a
=
a
.
r
e
s
h
a
p
e
(
5
,
1
)
a
s
s
e
r
t
(
a
.
s
h
a
p
e
=
=
(
5
,
1
)
)
a
=
n
p
.
r
a
n
d
o
m
.
r
a
n
d
n
(
5
,
1
)
→
c
o
l
u
m
v
e
c
t
o
r
a=np.random.randn(5) //wrong\rightarrow a=a.reshape(5,1)\\ assert(a.shape==(5,1))\\ a=np.random.randn(5,1)\rightarrow colum\;vector
a=np.random.randn(5)//wrong→a=a.reshape(5,1)assert(a.shape==(5,1))a=np.random.randn(5,1)→columvector
Shallow Neural Network
Representation
2 layer NN:
I
n
p
u
t
l
a
y
e
r
→
h
i
d
d
e
n
→
l
a
y
e
r
→
o
u
t
l
a
y
e
r
a
[
0
]
→
a
[
1
]
→
a
[
2
]
z
[
1
]
=
W
[
1
]
a
[
0
]
+
b
[
1
]
a
[
1
]
=
σ
(
z
[
1
]
)
z
[
2
]
=
W
[
2
]
a
[
1
]
+
b
[
2
]
a
[
2
]
=
σ
(
z
[
2
]
)
=
y
^
Input\;layer\rightarrow hidden\rightarrow layer\rightarrow out\;layer\\ a^{[0]}\rightarrow a^{[1]}\rightarrow a^{[2]}\\\\ z^{[1]}=W^{[1]}a^{[0]}+b^{[1]}\\ a^{[1]}=\sigma(z^{[1]})\\ z^{[2]}=W^{[2]}a^{[1]}+b^{[2]}\\ a^{[2]}=\sigma(z^{[2]})=\hat y\\
Inputlayer→hidden→layer→outlayera[0]→a[1]→a[2]z[1]=W[1]a[0]+b[1]a[1]=σ(z[1])z[2]=W[2]a[1]+b[2]a[2]=σ(z[2])=y^
computing:
z i [ 1 ] = w i [ 1 ] T x + b i [ 1 ] a i [ 1 ] = σ ( z i [ 1 ] ) [ w 1 [ 1 ] T w 2 [ 1 ] T w 3 [ 1 ] T w 4 [ 1 ] T ] ⋅ [ x 1 x 2 x 3 ] + [ b 1 [ 1 ] b 2 [ 1 ] b 3 [ 1 ] b 4 [ 1 ] ] = [ z 1 [ 1 ] z 2 [ 1 ] z 3 [ 1 ] z 4 [ 1 ] ] z_i^{[1]}=w_i^{[1]T}x+b_i^{[1]}\\ a_i^{[1]}=\sigma(z_i^{[1]})\\ \left[ \begin{matrix} w_1^{[1]T}\\w_2^{[1]T}\\w_3^{[1]T}\\w_4^{[1]T} \end{matrix} \right] \cdot \left[ \begin{matrix} x_1\\x_2\\x_3 \end{matrix} \right]+\left[ \begin{matrix} b_1^{[1]}\\b_2^{[1]}\\b_3^{[1]}\\b_4^{[1]} \end{matrix} \right]=\left[ \begin{matrix} z_1^{[1]}\\z_2^{[1]}\\z_3^{[1]}\\z_4^{[1]} \end{matrix} \right] zi[1]=wi[1]Tx+bi[1]ai[1]=σ(zi[1]) w1[1]Tw2[1]Tw3[1]Tw4[1]T ⋅ x1x2x3 + b1[1]b2[1]b3[1]b4[1] = z1[1]z2[1]z3[1]z4[1]
Vectorize:
x ( i ) → a [ 2 ] ( i ) = y ^ ( i ) Z [ 1 ] = W [ 1 ] X + b [ 1 ] A [ 1 ] = σ ( Z [ 1 ] ) Z [ 2 ] = W [ 2 ] A [ 1 ] + b [ 2 ] A [ 2 ] = σ ( Z [ 2 ] ) W [ 1 ] ⋅ [ x ( 1 ) x ( 2 ) ⋯ x ( m ) ] + b = [ z [ 1 ] ( 1 ) z [ 1 ] ( 2 ) ⋯ z [ 1 ] ( m ) ] = Z [ 1 ] x^{(i)}\rightarrow a^{[2](i)}=\hat y^{(i)}\\ Z^{[1]}=W^{[1]}X+b^{[1]}\\ A^{[1]}=\sigma(Z^{[1]})\\ Z^{[2]}=W^{[2]}A^{[1]}+b^{[2]}\\ A^{[2]}=\sigma(Z^{[2]})\\ W^{[1]}\cdot \left[ \begin{matrix} x^{(1)} & x^{(2)} &\cdots & x^{(m)} \end{matrix} \right]+b=\left[ \begin{matrix} z^{[1](1)} & z^{[1](2)} &\cdots & z^{[1](m)} \end{matrix} \right]=Z^{[1]} x(i)→a[2](i)=y^(i)Z[1]=W[1]X+b[1]A[1]=σ(Z[1])Z[2]=W[2]A[1]+b[2]A[2]=σ(Z[2])W[1]⋅[x(1)x(2)⋯x(m)]+b=[z[1](1)z[1](2)⋯z[1](m)]=Z[1]
Activation functions
a = 1 1 + e − z , a ′ = a ( 1 − a ) a = tanh ( z ) = e z − e − z e z + e − z , a ∈ ( − 1 , 1 ) , a ′ = 1 − a 2 a = m a x ( 0 , z ) a = m a x ( 0.01 z , z ) a=\frac{1}{1+e^{-z}},a'=a(1-a)\\ a=\tanh(z)=\frac{e^z-e^{-z}}{e^z+e^{-z}},a\in (-1,1),a'=1-a^2\\ a=max(0,z)\\ a=max(0.01z,z) a=1+e−z1,a′=a(1−a)a=tanh(z)=ez+e−zez−e−z,a∈(−1,1),a′=1−a2a=max(0,z)a=max(0.01z,z)
Gradient descent
computation
z [ 1 ] = W [ 1 ] x + b [ 1 ] → a [ 1 ] = σ ( z [ 1 ] ) → z [ 2 ] = W [ 2 ] a [ 1 ] + b [ 2 ] → a [ 2 ] = σ ( z [ 2 ] ) → L ( a [ 2 ] , y ) d z [ 2 ] = a [ 2 ] − y d w [ 2 ] = d z [ 2 ] a [ 1 ] T d b [ 2 ] = d z [ 2 ] d z [ 1 ] = w [ 2 ] T d z [ 2 ] ∗ a ′ [ 1 ] d w [ 1 ] = d z [ 1 ] ⋅ x T d b [ 1 ] = d z [ 1 ] z^{[1]}=W^{[1]}x+b^{[1]}\rightarrow\\ a^{[1]}=\sigma(z^{[1]})\rightarrow\\ z^{[2]}=W^{[2]}a^{[1]}+b^{[2]}\rightarrow\\ a^{[2]}=\sigma(z^{[2]})\rightarrow\\ \mathcal{L}(a^{[2]},y)\\\\ dz^{[2]}=a^{[2]}-y\\ dw^{[2]}=dz^{[2]}a^{[1]T}\\ db^{[2]}=dz^{[2]}\\ dz^{[1]}=w^{[2]T}dz^{[2]}*a^{'[1]}\\ dw^{[1]}=dz^{[1]}\cdot x^T\\ db^{[1]}=dz^{[1]}\\\\ z[1]=W[1]x+b[1]→a[1]=σ(z[1])→z[2]=W[2]a[1]+b[2]→a[2]=σ(z[2])→L(a[2],y)dz[2]=a[2]−ydw[2]=dz[2]a[1]Tdb[2]=dz[2]dz[1]=w[2]Tdz[2]∗a′[1]dw[1]=dz[1]⋅xTdb[1]=dz[1]
dz[1]的推导涉及到了矩阵求导
the dimension
x : ( n 0 , m ) W [ 1 ] : ( n 1 , n 0 ) → a [ 1 ] : ( n 1 , m ) W [ 2 ] : : ( n 2 , n 1 ) → a [ 2 ] : ( n 2 , m ) x:(n_0,m)\quad W^{[1]}:(n_1,n_0)\rightarrow \\ a^{[1]}:(n_1,m)\quad W^{[2]:}:(n_2,n_1)\rightarrow\\ a^{[2]}:(n_2,m)\quad x:(n0,m)W[1]:(n1,n0)→a[1]:(n1,m)W[2]::(n2,n1)→a[2]:(n2,m)
vectorize
d Z [ 2 ] = A [ 2 ] − Y d W [ 2 ] = 1 m d Z [ 2 ] A [ 1 ] T d b [ 2 ] = n p . s u m ( d Z [ 2 ] , a x i s = 1 , k e e p d i m s = T r u e ) d Z [ 1 ] = W [ 2 ] T d Z [ 2 ] ∗ A ′ [ 1 ] d W [ 1 ] = 1 m d Z [ 1 ] X T d b [ 1 ] = 1 m n p . s u m ( d Z [ 1 ] , a x i s = 1 , k e e p d i m s = T r u e ) dZ^{[2]}=A^{[2]}-Y\\ dW^{[2]}=\frac{1}{m}dZ^{[2]}A^{[1]T}\\ db^{[2]}=np.sum(dZ^{[2]},axis = 1,keepdims=True)\\ dZ^{[1]}=W^{[2]T}dZ^{[2]}*A^{'[1]}\\ dW^{[1]}=\frac{1}{m}dZ^{[1]}X^T\\ db^{[1]}=\frac{1}{m}np.sum(dZ{[1]},axis=1,keepdims=True) dZ[2]=A[2]−YdW[2]=m1dZ[2]A[1]Tdb[2]=np.sum(dZ[2],axis=1,keepdims=True)dZ[1]=W[2]TdZ[2]∗A′[1]dW[1]=m1dZ[1]XTdb[1]=m1np.sum(dZ[1],axis=1,keepdims=True)
Random Initialization
w [ 1 ] = n p . r a n d o m . r a n d n ( ( 2 , 2 ) ) ∗ 0.01 b [ 1 ] = n p . z e r o ( ( 2 , 1 ) ) w^{[1]}=np.random.randn((2,2))*0.01\\ b^{[1]}=np.zero((2,1)) w[1]=np.random.randn((2,2))∗0.01b[1]=np.zero((2,1))
Deep neural network
notation
e x a m p l e : L l a y e r N N a [ l ] → a c t i v a t i o n f u n c t i o n w [ l ] → w e i g h t s f o r z [ l ] y ^ = a [ L ] example:L\;\;layer\;\;NN\\ a^{[l]}\rightarrow activation\;function\\ w^{[l]}\rightarrow weights\;for\;z^{[l]}\\ \hat y=a^{[L]} example:LlayerNNa[l]→activationfunctionw[l]→weightsforz[l]y^=a[L]
Forward propagation
f o r l = 1 , 2 , 3.. z [ l ] = w [ l ] a [ l − 1 ] + b [ l ] c a c h e z [ l ] , w [ l ] , b [ l ] a [ l ] = g [ l ] ( z [ l ] ) for\;\;l=1,2,3..\\ z^{[l]}=w^{[l]}a^{[l-1]}+b^{[l]}\\cache\;z^{[l]},w^{[l]},b^{[l]} \\ a^{[l]}=g^{[l]}(z^{[l]}) forl=1,2,3..z[l]=w[l]a[l−1]+b[l]cachez[l],w[l],b[l]a[l]=g[l](z[l])
Backward propagation
d a [ l ] → d a [ l − 1 ] ( d z [ l ] , d w [ l ] , d b [ l ] ) d z [ l ] = d a [ l ] ∗ g [ l ] ′ ( z [ l ] ) = w [ l + 1 ] d z [ l + 1 ] ∗ g [ l ] ′ ( z [ l ] ) d w [ l ] = d z [ l ] ⋅ a [ l − 1 ] T d b [ l ] = d z [ l ] d a [ l − 1 ] = w [ l ] T ⋅ d z [ l ] da^{[l]}\rightarrow da^{[l-1]}(dz^{[l]},dw^{[l]},db^{[l]})\\ dz^{[l]}=da^{[l]}*g^{[l]'}(z^{[l]})=w^{[l+1]}dz^{[l+1]}*g^{[l]'}(z^{[l]})\\ dw^{[l]}=dz^{[l]}\cdot a^{[l-1]T}\\ db^{[l]}=dz^{[l]}\\ da^{[l-1]}=w^{[l]T}\cdot dz^{[l]}\\ da[l]→da[l−1](dz[l],dw[l],db[l])dz[l]=da[l]∗g[l]′(z[l])=w[l+1]dz[l+1]∗g[l]′(z[l])dw[l]=dz[l]⋅a[l−1]Tdb[l]=dz[l]da[l−1]=w[l]T⋅dz[l]
matrix dimensions
d w , w [ l ] : ( n [ l ] , n [ l − 1 ] ) d b , b [ l ] : ( n [ l ] , 1 ) Z [ l ] , A [ l ] : ( n [ l ] , m ) dw,w^{[l]}:(n^{[l]},n^{[l-1]})\\ db, b^{[l]}:(n^{[l]},1 )\\ Z^{[l]},A^{[l]}:(n^{[l]},m) dw,w[l]:(n[l],n[l−1])db,b[l]:(n[l],1)Z[l],A[l]:(n[l],m)