第五讲 Backpropagation
From one-layer NN to multi layer NN
2 layer case.
x
=
z
(
1
)
=
a
(
1
)
z
(
2
)
=
W
(
1
)
x
+
b
(
1
)
a
(
2
)
=
f
(
z
(
2
)
)
z
(
3
)
=
W
(
2
)
a
(
2
)
+
b
(
2
)
a
(
3
)
=
f
(
z
(
3
)
)
s
=
U
T
a
(
3
)
\begin{aligned} x =& z^{(1)} = a^{(1)} \\ z^{(2)} =& W^{(1)}x+b^{(1)} \\ a^{(2)} =& f(z^{(2)}) \\ z^{(3)} =& W^{(2)}a^{(2)} +b^{(2)} \\ a^{(3)} =& f(z^{(3)}) \\ s =& U^Ta^{(3)} \end{aligned}
x=z(2)=a(2)=z(3)=a(3)=s=z(1)=a(1)W(1)x+b(1)f(z(2))W(2)a(2)+b(2)f(z(3))UTa(3)
for
W
(
2
)
W^{(2)}
W(2),
∂
s
∂
W
i
j
(
2
)
=
δ
i
(
3
)
a
j
(
2
)
\frac {\partial{s}} {\partial{W_{ij}^{(2)}}} = \delta_i^{(3)} a_j^{(2)}
∂Wij(2)∂s=δi(3)aj(2)
In matrix notation
∂
s
∂
W
(
2
)
=
δ
(
3
)
a
(
2
)
T
δ
(
3
)
=
U
⊙
f
′
(
z
(
3
)
)
\frac {\partial{s}} {\partial{W^{(2)}}} = \delta^{(3)} a^{(2)^T} \\ \delta^{(3)}=U \odot f'(z^{(3)})
∂W(2)∂s=δ(3)a(2)Tδ(3)=U⊙f′(z(3))
Then, we need to calculate
∂
s
∂
W
(
1
)
\frac{\partial{s}}{\partial{W^{(1)}}}
∂W(1)∂s
s
=
U
T
f
(
W
(
2
)
f
(
W
(
1
)
x
+
b
(
1
)
)
+
b
(
2
)
)
=
U
T
f
(
W
(
2
)
f
(
[
.
.
.
;
W
i
j
(
1
)
x
j
+
C
+
b
j
(
1
)
;
.
.
.
]
)
+
b
(
2
)
)
=
U
T
f
(
W
(
2
)
[
.
.
.
f
(
W
i
j
(
1
)
x
j
+
C
+
b
j
(
1
)
)
;
.
.
.
]
+
b
(
2
)
)
=
U
T
f
(
[
W
1
i
(
2
)
f
(
W
i
j
(
1
)
x
j
+
C
+
b
j
(
1
)
)
+
C
W
2
i
(
2
)
f
(
W
i
j
(
1
)
x
j
+
C
+
b
j
(
1
)
)
+
C
.
.
.
W
n
i
(
2
)
f
(
W
i
j
(
1
)
x
j
+
C
+
b
j
(
1
)
)
+
C
]
+
b
(
2
)
)
=
U
T
f
(
[
W
1
i
(
2
)
f
(
W
i
j
(
1
)
x
j
+
C
+
b
j
(
1
)
)
+
C
+
b
1
(
2
)
W
2
i
(
2
)
f
(
W
i
j
(
1
)
x
j
+
C
+
b
j
(
1
)
)
+
C
+
b
2
(
2
)
.
.
.
W
n
i
(
2
)
f
(
W
i
j
(
1
)
x
j
+
C
+
b
j
(
1
)
)
+
C
+
b
n
(
2
)
]
)
=
U
T
[
f
(
W
1
i
(
2
)
f
(
W
i
j
(
1
)
x
j
+
C
+
b
j
(
1
)
)
+
C
+
b
1
(
2
)
)
f
(
W
2
i
(
2
)
f
(
W
i
j
(
1
)
x
j
+
C
+
b
j
(
1
)
)
+
C
+
b
2
(
2
)
)
.
.
.
f
(
W
n
i
(
2
)
f
(
W
i
j
(
1
)
x
j
+
C
+
b
j
(
1
)
)
+
C
+
b
n
(
2
)
)
]
=
∑
k
U
k
f
(
W
k
i
(
2
)
f
(
W
i
j
(
1
)
x
j
+
C
+
b
j
(
1
)
)
+
C
+
b
k
(
2
)
)
\begin{aligned} s =& U^Tf(W^{(2)}f(W^{(1)}x+b^{(1)})+b^{(2)}) \\ =& U^Tf(W^{(2)}f([...;W^{(1)}_{ij}x_j + C +b^{(1)}_j;...])+b^{(2)}) \\ =& U^Tf( W^{(2)} \left[\begin{array}{ccc} ... \\ f(W^{(1)}_{ij}x_j + C +b^{(1)}_j); \\ ... \end{array} \right] +b^{(2)}) \\ =& U^Tf( \left[\begin{array}{ccc} W_{1i}^{(2)}f(W^{(1)}_{ij}x_j + C +b^{(1)}_j) + C \\ W_{2i}^{(2)}f(W^{(1)}_{ij}x_j + C +b^{(1)}_j) + C \\ ... \\ W_{ni}^{(2)}f(W^{(1)}_{ij}x_j + C +b^{(1)}_j) + C \\ \end{array} \right] +b^{(2)}) \\ =& U^Tf( \left[\begin{array}{ccc} W_{1i}^{(2)}f(W^{(1)}_{ij}x_j + C +b^{(1)}_j) + C + b^{(2)}_1 \\ W_{2i}^{(2)}f(W^{(1)}_{ij}x_j + C +b^{(1)}_j) + C + b^{(2)}_2 \\ ... \\ W_{ni}^{(2)}f(W^{(1)}_{ij}x_j + C +b^{(1)}_j) + C + b^{(2)}_n \\ \end{array} \right] ) \\ =& U^T \left[\begin{array}{ccc} f(W_{1i}^{(2)}f(W^{(1)}_{ij}x_j + C +b^{(1)}_j) + C + b^{(2)}_1) \\ f(W_{2i}^{(2)}f(W^{(1)}_{ij}x_j + C +b^{(1)}_j) + C + b^{(2)}_2 ) \\ ... \\ f(W_{ni}^{(2)}f(W^{(1)}_{ij}x_j + C +b^{(1)}_j) + C + b^{(2)}_n ) \\ \end{array} \right] \\ =& \sum_k U_k f(W_{ki}^{(2)}f(W^{(1)}_{ij}x_j + C +b^{(1)}_j) + C + b^{(2)}_k) \end{aligned}
s=======UTf(W(2)f(W(1)x+b(1))+b(2))UTf(W(2)f([...;Wij(1)xj+C+bj(1);...])+b(2))UTf(W(2)⎣⎡...f(Wij(1)xj+C+bj(1));...⎦⎤+b(2))UTf(⎣⎢⎢⎢⎡W1i(2)f(Wij(1)xj+C+bj(1))+CW2i(2)f(Wij(1)xj+C+bj(1))+C...Wni(2)f(Wij(1)xj+C+bj(1))+C⎦⎥⎥⎥⎤+b(2))UTf(⎣⎢⎢⎢⎡W1i(2)f(Wij(1)xj+C+bj(1))+C+b1(2)W2i(2)f(Wij(1)xj+C+bj(1))+C+b2(2)...Wni(2)f(Wij(1)xj+C+bj(1))+C+bn(2)⎦⎥⎥⎥⎤)UT⎣⎢⎢⎢⎡f(W1i(2)f(Wij(1)xj+C+bj(1))+C+b1(2))f(W2i(2)f(Wij(1)xj+C+bj(1))+C+b2(2))...f(Wni(2)f(Wij(1)xj+C+bj(1))+C+bn(2))⎦⎥⎥⎥⎤k∑Ukf(Wki(2)f(Wij(1)xj+C+bj(1))+C+bk(2))
Thus,
∂
s
∂
W
i
j
(
1
)
=
∑
k
U
k
f
k
′
(
z
(
3
)
)
f
k
′
(
z
(
2
)
)
W
k
i
(
2
)
x
j
=
W
⋅
i
(
2
)
T
f
′
(
z
(
2
)
)
⊙
U
⊙
f
′
(
z
(
3
)
)
x
j
∂
s
∂
W
i
⋅
(
1
)
=
W
⋅
i
(
2
)
T
f
′
(
z
(
2
)
)
⊙
U
⊙
f
′
(
z
(
3
)
)
x
T
∂
s
∂
W
(
1
)
=
W
(
2
)
T
f
′
(
z
(
2
)
)
⊙
U
⊙
f
′
(
z
(
3
)
)
x
T
=
W
(
2
)
T
δ
(
3
)
⊙
f
′
(
z
(
2
)
)
x
T
\begin{aligned} \frac{\partial s}{\partial W^{(1)}_{ij}} =&\sum_k U_k f'_k(z^{(3)})f'_k(z^{(2)})W^{(2)}_{ki}x_j \\ =& W^{(2)^T}_{·i} f'(z^{(2)}) \odot U \odot f'(z^{(3)}) x_j \\ \frac{\partial s}{\partial W^{(1)}_{i·}} =& W^{(2)^T}_{·i} f'(z^{(2)}) \odot U \odot f'(z^{(3)}) x^T \\ \frac{\partial s}{\partial W^{(1)}} =& W^{(2)^T} f'(z^{(2)}) \odot U \odot f'(z^{(3)}) x^T \\ =& W^{(2)^T} \delta^{(3)}\odot f'(z^{(2)}) x^T \end{aligned}
∂Wij(1)∂s==∂Wi⋅(1)∂s=∂W(1)∂s==k∑Ukfk′(z(3))fk′(z(2))Wki(2)xjW⋅i(2)Tf′(z(2))⊙U⊙f′(z(3))xjW⋅i(2)Tf′(z(2))⊙U⊙f′(z(3))xTW(2)Tf′(z(2))⊙U⊙f′(z(3))xTW(2)Tδ(3)⊙f′(z(2))xT
#我也不知道我自己推的对不对,反正和课件里的结果对上号了,欢迎读者批评指正。
δ
(
l
)
=
(
(
W
(
l
)
)
T
δ
(
l
+
1
)
)
⊙
f
′
(
z
(
l
)
)
∂
∂
W
(
l
)
E
R
=
δ
(
l
+
1
)
(
a
(
l
+
1
)
)
(
a
(
l
)
)
T
+
λ
W
(
l
)
\begin{aligned} \delta^{(l)} =&((W^{(l)})^T\delta^{(l+1)})\odot f'(z^{(l)}) \\ \frac{\partial}{\partial W^{(l)}}E_R =& \delta^{(l+1)}(a^{(l+1)})(a^{(l)})^T+\lambda W^{(l)} \end{aligned}
δ(l)=∂W(l)∂ER=((W(l))Tδ(l+1))⊙f′(z(l))δ(l+1)(a(l+1))(a(l))T+λW(l)
BP
f
(
x
,
y
,
z
)
=
(
x
+
y
)
z
f(x,y,z)=(x+y)z
f(x,y,z)=(x+y)z
def:
q
=
x
+
y
∂
q
∂
x
=
1
,
∂
q
∂
y
=
1
f
=
q
z
∂
f
∂
q
=
z
,
∂
f
∂
z
=
q
\begin{aligned} q=x+y& &\frac{\partial q}{\partial x} = 1, \frac{\partial q}{\partial y} = 1 \\ f=qz& &\frac{\partial f}{\partial q} = z, \frac{\partial f}{\partial z} = q \end{aligned}
q=x+yf=qz∂x∂q=1,∂y∂q=1∂q∂f=z,∂z∂f=q
then we could use chain rule.
sequenceDiagram
x->>q:
y->>q:
q->>z
Recursively apply chain rule through each node.
Another 3 different descriptions and viewpoints of bp
- Functions as Circuits
- High-level flowgraph; move back through the graph.
- Delta error signals in real NN
Papaer Reading
FastText:
- average word vector.
- hierarchical softmax
- we construct a Huffuman Tree based on word frequency. Each non-leaf node is a two-class classifier. And each leaf denotes a word.