CS224N notes_chapter5_Backpropagation

第五讲 Backpropagation

From one-layer NN to multi layer NN

2 layer case.
x = z ( 1 ) = a ( 1 ) z ( 2 ) = W ( 1 ) x + b ( 1 ) a ( 2 ) = f ( z ( 2 ) ) z ( 3 ) = W ( 2 ) a ( 2 ) + b ( 2 ) a ( 3 ) = f ( z ( 3 ) ) s = U T a ( 3 ) \begin{aligned} x =& z^{(1)} = a^{(1)} \\ z^{(2)} =& W^{(1)}x+b^{(1)} \\ a^{(2)} =& f(z^{(2)}) \\ z^{(3)} =& W^{(2)}a^{(2)} +b^{(2)} \\ a^{(3)} =& f(z^{(3)}) \\ s =& U^Ta^{(3)} \end{aligned} x=z(2)=a(2)=z(3)=a(3)=s=z(1)=a(1)W(1)x+b(1)f(z(2))W(2)a(2)+b(2)f(z(3))UTa(3)
for W ( 2 ) W^{(2)} W(2),
∂ s ∂ W i j ( 2 ) = δ i ( 3 ) a j ( 2 ) \frac {\partial{s}} {\partial{W_{ij}^{(2)}}} = \delta_i^{(3)} a_j^{(2)} Wij(2)s=δi(3)aj(2)
In matrix notation
∂ s ∂ W ( 2 ) = δ ( 3 ) a ( 2 ) T δ ( 3 ) = U ⊙ f ′ ( z ( 3 ) ) \frac {\partial{s}} {\partial{W^{(2)}}} = \delta^{(3)} a^{(2)^T} \\ \delta^{(3)}=U \odot f'(z^{(3)}) W(2)s=δ(3)a(2)Tδ(3)=Uf(z(3))
Then, we need to calculate ∂ s ∂ W ( 1 ) \frac{\partial{s}}{\partial{W^{(1)}}} W(1)s
s = U T f ( W ( 2 ) f ( W ( 1 ) x + b ( 1 ) ) + b ( 2 ) ) = U T f ( W ( 2 ) f ( [ . . . ; W i j ( 1 ) x j + C + b j ( 1 ) ; . . . ] ) + b ( 2 ) ) = U T f ( W ( 2 ) [ . . . f ( W i j ( 1 ) x j + C + b j ( 1 ) ) ; . . . ] + b ( 2 ) ) = U T f ( [ W 1 i ( 2 ) f ( W i j ( 1 ) x j + C + b j ( 1 ) ) + C W 2 i ( 2 ) f ( W i j ( 1 ) x j + C + b j ( 1 ) ) + C . . . W n i ( 2 ) f ( W i j ( 1 ) x j + C + b j ( 1 ) ) + C ] + b ( 2 ) ) = U T f ( [ W 1 i ( 2 ) f ( W i j ( 1 ) x j + C + b j ( 1 ) ) + C + b 1 ( 2 ) W 2 i ( 2 ) f ( W i j ( 1 ) x j + C + b j ( 1 ) ) + C + b 2 ( 2 ) . . . W n i ( 2 ) f ( W i j ( 1 ) x j + C + b j ( 1 ) ) + C + b n ( 2 ) ] ) = U T [ f ( W 1 i ( 2 ) f ( W i j ( 1 ) x j + C + b j ( 1 ) ) + C + b 1 ( 2 ) ) f ( W 2 i ( 2 ) f ( W i j ( 1 ) x j + C + b j ( 1 ) ) + C + b 2 ( 2 ) ) . . . f ( W n i ( 2 ) f ( W i j ( 1 ) x j + C + b j ( 1 ) ) + C + b n ( 2 ) ) ] = ∑ k U k f ( W k i ( 2 ) f ( W i j ( 1 ) x j + C + b j ( 1 ) ) + C + b k ( 2 ) ) \begin{aligned} s =& U^Tf(W^{(2)}f(W^{(1)}x+b^{(1)})+b^{(2)}) \\ =& U^Tf(W^{(2)}f([...;W^{(1)}_{ij}x_j + C +b^{(1)}_j;...])+b^{(2)}) \\ =& U^Tf( W^{(2)} \left[\begin{array}{ccc} ... \\ f(W^{(1)}_{ij}x_j + C +b^{(1)}_j); \\ ... \end{array} \right] +b^{(2)}) \\ =& U^Tf( \left[\begin{array}{ccc} W_{1i}^{(2)}f(W^{(1)}_{ij}x_j + C +b^{(1)}_j) + C \\ W_{2i}^{(2)}f(W^{(1)}_{ij}x_j + C +b^{(1)}_j) + C \\ ... \\ W_{ni}^{(2)}f(W^{(1)}_{ij}x_j + C +b^{(1)}_j) + C \\ \end{array} \right] +b^{(2)}) \\ =& U^Tf( \left[\begin{array}{ccc} W_{1i}^{(2)}f(W^{(1)}_{ij}x_j + C +b^{(1)}_j) + C + b^{(2)}_1 \\ W_{2i}^{(2)}f(W^{(1)}_{ij}x_j + C +b^{(1)}_j) + C + b^{(2)}_2 \\ ... \\ W_{ni}^{(2)}f(W^{(1)}_{ij}x_j + C +b^{(1)}_j) + C + b^{(2)}_n \\ \end{array} \right] ) \\ =& U^T \left[\begin{array}{ccc} f(W_{1i}^{(2)}f(W^{(1)}_{ij}x_j + C +b^{(1)}_j) + C + b^{(2)}_1) \\ f(W_{2i}^{(2)}f(W^{(1)}_{ij}x_j + C +b^{(1)}_j) + C + b^{(2)}_2 ) \\ ... \\ f(W_{ni}^{(2)}f(W^{(1)}_{ij}x_j + C +b^{(1)}_j) + C + b^{(2)}_n ) \\ \end{array} \right] \\ =& \sum_k U_k f(W_{ki}^{(2)}f(W^{(1)}_{ij}x_j + C +b^{(1)}_j) + C + b^{(2)}_k) \end{aligned} s=======UTf(W(2)f(W(1)x+b(1))+b(2))UTf(W(2)f([...;Wij(1)xj+C+bj(1);...])+b(2))UTf(W(2)...f(Wij(1)xj+C+bj(1));...+b(2))UTf(W1i(2)f(Wij(1)xj+C+bj(1))+CW2i(2)f(Wij(1)xj+C+bj(1))+C...Wni(2)f(Wij(1)xj+C+bj(1))+C+b(2))UTf(W1i(2)f(Wij(1)xj+C+bj(1))+C+b1(2)W2i(2)f(Wij(1)xj+C+bj(1))+C+b2(2)...Wni(2)f(Wij(1)xj+C+bj(1))+C+bn(2))UTf(W1i(2)f(Wij(1)xj+C+bj(1))+C+b1(2))f(W2i(2)f(Wij(1)xj+C+bj(1))+C+b2(2))...f(Wni(2)f(Wij(1)xj+C+bj(1))+C+bn(2))kUkf(Wki(2)f(Wij(1)xj+C+bj(1))+C+bk(2))
Thus,
∂ s ∂ W i j ( 1 ) = ∑ k U k f k ′ ( z ( 3 ) ) f k ′ ( z ( 2 ) ) W k i ( 2 ) x j = W ⋅ i ( 2 ) T f ′ ( z ( 2 ) ) ⊙ U ⊙ f ′ ( z ( 3 ) ) x j ∂ s ∂ W i ⋅ ( 1 ) = W ⋅ i ( 2 ) T f ′ ( z ( 2 ) ) ⊙ U ⊙ f ′ ( z ( 3 ) ) x T ∂ s ∂ W ( 1 ) = W ( 2 ) T f ′ ( z ( 2 ) ) ⊙ U ⊙ f ′ ( z ( 3 ) ) x T = W ( 2 ) T δ ( 3 ) ⊙ f ′ ( z ( 2 ) ) x T \begin{aligned} \frac{\partial s}{\partial W^{(1)}_{ij}} =&\sum_k U_k f'_k(z^{(3)})f'_k(z^{(2)})W^{(2)}_{ki}x_j \\ =& W^{(2)^T}_{·i} f'(z^{(2)}) \odot U \odot f'(z^{(3)}) x_j \\ \frac{\partial s}{\partial W^{(1)}_{i·}} =& W^{(2)^T}_{·i} f'(z^{(2)}) \odot U \odot f'(z^{(3)}) x^T \\ \frac{\partial s}{\partial W^{(1)}} =& W^{(2)^T} f'(z^{(2)}) \odot U \odot f'(z^{(3)}) x^T \\ =& W^{(2)^T} \delta^{(3)}\odot f'(z^{(2)}) x^T \end{aligned} Wij(1)s==Wi(1)s=W(1)s==kUkfk(z(3))fk(z(2))Wki(2)xjWi(2)Tf(z(2))Uf(z(3))xjWi(2)Tf(z(2))Uf(z(3))xTW(2)Tf(z(2))Uf(z(3))xTW(2)Tδ(3)f(z(2))xT
#我也不知道我自己推的对不对,反正和课件里的结果对上号了,欢迎读者批评指正。
δ ( l ) = ( ( W ( l ) ) T δ ( l + 1 ) ) ⊙ f ′ ( z ( l ) ) ∂ ∂ W ( l ) E R = δ ( l + 1 ) ( a ( l + 1 ) ) ( a ( l ) ) T + λ W ( l ) \begin{aligned} \delta^{(l)} =&((W^{(l)})^T\delta^{(l+1)})\odot f'(z^{(l)}) \\ \frac{\partial}{\partial W^{(l)}}E_R =& \delta^{(l+1)}(a^{(l+1)})(a^{(l)})^T+\lambda W^{(l)} \end{aligned} δ(l)=W(l)ER=((W(l))Tδ(l+1))f(z(l))δ(l+1)(a(l+1))(a(l))T+λW(l)

BP

f ( x , y , z ) = ( x + y ) z f(x,y,z)=(x+y)z f(x,y,z)=(x+y)z
def:
q = x + y ∂ q ∂ x = 1 , ∂ q ∂ y = 1 f = q z ∂ f ∂ q = z , ∂ f ∂ z = q \begin{aligned} q=x+y& &\frac{\partial q}{\partial x} = 1, \frac{\partial q}{\partial y} = 1 \\ f=qz& &\frac{\partial f}{\partial q} = z, \frac{\partial f}{\partial z} = q \end{aligned} q=x+yf=qzxq=1,yq=1qf=z,zf=q
then we could use chain rule.

sequenceDiagram
x->>q:
y->>q:
q->>z

Recursively apply chain rule through each node.

Another 3 different descriptions and viewpoints of bp

  • Functions as Circuits
  • High-level flowgraph; move back through the graph.
  • Delta error signals in real NN

Papaer Reading

FastText:

  • average word vector.
  • hierarchical softmax
    • we construct a Huffuman Tree based on word frequency. Each non-leaf node is a two-class classifier. And each leaf denotes a word.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值