【深度学习入门】NNDL学习笔记(二)

 

Chapter2 How the backpropagation algorithm works?

反向传播 Backpropagation: a fast algorithm for computing gradients.

Warm up: a fast matrix-based approach to computing the output from a neural network

                                                      a^l_j = \sigma(\sum_kw^l_{jk}a^{l-1}_k+b^l_j)\Rightarrow a^l=\sigma(w^l\cdot a^{l-1}+b^l)

weighted input: z^l = w^l\cdot a^{l-1}+b^l

elementwise multiplication: s \odot t

The two assumptions we need about the cost function

1. The cost function can be written as an average C=\frac{1}{n}\sum_x C_xover cost functions Cx for individual training examples, x.

2. The cost can be written as a function of the outputs from the neural network C_x = C_x(a^l)

The four fundamental equations behind backpropagation

The error  in the jth neuron in the lth layer: \delta^l_j \equiv\frac{\partial C}{\partial z^l_j}

For all σ(.): 

Equation1: the error in the output layer \delta ^L


                                                                     \delta^L_j=\frac{\partial C}{\partial a^L_j}\sigma(z^L_j)',

                                                            matrix form: \delta^L = \nabla_a C \odot \sigma' (z^L)


It is easily computed: 

因为\sigma = \frac{1}{1+e^z}, 计算出z 就可计算出 \sigma(z^L_j)'

If we're using the quadratic cost function then C=\frac{\sum_j(y_j-a_j^L)^2}{2},   \frac{\partial C}{\partial a^L_j}=(a^L_j-y^L)

matrix form \delta^L = (a^L-y) \odot \sigma' (z^L)

Equation2: the error \delta^l in terms of the next layer, \delta^{l+1}


                                                                 \delta^l = ((w^{l+1})^T\delta^{l+1})\odot \delta'(z^l),

                                                                      \begin{align*} &\Rightarrow dC=(w^{l+1})^T \frac{\partial C}{\partial z^{l+1}}d(a^{l})\\ &\Rightarrow w^{l+1}\frac{\partial C}{\partial a^{l}}= \frac{\partial C}{\partial z^{l+1}} \end{align*}


Equation3: An equation for the rate of change of the cost with respect to any bias in the network.


                                                                 \frac{\partial C}{\partial b^l_j}=\delta ^l_j


Equation4: An equation for the rate of change of the cost with respect to any weight in the network


                                                          \frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k\delta^l_j,     \frac{\partial C}{\partial w}=a_{in}\delta_{out}


A weight or bias in the final layer will learn slowly if the output neuron is either low activation (≈0) or high activation (≈1). 

If the input neuron is low-activation, or if the output neuron has saturated, weights and biases will learn very slowly.

The backpropagation algorithm

Exercise

1.Suppose we modify a single neuron in a feedforward network so that the output from the neuron is given by f(∑jwjxj+b). How should we modify the backpropagation algorithm in this case?

For BP1 & BP2 just exchange σ for f while BP3 & BP4 will not change.

2.Linear: Suppose we replace the usual non-linear σ function with σ(z)=z throughout the network. Rewrite the backpropagation algorithm for this case.

Change every σ to f(z) = x.

The code for backpropagation

For a mini-batch of size m:

Fully matrix-based approach: It's possible to modify the backpropagation algorithm so that it computes the gradients for all training examples in a mini-batch simultaneously,taking full advantage of modern libraries for linear algebra

The big picture

                                                                     \Delta C\approx\frac{\partial C}{\partial w^l_jk} \Delta w^l_jk

 

This suggests that a possible approach to computing ∂C/∂wljk is to carefully track how a small change in wljk propagates to cause a small change in C.

                                                                    \Delta a^l_j = \frac{\partial a^l_j}{\partial w^l_{jk}} \Delta w^l_{jk}

For a single neuron in the next layer:

                                                                   \Delta a^{l+1}_q \approx \frac{\partial a^{l+1}_q}{\partial a^l_j}\Delta a^l_j

Imagine a path from w to C:

There are many paths from w to C:

What the equation tells us is that every edge between two neurons in the network is associated with a rate factor which is just the partial derivative of one neuron's activation with respect to the other neuron's activation.

 

 

 

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值