How back-Propagation works

In order to minimize the cost function, we expect small changes in weights lead to samll changes in output. So we can use this property to modify weights to make network getting closer to what we want step by step. This is also one reason why we need smooth activation function rather than {0, 1}


1. The intuitive idea

ΔCostCostwjΔwj

Δwj should be small enough to make sure a good approximation

if we want to decrease Cost , which means ΔCost<0 . Therefore, Δwj should be an opposite sign with Costwj .

Here, let Δw=ηwCost , η is learning rate. Basically, η=ϵ||wCost|| enables the minimization of ΔCost in each updation. ϵ=||Δw|| . ( ||Δw|| is small to ensure the approximation)




2. Back-Propagation

Warming-up:

wCost or Costw tells us how quickly the Cost changes when we update the w.

2.1 Introduction to Notations

wljk : connection from kth neuron in the (l1)th layer to jth neuron in the lth layer.

alj : the activation or output of the jth neuron in the lth layer.

zlj : the Input to the jth neuron in the lth layer.

The Detailed Notation Image

alj=f(kwljkal1k) can be vectorized as:

al=f(wlal1)

( w3 in above picture is a 2 * 4 matrix【the first row corresponds to the first neuron and second row corresponds to the second neuron in layer 3】; a3 is a 2*1 matrix or vector)

On the other hand, for convenience reason, we define:

zl=wlal1
al=f(zl)



2.2 The Fundamental Equations

In the beginning, we set ERROR in the neuron:

δlj=Costzlj

This definition makes sense. Suppose it adds a little change ( Δzlj ) to the corresponding neuron ( zlj ). Then its output will be f(zlj+Δzlj) instead of f(zlj) . This change propagates through later layer and the overall change is approximating CostzljΔzlj .

In this case, if Costzlj is close to zero, the changes of Cost function can’t be great. We can think this neuron is near optimal. Thus, the heuristic sense in Costzlj is a measurement of error in the neuron.

  1. An equation for the error in the output layer

    δL=CostzL(L represents the last layer)

    δL=CostaLaLzL=CostaLf(zL)=aLCostf(zL)( represents componentwise production)


  2. An equation for the error δl in terms of the error in the next layer, δl+1 :

δl=Costzl=Costzl+1zl+1zl=(δl+1(wl+1)T)f(zl)

More specifically, for an individual neuron:

δlj=Costzlj=kCostzl+1kzl+1kzlk=kδl+1kzl+1kzlk

Meanwhile
zl+1k=jwl+1kjalj=jwl+1kjf(zlj)

So
δlj=kδl+1kzl+1kzlk=kwlkjδl+1kf(zlj)

using (wl)Tδl to obtain δl1 seems propagate error backward through the network. We then take the Hadamard product f(zl1) . This moves the error backward through the activation function in layer l1


  1. An equation for the rate of change of the cost with respect to any weight in the network:
    Costwljk=Costzljzljwljk=δljal1k

zlj=kwljkal1k

For less-index version:

Costw=δlal1likeainδout

example

it’s understood that ain is the activation of the neuron input to the weight w, and δout is the error of the neuron output from the weight w.

Note if ain is small, Costw is also small. we’ll say the weight learns slowly, meaning that it’s not changing much during gradient descent. Recall from the graph of the sigmoid function that the σ function becomes very flat in the end sides. It occurs that its derivative is approximately 0. In this case it’s common to say the output neuron has saturated and, as a result, the weight has stopped learning (or is learning slowly).


psudo code for updation w: (m is the num of batch)

wl=wlηmxδx,l(ax,l1)T

What’s clever about backpropagation is that it enables us to simultaneously compute all the partial derivatives ∂C/∂wj using just one forward pass through the network, followed by one backward pass through the network

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值