In order to minimize the cost function, we expect small changes in weights lead to samll changes in output. So we can use this property to modify weights to make network getting closer to what we want step by step. This is also one reason why we need smooth activation function rather than {0, 1}
1. The intuitive idea
Δwj should be small enough to make sure a good approximation
if we want to decrease Cost , which means ΔCost<0 . Therefore, Δwj should be an opposite sign with ∂Cost∂wj .
Here, let Δw=−η∇wCost , η is learning rate. Basically, η=ϵ||∇wCost|| enables the minimization of ΔCost in each updation. ϵ=||Δw|| . ( ||Δw|| is small to ensure the approximation)
2. Back-Propagation
Warming-up:
∇wCost or ∂Cost∂w tells us how quickly the Cost changes when we update the w.
2.1 Introduction to Notations
wljk : connection from kth neuron in the (l−1)th layer to jth neuron in the lth layer.
alj : the activation or output of the jth neuron in the lth layer.
zlj : the Input to the jth neuron in the lth layer.
alj=f(∑kwljkal−1k)
can be vectorized as:
( w3 in above picture is a 2 * 4 matrix【the first row corresponds to the first neuron and second row corresponds to the second neuron in layer 3】; a3 is a 2*1 matrix or vector)
On the other hand, for convenience reason, we define:
2.2 The Fundamental Equations
In the beginning, we set ERROR in the neuron:
This definition makes sense. Suppose it adds a little change ( Δzlj ) to the corresponding neuron ( zlj ). Then its output will be f(zlj+Δzlj) instead of f(zlj) . This change propagates through later layer and the overall change is approximating ∂Cost∂zljΔzlj .
In this case, if ∂Cost∂zlj is close to zero, the changes of Cost function can’t be great. We can think this neuron is near optimal. Thus, the heuristic sense in ∂Cost∂zlj is a measurement of error in the neuron.
An equation for the error in the output layer
δL=∂Cost∂zL(L represents the last layer)δL=∂Cost∂aL∂aL∂zL=∂Cost∂aLf′(zL)=∇aLCost⊙f′(zL)(⊙ represents componentwise production)An equation for the error δl in terms of the error in the next layer, δl+1 :
More specifically, for an individual neuron:
Meanwhile
So
using (wl)T∗δl to obtain δl−1 seems propagate error backward through the network. We then take the Hadamard product ⊙f′(zl−1) . This moves the error backward through the activation function in layer l−1
- An equation for the rate of change of the cost with respect to any weight in the network:
∂Cost∂wljk=∂Cost∂zlj∂zlj∂wljk=δljal−1k
For less-index version:
it’s understood that ain is the activation of the neuron input to the weight w, and δout is the error of the neuron output from the weight w.
Note if ain is small, ∂Cost∂w is also small. we’ll say the weight learns slowly, meaning that it’s not changing much during gradient descent. Recall from the graph of the sigmoid function that the σ function becomes very flat in the end sides. It occurs that its derivative is approximately 0. In this case it’s common to say the output neuron has saturated and, as a result, the weight has stopped learning (or is learning slowly).
psudo code for updation w: (m is the num of batch)
wl=wl−ηm∑xδx,l(ax,l−1)T
What’s clever about backpropagation is that it enables us to simultaneously compute all the partial derivatives ∂C/∂wj using just one forward pass through the network, followed by one backward pass through the network