Artificial Neural Networks: Mathematics of Backpropagation (Part 4)

最新推荐文章于 2023-06-05 16:58:14 发布

算法学习者

最新推荐文章于 2023-06-05 16:58:14 发布

阅读量653

点赞数

分类专栏： DL

DL 专栏收录该内容

158 篇文章 3 订阅

订阅专栏

http://briandolhansky.com/blog/2013/9/27/artificial-neural-networks-backpropagation-part-4

Up until now, we haven't utilized any of the expressive non-linear power of neural networks - all of our simple one layer models corresponded to a linear model such as multinomial logistic regression. These one-layer models had a simple derivative. We only had one set of weights the fed directly to our output, and it was easy to compute the derivative with respect to these weights. However, what happens when we want to use a deeper model? What happens when we start stacking layers?

No longer is there a linear relation in between a change in the weights and a change of the target. Any perturbation at a particular layer will be further transformed in successive layers. So, then, how do we compute the gradient for all weights in our network? This is where we use the backpropagation algorithm.

Backpropagation, at its core, simply consists of repeatedly applying the chain rule through all of the possible paths in our network. However, there are an exponential number of directed paths from the input to the output. Backpropagation's real power arises in the form of a dynamic programming algorithm, where we reuse intermediate results to calculate the gradient. We transmit intermediate errors backwards through a network, thus leading to the name backpropagation. In fact, backpropagation is closely related to forward propagation, but instead of propagating the inputs forward through the network, we propagate the error backwards.

Most explanations of backpropagation start directly with a general theoretical derivation, but I’ve found that computing the gradients by hand naturally leads to the backpropagation algorithm itself, and that’s what I’ll be doing in this blog post. This is a lengthy section, but I feel that this is the best way to learn how backpropagation works.

I’ll start with a simple one-path network, and then move on to a network with multiple units per layer. Finally, I’ll derive the general backpropagation algorithm. Code for the backpropagation algorithm will be included in my next installment, where I derive the matrix form of the algorithm.

Examples: Deriving the base rules of backpropagation

   Remember that our ultimate goal in training a neural network is to find the gradient of each weight with respect to the output: 
   
     ∂E∂wi→j 
    ∂E∂wi→j 
  
 We do this so that we can update the weights incrementally using stochastic gradient descent: 
   
     wi→j=wi→j−η∂E∂wi→j 
    wi→j=wi→j−η∂E∂wi→j

For a single unit in a general network, we can have several cases: the unit may have only one input and one output (case 1), the unit may have multiple inputs (case 2), or the unit may have multiple outputs (case 3). Technically there is a fourth case: a unit may have multiple inputs and outputs. But as we will see, the multiple input case and the multiple output case are independent, and we can simply combine the rules we learn for case 2 and case 3 for this case.

I will go over each of this cases in turn with relatively simple multilayer networks, and along the way will derive some general rules for backpropagation. At the end, we can combine all of these rules into a single grand unified backpropagation algorithm for arbitrary networks.

Case 1: Single input and single output

Suppose we have the following network:

A simple "one path" network.

   We can explicitly write out the values of each of variable in this network: 
   
     sj=zj=sk=zk=so=y^i=E= w1⋅xi σ(inj)=σ(w1⋅xi) w2⋅zj σ(ink)=σ(w2⋅σ(w1⋅xi)) w3⋅zk ino=w3⋅σ(w2⋅σ(w1⋅xi)) 12(y^i−yi)2=12(w3⋅σ(w2⋅σ(w1⋅xi))−yi)2 
    sj= w1⋅xizj= σ(inj)=σ(w1⋅xi)sk= w2⋅zjzk= σ(ink)=σ(w2⋅σ(w1⋅xi))so= w3⋅zky^i= ino=w3⋅σ(w2⋅σ(w1⋅xi))E= 12(y^i−yi)2=12(w3⋅σ(w2⋅σ(w1⋅xi))−yi)2 
  
 For this simple example, it's easy to find all of the derivatives by hand. In fact, let's do that now. I am going to color code certain parts of the derivation, and see if you can deduce a pattern that we might exploit in an iterative algorithm. First, let's find the derivative for  
   
    wk→o 
   wk→o (remember that  
   
    y^=wk→ozk 
   y^=wk→ozk, as our output is a linear unit): 
   
     ∂E∂wk→o==== ∂∂wk→o12(y^i−yi)2 ∂∂wk→o12(wk→o⋅zk−yi)2 (wk→o⋅zk−yi)∂∂wk→o(wk→o⋅zk−yi) (yi^−yi)(zk) 
    ∂E∂wk→o= ∂∂wk→o12(y^i−yi)2= ∂∂wk→o12(wk→o⋅zk−yi)2= (wk→o⋅zk−yi)∂∂wk→o(wk→o⋅zk−yi)= (yi^−yi)(zk) 
  
 Finding the weight update for  
   
    wi→k 
   wi→k is also relatively simple: 
   
     ∂E∂wj→k===== ∂∂wj→k12(y^i−yi)2 (y^i−yi)⎛⎝∂∂wj→k(wk→o⋅σ(wj→k⋅zj)−yi)⎞⎠ (y^i−yi)(wk→o)⎛⎝∂∂wj→kσ(wj→k⋅zj)⎞⎠ (y^i−yi)(wk→o)⎛⎝σ(sk)(1−σ(sk))∂∂wj→k(wj→k⋅zj)⎞⎠ (y^i−yi)(wk→o)(σ(sk)(1−σ(sk))(zj) 
    ∂E∂wj→k= ∂∂wj→k12(y^i−yi)2= (y^i−yi)(∂∂wj→k(wk→o⋅σ(wj→k⋅zj)−yi))= (y^i−yi)(wk→o)(∂∂wj→kσ(wj→k⋅zj))= (y^i−yi)(wk→o)(σ(sk)(1−σ(sk))∂∂wj→k(wj→k⋅zj))= (y^i−yi)(wk→o)(σ(sk)(1−σ(sk))(zj) 
  
 Again, finding the weight update for  
   
    wi→j 
   wi→j consists of some straightforward calculus: 
   
     ∂E∂wi→j===== ∂∂wi→j12(y^i−yi)2 (y^i−yi)⎛⎝∂∂wi→j(y^i−yi)⎞⎠ (y^i−yi)(wk→o)⎛⎝∂∂wi→j⋅σ(wj→k⋅σ(wi→j⋅xi))⎞⎠ (y^i−yi)(wk→o)(σ(sk)(1−σ(sk)))(wj→k)⎛⎝∂∂wi→jσ(wi→j⋅xi)⎞⎠ (y^i−yi)(wk→o)(σ(sk)(1−σ(sk)))(wj→k)(σ(sj)(1−σ(sj)))(xi) 
    ∂E∂wi→j= ∂∂wi→j12(y^i−yi)2= (y^i−yi)(∂∂wi→j(y^i−yi))= (y^i−yi)(wk→o)(∂∂wi→j⋅σ(wj→k⋅σ(wi→j⋅xi)))= (y^i−yi)(wk→o)(σ(sk)(1−σ(sk)))(wj→k)(∂∂wi→jσ(wi→j⋅xi))= (y^i−yi)(wk→o)(σ(sk)(1−σ(sk)))(wj→k)(σ(sj)(1−σ(sj)))(xi) 
  
 By now, you should be seeing a pattern emerging, a pattern that hopefully we could encode with backpropagation. We are reusing multiple values as we compute the updates for weights that appear earlier and earlier in the network. Specifically, we see the  
  derivative of the network error, the  
  weighted derivative of unit  
     
     k 
    k's output with respect to  
     
     sk 
    sk, and the  
  weighted derivative of unit  
     
     j 
    j's output with respect to  
     
     sj 
    sj.  
  
 So, in summary, for this simple network, we have: 
   
     Δwi→j=Δwj→k=Δwk→o= −η[(y^i−yi)(wk→o)(σ(sk)(1−σ(sk)))(wj→k)(σ(sj)(1−σ(sj)))(xi)] −η[(y^i−yi)(wk→o)(σ(sk)(1−σ(sk))(zj)] −η[(yi^−yi)(zk)] 
    Δwi→j= −η[(y^i−yi)(wk→o)(σ(sk)(1−σ(sk)))(wj→k)(σ(sj)(1−σ(sj)))(xi)]Δwj→k= −η[(y^i−yi)(wk→o)(σ(sk)(1−σ(sk))(zj)]Δwk→o= −η[(yi^−yi)(zk)]

Case 2: Handling multiple inputs

Consider the more complicated network, where a unit may have more than one input:

   What happens to a weight when it leads to a unit that has multiple inputs? Is  
   
    wi→k 
   wi→k's update rule affected by  
   
    wj→k 
   wj→k's update rule? To see, let's derive the update for  
   
    wi→k 
   wi→k by hand: 
   
     ∂Ewi→k=====∂wi→k12(y^i−yi)2 (y^i−yi)(∂wi→kzkwk→o) (y^i−yi)(wk→o)(∂wi→kσ(sk)) (y^i−yi)(σ(sk)(1−σ(sk))wk→o)(∂wi→k(ziwi→k+zjwj→k)) (y^i−yi)(σ(sk)(1−σ(sk))wk→o)zi 
    ∂Ewi→k=∂wi→k12(y^i−yi)2= (y^i−yi)(∂wi→kzkwk→o)= (y^i−yi)(wk→o)(∂wi→kσ(sk))= (y^i−yi)(σ(sk)(1−σ(sk))wk→o)(∂wi→k(ziwi→k+zjwj→k))= (y^i−yi)(σ(sk)(1−σ(sk))wk→o)zi 
  
 Here we see that the update for  
   
    wi→k 
   wi→k does not depend on  
   
    wj→k 
   wj→k's derivative, leading to our first rule:  
  The derivative for a weight is not dependent on the derivatives of any of the other weights in the same layer. Thus we can update weights in the same layer in isolation. There is a natural ordering of the updates - they only depend on the  
  values of other weights in the same layer, and (as we shall see), the derivatives of weights further in the network. This ordering is good news for the backpropagation algorithm.

Case 3: Handling multiple outputs

Now let's examine the case where a hidden unit has more than one output.

   Based on the previous sections, the only "new" type of weight update is the derivative of  
   
    win→j 
   win→j. The difference in the multiple output case is that unit  
   
    i 
   i has more than one immediate successor, so (spoiler!) we must sum the error accumulated along all paths that are rooted at unit  
   
    i 
   i. Let's explicitly derive the weight update for  
   
    win→i 
   win→i (to keep track of what's going on, we define  
   
    σi(⋅) 
   σi(⋅) as the activation function for unit  
   
    i 
   i): 
   
     ∂Ewin→i========∂win→i12(y^i−yi)2 (y^i−yi)(∂win→i(zjwj→o+zkwk→o)) (y^i−yi)(∂win→i(σj(sj)wj→o+σk(sk)wk→o)) (y^i−yi)(wj→oσ′j(sj)∂win→isj+wk→oσ′k(sk)∂win→isk) (y^i−yi)(wj→oσ′j(sj)∂win→iziwi→j+wk→oσ′k(sk)∂win→iziwi→k) (y^i−yi)(wj→oσ′j(sj)∂win→iσi(si)wi→j+wk→oσ′k(sk)∂win→iσi(si)wi→k) (y^i−yi)(wj→oσ′j(sj)wi→jσ′i(si)∂win→isi+wk→oσ′k(sk)wi→kσ′i(si)∂win→isi) (y^i−yi)(wj→oσ′j(sj)wi→jσ′i(si)+wk→oσ′k(sk)wi→kσ′i(si))xi 
    ∂Ewin→i=∂win→i12(y^i−yi)2= (y^i−yi)(∂win→i(zjwj→o+zkwk→o))= (y^i−yi)(∂win→i(σj(sj)wj→o+σk(sk)wk→o))= (y^i−yi)(wj→oσj′(sj)∂win→isj+wk→oσk′(sk)∂win→isk)= (y^i−yi)(wj→oσj′(sj)∂win→iziwi→j+wk→oσk′(sk)∂win→iziwi→k)= (y^i−yi)(wj→oσj′(sj)∂win→iσi(si)wi→j+wk→oσk′(sk)∂win→iσi(si)wi→k)= (y^i−yi)(wj→oσj′(sj)wi→jσi′(si)∂win→isi+wk→oσk′(sk)wi→kσi′(si)∂win→isi)= (y^i−yi)(wj→oσj′(sj)wi→jσi′(si)+wk→oσk′(sk)wi→kσi′(si))xi 
  
 There are two things to note here. The first, and most relevant, is our second derived rule:  
  the weight update for a weight leading to a unit with multiple outputs is dependent on derivatives that reside on both paths.  
  
 But more generally, and more importantly, we begin to see the relation between backpropagation and forward propagation. During backpropagation, we compute the error of the output. We then pass the error backward and weight it along each edge. When we come to a unit, we multiply the weighted backpropagated error by the unit's derivative. We then continue backpropagating this error in the same fashion, all the way to the input. Backpropagation, much like forward propagation, is a recursive algorithm. In the next section, I introduce the notion of an  
  error signal, which allows us to rewrite our weight updates in a compact form.

Error Signals

   Deriving all of the weight updates by hand is intractable, especially if we have hundreds of units and many layers. But we saw a pattern emerge in the last few sections - the error is propagated backwards through the network. In this section, we define the error signal, which is simply the accumulated error at each unit. For now, let's just consider the contribution of a single training instance (so we use  
   
    y^ 
   y^ instead of  
   
    y^i 
   y^i).  
  
 We define the recursive error signal at unit  
   
    j 
   j as: 
   
     δj= ∂E∂sj 
    δj= ∂E∂sj 
  
 In layman's terms, it is a measure of how much the network error varies with the input to unit  
   
    j 
   j. Using the error signal has some nice properties - namely, we can rewrite backpropagation in a more compact form. To see this, let's expand  
   
    δj 
   δj: 
   
     δj=== ∂E∂sj ∂∂sj12(y^−y)2 (y^−y)∂y^∂sj 
    δj= ∂E∂sj= ∂∂sj12(y^−y)2= (y^−y)∂y^∂sj 
  
 Consider the case where unit  
   
    j 
   j is an output node. This means that  
   
    y^=fj(sj) 
   y^=fj(sj) (if unit  
   
    j 
   j's activation function is 
   
    fj(⋅) 
   fj(⋅)), so  
   
    ∂y^∂sj 
   ∂y^∂sj is simply  
   
    f′j(sj) 
   fj′(sj), giving us  
   
    δj=(y^−y)f′j(sj) 
   δj=(y^−y)fj′(sj).  
  
 Otherwise, unit  
   
    j 
   j is a hidden node that leads to another layer of nodes  
   
    k∈outs(j) 
   k∈outs(j). We can expand  
   
    ∂y^∂sj 
   ∂y^∂sj further, using the chain rule: 
   
     ∂y^∂sj== ∂y^∂zj∂zj∂sj ∂y^∂zjf′j(sj) 
    ∂y^∂sj= ∂y^∂zj∂zj∂sj= ∂y^∂zjfj′(sj) 
  
 Take note of the term  
   
    ∂y^∂zj 
   ∂y^∂zj. Multiple units depend on  
   
    zj 
   zj, specifically, all of the units  
   
    k∈outs(j) 
   k∈outs(j). We saw in the section on multiple outputs that a weight that leads to a unit with multiple outputs  
  does have an effect on those output units. But for each unit  
   
    k 
   k, we have  
   
    sk=zjwj→k 
   sk=zjwj→k, with each  
   
    sk 
   sk not depending on any other  
   
    sk 
   sk. Therefore, we can use the chain rule again and sum over the output nodes  
   
    k∈outs(j) 
   k∈outs(j): 
   
     ∂y^∂sj== f′j(sj)∑k∈outs(j)∂y^∂sk∂sk∂zj f′j(sj)∑k∈outs(j)∂y^∂skwj→k 
    ∂y^∂sj= fj′(sj)∑k∈outs(j)∂y^∂sk∂sk∂zj= fj′(sj)∑k∈outs(j)∂y^∂skwj→k 
  
 Plugging this equation back into the function  
   
    δj=(y^−y)∂y^∂sj 
   δj=(y^−y)∂y^∂sj, we get: 
   
     δj=(y^−y)f′j(sj)∑k∈outs(j)∂y^∂skwj→k 
    δj=(y^−y)fj′(sj)∑k∈outs(j)∂y^∂skwj→k 
  
 Based on our definition of the error signal, we know that  
   
    δk=(y^−y)∂y^∂sk 
   δk=(y^−y)∂y^∂sk, so if we push  
   
    (y^−y) 
   (y^−y) into the summation, we get the following recursive relation: 
   
     δj=f′j(sj)∑k∈outs(j)δkwj→k 
    δj=fj′(sj)∑k∈outs(j)δkwj→k 
  
 We now have a compact representation of the backpropagated error. The last thing to do is tie everything together with a general algorithm.

The general form of backpropagation

Recall the simple network from the first section:

   We can use the definition of  
   
    δi 
   δi to derive the values of all the error signals in the network: 
   
     δo=δk=δj= (y^−y) (The derivative of a linear function is 1) δowk→oσ(sk)(1−σ(sk)) δkwj→kσ(sj)(1−σ(sj)) 
    δo= (y^−y) (The derivative of a linear function is1)δk= δowk→oσ(sk)(1−σ(sk))δj= δkwj→kσ(sj)(1−σ(sj)) 
  
 Also remember that the explicit weight updates for this network were of the form: 
   
     Δwi→j=Δwj→k=Δwk→o= −η[(y^i−yi)(wk→o)(σ(sk)(1−σ(sk)))(wj→k)(σ(sj)(1−σ(sj)))(xi)] −η[(y^i−yi)(wk→o)(σ(sk)(1−σ(sk))(zj)] −η[(yi^−yi)(zk)] 
    Δwi→j= −η[(y^i−yi)(wk→o)(σ(sk)(1−σ(sk)))(wj→k)(σ(sj)(1−σ(sj)))(xi)]Δwj→k= −η[(y^i−yi)(wk→o)(σ(sk)(1−σ(sk))(zj)]Δwk→o= −η[(yi^−yi)(zk)] 
  
 By substituting each of the error signals, we get: 
   
     Δwk→o=Δwj→k=Δwi→j= −ηδozk −ηδkzj −ηδjxi 
    Δwk→o= −ηδozkΔwj→k= −ηδkzjΔwi→j= −ηδjxi 
  
 As another example, let's look at the more complicated network from the section on handling multiple outputs:

   We can again derive all of the error signals: 
   
     δo=δk=δj=δi= (y^−y) δowk→oσ(sk)(1−σ(sk)) δowj→oσ(sj)(1−σ(sj)) σ(si)(1−σ(si))∑k∈outs(i)δkwi→k 
    δo= (y^−y)δk= δowk→oσ(sk)(1−σ(sk))δj= δowj→oσ(sj)(1−σ(sj))δi= σ(si)(1−σ(si))∑k∈outs(i)δkwi→k 
  
 Although we did not derive all of these weight updates by hand, by using the error signals, the weight updates become (and you can check this by hand, if you'd like): 
   
     Δwk→o=Δwj→o=Δwi→k=Δwi→j=Δwin→i= −ηδozk −ηδozj −ηδkzi −ηδjzi −ηδixi 
    Δwk→o= −ηδozkΔwj→o= −ηδozjΔwi→k= −ηδkziΔwi→j= −ηδjziΔwin→i= −ηδixi 
  
 It should be clear by now that we've derived a general form of the weight updates, which is simply  
   
    Δwi→j=−ηδjzi 
   Δwi→j=−ηδjzi.  
  
 The last thing to consider is the case where we use a minibatch of instances to compute the gradient. Because we treat each  
   
    yi 
   yi as independent, we sum over all training instances to compute the full update for a weight (we typically scale by the minibatch size  
   
    N 
   N so that steps are not sensitive to the magnitude of  
   
    N 
   N). For each separate training instance  
   
    yi 
   yi, we add a superscript  
   
    (yi) 
   (yi) to the values that change for each training example: 
   
     Δwi→j= −ηN∑yiδ(yi)jz(yi)i 
    Δwi→j= −ηN∑yiδj(yi)zi(yi) 
  
 Thus, the general form of the backpropagation algorithm for updating the weights consists the following steps: 
  Feed the training instances forward through the network, and record each  
      
      s(yi)j 
     sj(yi) and  
      
      z(yi)j 
     zj(yi).
Calculate the error signal  
      
      δ(yi)j 
     δj(yi) for all units  
      
      j 
     j and each training example  
      
      yi 
     yi. If  
      
      j 
     j is an output node, then  
      
      δ(yi)j=f′j(s(yi)j)(y^i−yi) 
     δj(yi)=fj′(sj(yi))(y^i−yi). If  
      
      j 
     j is not an output node, then  
      
      δ(yi)j=f′j(s(yi)j)∑k∈outs(j)δ(yi)kwj→k 
     δj(yi)=fj′(sj(yi))∑k∈outs(j)δk(yi)wj→k.
Update the weights with the rule  
      
      Δwi→j=−ηN∑yiδ(yi)jz(yi)i 
     Δwi→j=−ηN∑yiδj(yi)zi(yi).

Conclusions

Hopefully you've gained a full understanding of the backpropagation algorithm with this derivation. Although we've fully derived the general backpropagation algorithm in this chapter, it's still not in a form amenable to programming or scaling up. In the next post, I will go over the matrix form of backpropagation, along with a working example that trains a basic neural network on MNIST.