A Derivation of Backpropagation in Matrix Form(转)

A Derivation of Backpropagation in Matrix Form(转)

Backpropagation is an algorithm used to train neural networks, used along with an optimization routine such as gradient descent. Gradient descent requires access to the gradient of the loss function with respect to all the weights in the network to perform a weight update, in order to minimize the loss function. Backpropagation computes these gradients in a systematic way. Backpropagation along with Gradient descent is arguably the single most important algorithm for training Deep Neural Networks and could be said to be the driving force behind the recent emergence of Deep Learning.

Any layer of a neural network can be considered as an Affine Transformation followed by application of a non linear function. A vector is received as input and is multiplied with a matrix to produce an output , to which a bias vector may be added before passing the result through an activation function such as sigmoid.

Input=xOutput=f(Wx+b)Input=xOutput=f(Wx+b)

Consider a neural network with a single hidden layer like this one. It has no bias units. We derive forward and backward pass equations in their matrix form. Neural Network

The forward propagation equations are as follows:

Input=x0Hidden Layer1 output=x1=f1(W1x0)Hidden Layer2 output=x2=f2(W2x1)Output=x3=f3(W3x2)Input=x0Hidden Layer1 output=x1=f1(W1x0)Hidden Layer2 output=x2=f2(W2x1)Output=x3=f3(W3x2)

To train this neural network, you could either use Batch gradient descent or Stochastic gradient descent. Stochastic gradient descent uses a single instance of data to perform weight updates, whereas the Batch gradient descent uses a a complete batch of data.

For simplicity lets assume this is a multiple regression problem.

Stochastic update loss function: E=12zt22E=12‖z−t‖22

Batch update loss function: E=12iBatchziti22E=12∑i∈Batch‖zi−ti‖22

Here tt is the ground truth for that instance.

We will only consider the stochastic update loss function. All the results hold for the batch version as well.

Let us look at the loss function from a different perspective. Given an input x0x0, output x3x3 is determined by W1,W2W1,W2 and W3W3. So the only tuneable parameters in EE are W1,W2W1,W2 and W3W3. To reduce the value of the error function, we have to change these weights in the negative direction of the gradient of the loss function with respect to these weights.

w=wαwEwfor all the weights ww=w−αw∂E∂wfor all the weights w

Here αwαw is a scalar for this particular weight, called the learning rate. Its value is decided by the optimization technique used. I highly recommend reading An overview of gradient descent optimization algorithms for more information about various gradient decent techniques and learning rates.

Backpropagation equations can be derived by repeatedly applying the chain rule. First we derive these for the weights in W3W3:

EW3=(x3t)x3W3=[(x3t)f3(W3x2)]W3x2W3=[(x3t)f3(W3x2)]xT2Let δ3=(x3t)f3(W3x2)EW3=δ3xT2∂E∂W3=(x3−t)∂x3∂W3=[(x3−t)∘f3′(W3x2)]∂W3x2∂W3=[(x3−t)∘f3′(W3x2)]x2TLet δ3=(x3−t)∘f3′(W3x2)∂E∂W3=δ3x2T

Here  is the Hadamard product. Lets sanity check this by looking at the dimensionalities. EW3∂E∂W3 must have the same dimensions as W3W3W3W3’s dimensions are 2×32×3. Dimensions of (x3t)(x3−t) is 2×12×1 and f3(W3x2)f3′(W3x2) is also2×12×1, so δ3δ3 is also 2×12×1x2x2 is 3×13×1, so dimensions of δ3xT2δ3x2T is 2×32×3, which is the same as W3W3.

Now for the weights in W2W2:

EW2=(x3t)x3W2=[(x3t)f3(W3x2)](W3x2)W2=δ3(W3x2)W2=WT3δ3x2W2=[WT3δ3f2(W2x1)]W2x1W2=δ2xT1∂E∂W2=(x3−t)∂x3∂W2=[(x3−t)∘f3′(W3x2)]∂(W3x2)∂W2=δ3∂(W3x2)∂W2=W3Tδ3∂x2∂W2=[W3Tδ3∘f2′(W2x1)]∂W2x1∂W2=δ2x1T

Lets sanity check this too. W2W2’s dimensions are 3×53×5δ3δ3 is 2×12×1 and W3W3 is 2×32×3, so WT3δ3W3Tδ3 is 3×13×1f2(W2x1)f2′(W2x1) is 3×13×1, so δ2δ2 is also 3×13×1x1x1 is 5×15×1, so δ2xT1δ2x1T is 3×53×5. So this checks out to be the same.

Similarly for W1W1:

EW1=[WT2δ2f1(W1x0)]xT0=δ1xT0∂E∂W1=[W2Tδ2∘f1′(W1x0)]x0T=δ1x0T

We can observe a recursive pattern emerging in the backpropagation equations. The Forward and Backward passes can be summarized as below:

The neural network has LL layers. x0x0 is the input vector, xLxL is the output vector and tt is the truth vector. The weight matrices are W1,W2,..,WLW1,W2,..,WL and activation functions are f1,f2,..,fLf1,f2,..,fL.

Forward Pass:

xi=fi(Wixi1)E=xLt22xi=fi(Wixi−1)E=‖xL−t‖22

Backward Pass:

δL=(xLt)fL(WLxL1)δi=WTi+1δi+1fi(Wixi1)δL=(xL−t)∘fL′(WLxL−1)δi=Wi+1Tδi+1∘fi′(Wixi−1)

Weight Update:

EWi=δixTi1Wi=WiαWiEWi∂E∂Wi=δixi−1TWi=Wi−αWi∘∂E∂Wi

Equations for Backpropagation, represented using matrices have two advantages.

One could easily convert these equations to code using either Numpy in Python or Matlab. It is much closer to the way neural networks are implemented in libraries. Using matrix operations speeds up the implementation as one could use high performance matrix primitives from BLAS. GPUs are also suitable for matrix computations as they are suitable for parallelization.

The matrix version of Backpropagation is intuitive to derive and easy to remember as it avoids the confusing and cluttering derivations involving summations and multiple subscripts.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值