Intro to Deep Learning & Backpropagation 深度学习模型介绍及反向传播算法推导详解



Deep Learning v.s. Machine Learning

The major difference between Deep Learning and Machine Learning technique is the problem solving approach. Deep Learning techniques tend to solve the problem end to end, where as Machine learning techniques need the problem statements to break down to different parts to be solved first and then their results to be combine at final stage.

Forward Propagation

The general procedure is the following:

a ( 1 ) ( x ) = w ( 1 ) T ⋅ x + b ( 1 ) h ( 1 ) ( x ) = g 1 ( a ( 1 ) ( x ) ) a ( 2 ) ( x ) = w ( 2 ) T ⋅ h ( 1 ) ( x ) + b ( 2 ) h ( 2 ) ( x ) = g 2 ( a ( 2 ) ( x ) ) . . . . . . a ( L + 1 ) ( x ) = w ( L + 1 ) T ⋅ h ( L ) ( x ) + b ( L + 1 ) h ( L + 1 ) ( x ) = g L + 1 ( a ( L + 1 ) ( x ) ) \begin{aligned} a^{(1)}(x) &= w^{(1)^T} \cdot x + b^{(1)} \\ h^{(1)}(x) &= g_1(a^{(1)}(x)) \\ a^{(2)}(x) &= w^{(2)^T} \cdot h^{(1)}(x) + b^{(2)} \\ h^{(2)}(x) &= g_2(a^{(2)}(x)) \\ &...... \\ a^{(L+1)}(x) &= w^{(L+1)^T} \cdot h^{(L)}(x) + b^{(L+1)} \\ h^{(L+1)}(x) &= g_{L+1}(a^{(L+1)}(x)) \end{aligned} a(1)(x)h(1)(x)a(2)(x)h(2)(x)a(L+1)(x)h(L+1)(x)=w(1)Tx+b(1)=g1(a(1)(x))=w(2)Th(1)(x)+b(2)=g2(a(2)(x))......=w(L+1)Th(L)(x)+b(L+1)=gL+1(a(L+1)(x))


  • w ( i ) w^{(i)} w(i) has dimension: (# of (hidden) units in layer i i i) × \times × (# of (hidden) units in layer i − 1 i-1 i1).

  • b ( i ) , a ( i ) , h ( i ) b^{(i)}, a^{(i)}, h^{(i)} b(i),a(i),h(i) have the same dimension: (# of (hidden) units in layer i i i, 1 1 1).

  • g i g_{i} gi is an activation function. Sigmoid, tanh, relu are common activation functions. In the last layer, the choose of activation function g ( L + 1 ) g_{(L+1)} g(L+1) depends on problems, usually sigmoid, and softmax.

Loss functions of neural network

Common loss functions are cross-entropy loss, hinge loss, triple loss, etc. In fact, depending on specifice problems, we can define arbitrarily loss functions. We can also use AUC as a loss.

In this post, we focus on the cross-entropy loss.

For example, let the output of the neural network be ( 0.6 , 0.2 , 0.1 , 0.1 ) (0.6, 0.2, 0.1, 0.1) (0.6,0.2,0.1,0.1), and the true label ( 0 , 1 , 0 , 0 ) (0, 1, 0, 0) (0,1,0,0), then we can write the cross-entropy loss as

ℓ ( [ 0.6 0.2 0.1 0.1 ] , [ 0 1 0 0 ] ) = − ( 0 ⋅ log ⁡ 0.6 + 1 ⋅ log ⁡ 0.2 + 0 ⋅ log ⁡ 0.1 + 0 ⋅ log ⁡ 0.1 ) = − log ⁡ 0.2 \ell \left(\left[\begin{array}{l} 0.6 \\ 0.2 \\ 0.1 \\ 0.1 \end{array}\right],\left[\begin{array}{l} 0 \\ 1 \\ 0 \\ 0 \end{array}\right]\right) = -(0 \cdot \log 0.6 + 1 \cdot \log 0.2 + 0 \cdot \log 0.1 + 0 \cdot \log 0.1) = -\log 0.2,0100=(0log0.6+1log0.2+0log0.1+0log0.1)=log0.2

From now on, we call the output of the neural network f ( x ) f(x) f(x), and the true label w.r.t x x x is y, then the corss-entropy loss is written by

ℓ ( f ( x ) , y ) = − log ⁡ f ( x ) y \ell \left( f(x), y\right) = - \log f(x)_y (f(x),y)=logf(x)y

where f ( x ) y f(x)_y f(x)y is the y y y-th entry of f ( x ) f(x) f(x).


In fitting a neural network, backpropagation computes the gradient of the loss function with respect to the weights of the network. The backpropagation algorithm works by computing the gradient of the loss function with respect to each weight by the chain rule, computing the gradient one layer at a time, iterating backward from the last layer to avoid redundant calculations of intermediate terms in the chain rule.

In the following, we would try to compute ∂ ℓ ∂ f ( x ) , ∂ ℓ ∂ a ( L + 1 ) ( x ) , ∂ ℓ ∂ h ( k ) ( x ) , ∂ ℓ ∂ a ( k ) ( x ) , ∂ ℓ ∂ w ( k ) , ∂ ℓ ∂ b ( k ) \frac{\partial \ell}{\partial f(x)}, \frac{\partial \ell}{\partial a^{(L+1)}(x)}, \frac{\partial \ell}{\partial h^{(k)}(x)}, \frac{\partial \ell}{\partial a^{(k)}(x)}, \frac{\partial \ell}{\partial w^{(k)}}, \frac{\partial \ell}{\partial b^{(k)}} f(x),a(L+1)(x),h(k)(x),a(k)(x),w(k),b(k), and then make a summary of the back-propagation process.

compute ∂ ℓ ∂ f ( x ) \frac{\partial \ell}{\partial f(x)} f(x)

First consider single element:
∂ ℓ ∂ f ( x ) j = ∂ − log ⁡ f ( x ) y ∂ f ( x ) j = − 1 f ( x ) y ⋅ ∂ f ( x ) y ∂ f ( x ) j = − 1 ⋅ I ( y = j ) f ( x ) y \begin{aligned} \frac{\partial \ell}{\partial f(x)_j} &= \frac{\partial -\log f(x)_y}{\partial f(x)_j} \\ &= \frac{-1}{f(x)_y} \cdot \frac{\partial f(x)_y}{\partial f(x)_j} \\ &= \frac{-1\cdot I(y=j)}{f(x)_y} \end{aligned} f(x)j=f(x)jlogf(x)y=f(x)y1f(x)jf(x)y=f(x)y1I





当前余额3.43前往充值 >
领取后你会自动成为博主和红包主的粉丝 规则
钱包余额 0


