文章目录
-
- 往期文章链接目录
- Deep Learning v.s. Machine Learning
- Forward Propagation
- Loss functions of neural network
- Back-propagation
-
- compute ∂ ℓ ∂ f ( x ) \frac{\partial \ell}{\partial f(x)} ∂f(x)∂ℓ
- compute ∂ ℓ ∂ a ( L + 1 ) ( x ) \frac{\partial \ell}{\partial a^{(L+1)}(x)} ∂a(L+1)(x)∂ℓ
- compute ∂ ℓ ∂ h ( k ) ( x ) \frac{\partial \ell}{\partial h^{(k)}(x)} ∂h(k)(x)∂ℓ
- compute ∂ ℓ ∂ a ( k ) ( x ) \frac{\partial \ell}{\partial a^{(k)}(x)} ∂a(k)(x)∂ℓ
- compute ∂ ℓ ∂ w ( k ) \frac{\partial \ell}{\partial w^{(k)}} ∂w(k)∂ℓ
- compute ∂ ℓ ∂ b ( k ) \frac{\partial \ell}{\partial b^{(k)}} ∂b(k)∂ℓ
- Back-propagation Precedure (using SGD) Summary
- Debugging: Gradient Checking
- Dropout
- Early Stopping
- SGD with Converge
- 往期文章链接目录
往期文章链接目录
Deep Learning v.s. Machine Learning
The major difference between Deep Learning and Machine Learning technique is the problem solving approach. Deep Learning techniques tend to solve the problem end to end, where as Machine learning techniques need the problem statements to break down to different parts to be solved first and then their results to be combine at final stage.
Forward Propagation
The general procedure is the following:
a ( 1 ) ( x ) = w ( 1 ) T ⋅ x + b ( 1 ) h ( 1 ) ( x ) = g 1 ( a ( 1 ) ( x ) ) a ( 2 ) ( x ) = w ( 2 ) T ⋅ h ( 1 ) ( x ) + b ( 2 ) h ( 2 ) ( x ) = g 2 ( a ( 2 ) ( x ) ) . . . . . . a ( L + 1 ) ( x ) = w ( L + 1 ) T ⋅ h ( L ) ( x ) + b ( L + 1 ) h ( L + 1 ) ( x ) = g L + 1 ( a ( L + 1 ) ( x ) ) \begin{aligned} a^{(1)}(x) &= w^{(1)^T} \cdot x + b^{(1)} \\ h^{(1)}(x) &= g_1(a^{(1)}(x)) \\ a^{(2)}(x) &= w^{(2)^T} \cdot h^{(1)}(x) + b^{(2)} \\ h^{(2)}(x) &= g_2(a^{(2)}(x)) \\ &...... \\ a^{(L+1)}(x) &= w^{(L+1)^T} \cdot h^{(L)}(x) + b^{(L+1)} \\ h^{(L+1)}(x) &= g_{L+1}(a^{(L+1)}(x)) \end{aligned} a(1)(x)h(1)(x)a(2)(x)h(2)(x)a(L+1)(x)h(L+1)(x)=w(1)T⋅x+b(1)=g1(a(1)(x))=w(2)T⋅h(1)(x)+b(2)=g2(a(2)(x))......=w(L+1)T⋅h(L)(x)+b(L+1)=gL+1(a(L+1)(x))
Note:
-
w ( i ) w^{(i)} w(i) has dimension: (# of (hidden) units in layer i i i) × \times × (# of (hidden) units in layer i − 1 i-1 i−1).
-
b ( i ) , a ( i ) , h ( i ) b^{(i)}, a^{(i)}, h^{(i)} b(i),a(i),h(i) have the same dimension: (# of (hidden) units in layer i i i, 1 1 1).
-
g i g_{i} gi is an activation function. Sigmoid, tanh, relu are common activation functions. In the last layer, the choose of activation function g ( L + 1 ) g_{(L+1)} g(L+1) depends on problems, usually sigmoid, and softmax.
Loss functions of neural network
Common loss functions are cross-entropy loss, hinge loss, triple loss, etc. In fact, depending on specifice problems, we can define arbitrarily loss functions. We can also use AUC as a loss.
In this post, we focus on the cross-entropy loss.
For example, let the output of the neural network be ( 0.6 , 0.2 , 0.1 , 0.1 ) (0.6, 0.2, 0.1, 0.1) (0.6,0.2,0.1,0.1), and the true label ( 0 , 1 , 0 , 0 ) (0, 1, 0, 0) (0,1,0,0), then we can write the cross-entropy loss as
ℓ ( [ 0.6 0.2 0.1 0.1 ] , [ 0 1 0 0 ] ) = − ( 0 ⋅ log 0.6 + 1 ⋅ log 0.2 + 0 ⋅ log 0.1 + 0 ⋅ log 0.1 ) = − log 0.2 \ell \left(\left[\begin{array}{l} 0.6 \\ 0.2 \\ 0.1 \\ 0.1 \end{array}\right],\left[\begin{array}{l} 0 \\ 1 \\ 0 \\ 0 \end{array}\right]\right) = -(0 \cdot \log 0.6 + 1 \cdot \log 0.2 + 0 \cdot \log 0.1 + 0 \cdot \log 0.1) = -\log 0.2 ℓ⎝⎜⎜⎛⎣⎢⎢⎡0.60.20.10.1⎦⎥⎥⎤,⎣⎢⎢⎡0100⎦⎥⎥⎤⎠⎟⎟⎞=−(0⋅log0.6+1⋅log0.2+0⋅log0.1+0⋅log0.1)=−log0.2
From now on, we call the output of the neural network f ( x ) f(x) f(x), and the true label w.r.t x x x is y, then the corss-entropy loss is written by
ℓ ( f ( x ) , y ) = − log f ( x ) y \ell \left( f(x), y\right) = - \log f(x)_y ℓ(f(x),y)=−logf(x)y
where f ( x ) y f(x)_y f(x)y is the y y y-th entry of f ( x ) f(x) f(x).
Back-propagation
In fitting a neural network, backpropagation computes the gradient of the loss function with respect to the weights of the network. The backpropagation algorithm works by computing the gradient of the loss function with respect to each weight by the chain rule, computing the gradient one layer at a time, iterating backward from the last layer to avoid redundant calculations of intermediate terms in the chain rule.
In the following, we would try to compute ∂ ℓ ∂ f ( x ) , ∂ ℓ ∂ a ( L + 1 ) ( x ) , ∂ ℓ ∂ h ( k ) ( x ) , ∂ ℓ ∂ a ( k ) ( x ) , ∂ ℓ ∂ w ( k ) , ∂ ℓ ∂ b ( k ) \frac{\partial \ell}{\partial f(x)}, \frac{\partial \ell}{\partial a^{(L+1)}(x)}, \frac{\partial \ell}{\partial h^{(k)}(x)}, \frac{\partial \ell}{\partial a^{(k)}(x)}, \frac{\partial \ell}{\partial w^{(k)}}, \frac{\partial \ell}{\partial b^{(k)}} ∂f(x)∂ℓ,∂a(L+1)(x)∂ℓ,∂h(k)(x)∂ℓ,∂a(k)(x)∂ℓ,∂w(k)∂ℓ,∂b(k)∂ℓ, and then make a summary of the back-propagation process.
compute ∂ ℓ ∂ f ( x ) \frac{\partial \ell}{\partial f(x)} ∂f(x)∂ℓ
First consider single element:
∂ ℓ ∂ f ( x ) j = ∂ − log f ( x ) y ∂ f ( x ) j = − 1 f ( x ) y ⋅ ∂ f ( x ) y ∂ f ( x ) j = − 1 ⋅ I ( y = j ) f ( x ) y \begin{aligned} \frac{\partial \ell}{\partial f(x)_j} &= \frac{\partial -\log f(x)_y}{\partial f(x)_j} \\ &= \frac{-1}{f(x)_y} \cdot \frac{\partial f(x)_y}{\partial f(x)_j} \\ &= \frac{-1\cdot I(y=j)}{f(x)_y} \end{aligned} ∂f(x)j∂ℓ=∂f(x)j∂−logf(x)y=f(x)y−1⋅∂f(x)j∂f(x)y=f(x)y−1⋅I