Intro to Deep Learning & Backpropagation 深度学习模型介绍及反向传播算法推导详解

本文对比了深度学习与机器学习的区别,介绍了前向传播过程,讲解了常用的神经网络损失函数,尤其是交叉熵损失。接着详细探讨了反向传播算法,包括计算各个参数梯度的步骤,并提供了梯度检查的调试方法。此外,还讨论了正则化技术——Dropout的训练和测试阶段应用,以及早停法避免过拟合。最后提到了SGD收敛条件。
摘要由CSDN通过智能技术生成

往期文章链接目录

Deep Learning v.s. Machine Learning

The major difference between Deep Learning and Machine Learning technique is the problem solving approach. Deep Learning techniques tend to solve the problem end to end, where as Machine learning techniques need the problem statements to break down to different parts to be solved first and then their results to be combine at final stage.

Forward Propagation

The general procedure is the following:

a ( 1 ) ( x ) = w ( 1 ) T ⋅ x + b ( 1 ) h ( 1 ) ( x ) = g 1 ( a ( 1 ) ( x ) ) a ( 2 ) ( x ) = w ( 2 ) T ⋅ h ( 1 ) ( x ) + b ( 2 ) h ( 2 ) ( x ) = g 2 ( a ( 2 ) ( x ) ) . . . . . . a ( L + 1 ) ( x ) = w ( L + 1 ) T ⋅ h ( L ) ( x ) + b ( L + 1 ) h ( L + 1 ) ( x ) = g L + 1 ( a ( L + 1 ) ( x ) ) \begin{aligned} a^{(1)}(x) &= w^{(1)^T} \cdot x + b^{(1)} \\ h^{(1)}(x) &= g_1(a^{(1)}(x)) \\ a^{(2)}(x) &= w^{(2)^T} \cdot h^{(1)}(x) + b^{(2)} \\ h^{(2)}(x) &= g_2(a^{(2)}(x)) \\ &...... \\ a^{(L+1)}(x) &= w^{(L+1)^T} \cdot h^{(L)}(x) + b^{(L+1)} \\ h^{(L+1)}(x) &= g_{L+1}(a^{(L+1)}(x)) \end{aligned} a(1)(x)h(1)(x)a(2)(x)h(2)(x)a(L+1)(x)h(L+1)(x)=w(1)Tx+b(1)=g1(a(1)(x))=w(2)Th(1)(x)+b(2)=g2(a(2)(x))......=w(L+1)Th(L)(x)+b(L+1)=gL+1(a(L+1)(x))

Note:

  • w ( i ) w^{(i)} w(i) has dimension: (# of (hidden) units in layer i i i) × \times × (# of (hidden) units in layer i − 1 i-1 i1).

  • b ( i ) , a ( i ) , h ( i ) b^{(i)}, a^{(i)}, h^{(i)} b(i),a(i),h(i) have the same dimension: (# of (hidden) units in layer i i i, 1 1 1).

  • g i g_{i} gi is an activation function. Sigmoid, tanh, relu are common activation functions. In the last layer, the choose of activation function g ( L + 1 ) g_{(L+1)} g(L+1) depends on problems, usually sigmoid, and softmax.

Loss functions of neural network

Common loss functions are cross-entropy loss, hinge loss, triple loss, etc. In fact, depending on specifice problems, we can define arbitrarily loss functions. We can also use AUC as a loss.

In this post, we focus on the cross-entropy loss.

For example, let the output of the neural network be ( 0.6 , 0.2 , 0.1 , 0.1 ) (0.6, 0.2, 0.1, 0.1) (0.6,0.2,0.1,0.1), and the true label ( 0 , 1 , 0 , 0 ) (0, 1, 0, 0) (0,1,0,0), then we can write the cross-entropy loss as

ℓ ( [ 0.6 0.2 0.1 0.1 ] , [ 0 1 0 0 ] ) = − ( 0 ⋅ log ⁡ 0.6 + 1 ⋅ log ⁡ 0.2 + 0 ⋅ log ⁡ 0.1 + 0 ⋅ log ⁡ 0.1 ) = − log ⁡ 0.2 \ell \left(\left[\begin{array}{l} 0.6 \\ 0.2 \\ 0.1 \\ 0.1 \end{array}\right],\left[\begin{array}{l} 0 \\ 1 \\ 0 \\ 0 \end{array}\right]\right) = -(0 \cdot \log 0.6 + 1 \cdot \log 0.2 + 0 \cdot \log 0.1 + 0 \cdot \log 0.1) = -\log 0.2 0.60.20.10.1,0100=(0log0.6+1log0.2+0log0.1+0log0.1)=log0.2

From now on, we call the output of the neural network f ( x ) f(x) f(x), and the true label w.r.t x x x is y, then the corss-entropy loss is written by

ℓ ( f ( x ) , y ) = − log ⁡ f ( x ) y \ell \left( f(x), y\right) = - \log f(x)_y (f(x),y)=logf(x)y

where f ( x ) y f(x)_y f(x)y is the y y y-th entry of f ( x ) f(x) f(x).

Back-propagation

In fitting a neural network, backpropagation computes the gradient of the loss function with respect to the weights of the network. The backpropagation algorithm works by computing the gradient of the loss function with respect to each weight by the chain rule, computing the gradient one layer at a time, iterating backward from the last layer to avoid redundant calculations of intermediate terms in the chain rule.

In the following, we would try to compute ∂ ℓ ∂ f ( x ) , ∂ ℓ ∂ a ( L + 1 ) ( x ) , ∂ ℓ ∂ h ( k ) ( x ) , ∂ ℓ ∂ a ( k ) ( x ) , ∂ ℓ ∂ w ( k ) , ∂ ℓ ∂ b ( k ) \frac{\partial \ell}{\partial f(x)}, \frac{\partial \ell}{\partial a^{(L+1)}(x)}, \frac{\partial \ell}{\partial h^{(k)}(x)}, \frac{\partial \ell}{\partial a^{(k)}(x)}, \frac{\partial \ell}{\partial w^{(k)}}, \frac{\partial \ell}{\partial b^{(k)}} f(x),a(L+1)(x),h(k)(x),a(k)(x),w(k),b(k), and then make a summary of the back-propagation process.

compute ∂ ℓ ∂ f ( x ) \frac{\partial \ell}{\partial f(x)} f(x)

First consider single element:
∂ ℓ ∂ f ( x ) j = ∂ − log ⁡ f ( x ) y ∂ f ( x ) j = − 1 f ( x ) y ⋅ ∂ f ( x ) y ∂ f ( x ) j = − 1 ⋅ I ( y = j ) f ( x ) y \begin{aligned} \frac{\partial \ell}{\partial f(x)_j} &= \frac{\partial -\log f(x)_y}{\partial f(x)_j} \\ &= \frac{-1}{f(x)_y} \cdot \frac{\partial f(x)_y}{\partial f(x)_j} \\ &= \frac{-1\cdot I(y=j)}{f(x)_y} \end{aligned} f(x)j=f(x)jlogf(x)y=f(x)y1f(x)jf(x)y=f(x)y1I

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值