Intro to Deep Learning & Backpropagation 深度学习模型介绍及反向传播算法推导详解

最新推荐文章于 2022-11-16 12:45:42 发布

Jay_Tang

最新推荐文章于 2022-11-16 12:45:42 发布

阅读量606

点赞数 1

分类专栏：机器学习核心推导文章标签：深度学习

本文链接：https://blog.csdn.net/Jay_Tang/article/details/106346648

版权

本文对比了深度学习与机器学习的区别，介绍了前向传播过程，讲解了常用的神经网络损失函数，尤其是交叉熵损失。接着详细探讨了反向传播算法，包括计算各个参数梯度的步骤，并提供了梯度检查的调试方法。此外，还讨论了正则化技术——Dropout的训练和测试阶段应用，以及早停法避免过拟合。最后提到了SGD收敛条件。

摘要由CSDN通过智能技术生成

文章目录

往期文章链接目录

Deep Learning v.s. Machine Learning

The major difference between Deep Learning and Machine Learning technique is the problem solving approach. Deep Learning techniques tend to solve the problem end to end, where as Machine learning techniques need the problem statements to break down to different parts to be solved first and then their results to be combine at final stage.

Forward Propagation

The general procedure is the following:

$\begin{aligned} a^{(1)}(x) &= w^{(1)^T} \cdot x + b^{(1)} \\ h^{(1)}(x) &= g_1(a^{(1)}(x)) \\ a^{(2)}(x) &= w^{(2)^T} \cdot h^{(1)}(x) + b^{(2)} \\ h^{(2)}(x) &= g_2(a^{(2)}(x)) \\ &...... \\ a^{(L+1)}(x) &= w^{(L+1)^T} \cdot h^{(L)}(x) + b^{(L+1)} \\ h^{(L+1)}(x) &= g_{L+1}(a^{(L+1)}(x)) \end{aligned}$

Note:

$w^{(i)}$ has dimension: (# of (hidden) units in layer $i$ ) $\times$ (# of (hidden) units in layer $i - 1$ ).
$b^{(i)}, a^{(i)}, h^{(i)}$ have the same dimension: (# of (hidden) units in layer $i$ , $1$ ).
$g_{i}$ is an activation function. Sigmoid, tanh, relu are common activation functions. In the last layer, the choose of activation function $g_{(L+1)}$ depends on problems, usually sigmoid, and softmax.

Loss functions of neural network

Common loss functions are cross-entropy loss, hinge loss, triple loss, etc. In fact, depending on specifice problems, we can define arbitrarily loss functions. We can also use AUC as a loss.

In this post, we focus on the cross-entropy loss.

For example, let the output of the neural network be $(0.6, 0.2, 0.1, 0.1)$ , and the true label $(0, 1, 0, 0)$ , then we can write the cross-entropy loss as

$\ell \left(\left[\begin{array}{l} 0.6 \\ 0.2 \\ 0.1 \\ 0.1 \end{array}\right],\left[\begin{array}{l} 0 \\ 1 \\ 0 \\ 0 \end{array}\right]\right) = -(0 \cdot \log 0.6 + 1 \cdot \log 0.2 + 0 \cdot \log 0.1 + 0 \cdot \log 0.1) = -\log 0.2$

From now on, we call the output of the neural network $f (x)$ , and the true label w.r.t $x$ is y, then the corss-entropy loss is written by

$\ell \left( f(x), y\right) = - \log f(x)_y$

where $f(x)_y$ is the $y$ -th entry of $f (x)$ .

Back-propagation

In fitting a neural network, backpropagation computes the gradient of the loss function with respect to the weights of the network. The backpropagation algorithm works by computing the gradient of the loss function with respect to each weight by the chain rule, computing the gradient one layer at a time, iterating backward from the last layer to avoid redundant calculations of intermediate terms in the chain rule.

In the following, we would try to compute $\frac{\partial \ell}{\partial f(x)}, \frac{\partial \ell}{\partial a^{(L+1)}(x)}, \frac{\partial \ell}{\partial h^{(k)}(x)}, \frac{\partial \ell}{\partial a^{(k)}(x)}, \frac{\partial \ell}{\partial w^{(k)}}, \frac{\partial \ell}{\partial b^{(k)}}$ , and then make a summary of the back-propagation process.

compute $\frac{\partial \ell}{\partial f(x)}$

First consider single element:
$\begin{aligned} \frac{\partial \ell}{\partial f(x)_j} &= \frac{\partial -\log f(x)_y}{\partial f(x)_j} \\ &= \frac{-1}{f(x)_y} \cdot \frac{\partial f(x)_y}{\partial f(x)_j} \\ &= \frac{-1\cdot I(y=j)}{f(x)_y} \end{aligned}$