Deep Learning：深度前馈神经网络（五）

最新推荐文章于 2024-08-15 09:26:02 发布

蚊子爱牛牛

最新推荐文章于 2024-08-15 09:26:02 发布

阅读量496

点赞数

分类专栏： deep-learning 文章标签：深度学习深度前馈神经网络 BP算法前向和反向传播

本文链接：https://blog.csdn.net/XJY104165/article/details/78297215

版权

deep-learning 专栏收录该内容

21 篇文章 0 订阅

订阅专栏

本文深入探讨了深度前馈神经网络中的反向传播算法，它用于计算成本函数关于参数的梯度。此外，介绍了计算图的概念，它是描述反向传播算法的精确工具，以及微积分链式法则在复杂函数导数计算中的应用。

摘要由CSDN通过智能技术生成

Back-Propagation and Other Differentiation Algorithms

Forward propagation: The inputs x provide the initial information that then propagates up to the hidden units at each layer and finally produces $\hat{y}$ .
Back-propagation: allows the information from the cost to then flow backwards through the network, in order to compute the gradient.
The term back-propagation is often misunderstood as meaning the whole learning algorithm for multi-layer neural networks. Actually, back-propagation refers only to the method for computing the gradient, while another algorithm, such as stochastic gradient descent, is used to perform learning using this gradient.
In learning algorithms, the gradient we most often require is the gradient of the cost function with respect to the parameters, $\nabla_\theta J(\theta)$ .

Computational Graphs

To describe the back-propagation algorithm more precisely, it is helpful to have a more precise computational graph language.

Here, we use each node in the graph to indicate a variable. The variable may be a scalar, vector, matrix, tensor, or even a variable of another type.
An operation is a simple function of one or more variables. Our graph language is accompanied by a set of allowable operations. Functions more complicated than the operations in this set may be described by composing many operations together.

Chain Rule of Calculus

The chain rule of calculus (not to be confused with the chain rule of probability) is used to compute the derivatives of functions formed by composing other functions whose derivatives are known. Back-propagation is an algorithm that computes the chain rule, with a specific order of operations that is highly efficient.
Let x be a real number, and let f and g both be functions mapping from a real number to a real number. Suppose that $y = g(x)$ and $z = f(g(x)) = f(y)$ . Then the chain rule states that

d z d x = d z d y d y d x

$\frac{dz}{dx}=\frac{dz}{dy}\frac{dy}{dx}$
We can generalize this beyond the scalar case:

\partial z \partial x i = \sum j \partial z \partial y j \partial y j \partial x

$\frac{\partial z}{\partial x_i}=\sum_j\frac{\partial z}{\partial y_j}\frac{\partial y_j}{\partial x}$
In vector notation, this may be equivalently written as

\nabla x z = (\partial y \partial x) T \nabla y z

$\nabla_xz=\left(\frac{\partial y}{\partial x}\right)^T\nabla_yz$
where

∂y∂x $\frac{\partial y}{\partial x}$ is the n × m Jacobian matrix of g.
From this we see that the gradient of a variable x can be obtained by multiplying a Jacobian matrix