全连接神经网络英文数学推导详解

最新推荐文章于 2024-11-13 10:28:52 发布

weixin_44330844

最新推荐文章于 2024-11-13 10:28:52 发布

阅读量696

点赞数

文章标签：神经网络

本文链接：https://blog.csdn.net/weixin_44330844/article/details/104436380

版权

本文详细介绍了全连接神经网络（FCNN）的前向传播和反向传播过程，包括激活函数、损失函数、梯度下降等概念。作者以简单的Sigmoid函数为例，阐述了神经网络中权重、偏置的更新，并简要提及了深度学习和后续可能涉及的支持向量机（SVM）、对抗生成网络（GAN）。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

作者刚刚高三下学期开学，对机器学习比较感兴趣，因为大学要出国读书，最近想更新一些机器学习算法的英文数学介绍和推导。作者数学知识水平非常有限，有些细节错误有劳各位部落大神指出～

我们先从最基础的全连接神经网络 (Fully Connected Feedforward Neural Network) 的构造原理以及训练方法说起。

Today I’m going to talk about how neural network (NN) may be trained to fit or classify data (which are essentially the same). The simplest form of network is used: the fully connected feedforward neural network.

As Prof. Hungyi Li’s point upon the nature of fitting models, the process of training a model can be divided into four steps:

Define the model function $f_\theta (X)$ , where $\theta$ represent all parameters of the model, whose values are to be “learned” through training
Input training data $X$ and receive outcomes $f_\theta(X)$ from the model.
Compare the model’s outcomes $f_\theta (X)$ to training data $Y$ through a predefined loss function (usually defined as mean-squared loss or divergence)
Minimize the loss (usually through gradient descent or its alternatives)

We are going to start with the first step now. In the case of a neural network, the model function $f_\theta (X)$ can be better illustrated with the following graph.
在这里插入图片描述
The leftmost layer–“Input Layer”–is where $X$ will be inputted into the model, and outcomes will be outputted through the rightmost layer–“Output Layer”. Between these two are “Hidden Layers” that manipulate the data inputted. Between each two successive layers, the data are multipled with weights and summed together into a single value, going into a specific neuron at the next layer, and the neuron will output another value as a function of the inputted value; such function is the “Activation Function” of a neural network, and this nomenclature can be a bit confusing because a neural network can have different activation functions for different neurons (i.e. the activation function neuron-specific rather than NN-specific).

Forward Propagation

We are then giving a mathematical account upon the working mechainsm of NN, known as “Forward Propagation”. Let $\vec{a}$ be the vector inputting into each layer, and $\vec{z}$ be the vector outputting from each layer, and $\sigma (\vec{x})$ be the activation function in vector form, the computation at the $i$ 'th layer is then
$\vec{z_i}=\sigma(\vec{a_i})$
An activation function is usually chosen as one that converges to $1$ or $0$ quickly and monotonously, and is differentiable on its domain range. A good example is the “Sigmoid Function” $S (x)$ , defined as:
$S(x)=\frac{1}{1+e^{-x}}$
whose graph looks like the figure below.
在这里插入图片描述
Its derivative, $S ’ (x)$ can be easily found as
$S'(x)=S(x)\cdot\left(1-S(x)\right)$
The outputs are then weighted and summed up with constants to form $\vec{a_{i+1}}$ , which can be defined as
$\vec{a_{i+1}}=W_{i+1}\vec{z_i}^T +\vec{b_{i+1}}$
where $\vec{b_i}$ is a vector of constants, usually named as the “Bias”, and the matrix’s entry $w_{ij}$ denotes the weight of the current layer’s $j$ ’th output toward the summation of the $i$ ’th neuron at the next layer. With these 2 equations, the forward propagation process is mathematically defined. As of the dimension of vectors $\vec{z}$ and $\vec{a}$ , it’s easy to see that it relies on the dimension of the training data $X$ and $Y$ , as well as the number of neurons that developers wish to assign to each layer of the network.

At the last layer, usually a regularization should be done with a soft-max function, but for simplicity it’s not included in this article, and readers can easily add it to the model addressed here with a little research.

Backward Propagation

Scientists researched on perceptrons intensively in the last decades, because it’s a simplified NN with no hidden layers; this mathematical simplicity makes it easy to train with gradient descents. On the contrary, scientists struggled for a long time to find the partial derivative of the loss function to each parameter of a NN model, until a “Backward Propagation” approach was figured out at the end of the 1900s. Though extensive efforts costed, this approach is very simple in theory, and it’s fundamentally based on the chain rule of partial derivatives,
$\frac{\partial a}{\partial c}=\frac{\partial a}{\partial b}\frac{\partial b}{\partial c}$
and a dummy variable, $\delta$ , introduced to cohere mathematical derivations.

We first define the layer-specific variable $\delta_i$ as
$\vec{\delta_i}=\frac{\partial L}{\partial \vec{a_i}}$
Using this dummy variable, we can compute
$\begin{aligned} \frac{\partial L}{\partial W_i}&=\frac{\partial L}{\partial \vec{a_i}}\frac{\partial \vec{a_i}}{\partial W_i}\\ &=\vec{\delta_i}\frac{\partial \vec{a_i}}{\partial W_i} \\ \frac{\partial L}{\partial \vec{b_i}}&=\frac{\partial L}{\partial \vec{a_i}}\frac{\partial \vec{a_i}}{\partial\vec{b_i}}\\ &=\vec{\delta_i}\frac{\partial \vec{a_i}}{\partial\vec{b_i}} \end{aligned}$