全连接神经网络英文数学推导详解

本文详细介绍了全连接神经网络(FCNN)的前向传播和反向传播过程,包括激活函数、损失函数、梯度下降等概念。作者以简单的Sigmoid函数为例,阐述了神经网络中权重、偏置的更新,并简要提及了深度学习和后续可能涉及的支持向量机(SVM)、对抗生成网络(GAN)。
摘要由CSDN通过智能技术生成

​作者刚刚高三下学期开学,对机器学习比较感兴趣,因为大学要出国读书,最近想更新一些机器学习算法的英文数学介绍和推导。作者数学知识水平非常有限,有些细节错误有劳各位部落大神指出~

我们先从最基础的全连接神经网络 (Fully Connected Feedforward Neural Network) 的构造原理以及训练方法说起。

Today I’m going to talk about how neural network (NN) may be trained to fit or classify data (which are essentially the same). The simplest form of network is used: the fully connected feedforward neural network.

​As Prof. Hungyi Li’s point upon the nature of fitting models, the process of training a model can be divided into four steps:

  1. Define the model function f θ ( X ) f_\theta (X) fθ(X), where θ \theta θ represent all parameters of the model, whose values are to be “learned” through training
  2. Input training data X X X and receive outcomes f θ ( X ) f_\theta(X) fθ(X) from the model.
  3. Compare the model’s outcomes f θ ( X ) f_\theta (X) fθ(X) to training data Y Y Y through a predefined loss function (usually defined as mean-squared loss or divergence)
  4. Minimize the loss (usually through gradient descent or its alternatives)

​We are going to start with the first step now. In the case of a neural network, the model function f θ ( X ) f_\theta (X) fθ(X) can be better illustrated with the following graph.
在这里插入图片描述
The leftmost layer–“Input Layer”–is where X X X will be inputted into the model, and outcomes will be outputted through the rightmost layer–“Output Layer”. Between these two are “Hidden Layers” that manipulate the data inputted. Between each two successive layers, the data are multipled with weights and summed together into a single value, going into a specific neuron at the next layer, and the neuron will output another value as a function of the inputted value; such function is the “Activation Function” of a neural network, and this nomenclature can be a bit confusing because a neural network can have different activation functions for different neurons (i.e. the activation function neuron-specific rather than NN-specific).

Forward Propagation

​We are then giving a mathematical account upon the working mechainsm of NN, known as “Forward Propagation”. Let a ⃗ \vec{a} a be the vector inputting into each layer, and z ⃗ \vec{z} z be the vector outputting from each layer, and σ ( x ⃗ ) \sigma (\vec{x}) σ(x ) be the activation function in vector form, the computation at the i i i'th layer is then
z i ⃗ = σ ( a i ⃗ ) \vec{z_i}=\sigma(\vec{a_i}) zi =σ(ai )
​An activation function is usually chosen as one that converges to 1 1 1 or 0 0 0 quickly and monotonously, and is differentiable on its domain range. A good example is the “Sigmoid Function” S ( x ) S(x) S(x), defined as:
S ( x ) = 1 1 + e − x S(x)=\frac{1}{1+e^{-x}} S(x)=1+ex1
whose graph looks like the figure below.
在这里插入图片描述
​Its derivative, S ’ ( x ) S’(x) S(x) can be easily found as
S ′ ( x ) = S ( x ) ⋅ ( 1 − S ( x ) ) S'(x)=S(x)\cdot\left(1-S(x)\right) S(x)=S(x)(1S(x))
The outputs are then weighted and summed up with constants to form a i + 1 ⃗ \vec{a_{i+1}} ai+1 , which can be defined as
a i + 1 ⃗ = W i + 1 z i ⃗ T + b i + 1 ⃗ \vec{a_{i+1}}=W_{i+1}\vec{z_i}^T +\vec{b_{i+1}} ai+1 =Wi+1zi T+bi+1
where b i ⃗ \vec{b_i} bi is a vector of constants, usually named as the “Bias”, and the matrix’s entry w i j w_{ij} wij denotes the weight of the current layer’s j j j’th output toward the summation of the i i i’th neuron at the next layer. With these 2 equations, the forward propagation process is mathematically defined. As of the dimension of vectors z ⃗ \vec{z} z and a ⃗ \vec{a} a , it’s easy to see that it relies on the dimension of the training data X X X and Y Y Y, as well as the number of neurons that developers wish to assign to each layer of the network.

At the last layer, usually a regularization should be done with a soft-max function, but for simplicity it’s not included in this article, and readers can easily add it to the model addressed here with a little research.

Backward Propagation

Scientists researched on perceptrons intensively in the last decades, because it’s a simplified NN with no hidden layers; this mathematical simplicity makes it easy to train with gradient descents. On the contrary, scientists struggled for a long time to find the partial derivative of the loss function to each parameter of a NN model, until a “Backward Propagation” approach was figured out at the end of the 1900s. Though extensive efforts costed, this approach is very simple in theory, and it’s fundamentally based on the chain rule of partial derivatives,
∂ a ∂ c = ∂ a ∂ b ∂ b ∂ c \frac{\partial a}{\partial c}=\frac{\partial a}{\partial b}\frac{\partial b}{\partial c} ca=bacb
and a dummy variable, δ \delta δ, introduced to cohere mathematical derivations.

We first define the layer-specific variable δ i \delta_i δi as
δ i ⃗ = ∂ L ∂ a i ⃗ \vec{\delta_i}=\frac{\partial L}{\partial \vec{a_i}} δi =ai L
Using this dummy variable, we can compute
∂ L ∂ W i = ∂ L ∂ a i ⃗ ∂ a i ⃗ ∂ W i = δ i ⃗ ∂ a i ⃗ ∂ W i ∂ L ∂ b i ⃗ = ∂ L ∂ a i ⃗ ∂ a i ⃗ ∂ b i ⃗ = δ i ⃗ ∂ a i ⃗ ∂ b i ⃗ \begin{aligned} \frac{\partial L}{\partial W_i}&=\frac{\partial L}{\partial \vec{a_i}}\frac{\partial \vec{a_i}}{\partial W_i}\\ &=\vec{\delta_i}\frac{\partial \vec{a_i}}{\partial W_i} \\ \frac{\partial L}{\partial \vec{b_i}}&=\frac{\partial L}{\partial \vec{a_i}}\frac{\partial \vec{a_i}}{\partial\vec{b_i}}\\ &=\vec{\delta_i}\frac{\partial \vec{a_i}}{\partial\vec{b_i}} \end{aligned} WiLbi L=a

评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值