


我们先从最基础的全连接神经网络 (Fully Connected Feedforward Neural Network) 的构造原理以及训练方法说起。

Today I’m going to talk about how neural network (NN) may be trained to fit or classify data (which are essentially the same). The simplest form of network is used: the fully connected feedforward neural network.

​As Prof. Hungyi Li’s point upon the nature of fitting models, the process of training a model can be divided into four steps:

  1. Define the model function f θ ( X ) f_\theta (X) fθ(X), where θ \theta θ represent all parameters of the model, whose values are to be “learned” through training
  2. Input training data X X X and receive outcomes f θ ( X ) f_\theta(X) fθ(X) from the model.
  3. Compare the model’s outcomes f θ ( X ) f_\theta (X) fθ(X) to training data Y Y Y through a predefined loss function (usually defined as mean-squared loss or divergence)
  4. Minimize the loss (usually through gradient descent or its alternatives)

​We are going to start with the first step now. In the case of a neural network, the model function f θ ( X ) f_\theta (X) fθ(X) can be better illustrated with the following graph.
The leftmost layer–“Input Layer”–is where X X X will be inputted into the model, and outcomes will be outputted through the rightmost layer–“Output Layer”. Between these two are “Hidden Layers” that manipulate the data inputted. Between each two successive layers, the data are multipled with weights and summed together into a single value, going into a specific neuron at the next layer, and the neuron will output another value as a function of the inputted value; such function is the “Activation Function” of a neural network, and this nomenclature can be a bit confusing because a neural network can have different activation functions for different neurons (i.e. the activation function neuron-specific rather than NN-specific).

Forward Propagation

​We are then giving a mathematical account upon the working mechainsm of NN, known as “Forward Propagation”. Let a ⃗ \vec{a} a be the vector inputting into each layer, and z ⃗ \vec{z} z be the vector outputting from each layer, and σ ( x ⃗ ) \sigma (\vec{x}) σ(x ) be the activation function in vector form, the computation at the i i i'th layer is then
z i ⃗ = σ ( a i ⃗ ) \vec{z_i}=\sigma(\vec{a_i}) zi =σ(ai )
​An activation function is usually chosen as one that converges to 1 1 1 or 0 0 0 quickly and monotonously, and is differentiable on its domain range. A good example is the “Sigmoid Function” S ( x ) S(x) S(x), defined as:
S ( x ) = 1 1 + e − x S(x)=\frac{1}{1+e^{-x}} S(x)=1+ex1
whose graph looks like the figure below.
​Its derivative, S ’ ( x ) S’(x) S(x) can be easily found as
S ′ ( x ) = S ( x ) ⋅ ( 1 − S ( x ) ) S'(x)=S(x)\cdot\left(1-S(x)\right) S(x)=S(x)(1S(x))
The outputs are then weighted and summed up with constants to form a i + 1 ⃗ \vec{a_{i+1}} ai+1 , which can be defined as
a i + 1 ⃗ = W i + 1 z i ⃗ T + b i + 1 ⃗ \vec{a_{i+1}}=W_{i+1}\vec{z_i}^T +\vec{b_{i+1}} ai+1 =Wi+1zi T+bi+1
where b i ⃗ \vec{b_i} bi is a vector of constants, usually named as the “Bias”, and the matrix’s entry w i j w_{ij} wij denotes the weight of the current layer’s j j j’th output toward the summation of the i i i’th neuron at the next layer. With these 2 equations, the forward propagation process is mathematically defined. As of the dimension of vectors z ⃗ \vec{z} z and a ⃗ \vec{a} a , it’s easy to see that it relies on the dimension of the training data X X X and Y Y Y, as well as the number of neurons that developers wish to assign to each layer of the network.

At the last layer, usually a regularization should be done with a soft-max function, but for simplicity it’s not included in this article, and readers can easily add it to the model addressed here with a little research.

Backward Propagation

Scientists researched on perceptrons intensively in the last decades, because it’s a simplified NN with no hidden layers; this mathematical simplicity makes it easy to train with gradient descents. On the contrary, scientists struggled for a long time to find the partial derivative of the loss function to each parameter of a NN model, until a “Backward Propagation” approach was figured out at the end of the 1900s. Though extensive efforts costed, this approach is very simple in theory, and it’s fundamentally based on the chain rule of partial derivatives,
∂ a ∂ c = ∂ a ∂ b ∂ b ∂ c \frac{\partial a}{\partial c}=\frac{\partial a}{\partial b}\frac{\partial b}{\partial c} ca=bacb
and a dummy variable, δ \delta δ, introduced to cohere mathematical derivations.

We first define the layer-specific variable δ i \delta_i δi as
δ i ⃗ = ∂ L ∂ a i ⃗ \vec{\delta_i}=\frac{\partial L}{\partial \vec{a_i}} δi =ai L
Using this dummy variable, we can compute
∂ L ∂ W i = ∂ L ∂ a i ⃗ ∂ a i ⃗ ∂ W i = δ i ⃗ ∂ a i ⃗ ∂ W i ∂ L ∂ b i ⃗ = ∂ L ∂ a i ⃗ ∂ a i ⃗ ∂ b i ⃗ = δ i ⃗ ∂ a i ⃗ ∂ b i ⃗ \begin{aligned} \frac{\partial L}{\partial W_i}&=\frac{\partial L}{\partial \vec{a_i}}\frac{\partial \vec{a_i}}{\partial W_i}\\ &=\vec{\delta_i}\frac{\partial \vec{a_i}}{\partial W_i} \\ \frac{\partial L}{\partial \vec{b_i}}&=\frac{\partial L}{\partial \vec{a_i}}\frac{\partial \vec{a_i}}{\partial\vec{b_i}}\\ &=\vec{\delta_i}\frac{\partial \vec{a_i}}{\partial\vec{b_i}} \end{aligned} WiLbi L=a

