小冬冬历险记_神经网络背后的神经元理论历险记

最新推荐文章于 2022-08-24 11:35:19 发布

weixin_26729283

最新推荐文章于 2022-08-24 11:35:19 发布

阅读量854

点赞数

文章标签：神经网络深度学习

原文链接：https://medium.com/@sergencansiz/adventure-of-the-neurons-theory-behind-the-neural-networks-5d19c594ca16

版权

小冬冬历险记

Neural Networks are in use widely in the scope of Artificial Intelligence. It can be a solution for most of the predictive problems thanks to its scalability and flexibility. You can solve complex problems on regression, classification, forecasting, object recognition, speech recognition, NLP and so on. So, what is Neural Networks, what makes them capable of all these problems and how does it learn to make predictions? In order to understand these, we need to know more about neurons and the math behind them. In this article I am going to explain how neurons learn from given data and used in prediction. We will inspect the theory behind the neural networks and training process, from A to Z. You will get the answers of following questions in this article;

神经网络在人工智能领域得到了广泛的应用。由于其可伸缩性和灵活性，它可以解决大多数预测问题。您可以解决有关回归，分类，预测，对象识别，语音识别，NLP等的复杂问题。那么，什么是神经网络？什么使它们能够解决所有这些问题？它如何学习进行预测？为了理解这些，我们需要更多地了解神经元及其背后的数学知识。在本文中，我将解释神经元如何从给定数据中学习并用于预测。我们将检查从A到Z的神经网络和训练过程背后的理论。您将在本文中获得以下问题的答案；

What are neurons and what are their duties?
什么是神经元，它们的职责是什么？
How are neurons connected in the neural network?
神经元如何在神经网络中连接？
How do neural networks make prediction according to given data?
神经网络如何根据给定的数据进行预测？
What happens in each iteration in the network?
网络中的每次迭代会发生什么？
What happens after each iteration?
每次迭代后会发生什么？
What changes happen in neural network during training?
训练期间神经网络会发生什么变化？
When should the training process stop?Neurons in The Network
培训过程应何时停止？网络中的神经元

网络中的神经元(Neurons in The Network)

Neurons are essential components of a Neural Network. More precisely, Neural Networks are formed by connecting one neuron to every other neuron. Unlike the human neural system, neurons in neural networks are connected to each other through the layers. Processing time would be extremely high if we consider that every neuron would be connected to every other neuron. In order to reduce processing time and save computer’s computational power, layers are used in the Neural Network. Thanks to layers, every neuron in a layer is connected to neurons in the next layer. The figure below represents the basic structure of a neural network with three layers.

神经元是神经网络的重要组成部分。更准确地说，神经网络是通过将一个神经元连接到其他每个神经元而形成的。与人类神经系统不同，神经网络中的神经元通过各层相互连接。如果我们认为每个神经元都将与其他每个神经元相连，那么处理时间将非常长。为了减少处理时间并节省计算机的计算能力，在神经网络中使用了图层。多亏了层，一层中的每个神经元都连接到下一层中的神经元。下图显示了具有三层的神经网络的基本结构。

As you can see from the figure above, there are 3 layers as Input, Hidden and Output. These three layers are the main layers of a neural network. Neurons in the input layer represent each variable in the dataset. Neuron in the output layer represents the final predicted value after input values pass into every neuron in the hidden layer. While there is only one input and output layer, the number of hidden layers can be increased. Therefore, performance of the neural networks depends on the number of layers and number of neurons in each layer. Predictive performance between a network with 3 hidden layers with 3 neurons in every layer and network with one layer-one neuron might be different. So, what does the network look like if we add another hidden layer with 3 neurons? Figure below shows a network with 2 hidden layers.

从上图可以看到，输入，隐藏和输出共3层。这三层是神经网络的主要层。输入层中的神经元代表数据集中的每个变量。在输入值传递到隐藏层中的每个神经元之后，输出层中的神经元表示最终的预测值。虽然只有一层输入和输出层，但是可以增加隐藏层的数量。因此，神经网络的性能取决于层数和每层中的神经元数。具有每层包含3个神经元的3个隐藏层的网络与具有一层-一层神经元的网络之间的预测性能可能有所不同。那么，如果我们添加具有3个神经元的另一个隐藏层，网络将是什么样子？下图显示了一个具有2个隐藏层的网络。

We can add more hidden layers and neurons to the network. There is not any limit for it. But, we shouldn’t forget that adding more layers and neurons doesn’t mean that our network will predict better. However, when we increase the number of hidden layers and neurons, the training time will increase due to the calculations in each neuron. What we need to do is find the best network structure for our network.

我们可以向网络添加更多的隐藏层和神经元。没有任何限制。但是，我们不应该忘记添加更多的层和神经元并不意味着我们的网络会更好地进行预测。但是，当我们增加隐藏层和神经元的数量时，训练时间将由于每个神经元中的计算而增加。我们需要做的是找到适合我们网络的最佳网络结构。

喂养神经元 (Feeding The Neurons)

Neural networks work over iterations and every iteration trains the model to reach the best prediction. So, feeding the neurons is the main movement that trains our network. This movement is called “feed forward” in neural networks. What it means is taking data from previous connected neurons, doing calculations with that data and sending the result to the next connected neurons. When the final calculation is done in the output neuron, another observation is taken from the data set and goes for feeding again. This process keeps going until our network’s prediction is really close to the actual value that has been predicted. Most important thing here is what calculations are performed in neurons. We can basically call weighting to these calculations. We shouldn’t forget that we are going to make predictions by using this network. Therefore, we need neurons to be weighted to make the best prediction.

神经网络负责迭代，每次迭代都会训练模型以达到最佳预测。因此，喂养神经元是训练我们网络的主要运动。这种运动在神经网络中称为“前馈”。这意味着从先前连接的神经元中获取数据，对该数据进行计算，然后将结果发送到下一连接的神经元。当在输出神经元中完成最终计算时，将从数据集中获取另一个观察值，并再次进行喂食。这个过程一直持续到我们网络的预测值真正接近已预测的实际值为止。这里最重要的是在神经元中执行什么计算。我们基本上可以将加权称为这些计算。我们不应忘记，我们将使用此网络进行预测。因此，我们需要对神经元进行加权以做出最佳预测。

Okay, let’s inspect this feeding process in depth. Let’s assume that we have a neural network with 1 input layer with 2 input neurons, 1 hidden layer with 2 neurons and 1 output layer with 1 neuron. So, first we can inspect the feeding process of a neuron in the hidden layer;

好吧，让我们深入检查这个喂食过程。假设我们有一个神经网络，其中包含1个输入层和2个输入神经元，1个隐藏层和2个神经元以及1个输出层和1个神经元。因此，首先我们可以检查隐藏层中神经元的进食过程。

In the figure above, there are two neurons in the input layer. It means that we have two variables in our data set that we are going to use for training. As you can see, every connection between the input neurons and the neuron in the hidden layer have “W” values. These “W” values represent the values which a single neuron keeps for every neuron that feeds itself. At the first step “W” and “b” values can be generated randomly between 0 and 1. During the iterations (steps) these values will be updated in order to reach the best predicted value. We will see these iterations in the next sections.

在上图中，输入层中有两个神经元。这意味着我们将在数据集中使用两个变量进行训练。如您所见，输入神经元和隐藏层中神经元之间的每个连接都具有“ W”值。这些“ W”值代表单个神经元为每个自身供养的神经元所保持的值。第一步，“ W”和“ b”值可以在0和1之间随机生成。在迭代(步骤)期间，将更新这些值，以达到最佳预测值。我们将在下一部分中看到这些迭代。

The H1 neuron is fed by X1 and X2 neurons so it has W1 and W2 weight for X1 and X2. If you see the calculation part, each neuron’s value that comes from the input layer multiplies with its weight and we sum all of it. However, there is also one more value which is represented as “b”. This value is the intercept of the neuron and we sum this value with multiplied weights. It can also be called “bias”. After the sum operation, the result passes from the activation function. Activation function makes a kind of transformation operation. The obtained value from the activation function will be the final value of this neuron at the current step .

H1神经元由X1和X2神经元供电，因此X1和X2的权重为W1和W2。如果您看到计算部分，则来自输入层的每个神经元的值都会乘以其权重，然后将所有的总和相加。但是，还有一个值表示为“ b”。该值是神经元的截距，我们将该值与权重相乘。也可以称为“偏差”。求和运算后，结果从激活函数传递。激活功能进行一种转换操作。从激活函数获得的值将是该神经元在当前步骤的最终值。

激活功能 (Activation Functions)

There are different activation functions that we can choose according to the value that we want to predict. Let’s assume that we want to predict 2 classes indicated as 0 and 1. So, what we need is a probability value between 0 and 1 in order to decide the predicted class. If the predicted value is less than 0.5 (which means it is close to 0 class) we can say is predicted as 0. If it is greater than 0.5 we can say it is predicted as 1. But, without activation function it is possible to get predicted values out of 0 and 1. In order to sort out this problem, we can use an activation function that keeps neuron values between 0 and 1. This activation function is called “Sigmoid”. Other activation functions which are commonly used are presented in the figure below.

我们可以根据要预测的值选择不同的激活函数。假设我们要预测2个表示为0和1的类别。因此，我们需要一个介于0和1之间的概率值来确定预测的类别。如果预测值小于0.5(这意味着它接近于0类)，则可以说被预测为0。如果预测值大于0.5，则可以说它被预测为1。但是，如果没有激活函数，则可以获取0和1之间的预测值。为了解决此问题，我们可以使用将神经元值保持在0和1之间的激活函数。该激活函数称为“ Sigmoid”。下图显示了其他常用的激活功能。

I already explained when we should use the sigmoid function. If you look at the figure above there are also other activation functions. For example, hyperbolic tangent function can be used when we need output between -1 and 1. The ReLU function can be used if the output should not be less than 0 and identify function returns output as it was. It means the values which are calculated in neurons will be as itself. According to the variable which is going to be predicted, one of these functions should be chosen.

我已经解释了何时应该使用S型函数。如果您看上面的图，还有其他激活功能。例如，当我们需要在-1和1之间输出时，可以使用双曲正切函数。如果输出不应小于0并标识函数按原样返回输出，则可以使用ReLU函数。这意味着在神经元中计算出的值将保持不变。根据将要预测的变量，应选择这些功能之一。

培训网络示例 (Example Case for Training Network)

Okay! We can set the whole network with an example case. Let’s assume that we are going to train a network to predict tree ages by using the length and the width of the tree. It means that we will use two variables as predictors and one variable as predicted. The data set that is going to be used for training is shown below.

好的！我们可以通过一个案例来设置整个网络。假设我们要训练一个网络，以通过使用树的长度和宽度来预测树的年龄。这意味着我们将使用两个变量作为预测变量，使用一个变量作为预测变量。下面显示了将用于训练的数据集。

As we mentioned before, neural networks get trained over iterations. Therefore, in every iteration (step) it takes samples from the data set and does the calculations in the neurons. Only one observation or small batches can be taken. In this example we will take one sample (observation) for one step and it will be the first observation of our dataset.

如前所述，神经网络经过迭代训练。因此，在每个迭代(步骤)中，它都会从数据集中获取样本并在神经元中进行计算。只能进行一次观察或小批量观察。在此示例中，我们将一个样本(观察)进行一步，这将是我们数据集的首次观察。

We should also decide which activation function is going to be used in the network. As it mentioned, it depends on what you want to predict for using this network. If we look up to our data set, age, which is a dependent variable, includes numeric values. So, we won’t predict the type of trees, we will predict how old they are. This inference leads us to activation functions whose output will be linear. The first option is ReLU and the second option is Identity. Because, these functions do not transform the values to between 0 and 1. The difference between them while Identity function returns inputed value as it was, ReLU function returns value 0 if the input is negative and returns value as it was if it is not negative. However, the age can not be negative so using ReLU function makes sense. But I want to keep this tutorial as simple as it can be. That’s why I want to use the Identity function as activation. It is much easier to understand in this format. It sounds weird but using the Identity function means that you won’t use any activation function. Therefore, in the following figures I didn’t add any activation function process (Because, it returns calculated value as it was).

我们还应该决定在网络中将使用哪种激活功能。如前所述，这取决于您希望使用该网络进行预测。如果我们查看数据集，age是一个因变量，其中包括数字值。因此，我们不会预测树木的类型，而是会预测它们的年龄。该推论使我们得出激活函数，其输出将是线性的。第一个选项是ReLU，第二个选项是Identity。因为，这些函数不会将值转换为0到1之间的值。当Identity函数按原样返回输入值时，它们之间的差异，如果输入为负，ReLU函数将返回值0；如果输入为负，则返回原值。。但是，年龄不能为负数，因此使用ReLU功能很有意义。但我想使本教程尽可能简单。这就是为什么我想使用身份功能作为激活。用这种格式更容易理解。听起来很奇怪，但是使用“身份”功能意味着您将不会使用任何激活功能。因此，在下图中，我没有添加任何激活函数过程(因为它按原样返回计算值)。

Figure below represents the neural network and calculations for the first iteration. As you can see, there are two neurons in the Input layer as “Width” and “Length”. Because it is the first iteration, the values of “Width” and “Length” come from the first observation as 5 and 7. There are also 2 neurons in the hidden layer and one neuron in the output layer. If you look carefully every connection between neurons has “w” weight values. As we mentioned before, weights can be chosen randomly between 0 and 1. It can be out of this range but it is better to set between 0 and 1 for the first step. In the following iterations they are going to be changed anyway. The values of these weights that have been randomly assigned to connections are also shown above the neurons. We can use the saying which is -these weights for every connection- sometimes, but they are stored in the neurons. We do not assign weights to connection in practice. It means that every neuron stores “w” values for every connection that comes from the previous layer in matrix structure. Actually, we could also use the ReLU activation function. Because, we want to predict the age of the trees and ages can not be negative. But I just want to keep this example as simple as it can be. Therefore I decided to use the Identity Activation Function and there is not any activation function process in the figure below. Because identity functions do not have any transformation for the values which are calculated in the neurons. It keeps them as they were.

下图显示了第一次迭代的神经网络和计算。如您所见，输入层中有两个神经元，分别是“宽度”和“长度”。由于是第一次迭代，因此“宽度”和“长度”的值来自第一次观察，分别为5和7。在隐藏层中还有2个神经元，在输出层中还有一个神经元。如果仔细观察，神经元之间的每个连接都具有“ w”的权重值。如前所述，权重可以在0到1之间随机选择。它可以不在此范围内，但第一步最好将其设置在0到1之间。在以下迭代中，无论如何它们都将被更改。这些权重的值已随机分配给连接，也显示在神经元上方。有时我们可以使用这样的说法-每个连接的权重-但它们存储在神经元中。实际上，我们不为连接分配权重。这意味着每个神经元为矩阵结构中来自上一层的每个连接存储“ w”值。实际上，我们也可以使用ReLU激活功能。因为，我们要预测树木的年龄，并且年龄不能为负。但是我只是想让这个例子尽可能简单。因此，我决定使用身份激活功能，下图中没有任何激活功能过程。因为身份函数对神经元中计算出的值没有任何变换。它使它们保持原样。

After the calculations in the neurons, 6.25 has been found as the final result in the first iteration. Technically we can say that it was the first predicted “Age” value according to given first observation’s Width and Length values. Next thing that we should do is to compare this result with the actual Age for that observation. Age of the tree for the first observation is 5. It means that we are 1.5 ages far from the actual value and we need to update the neurons’ weights according to this distance. This process is called Back-propagation and we will see it in the next section. But before that, I want to show you what would happen if we used 2 hidden layers with 2 neurons in each in this artificial neural network. Because you might wonder how we would do the calculation in such a scenario.

在神经元中进行计算后，在第一次迭代中发现了6.25作为最终结果。从技术上讲，根据给定的第一个观测值的“宽度”和“长度”值，它可以说是第一个预测的“年龄”值。接下来，我们要做的是将该结果与该观察的实际年龄进行比较。第一次观察的树的年龄为5。这意味着我们距离实际值1.5个年龄，我们需要根据该距离更新神经元的权重。这个过程称为反向传播，我们将在下一节中看到它。但是在此之前，我想向您展示如果在此人工神经网络中使用2个隐藏层以及2个神经元，每个隐藏层会发生什么情况。因为您可能想知道在这种情况下我们将如何进行计算。

In the figure above, you can see that there is one more layer added into the previous network. Therefore, our predicted value has changed as 5.68 as expected. This was just for an example we will use the previous network structure in the following sections.

在上图中，您可以看到在先前的网络中又添加了一层。因此，我们的预测值已按预期更改为5.68。这只是一个示例，我们将在以下各节中使用以前的网络结构。

将神经元升级到更高水平 (Upgrading Neurons to Up Levels)

This section may be the most important section of this article. Because I will explain how neurons’ weights get changed to make good predictions. I named this section as “Upgrading Neurons to Up Levels” because after every iteration in the network, we change the weight values of neurons. So, we upgrade them to up level to make better predictions.

本节可能是本文中最重要的部分。因为我将解释如何改变神经元的权重以做出正确的预测。我将此部分命名为“将神经元升级到更高级别”，因为在网络中的每次迭代之后，我们都会更改神经元的权重值。因此，我们将它们升级到更高的水平以做出更好的预测。

As it mentioned in the previous section, neurons’ weights and biases (intercepts) need to be updated by considering the difference between predicted and actual value after one iteration has been completed. This process is called Back-propagation in artificial neural networks and the algorithm that we are going to use for that is Stochastic Gradient Descent. After every feedforward (iteration) back-propagation has to be completed too during the training. With help of back-propagation networks get trained. Following figure represents the back-propagation process.

如前一节所述，在一次迭代完成后，需要考虑预测值与实际值之间的差异来更新神经元的权重和偏差(拦截)。这个过程在人工神经网络中称为反向传播，而我们要使用的算法是随机梯度下降。在每次前馈(迭代)之后，在训练期间也必须完成反向传播。在反向传播网络的帮助下，训练有素。下图表示反向传播过程。

In order to update the weights and biases first we should find the difference between actual and predicted value. There are two functions which are used often for calculating the difference between actual and predicted. These are called Mean Square Error and Cross-Entropy. In some cases Sum of Squared Error is used too. These functions are called loss function or cost function in general. While Means Square Error can be prefered for regression problems, Cross-Entropy can be prefered for classification problems. In this example we are trying to estimate tree ages, not type of trees. That’s why we can use Mean Square Error.

为了首先更新权重和偏差，我们应该找到实际值与预测值之间的差异。通常使用两个函数来计算实际值和预测值之间的差异。这些称为均方误差和交叉熵。在某些情况下，也使用平方误差总和。这些函数通常称为损失函数或成本函数。虽然均方误差可以作为回归问题的首选，但交叉熵可以作为分类问题的首选。在此示例中，我们尝试估计树木年龄，而不是树木类型。这就是为什么我们可以使用均方误差。

The figure below shows the function of Mean Squared Error;

下图显示了均方误差的功能；

Now, we can calculate the difference between actual value and predicted value in the previous example by using MSE. The predicted age was found as 6.25 for first observation and the actual age was 5. Because we take only one sample for one iteration the “n” will be equal to 1.

现在，我们可以使用MSE计算上一示例中的实际值和预测值之间的差。首次观察的预测年龄为6.25，而实际年龄为5。由于我们仅对一个迭代进行采样，因此“ n”等于1。

According to the above mathematical operations MSE found as 1.5625. We can use the term of loss for this value. Additionally, if we would have taken 32 observations for each iteration, we would have to calculate MSE from 32 observations predicted values and actual values.

根据上述数学运算，MSE为1.5625。我们可以使用损失项来表示该值。另外，如果我们每次迭代都要进行32次观测，则必须从32次观测的预测值和实际值中计算出MSE。

Okay, we have the loss result and what is next? Now, we have to figure out how we are going to change the weights and biases by considering their effects on loss. Therefore, we need a method that can do it. This method is called partial derivative in math. Partial derivatives of loss function L (MSE) with weights and biases can provide the information that how weights and biases influence the Loss. So, it is what we are looking for. We need to do the following partial derivative operation;

好吧，我们有损失的结果，接下来是什么？现在，我们必须弄清楚我们如何通过考虑权重和偏差对损失的影响来改变权重和偏差。因此，我们需要一种可以做到这一点的方法。这种方法在数学上称为偏导数。具有权重和偏差的损失函数L (MSE)的偏导数可以提供有关权重和偏差如何影响损失的信息。因此，这就是我们想要的。我们需要执行以下偏导数运算；

The figure above shows the partial derivatives for each weight and bias. When we have partial derivatives we should multiply it with the Learning Rate. After that we should subtract the result from its weights. The Learning Rate here represents how weights are going to change. Changes in weight with a high learning rate will be much more than low rate learning rate. Therefore, we need to set the optimum learning rate for our network.

上图显示了每种权重和偏差的偏导数。当我们有偏导数时，我们应该将它乘以学习率。之后，我们应该从权重中减去结果。这里的学习率表示权重将如何变化。学习率高时体重的变化将远远多于低率学习率。因此，我们需要为网络设置最佳学习率。

Now it is time to learn how we are going to find the partial derivatives. It’s a little deep in math and if you don’t have any idea about partial derivatives you can watch this video. We will go on derivative operations for W1 (dL/dW1). In order to find this partial derivative we can do a transition which is shown in the formula below. H1 represents the value of H1 neuron after weighting operation, Yprediction represents the result of O1 neuron and w1 is the weight value which is for in the H1 neuron for tree Width (first neuron in the input layer).

现在是时候学习如何找到偏导数了。它在数学上有点深，如果您对偏导数一无所知，可以观看此视频。我们将继续进行W1(dL / dW1)的导数运算。为了找到该偏导数，我们可以进行下面的公式所示的转换。 H1表示加权操作后的H1神经元的值， Yprediction表示O1神经元的结果，w1是H1神经元中树宽(输入层中的第一个神经元)的权重值。

In order to solve the above equilibrium we need to go on the formulas of Yprediction, H1 and L loss. After that we can obtain the partial derivatives of all. If we look to the first partial derivative YPrediction formula we can see it’s the final calculation of predicted value in the O1 neuron.

为了解决上述平衡，我们需要进行Yprediction ， H1和L损失的公式。之后，我们可以获得所有的偏导数。如果我们看一阶偏导数YPrediction公式，我们可以看到它是O1神经元中预测值的最终计算。

And the formula that we found the value of H1 neuron is;

我们发现H1神经元的价值的公式是：

The last formula that we need is the formula of loss function MSE (L). We shouldn’t forget that we take only 1 observation in each step. So, the “n” value will be equal to 1;

我们需要的最后一个公式是损失函数MSE(L)的公式。我们不应忘记，我们在每个步骤中只进行一次观察。因此，“ n”值等于1；

Okay! Now, we are good to go. We can start to take partial derivatives one by one. We can start with the first partial derivative which is “dL/dYPrediction”;

好的！现在，我们很好。我们可以开始逐个取偏导数。我们可以从第一个偏导数开始，即“ dL / dYPrediction”；

The partial derivative operations have been performed in the figure above. The basic idea behind this operation is to find how Loss is influenced from Yprediction. As it mentioned before, you can watch this video in order to understand these operations. The next partial derivative is “dYpred/dH1”. Normally, H1 is a function which returns the value over activation function. In such cases while taking partial derivative, we should take derivative of H1. But, in this case we used identity function and the derivative of identity function f(x) = x is equal to 1. So, we can pretend like H1 is just a parameter value of YPrediction. If we would use the Sigmoid function we would should have taken derivative of it (It is another scenario).

上图已执行偏导数运算。此操作背后的基本思想是发现Yprediction如何影响Loss 。如前所述，您可以观看此视频以了解这些操作。下一个偏导数是“ dYpred / dH1”。通常，H1是通过激活函数返回值的函数。在这种情况下，采用偏导数时，我们应采用H1的导数。但是，在这种情况下，我们使用了恒等函数，恒等函数f(x)= x的导数等于1。因此，我们可以假装H1只是YPrediction的参数值。如果我们使用Sigmoid函数，我们应该采用它的导数(这是另一种情况)。

The last derivative is “dH1/dw1”;

最后的派生词是“ dH1 / dw1”；

Now, we have all equilibriums to find the partial derivative of Loss according to “W1” weight. All we need to do is bring the pieces together.

现在，我们拥有所有均衡，可以根据“ W1”权重找到损失的偏导数。我们需要做的就是将各个部分放在一起。

Next thing that we should do is placed the values of YTrue, YPrediction, w5 and x1 in the formula above;

接下来我们要做的是将YTrue，YPrediction，w5和x1的值放在上面的公式中；

随机梯度下降 (Stochastic Gradient Descent)

Yeap! We know how w1 value influences the loss. Next thing that we should do is decide how we are going to change the “W1” value in order to minimize the loss. Therefore, we’ll use the algorithm called Stochastic Gradient Descent (SGD). Actually, partial derivative operation was part of the gradient descent that we have done in the previous section. We can basically say that we are training our network by using back propagation as a stochastic gradient computing technique. In SGD we take single samples from data, pass them into the network, find out how we are going to change the weights and intercepts in order to minimize loss. This process goes in every iteration until we reach the minimum threshold. It means if we can’t get any decrease more than threshold value in the loss after a while, we can stop training. In order to understand it we can see following figure;

对！我们知道w1值如何影响损耗。我们接下来要做的是确定如何更改“ W1”值，以最大程度地减少损失。因此，我们将使用称为随机梯度下降(SGD)的算法。实际上，偏导数运算是我们在上一节中完成的梯度下降的一部分。基本上可以说，我们正在使用反向传播作为随机梯度计算技术来训练我们的网络。在SGD中，我们从数据中提取单个样本，将其传递到网络中，了解如何改变权重和截距以最大程度地减少损失。此过程会在每次迭代中进行，直到达到最小阈值。这意味着如果过一会儿我们不能使损失减少到阈值以上，就可以停止训练。为了理解它，我们可以看到下图；

In SGD; after we find the partial derivatives for all weights and intercepts, we should multiply them with Learning Rate and subtract from their old weights and intercepts. We only found partial derivative of “w1” so far. Therefore, we can do following operations in order to find out new updated “w1” value;

以新币计算；找到所有权重和截距的偏导数后，应将它们乘以学习率，然后从它们的旧权重和截距中减去。到目前为止，我们仅发现“ w1”的偏导数。因此，我们可以进行以下操作以找出新的更新的“ w1”值；

As you can see from the above operations, our new W1 value has been found as -0.1125. It means that we will replace the old “W1” value (which is located in the H1 neuron) with -0.1125 and we will use this value in the next iteration. We only found the new W1 value so far. We also must find the new values of other W values and intercepts. In the figure above there other partial derivatives for W and Intercept values;

从上述操作可以看到，我们的新W1值已找到-0.1125。这意味着我们将用-0.1125替换旧的“ W1”值(位于H1神经元中)，并在下一次迭代中使用该值。到目前为止，我们仅发现了新的W1值。我们还必须找到其他W值和截距的新值。在上图中，还有W和Intercept值的其他偏导数。

Now, we can calculate new weights in order to complete the first back propagation process;

现在，我们可以计算新的权重以完成第一个反向传播过程。

We have all new “w” and “b” values after first back-propagation. Let’s look at our network with new values and its result after feed forwarding step;

第一次反向传播后，我们将获得所有新的“ w”和“ b”值。让我们看一下具有新值的网络以及Feed转发步骤之后的结果；

As you can see from the result, the predicted value has been found as -3.33 which is negative. Do we need to worry? Answer is nop! We just completed the first iteration with the first sample of data set. There are more samples in our data that the network has to process. Besides, neural networks require much more iteration with other samples to get trained. The main questions might be; How many iteration has to be completed or when should our training process stop? The answer is hiding behind the SDG.

从结果中可以看到，预测值为-3.33，为负。我们需要担心吗？答案是小结！我们刚刚用数据集的第一个样本完成了第一次迭代。网络中需要处理的数据样本更多。此外，神经网络需要与其他样本进行更多迭代才能得到训练。主要问题可能是；必须完成多少次迭代，或者我们的培训过程应何时停止？答案隐藏在可持续发展目标的背后。

培训过程应何时停止？ (When should the training process stop?)

The purpose of the iterations is to reach the threshold of minimum decrease in loss. We should define this threshold value before the training phase of our network. Let’s say we defined the threshold value as 0.05 and the maximum iteration as 10000. It means we plan to reach 0.05 decrease in max 10000 iterations. Decreasing in loss will be higher in the first iterations. But after a while, decreasing will start to change not that much and when the network reaches the minimum threshold training will stop.

迭代的目的是达到最小的损耗降低阈值。我们应该在网络训练阶段之前定义此阈值。假设我们将阈值定义为0.05，将最大迭代定义为10000。这意味着我们计划在最大10000次迭代中减少0.05。损失的减少在第一次迭代中会更高。但是过了一会儿，下降将不会有太大变化，并且当网络达到最小阈值时，训练将停止。

For example, at 8909. iteration we took our sample and passed into the network and found the loss 0.3559, at 8910. iteration we took another example and found the loss 0.3315, at 89111. iteration we took another sample and found loss 0.35080. The average decrease in loss is between 0.0244 and 0.0026. These values are less than our threshold which is 0.05. So, we can stop training. However, we can also define a minimum error that stops training of our network. Network keeps training to reach a given minimum error a.k.a loss (of course we also define max iterations).

例如，在8909。迭代时，我们采用样本并进入网络，在8910处发现损失0.3559。在迭代中，我们采用了另一个示例，在89111中发现了损失0.3315。在迭代中，我们采用了另一个样本并发现了损失0.35080。损失的平均减少量在0.0244和0.0026之间。这些值小于我们的阈值0.05。因此，我们可以停止训练。但是，我们还可以定义一个最小误差，该误差会停止训练我们的网络。网络不断训练以达到给定的最小错误(即损失)(当然，我们还定义了最大迭代次数)。

There is also one more problem that we need to worry about. It is an overfitting problem. This problem occurs when our ML model performs perfectly on the training set and doesn’t perform well on the new data (test data). It means our network can reach the minimum loss after thousands of iterations but it does not perform well on the testing set which we keep out from the training set in order to evaluate the performance. In such cases we can define a validation part from our data and control the loss amount for that part too. If the loss for validation set keeps increasing while the loss for training is decreasing, we can stop training too. We can understand it from the figure below;

我们还需要担心另一个问题。这是一个过拟合的问题。当我们的ML模型在训练集上表现完美并且在新数据(测试数据)上表现不佳时，就会出现此问题。这意味着我们的网络在经过数千次迭代后可以达到最小的损失，但是在测试集上表现不佳，而为了评估性能，我们将其排除在训练集之外。在这种情况下，我们可以从数据中定义一个验证零件，也可以控制该零件的损失金额。如果验证损失的集合不断增加，而训练的损失则在减少，那么我们也可以停止训练。我们可以从下图了解它；

结论 (Conclusions)

In this article we learned how neural network models’ training process goes. Of course, it was just a basic example with basic network structure. The purpose of this article was to learn the idea and the math behind neural networks. More complex network structures can be set. With the help of programming languages, math operations can be done easily in these structures. But they also will require much more computational power in the machine. If you have knowledge on programming languages, you can apply these steps and create a neural network model from scratch.

在本文中，我们了解了神经网络模型的训练过程是如何进行的。当然，这只是具有基本网络结构的基本示例。本文的目的是学习神经网络背后的思想和数学。可以设置更复杂的网络结构。借助编程语言，可以在这些结构中轻松进行数学运算。但是它们也将需要机器中更多的计算能力。如果您了解编程语言，则可以应用以下步骤并从头开始创建神经网络模型。

What about for after? After you understand completely the operations in this article you can go further. You can check other back-propagation and gradient descent approaches, different types of neural networks such as CNN, RNN, LSTM and so on. Every other algorithm or approach has been developed in order to solve different problems.

那以后呢？完全理解本文中的操作后，您可以继续进行。您可以检查其他反向传播和梯度下降方法，不同类型的神经网络，例如CNN，RNN，LSTM等。为了解决不同的问题，已经开发了其他所有算法或方法。

I hope it was helpful, please don’t hesitate to ask questions…

希望对您有所帮助，请随时提出问题...