机器学习反向传播_机器学习中的神秘化反向传播：您想了解的隐藏数学-CSDN博客

机器学习反向传播

By Ibrahima “Prof” Traore, ML expert at Wildcard / w6d.io

作者：Wildcard / w6d.io的 ML专家Ibrahima“ Prof” Traore

In this article, you will learn different math concepts such as gradient descent, derivatives, matrix, chain rule, and how to use those concepts to explain and solve some back propagation examples from scratch (in artificial neural network). Neural networks are a family of powerful machine learning models. This technology has been proven to excel at solving a variety of complex problems in engineering, science, finance, market analysis and many more.

在本文中，您将学习不同的数学概念，例如梯度下降，导数，矩阵，链规则 ，以及如何使用这些概念从头开始解释和解决一些反向传播示例(在人工神经网络中)。神经网络是一系列强大的机器学习模型。事实证明，这项技术在解决工程，科学，金融，市场分析等方面的各种复杂问题方面表现出色。

Note: Basic knowledge about derivatives is recommended to get the best out of this article.

注意： 建议充分 了解有关 衍生工具的 基本知识 。

There are two directions in which information flows in a neural network:

信息在神经网络中的流动有两个方向：

Forward propagation (also called forward pass or inference)
前向传播(也称为前向传递或推断)
Backward propagation
向后传播

The first one refers to the calculation and storage of intermediate variables (as inputs and outputs) in a neural network. The second one, Back propagation (short for backward propagation of errors) is an algorithm used for supervised learning of artificial neural networks using gradient descent.

第一个涉及神经网络中中间变量(作为输入和输出)的计算和存储。第二种是反向传播(错误的向后传播的缩写 )，是一种用于使用梯度下降法监督学习人工神经网络的算法。

This article will be divided into three main parts:

本文将分为三个主要部分：

The hidden math you need for back propagation
反向传播所需的隐藏数学
Forward propagation in artificial neural network
人工神经网络中的正向传播
Back propagation in artificial neural network
人工神经网络中的反向传播

第一部分：反向传播所需的隐藏数学 (Part I : The Hidden Math you Need for Back-propagation)

The goal of training a model is to find a set of weights proven to be good, or good enough, at solving the specific problem. Therefore, we must find weights that result in a minimum amount of errors or losses when evaluating the examples in the training dataset. To fulfill this task, we will use the above derivative.

训练模型的目的是找到一组权重被证明在解决特定问题上是好的或足够好的。因此，在评估训练数据集中的示例时，我们必须找到导致最小错误或损失的权重。为了完成此任务，我们将使用上述派生词。

Prior to starting the training, parameters are usually generated randomly. We need derivatives to adjust them so that the global error becomes minimal. In this case, the weights and biases are relatively adapted to making a good prediction. Derivation will show us the direction to take, the adjustment to apply in every weight and bias, what parameters to bring down, which value to subtract, how much to add, when to stop… In the above graph, our goal was to get the weight value that reduces the error to minimal.

在开始训练之前，通常会随机生成参数。我们需要调整导数以使全局误差最小。在这种情况下，权重和偏差相对适于做出良好的预测。推导将向我们显示采取的方向，对每个权重和偏差进行的调整，要降低的参数，要减去的值，要添加的值，何时停止……在上图中，我们的目标是获得将误差减小到最小的权重值。

The red line is in-fact the tangent line’s. Without going into deeper formulas and demonstrations, here is how we correct the weights: In each iteration (epoch) the current weight is updated (as shown below). Consider the following function (you can see its representation above):

红线实际上是切线的。在不进行更深入的公式和演示的情况下，这是我们校正权重的方法：在每次迭代(纪元)中，都会更新当前权重(如下所示)。考虑以下功能(您可以在上面看到其表示形式)：

Of course we can derive this function, set the derivative to zero and we are done.

当然，我们可以派生此函数，将导数设置为零就可以了。

The above function only depends on one variable, ‘w’, but in deep learning, functions can depend on many variables. In this case, the above method can be difficult to solve. That’s the situation where we will use a Gradient descent algorithm.

上面的函数仅取决于一个变量“ w”，但是在深度学习中，函数可以取决于许多变量。在这种情况下，上述方法可能难以解决。在这种情况下，我们将使用梯度下降算法。

So as I already mentioned, let’s take a random weight as initial value, say w0 = 10.0 because we will look for the direction to take (here by “direction” I mean decrease or increase the weight), we will also use a derivative but in another way. The move’s step (learning rate) will be arbitrary set to 0.01 (see the step of each movement in the picture below).

因此，正如我已经提到的，我们将随机权重作为初始值，假设w0 = 10.0，因为我们将寻找采取的方向(这里的“方向”是指减小或增加权重) ，我们还将使用导数，但是用另一种方式。移动的步长(学习率)将任意设置为0.01(请参见下图中每个移动的步长)。

This formula, based on the tangent formula in a particular point, simply means: in each iteration (each step), your current weight value is in the previous one, in which you subtract the fraction part (quantity on the right of the minus sign in the above formula). On this picture, you can see the different values of adjusted weights. Note that in artificial neural network, we multiply this fraction by the learning rate (step value).

该公式基于特定点的正切公式，其简单含义是：在每次迭代(每个步骤)中，您当前的权重值在前一个值中，您在其中减去分数部分(负号右边的数量)在上面的公式中)。在此图片上，您可以看到调整后的砝码的不同值。请注意，在人工神经网络中，我们将此分数乘以学习率 (步长值)。

The above graph is based on this technique. This is an example of gradient descent. The graph in real neural network will be the error graph to be minimized and the variable `w` will be our neural network weight.

上图基于此技术。这是梯度下降的一个例子。实际神经网络中的图将是要最小化的误差图，变量“ w”将是我们的神经网络权重。

我们将在模型中使用的某些神经网络激活函数的导数 (Derivatives of some neural network activation function we will use in our model)

Relu (activation function)
Relu(激活功能)

We will use this activation function at the input layer. It simply converts negative values to zero. The positive values remain unchanged.

我们将在输入层使用此激活功能。它只是将负值转换为零。正值保持不变。

Example:

例：

relu([ 0.93412086, -0.89987134, 0.07139904, 0.63705336]) = [0.93412086, 0. , 0.07139904, 0.63705336]

relu([0.93412086，-0.89987134，0.07139904，0.63705336])= [0.93412086，0.，0.07139904，0.63705336]

2. Tanh (hyperbolic tangent activation function)

2. Tanh(双曲正切激活函数)

In math we have circular trigonometry (with simple sin, cos, tan functions we have seen at school) and spherical trigonometry (with cosh , sinh, tanh …, h stands for hyperbolic). Every trigonometric hyperbolic function uses the exponential function. So the hyperbolic tangent is:

在数学中，我们有圆形三角函数(具有在学校看到的简单sin，cos，tan函数)和球形三角函数(具有cosh，sinh和tanh…，h表示双曲)。每个三角双曲函数都使用指数函数。因此，双曲正切为：

Similar to the sigmoid activation function, the advantage of tanh is that the negative inputs will be mapped strongly negative and the zero inputs will be mapped near zero in the tanh graph. The derivative is:

类似于S型激活函数，tanh的优势在于，在tanh图中，负输入将被强烈映射为负，零输入将被映射为接近零。导数是：

3. Softmax (activation function)

3. Softmax(激活功能)

We will use this function in our model’s hidden layers. In general, this function is the output layer’s activation function, used to have probabilities on the output.

我们将在模型的隐藏层中使用此功能。通常，此功能是输出层的激活功能，用于在输出上显示概率。

Example: if there are 3 inputs, n equals 3, we can simply write:

示例：如果有3个输入，n等于3，我们可以简单地写成：

which leads us after developing to:

导致我们发展到：

连锁规则 (The Chain Rule)

A neural network model is composed of layers and each layer has its activation functions like the previous one we just talked about. From the input layer to the output layer, parameters pass through the activation function.

神经网络模型由层组成，每一层都有其激活功能，就像我们刚才谈到的那样。从输入层到输出层，参数通过激活功能传递。

A given output in a given layer becomes an input for the next neighbor’s layer’s node. We have what we call composite functions.

给定层中的给定输出将成为下一个邻居层节点的输入。我们有所谓的复合函数。

As an example, let‘s define a simple math function called “g” which depends on three variables: x, y, z.Let’s suppose g(x, y, z) = (2x + y) * zFor example if x = 1, y = -5, z = 7the result will be: g(1, -5, 7) = (2*1 + (-5)) * 7 g(1, -5, 7) = (2 -5) * 7 g(1, -5, 7) = -3 * 7 g(1, -5, 7) = -21

例如，让我们定义一个简单的数学函数“ g”，它取决于三个变量：x，y，z。假设g(x，y，z)=(2x + y)* z例如，如果x = 1， y = -5，z = 7结果将是：g(1，-5，7)=(2 * 1 +(-5))* 7 g(1，-5，7)=(2 -5)* 7 g(1，-5，7)= -3 * 7 g(1，-5，7)= -21

This is a typical example of multivariable functions. What about derivatives in respect to its variables x, y, z.?

这是多元函数的典型示例。关于变量x，y，z的导数呢？

Most frequently, a derivative (in respect to x for example ) of the “g” function is written g‘. Instead we will use a partial letter. So all following expressions remain the same:

最常见的是，将“ g”函数的导数(例如，相对于x)写为g' 。相反，我们将使用部分字母。因此，以下所有表达式均相同：

If we have a multivariable function (it will be the situation we will generally meet almost all the time in neural network), the derivatives in respect to each variables are called partial derivatives:

如果我们有一个多变量函数(通常在神经网络中几乎总是会遇到这种情况)，那么每个变量的导数称为偏导数：

The rule of derivatives for multivariable functions is : “If you’re deriving in respect to a variable, all other variables must be considered as constant.” Back to our “g” function, the partial derivatives respectively x, y, z are:

多变量函数的导数规则是：“如果要派生变量，则必须将所有其他变量视为常量。” 回到我们的“ g”函数，x，y，z的偏导数分别为：

Now consider a neural network with an input layer, one hidden layer, and an output layer, respectively with the activation function we’ve seen. Let’s call them f = relu, g = tanh and h = softmax. From the input layer to the hidden layer, the input X will be activated with the “f” function, so the hidden layer entry is f(X). From the hidden layer to the output layer, the input (now f(X)) will be activated through the “g” function, so the output layer will receive the input g[ f(X) ] and the output layer will now be activated with its input (now g[ f(X) ]) through the function “h” and the network will display h(. g[.f(X)]). To make things short, let’s try to derive this.

现在考虑一个神经网络，它具有一个输入层，一个隐藏层和一个输出层，分别具有我们已经看到的激活函数。我们称它们为f = relu，g = tanh和h = softmax 。从输入层到隐藏层，将使用“ f”功能激活输入X ，因此隐藏层条目为f(X)。 从隐藏层到输出层，将通过“ g”功能激活输入(现在为f(X) )，因此输出层将接收输入g [f(X)] ，现在输出层将为通过功能“ h”激活其输入(现在为g [f(X)] )，网络将显示h(。g [.f(X)])。 为了简短起见，让我们尝试得出这一点。

h(. g[.f(X) ]).

h(。g [.f(X)]) 。

Let’s call this function “p”.

我们将此函数称为“ p” 。

p (X) = h(. g[.f(X) ]).

p(X)= h(。g [.f(X)]) 。

By chaining the derivatives, each function must be derived in respect to its input. So the derivative of the “p” function is:

通过链接导数，必须根据其输入派生每个函数。因此，“ p”函数的导数为：

The “h” function input is “g”, so we derive “h” in respect to “g”

“ h”函数输入为“ g”，因此我们针对“ g”得出“ h”

The “g” function input is “f”, so we derive “g” in respect to “f”

“ g”函数输入为“ f”，因此我们针对“ f”得出“ g”

The “f” function input is “X”, so we derive “f” in respect to “X”

“ f”函数输入为“ X”，因此我们针对“ X”得出“ f”

The input “X” can be images or any data extracted from audio files, from finance, from the weather, stream data like handwriting, covid19 symptoms… and the output “p(X)” can be for example the classification of an input, a new image, words, prediction or anything else…

输入“ X”可以是图像，也可以是从音频文件，财务，天气，手写文字，covid19症状等流数据中提取的图像或任何数据，而输出“ p(X)”例如可以是输入的分类，新的图像，文字，预测或其他任何内容...

This is the base of back propagation in neural network.

这是神经网络中反向传播的基础。

The neural network input X we mentioned is composed of many data which can be arranged in certain order in something called a matrix.

我们提到的神经网络输入X由许多数据组成，这些数据可以按一定顺序排列在称为矩阵的东西中。

关于矩阵的一些知识 (A little bit about Matrices)

Understanding matrices before diving in back propagation in math is necessary. A matrix (plural matrices) is a set of elements arranged in rows and columns so as to form a rectangular array.

了解马特里在数学潜水反向传播之前，CES是必要的。矩阵( 多个矩阵 )是按行和列排列以形成矩形阵列的一组元素。

It is a good practice to represent data in a matrix. Here, “w” means weights (a commonly used expression in neural network) because we are dealing with parameters. w11 is the element at the intersection of line 1 column 1. Suppose this matrix is from a model between the input layer and the first hidden layer.

在矩阵中表示数据是一种好习惯。在这里，“ w”表示权重(神经网络中常用的表达式)，因为我们正在处理参数。 w11是第1行第1列的交点处的元素。假设此矩阵来自输入层和第一个隐藏层之间的模型。

Can you imagine the number of input nodes and the number of this hidden layer’s nodes from this matrix ?

您能否从此矩阵中想象输入节点的数量以及该隐藏层的节点的数量？

Can you figure it out ? We have three lines so this model has three inputs and we have four columns so there are 4 nodes in this hidden layer.

你能弄清楚吗？我们有三行，因此该模型有三个输入，我们有四列，因此该隐藏层中有4个节点。

第二部分：人工神经网络中的正向传播 (Part II : Forward Propagation in Artificial Neural Network)

Can you find the matrix we must get from the input layer i to the hidden layer j?

您能找到我们必须从输入层i到隐藏层j获得的矩阵吗？

The above model’s architecture has 2 nodes in the input layers, 2 hidden layers of 4 nodes (each one) and the output layers composed of 3 nodes.

上面模型的体系结构在输入层中有2个节点，在 4个节点 (每个1个)中有2个 隐藏层，而由3个节点组成的输出层 。

As we said in the previous part, we will use the relu activation for first hidden layer, tanh for the second hidden layer, softmax for the output layer. We use softmax because we will deal with probabilities. Our learning rate will be:

如上一部分所述，我们将对第一隐藏层使用relu激活，对第二隐藏层使用tanh ，对输出层使用softmax 。我们使用softmax是因为我们将处理概率。我们的学习率将是：

lr = 0.01.

lr = 0.01 。

So let’s build matrices (see the matrix part of this article). Let’s consider simply two numbers as follows, instead of taking images or any other data we mentioned earlier:

因此，让我们构建矩阵(请参阅本文的矩阵部分)。让我们简单地考虑以下两个数字，而不是拍摄图像或我们前面提到的任何其他数据：

inputs = i = [i1, i2] = [0.2, 0.1] = two numerical symptoms of covid19

输入= i = [i1，i2] = [0.2，0.1] = covid19的两个数字症状

and we want the desired output to be :

我们希望所需的输出为：

outputs = [o1, o2, o3] = [1.0, 0.0, 0.0] = the probability to have covid19

输出= [o1，o2，o3] = [1.0，0.0，0.0] =拥有covid19的概率

We have three outputs and two inputs. We will use the python numpy library to perform our calculus. Each edge has its weights and each nodes will have its input value (entry values) and its output value (obtained by applying the activation functions to its input). Every node and layer has a name (see the graphic above). During the Forward propagation, we will generate weights randomly to start our forward propagation (I use the numpy library). The first hidden layer’s matrix (layer j): 4 nodes receive 2 inputs each.

我们有三个输出和两个输入。我们将使用python numpy库执行微积分。每个边都有权重，每个节点都有输入值(输入值)和输出值(通过将激活函数应用于其输入获得)。每个节点和层都有一个名称(请参见上图)。在正向传播期间，我们将随机生成权重以开始正向传播(我使用numpy库)。第一个隐藏层的矩阵(第j层)：4个节点每个接收2个输入。

The above image gives:

上图给出：

and the bias: 4 nodes in this j layer so four bias

和偏差：这个j层有4个节点，所以有4个偏差

Note that the first line is input i1’s weights values from i1, i2 towards j1, j2, j3, and j4 nodes, the second line is input i2’s weights towards j1 ,j2 , j3 and j4 nodes according to our model. The second hidden layer’s matrix (layer k): 4 nodes receive 4 inputs (each one).

请注意，根据我们的模型，第一行是从i1，i2向j1，j2，j3和j4节点输入i1的权重值，第二行是向i1，j2，j3和j4节点输入i2的权重。第二个隐藏层的矩阵(第k层)：4个节点接收4个输入(每个输入)。

The above image gives us:

上图为我们提供了：

and the bias: 4 nodes in this j layer so four bias:

偏差：在这个j层中有4个节点，所以有四个偏差：

output layer’s matrix (layer o): 3 nodes receive 4 inputs (each one).

输出层的矩阵(层o)：3个节点接收4个输入(每个1个)。

gives the matrix:

给出矩阵：

and the bias: 3 nodes in this o layer so three bias

和偏差：这个o层中有3个节点，所以三个偏差

Let’s calculate the forward propagation result by using a matrix operation.

让我们使用矩阵运算来计算前向传播结果。

j_inputs are:

j_inputs是：

using Relu activation function, the layer j’s output will be j_outputs. Here is how we proceed using the node j1 as an example:

使用Relu激活功能，第j层的输出将是j_outputs。这是我们以节点j1为例的过程：

j_outputs are:

j_outputs是：

using Tanh activation function, the layer k’s output will give k_outputs

使用Tanh激活函数，第k层的输出将给出k_outputs

and using the Softmax activation function, o_ouputs are:

并使用Softmax激活功能，o_ouputs为：

The result of our model’s first forward propagation is : o_output = [0.29228018, 0.03001013, 0.67770969]. We expected [1.0, 0.0, 0.0]. Therefore we must adjust the parameters so that our result comes closer to [1.0, 0.0, 0.0]. In order to do that we must use back propagation to reduce the errors considerably.

我们模型的第一个正向传播的结果是： o_output = [0.29228018，0.03001013，0.67770969] 。我们预期为[1.0，0.0，0.0]。因此，我们必须调整参数，使我们的结果更接近[1.0，0.0，0.0] 。为此，我们必须使用反向传播来大大减少错误。

第三部分：人工神经网络中的反向传播 (Part III : Back Propagation in Artificial Neural Network)

As our o_outputs are probabilities, we will use The cross-entropy error because ours outputs are probabilities.

因为我们的o_outputs是概率，所以我们将使用交叉熵误差，因为我们的输出是概率。

Cross-entropy loss function is:

交叉熵损失函数为：

演示地址

N is the number of output, so N = 3yi is the — th element from our expected result.So an element from [1.0, 0.0, 0.0] array.ŷi is i — th element from our observed result. So for each iteration, an element from the output array.

N是输出数量，因此N = 3 yi是我们预期结果中的第-个元素。 因此，[1.0，0.0，0.0]数组中的元素。 isi是我-我们观察到的结果中的第一个元素。 因此，对于每次迭代，都是输出数组中的一个元素。

which leads us to:

这导致我们：

Expected output values are fixed, so they won’t change. The cross entropy variation will depend on o1_output, o2_output, o3_output. Let’s derive it in respect to theses variables.

预期的输出值是固定的，因此它们不会改变。交叉熵的变化将取决于o1_output，o2_output，o3_output。让我们针对这些变量进行推导。

Remember: the derivative of logarithm

记住：对数的导数

To sum up, we have this columned matrix with 3 results so three lines packed into 1 column:

总而言之，我们得到了具有3个结果的列矩阵，因此三行打包成1列：

In this neural network, there are twelve weights left:

在此神经网络中，还剩下十二个权重：

“wk1o1” simply means weights from node k1 to node o1. Then the four last biases:

“ wk1o1”仅表示从节点k1到节点o1的权重。然后是最后四个偏见：

Here, the back propagation’s first step is to find the derivative for each weight. We must find the derivative of the cross entropy loss function (we use) in respect to all the weights and biases. Remember our cross entropy is J.

在这里，反向传播的第一步是找到每个权重的导数。我们必须找到关于所有权重和偏差的交叉熵损失函数(我们使用)的导数。记住我们的交叉熵是J。

o_output is obtained by applying the Softmax function to its input o_input matrix. o_output = Softmax(o_input)

o_output是通过将Softmax函数应用于其输入o_input矩阵而获得的。 o_output = Softmax(o_input)

but o_input = k_output * wko + bo

但是o_input = k_output * wko + bo

so if we replace the o_input variable we have (see softmax function above)

因此，如果我们替换具有的o_input变量(请参见上面的softmax函数)

o_output = Softmax(o_input)

as we are dealing with matrix we can simply write:

当我们处理矩阵时，我们可以简单地写成：

[o1_output1, o2_output, o3_output] = [Softmax(o1_input), Softmax(o2_input), Softmax(o2_input)])

[o1_output1，o2_output，o3_output] = [Softmax(o1_input)，Softmax(o2_input)，Softmax(o2_input)])

in terms of derivative we have:

就导数而言，我们有：

equals to:

等于：

As we don’t have a direct access to weights and biases, let’s continue chaining the derivatives. We know that: o_input = k_output * wko + bo GOOOOOOD!!! We have now a direct access to weights and biases. It is what we want. As it is the entry of a layer, it doesn’t have this layer’s activation function. In terms of derivative we have:

由于我们无法直接获得权重和偏差，因此让我们继续链接衍生产品。我们知道：o_input = k_output * wko + bo GOOOOOOD！ 现在，我们可以直接获得权重和偏见。这就是我们想要的。由于它是图层的入口，因此没有该图层的激活功能。在导数方面，我们有：

So the derivative in respect to wko and bo will be:

因此，关于wko和bo的导数将是：

As they are matrices, we have:

由于它们是矩阵，因此我们具有：

We then have this amazing result, after deriving the above expression. Don’t miss out this rule: If you are deriving in respect to a variable, all over variables must be considered as constant.

在推导上面的表达式之后，我们得到这个惊人的结果。不要错过这条规则：如果要派生变量，则必须将所有变量视为常量。

In the first part of this article, we talked about the chain rule. We will use it now. The derivative of our loss function (cross-entropy) in respect to weights wko and biases bo will be: Starting from the end (o_output), each output must be derived in respect to its input: o_ouput must be derived in respect to its entry o_input (already done above), and if the input has weights. Here’s the long story short:

在本文的第一部分，我们讨论了链式规则。我们现在将使用它。关于权重wko和偏差bo的损失函数(互熵)的导数将是：从结尾(o_output)开始，每个输出都必须根据其输入得出：o_ouput必须相对于其输入得出o_input(已在上面完成)，以及输入是否具有权重。长话短说：

wko is a matrix of 4 rows of 3 columns each (remember).

wko是每行3列的4行矩阵(请记住)。

We must use the chain rule for each element of this matrix. o1_inputs deals with the first column’s weights (wk1o1, wk2o1, wk3o1, wk4o1) because it collects their data, once activated, it gives o1_output.

我们必须对这个矩阵的每个元素使用链式规则。 o1_inputs处理第一列的权重( wk1o1，wk2o1，wk3o1，wk4o1 )，因为它会收集它们的数据，一旦激活，它就会给出o1_output。

To sum it up, we have:

总结起来，我们有：

And the updated weights (wko) formula is:

并且更新的权重(wko)公式为：

Concerning bias, we have:

关于偏见，我们有：

because the other variables are considered to be constant. Their derivatives are zero. In the same manner:

因为其他变量被认为是恒定的。它们的导数为零。以相同的方式：

The updated bias are:

更新的偏差为：

Before anything else, let’s talk about the Error. In the previous part, the error was Cross entropy. Now here, what is the error we will use? In this layer (layer j), each node will receive a little error from the output layer.

首先，让我们谈谈错误。在上一部分中，误差是交叉熵。现在在这里，我们将使用什么错误？在此层(j层)中，每个节点将从输出层收到一点错误。

So Total_error = Error_from_o1 + Error_from_o2 + Error_from_o3

所以Total_error = Error_from_o1 + Error_from_o2 + Error_from_o3

which gives:

这使：

wjk is a matrix of 4 rows of 4 columns each (remember).

wjk是一个4行，每行4列的矩阵(请记住)。

We must calculate the derivative for each one (like we did previously).

我们必须为每一个计算导数(就像我们之前所做的一样)。

The first terms of the chain rule formula: As you see, the k_outputs is not directly linked to J, we must chain rule to get it. We already calculate

链规则公式的第一项：如您所见， k_outputs没有直接链接到J ，我们必须通过链规则来获取它。我们已经计算了

then

然后

To sum it up:

把它们加起来：

using learning_rate, the updated weights are:

使用learning_rate，更新后的权重为：

With the same method, we update the wij’ weights.

使用相同的方法，我们更新wij的权重。

Thanks for reading, and don’t miss out our next article on “Why are weights randomly initialized in a neural network?”, in collaboration with Anselme, Machine Learning engineer at Wildcard and also regular author on the w6d medium.

感谢您的阅读，不要错过我们的下一篇文章 “为什么权重在神经网络中随机初始化？” 与Wildcard机器学习工程师Anselme以及w6d媒体的定期作者合作。

Originally published at http://github.com.

最初发布在 http://github.com 。