深度学习背后的数学_深度学习背后的简单数学

最新推荐文章于 2024-01-13 15:30:00 发布

weixin_26752765

最新推荐文章于 2024-01-13 15:30:00 发布

阅读量442

点赞数

文章标签： python 机器学习深度学习算法人工智能

原文链接：https://medium.com/artifical-mind/simple-mathematics-behind-deep-learning-c38152c8b534

版权

深度学习背后的数学

Deep learning is one of the most important pillars in machine learning models. It is based on artificial neural networks. Deep learning is extremely popular because of its rich applications in the areas of image recognition, speech recognition, natural language processing (NLP), bioinformatics, drug designing, and many more. Although there are rich and efficient libraries and packages are being offered by major tech companies on Deep Learning, which are ready to use with much background knowledge. Still, it is worth understanding the little but impressive mathematics behind these models, especially the rule that works inside artificial neural networks (ANNs). There are many ways to understand the working of ANNs but we will begin with a very basic example of data fitting which perfectly explains the working of neural networks.

深度学习是机器学习模型中最重要的Struts之一。它基于人工神经网络。由于深度学习在图像识别，语音识别，自然语言处理(NLP)，生物信息学，药物设计等领域的广泛应用，因此深度学习非常受欢迎。尽管大型技术公司在深度学习方面提供了丰富而高效的库和程序包，这些库和程序包已准备好使用，并且具有很多背景知识。尽管如此，仍然值得理解这些模型背后的一点点但令人印象深刻的数学，尤其是在人工神经网络(ANN)中起作用的规则。有很多方法可以理解ANN的工作原理，但我们将从一个非常基本的数据拟合示例入手，该示例完美地说明了神经网络的工作原理。

Suppose, some data of land fertility for a region is given to us, see figure 1, where the circles represent the fertile land and the crosses represent the infertile land. Obviously this data is available for finite sites in the given region. If we wish to know the characteristics of the land at any random point in the region, we may like to have a mathematical transformation which picks the input data is a location of the site and maps it onto either circle or cross. i.e. if the land is fertile it will map it on the circle ( category A)or else it will map onto the cross (category B). So, the idea is to utilize the given data and provide information about those points where information is not available. Mathematically this what we call curve fitting. It is possible by creating a transformation rule which returns every point in R² to either circle(fertile) or cross (infertile). There may be many ways to construct such transformations and its an open area for the users and researchers. Here, we will use a magical function called Sigmoid function. The sigmoid function is like a step function but it is continuous and differential, which makes it very interesting and important. It’s mathematical expression is

假设，我们获得了某个地区的土地肥力数据，请参见图1，其中圆圈代表肥沃的土地，而十字代表贫瘠的土地。显然，此数据可用于给定区域中的有限站点。如果我们想知道该区域中任意随机点的土地特征，我们可能希望进行数学转换，以选择输入数据为站点的位置并将其映射到圆形或十字形上。也就是说，如果土地肥沃，则将其映射到圆上(类别A)，否则将映射到十字架上(类别B)。因此，其想法是利用给定的数据并提供有关信息不可用的那些点的信息。在数学上，这就是我们所说的曲线拟合。通过创建转换规则，可以将R²中的每个点返回到圆(可育)或十字(不育)。可能有很多方法可以构建这样的转换，并为用户和研究人员提供一个开放区域。在这里，我们将使用一个称为Sigmoid函数的神奇函数。 S形函数类似于阶跃函数，但它是连续且微分的，这使其非常有趣且重要。它的数学表达式是

Figure 1 shows the graph of the sigmoid functions, sometimes it is called a logistic function. Actually it is a smoothed version of the step function. It is a widely used function in ANN, the probable reason behind this may be its similar nature to the real neurons in the brain. When there is enough input (x is large) it gives output 1, otherwise, it remains inactive.

图1显示了S型函数图，有时也称为逻辑函数。实际上，它是阶跃函数的平滑版本。它是ANN中广泛使用的功能，其背后的可能原因可能是其与大脑中真实神经元的相似性质。当有足够的输入(x很大)时，它给出输出1，否则，它保持不活动状态。

The steepness and the transitional nature of the sigmoid function in its current form may not be helpful for every situation, therefore we play with its steepness and transition simply by scaling and shifting the argument. e.g. if we draw

S形函数当前形式的陡峭性和过渡性质可能无法在每种情况下都有用，因此，我们仅通过缩放和移动参数即可解决其陡峭性和过渡性。例如，如果我们画

then it looks like

然后看起来

It shows we can handle the steepness and the transition of the sigmoid function by choosing the suitable value of the a and b.

它表明我们可以通过选择a和b的合适值来处理S型函数的陡度和过渡。

This shifting and scaling in neural networks are called the weighting and biasing of the input. Therefore here ‘x’ is the input ‘a’ is the weight and ‘b’ is the bias. The optimal values of ‘a’ and ‘b’ are of extreme importance to develop any efficient neural network model. To be clear, the whole thing explained here is a single input structure, i.e. ‘x’ is a scalar.

神经网络中的这种移动和缩放称为输入的加权和偏置。因此，这里的“ x”是输入，“ a”是权重，“ b”是偏差。 “ a”和“ b”的最佳值对于开发任何有效的神经网络模型都至关重要。为了清楚起见，这里解释的整个事情都是一个单一的输入结构，即“ x”是一个标量。

Now we will use linear algebra to scale this concept for more than one inputs, i.e. instead of taking x as a single input, we can take ‘X’ as a vector. Here we are planning to define the sigmoid function

现在，我们将使用线性代数将这个概念扩展到多个输入，即，我们可以将“ X”作为向量，而不是将x作为单个输入。在这里，我们计划定义S形函数

This definition is important to understand, as it picks the components of ‘X’ (input vector)and maps componentwise using the sigmoid function. Now to introduce the weight and bias into the input which is vector now, we need to replace ‘a’ by a weight matrix ‘W’ and ‘b’ by the bias vector ‘B’. Therefore this scaled system will change into

该定义很重要，因为它选择了“ X”(输入向量)的分量并使用S型函数对各个分量进行映射。现在要将权重和偏差引入到现在为矢量的输入中，我们需要将权重矩阵“ W ”替换为“ a”，将偏置矢量“ B ”替换为“ b”。因此，此缩放系统将变为

Here ‘W’ is the weight matrix of order m x m, ‘X’ is the input vector of length’ and ‘B’ is the bias vector of length m. Now the recursive use of the defined sigmoid function will lead us to the magical world of neurons, layers, input, and output. Let us try to understand with an example by taking an input vector of length 2, say X=[x1, x2], bias B=[b1,b2] and weight matrix W=[w11 w12; w21 w22]. Here X is the input layer or neuron or simply input which works as:

这里“ W”是m×m阶的权重矩阵，“ X”是长度的输入向量，“ B”是长度m的偏差向量。现在，递归使用已定义的S型函数将使我们进入神经元，层，输入和输出的神奇世界。让我们通过一个长度为2的输入矢量来尝试理解一个例子，比如说X = [x1，x2]，偏差B = [b1，b2]和权重矩阵W = [w11 w12; w21 w22]。这里X是输入层或神经元或简单地输入，其工作方式如下：

After this first operation, we achieved a new layer which is:

进行完第一个操作后，我们获得了一个新层：

Here the cross arrows represent that, in the creation of the new neurons x1 and x2 both are involved for all the components. This whole procedure helped us to create new neurons from the existing neurons (input data). This simple idea can be scaled two any finite number of input vectors (in the above example it was a vector of length two) say ‘m’, in that case, we can write the same sigmoid function as

此处的十字箭头表示，在创建新神经元x1和x2时，所有组件都参与其中。整个过程帮助我们从现有的神经元(输入数据)创建新的神经元。这个简单的想法可以缩放成两个任意有限数量的输入向量(在上面的示例中是长度为2的向量)，例如m，在这种情况下，我们可以编写与

To connect everything once again, so far we just implemented the sigmoid function on the given input vector (in terms of ANN it is called neurons) and created another layer of neurons as explained above. Now we will again use the sigmoid function on the newly created neurons, here we can play with the weight matrix and it will enhance the number of neurons in the next layer. For example, while applying the sigmoid function on the new layer let’s choose the weight matrix of order 3 by 2 and a bias vector of length three and it will produce three new neurons in the next layer. This flexibility in the choice of order of weight matrix provides us the desired number of neurons in the next layer. Once we get the second layer of neurons apply the sigmoid function again on the second layer, you will get third, keep on recursively using this idea and one can have as many as layers. Since these layers work as intermediate layers that is why they are popular as hidden layers, and the recursive use of the sigmoid function (or any activation function) takes us to deep down in learning the data, probably this is the reason we call it as deep learning. if we repeated this whole process four times we will have four layers in the neural network model and the following mathematical function and layers.

为了再次连接所有内容，到目前为止，我们仅在给定的输入矢量上实现了S型函数(就ANN而言，它称为神经元)，并如上所述创建了另一层神经元。现在，我们将再次在新创建的神经元上使用S形函数，在这里我们可以处理权重矩阵，它将增加下一层中神经元的数量。例如，在新层上应用S形函数时，让我们选择3乘2的权重矩阵和长度为3的偏差矢量，它将在下一层中产生三个新的神经元。权重矩阵顺序选择的灵活性为我们在下一层中提供了所需的神经元数量。一旦我们获得第二层神经元在第二层上再次应用S形函数，您将获得第三层，继续使用此思想递归，其中一层可以多达多层。由于这些层充当中间层，这就是为什么它们被用作隐藏层的原因，并且对Sigmoid函数(或任何激活函数)的递归使用使我们深入学习数据，这也许就是我们称其为深度学习 。如果我们将整个过程重复四次，我们将在神经网络模型中具有四层，并具有以下数学函数和层。

Finally, we got a mathematical function from which actually fits the given data. Here, how many hidden layers and how many neurons one has to create, it totally depends on the user. Naturally, the more hidden layers and intermediate neurons will return more complex F(X). Nevermind, let come back to our F(X) again if one wishes to count how may weight coefficients and bias components are used, they are 23 in numbers. All these 23 parameters are required to be optimized to get the best fit of the given data.

最后，我们得到了一个数学函数，可以从中实际拟合给定的数据。在这里，一个人必须创建多少个隐藏层和多少个神经元，这完全取决于用户。自然地，更多的隐藏层和中间神经元将返回更复杂的F(X)。没关系，如果希望计算加权系数和偏差分量的使用方式，请再次返回F(X)，它们的数量为23。所有这23个参数都需要进行优化，以最佳地拟合给定数据。

If we reconnect with our example of the fertile land classifier, if the value of F(X) is close to 1, X will be mapped into the category A (fertile land), if F(X) is close to 0, X will be mapped into the category B(infertile land). But, in reality, we will establish a breaking rule which will be helpful to classify the data. If one carefully see in figure 1, there are 20 data points in the data, these data shall be used as a target output to improve to train the model. Here training the model means to get the optimal values of all 23 parameters which provide the best fit of the given data. Since there are two types of target data category A (circles) and category B (crosses). Let x(i), i=1,…,20 are the data points whose images are either circle or cross in figure 1. Now we classify

如果我们重新结合肥沃的土地分类器的示例，如果F(X)的值接近1，则 X将映射到类别A(肥沃的土地)，如果F(X)接近0，则X将被划为B类(不育土地)。但是，实际上，我们将建立一条打破规则，这将有助于对数据进行分类。如果仔细观察图1，则数据中有20个数据点，这些数据将用作改进模型训练的目标输出。在这里训练模型意味着获得所有23个参数的最优值，这些参数提供了给定数据的最佳拟合。由于目标数据有两种类型，类别A(圆圈)和类别B(十字)。令x(i)，i = 1，…，20是图1中图像为圆形或十字形的数据点。

Here (x(i),y(x(i))) are the given data points (See figure 1). Now, these y(x(i)), i=1,..,20 shall be used as target vectors to get the optimal values of all parameters (weights and bias). We define the following cost function (objective function)

这里(x(i)，y(x(i)))是给定的数据点(见图1)。现在，这些y(x(i))，i = 1，..，20将被用作目标向量，以获得所有参数(权重和偏差)的最佳值。我们定义以下成本函数(目标函数)

If one carefully see this function it has two important aspects, firstly it uses the given data points (y(x(i))) and the function(F(x)) created from the recursive operation of the sigmoid function. F(x) involves all the weights and biases which are still unknown. the other multiples 1/20 is used to normalize the function and 1/2 is used for the differentiation point of view. But they do not matter to us from the optimization point of view(why ?). Now our objective is to find the minimum value of this cost function, ideally, this has to be zero, but in reality, it can not be zero. Finally, the values of all the weights and biases for which this cost function is minimum are the optimal values of weights and biases. To determine the optimal values of these parameters (weights and biases), are actually termed as training the neural network. Now how to get these optimal values, for this one needs to use some optimization algorithm to minimize the cost function, these algorithms may be gradient descent, stochastic gradient, etc. How these algorithms work and how can we minimize this cost function, it’s another day's story.

如果仔细看过这个函数，它有两个重要方面，首先，它使用给定的数据点(y(x(i)))和从S型函数的递归操作创建的函数(F(x))。 F(x)包含所有未知的权重和偏差。其他倍数1/20用于归一化函数，而1/2用于微分观点。但是从优化的角度来看，它们对我们来说并不重要(为什么？)。现在，我们的目标是找到此成本函数的最小值，理想情况下，它必须为零，但实际上不能为零。最后，此成本函数最小的所有权重和偏差的值是权重和偏差的最佳值。为了确定这些参数的最佳值(权重和偏差)，实际上称为训练神经网络 。现在如何获得这些最优值，为此，需要使用某种优化算法来最小化成本函数，这些算法可能是梯度下降，随机梯度等。这些算法如何工作以及如何最小化该成本函数，这是另一种方法一天的故事。

After minimizing the cost function, we will have the optimal values of the parameters which can be put in F(x) and we can have the value of F(x) for every input x. If F(x) is near to 1 then x will fall in category A, if x is near to zero then x will fall in category B. Our classifier is ready to use. Even we can draw a boundary line on the data set which classifies both the categories.

在最小化成本函数之后，我们将获得可放入F(x)的参数的最佳值，并且对于每个输入x均可获得F(x)的值。如果F(x)接近1，则x属于A类；如果x接近零，则x属于B类。我们的分类器已准备就绪。甚至我们都可以在数据集上划出一条边界线来对这两个类别进行分类。

In this study, we used sigmoid function to train the model but in general, there are many more activation functions that may be used in a similar way.

在这项研究中，我们使用了S型函数来训练模型，但是总的来说，还有更多的激活函数可以类似的方式使用。

Congratulations you learned the fundamental mathematics and its execution behind deep learning-based classifiers.

恭喜您学习了基于深度学习的分类器背后的基础数学及其执行。

Higham, Catherine F., and Desmond J. Higham. “Deep learning: An introduction for applied mathematicians.” SIAM Review 61.4 (2019): 860–891.

Higham，Catherine F.和Desmond J. Higham。 “深度学习：面向应用数学家的介绍。” SIAM评论 61.4(2019)：860–891。