梯度下降优化方法'原理_优化梯度下降的新方法

梯度下降优化方法'原理

The new era of machine learning and artificial intelligence is the Deep learning era. It not only has immeasurable accuracy but also a huge hunger for data. Employing neural nets, functions with more exceeding complexity can be mapped on given data points.

机器学习和人工智能的新时代是深度学习时代。 它不仅具有无法估量的准确性,而且对数据的渴望也很大。 利用神经网络,可以将复杂程度更高的功能映射到给定的数据点上。

But there are a few very precise things which make the experience with neural networks more incredible and perceiving.

但是,有一些非常精确的东西使神经网络的体验更加令人难以置信和感知。

Xavier初始化 (Xavier Initialization)

Let us assume that we have trained a huge neural network. For simplicity, the constant term is zero and the activation function is identity.

让我们假设我们已经训练了一个巨大的神经网络。 为简单起见,常数项为零,激活函数为恒等式。

Image for post

For the given condition, we can have the following equations of gradient descent and expression of the target variable in terms of weights of all layer and input a[0].

对于给定的条件,我们可以使用以下梯度下降方程和目标变量的表达式来表示所有层和输入a [0]的权重。

For ease of understanding, let us consider all weights to be equal, i.e.

为了便于理解,让我们考虑所有权重相等,即

Image for post

Here we have considered the last weight different as it will give the output value and in case of binary classification it may be a sigmoid function or ReLu function.

在这里,我们考虑了最后一个权重,因为它会给出输出值,并且在二进制分类的情况下,它可能是S型函数或ReLu函数。

When we replace all the weights in the expression of the target variable, we obtain a new expression for y, the expression of prediction of the target variable.

当替换目标变量表达式中的所有权重时,我们将获得y的新表达式,即目标变量预测的表达式。

Let us consider two different situations for the weights.

让我们考虑两种不同的权重情况。

Image for post

In case 1, when we advance the weight to the power of L-1, assuming a very large neural network, the value of y becomes very large. Likewise, in case 2, the value of y becomes exponentially small. These are called vanishing and exploding gradients. These provisions affect the accuracy of gradient descent and demand more time for training the data.

在情况1中,当我们将权重提高到L-1的幂时,假设神经网络非常大,则y的值将变得非常大。 同样,在情况2中,y的值呈指数减小。 这些称为消失和爆炸梯度 。 这些规定会影响梯度下降的准确性,并需要更多时间来训练数据。

To avoid these circumstances we need to initialize our weights more carefully and more systematically. One way of doing this is by Xavier Initialization.

为了避免这些情况,我们需要更仔细,更系统地初始化权重。 一种方法是通过Xavier Initialization

If we consider a single neuron as in logistic regression, the dimension of the weight matrix is defined by the dimension of a single example. Hence we can set the variance of weights as 1/n. As we increase the dimension of input example, the former ‘s dimensions must be increased to train the model.

如果我们在逻辑回归中考虑单个神经元,则权重矩阵的维数由单个示例的维数定义。 因此,我们可以将权重的方差设置为1 / n。 随着我们增加输入示例的尺寸,必须增加前者的尺寸以训练模型。

Once we have applied this technique to deeper neural networks, the weight initialization for each layer can be expressed as

一旦我们将此技术应用于更深的神经网络,就可以将每一层的权重初始化表示为

Image for post

similarly, there can be various ways to define the variance and multiplying with the randomly initialized weights.

类似地,可以有多种方法来定义方差并与随机初始化的权重相乘。

改进梯度计算 (Improvising Gradient Computation)

Let us consider a function f(x) = x³ and compute its gradient at x = 1 using calculus. Using this simple function has a reason to understand and admire this concept. By differentiation, we know that the slope of the function at x=1 is 3.

让我们考虑一个函数f(x)=x³,并使用微积分计算在x = 1处的梯度。 使用这个简单的功能有一个理解和欣赏这个概念的理由。 通过微分,我们知道函数在x = 1处的斜率为3。

Now, let us calculate the slope at x=1 by calculus. We find the value of the function at x = 1+delta, where delta is a very small quantity (say = 0.001). The slope of the function becomes the slope of the hypotenuse of the yellow triangle.

现在,让我们通过微积分计算x = 1处的斜率。 我们在x = 1 + delta处找到函数的值,其中delta是非常小的量(例如= 0.001)。 函数的斜率变为黄色三角形斜边的斜率。

Image for post

Hence the slope is 3.003 with an error of 0.003. Now, let us define the error differently and again calculate the slope.

因此,斜率为3.003,误差为0.003。 现在,让我们以不同的方式定义误差,然后再次计算斜率。

Image for post

Now we are calculating the slope of a bigger triangle formed by boundaries of 1-delta and 1+delta. Calculating the slope in this manner has reduced the error significantly to 0.000001.

现在,我们正在计算由1-delta和1 + delta的边界形成的较大三角形的斜率。 以这种方式计算斜率已将误差显着降低至0.000001。

Hence, we can infer that defining the slope in this manner will help us to better calculate the slope of a function. This demonstration helps us to optimize gradient calculation hence optimizing the Gradient descent.

因此,我们可以推断出以这种方式定义斜率将有助于我们更好地计算函数的斜率。 该演示帮助我们优化了梯度计算,从而优化了梯度下降。

One thing to note is implementing this function to calculate gradients more efficiently will increase the time required to calculate the gradients.

要注意的一件事是,实现此功能以更有效地计算梯度将增加计算梯度所需的时间。

翻译自: https://towardsdatascience.com/new-ways-for-optimizing-gradient-descent-42ce313fccae

梯度下降优化方法'原理

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值