随机梯度下降法 数学原理
Minimising cost function is the holy grail of machine learning. Gradient descent is our master key to all cost minimisation problems. But gradient descent is slow. It uses entire training dataset for one iteration. This problem is exaggerated when working with millions of rows dataset. Stochastic gradient descent (SGD) comes to our rescue.
最小化成本功能是机器学习的圣杯。 梯度下降是我们解决所有成本最低问题的万能钥匙。 但是梯度下降很慢。 它使用整个训练数据集进行一次迭代。 使用数百万行的数据集时,此问题会被夸大。 随机梯度下降(SGD)来了。
Nearly all of deep learning is powered by SGD algorithm.
几乎所有的深度学习都由SGD算法提供支持。
SGD approximates gradient by using one example. This approximation is fairly accurate and works. But how can we approximate calculation involving million of rows using only one row?
SGD通过使用一个示例来近似梯度。 这种近似是相当准确的并且有效。 但是,如何仅使用一行就可以估算出涉及百万行的计算呢?
Let us see general equation of gradient
让我们看一下梯度的一般方程
where L stands for loss function, J is the cost function and m is number of training examples.
其中L代表损失函数,J是成本函数,m是训练示例数。
1) Key insight here is that gradient is the expectation. The expectation can be approximated by few examples. This insight shows that our approximation is accurate and guarantees convergence of SGD algorithm.
1) 这里的关键见解是渐变是期望。 期望可以通过几个例子来近似。 该见解表明,我们的近似是准确的,并保证了SGD算法的收敛性。
2) Standard error of the mean calculated from m examples is inversely proportional to square root of number of examples (m)
2)从m个示例计算出的平均值的标准误与示例数的平方根成反比(m)
Consider two scenarios, we are estimating mean using 10 examples and 100,00 examples. Later examples needs 100 times more computation but reduces error by only 10 times.
考虑两个场景,我们使用10个示例和100,00个示例来估计均值。 后面的示例需要100倍的计算量,但错误仅减少10倍。
We get diminishing gains by using more examples to approximate mean.
通过使用更多示例来近似均值,我们得到的收益递减。
3) Small batches can offer a regularizing effect by adding noise to the gradient.
3)小批量可以通过向梯度中添加噪声来提供正则化效果。
4) In practise there is redundancy in training examples. Large number of examples all making very similar contributions to the gradient. SGD can eliminate this redundancy.
4)在实践中,训练示例中存在冗余。 大量示例均对梯度做出了非常相似的贡献。 SGD可以消除这种冗余。
Mini batch gradient descent: Similar to SGD but instead of 1 example, we use more (generally few hundreds) examples. All of the above arguments are valid for mini batch gradient descent
小型批量梯度下降:类似于SGD,但我们使用了更多(通常几百个)示例而不是1个示例。 以上所有参数均适用于小批量梯度下降
翻译自: https://medium.com/@nrkivar/mathematical-justification-of-stochastic-gradient-descent-6ebd4f109c3f
随机梯度下降法 数学原理