线性方程组 梯度下降_为什么梯度下降和正态方程不利于线性回归

线性方程组 梯度下降

介绍 (Introduction)

Most of the ML courses start with linear regression and gradient descent and/or normal equations for this problem. Probably the most well-known Andrew Ng’s course also introduces linear regression as a very basic machine learning algorithm and how to solve it using gradient descent and normal equations methods. Unfortunately, usually, those are quite terrible ways to do it. In fact, if you have ever used LinearRegression from Scikit-learn, you have used alternative methods!

大多数ML课程都以线性回归和梯度下降和/或法线方程开始。 也许最著名的吴安德(Andrew Ng)的课程也将线性回归作为一种非常基本的机器学习算法进行介绍,并介绍了如何使用梯度下降和法线方程法求解它。 不幸的是,通常,这些是非常糟糕的方法。 实际上,如果您曾经使用LinearRegression Scikit-learn的LinearRegression,则可以使用其他方法!

问题概述 (Problem outline)

In linear regression, we have to estimate parameters theta — coefficients for linear combination of terms for regression (where x_0 = 1 and theta_0 is a free term/bias):

在线性回归中,我们必须估计参数theta_0用于回归项线性组合的系数(其中x_0 = 1theta_0是自由项/偏差):

Image for post

We do it by minimizing residual sum of squares (RSS), i. e. average of the squared differences between the output of our model and true values:

我们通过最小化残差平方和(RSS)来做到这一点,即模型输出与真实值之间的平方差的平均值:

Image for post

梯度下降法 (Gradient descent method)

Gradient descent, a very general method for function optimization, iteratively approaches the local minimum of the function. Since the loss function for linear regression is quadratic, it is also convex, i. e. there is a unique local and global minimum. We approach it by taking steps based on the negative gradient and chosen learning rate alpha.

梯度下降是一种用于函数优化的非常通用的方法,它迭代地逼近函数的局部最小值。 由于线性回归的损失函数是二次函数,因此它也是凸的,即存在唯一的局部和全局最小值。 我们通过采取基于负梯度和所选学习率alpha的步骤来实现这一目标。

Image for post
Image for post

Why is this approach bad in most cases? The main reasons are:

为什么在大多数情况下这种方法不好? 主要原因是:

  1. It’s slow — iteratively approaching the minimum takes quite a bit of time, especially computing the gradient. While there are methods of speeding this up (stochastic gradient descent, parallel computing, using other gradient methods), this is an inherently slow algorithm for general convex optimization.

    它很慢 -迭代达到最小值需要花费大量时间,尤其是计算梯度时。 虽然有加快速度的方法(随机梯度下降,并行计算,使用其他梯度方法),但这是一般凸优化的固有速度较慢的算法。

  2. It does not arrive exactly at the minimum — with the gradient descent, you are guaranteed to never get to the exact minimum, be it local or global one. That’s because you are only as precise as the gradient and learning rate alpha. This may be quite a problem if you want a really accurate solution.

    它不会精确地达到最小值 -随着梯度下降,您将永远不会达到精确的最小值,无论是局部最小值还是全局最小值。 那是因为您的精确度仅与渐变和学习率alpha一样。 如果您想要一个真正准确的解决方案,这可能是个问题。

  3. It introduces new hyperparameter alpha — you have to optimize the learning rate alpha, which is a tradeoff between speed (approaching minimum faster) and accuracy (arriving closer to the minimum). While you can use the adaptive learning rate, it’s more complicated and still introduces new hyperparameter.

    它引入了新的超参数alpha-您必须优化学习率alpha,这是速度(更快地接近最小值)和准确性(接近最小值)之间的折衷方案。 尽管您可以使用自适应学习率,但它更加复杂,并且仍然引入了新的超参数。

So why do we even bother with gradient descent for linear regression? There are two main reasons:

那么,为什么我们还要为线性回归而烦恼梯度下降呢? 主要有两个原因:

  1. Educational purpose — since linear regression is so simple, it’s easy to introduce the concept of gradient descent with this algorithm. While it’s not good for this particular purpose in practice, it’s very important for neural networks. That’s most probably why Andrew Ng chose this way in his course, and everyone else blindly followed, without explicitly stating that you should not do this in practice.

    教育目的 -由于线性回归非常简单,因此使用此算法很容易引入梯度下降的概念。 虽然这对于实践中的特定目的不是很好,但对于神经网络来说却非常重要。 这很可能就是为什么吴安德(Andrew Ng)在他的课程中选择这种方式,而其他所有人却盲目地遵循,而没有明确指出您不应在实践中这样做。

  2. Extremely big data — if you have huge amounts of data and have to use parallel and/or distributed computing, the gradient approach is very easy to use. You just partition data in chunks, send it to different machines and compute gradient elements on many cores/machines. Most often though you don’t have such needs or computational capabilities.

    极大的数据 -如果您有大量的数据并且必须使用并行和/或分布式计算,则渐变方法非常易于使用。 您只需将数据分区,然后将其发送到不同的计算机,并在许多内核/计算机上计算梯度元素。 大多数情况下,尽管您没有这种需求或计算能力。

正态方程法 (Normal equation method)

Quadratic cost function has been originally chosen for linear regression because of its nice mathematical properties. It’s easy to use and we are able to get a closed form solution, i. e. a mathematical formula for theta parameters — a normal equation. In the derivation below, we get rid of 1/2n, since in the derivation it will vanish anyway.

二次成本函数最初因其良好的数学特性而被选择用于线性回归。 它易于使用,并且我们可以获得封闭形式的解决方案,即theta参数的数学公式-一个正常方程式。 在下面的推导中,我们摆脱了1/2n ,因为在任何推导中它都会消失。

Image for post
Image for post

We arrive at a system of linear equations and finally at the normal equation:

我们得到一个线性方程组,最后是法线方程:

Image for post

And why is this approach bad too? Main reasons are:

为什么这种方法也不好? 主要原因如下:

  1. It’s slow — having a short, nice equation does not mean that computing it is fast. Matrix multiplication is O(n³), inversion is also O(n³). This is actually slower than gradient descent for even modest sized datasets.

    它很慢 -简短而优美的方程式并不意味着计算就很快。 矩阵乘法是O(n³),反演也是O(n³)。 对于中等大小的数据集,这实际上比梯度下降要慢。

  2. It’s numerically unstable — matrix multiplication X^T * X squares the condition number of the matrix, and later we have to additionally multiply the result by X^T. This can make the results extremely unstable and this is the main reason why this method is almost never used outside pen and paper linear algebra or statistics courses. Not even solving with Cholesky decomposition will save this.

    这在数值上是不稳定的 -矩阵乘法X^T * X将矩阵的条件数平方,然后我们必须将结果乘以X^T 这可能会使结果极不稳定,这是为什么在笔和纸线性代数或统计课程之外几乎从未使用此方法的主要原因。 甚至用Cholesky分解求解也不会保存这一点。

This method should never be used in practice in machine learning. It is nice for mathematical analysis, but that’s it. However, it has become the basis for methods that are actually used by Scikit-learn and other libraries.

在机器学习中,切勿在实践中使用此方法。 数学分析很好,仅此而已。 但是,它已成为Scikit-learn和其他库实际使用的方法的基础。

那么每个人都用什么呢? (So what does everyone use?)

As we’ve seen the downsides of the approaches shown in ML courses, let’s see what is used in practice. In Scikit-learn LinearRegression we can see:

正如我们已经看到的ML课程中显示的方法的弊端,让我们看看实际使用的方法。 在Scikit学习的LinearRegression我们可以看到:

Image for post

So Scikit-learn does not bother with its own implementation, instead, it just uses Scipy. In scipy.linalg.lstsq we can see that even this library does not use it’s own implementation, instead using LAPACK:

因此,Scikit-learn不会为自己的实现而烦恼,而是仅使用Scipy。 在scipy.linalg.lstsq我们可以看到即使该库也不使用它自己的实现,而是使用LAPACK:

Image for post

Finally, we arrive at gelsd, gelsy and gelss entries in Intel LAPACK documentation:

最后,我们在Intel LAPACK文档中gelsdgelsygelss条目:

Image for post
Image for post
Image for post

2 out of those 3 methods use Singular Value Decomposition (SVD), a very important algorithm both in numerical methods and machine learning. You may have heard about it in context of NLP or recommender systems, where it’s used for dimensionality reduction. It turns out, it’s also used for practical linear regression, where it provides a reasonably fast and very accurate method for computing least squares problem that lies in the heart of linear regression.

这3种方法中有2种使用奇异值分解(SVD),这在数值方法和机器学习中都是非常重要的算法。 您可能在NLP或推荐器系统的上下文中听说过它,该系统用于降维。 事实证明,它还用于实际的线性回归,它为线性回归的核心提供了一种计算最小二乘问题的合理快速,非常准确的方法。

SVD和Moore-Penrose伪逆 (SVD and Moore-Penrose pseudoinverse)

If we stop one step before the normal equations, we get a regular least squares problem:

如果我们在正常方程之前停了一步,就会遇到正则最小二乘问题:

Image for post

Since X is almost never square (usually we have more samples than features, i. e. a “tall and skinny” matrix X), this equation does not have an exact solution. Instead, we use the least squares approximation, i. e. the theta vector as close as possible to the solution in terms of Euclidean distance (L2 norm):

由于X几乎从不平方(通常,我们的样本多于特征,即“又高又瘦”的矩阵X ),因此该方程式没有精确的解。 取而代之的是,我们使用最小二乘近似,即theta向量在欧几里得距离(L2范数)上尽可能接近解:

Image for post

This problem (OLS, Ordinary Least Square) can be solved in many ways, but it turns out that we have a very useful theorem to help us:

这个问题(OLS,普通最小二乘)可以用很多方法解决,但是事实证明,我们有一个非常有用的定理可以帮助我们:

Image for post

The Moore-Penrose pseudoinverse is the matrix inverse approximation for arbitrary matrices — even not square ones! In practice it’s calculated through SVD — Singular Value Decomposition. We decompose the matrix X into the product of 3 matrices:

Moore-Penrose伪逆是任意矩阵(甚至不是正方形矩阵)的矩阵逆近似! 实际上,它是通过SVD(奇异值分解)来计算的。 我们将矩阵X分解为3个矩阵的乘积:

Image for post

The Moore-Penrose pseudoinverse is then defined as:

然后将Moore-Penrose伪逆定义为:

Image for post

As you can see, if we have the SVD, computing the pseudoinverse is quite a trivial operation, since the sigma matrix is diagonal.

如您所见,如果我们有SVD,则计算伪逆是一个微不足道的运算,因为sigma矩阵是对角线。

Finally, we arrive at a very practical formula for linear regression coefficient vector:

最后,我们得出线性回归系数向量的一个非常实用的公式:

Image for post

This is what is used in practice by Scikit-learn, Scipy, Numpy and a number of other packages. There are of course some optimizations that can enhance the performance, like divide-and-conquer approach for faster SVD computation (used by Scikit-learn and Scipy by default), but those are more of implementation details. The main idea remains — use SVD and Moore-Penrose pseudoinverse.

这是Scikit-learn,Scipy,Numpy和许多其他软件包在实践中使用的。 当然,有一些优化可以提高性能,例如采用分治法实现更快的SVD计算(默认情况下由Scikit-learn和Scipy使用),但是这些实现细节更多。 主要思想仍然是-使用SVD和Moore-Penrose伪逆。

Advantages of this method are:

这种方法的优点是:

  1. Reasonably fast — while SVD is quite costly to compute, it’s quite fast nevertheless. Also many years of research have contributed to the speed of modern implementations, allowing parallelization and distributed computing of the decomposition.

    相当快 -尽管SVD的计算成本很高,但它仍然相当快。 同样,多年的研究也为现代实现的速度做出了贡献,从而允许分解的并行化和分布式计算。

  2. Extremely numerically stable — numerical stability of computations is not an issue while using SVD. What’s more, it allows us to be very precise with its results.

    极高的数值稳定性 -使用SVD时,计算的数值稳定性不是问题。 而且,它使我们可以非常精确地获得结果。

  3. Arrives exactly at global minimum — this method is almost as accurate as machine epsilon, so we have really the best solution possible.

    精确到达全局最小值 -这种方法几乎与机器epsilon一样精确,因此我们确实有可能是最佳的解决方案。

Beware though — this article is about linear regression, not about regularized versions like LASSO or ElasticNet! While this method works wonders for linear regression, with regularization we don’t have the nice least squares minimization and have to use e. g. coordinate descent.

但是要当心-本文是关于线性回归的, 而不是像LASSO或ElasticNet这样的正规化版本! 尽管此方法可以很好地解决线性回归问题,但使用正则化方法时,我们没有最小二乘方最小化方法,因此必须使用例如坐标下降法。

摘要 (Summary)

In this article you’ve learned what’s really happening under the mask of the Scikit-learn LinearRegression. While gradient descent and normal equations have their applications (education and mathematical properties), in practice we use the Moore-Penrose pseudoinverse with SVD to get accurate predictions for our linear regression models.

在本文中,您了解了Scikit学习LinearRegression掩盖下的实际情况。 虽然梯度下降方程和正态方程都有其应用(教育和数学性质),但在实践中,我们使用带有SVD的Moore-Penrose伪逆来为线性回归模型获得准确的预测。

Sources:

资料来源:

翻译自: https://towardsdatascience.com/why-gradient-descent-and-normal-equation-are-bad-for-linear-regression-928f8b32fa4f

线性方程组 梯度下降

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值