通俗易懂地介绍梯度下降法(以线性回归为例,配以Python示例代码)

转载:https://spin.atomicobject.com/2014/06/24/gradient-descent-linear-regression/


An Introduction to Gradient Descent and Linear Regression

Gradient descent is one of those “greatest hits” algorithms that can offer a new perspective for solving problems. Unfortunately, it’s rarely taught in undergraduate computer science programs. In this post I’ll give an introduction to the gradient descent algorithm, and walk through an example that demonstrates how gradient descent can be used to solve machine learning problems such as linear regression.

At a theoretical level, gradient descent is an algorithm that minimizes functions. Given a function defined by a set of parameters, gradient descent starts with an initial set of parameter values and iteratively moves toward a set of parameter values that minimize the function. This iterative minimization is achieved using calculus, taking steps in the negative direction of the function gradient.

It’s sometimes difficult to see how this mathematical explanation translates into a practical setting, so it’s helpful to look at an example. The canonical example when explaining gradient descent is linear regression.

Code for this example can be found here

Linear Regression Example

Simply stated, the goal of linear regression is to fit a line to a set of points. Consider the following data.

points_for_linear_regression1

Let’s suppose we want to model the above set of points with a line. To do this we’ll use the standard y = mx + b line equation where m is the line’s slope and b is the line’s y-intercept. To find the best line for our data, we need to find the best set of slope m and y-intercept b values.

A standard approach to solving this type of problem is to define an error function (also called a cost function) that measures how “good” a given line is. This function will take in a (m,b) pair and return an error value based on how well the line fits our data. To compute this error for a given line, we’ll iterate through each (x,y) point in our data set and sum the square distances between each point’s y value and the candidate line’s yvalue (computed at mx + b). It’s conventional to square this distance to ensure that it is positive and to make our error function differentiable. In python, computing the error for a given line will look like:

PYTHON
# y = mx + b
# m is slope, b is y-intercept
def computeErrorForLineGivenPoints(b, m, points):
    totalError = 0
    for i in range(0, len(points)):
        totalError += (points[i].y - (m * points[i].x + b)) ** 2
    return totalError / float(len(points))

Formally, this error function looks like:
linear_regression_error1

Lines that fit our data better (where better is defined by our error function) will result in lower error values. If we minimize this function, we will get the best line for our data. Since our error function consists of two parameters (m and b) we can visualize it as a two-dimensional surface. This is what it looks like for our data set:

gradient_descent_error_surface

Each point in this two-dimensional space represents a line. The height of the function at each point is the error value for that line. You can see that some lines yield smaller error values than others (i.e., fit our data better). When we run gradient descent search, we will start from some location on this surface and move downhill to find the line with the lowest error.

To run gradient descent on this error function, we first need to compute its gradient. The gradient will act like a compass and always point us downhill. To compute it, we will need to differentiate our error function. Since our function is defined by two parameters (m and b), we will need to compute a partial derivative for each. These derivatives work out to be:

linear_regression_gradient1

We now have all the tools needed to run gradient descent. We can initialize our search to start at any pair of m and b values (i.e., any line) and let the gradient descent algorithm march downhill on our error function towards the best line. Each iteration will update m and b to a line that yields slightly lower error than the previous iteration. The direction to move in for each iteration is calculated using the two partial derivatives from above and looks like this:

PYTHON
def stepGradient(b_current, m_current, points, learningRate):
    b_gradient = 0
    m_gradient = 0
    N = float(len(points))
    for i in range(0, len(points)):
        b_gradient += -(2/N) * (points[i].y - ((m_current*points[i].x) + b_current))
        m_gradient += -(2/N) * points[i].x * (points[i].y - ((m_current * points[i].x) + b_current))
    new_b = b_current - (learningRate * b_gradient)
    new_m = m_current - (learningRate * m_gradient)
    return [new_b, new_m]

The learningRate variable controls how large of a step we take downhill during each iteration. If we take too large of a step, we may step over the minimum. However, if we take small steps, it will require many iterations to arrive at the minimum.

Below are some snapshots of gradient descent running for 2000 iterations for our example problem. We start out at point m = -1 b = 0. Each iteration m and b are updated to values that yield slightly lower error than the previous iteration. The left plot displays the current location of the gradient descent search (blue dot) and the path taken to get there (black line). The right plot displays the corresponding line for the current search location. Eventually we ended up with a pretty accurate fit.

gradient_descent_search

We can also observe how the error changes as we move toward the minimum. A good way to ensure that gradient descent is working correctly is to make sure that the error decreases for each iteration. Below is a plot of error values for the first 100 iterations of the above gradient search.

gradient_descent_error_by_iteration

We’ve now seen how gradient descent can be applied to solve a linear regression problem. While the model in our example was a line, the concept of minimizing a cost function to tune parameters also applies to regression problems that use higher order polynomials and other problems found around the machine learning world.

While we were able to scratch the surface for learning gradient descent, there are several additional concepts that are good to be aware of that we weren’t able to discuss. A few of these include:

  • Convexity – In our linear regression problem, there was only one minimum. Our error surface was convex. Regardless of where we started, we would eventually arrive at the absolute minimum. In general, this need not be the case. It’s possible to have a problem with local minima that a gradient search can get stuck in. There are several approaches to mitigate this (e.g., stochastic gradient search).
  • Performance – We used vanilla gradient descent with a learning rate of 0.0005 in the above example, and ran it for 2000 iterations. There are approaches such a line search, that can reduce the number of iterations required. For the above example, line search reduces the number of iterations to arrive at a reasonable solution from several thousand to around 50.
  • Convergence – We didn’t talk about how to determine when the search finds a solution. This is typically done by looking for small changes in error iteration-to-iteration (e.g., where the gradient is near zero).

For more information about gradient descent, linear regression, and other machine learning topics, I would strongly recommend Andrew Ng’s machine learning course on Coursera.

Example Code

Example code for the problem described above can be found here

EditI chose to use linear regression example above for simplicity. We used gradient descent to iteratively estimate m and b, however we could have also solved for them directly. My intention was to illustrate how gradient descent can be used to iteratively estimate/tune parameters, as this is required for many different problems in machine learning.


一元线性回归梯度下降法是一种用于求解线性回归模型参数的优化算法。在梯度下降算法中,我们首先定义一个损失函数J(θ),其中θ表示模型的参数。然后通过迭代的方式,不断调整θ的取值,使得损失函数J(θ)的值最小化。 在一元线性回归中,我们假设目标变量y与特征变量x之间存在线性关系。我们的目标是找到一条直线,使得通过这条直线对特征变量x进行预测得到的结果与真实值y之间的误差最小。 梯度下降法的思路是通过计算损失函数J(θ)对参数θ的偏导数,即∂J(θ)/∂θ,来确定参数的更新方向。我们可以通过迭代地更新参数,使得损失函数逐渐减小。 具体步骤如下: 1. 初始化参数θ的值。 2. 计算损失函数J(θ)对参数θ的偏导数∂J(θ)/∂θ。 3. 根据计算得到的偏导数值和学习率的大小,确定参数θ的更新方向和步长。 4. 更新参数θ的值,即θ = θ - 学习率 * ∂J(θ)/∂θ。 5. 重复步骤2-4,直到满足停止条件(如达到最大迭代次数或损失函数值的变化小于设定阈值)。 通过不断迭代更新参数θ的值,梯度下降法可以找到使得损失函数J(θ)最小化的最优参数值。 引用中提到了为什么要使用减法来更新参数,这是因为当偏导数值为负数时,说明当前参数位于损失函数的左侧,需要增大参数值才能靠近极值点。反之,当偏导数值为正数时,需要减小参数值。通过这种方式,梯度下降法可以逐步接近损失函数的最小值。 引用中提到了线性回归中的损失函数J(θ),它是通过将特征变量x的值带入线性回归模型进行预测,然后计算预测结果与真实值之间差值的平方和得到的。梯度下降法的目标就是求解使得损失函数最小化的参数值。 引用中提到了梯度下降算法中的导数项,它表示对损失函数J(θ)对参数θ的偏导数的简化形式。通过计算导数项,可以确定参数的更新方向和步长。 综上所述,一元线性回归梯度下降法是一种通过迭代更新参数的优化算法,用于求解线性回归模型参数使得损失函数最小化的方法。<span class="em">1</span><span class="em">2</span><span class="em">3</span> #### 引用[.reference_title] - *1* [一元线性回归梯度下降法(通俗易懂,初学专属)](https://blog.csdn.net/m0_63867120/article/details/127073912)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v92^chatsearchT3_1"}}] [.reference_item style="max-width: 50%"] - *2* *3* [梯度下降算法--一元线性回归](https://blog.csdn.net/weixin_44246836/article/details/125128880)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v92^chatsearchT3_1"}}] [.reference_item style="max-width: 50%"] [ .reference_list ]
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值