Optional Lab: Gradient Descent for Linear Regression

Goals

In this lab, you will automate the process of optimizing w w w and b b b using gradient descent.
使用梯度下降自动优化 w w w b b b

Tools

In this lab, we will make use of:

  • NumPy, a popular library for scientific computing
  • Matplotlib, a popular library for plotting data
  • plotting routines in the lab_utils.py file in the local directory
import math, copy
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('./deeplearning.mplstyle')
from lab_utils_uni import plt_house_x, plt_contour_wgrad, plt_divergence, plt_gradients

Problem Statement

房价预测,依旧使用一个由两个点组成的数据集,分别为 (1.0, 300) 和 (2.0, 500)

# Load our data set
x_train = np.array([1.0, 2.0])   #features
y_train = np.array([300.0, 500.0])   #target value

Compute Cost

和上个Lab相同,再次使用

#Function to calculate the cost
def compute_cost(x, y, w, b):
   
    m = x.shape[0] 
    cost = 0
    
    for i in range(m):
        f_wb = w * x[i] + b
        cost = cost + (f_wb - y[i])**2
    total_cost = 1 / (2 * m) * cost

    return total_cost

Gradient Descent Summary

使用cost function对预测值和实际值之间的误差进行最小化
J ( w , b ) = 1 2 m ∑ i = 0 m − 1 ( f w , b ( x ( i ) ) − y ( i ) ) 2 (1) J(w,b) = \frac{1}{2m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})^2\tag{1} J(w,b)=2m1i=0m1(fw,b(x(i))y(i))2(1)
梯度下降的定义为
repeat  until convergence:    {    w = w − α ∂ J ( w , b ) ∂ w    b = b − α ∂ J ( w , b ) ∂ b } \begin{align*} \text{repeat}&\text{ until convergence:} \; \lbrace \newline \; w &= w - \alpha \frac{\partial J(w,b)}{\partial w} \tag{2} \; \newline b &= b - \alpha \frac{\partial J(w,b)}{\partial b} \newline \rbrace \end{align*} repeatwb} until convergence:{=wαwJ(w,b)=bαbJ(w,b)(2)
注意,此处的 w w w b b b同时更新的,而不是一个更新之后用更新后的值去更新另一个
梯度的定义为
∂ J ( w , b ) ∂ w = 1 m ∑ i = 0 m − 1 ( f w , b ( x ( i ) ) − y ( i ) ) x ( i ) ∂ J ( w , b ) ∂ b = 1 m ∑ i = 0 m − 1 ( f w , b ( x ( i ) ) − y ( i ) ) \begin{align} \frac{\partial J(w,b)}{\partial w} &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})x^{(i)} \tag{3}\\ \frac{\partial J(w,b)}{\partial b} &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)}) \tag{4}\\ \end{align} wJ(w,b)bJ(w,b)=m1i=0m1(fw,b(x(i))y(i))x(i)=m1i=0m1(fw,b(x(i))y(i))(3)(4)

Implement Gradient Descent

为一个特征实现梯度下降,需要以下三个函数

  • compute_gradient 实现公式 (3) 和 (4)
  • compute_cost 实现公式 (2) ,和上个Lab相同
  • gradient_descent 利用上面两个公式进行计算

惯例:

  • 包含偏导数的python变量的命名遵循这种模式, ∂ J ( w , b ) ∂ b \frac{\partial J(w,b)}{\partial b} bJ(w,b) will be dj_db.
  • w.r.t is With Respect To, as in partial derivative of J ( w b ) J(wb) J(wb) With Respect To b b b.

Compute Gradient

compute_gradient 实现公式 (3) 和 (4) 并且返回 ∂ J ( w , b ) ∂ w \frac{\partial J(w,b)}{\partial w} wJ(w,b) ∂ J ( w , b ) ∂ b \frac{\partial J(w,b)}{\partial b} bJ(w,b) ,注释描述了操作数

def compute_gradient(x, y, w, b): 
    """
    Computes the gradient for linear regression 
    Args:
      x (ndarray (m,)): Data, m examples 
      y (ndarray (m,)): target values
      w,b (scalar)    : model parameters  
    Returns
      dj_dw (scalar): The gradient of the cost w.r.t. the parameters w
      dj_db (scalar): The gradient of the cost w.r.t. the parameter b     
     """
    
    # Number of training examples
    m = x.shape[0]    
    dj_dw = 0
    dj_db = 0
    
    for i in range(m):  
        f_wb = w * x[i] + b 
        dj_dw_i = (f_wb - y[i]) * x[i] 
        dj_db_i = f_wb - y[i] 
        dj_db += dj_db_i
        dj_dw += dj_dw_i 
    dj_dw = dj_dw / m 
    dj_db = dj_db / m 
        
    return dj_dw, dj_db

课程中描述了梯度下降如何利用cost相对于某一点上的参数的偏导数来更新该参数
让我们使用compute_gradient函数来找到并绘制成本函数相对于其中一个参数 w w w 的一些偏导数

plt_gradients(x_train,y_train, compute_cost, compute_gradient)
plt.show()

在这里插入图片描述

左图展示了cost曲线上关于 w w w 的三个点的斜率 or 偏导数,固定 b b b = 100,右侧的为正,左侧的为负,因为cost function的碗状,导数将始终使梯度下降朝梯度为0的底部走
右侧的 “quiver plot” 提供了一种查看两个参数的梯度的方法,箭头大小反映了该点处的梯度大小,方向和斜率反映了该点 ∂ J ( w , b ) ∂ w \frac{\partial J(w,b)}{\partial w} wJ(w,b) ∂ J ( w , b ) ∂ b \frac{\partial J(w,b)}{\partial b} bJ(w,b) 的比率
Note that the gradient points away from the minimum. Review equation (2) above. The scaled gradient is subtracted from the current value of w w w or b b b. This moves the parameter in a direction that will reduce cost.

Gradient Descent

梯度下降函数gradient_descent,实现了公式 (2),注释中描述了具体的实施细节,我们将用这个函数来找到对于训练集最优的 w w w b b b

def gradient_descent(x, y, w_in, b_in, alpha, num_iters, cost_function, gradient_function): 
    """
    Performs gradient descent to fit w,b. Updates w,b by taking 
    num_iters gradient steps with learning rate alpha
    
    Args:
      x (ndarray (m,))  : Data, m examples 
      y (ndarray (m,))  : target values
      w_in,b_in (scalar): initial values of model parameters  
      alpha (float):     Learning rate
      num_iters (int):   number of iterations to run gradient descent
      cost_function:     function to call to produce cost
      gradient_function: function to call to produce gradient
      
    Returns:
      w (scalar): Updated value of parameter after running gradient descent
      b (scalar): Updated value of parameter after running gradient descent
      J_history (List): History of cost values
      p_history (list): History of parameters [w,b] 
      """
    
    w = copy.deepcopy(w_in) # avoid modifying global w_in
    # An array to store cost J and w's at each iteration primarily for graphing later
    J_history = []
    p_history = []
    b = b_in
    w = w_in
    
    for i in range(num_iters):
        # Calculate the gradient and update the parameters using gradient_function
        dj_dw, dj_db = gradient_function(x, y, w , b)     

        # Update Parameters using equation (3) above
        b = b - alpha * dj_db                            
        w = w - alpha * dj_dw                            

        # Save cost J at each iteration
        if i<100000:      # prevent resource exhaustion 
            J_history.append( cost_function(x, y, w , b))
            p_history.append([w,b])
        # Print cost every at intervals 10 times or as many iterations if < 10
        # 0.3e指输出保留三位小数并使用科学计数法
        if i% math.ceil(num_iters/10) == 0:
            print(f"Iteration {i:4}: Cost {J_history[-1]:0.2e} ",
                  f"dj_dw: {dj_dw: 0.3e}, dj_db: {dj_db: 0.3e}  ",
                  f"w: {w: 0.3e}, b:{b: 0.5e}")
 
    return w, b, J_history, p_history #return w and J,w history for graphing
# initialize parameters
w_init = 0
b_init = 0
# some gradient descent settings
iterations = 10000
tmp_alpha = 1.0e-2
# run gradient descent
w_final, b_final, J_hist, p_hist = gradient_descent(x_train ,y_train, w_init, b_init, tmp_alpha, 
                                                    iterations, compute_cost, compute_gradient)
print(f"(w,b) found by gradient descent: ({w_final:8.4f},{b_final:8.4f})")

输出如下

Iteration    0: Cost 7.93e+04  dj_dw: -6.500e+02, dj_db: -4.000e+02   w:  6.500e+00, b: 4.00000e+00
Iteration 1000: Cost 3.41e+00  dj_dw: -3.712e-01, dj_db:  6.007e-01   w:  1.949e+02, b: 1.08228e+02
Iteration 2000: Cost 7.93e-01  dj_dw: -1.789e-01, dj_db:  2.895e-01   w:  1.975e+02, b: 1.03966e+02
Iteration 3000: Cost 1.84e-01  dj_dw: -8.625e-02, dj_db:  1.396e-01   w:  1.988e+02, b: 1.01912e+02
Iteration 4000: Cost 4.28e-02  dj_dw: -4.158e-02, dj_db:  6.727e-02   w:  1.994e+02, b: 1.00922e+02
Iteration 5000: Cost 9.95e-03  dj_dw: -2.004e-02, dj_db:  3.243e-02   w:  1.997e+02, b: 1.00444e+02
Iteration 6000: Cost 2.31e-03  dj_dw: -9.660e-03, dj_db:  1.563e-02   w:  1.999e+02, b: 1.00214e+02
Iteration 7000: Cost 5.37e-04  dj_dw: -4.657e-03, dj_db:  7.535e-03   w:  1.999e+02, b: 1.00103e+02
Iteration 8000: Cost 1.25e-04  dj_dw: -2.245e-03, dj_db:  3.632e-03   w:  2.000e+02, b: 1.00050e+02
Iteration 9000: Cost 2.90e-05  dj_dw: -1.082e-03, dj_db:  1.751e-03   w:  2.000e+02, b: 1.00024e+02
(w,b) found by gradient descent: (199.9929,100.0116)

如课程的PPT所示,cost 一开始很大并且会快速下降,同时偏导数 ∂ J ( w , b ) ∂ w \frac{\partial J(w,b)}{\partial w} wJ(w,b) ∂ J ( w , b ) ∂ b \frac{\partial J(w,b)}{\partial b} bJ(w,b) 也会变小
随着逐渐接近“碗底”,偏导数变小,下降过程会变慢
所以即使学习率alpha不发生改变,进度依旧会变慢

Cost Versus Iterations of Gradient Descent

cost 和 iterations 的关系图是衡量梯度下降进展的有用指标,如果成功运行梯度下降,cost应始终降低
最初的 cost 变化非常快,在与最终下降不同的尺度上绘制初始下降是非常有用的
注意下图中 cost 的尺度和迭代次数

# plot cost versus iteration  
fig, (ax1, ax2) = plt.subplots(1, 2, constrained_layout=True, figsize=(12,4))
ax1.plot(J_hist[:100])
ax2.plot(1000 + np.arange(len(J_hist[1000:])), J_hist[1000:])
ax1.set_title("Cost vs. iteration(start)");  ax2.set_title("Cost vs. iteration (end)")
ax1.set_ylabel('Cost')            ;  ax2.set_ylabel('Cost') 
ax1.set_xlabel('iteration step')  ;  ax2.set_xlabel('iteration step') 
plt.show()

在这里插入图片描述

Predictions

现在已经获得了最佳的 w w w b b b ,可以利用模型来预测房价
正如预期那样,预测值与训练集中的值几乎相同,此外,不在预测中的值与期望值一致

print(f"1000 sqft house prediction {w_final*1.0 + b_final:0.1f} Thousand dollars")
print(f"1200 sqft house prediction {w_final*1.2 + b_final:0.1f} Thousand dollars")
print(f"2000 sqft house prediction {w_final*2.0 + b_final:0.1f} Thousand dollars")

输出如下

1000 sqft house prediction 300.0 Thousand dollars
1200 sqft house prediction 340.0 Thousand dollars
2000 sqft house prediction 500.0 Thousand dollars

Plotting

通过在 cost(w, b) 的等高线图 (the contour plot) 上绘制迭代的 cost ,可以显示梯度下降执行过程中的进度

fig, ax = plt.subplots(1,1, figsize=(12, 6))
plt_contour_wgrad(x_train, y_train, p_hist, ax)

在这里插入图片描述

上面的等高线图展现了一系列 w w w b b b 下的 cost(w, b) ,cost level由环表示,使用红色箭头覆盖的是梯度下降的路径
 
注意事项:

  • 这条路朝着它的目标稳步(单调)前进
  • 初始 step 比目标附近的 step 大得多
    在这里插入图片描述

放大后,可以看到梯度下降的最后一步,note the distance between steps shrinks as the gradient approaches zero.

Increased Learning Rate

课程中有涉及关于学习率的适当值的讨论,如果 α \alpha α 越大,梯度下降越快收敛到一个解,但是如果太大,梯度下降将会发散,上面就是一个很好的收敛的解的例子
让我们试着增加 α \alpha α 看看会发生什么

# initialize parameters
w_init = 0
b_init = 0
# set alpha to a large value
iterations = 10
tmp_alpha = 8.0e-1
# run gradient descent
w_final, b_final, J_hist, p_hist = gradient_descent(x_train ,y_train, w_init, b_init, tmp_alpha, 
                                                    iterations, compute_cost, compute_gradient)

输出如下

Iteration    0: Cost 2.58e+05  dj_dw: -6.500e+02, dj_db: -4.000e+02   w:  5.200e+02, b: 3.20000e+02
Iteration    1: Cost 7.82e+05  dj_dw:  1.130e+03, dj_db:  7.000e+02   w: -3.840e+02, b:-2.40000e+02
Iteration    2: Cost 2.37e+06  dj_dw: -1.970e+03, dj_db: -1.216e+03   w:  1.192e+03, b: 7.32800e+02
Iteration    3: Cost 7.19e+06  dj_dw:  3.429e+03, dj_db:  2.121e+03   w: -1.551e+03, b:-9.63840e+02
Iteration    4: Cost 2.18e+07  dj_dw: -5.974e+03, dj_db: -3.691e+03   w:  3.228e+03, b: 1.98886e+03
Iteration    5: Cost 6.62e+07  dj_dw:  1.040e+04, dj_db:  6.431e+03   w: -5.095e+03, b:-3.15579e+03
Iteration    6: Cost 2.01e+08  dj_dw: -1.812e+04, dj_db: -1.120e+04   w:  9.402e+03, b: 5.80237e+03
Iteration    7: Cost 6.09e+08  dj_dw:  3.156e+04, dj_db:  1.950e+04   w: -1.584e+04, b:-9.80139e+03
Iteration    8: Cost 1.85e+09  dj_dw: -5.496e+04, dj_db: -3.397e+04   w:  2.813e+04, b: 1.73730e+04
Iteration    9: Cost 5.60e+09  dj_dw:  9.572e+04, dj_db:  5.916e+04   w: -4.845e+04, b:-2.99567e+04

上面的 w w w b b b 在正负之间反复回弹,并且绝对值随着迭代次数而增加,此外,each iteration ∂ J ( w , b ) ∂ w \frac{\partial J(w,b)}{\partial w} wJ(w,b) changes sign and cost is increasing rather than decreasing.这是一个很清晰的迹象,表明学习率太大,解是发散的
绘图来可视化这一现象
在这里插入图片描述

左图展现了 w w w 在梯度下降前几步的变化,在正负之间震荡并且 cost 快速增加
Gradient Descent is operating on both w w w and b b b simultaneously, so one needs the 3-D plot on the right for the complete picture.

Congratulations!

In this lab you:

  • delved into the details of gradient descent for a single variable.
  • developed a routine to compute the gradient
  • visualized what the gradient is
  • completed a gradient descent routine
  • utilized gradient descent to find parameters
  • examined the impact of sizing the learning rate
  • 19
    点赞
  • 18
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

gravity_w

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值