Optional Lab: Gradient Descent for Linear Regression



In this lab, you will automate the process of optimizing w w w and b b b using gradient descent.
使用梯度下降自动优化 w w w b b b


In this lab, we will make use of:

  • NumPy, a popular library for scientific computing
  • Matplotlib, a popular library for plotting data
  • plotting routines in the lab_utils.py file in the local directory
import math, copy
import numpy as np
import matplotlib.pyplot as plt
from lab_utils_uni import plt_house_x, plt_contour_wgrad, plt_divergence, plt_gradients

Problem Statement

房价预测,依旧使用一个由两个点组成的数据集,分别为 (1.0, 300) 和 (2.0, 500)

# Load our data set
x_train = np.array([1.0, 2.0])   #features
y_train = np.array([300.0, 500.0])   #target value

Compute Cost


#Function to calculate the cost
def compute_cost(x, y, w, b):
    m = x.shape[0] 
    cost = 0
    for i in range(m):
        f_wb = w * x[i] + b
        cost = cost + (f_wb - y[i])**2
    total_cost = 1 / (2 * m) * cost

    return total_cost

Gradient Descent Summary

使用cost function对预测值和实际值之间的误差进行最小化
J ( w , b ) = 1 2 m ∑ i = 0 m − 1 ( f w , b ( x ( i ) ) − y ( i ) ) 2 (1) J(w,b) = \frac{1}{2m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})^2\tag{1} J(w,b)=2m1i=0m1(fw,b(x(i))y(i))2(1)
repeat  until convergence:    {    w = w − α ∂ J ( w , b ) ∂ w    b = b − α ∂ J ( w , b ) ∂ b } \begin{align*} \text{repeat}&\text{ until convergence:} \; \lbrace \newline \; w &= w - \alpha \frac{\partial J(w,b)}{\partial w} \tag{2} \; \newline b &= b - \alpha \frac{\partial J(w,b)}{\partial b} \newline \rbrace \end{align*} repeatwb} until convergence:{=wαwJ(w,b)=bαbJ(w,b)(2)
注意,此处的 w w w b b b同时更新的,而不是一个更新之后用更新后的值去更新另一个
∂ J ( w , b ) ∂ w = 1 m ∑ i = 0 m − 1 ( f w , b ( x ( i ) ) − y ( i ) ) x ( i ) ∂ J ( w , b ) ∂ b = 1 m ∑ i = 0 m − 1 ( f w , b ( x ( i ) ) − y ( i ) ) \begin{align} \frac{\partial J(w,b)}{\partial w} &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})x^{(i)} \tag{3}\\ \frac{\partial J(w,b)}{\partial b} &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)}) \tag{4}\\ \end{align} wJ(w,b)bJ(w,b)=m1i=0m1(fw,b(x(i))y(i))x(i)=m1i=0m1(fw,b(x(i))y(i))(3)(4)

Implement Gradient Descent


  • compute_gradient 实现公式 (3) 和 (4)
  • compute_cost 实现公式 (2) ,和上个Lab相同
  • gradient_descent 利用上面两个公式进行计算


  • 包含偏导数的python变量的命名遵循这种模式, ∂ J ( w , b ) ∂ b \frac{\partial J(w,b)}{\partial b} bJ(w,b) will be dj_db.
  • w.r.t is With Respect To, as in partial derivative of J ( w b ) J(wb) J(wb) With Respect To b b b.

Compute Gradient

compute_gradient 实现公式 (3) 和 (4) 并且返回 ∂ J ( w , b ) ∂ w \frac{\partial J(w,b)}{\partial w} wJ(w,b) ∂ J ( w , b ) ∂ b \frac{\partial J(w,b)}{\partial b} bJ(w,b) ,注释描述了操作数

def compute_gradient(x, y, w, b): 
    Computes the gradient for linear regression 
      x (ndarray (m,)): Data, m examples 
      y (ndarray (m,)): target values
      w,b (scalar)    : model parameters  
      dj_dw (scalar): The gradient of the cost w.r.t. the parameters w
      dj_db (scalar): The gradient of the cost w.r.t. the parameter b     
    # Number of training examples
    m = x.shape[0]    
    dj_dw = 0
    dj_db = 0
    for i in range(m):  
        f_wb = w * x[i] + b 
        dj_dw_i = (f_wb - y[i]) * x[i] 
        dj_db_i = f_wb - y[i] 
        dj_db += dj_db_i
        dj_dw += dj_dw_i 
    dj_dw = dj_dw / m 
    dj_db = dj_db / m 
    return dj_dw, dj_db

让我们使用compute_gradient函数来找到并绘制成本函数相对于其中一个参数 w w w 的一些偏导数

plt_gradients(x_train,y_train, compute_cost, compute_gradient)


左图展示了cost曲线上关于 w w w 的三个点的斜率 or 偏导数,固定 b b b = 100,右侧的为正,左侧的为负,因为cost function的碗状,导数将始终使梯度下降朝梯度为0的底部走
右侧的 “quiver plot” 提供了一种查看两个参数的梯度的方法,箭头大小反映了该点处的梯度大小,方向和斜率反映了该点 ∂ J ( w , b ) ∂ w \frac{\partial J(w,b)}{\partial w} wJ(w,b) ∂ J ( w , b ) ∂ b \frac{\partial J(w,b)}{\partial b} bJ(w,b) 的比率
Note that the gradient points away from the minimum. Review equation (2) above. The scaled gradient is subtracted from the current value of w w w or b b b. This moves the parameter in a direction that will reduce cost.

Gradient Descent

梯度下降函数gradient_descent,实现了公式 (2),注释中描述了具体的实施细节,我们将用这个函数来找到对于训练集最优的 w w w b b b

def gradient_descent(x, y, w_in, b_in, alpha, num_iters, cost_function, gradient_function): 
    Performs gradient descent to fit w,b. Updates w,b by taking 
    num_iters gradient steps with learning rate alpha
      x (ndarray (m,))  : Data, m examples 
      y (ndarray (m,))  : target values
      w_in,b_in (scalar): initial values of model parameters  
      alpha (float):     Learning rate
      num_iters (int):   number of iterations to run gradient descent
      cost_function:     function to call to produce cost
      gradient_function: function to call to produce gradient
      w (scalar): Updated value of parameter after running gradient descent
      b (scalar): Updated value of parameter after running gradient descent
      J_history (List): History of cost values
      p_history (list): History of parameters [w,b] 
    w = copy.deepcopy(w_in) # avoid modifying global w_in
    # An array to store cost J and w's at each iteration primarily for graphing later
    J_history = []
    p_history = []
    b = b_in
    w = w_in
    for i in range(num_iters):
        # Calculate the gradient and update the parameters using gradient_function
        dj_dw, dj_db = gradient_function(x, y, w , b)     

        # Update Parameters using equation (3) above
        b = b - alpha * dj_db                            
        w = w - alpha * dj_dw                            

        # Save cost J at each iteration
        if i<100000:      # prevent resource exhaustion 
            J_history.append( cost_function(x, y, w , b))
        # Print cost every at intervals 10 times or as many iterations if < 10
        # 0.3e指输出保留三位小数并使用科学计数法
        if i% math.ceil(num_iters/10) == 0:
            print(f"Iteration {i:4}: Cost {J_history[-1]:0.2e} ",
                  f"dj_dw: {dj_dw: 0.3e}, dj_db: {dj_db: 0.3e}  ",
                  f"w: {w: 0.3e}, b:{b: 0.5e}")
    return w, b, J_history, p_history #return w and J,w history for graphing
# initialize parameters
w_init = 0
b_init = 0
# some gradient descent settings
iterations = 10000
tmp_alpha = 1.0e-2
# run gradient descent
w_final, b_final, J_hist, p_hist = gradient_descent(x_train ,y_train, w_init, b_init, tmp_alpha, 
                                                    iterations, compute_cost, compute_gradient)
print(f"(w,b) found by gradient descent: ({w_final:8.4f},{b_final:8.4f})")


Iteration    0: Cost 7.93e+04  dj_dw: -6.500e+02, dj_db: -4.000e+02   w:  6.500e+00, b: 4.00000e+00
Iteration 1000: Cost 3.41e+00  dj_dw: -3.712e-01, dj_db:  6.007e-01   w:  1.949e+02, b: 1.08228e+02
Iteration 2000: Cost 7.93e-01  dj_dw: -1.789e-01, dj_db:  2.895e-01   w:  1.975e+02, b: 1.03966e+02
Iteration 3000: Cost 1.84e-01  dj_dw: -8.625e-02, dj_db:  1.396e-01   w:  1.988e+02, b: 1.01912e+02
Iteration 4000: Cost 4.28e-02  dj_dw: -4.158e-02, dj_db:  6.727e-02   w:  1.994e+02, b: 1.00922e+02
Iteration 5000: Cost 9.95e-03  dj_dw: -2.004e-02, dj_db:  3.243e-02   w:  1.997e+02, b: 1.00444e+02
Iteration 6000: Cost 2.31e-03  dj_dw: -9.660e-03, dj_db:  1.563e-02   w:  1.999e+02, b: 1.00214e+02
Iteration 7000: Cost 5.37e-04  dj_dw: -4.657e-03, dj_db:  7.535e-03   w:  1.999e+02, b: 1.00103e+02
Iteration 8000: Cost 1.25e-04  dj_dw: -2.245e-03, dj_db:  3.632e-03   w:  2.000e+02, b: 1.00050e+02
Iteration 9000: Cost 2.90e-05  dj_dw: -1.082e-03, dj_db:  1.751e-03   w:  2.000e+02, b: 1.00024e+02
(w,b) found by gradient descent: (199.9929,100.0116)

如课程的PPT所示,cost 一开始很大并且会快速下降,同时偏导数 ∂ J ( w , b ) ∂ w \frac{\partial J(w,b)}{\partial w} wJ(w,b) ∂ J ( w , b ) ∂ b \frac{\partial J(w,b)}{\partial b} bJ(w,b) 也会变小

Cost Versus Iterations of Gradient Descent

cost 和 iterations 的关系图是衡量梯度下降进展的有用指标,如果成功运行梯度下降,cost应始终降低
最初的 cost 变化非常快,在与最终下降不同的尺度上绘制初始下降是非常有用的
注意下图中 cost 的尺度和迭代次数

# plot cost versus iteration  
fig, (ax1, ax2) = plt.subplots(1, 2, constrained_layout=True, figsize=(12,4))
ax2.plot(1000 + np.arange(len(J_hist[1000:])), J_hist[1000:])
ax1.set_title("Cost vs. iteration(start)");  ax2.set_title("Cost vs. iteration (end)")
ax1.set_ylabel('Cost')            ;  ax2.set_ylabel('Cost') 
ax1.set_xlabel('iteration step')  ;  ax2.set_xlabel('iteration step') 



现在已经获得了最佳的 w w w b b b ,可以利用模型来预测房价

print(f"1000 sqft house prediction {w_final*1.0 + b_final:0.1f} Thousand dollars")
print(f"1200 sqft house prediction {w_final*1.2 + b_final:0.1f} Thousand dollars")
print(f"2000 sqft house prediction {w_final*2.0 + b_final:0.1f} Thousand dollars")


1000 sqft house prediction 300.0 Thousand dollars
1200 sqft house prediction 340.0 Thousand dollars
2000 sqft house prediction 500.0 Thousand dollars


通过在 cost(w, b) 的等高线图 (the contour plot) 上绘制迭代的 cost ,可以显示梯度下降执行过程中的进度

fig, ax = plt.subplots(1,1, figsize=(12, 6))
plt_contour_wgrad(x_train, y_train, p_hist, ax)


上面的等高线图展现了一系列 w w w b b b 下的 cost(w, b) ,cost level由环表示,使用红色箭头覆盖的是梯度下降的路径

  • 这条路朝着它的目标稳步(单调)前进
  • 初始 step 比目标附近的 step 大得多

放大后,可以看到梯度下降的最后一步,note the distance between steps shrinks as the gradient approaches zero.

Increased Learning Rate

课程中有涉及关于学习率的适当值的讨论,如果 α \alpha α 越大,梯度下降越快收敛到一个解,但是如果太大,梯度下降将会发散,上面就是一个很好的收敛的解的例子
让我们试着增加 α \alpha α 看看会发生什么

# initialize parameters
w_init = 0
b_init = 0
# set alpha to a large value
iterations = 10
tmp_alpha = 8.0e-1
# run gradient descent
w_final, b_final, J_hist, p_hist = gradient_descent(x_train ,y_train, w_init, b_init, tmp_alpha, 
                                                    iterations, compute_cost, compute_gradient)


Iteration    0: Cost 2.58e+05  dj_dw: -6.500e+02, dj_db: -4.000e+02   w:  5.200e+02, b: 3.20000e+02
Iteration    1: Cost 7.82e+05  dj_dw:  1.130e+03, dj_db:  7.000e+02   w: -3.840e+02, b:-2.40000e+02
Iteration    2: Cost 2.37e+06  dj_dw: -1.970e+03, dj_db: -1.216e+03   w:  1.192e+03, b: 7.32800e+02
Iteration    3: Cost 7.19e+06  dj_dw:  3.429e+03, dj_db:  2.121e+03   w: -1.551e+03, b:-9.63840e+02
Iteration    4: Cost 2.18e+07  dj_dw: -5.974e+03, dj_db: -3.691e+03   w:  3.228e+03, b: 1.98886e+03
Iteration    5: Cost 6.62e+07  dj_dw:  1.040e+04, dj_db:  6.431e+03   w: -5.095e+03, b:-3.15579e+03
Iteration    6: Cost 2.01e+08  dj_dw: -1.812e+04, dj_db: -1.120e+04   w:  9.402e+03, b: 5.80237e+03
Iteration    7: Cost 6.09e+08  dj_dw:  3.156e+04, dj_db:  1.950e+04   w: -1.584e+04, b:-9.80139e+03
Iteration    8: Cost 1.85e+09  dj_dw: -5.496e+04, dj_db: -3.397e+04   w:  2.813e+04, b: 1.73730e+04
Iteration    9: Cost 5.60e+09  dj_dw:  9.572e+04, dj_db:  5.916e+04   w: -4.845e+04, b:-2.99567e+04

上面的 w w w b b b 在正负之间反复回弹,并且绝对值随着迭代次数而增加,此外,each iteration ∂ J ( w , b ) ∂ w \frac{\partial J(w,b)}{\partial w} wJ(w,b) changes sign and cost is increasing rather than decreasing.这是一个很清晰的迹象,表明学习率太大,解是发散的

左图展现了 w w w 在梯度下降前几步的变化,在正负之间震荡并且 cost 快速增加
Gradient Descent is operating on both w w w and b b b simultaneously, so one needs the 3-D plot on the right for the complete picture.


In this lab you:

  • delved into the details of gradient descent for a single variable.
  • developed a routine to compute the gradient
  • visualized what the gradient is
  • completed a gradient descent routine
  • utilized gradient descent to find parameters
  • examined the impact of sizing the learning rate
