【Machine Learning】2.代价函数(Cost Function)、梯度下降(Gradient Descent)

1.代价函数(Cost Function)

1.1 介绍

The equation for cost with one variable is:
J ( w , b ) = 1 2 m ∑ i = 0 m − 1 ( f w , b ( x ( i ) ) − y ( i ) ) 2 (1) J(w,b) = \frac{1}{2m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})^2 \tag{1} J(w,b)=2m1i=0m1(fw,b(x(i))y(i))2(1)

f w , b ( x ( i ) ) = w x ( i ) + b (2) f_{w,b}(x^{(i)}) = wx^{(i)} + b \tag{2} fw,b(x(i))=wx(i)+b(2)

  • f w , b ( x ( i ) ) f_{w,b}(x^{(i)}) fw,b(x(i)) is our prediction for example i i i using parameters w , b w,b w,b.
  • ( f w , b ( x ( i ) ) − y ( i ) ) 2 (f_{w,b}(x^{(i)}) -y^{(i)})^2 (fw,b(x(i))y(i))2 is the squared difference between the target value and the prediction.
  • These differences are summed over all the m m m examples and divided by 2m to produce the cost, J ( w , b ) J(w,b) J(w,b).
  • 目的是最小化代价函数来得到想要的参数,代价函数只是一个量值,除以2m方便求导,本质并无影响

Note, in lecture summation ranges are typically from 1 to m, while code will be from 0 to m-1.

1.2 代码实现

The code below calculates cost by looping over each example. In each loop:

  • f_{w,b}, a prediction is calculated
  • the difference between the target and the prediction is calculated and squared.
  • this is added to the total cost.
def compute_cost(x, y, w, b): 
    Computes the cost function for linear regression.
      x (ndarray (m,)): Data, m examples 
      y (ndarray (m,)): target values
      w,b (scalar)    : model parameters  
        total_cost (float): The cost of using w,b as the parameters for linear regression
               to fit the data points in x and y
    # number of training examples
    m = x.shape[0] 
    cost_sum = 0 
    for i in range(m): 
        f_wb = w * x[i] + b   
        cost = (f_wb - y[i]) ** 2  
        cost_sum = cost_sum + cost  
    total_cost = (1 / (2 * m)) * cost_sum  

    return total_cost

2.梯度下降(Gradient Descent)

2.1 介绍

For a linear model that predicts f w , b ( x ( i ) ) f_{w,b}(x^{(i)}) fw,b(x(i)):
f w , b ( x ( i ) ) = w x ( i ) + b (1) f_{w,b}(x^{(i)}) = wx^{(i)} + b \tag{1} fw,b(x(i))=wx(i)+b(1)
In linear regression, you utilize input training data to fit the parameters w w w, b b b by minimizing a measure of the error between our predictions f w , b ( x ( i ) ) f_{w,b}(x^{(i)}) fw,b(x(i)) and the actual data y ( i ) y^{(i)} y(i). The measure is called the c o s t cost cost, J ( w , b ) J(w,b) J(w,b). In training you measure the cost over all of our training samples x ( i ) , y ( i ) x^{(i)},y^{(i)} x(i),y(i)
J ( w , b ) = 1 2 m ∑ i = 0 m − 1 ( f w , b ( x ( i ) ) − y ( i ) ) 2 (2) J(w,b) = \frac{1}{2m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})^2\tag{2} J(w,b)=2m1i=0m1(fw,b(x(i))y(i))2(2)
Gradient descent was described as:

repeat  until convergence:    {    w = w − α ∂ J ( w , b ) ∂ w    b = b − α ∂ J ( w , b ) ∂ b } \begin{align*} \text{repeat}&\text{ until convergence:} \; \lbrace \newline \; w &= w - \alpha \frac{\partial J(w,b)}{\partial w} \tag{3} \; \newline b &= b - \alpha \frac{\partial J(w,b)}{\partial b} \newline \rbrace \end{align*} repeatwb} until convergence:{=wαwJ(w,b)=bαbJ(w,b)(3)where, parameters w w w, b b b are updated simultaneously(同时).

The gradient is defined as(对代价函数求偏导可得):
∂ J ( w , b ) ∂ w = 1 m ∑ i = 0 m − 1 ( f w , b ( x ( i ) ) − y ( i ) ) x ( i ) ∂ J ( w , b ) ∂ b = 1 m ∑ i = 0 m − 1 ( f w , b ( x ( i ) ) − y ( i ) ) \begin{align} \frac{\partial J(w,b)}{\partial w} &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})x^{(i)} \tag{4}\\ \frac{\partial J(w,b)}{\partial b} &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)}) \tag{5}\\ \end{align} wJ(w,b)bJ(w,b)=m1i=0m1(fw,b(x(i))y(i))x(i)=m1i=0m1(fw,b(x(i))y(i))(4)(5)
Here simultaniously(同时) means that you calculate the partial derivatives for all the parameters before updating any of the parameters.

2.2 代码实现

You will implement gradient descent algorithm for one feature. You will need three functions.

  • compute_gradient implementing equation (4) and (5) above
  • compute_cost implementing equation (2) above (code from the previous )
  • gradient_descent, utilizing compute_gradient and compute_cost


  • The naming of python variables containing partial derivatives follows this pattern, ∂ J ( w , b ) ∂ b \frac{\partial J(w,b)}{\partial b} bJ(w,b) will be dj_db.
  • w.r.t is With Respect To, as in partial derivative of J ( w b ) J(wb) J(wb) With Respect To b b b.


compute_gradient implements (4) and (5) above and returns ∂ J ( w , b ) ∂ w \frac{\partial J(w,b)}{\partial w} wJ(w,b), ∂ J ( w , b ) ∂ b \frac{\partial J(w,b)}{\partial b} bJ(w,b). The embedded comments describe the operations.

def compute_gradient(x, y, w, b): 
    Computes the gradient for linear regression 
      x (ndarray (m,)): Data, m examples 
      y (ndarray (m,)): target values
      w,b (scalar)    : model parameters  
      dj_dw (scalar): The gradient of the cost w.r.t. the parameters w
      dj_db (scalar): The gradient of the cost w.r.t. the parameter b     
    # Number of training examples
    m = x.shape[0]    
    dj_dw = 0
    dj_db = 0
    for i in range(m):  
        f_wb = w * x[i] + b 
        dj_dw_i = (f_wb - y[i]) * x[i] 
        dj_db_i = f_wb - y[i] 
        dj_db += dj_db_i
        dj_dw += dj_dw_i 
    dj_dw = dj_dw / m 
    dj_db = dj_db / m 
    return dj_dw, dj_db

Gradient Descent(梯度下降)

Now that gradients can be computed, gradient descent, described in equation (3) above can be implemented below in gradient_descent. The details of the implementation are described in the comments. Below, you will utilize this function to find optimal values of w w w and b b b on the training data.

def gradient_descent(x, y, w_in, b_in, alpha, num_iters, cost_function, gradient_function): 
    Performs gradient descent to fit w,b. Updates w,b by taking 
    num_iters gradient steps with learning rate alpha
      x (ndarray (m,))  : Data, m examples 
      y (ndarray (m,))  : target values
      w_in,b_in (scalar): initial values of model parameters 输入参数
      alpha (float):     Learning rate 学习率
      num_iters (int):   number of iterations to run gradient descent 迭代次数
      cost_function:     function to call to produce cost
      gradient_function: function to call to produce gradient 
      w (scalar): Updated value of parameter after running gradient descent
      b (scalar): Updated value of parameter after running gradient descent
      J_history (List): History of cost values
      p_history (list): History of parameters [w,b] 
    w = copy.deepcopy(w_in) # avoid modifying global w_in
    # An array to store cost J and w's at each iteration primarily for graphing later
    J_history = []
    p_history = []
    b = b_in
    w = w_in
    for i in range(num_iters):
        # Calculate the gradient and update the parameters using gradient_function
        dj_dw, dj_db = gradient_function(x, y, w , b)     

        # Update Parameters using equation (3) above
        b = b - alpha * dj_db                            
        w = w - alpha * dj_dw                            

        # Save cost J at each iteration
        if i<100000:      # prevent resource exhaustion 
            J_history.append( cost_function(x, y, w , b))# 代价记录
            p_history.append([w,b]) # w,b参数值记录
        # Print cost every at intervals 10 times or as many iterations if < 10
        if i% math.ceil(num_iters/10) == 0:
            print(f"Iteration {i:4}: Cost {J_history[-1]:0.2e} ",
                  f"dj_dw: {dj_dw: 0.3e}, dj_db: {dj_db: 0.3e}  ",
                  f"w: {w: 0.3e}, b:{b: 0.5e}")
           # i:4 是宽度为4, 0.3e 是三位小数加科学计数法
    return w, b, J_history, p_history #return w and J,w history for graphing
# initialize parameters 初始化参数
w_init = 0
b_init = 0
# some gradient descent settings 
iterations = 10000
tmp_alpha = 1.0e-2 # 学习率{\alpha}
# run gradient descent 运行
w_final, b_final, J_hist, p_hist = gradient_descent(x_train ,y_train, w_init, b_init, tmp_alpha,iterations, compute_cost, compute_gradient)
print(f"(w,b) found by gradient descent: ({w_final:8.4f},{b_final:8.4f})")
#  8.4f 是精度 :最小宽度为8,小数点后保留4位,左边空格补位
# 08.4f:左边用0补位,详细可了解python输出精度和位数控制




