【Machine Learning】2.代价函数（Cost Function）、梯度下降（Gradient Descent）

一藏过往

已于 2023-12-08 21:44:09 修改

阅读量468

点赞数 10

分类专栏： Machine Learning 文章标签：机器学习人工智能深度学习 python

于 2023-12-08 18:08:18 首次发布

本文链接：https://blog.csdn.net/qq_42887833/article/details/134884567

版权

Machine Learning 专栏收录该内容

8 篇文章 0 订阅

订阅专栏

1.代价函数（Cost Function）

1.1 介绍

The equation for cost with one variable is:
$\frac{1}{2m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})^2 \tag{1}$

where
$f_{w,b}(x^{(i)}) = wx^{(i)} + b \tag{2}$

$f_{w,b}(x^{(i)})$ is our prediction for example $i$ using parameters $w, b$ .
$f_{w,b}(x^{(i)}) -y^{(i)})^2$ is the squared difference between the target value and the prediction.
These differences are summed over all the $m$ examples and divided by 2m to produce the cost, $J (w, b)$ .
目的是最小化代价函数来得到想要的参数，代价函数只是一个量值，除以2m方便求导，本质并无影响

Note, in lecture summation ranges are typically from 1 to m, while code will be from 0 to m-1.

1.2 代码实现

The code below calculates cost by looping over each example. In each loop:

f_{w,b}, a prediction is calculated
the difference between the target and the prediction is calculated and squared.
this is added to the total cost.

def compute_cost(x, y, w, b): 
    """
    Computes the cost function for linear regression.
    
    Args:
      x (ndarray (m,)): Data, m examples 
      y (ndarray (m,)): target values
      w,b (scalar)    : model parameters  
    
    Returns
        total_cost (float): The cost of using w,b as the parameters for linear regression
               to fit the data points in x and y
    """
    # number of training examples
    m = x.shape[0] 
    
    cost_sum = 0 
    for i in range(m): 
        f_wb = w * x[i] + b   
        cost = (f_wb - y[i]) ** 2  
        cost_sum = cost_sum + cost  
    total_cost = (1 / (2 * m)) * cost_sum  

    return total_cost

2.梯度下降（Gradient Descent）

2.1 介绍

For a linear model that predicts $f_{w,b}(x^{(i)})$ :
$f_{w,b}(x^{(i)}) = wx^{(i)} + b \tag{1}$
In linear regression, you utilize input training data to fit the parameters $w$ , $b$ by minimizing a measure of the error between our predictions $f_{w,b}(x^{(i)})$ and the actual data $y^{(i)}$ . The measure is called the $cos t$ , $J (w, b)$ . In training you measure the cost over all of our training samples $x^{(i)},y^{(i)}$
$\frac{1}{2m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})^2\tag{2}$
Gradient descent was described as:

$\begin{align*} \text{repeat}&\text{ until convergence:} \; \lbrace \newline \; w &= w - \alpha \frac{\partial J(w,b)}{\partial w} \tag{3} \; \newline b &= b - \alpha \frac{\partial J(w,b)}{\partial b} \newline \rbrace \end{align*}$ where, parameters $w$ , $b$ are updated simultaneously(同时).

The gradient is defined as（对代价函数求偏导可得）:
$\begin{align} \frac{\partial J(w,b)}{\partial w} &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})x^{(i)} \tag{4}\\ \frac{\partial J(w,b)}{\partial b} &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)}) \tag{5}\\ \end{align}$
Here simultaniously(同时) means that you calculate the partial derivatives for all the parameters before updating any of the parameters.

2.2 代码实现

You will implement gradient descent algorithm for one feature. You will need three functions.

compute_gradient implementing equation (4) and (5) above
compute_cost implementing equation (2) above (code from the previous )
gradient_descent, utilizing compute_gradient and compute_cost

Conventions:

The naming of python variables containing partial derivatives follows this pattern, $\frac{\partial J(w,b)}{\partial b}$ will be dj_db.
w.r.t is With Respect To, as in partial derivative of $J (w b)$ With Respect To $b$ .

compute_gradient(梯度计算)

compute_gradient implements (4) and (5) above and returns $\frac{\partial J(w,b)}{\partial w}$ , $\frac{\partial J(w,b)}{\partial b}$ . The embedded comments describe the operations.

def compute_gradient(x, y, w, b): 
    """
    Computes the gradient for linear regression 
    Args:
      x (ndarray (m,)): Data, m examples 
      y (ndarray (m,)): target values
      w,b (scalar)    : model parameters  
    Returns
      dj_dw (scalar): The gradient of the cost w.r.t. the parameters w
      dj_db (scalar): The gradient of the cost w.r.t. the parameter b     
     """
    
    # Number of training examples
    m = x.shape[0]    
    dj_dw = 0
    dj_db = 0
    
    for i in range(m):  
        f_wb = w * x[i] + b 
        dj_dw_i = (f_wb - y[i]) * x[i] 
        dj_db_i = f_wb - y[i] 
        dj_db += dj_db_i
        dj_dw += dj_dw_i 
    dj_dw = dj_dw / m 
    dj_db = dj_db / m 
        
    return dj_dw, dj_db

Gradient Descent(梯度下降)

Now that gradients can be computed, gradient descent, described in equation (3) above can be implemented below in gradient_descent. The details of the implementation are described in the comments. Below, you will utilize this function to find optimal values of $w$ and $b$ on the training data.

def gradient_descent(x, y, w_in, b_in, alpha, num_iters, cost_function, gradient_function): 
    """
    Performs gradient descent to fit w,b. Updates w,b by taking 
    num_iters gradient steps with learning rate alpha
    
    Args:
      x (ndarray (m,))  : Data, m examples 
      y (ndarray (m,))  : target values
      w_in,b_in (scalar): initial values of model parameters 输入参数
      alpha (float):     Learning rate 学习率
      num_iters (int):   number of iterations to run gradient descent 迭代次数
      cost_function:     function to call to produce cost
      gradient_function: function to call to produce gradient 
      
    Returns:
      w (scalar): Updated value of parameter after running gradient descent
      b (scalar): Updated value of parameter after running gradient descent
      J_history (List): History of cost values
      p_history (list): History of parameters [w,b] 
      """
    
    w = copy.deepcopy(w_in) # avoid modifying global w_in
    # An array to store cost J and w's at each iteration primarily for graphing later
    J_history = []
    p_history = []
    b = b_in
    w = w_in
    
    for i in range(num_iters):
        # Calculate the gradient and update the parameters using gradient_function
        dj_dw, dj_db = gradient_function(x, y, w , b)     

        # Update Parameters using equation (3) above
        b = b - alpha * dj_db                            
        w = w - alpha * dj_dw                            

        # Save cost J at each iteration
        if i<100000:      # prevent resource exhaustion 
            J_history.append( cost_function(x, y, w , b))# 代价记录
            p_history.append([w,b]) # w，b参数值记录
            #append()函数用于向列表末尾增加元素
        # Print cost every at intervals 10 times or as many iterations if < 10
        if i% math.ceil(num_iters/10) == 0:
            print(f"Iteration {i:4}: Cost {J_history[-1]:0.2e} ",
                  f"dj_dw: {dj_dw: 0.3e}, dj_db: {dj_db: 0.3e}  ",
                  f"w: {w: 0.3e}, b:{b: 0.5e}")
           # i:4 是宽度为4， 0.3e 是三位小数加科学计数法
 
    return w, b, J_history, p_history #return w and J,w history for graphing

# initialize parameters 初始化参数
w_init = 0
b_init = 0
# some gradient descent settings 
iterations = 10000
tmp_alpha = 1.0e-2 # 学习率{\alpha}
# run gradient descent 运行
w_final, b_final, J_hist, p_hist = gradient_descent(x_train ,y_train, w_init, b_init, tmp_alpha,iterations, compute_cost, compute_gradient)
print(f"(w,b) found by gradient descent: ({w_final:8.4f},{b_final:8.4f})")
#  8.4f 是精度 ：最小宽度为8，小数点后保留4位，左边空格补位
# 08.4f：左边用0补位，详细可了解python输出精度和位数控制