【Machine Learning】2.代价函数(Cost Function)、梯度下降(Gradient Descent)

1.代价函数(Cost Function)

1.1 介绍

The equation for cost with one variable is:
J ( w , b ) = 1 2 m ∑ i = 0 m − 1 ( f w , b ( x ( i ) ) − y ( i ) ) 2 (1) J(w,b) = \frac{1}{2m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})^2 \tag{1} J(w,b)=2m1i=0m1(fw,b(x(i))y(i))2(1)

where
f w , b ( x ( i ) ) = w x ( i ) + b (2) f_{w,b}(x^{(i)}) = wx^{(i)} + b \tag{2} fw,b(x(i))=wx(i)+b(2)

  • f w , b ( x ( i ) ) f_{w,b}(x^{(i)}) fw,b(x(i)) is our prediction for example i i i using parameters w , b w,b w,b.
  • ( f w , b ( x ( i ) ) − y ( i ) ) 2 (f_{w,b}(x^{(i)}) -y^{(i)})^2 (fw,b(x(i))y(i))2 is the squared difference between the target value and the prediction.
  • These differences are summed over all the m m m examples and divided by 2m to produce the cost, J ( w , b ) J(w,b) J(w,b).
  • 目的是最小化代价函数来得到想要的参数,代价函数只是一个量值,除以2m方便求导,本质并无影响

Note, in lecture summation ranges are typically from 1 to m, while code will be from 0 to m-1.

1.2 代码实现

The code below calculates cost by looping over each example. In each loop:

  • f_{w,b}, a prediction is calculated
  • the difference between the target and the prediction is calculated and squared.
  • this is added to the total cost.
def compute_cost(x, y, w, b): 
    """
    Computes the cost function for linear regression.
    
    Args:
      x (ndarray (m,)): Data, m examples 
      y (ndarray (m,)): target values
      w,b (scalar)    : model parameters  
    
    Returns
        total_cost (float): The cost of using w,b as the parameters for linear regression
               to fit the data points in x and y
    """
    # number of training examples
    m = x.shape[0] 
    
    cost_sum = 0 
    for i in range(m): 
        f_wb = w * x[i] + b   
        cost = (f_wb - y[i]) ** 2  
        cost_sum = cost_sum + cost  
    total_cost = (1 / (2 * m)) * cost_sum  

    return total_cost

2.梯度下降(Gradient Descent)

2.1 介绍

For a linear model that predicts f w , b ( x ( i ) ) f_{w,b}(x^{(i)}) fw,b(x(i)):
f w , b ( x ( i ) ) = w x ( i ) + b (1) f_{w,b}(x^{(i)}) = wx^{(i)} + b \tag{1} fw,b(x(i))=wx(i)+b(1)
In linear regression, you utilize input training data to fit the parameters w w w, b b b by minimizing a measure of the error between our predictions f w , b ( x ( i ) ) f_{w,b}(x^{(i)}) fw,b(x(i)) and the actual data y ( i ) y^{(i)} y(i). The measure is called the c o s t cost cost, J ( w , b ) J(w,b) J(w,b). In training you measure the cost over all of our training samples x ( i ) , y ( i ) x^{(i)},y^{(i)} x(i),y(i)
J ( w , b ) = 1 2 m ∑ i = 0 m − 1 ( f w , b ( x ( i ) ) − y ( i ) ) 2 (2) J(w,b) = \frac{1}{2m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})^2\tag{2} J(w,b)=2m1i=0m1(fw,b(x(i))y(i))2(2)
Gradient descent was described as:

repeat  until convergence:    {    w = w − α ∂ J ( w , b ) ∂ w    b = b − α ∂ J ( w , b ) ∂ b } \begin{align*} \text{repeat}&\text{ until convergence:} \; \lbrace \newline \; w &= w - \alpha \frac{\partial J(w,b)}{\partial w} \tag{3} \; \newline b &= b - \alpha \frac{\partial J(w,b)}{\partial b} \newline \rbrace \end{align*} repeatwb} until convergence:{=wαwJ(w,b)=bαbJ(w,b)(3)where, parameters w w w, b b b are updated simultaneously(同时).

The gradient is defined as(对代价函数求偏导可得):
∂ J ( w , b ) ∂ w = 1 m ∑ i = 0 m − 1 ( f w , b ( x ( i ) ) − y ( i ) ) x ( i ) ∂ J ( w , b ) ∂ b = 1 m ∑ i = 0 m − 1 ( f w , b ( x ( i ) ) − y ( i ) ) \begin{align} \frac{\partial J(w,b)}{\partial w} &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})x^{(i)} \tag{4}\\ \frac{\partial J(w,b)}{\partial b} &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)}) \tag{5}\\ \end{align} wJ(w,b)bJ(w,b)=m1i=0m1(fw,b(x(i))y(i))x(i)=m1i=0m1(fw,b(x(i))y(i))(4)(5)
Here simultaniously(同时) means that you calculate the partial derivatives for all the parameters before updating any of the parameters.

2.2 代码实现

You will implement gradient descent algorithm for one feature. You will need three functions.

  • compute_gradient implementing equation (4) and (5) above
  • compute_cost implementing equation (2) above (code from the previous )
  • gradient_descent, utilizing compute_gradient and compute_cost

Conventions:

  • The naming of python variables containing partial derivatives follows this pattern, ∂ J ( w , b ) ∂ b \frac{\partial J(w,b)}{\partial b} bJ(w,b) will be dj_db.
  • w.r.t is With Respect To, as in partial derivative of J ( w b ) J(wb) J(wb) With Respect To b b b.

compute_gradient(梯度计算)

compute_gradient implements (4) and (5) above and returns ∂ J ( w , b ) ∂ w \frac{\partial J(w,b)}{\partial w} wJ(w,b), ∂ J ( w , b ) ∂ b \frac{\partial J(w,b)}{\partial b} bJ(w,b). The embedded comments describe the operations.

def compute_gradient(x, y, w, b): 
    """
    Computes the gradient for linear regression 
    Args:
      x (ndarray (m,)): Data, m examples 
      y (ndarray (m,)): target values
      w,b (scalar)    : model parameters  
    Returns
      dj_dw (scalar): The gradient of the cost w.r.t. the parameters w
      dj_db (scalar): The gradient of the cost w.r.t. the parameter b     
     """
    
    # Number of training examples
    m = x.shape[0]    
    dj_dw = 0
    dj_db = 0
    
    for i in range(m):  
        f_wb = w * x[i] + b 
        dj_dw_i = (f_wb - y[i]) * x[i] 
        dj_db_i = f_wb - y[i] 
        dj_db += dj_db_i
        dj_dw += dj_dw_i 
    dj_dw = dj_dw / m 
    dj_db = dj_db / m 
        
    return dj_dw, dj_db

Gradient Descent(梯度下降)

Now that gradients can be computed, gradient descent, described in equation (3) above can be implemented below in gradient_descent. The details of the implementation are described in the comments. Below, you will utilize this function to find optimal values of w w w and b b b on the training data.

def gradient_descent(x, y, w_in, b_in, alpha, num_iters, cost_function, gradient_function): 
    """
    Performs gradient descent to fit w,b. Updates w,b by taking 
    num_iters gradient steps with learning rate alpha
    
    Args:
      x (ndarray (m,))  : Data, m examples 
      y (ndarray (m,))  : target values
      w_in,b_in (scalar): initial values of model parameters 输入参数
      alpha (float):     Learning rate 学习率
      num_iters (int):   number of iterations to run gradient descent 迭代次数
      cost_function:     function to call to produce cost
      gradient_function: function to call to produce gradient 
      
    Returns:
      w (scalar): Updated value of parameter after running gradient descent
      b (scalar): Updated value of parameter after running gradient descent
      J_history (List): History of cost values
      p_history (list): History of parameters [w,b] 
      """
    
    w = copy.deepcopy(w_in) # avoid modifying global w_in
    # An array to store cost J and w's at each iteration primarily for graphing later
    J_history = []
    p_history = []
    b = b_in
    w = w_in
    
    for i in range(num_iters):
        # Calculate the gradient and update the parameters using gradient_function
        dj_dw, dj_db = gradient_function(x, y, w , b)     

        # Update Parameters using equation (3) above
        b = b - alpha * dj_db                            
        w = w - alpha * dj_dw                            

        # Save cost J at each iteration
        if i<100000:      # prevent resource exhaustion 
            J_history.append( cost_function(x, y, w , b))# 代价记录
            p_history.append([w,b]) # w,b参数值记录
            #append()函数用于向列表末尾增加元素
        # Print cost every at intervals 10 times or as many iterations if < 10
        if i% math.ceil(num_iters/10) == 0:
            print(f"Iteration {i:4}: Cost {J_history[-1]:0.2e} ",
                  f"dj_dw: {dj_dw: 0.3e}, dj_db: {dj_db: 0.3e}  ",
                  f"w: {w: 0.3e}, b:{b: 0.5e}")
           # i:4 是宽度为4, 0.3e 是三位小数加科学计数法
 
    return w, b, J_history, p_history #return w and J,w history for graphing
# initialize parameters 初始化参数
w_init = 0
b_init = 0
# some gradient descent settings 
iterations = 10000
tmp_alpha = 1.0e-2 # 学习率{\alpha}
# run gradient descent 运行
w_final, b_final, J_hist, p_hist = gradient_descent(x_train ,y_train, w_init, b_init, tmp_alpha,iterations, compute_cost, compute_gradient)
print(f"(w,b) found by gradient descent: ({w_final:8.4f},{b_final:8.4f})")
#  8.4f 是精度 :最小宽度为8,小数点后保留4位,左边空格补位
# 08.4f:左边用0补位,详细可了解python输出精度和位数控制
  • 10
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值