Machine learning week 1(Andrew Ng)——机器学习介绍及分类、单变量线性回归实现

Week one

1.1、Supervised learning-part-01(regression)

Learn from being given “right answer”
给定输入x与输出y然后让机器判断新的输入x下的正确输出

1.2、Supervised learning-part-02(classification)

It predicts categories. How to find a boundary line?

1.3、Unsupervised learning-part-01

We don’t tell them in advance
eg: Clustering algorithm

1.4、Unsupervised learning-part-02

Anomaly detection异常检测
Dimensionality reduction 数据降维

1.5、Jupyter notebooks
2.1、Line regression model part 0102

在这里插入图片描述

  • y-hat is the prediction target and y is the output or “target” variable(目标变量)
  • Line regression with one variable is called uni-variate(单变量)linear regression
2.2、Cost Function Formula

J ( w , b ) = 1 2 m ∑ i = 0 m − 1 ( f w , b ( x ( i ) ) − y ( i ) ) 2 (1) J(w,b) = \frac{1}{2m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})^2 \tag{1} J(w,b)=2m1i=0m1(fw,b(x(i))y(i))2(1)
The extra division by 2 is just meant to make some of our later calculations look neater

def compute_cost(x, y, w, b):
   
    m = x.shape[0] 
    cost = 0
    
    for i in range(m):
        f_wb = w * x[i] + b
        cost = cost + (f_wb - y[i])**2
    total_cost = 1 / (2 * m) * cost

    return total_cost

In this class, we have a simplified version which make b equal zero and minimize J ( w , b ) J(w,b) J(w,b).

2.3、Visualizing The Cost Function

The original model with w and b was plotted in 3D.
在这里插入图片描述

2.4、Visualization Examples

Different choice of w and b.

3.1、Gradient Descent(梯度下降)

Spin around and decide direction
For the calculation:

w = w − α ∂ J ( w , b ) ∂ w , b = b − α ∂ J ( w , b ) ∂ b w=w-\alpha\frac{\partial J(w,b)}{\partial w},b=b-\alpha \frac{\partial J(w,b)}{\partial b} w=wαwJ(w,b),b=bαbJ(w,b)

  • α \alpha α is the learning rate,which is a small positive number.
3.1.1 The relation between slope and w

在这里插入图片描述
When the point is on the right,the slope is 2/1,which is a positive number.So the ∂ J ( w , b ) ∂ w \frac{\partial J(w,b)}{\partial w} wJ(w,b)is positive.Therefore,the new w w w will decrease .That’s how the w reaches the correct value.

3.1.2 The learning rate

If α \alpha α is too small,gradient descent may be slow.
If α \alpha α is too big, gradient descent may be fail to converge(收敛)just like this pic.
在这里插入图片描述
We should also notice that the change of w will not make the pot leave the curve ,because it only changes the x axis and the vertical axis will change following w.

3.2、Gradient Descent for linear regression

在这里插入图片描述
Depending on where you initialize the parameters w and b, you can end up at different local minima.
But it turns out when you’re using a squared error cost function with linear regression, the cost function does not and will never have multiple local minima.

  • ∂ J ( w , b ) ∂ w \frac{\partial J(w,b)}{\partial w} wJ(w,b) is equal to 1 m ∑ i = 0 m − 1 ( f w , b ( x ( i ) ) − y ( i ) ) x ( i ) \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})x^{(i)} m1i=0m1(fw,b(x(i))y(i))x(i)
  • ∂ J ( w , b ) ∂ b \frac{\partial J(w,b)}{\partial b} bJ(w,b)is equal to 1 m ∑ i = 0 m − 1 ( f w , b ( x ( i ) ) − y ( i ) ) \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)}) m1i=0m1(fw,b(x(i))y(i))
def compute_gradient(x, y, w, b): 
    """
    Computes the gradient for linear regression 
    Args:
      x (ndarray (m,)): Data, m examples 
      y (ndarray (m,)): target values
      w,b (scalar)    : model parameters  
    Returns
      dj_dw (scalar): The gradient of the cost w.r.t. the parameters w
      dj_db (scalar): The gradient of the cost w.r.t. the parameter b     
     """
    
    # Number of training examples
    m = x.shape[0]    
    dj_dw = 0
    dj_db = 0
    
    for i in range(m):  
        f_wb = w * x[i] + b 
        dj_dw_i = (f_wb - y[i]) * x[i] 
        dj_db_i = f_wb - y[i] 
        dj_db += dj_db_i
        dj_dw += dj_dw_i 
    dj_dw = dj_dw / m 
    dj_db = dj_db / m 
        
    return dj_dw, dj_db

The overall implement:

def gradient_descent(x, y, w_in, b_in, alpha, num_iters, cost_function, gradient_function): 
    """
    Performs gradient descent to fit w,b. Updates w,b by taking 
    num_iters gradient steps with learning rate alpha
    
    Args:
      x (ndarray (m,))  : Data, m examples 
      y (ndarray (m,))  : target values
      w_in,b_in (scalar): initial values of model parameters  
      alpha (float):     Learning rate
      num_iters (int):   number of iterations to run gradient descent
      cost_function:     function to call to produce cost
      gradient_function: function to call to produce gradient
      
    Returns:
      w (scalar): Updated value of parameter after running gradient descent
      b (scalar): Updated value of parameter after running gradient descent
      J_history (List): History of cost values
      p_history (list): History of parameters [w,b] 
      """
    
    w = copy.deepcopy(w_in) # avoid modifying global w_in
    # An array to store cost J and w's at each iteration primarily for graphing later
    J_history = []
    p_history = []
    b = b_in
    w = w_in
    
    for i in range(num_iters):
        # Calculate the gradient and update the parameters using gradient_function
        dj_dw, dj_db = gradient_function(x, y, w , b)     

        # Update Parameters using equation (3) above
        b = b - alpha * dj_db                            
        w = w - alpha * dj_dw                            

        # Save cost J at each iteration
        if i<100000:      # prevent resource exhaustion 
            J_history.append( cost_function(x, y, w , b))
            p_history.append([w,b])
        # Print cost every at intervals 10 times or as many iterations if < 10
        if i% math.ceil(num_iters/10) == 0:
            print(f"Iteration {i:4}: Cost {J_history[-1]:0.2e} ",
                  f"dj_dw: {dj_dw: 0.3e}, dj_db: {dj_db: 0.3e}  ",
                  f"w: {w: 0.3e}, b:{b: 0.5e}")
 
    return w, b, J_history, p_history #return w and J,w history for graphing


# initialize parameters
w_init = 0
b_init = 0
# some gradient descent settings
iterations = 10000
tmp_alpha = 1.0e-2
# run gradient descent
w_final, b_final, J_hist, p_hist = gradient_descent(x_train ,y_train, w_init, b_init, tmp_alpha, 
                                                    iterations, compute_cost, compute_gradient)
print(f"(w,b) found by gradient descent: ({w_final:8.4f},{b_final:8.4f})")

Finally,to be more precise, this gradient descent process is called batch gradient descent. The term bashed grading descent refers to the fact that on every step of gradient descent, we’re looking at all of the training examples, instead of just a subset of the training data.

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值