Machine learning week 2(Andrew Ng)——多维线性回归及梯度下降实现中的技巧(特征缩放、多项式回归)

Week two:Regression with multiple input variables

1、Multiple linear regression

The model’s prediction with multiple variables is given by the linear model:
f w , b ( x ) = w 0 x 0 + w 1 x 1 + . . . + w n − 1 x n − 1 + b f_{\mathbf{w},b}(\mathbf{x}) = w_0x_0 + w_1x_1 +... + w_{n-1}x_{n-1} + b fw,b(x)=w0x0+w1x1+...+wn1xn1+bor in vector notation:
f w , b ( x ) = w ⋅ x + b f_{\mathbf{w},b}(\mathbf{x}) = \mathbf{w} \cdot \mathbf{x} + b fw,b(x)=wx+b

1.1、Vectorization

f = n p . d o t ( w , x ) + b f=np.dot(\mathbf{w},\mathbf{x})+b f=np.dot(w,x)+b

  • Shorter code
  • Faster than others
    - Special hardware was used to add the altogether rather than add them one by one.
    - When update w ,such as w j = w j − 0.1 d j w_j=w_j-0.1d_j wj=wj0.1dj,it’s more efficient.
1.2、Lab of python and numpy
  • 1-D array, shape (n,): n elements indexed [0] through [n-1].
    It doesn’t mean that the number of this array is 1.In the contrary ,it means that this array has a row of elements.(一维数组)
1.2.1、The creation
a = np.zeros(4);                print(f"np.zeros(4) :   a = {a}, a shape = {a.shape}, a data type = {a.dtype}")
a = np.zeros((4,2));            print(f"np.zeros(4,2) :  a = {a}, a shape = {a.shape}, a data type = {a.dtype}")
a = np.zeros((4,));             print(f"np.zeros(4,) :  a = {a}, a shape = {a.shape}, a data type = {a.dtype}")
a = np.random.random_sample(4); print(f"np.random.random_sample(4): a = {a}, a shape = {a.shape}, a data type = {a.dtype}")
'''
np.zeros(4) :   a = [0. 0. 0. 0.], a shape = (4,), a data type = float64
np.zeros(4,2) :  a = [[0. 0.]
 [0. 0.]
 [0. 0.]
 [0. 0.]], a shape = (4, 2), a data type = float64
np.zeros(4,) :  a = [0. 0. 0. 0.], a shape = (4,), a data type = float64
np.random.random_sample(4): a = [0.93520726 0.67662427 0.59914887 0.55925697], a shape = (4,), a data type = float64
'''

a = np.arange(4.);              print(f"np.arange(4.):     a = {a}, a shape = {a.shape}, a data type = {a.dtype}")
# 使用浮点4.
a = np.random.rand(4);          print(f"np.random.rand(4): a = {a}, a shape = {a.shape}, a data type = {a.dtype}")
'''
np.arange(4.):     a = [0. 1. 2. 3.], a shape = (4,), a data type = float64
np.random.rand(4): a = [0.62211656 0.87796894 0.86304218 0.80868278], a shape = (4,), a data type = float64
'''


a = np.array([5,4,3,2]);  print(f"np.array([5,4,3,2]):  a = {a},     a shape = {a.shape}, a data type = {a.dtype}")
a = np.array([5.,4,3,2]); print(f"np.array([5.,4,3,2]): a = {a}, a shape = {a.shape}, a data type = {a.dtype}")
'''
np.array([5,4,3,2]):  a = [5 4 3 2],     a shape = (4,), a data type = int64
np.array([5.,4,3,2]): a = [5. 4. 3. 2.], a shape = (4,), a data type = float64
'''
1.2.2、The operations on vectors
  • Indexing means referring to an element of an array by its position within the array.
    - a.shape[0] returns the length of array
  • Slicing means getting a subset of elements from an array based on their indices.(切片)
    • c = a[2:7:2]; print("a[2:7:2] = ", c) # a[2:7:2] = [2 4 6]
    • c = a[3:]; print("a[3:] = ", c)# a[3:] = [3 4 5 6 7 8 9]
  • Operations on a single vector.
b = -a 
print(f"b = -a        : {b}")
b = np.sum(a) 
print(f"b = np.sum(a) : {b}")
b = np.mean(a)
print(f"b = np.mean(a): {b}")
b = a**2
print(f"b = a**2      : {b}")
'''
b = -a        : [-1 -2 -3 -4]
b = np.sum(a) : 10
b = np.mean(a): 2.5
b = a**2      : [ 1  4  9 16]
'''
  • On two vectors
    - {a+b}
    - b = 5 * a
  • c = np.dot(a, b)
  • shape
    • When the parameter is a scalar number(标量)it returns () and when we use a.shape(a is a 1D-vector),it return(length of a, )
    • When a is a 2-D vector,such as a= [[0 1] [2 3][4 5]],
      #a.shape: (3, 2),a[2].shape: (2,)
1.2.3、Matrices

Matrices, are two dimensional arrays.m is often the number of rows and n the number of columns.Matrices have a two-dimensional (2-D) index [m,n].

  • Creation
a = np.zeros((1, 5))                                       
print(f"a shape = {a.shape}, a = {a},{a.dtype}")                     

a = np.zeros((2, 1))                                                                   
print(f"a shape = {a.shape}, a = {a}") 

a = np.random.random_sample((1, 1))  
print(f"a shape = {a.shape}, a = {a}")
"""
a shape = (1, 5), a = [[0. 0. 0. 0. 0.]],float64
a shape = (2, 1), a = [[0.]
[0.]] 
# Notice further than NumPy, when printing, will print one row per line.
a shape = (1, 1), a = [[0.04997798]]
"""
a = np.array([[5], [4], [3]]);   print(f" a shape = {a.shape}, np.array: a = {a}")
#  a shape = (3, 1), np.array: a = [[5]
#  [4]
#  [3]]
a = np.arange(6).reshape(-1, 2)# It is equal to a = np.arange(6).reshape(3, 2)
"""
a= [[0 1]
 [2 3]
 [4 5]]
""" 
# The -1 argument tells the routine to compute the number of rows given the size of the array and the number of columns.
  • Slicing
    a[:, 2:7:1] = [[ 2 3 4 5 6][12 13 14 15 16]] ,
    a[:, 2:7:1].shape = (2, 5) a 2-D array
1.3 Gradient descent for multiple linear regression

repeat  until convergence:    {    w j = w j − α ∂ J ( w , b ) ∂ w j    for j = 0..n-1 b    = b − α ∂ J ( w , b ) ∂ b } \begin{align*} \text{repeat}&\text{ until convergence:} \; \lbrace \newline\; & w_j = w_j - \alpha \frac{\partial J(\mathbf{w},b)}{\partial w_j} \; & \text{for j = 0..n-1}\newline &b\ \ = b - \alpha \frac{\partial J(\mathbf{w},b)}{\partial b} \newline \rbrace \end{align*} repeat} until convergence:{wj=wjαwjJ(w,b)b  =bαbJ(w,b)for j = 0..n-1
While:
∂ J ( w , b ) ∂ w j = 1 m ∑ i = 0 m − 1 ( f w , b ( x ( i ) ) − y ( i ) ) x j ( i ) ∂ J ( w , b ) ∂ b = 1 m ∑ i = 0 m − 1 ( f w , b ( x ( i ) ) − y ( i ) ) \begin{align} \frac{\partial J(\mathbf{w},b)}{\partial w_j} &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)})x_{j}^{(i)} \\ \frac{\partial J(\mathbf{w},b)}{\partial b} &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)}) \end{align} wjJ(w,b)bJ(w,b)=m1i=0m1(fw,b(x(i))y(i))xj(i)=m1i=0m1(fw,b(x(i))y(i))

1.4、The Lab
  • x ( i ) \mathbf{x}^{(i)} x(i) is vector containing example i. x ( i ) \mathbf{x}^{(i)} x(i) = ( x 0 ( i ) , x 1 ( i ) , ⋯   , x n − 1 ( i ) ) = (x^{(i)}_0, x^{(i)}_1, \cdots,x^{(i)}_{n-1}) =(x0(i),x1(i),,xn1(i))
  • The code
def compute_cost(X, y, w, b): 
    """
    compute cost
    Args:
      X (ndarray (m,n)): Data, m examples with n features
      y (ndarray (m,)) : target values
      w (ndarray (n,)) : model parameters  
      b (scalar)       : model parameter
      
    Returns:
      cost (scalar): cost
    """
    m = X.shape[0]
    cost = 0.0
    for i in range(m):                                
        f_wb_i = np.dot(X[i], w) + b           #(n,)(n,) = scalar (see np.dot)
        cost = cost + (f_wb_i - y[i])**2       #scalar
    cost = cost / (2 * m)                      #scalar    
    return cost


def compute_gradient(X, y, w, b): 
    """
    Computes the gradient for linear regression 
    Args:
      X (ndarray (m,n)): Data, m examples with n features
      y (ndarray (m,)) : target values
      w (ndarray (n,)) : model parameters  
      b (scalar)       : model parameter
      
    Returns:
      dj_dw (ndarray (n,)): The gradient of the cost w.r.t. the parameters w. 
      dj_db (scalar):       The gradient of the cost w.r.t. the parameter b. 
    """
    m,n = X.shape           #(number of examples, number of features)
    dj_dw = np.zeros((n,))
    dj_db = 0.

    for i in range(m):                             
        err = (np.dot(X[i], w) + b) - y[i]   
        for j in range(n):                         
            dj_dw[j] = dj_dw[j] + err * X[i, j]    
        dj_db = dj_db + err                        
    dj_dw = dj_dw / m                                
    dj_db = dj_db / m                                
        
    return dj_db, dj_dw


def gradient_descent(X, y, w_in, b_in, cost_function, gradient_function, alpha, num_iters): 
    """
    Performs batch gradient descent to learn theta. Updates theta by taking 
    num_iters gradient steps with learning rate alpha
    
    Args:
      X (ndarray (m,n))   : Data, m examples with n features
      y (ndarray (m,))    : target values
      w_in (ndarray (n,)) : initial model parameters  
      b_in (scalar)       : initial model parameter
      cost_function       : function to compute cost
      gradient_function   : function to compute the gradient
      alpha (float)       : Learning rate
      num_iters (int)     : number of iterations to run gradient descent
      
    Returns:
      w (ndarray (n,)) : Updated values of parameters 
      b (scalar)       : Updated value of parameter 
      """
    
    # An array to store cost J and w's at each iteration primarily for graphing later
    J_history = []
    w = copy.deepcopy(w_in)  #avoid modifying global w within function
    b = b_in
    
    for i in range(num_iters):

        # Calculate the gradient and update the parameters
        dj_db,dj_dw = gradient_function(X, y, w, b)   ##None

        # Update Parameters using w, b, alpha and gradient
        w = w - alpha * dj_dw               ##None
        b = b - alpha * dj_db               ##None
      
        # Save cost J at each iteration
        if i<100000:      # prevent resource exhaustion 
            J_history.append( cost_function(X, y, w, b))

        # Print cost every at intervals 10 times or as many iterations if < 10
        if i% math.ceil(num_iters / 10) == 0:
            print(f"Iteration {i:4d}: Cost {J_history[-1]:8.2f}   ")
        
    return w, b, J_history #return final w,b and J history for graphing
2、Gradient descent in practice
2.1、Feature scaling(特征缩放)

在这里插入图片描述

可以看到,当X1和X2的尺度不同,那么他们对最终损失造成的影响也是不相同的,如左图所示,你可以想象成一个三维的地形,或者一个崎岖的山路,而右图由于输入尺度相同,当然也是个三维的地形,但层次分明。当你想到达最低点时,左图需要根据梯度下降指引下降的方向,可能到达不同点,他所指引的方向并不指向最低点;而相反,右图中,他的梯度下降由于周围地形很相近,所以一直可以指向最低点,这也就是feature scaling加速收敛的原因所在!(引自Feature Scaling 的意义

在这里插入图片描述

  • Mean normalization
    x i : = x i − μ i m a x − m i n x_i := \frac{x_i - \mu_i}{max - min} xi:=maxminxiμi

  • Z-score normalization
    x j ( i ) = x j ( i ) − μ j σ j x^{(i)}_j = \dfrac{x^{(i)}_j - \mu_j}{\sigma_j} xj(i)=σjxj(i)μj
    where j j j selects a feature or a column in the X \mathbf{X} X matrix and:
    σ j 2 = 1 m ∑ i = 0 m − 1 ( x j ( i ) − μ j ) 2 \sigma^2_j = \frac{1}{m} \sum_{i=0}^{m-1} (x^{(i)}_j - \mu_j)^2 σj2=m1i=0m1(xj(i)μj)2
    After z-score normalization, all features will have a mean of 0 and a standard deviation of 1.

    Given a new x value (living room area and number of bed- rooms), we must first normalize x using the mean and standard deviation that we had previously computed from the training set.

    • The code:
def zscore_normalize_features(X):
    """
    computes  X, zcore normalized by column
    
    Args:
      X (ndarray (m,n))     : input data, m examples, n features
      
    Returns:
      X_norm (ndarray (m,n)): input normalized by column
      mu (ndarray (n,))     : mean of each feature
      sigma (ndarray (n,))  : standard deviation of each feature
    """
    # find the mean of each column/feature
    mu     = np.mean(X, axis=0)                 # mu will have shape (n,)
    # axis = 0 means Vertical axis while axis = 1 means Horizontal
    # find the standard deviation of each column/feature
    sigma  = np.std(X, axis=0)                  # sigma will have shape (n,)
    # element-wise, subtract mu for that column from each example, divide by std for that column
    X_norm = (X - mu) / sigma      

    return (X_norm, mu, sigma)
2.2、The learning rate

With a small enough α \alpha α, J ( w , b ) J(w,b) J(w,b)should decrease on every iteration
在这里插入图片描述

2.3、Feature engineering

所谓特征工程,就是根据你的直觉,通过变换或组合原始特征来设计新的特征,以使学习算法更容易做出准确的预测。

在这里插入图片描述

2.4、Polynomial regression(多项式回归)

  When use a simple quadratic: y = 1 + x 2 y = 1+x^2 y=1+x2 to start , it is not a great fit.
  If we swap the original data with a version that squares the x x x value, then we can achieve y = w 0 x 0 2 + b y= w_0x_0^2 + b y=w0x02+b. Let’s try it,such as swapping X for x*2 .
   Then ,we can use y = w 0 ∗ X + b y=w_0*X+b y=w0X+b to fit the data.
在这里插入图片描述
As,we can see,near perfect fit.Therefore, we knew that an x 2 x^2 x2 term was required. It may not always be obvious which features are required. One could add a variety of potential features to try and find the most useful. For example, what if we had instead tried y = w 0 x 0 + w 1 x 1 2 + w 2 x 2 3 + b y=w_0x_0 + w_1x_1^2 + w_2x_2^3+b y=w0x0+w1x12+w2x23+b ?

X = np.c_[x, x**2, x**3]
#<-- added engineered feature

Then we attain 0.08 x + 0.54 x 2 + 0.03 x 3 + 0.0106 0.08x + 0.54x^2 + 0.03x^3 + 0.0106 0.08x+0.54x2+0.03x3+0.0106
Gradient descent has emphasized the data that is the best fit to the x 2 x^2 x2 data by increasing the w 1 w_1 w1 (0.54) term relative to the others.
Another way to think about this is to note that we are still using linear regression once we have created new features. Given that, the best features will be linear relative to the target. This is best understood with an example.Just as follows.
在这里插入图片描述

2.5、Scikit-Learn

Please refer to sklearn

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值