线性回归模型及K-score归一化方法Python3实现杂文-CSDN博客

本文链接：https://blog.csdn.net/e_more_day/article/details/129347495

本文介绍了如何使用Python3实现线性回归模型，包括一元线性梯度下降的代码实现和多元线性梯度下降的详细过程。同时，文章探讨了特征缩放的重要性，特别是K-score归一化方法，以及如何判断梯度下降是否收敛和选择合适的学习率。此外，还涉及了特征工程的概念，如多项式回归。

摘要由CSDN通过智能技术生成

Python3——线性回归模型及K-score归一化方法实现

前言：本文是博主参考吴恩达的机器学习课程记录的杂文笔记，主要内容是线性回归的代码实现和K-score归一化方法的代码实现，以及线性回归的主要公式内容。鉴于博主水平，如有错误，请帮忙斧正，若有意见，还望不吝赐教！

文章目录

Python3——线性回归模型及K-score归一化方法实现
一元线性梯度下降代码
多元线性梯度下降代码
回归模型优化方法
特征缩放(scale features)
判断梯度下降是否收敛
- 绘制学习曲线图
- 自动收敛测试
选择合适的学习率(Learning rate)
- 排除问题
特征工程(Feature engineering)
- 多项式回归(Polynomial regression)
- K-Score归一化

一元线性梯度下降代码

一元线性梯度下降三部曲

计算下降
执行公式：
$\begin{align*} \text{重复}&\text{直到收敛:} \newline \; \lbrace \newline \; w &= w - \alpha \frac{\partial J(w,b)}{\partial w} \ \ \ \ \ \ \ \ \ \ \tag{1} \; \newline b &= b - \alpha \frac{\partial J(w,b)}{\partial b} \ \ \ \ \ \ \ \ \ \ \tag{2} \newline \rbrace \end{align*}$
$\begin{align} \frac{\partial J(w,b)}{\partial w} &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})x^{(i)} \ \ \ \ \ \ \ \ \ \ \tag{3} \\ \frac{\partial J(w,b)}{\partial b} &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)}) \tag{4} \\ \end{align}$
计算代价
执行公式 $\frac{1}{2m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})^2 \ \ \ \ \ \ \ \ \ \ \tag{5}$
梯度下降
重复执行上述两个步骤。

代码实现

import math, copy
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('./deeplearning.mplstyle')
from lab_utils_uni import plt_house_x, plt_contour_wgrad, plt_divergence, plt_gradients

# 代价函数计算方法
def compute_cost(x, y, w, b):
   
    m = x.shape[0] 
    cost = 0
    
    for i in range(m):
        f_wb = w * x[i] + b
        cost = cost + (f_wb - y[i])**2
    total_cost = 1 / (2 * m) * cost
    return total_cost

# 计算两个权重的梯度下降的导数项
def compute_gradient(x, y, w, b): 
    """
    Computes the gradient for linear regression 
	计算梯度下降的下降率
    Args:（参数）
      x (ndarray (m,)): Data, m examples 
      y (ndarray (m,)): target values
      w,b (scalar)    : model parameters  
    Returns（返回值）
      dj_dw (scalar): The gradient of the cost w.r.t. the parameters w
      dj_db (scalar): The gradient of the cost w.r.t. the parameter b     
     """
    
    # Number of training examples
    m = x.shape[0]    
    dj_dw = 0
    dj_db = 0
    
    for i in range(m):  
        f_wb = w * x[i] + b 
        dj_dw_i = (f_wb - y[i]) * x[i] 
        dj_db_i = f_wb - y[i] 
        dj_db += dj_db_i
        dj_dw += dj_dw_i 
    dj_dw = dj_dw / m 
    dj_db = dj_db / m 
        
    return dj_dw, dj_db

# 计算梯度下降函数
def gradient_descent(x, y, w_in, b_in, alpha, num_iters, cost_function, gradient_function): 
    """
    Performs gradient descent to fit w,b. Updates w,b by taking 
    num_iters gradient steps with learning rate alpha
    
    Args:（参数）
      x (ndarray (m,))  : Data, m examples 
      y (ndarray (m,))  : target values
      w_in,b_in (scalar): initial values of model parameters （模型变量的初始值）
      alpha (float):     Learning rate （学习率）
      num_iters (int):   number of iterations to run gradient descent （指定迭代次数）
      cost_function:     function to call to produce cost （代价函数的方法）
      gradient_function: function to call to produce gradient （梯度下降偏导计算）
      
    Returns:（返回值）
      w (scalar): Updated value of parameter after running gradient descent
      b (scalar): Updated value of parameter after running gradient descent
      J_history (List): History of cost values
      p_history (list): History of parameters [w,b] 
      """
    
    w = copy.deepcopy(w_in) # avoid modifying global w_in （规避全局变量）
    # 用于储存返回值的历史代价函数和历史参数对
    J_history = []
    p_history = []
    b = b_in
    w = w_in
    
    for i in range(num_iters):
        # Calculate the gradient and update the parameters using gradient_function
        # 计算梯度下降的下降率
        dj_dw, dj_db = gradient_function(x, y, w , b)     
        # Update Parameters using equation (3) above
	    # 计算梯度下降的参数值并反代
        b = b - alpha * dj_db                            
        w = w - alpha * dj_dw                            
        # Save cost J at each iteration
	    # 储存代价函数J
        if i<100000:      # prevent resource exhaustion （避免资源耗尽）
            J_history.append( cost_function(x, y, w , b))
            p_history.append([w,b])
        # Print cost every at intervals 10 times or as many iterations if < 10
	    # 每10次运行就打印一次参数和代价函数
        if i% math.ceil(num_iters/10) == 0:
            print(f"Iteration {i:4}: Cost {J_history[-1]:0.2e} ",
                  f"dj_dw: {dj_dw: 0.3e}, dj_db: {dj_db: 0.3e}  ",
                  f"w: {w: 0.3e}, b:{b: 0.5e}")

    return w, b, J_history, p_history #return w and J,w history for graphing

注意事项

对于梯度下降公式，导数项是同步的
取得w, b最好远离最合适的值
学习率α要适当，过大会导致发散

多元线性梯度下降代码

数据矩阵

样本数据

假如拥有 $m$ 个样本(example)和 $n$ 个特征(further)，则这些数据可以用如下的 $(m, n)$ 矩阵 $\mathbf{x}$ 表示:
$\mathbf{X} = \begin{pmatrix} x^{(0)}_0 & x^{(0)}_1 & \cdots & x^{(0)}_{n-1} \\ x^{(1)}_0 & x^{(1)}_1 & \cdots & x^{(1)}_{n-1} \\ \cdots \\ x^{(m-1)}_0 & x^{(m-1)}_1 & \cdots & x^{(m-1)}_{n-1} \end{pmatrix}$
其中:

$\mathbf{x}^{(i)}$ 代表一个样本， $\mathbf{x}^{(i)}=(x_{0}^{(i)}, x_{1}^{(i)}, \cdots, x_{n-1}^{(i)})$
$x_{j}^{(i)}$ 代表第 $i$ 个样本的第 $j$ 个元素。上标是样本号，下标是元素索引。

参数向量

用 $\mathbf{w}$ 是一个带有 $n$ 个元素的向量：

$\mathbf{w}$ 每一个元素都有一个关联的特征
称之为列向量(column vector)
$\mathbf{w} = \begin{pmatrix} w_0 \\ w_1 \\ \cdots\\ w_{n-1} \end{pmatrix}$
$b$ 是一个标量参数
此时给定的参数 $w$ 的初始值同样是个向量 $\mathbf{w}$ 。

计算线性回归 $f (x)$

用非向量表示：
$f_{\mathbf{w},b}(\mathbf{x}) = w\_0x\_0 + w\_1x\_1 +... + w_{n-1}x_{n-1} + b \ \ \ \ \ \ \ \tag{1}$
用向量表示：
$f_{\mathbf{w},b}(\mathbf{x}) = \mathbf{w} \cdot \mathbf{x} + b \ \ \ \ \ \ \tag{2}$
这里的 $\cdot$ 代表的是向量点乘(dot product)

方法

循环法，对应公式(1)

def predict_single_loop(x, w, b): 
    """
    single predict using linear regression 用线性回归预测一次
    
    Args:
      x (ndarray): Shape (n,) example with multiple features(一个样本的向量)
      w (ndarray): Shape (n,) model parameters(权重的向量)
      b (scalar):  model parameter(模型标参)
      
    Returns:
      p (scalar):  prediction(函数预测值)
    """
    n = x.shape[0]
    p = 0
    for i in range(n):
        p_i = x[i] * w[i]  
        p = p + p_i         
    p = p + b                
    return p

数量积法，对应公式(2)

def predict(x, w, b): 
    """
    single predict using linear regression
    Args:
      x (ndarray): Shape (n,) example with multiple features
      w (ndarray): Shape (n,) model parameters   
      b (scalar):             model parameter 
      
    Returns:
      p (scalar):  prediction
    """
    p = np.dot(x, w) + b     
    return p

计算代价函数 $j (x)$

$J(\mathbf{w},b)$ 的表达式为:
$J(\mathbf{w},b) = \frac{1}{2m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)})^2 \tag{3}$
用向量表示:

$f_{\mathbf{w},b}(\mathbf{x}^{(i)}) = \mathbf{w} \cdot \mathbf{x}^{(i)} + b \tag{4}$

方法

def compute_cost(X, y, w, b): 
    """
    compute cost
    Args:
      X (ndarray (m,n)): Data, m examples with n features(所有输入的矩阵)
      y (ndarray (m,)) : target values(目标变量)
      w (ndarray (n,)) : model parameters(权重向量)  
      b (scalar)       : model parameter(标参)
      
    Returns:
      cost (scalar): cost(代价函数)
    """
    m = X.shape[0]
    cost = 0.0
    for i in range(m):                                
        f_wb_i = np.dot(X[i], w) + b           #(n,)(n,) = scalar (see np.dot)
        cost = cost + (f_wb_i - y[i])**2       #scalar
    cost = cost / (2 * m)                      #scalar    
    return cost

下降率

参数：
$\begin{align*} \text{循环}&\text{直到收敛:} \; \lbrace \newline\; & w_j = w_j - \alpha \frac{\partial J(\mathbf{w},b)}{\partial w_j} \; & \text{for j = 0..n-1}\ \ \ \ \ \ \ \ \ \ \ \ \ \tag{5} \newline &b\ \ = b - \alpha \frac{\partial J(\mathbf{w},b)}{\partial b} \newline \rbrace \end{align*}$
导数项：
$\begin{align} \frac{\partial J(\mathbf{w},b)}{\partial w_j} &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)})x_{j}^{(i)} \ \ \ \ \ \ \tag{6} \\ \frac{\partial J(\mathbf{w},b)}{\partial b} &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)}) \ \ \ \ \ \ \ \tag{7} \end{align}$
其中 $m$ 是数据集中的样本数， $f_{(w, b)}(x^{(i)})$ 是预测值， $y^{(i)}$ 是目标值。

def compute_gradient(X, y, w, b): 
    """
    Computes the gradient for linear regression 
    Args:
      X (ndarray (m,n)): Data, m examples with n features(数据矩阵)
      y (ndarray (m,)) : target values(目标值)
      w (ndarray (n,)) : model parameters(参数向量)
      b (scalar)       : model parameter(标参)
      
    Returns:
      dj_dw (ndarray (n,)): The gradient of the cost w.r.t. the parameters w. 
      dj_db (scalar):       The gradient of the cost w.r.t. the parameter b. 
    """
    m,n = X.shape           #(number of examples, number of features)
    dj_dw = np.zeros((n,))
    dj_db = 0.

    for i in range(m):                             
        err = (np.dot(X[i], w) + b) - y[i]   
        for j in range(n):                         
            dj_dw[j] = dj_dw[j] + err * X[i, j]    
        dj_db = dj_db + err                        
    dj_dw = dj_dw / m                                
    dj_db = dj_db / m                                
        
    return dj_db, dj_dw

实现梯度下降

def gradient_descent(X, y, w_in, b_in, cost_function, gradient_function, alpha, num_iters): 
    """
    Performs batch gradient descent to learn theta. Updates theta by taking 
    num_iters gradient steps with learning rate alpha
    
    Args:
      X (ndarray (m,n))   : Data, m examples with n features
      y (ndarray (m,))    : target values
      w_in (ndarray (n,)) : initial model parameters  
      b_in (scalar)       : initial model parameter
      cost_function       : function to compute cost
      gradient_function   : function to compute the gradient
      alpha (float)       : Learning rate
      num_iters (int)     : number of iterations to run gradient descent
      
    Returns:
      w (ndarray (n,)) : Updated values of parameters 
      b (scalar)       : Updated value of parameter 
      """
    
    # An array to store cost J and w's at each iteration primarily for graphing later
    J_history = []
    w = copy.deepcopy(w_in)  #avoid modifying global w within function
    b = b_in
    
    for i in range(num_iters):

        # Calculate the gradient and update the parameters
        dj_db,dj_dw = gradient_function(X, y, w, b)   ##None

        # Update Parameters using w, b, alpha and gradient
        w = w - alpha * dj_dw               ##None
        b = b - alpha * dj_db               ##None
      
        # Save cost J at each iteration
        if i<100000:      # prevent resource exhaustion 
            J_history.append( cost_function(X, y, w, b))

        # Print cost every at intervals 10 times or as many iterations if < 10
        if i% math.ceil(num_iters / 10) == 0:
            print(f"Iteration {i:4d}: Cost {J_history[-1]:8.2f}   ")
        
    return w, b, J_history #return final w,b and J history for graphing

回归模型优化方法

特征缩放(scale features)

目的；让梯度下降运行地更快
问题：不同的特征有不同的取值范围，且尺度相差大，梯度下降会反复跳跃，导致运行缓慢
处理方法：重新缩放这些特征，保持同样的数量级，如将范围定在(-1~1)附近。
三种方法混用导致模型准确性有一定的偏差，具体要看算法
特征缩放处理数据训练的模型，预测特征也要用对应的特征缩放

简单的特征缩放(Feature scaling)

将数据除其取值范围的最大值即可：
$x_{i, scaled} = \frac{x_i}{x_{max}}$

$x_i$ 代表这个特征向量中的第 $i$ 个特征。

均值归一化(Mean normalization)

缩放数值，使数值分布在原点附近（四个象限均有）。
计算方法：
$x_i=\frac{x_i - \mu_j}{x_{max} - x_{min}}$

$x_i$ 代表这个特征向量中的第 $i$ 个特征。
$\mu_j$ 代表这个特征向量的均值

Z-score归一化(Z-score normalization)

同样能使数据分布在原点附近。
Z-score归一化以后，所有要素的平均值为0，标准差为1。
计算方法：
$x_i = \frac{x_i - \mu_j}{\sigma_j}$

$\sigma_j$ ：第 $j$ 个特征的标准差(standard deviation)

判断梯度下降是否收敛

绘制学习曲线图

绘制学习曲线图（ $j (w, b) - i t er a t i o n$ 曲线图）
- $j (w, b)$ 必须在每次迭代后下降
- 如果出现上升，则需要考虑：
  - $\alpha$ 是否过大
  - 是否代码有bug
  - $\alpha d$ 中的“ $-$ ”写成了“ $+$ ”

自动收敛测试

$\epsilon$ 代表一个很小的正数，如 $10^{-3}$
如果每一次迭代 $j (w, b)$ 的下降小于 $\epsilon$ ，那么可以认为收敛
由于 $\epsilon$ 很难取，所以通常用学习曲线观察法看收敛情况

选择合适的学习率(Learning rate)

排除问题

方法：选择较小的学习率，观察其学习曲线，从而判断是bug还是学习率问题。
技巧：通常从0.001开始以三倍递增（0.001~ 0.003~ 0.01~ 0.03~ 0.1~ 0.3），从而找到最合适的学习率。
结果：选择尽可能大，又不会导致发散的学习率。

特征工程(Feature engineering)

概念：运用所学知识和现有的特征创建新的特征，通常通过变换，合并等方式，使其能帮助算法更简单地做出准确的预测。
例子：一块土地有特征长( $x_1$ )和宽( $x_2$ )，可以写出回归方程 $f(x)=w_1 x_1 \times w_2 x_2 + b$ 。但是长和宽可以合并成面积( $x_3=x_1 x_2$ )，扩充后的回归方程可以写为 $w_1 x_1 \times w_2 x_2 \times w_3 x_3 + b$ ，使得对目标值的预测更准确。

多项式回归(Polynomial regression)

通过特征工程对参数处理，从而得到一个多项式回归模型，如 $f(X)=w_1 x_1 \times w_2 x_2^2 \times w_3 x_3^3 + b$ 这样的曲线回归模型。实质上，最佳特征相对于目标呈线性，比如一个用二次方匹配的目标，用二次方去拟合目标值，则对应的图像是一条直线。其实可以理解为将多项式回归中的高次项当成一个整体，以线性回归方程理解。

K-Score归一化

计算方法：
$x_i = \frac{x_i - \mu_j}{\sigma_j}$
代码实现：

# K-score归一化
def zscore\_normalize\_features(X):
   """
 按列计算k均值归一化
   
 参数：
 X (ndarray): Shape (m,n) input data, m examples, n features(训练用的特征矩阵)
   
 Returns:
 X_norm (ndarray): Shape (m,n)  input normalized by column(归一化后的特征矩阵)
 mu (ndarray):     Shape (n,)   mean of each feature(每一列的特征均值)
 sigma (ndarray):  Shape (n,)   standard deviation of each feature(每一列的特征标准差)
 """
    \# 计算均值
    mu     = np.mean(X, axis=0)                 \# mu will have shape (n,)
    \# 计算标准差
    sigma  = np.std(X, axis=0)                  \# sigma will have shape (n,)
    \# element-wise, subtract mu for that column from each example, divide by std for that column
    X_norm = (X - mu) / sigma      

    return (X_norm, mu, sigma)