二、线性回归

最新推荐文章于 2024-10-11 16:17:26 发布

bestwangyulu

最新推荐文章于 2024-10-11 16:17:26 发布

阅读量802

点赞数

文章标签： python 机器学习数据分析 sklearn 线性回归

本文链接：https://blog.csdn.net/Y_L_Wang/article/details/123970981

版权

1. 理论推导

线性模型假定预测 $\hat{y}$ 是对应 $x=\{x_1,x_2,\cdots,x_p\}$ 的属性的线性组合，即：
$\begin{aligned} \hat{y} &=\theta_0+\theta_1 x_1+\theta_2 x_2+ \cdots + \theta_n x_p\\ & =\theta_0 x_0+\theta_1 x_1+\theta_2 x_2+ \cdots + \theta_n x_n \qquad 其中 x_0 \equiv 1 \end{aligned}*$
令 $\theta=\{\theta_0,\theta_1,\theta_2,\cdots,\theta_n\}^T,x_b=\{x_0,x_1,\cdots,x_n\}$ ，则：
$\hat{y} =x_b \theta$

现有数据 $X=\{x^{(i)},x^{(2)},\cdots,x^{(N)}\}^T$ ,对应 $y=\{y^{(1)},y^{(2)},\cdots,y^{(N)}\}^T$ ，其中 $x^{(i)}=\{x_1^{(i)},x_2^{(i)},\cdots,x_p^{(i)}\}$ 。对了方便

记 $X_b=\{x_b^{(i)},x_b^{(2)},\cdots,x_b^{(N)}\}^T$ 。

定义损失函数：
$\begin{aligned} J(\theta)&=\sum_{i=1}^N (y^{(i)}-x_b^{(i)}\beta)^2 \\ &=(y-X_b\beta)^T(y-X_b\beta) \end{aligned}$
令 $\frac{dJ(\theta)}{d\theta}=0$ ,得：

$X_b^TX_b\theta=X_b^Ty$

假设 $X^TX$ 可逆，则：

$\hat{\theta}=(X_b^TX_b)^{-1}X_b^Ty$

相应的：
$\hat{y} =X_b(X_b^TX_b)^{-1}X_b^Ty$
*以上过程称为线性回归的正规方程(解析解)*

2. 代码实现

基于正规方程组的代码实现

#导入数据
from sklearn.datasets import fetch_california_housing       
datasets=fetch_california_housing()
(X,y)=(datasets.data,datasets.target)
### 数据预处理-标准化
from sklearn.preprocessing import StandardScaler 
ss=StandardScaler()
X=ss.fit_transform(X)

import numpy as np
X_b=np.hstack([X,np.repeat(1,len(X)).reshape(-1,1)])
#估计参数
theta_hat=np.linalg.inv(X_b.T@X_b)@(X_b.T)@(y.reshape(-1,1))
#估计预测值
y_hat=X_b@theta_hat
#计算均方误差
mse=np.mean((y_hat-y)**2)

基于梯度下降法实现

def loss(X_b,y,theta):
    return (y-X_b@theta).T@(y-X_b@theta)
def loss_d_theat(X_b,y,theta):
    return X_b.T@(X_b@theta-y)/len(X_b)
def GD(X_b,y,theta_0,lr=0.1,e=1e-8):
    theta=theta_0
    l=loss(X_b,y,theta)
    while(max>0):
        d=loss_d_theat(X_b,y,theta)
        new_theta=theta-lr*d
        new_l=loss(X_b,y,new_theta)
        if(np.abs(l-new_l)<e):
            return new_theta
        else:
            theta=new_theta
            l=new_l
theta_0=np.repeat(0,X_b.shape[1])
theat_hat=GD(X_b,y,theta_0).reshape(-1,1)
y_hat=X_b@theta_hat
np.mean((y_hat-y)**2)

3. `sklearn` 中的线性回归

#拟合模型并进行预测
from sklearn.linear_model import LinearRegression
lr=LinearRegression()
theta_hat=lr.fit(X,y)
y_hat=lr.predict(X)

#评价指标
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.metrics import explained_variance_score
{
    "MSE":mean_squared_error(y_hat,y),
    "R2":r2_score(y_hat,y),
    "evs":explained_variance_score(y_hat,y)
}

4. 多项式回归

对于一元非线性数据，可以考虑用一个多项式去拟合数据，即：
$KaTeX parse error: Expected 'EOF', got '&' at position 10: \hat{y} &̲=\theta_0+\thet…$
如果用 $x_d=\{1,x,x^2,\cdots,x^n\}^T$ ,那么:
$\hat{y} =x_d \theta$
此时一元多项式回归问题就转化为了多元线性回归问题。

sklearn.preprocessing.PolynomialFeatures()方法可以为一元数据添加高次项特征。

x = np.random.uniform(-3,3,size=100)
X = x.reshape(-1,1)
# 一元二次方程
y = 0.5*x**2 + x + 2+np.random.normal(0,1,size=100)
ss=StandardScaler()
X=ss.fit_transform(X)

#添加高次项特征
from sklearn.preprocessing import PolynomialFeatures
pf=PolynomialFeatures(degree=2)
X=pf.fit_transform(X)

#进行线性回归
lr=LinearRegression()
theta_hat=lr.fit(X,y)
y_hat=lr.predict(X)
mse=mean_squared_error(y_hat,y)

#绘制拟合曲线
import matplotlib.pyplot as plt
plt.scatter(x,y)
plt.plot(x,y_hat,color="red")
plt.show()

12334

5. `Pipline` 管道机制

Pipeline可以将许多算法模型串联起来，比如说将添加多项式特性、标准化和线性回归连在一起形成一个典型的机器学习问题工作流。主要带来两点好处：

直接调用fit和predict方法来对pipeline中的所有算法模型进行训练和预测。
可以结合grid search对参数进行选择。

代码：

from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler 
from sklearn.linear_model import LinearRegression

from sklearn.pipeline import Pipeline
#安照处理步骤，依次传入类和相关参数
pipe_lr=Pipeline([
    ("PF",PolynomialFeatures(degree=2)),
    ("SS",StandardScaler()),
    ("LR",LinearRegression()),
])
pipe_lr.fit(X_train, y_train)