ML线性回归_线性回归的ml方法介绍-CSDN博客

本文链接：https://blog.csdn.net/qq_43627705/article/details/121132772

线性回归

多参考周志华老师的西瓜书以及南瓜书。

1. 概述

给定由 $d$ 个属性描述的示例 $x = (x_1;x_2;...;x_d)$ ，其中 $x_i$ 是 $x$ 在第 $i$ 个属性上的取值，线性模型试图学得一个通过属性的线性组合来进行预测的函数，即
$\omega_1 x_1 + \omega_2 x_2 + ... + \omega_d x_d + b$
向量形式
$\omega^T x + b$
其中 $\omega = (\omega_1; \omega_2;...; \omega_d)$ . $w$ 和 $b$ 确定后，模型得以确定。 $\omega$ 直观表达了各属性在预测中的重要性，因此线性模型有着良好的可解释性。

2.线性回归

2.1 基本模型

给定数据集 $D = {(x_1,y_1), (x_2,y_2),...,(x_m,y_m)}$ , 其中 $x_i = (x_{i1}; x_{i2};...; x_{id})$ , $y_i \in R$ .

简单来看，设输入属性为1个，线性回归试图学得
$f(x_i) = \omega x_i + b_i, 使得f(x_i) \simeq y_i$

2.2 一元优化目标 / 损失函数

使用均方误差来衡量 $f (x)$ 和 $y$ 之间的差异，即欧氏距离。（最小二乘法）
$(\omega^*, b^*) = \mathop{\arg\min}\limits_{(\omega, b)} \sum_{i=1}^{m} (f(x_i) - y_i)^2 = \mathop{\arg\min}\limits_{(\omega, b)} \sum_{i=1}^{m} (y_i - \omega x_i - b)^2$

令 $E(\omega,b) = \sum_{i=1}^{m} (y_i - \omega x_i - b)^2$ , 分别对 $\omega$ 和 $b$ 求偏导，得
$\frac{\partial E(\omega,b)}{\partial \omega} = 2(\omega \sum_{i=1}^{m} (x_i^2 - \sum_{i=1}^{m} (y_i - b)))$

$\frac{\partial E(\omega,b)}{\partial b} = 2(mb - \sum_{i=1}^m (y_i - \omega x_i))$

推导摘自南瓜书
在这里插入图片描述

在这里插入图片描述

令式(5)和(6)为零，可得 $\omega$ 和 $b$ 最优解的闭式解
$\omega = \frac{\sum_{i=1}^{m} y_i (x_i - \bar x)}{\sum_{i=1}^{m} x_i^2 - \frac{1}{m}(\sum_{i=1}^{m} x_i)^2}$

$\frac{1}{m} \sum_{i=1}^{m} (y_1 - \omega x_i)$

推导过程摘自南瓜书：

在这里插入图片描述

2.3多元线性回归

将 $(\omega,b)$ 表示为向量 $\omega$ , 将数据集 $D$ 表示为一个 $m \times (d + 1)$ 的矩阵 $X$ ，其中每行对应于一个示例，前 $d$ 个元素对应示例的 $d$ 个属性值，最后一个元素恒置为1，
$\begin{bmatrix} x_{11}& x_{12}& \cdots & x_{1d} & 1\\ x_{21}& x_{22}& \cdots & x_{2d} & 1\\ \vdots & \vdots & \ddots & \vdots \\ x_{m1}& x_{m2}& \cdots & x_{md} & 1 \end{bmatrix} =\begin{bmatrix} x_{1}^T & 1\\ x_{2}^T & 1\\ \vdots & \vdots\\ x_{m}^T & 1 \end{bmatrix}$

同时将标记也写成向量形式 $y = (y_1;y_2;...;y_m)$ , 构造优化目标如下
$\omega^*= \mathop{\arg\min}\limits_{\omega} (y-X\omega)^T(y-X\omega)$
同理使用最小二乘法对 $\omega$ 进行估计，得
$\omega = (X^TX)^{-1}X^Ty$
推导摘自南瓜书：

在这里插入图片描述

2.3一元线性回归实现

'''
来自华为云AI训练营案例
'''

import numpy as np
from matplotlib.font_manager import FontProperties
import matplotlib.pyplot as plt
%matplotlib inline

# 引入本地字体文件，解决中文会有乱码
# font_set = FontProperties(fname=r"./work/ simsun.ttc", size=12)

# 构造用于训练的数据集
x_train = [4, 8, 5, 10, 12]
y_train = [20, 50, 30, 70, 60]

# 画图函数
def draw(x_train, y_train):
    plt.scatter(x_train, y_train)
    
# 构造一元线性回归函数
def fit(x_train,y_train):
    numerator = 0  # 初始化分子
    denominator = 0  # 初始化分母
    numerator = np.sum(np.multiply(y_train, (x_train - np.mean(x_train))))
    denominator = np.sum(np.square(x_train)) - (1/len(x_train))*(np.sum(x_train))**2
    w = numerator / denominator
    b = (1 / len(x_train))*np.sum((y_train - np.multiply(w,x_train)))
    #print('w = %s\nb = %s'%(w,b))
    return w,b
    
# 预测函数
def predit(w,b,x):
    y = np.multiply(w,x) + b
    return y

# 测试集进行测试，并作图
def fit_test(w, b):
    x = np.linspace(4, 15, 9) # linspace 创建等差数列
    y = predit(w,b,x)
    plt.plot(x, y)
    plt.show()
    
    
if __name__ == "__main__":
    draw(x_train, y_train)
    w, b = fit(x_train, y_train)
    print(w, b)  # 输出斜率和截距
    fit_test(w, b)  # 绘制预测函数图像

2.4多元线性回归实现

# 多元线性回归的实现
# 导入模块
import numpy as np
import pandas as pd

# 构造数据，前三列表示自变量X，最后一列表示因变量Y
data = np.array([[3, 2, 9, 20],
                 [4, 10, 2, 72],
                 [3, 4, 9, 21],
                 [12, 3, 4, 20]])
#print("data:", data, "\n")

X = data[:, :-1]
Y = data[:, -1]

X = np.mat(np.c_[X, np.ones(X.shape[0])])  # 为系数矩阵增加常数项系数
Y = np.mat(Y)  # 数组转化为矩阵

# print("X:", X, "\n")
# print("Y:", Y, "\n")

# 多元线性回归拟合函数
def fit(X,Y):
    w = np.linalg.inv(X.T*X)*X.T*Y.T # 公式11
    return w

def predict(X,w):
    X = np.mat(np.c_[np.array(X),np.ones(np.array(X).shape[0])])
    y = X * w
    return y

if __name__ == "__main__":
    w = fit(X,Y)
	y = predict([[60, 60, 60]],w) # 测试

2.5 封装自定义线性回归模型

from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing # 加利福尼亚房价数据集
from sklearn.metrics import r2_score
import numpy as np
import pandas as pd

# 自己构造线性回归类
class LinearRegression_():
    def __init__(self,w = None):
        self.w = w # omega
    
    # 拟合函数
    def fit(self,X,Y):
        X = np.mat(np.c_[np.array(X),np.ones(np.array(X).shape[0])]) # 公式9 加一列1 b
        Y = np.mat(Y) # 转换成数组
        self.w = np.linalg.inv(X.T*X)*X.T*Y.T # 公式11
        #print(self.w)
    
    # 预测
    def predict(self,X):
        X = np.mat(np.c_[np.array(X),np.ones(np.array(X).shape[0])]) # 添加一列1 b
        y = X * self.w # 计算预测
        return y
    
if __name__ == "__main__":
    clf = LinearRegression_() # 实例化
    clf.fit(Xtrain,Ytrain) # 训练
    y_pred = clf.predict(Xtest) # 预测
    print(r2_score(Ytest,y_pred)) # 评估

2.6sklearn实现

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing # 加利福尼亚房价数据集
from sklearn.metrics import r2_score
import numpy as np
import pandas as pd

housevalue = fetch_california_housing() # 获取数据
X = pd.DataFrame(data=housevalue.data,columns=housevalue.feature_names)
Y = housevalue.target
# X.head()
# Y.head()

Xtrain,Xtest,Ytrain,Ytest = train_test_split(X,Y,test_size=0.3,random_state=420) # 分割数据集

lr = LinearRegression() # 实例化
lr.fit(Xtrain,Ytrain) # 拟合模型

y_pred = lr.predict(Xtest) # 预测

print('r2_score: %s' % r2_score(Ytest,y_pred))

#模型系数查看
print('模型系数：',lr.coef_)
print('截距：',lr.intercept_)
print(list(zip(X.columns,lr.coef_)))

2.6.2参数

class sklearn.linear_model.LinearRegression(*, fit_intercept=True, normalize=‘deprecated’, copy_X=True, n_jobs=None, positive=False)[source]

fit_intercept ：bool, default=True

Whether to calculate the intercept for this model. If set to False, no intercept will be used in calculations (i.e. data is expected to be centered).

normalize ：bool, default=False

This parameter is ignored when fit_intercept is set to False. If True, the regressors X will be normalized before regression by subtracting the mean and dividing by the l2-norm. If you wish to standardize, please use StandardScaler before calling fit on an estimator with normalize=False.

copy_X : bool, default=True

If True, X will be copied; else, it may be overwritten.

n_jobs : int, default=None

The number of jobs to use for the computation. This will only provide speedup in case of sufficiently large problems, that is if firstly n_targets > 1 and secondly X is sparse or if positive is set to True. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.

positive : bool, default=False

When set to True, forces the coefficients to be positive. This option is only supported for dense arrays.

仅作学习笔记使用，侵删