ML线性回归

线性回归

多参考周志华老师的西瓜书以及南瓜书。

1. 概述

给定由 d d d个属性描述的示例 x = ( x 1 ; x 2 ; . . . ; x d ) x = (x_1;x_2;...;x_d) x=(x1;x2;...;xd),其中 x i x_i xi x x x 在第 i i i 个属性上的取值,线性模型试图学得一个通过属性的线性组合来进行预测的函数,即
f ( x ) = ω 1 x 1 + ω 2 x 2 + . . . + ω d x d + b f(x) = \omega_1 x_1 + \omega_2 x_2 + ... + \omega_d x_d + b f(x)=ω1x1+ω2x2+...+ωdxd+b
向量形式
f ( x ) = ω T x + b f(x) = \omega^T x + b f(x)=ωTx+b
其中 ω = ( ω 1 ; ω 2 ; . . . ; ω d ) \omega = (\omega_1; \omega_2;...; \omega_d) ω=(ω1;ω2;...;ωd). w w w b b b 确定后,模型得以确定。 ω \omega ω 直观表达了各属性在预测中的重要性,因此线性模型有着良好的可解释性。

2.线性回归

2.1 基本模型

给定数据集 D = ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , ( x m , y m ) D = {(x_1,y_1), (x_2,y_2),...,(x_m,y_m)} D=(x1,y1),(x2,y2),...,(xm,ym), 其中 x i = ( x i 1 ; x i 2 ; . . . ; x i d ) x_i = (x_{i1}; x_{i2};...; x_{id}) xi=(xi1;xi2;...;xid), y i ∈ R y_i \in R yiR.

简单来看,设输入属性为1个,线性回归试图学得
f ( x i ) = ω x i + b i , 使 得 f ( x i ) ≃ y i f(x_i) = \omega x_i + b_i, 使得f(x_i) \simeq y_i f(xi)=ωxi+bi,使f(xi)yi

2.2 一元优化目标 / 损失函数

使用均方误差来衡量 f ( x ) f(x) f(x) y y y 之间的差异,即欧氏距离。(最小二乘法)
( ω ∗ , b ∗ ) = arg ⁡ min ⁡ ( ω , b ) ∑ i = 1 m ( f ( x i ) − y i ) 2 = arg ⁡ min ⁡ ( ω , b ) ∑ i = 1 m ( y i − ω x i − b ) 2 (\omega^*, b^*) = \mathop{\arg\min}\limits_{(\omega, b)} \sum_{i=1}^{m} (f(x_i) - y_i)^2 = \mathop{\arg\min}\limits_{(\omega, b)} \sum_{i=1}^{m} (y_i - \omega x_i - b)^2 (ω,b)=(ω,b)argmini=1m(f(xi)yi)2=(ω,b)argmini=1m(yiωxib)2

E ( ω , b ) = ∑ i = 1 m ( y i − ω x i − b ) 2 E(\omega,b) = \sum_{i=1}^{m} (y_i - \omega x_i - b)^2 E(ω,b)=i=1m(yiωxib)2 , 分别对 ω \omega ω b b b 求偏导,得
∂ E ( ω , b ) ∂ ω = 2 ( ω ∑ i = 1 m ( x i 2 − ∑ i = 1 m ( y i − b ) ) ) \frac{\partial E(\omega,b)}{\partial \omega} = 2(\omega \sum_{i=1}^{m} (x_i^2 - \sum_{i=1}^{m} (y_i - b))) ωE(ω,b)=2(ωi=1m(xi2i=1m(yib)))

∂ E ( ω , b ) ∂ b = 2 ( m b − ∑ i = 1 m ( y i − ω x i ) ) \frac{\partial E(\omega,b)}{\partial b} = 2(mb - \sum_{i=1}^m (y_i - \omega x_i)) bE(ω,b)=2(mbi=1m(yiωxi))

推导摘自南瓜书
在这里插入图片描述

在这里插入图片描述

令式(5)和(6)为零,可得 ω \omega ω b b b 最优解的闭式解
ω = ∑ i = 1 m y i ( x i − x ˉ ) ∑ i = 1 m x i 2 − 1 m ( ∑ i = 1 m x i ) 2 \omega = \frac{\sum_{i=1}^{m} y_i (x_i - \bar x)}{\sum_{i=1}^{m} x_i^2 - \frac{1}{m}(\sum_{i=1}^{m} x_i)^2} ω=i=1mxi2m1(i=1mxi)2i=1myi(xixˉ)

b = 1 m ∑ i = 1 m ( y 1 − ω x i ) b = \frac{1}{m} \sum_{i=1}^{m} (y_1 - \omega x_i) b=m1i=1m(y1ωxi)

推导过程摘自南瓜书:

在这里插入图片描述

2.3多元线性回归

( ω , b ) (\omega,b) (ω,b) 表示为向量 ω \omega ω, 将数据集 D D D 表示为一个 m × ( d + 1 ) m × (d + 1) m×(d+1) 的矩阵 X X X,其中每行对应于一个示例,前 d d d 个元素对应示例的 d d d 个属性值,最后一个元素恒置为1,
X = [ x 11 x 12 ⋯ x 1 d 1 x 21 x 22 ⋯ x 2 d 1 ⋮ ⋮ ⋱ ⋮ x m 1 x m 2 ⋯ x m d 1 ] = [ x 1 T 1 x 2 T 1 ⋮ ⋮ x m T 1 ] X = \begin{bmatrix} x_{11}& x_{12}& \cdots & x_{1d} & 1\\ x_{21}& x_{22}& \cdots & x_{2d} & 1\\ \vdots & \vdots & \ddots & \vdots \\ x_{m1}& x_{m2}& \cdots & x_{md} & 1 \end{bmatrix} =\begin{bmatrix} x_{1}^T & 1\\ x_{2}^T & 1\\ \vdots & \vdots\\ x_{m}^T & 1 \end{bmatrix} X=x11x21xm1x12x22xm2x1dx2dxmd111=x1Tx2TxmT111

同时将标记也写成向量形式 y = ( y 1 ; y 2 ; . . . ; y m ) y = (y_1;y_2;...;y_m) y=(y1;y2;...;ym), 构造优化目标如下
ω ∗ = arg ⁡ min ⁡ ω ( y − X ω ) T ( y − X ω ) \omega^*= \mathop{\arg\min}\limits_{\omega} (y-X\omega)^T(y-X\omega) ω=ωargmin(yXω)T(yXω)
同理使用最小二乘法对 ω \omega ω 进行估计,得
ω = ( X T X ) − 1 X T y \omega = (X^TX)^{-1}X^Ty ω=(XTX)1XTy
推导摘自南瓜书:

在这里插入图片描述

2.3一元线性回归实现

'''
来自华为云AI训练营案例
'''

import numpy as np
from matplotlib.font_manager import FontProperties
import matplotlib.pyplot as plt
%matplotlib inline

# 引入本地字体文件,解决中文会有乱码
# font_set = FontProperties(fname=r"./work/ simsun.ttc", size=12)

# 构造用于训练的数据集
x_train = [4, 8, 5, 10, 12]
y_train = [20, 50, 30, 70, 60]

# 画图函数
def draw(x_train, y_train):
    plt.scatter(x_train, y_train)
    
# 构造一元线性回归函数
def fit(x_train,y_train):
    numerator = 0  # 初始化分子
    denominator = 0  # 初始化分母
    numerator = np.sum(np.multiply(y_train, (x_train - np.mean(x_train))))
    denominator = np.sum(np.square(x_train)) - (1/len(x_train))*(np.sum(x_train))**2
    w = numerator / denominator
    b = (1 / len(x_train))*np.sum((y_train - np.multiply(w,x_train)))
    #print('w = %s\nb = %s'%(w,b))
    return w,b
    
# 预测函数
def predit(w,b,x):
    y = np.multiply(w,x) + b
    return y

# 测试集进行测试,并作图
def fit_test(w, b):
    x = np.linspace(4, 15, 9) # linspace 创建等差数列
    y = predit(w,b,x)
    plt.plot(x, y)
    plt.show()
    
    
if __name__ == "__main__":
    draw(x_train, y_train)
    w, b = fit(x_train, y_train)
    print(w, b)  # 输出斜率和截距
    fit_test(w, b)  # 绘制预测函数图像

2.4多元线性回归实现

# 多元线性回归的实现
# 导入模块
import numpy as np
import pandas as pd

# 构造数据,前三列表示自变量X,最后一列表示因变量Y
data = np.array([[3, 2, 9, 20],
                 [4, 10, 2, 72],
                 [3, 4, 9, 21],
                 [12, 3, 4, 20]])
#print("data:", data, "\n")

X = data[:, :-1]
Y = data[:, -1]

X = np.mat(np.c_[X, np.ones(X.shape[0])])  # 为系数矩阵增加常数项系数
Y = np.mat(Y)  # 数组转化为矩阵

# print("X:", X, "\n")
# print("Y:", Y, "\n")

# 多元线性回归拟合函数
def fit(X,Y):
    w = np.linalg.inv(X.T*X)*X.T*Y.T # 公式11
    return w

def predict(X,w):
    X = np.mat(np.c_[np.array(X),np.ones(np.array(X).shape[0])])
    y = X * w
    return y

if __name__ == "__main__":
    w = fit(X,Y)
	y = predict([[60, 60, 60]],w) # 测试
    

2.5 封装自定义线性回归模型

from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing # 加利福尼亚房价数据集
from sklearn.metrics import r2_score
import numpy as np
import pandas as pd

# 自己构造线性回归类
class LinearRegression_():
    def __init__(self,w = None):
        self.w = w # omega
    
    # 拟合函数
    def fit(self,X,Y):
        X = np.mat(np.c_[np.array(X),np.ones(np.array(X).shape[0])]) # 公式9 加一列1 b
        Y = np.mat(Y) # 转换成数组
        self.w = np.linalg.inv(X.T*X)*X.T*Y.T # 公式11
        #print(self.w)
    
    # 预测
    def predict(self,X):
        X = np.mat(np.c_[np.array(X),np.ones(np.array(X).shape[0])]) # 添加一列1 b
        y = X * self.w # 计算预测
        return y
    
if __name__ == "__main__":
    clf = LinearRegression_() # 实例化
    clf.fit(Xtrain,Ytrain) # 训练
    y_pred = clf.predict(Xtest) # 预测
    print(r2_score(Ytest,y_pred)) # 评估

2.6sklearn实现

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing # 加利福尼亚房价数据集
from sklearn.metrics import r2_score
import numpy as np
import pandas as pd

housevalue = fetch_california_housing() # 获取数据
X = pd.DataFrame(data=housevalue.data,columns=housevalue.feature_names)
Y = housevalue.target
# X.head()
# Y.head()

Xtrain,Xtest,Ytrain,Ytest = train_test_split(X,Y,test_size=0.3,random_state=420) # 分割数据集

lr = LinearRegression() # 实例化
lr.fit(Xtrain,Ytrain) # 拟合模型

y_pred = lr.predict(Xtest) # 预测

print('r2_score: %s' % r2_score(Ytest,y_pred))

#模型系数查看
print('模型系数:',lr.coef_)
print('截距:',lr.intercept_)
print(list(zip(X.columns,lr.coef_)))
2.6.2参数

class sklearn.linear_model.LinearRegression(*, fit_intercept=True, normalize=‘deprecated’, copy_X=True, n_jobs=None, positive=False)[source]

fit_interceptbool, default=True

Whether to calculate the intercept for this model. If set to False, no intercept will be used in calculations (i.e. data is expected to be centered).

normalizebool, default=False

This parameter is ignored when fit_intercept is set to False. If True, the regressors X will be normalized before regression by subtracting the mean and dividing by the l2-norm. If you wish to standardize, please use StandardScaler before calling fit on an estimator with normalize=False.

copy_X : bool, default=True

If True, X will be copied; else, it may be overwritten.

n_jobs : int, default=None

The number of jobs to use for the computation. This will only provide speedup in case of sufficiently large problems, that is if firstly n_targets > 1 and secondly X is sparse or if positive is set to True. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.

positive : bool, default=False

When set to True, forces the coefficients to be positive. This option is only supported for dense arrays.


仅作学习笔记使用,侵删

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值