线性回归
多参考周志华老师的西瓜书以及南瓜书。
1. 概述
给定由
d
d
d个属性描述的示例
x
=
(
x
1
;
x
2
;
.
.
.
;
x
d
)
x = (x_1;x_2;...;x_d)
x=(x1;x2;...;xd),其中
x
i
x_i
xi 是
x
x
x 在第
i
i
i 个属性上的取值,线性模型试图学得一个通过属性的线性组合来进行预测的函数,即
f
(
x
)
=
ω
1
x
1
+
ω
2
x
2
+
.
.
.
+
ω
d
x
d
+
b
f(x) = \omega_1 x_1 + \omega_2 x_2 + ... + \omega_d x_d + b
f(x)=ω1x1+ω2x2+...+ωdxd+b
向量形式
f
(
x
)
=
ω
T
x
+
b
f(x) = \omega^T x + b
f(x)=ωTx+b
其中
ω
=
(
ω
1
;
ω
2
;
.
.
.
;
ω
d
)
\omega = (\omega_1; \omega_2;...; \omega_d)
ω=(ω1;ω2;...;ωd).
w
w
w 和
b
b
b 确定后,模型得以确定。
ω
\omega
ω 直观表达了各属性在预测中的重要性,因此线性模型有着良好的可解释性。
2.线性回归
2.1 基本模型
给定数据集 D = ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , ( x m , y m ) D = {(x_1,y_1), (x_2,y_2),...,(x_m,y_m)} D=(x1,y1),(x2,y2),...,(xm,ym), 其中 x i = ( x i 1 ; x i 2 ; . . . ; x i d ) x_i = (x_{i1}; x_{i2};...; x_{id}) xi=(xi1;xi2;...;xid), y i ∈ R y_i \in R yi∈R.
简单来看,设输入属性为1个,线性回归试图学得
f
(
x
i
)
=
ω
x
i
+
b
i
,
使
得
f
(
x
i
)
≃
y
i
f(x_i) = \omega x_i + b_i, 使得f(x_i) \simeq y_i
f(xi)=ωxi+bi,使得f(xi)≃yi
2.2 一元优化目标 / 损失函数
使用均方误差来衡量
f
(
x
)
f(x)
f(x) 和
y
y
y 之间的差异,即欧氏距离。(最小二乘法)
(
ω
∗
,
b
∗
)
=
arg
min
(
ω
,
b
)
∑
i
=
1
m
(
f
(
x
i
)
−
y
i
)
2
=
arg
min
(
ω
,
b
)
∑
i
=
1
m
(
y
i
−
ω
x
i
−
b
)
2
(\omega^*, b^*) = \mathop{\arg\min}\limits_{(\omega, b)} \sum_{i=1}^{m} (f(x_i) - y_i)^2 = \mathop{\arg\min}\limits_{(\omega, b)} \sum_{i=1}^{m} (y_i - \omega x_i - b)^2
(ω∗,b∗)=(ω,b)argmini=1∑m(f(xi)−yi)2=(ω,b)argmini=1∑m(yi−ωxi−b)2
令
E
(
ω
,
b
)
=
∑
i
=
1
m
(
y
i
−
ω
x
i
−
b
)
2
E(\omega,b) = \sum_{i=1}^{m} (y_i - \omega x_i - b)^2
E(ω,b)=∑i=1m(yi−ωxi−b)2 , 分别对
ω
\omega
ω 和
b
b
b 求偏导,得
∂
E
(
ω
,
b
)
∂
ω
=
2
(
ω
∑
i
=
1
m
(
x
i
2
−
∑
i
=
1
m
(
y
i
−
b
)
)
)
\frac{\partial E(\omega,b)}{\partial \omega} = 2(\omega \sum_{i=1}^{m} (x_i^2 - \sum_{i=1}^{m} (y_i - b)))
∂ω∂E(ω,b)=2(ωi=1∑m(xi2−i=1∑m(yi−b)))
∂ E ( ω , b ) ∂ b = 2 ( m b − ∑ i = 1 m ( y i − ω x i ) ) \frac{\partial E(\omega,b)}{\partial b} = 2(mb - \sum_{i=1}^m (y_i - \omega x_i)) ∂b∂E(ω,b)=2(mb−i=1∑m(yi−ωxi))
推导摘自南瓜书
令式(5)和(6)为零,可得
ω
\omega
ω 和
b
b
b 最优解的闭式解
ω
=
∑
i
=
1
m
y
i
(
x
i
−
x
ˉ
)
∑
i
=
1
m
x
i
2
−
1
m
(
∑
i
=
1
m
x
i
)
2
\omega = \frac{\sum_{i=1}^{m} y_i (x_i - \bar x)}{\sum_{i=1}^{m} x_i^2 - \frac{1}{m}(\sum_{i=1}^{m} x_i)^2}
ω=∑i=1mxi2−m1(∑i=1mxi)2∑i=1myi(xi−xˉ)
b = 1 m ∑ i = 1 m ( y 1 − ω x i ) b = \frac{1}{m} \sum_{i=1}^{m} (y_1 - \omega x_i) b=m1i=1∑m(y1−ωxi)
推导过程摘自南瓜书:
2.3多元线性回归
将
(
ω
,
b
)
(\omega,b)
(ω,b) 表示为向量
ω
\omega
ω, 将数据集
D
D
D 表示为一个
m
×
(
d
+
1
)
m × (d + 1)
m×(d+1) 的矩阵
X
X
X,其中每行对应于一个示例,前
d
d
d 个元素对应示例的
d
d
d 个属性值,最后一个元素恒置为1,
X
=
[
x
11
x
12
⋯
x
1
d
1
x
21
x
22
⋯
x
2
d
1
⋮
⋮
⋱
⋮
x
m
1
x
m
2
⋯
x
m
d
1
]
=
[
x
1
T
1
x
2
T
1
⋮
⋮
x
m
T
1
]
X = \begin{bmatrix} x_{11}& x_{12}& \cdots & x_{1d} & 1\\ x_{21}& x_{22}& \cdots & x_{2d} & 1\\ \vdots & \vdots & \ddots & \vdots \\ x_{m1}& x_{m2}& \cdots & x_{md} & 1 \end{bmatrix} =\begin{bmatrix} x_{1}^T & 1\\ x_{2}^T & 1\\ \vdots & \vdots\\ x_{m}^T & 1 \end{bmatrix}
X=⎣⎢⎢⎢⎡x11x21⋮xm1x12x22⋮xm2⋯⋯⋱⋯x1dx2d⋮xmd111⎦⎥⎥⎥⎤=⎣⎢⎢⎢⎡x1Tx2T⋮xmT11⋮1⎦⎥⎥⎥⎤
同时将标记也写成向量形式
y
=
(
y
1
;
y
2
;
.
.
.
;
y
m
)
y = (y_1;y_2;...;y_m)
y=(y1;y2;...;ym), 构造优化目标如下
ω
∗
=
arg
min
ω
(
y
−
X
ω
)
T
(
y
−
X
ω
)
\omega^*= \mathop{\arg\min}\limits_{\omega} (y-X\omega)^T(y-X\omega)
ω∗=ωargmin(y−Xω)T(y−Xω)
同理使用最小二乘法对
ω
\omega
ω 进行估计,得
ω
=
(
X
T
X
)
−
1
X
T
y
\omega = (X^TX)^{-1}X^Ty
ω=(XTX)−1XTy
推导摘自南瓜书:
2.3一元线性回归实现
'''
来自华为云AI训练营案例
'''
import numpy as np
from matplotlib.font_manager import FontProperties
import matplotlib.pyplot as plt
%matplotlib inline
# 引入本地字体文件,解决中文会有乱码
# font_set = FontProperties(fname=r"./work/ simsun.ttc", size=12)
# 构造用于训练的数据集
x_train = [4, 8, 5, 10, 12]
y_train = [20, 50, 30, 70, 60]
# 画图函数
def draw(x_train, y_train):
plt.scatter(x_train, y_train)
# 构造一元线性回归函数
def fit(x_train,y_train):
numerator = 0 # 初始化分子
denominator = 0 # 初始化分母
numerator = np.sum(np.multiply(y_train, (x_train - np.mean(x_train))))
denominator = np.sum(np.square(x_train)) - (1/len(x_train))*(np.sum(x_train))**2
w = numerator / denominator
b = (1 / len(x_train))*np.sum((y_train - np.multiply(w,x_train)))
#print('w = %s\nb = %s'%(w,b))
return w,b
# 预测函数
def predit(w,b,x):
y = np.multiply(w,x) + b
return y
# 测试集进行测试,并作图
def fit_test(w, b):
x = np.linspace(4, 15, 9) # linspace 创建等差数列
y = predit(w,b,x)
plt.plot(x, y)
plt.show()
if __name__ == "__main__":
draw(x_train, y_train)
w, b = fit(x_train, y_train)
print(w, b) # 输出斜率和截距
fit_test(w, b) # 绘制预测函数图像
2.4多元线性回归实现
# 多元线性回归的实现
# 导入模块
import numpy as np
import pandas as pd
# 构造数据,前三列表示自变量X,最后一列表示因变量Y
data = np.array([[3, 2, 9, 20],
[4, 10, 2, 72],
[3, 4, 9, 21],
[12, 3, 4, 20]])
#print("data:", data, "\n")
X = data[:, :-1]
Y = data[:, -1]
X = np.mat(np.c_[X, np.ones(X.shape[0])]) # 为系数矩阵增加常数项系数
Y = np.mat(Y) # 数组转化为矩阵
# print("X:", X, "\n")
# print("Y:", Y, "\n")
# 多元线性回归拟合函数
def fit(X,Y):
w = np.linalg.inv(X.T*X)*X.T*Y.T # 公式11
return w
def predict(X,w):
X = np.mat(np.c_[np.array(X),np.ones(np.array(X).shape[0])])
y = X * w
return y
if __name__ == "__main__":
w = fit(X,Y)
y = predict([[60, 60, 60]],w) # 测试
2.5 封装自定义线性回归模型
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing # 加利福尼亚房价数据集
from sklearn.metrics import r2_score
import numpy as np
import pandas as pd
# 自己构造线性回归类
class LinearRegression_():
def __init__(self,w = None):
self.w = w # omega
# 拟合函数
def fit(self,X,Y):
X = np.mat(np.c_[np.array(X),np.ones(np.array(X).shape[0])]) # 公式9 加一列1 b
Y = np.mat(Y) # 转换成数组
self.w = np.linalg.inv(X.T*X)*X.T*Y.T # 公式11
#print(self.w)
# 预测
def predict(self,X):
X = np.mat(np.c_[np.array(X),np.ones(np.array(X).shape[0])]) # 添加一列1 b
y = X * self.w # 计算预测
return y
if __name__ == "__main__":
clf = LinearRegression_() # 实例化
clf.fit(Xtrain,Ytrain) # 训练
y_pred = clf.predict(Xtest) # 预测
print(r2_score(Ytest,y_pred)) # 评估
2.6sklearn实现
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing # 加利福尼亚房价数据集
from sklearn.metrics import r2_score
import numpy as np
import pandas as pd
housevalue = fetch_california_housing() # 获取数据
X = pd.DataFrame(data=housevalue.data,columns=housevalue.feature_names)
Y = housevalue.target
# X.head()
# Y.head()
Xtrain,Xtest,Ytrain,Ytest = train_test_split(X,Y,test_size=0.3,random_state=420) # 分割数据集
lr = LinearRegression() # 实例化
lr.fit(Xtrain,Ytrain) # 拟合模型
y_pred = lr.predict(Xtest) # 预测
print('r2_score: %s' % r2_score(Ytest,y_pred))
#模型系数查看
print('模型系数:',lr.coef_)
print('截距:',lr.intercept_)
print(list(zip(X.columns,lr.coef_)))
2.6.2参数
class sklearn.linear_model.LinearRegression(*, fit_intercept=True, normalize=‘deprecated’, copy_X=True, n_jobs=None, positive=False)[source]
fit_intercept :bool, default=True
Whether to calculate the intercept for this model. If set to False, no intercept will be used in calculations (i.e. data is expected to be centered).
normalize :bool, default=False
This parameter is ignored when fit_intercept
is set to False. If True, the regressors X will be normalized before regression by subtracting the mean and dividing by the l2-norm. If you wish to standardize, please use StandardScaler
before calling fit
on an estimator with normalize=False
.
copy_X : bool, default=True
If True, X will be copied; else, it may be overwritten.
n_jobs : int, default=None
The number of jobs to use for the computation. This will only provide speedup in case of sufficiently large problems, that is if firstly n_targets > 1
and secondly X
is sparse or if positive
is set to True
. None
means 1 unless in a joblib.parallel_backend
context. -1
means using all processors. See Glossary for more details.
positive : bool, default=False
When set to True
, forces the coefficients to be positive. This option is only supported for dense arrays.
仅作学习笔记使用,侵删