线性回归与应用
1 线性回归
1.1 简介
1886年,查尔斯 ⋅ \cdot ⋅达尔文的堂兄弗朗西斯 ⋅ \cdot ⋅高尔顿在研究动植物哪些特征可以遗传时,创造了回归一词。虽然高尔顿发明了回归一词,但是卡尔皮尔逊将回归置于稳固的数学地位。1898年,皮尔逊在论文中提出了线性回归方法。自改论文出版后,许多学者扩展了该理论。特别是,Joseph Berkson在1944年提出了逻辑回归模型,并用该模型分析医学研究中的二分类问题,这是最早的分类算法之一。
1.2 基础知识
下面讲的矩阵微分公式均是根据分母布局。下面是一些在最小二乘法中用到的一些矩阵微分知识。
1. 已知
a
T
=
(
a
1
,
a
2
,
⋯
,
a
n
)
\mathbf{a}^T=\begin{pmatrix} a_1, a_2, \cdots, a_n\end{pmatrix}
aT=(a1,a2,⋯,an),
x
T
=
(
x
1
,
x
2
,
⋯
,
x
n
)
\mathbf{x}^T=\begin{pmatrix}x_1, x_2, \cdots, x_n\end{pmatrix}
xT=(x1,x2,⋯,xn), 求
∂
a
T
x
∂
x
\frac{\partial{\mathbf{a}^T\mathbf{x}}}{\partial{\mathbf{x}}}
∂x∂aTx.
解:
∂
a
T
x
x
=
∂
a
1
x
1
+
a
2
x
2
+
⋯
+
a
n
x
n
∂
x
=
(
∂
a
1
x
1
+
a
2
x
2
+
⋯
+
a
n
x
n
∂
x
1
∂
a
1
x
1
+
a
2
x
2
+
⋯
+
a
n
x
n
∂
x
2
⋮
∂
a
1
x
1
+
a
2
x
2
+
⋯
+
a
n
x
n
∂
x
n
)
=
(
a
1
a
2
⋮
a
n
)
=
a
\begin{align} \frac{\partial{\mathbf{a}^T\mathbf{x}}}{\mathbf{x}} &= \frac{\partial{a_1x_1 + a_2x_2 + \cdots + a_nx_n}}{\partial{\mathbf{x}}} \\ &= \begin{pmatrix} \frac{\partial{a_1x_1 + a_2x_2 + \cdots + a_nx_n}}{\partial{x_1}} \\ \frac{\partial{a_1x_1 + a_2x_2 + \cdots + a_nx_n}}{\partial{x_2}} \\ \vdots \\ \frac{\partial{a_1x_1 + a_2x_2 + \cdots + a_nx_n}}{\partial{x_n}} \end{pmatrix} \\ &= \begin{pmatrix} a_1 \\ a_2 \\ \vdots \\ a_n \end{pmatrix} \\ &= \mathbf{a} \end{align}
x∂aTx=∂x∂a1x1+a2x2+⋯+anxn=
∂x1∂a1x1+a2x2+⋯+anxn∂x2∂a1x1+a2x2+⋯+anxn⋮∂xn∂a1x1+a2x2+⋯+anxn
=
a1a2⋮an
=a
同理,
∂
x
T
a
∂
x
=
a
\frac{\partial{\mathbf{x}^T\mathbf{a}}}{\partial{\mathbf{x}}} = \mathbf{a}
∂x∂xTa=a.
2. 已知
A
=
(
a
11
a
12
⋯
a
1
n
a
21
a
22
⋯
a
2
n
⋮
⋮
⋱
⋮
a
n
1
a
n
2
⋯
a
n
n
)
A=\begin{pmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{n1} & a_{n2} & \cdots & a_{nn}\end{pmatrix}
A=
a11a21⋮an1a12a22⋮an2⋯⋯⋱⋯a1na2n⋮ann
,
x
T
=
(
x
1
,
x
2
,
⋯
,
x
n
)
\mathbf{x}^T=\begin{pmatrix}x_1, x_2, \cdots, x_n\end{pmatrix}
xT=(x1,x2,⋯,xn), 求
∂
x
T
A
x
∂
x
\frac{\partial{\mathbf{x}^TA\mathbf{x}}}{\partial{\mathbf{x}}}
∂x∂xTAx.
解:
x
T
A
=
(
a
11
x
1
+
a
21
x
2
+
⋯
+
a
n
1
x
n
,
a
12
x
1
+
a
22
x
2
+
⋯
+
a
n
2
x
n
,
⋯
,
a
1
n
x
1
+
a
2
n
x
2
+
⋯
+
a
n
n
x
n
)
=
(
∑
i
=
1
n
a
i
1
x
i
,
∑
i
=
1
n
a
i
2
x
i
,
⋯
,
∑
i
=
1
n
a
i
n
x
i
)
\begin{align} \mathbf{x}^TA &= \begin{pmatrix} a_{11}x_1 + a_{21}x_2 + \cdots + a_{n1}x_n, a_{12}x_1 + a_{22}x_2 + \cdots + a_{n2}x_n, \cdots, a_{1n}x_1 + a_{2n}x_2 + \cdots + a_{nn}x_n \end{pmatrix} \\ &= \begin{pmatrix} \sum_{i=1}^n{a_{i1}x_i}, \sum_{i=1}^n{a_{i2}x_i}, \cdots, \sum_{i=1}^n{a_{in}x_i} \end{pmatrix} \end{align}
xTA=(a11x1+a21x2+⋯+an1xn,a12x1+a22x2+⋯+an2xn,⋯,a1nx1+a2nx2+⋯+annxn)=(∑i=1nai1xi,∑i=1nai2xi,⋯,∑i=1nainxi)
那么
x
T
A
x
=
(
∑
i
=
1
n
a
i
1
x
i
,
∑
i
=
1
n
a
i
2
x
i
,
⋯
,
∑
i
=
1
n
a
i
n
x
i
)
(
x
1
x
2
⋮
x
n
)
=
x
1
∑
i
=
1
n
a
i
1
x
i
+
x
2
∑
i
=
1
n
a
i
2
x
i
+
⋯
+
x
n
∑
i
=
1
n
a
i
n
x
i
\begin{align} \mathbf{x}^TA\mathbf{x} &= \begin{pmatrix} \sum_{i=1}^n{a_{i1}x_i}, \sum_{i=1}^n{a_{i2}x_i}, \cdots, \sum_{i=1}^n{a_{in}x_i} \end{pmatrix} \begin{pmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{pmatrix} \\ &=x_1\sum_{i=1}^n{a_{i1}x_i} + x_2\sum_{i=1}^n{a_{i2}x_i} + \cdots + x_n\sum_{i=1}^n{a_{in}x_i} \end{align}
xTAx=(∑i=1nai1xi,∑i=1nai2xi,⋯,∑i=1nainxi)
x1x2⋮xn
=x1i=1∑nai1xi+x2i=1∑nai2xi+⋯+xni=1∑nainxi
于是
∂
x
T
A
x
∂
x
=
(
∂
x
1
∑
i
=
1
n
a
i
1
x
i
+
x
2
∑
i
=
1
n
a
i
2
x
i
+
⋯
+
x
n
∑
i
=
1
n
a
i
n
x
i
∂
x
1
∂
x
1
∑
i
=
1
n
a
i
1
x
i
+
x
2
∑
i
=
1
n
a
i
2
x
i
+
⋯
+
x
n
∑
i
=
1
n
a
i
n
x
i
∂
x
2
⋮
∂
x
1
∑
i
=
1
n
a
i
1
x
i
+
x
2
∑
i
=
1
n
a
i
2
x
i
+
⋯
+
x
n
∑
i
=
1
n
a
i
n
x
i
∂
x
n
)
\begin{align} \frac{\partial{\mathbf{x}^TA\mathbf{x}}}{\partial{\mathbf{x}}} &= \begin{pmatrix} \frac{\partial{x_1\sum_{i=1}^n{a_{i1}x_i} +x_2\sum_{i=1}^n{a_{i2}x_i}+\cdots+x_n\sum_{i=1}^n{a_{in}x_i}}} {\partial{x_1}} \\ \frac{\partial{x_1\sum_{i=1}^n{a_{i1}x_i} +x_2\sum_{i=1}^n{a_{i2}x_i}+\cdots+x_n\sum_{i=1}^n{a_{in}x_i}}} {\partial{x_2}} \\ \vdots \\ \frac{\partial{x_1\sum_{i=1}^n{a_{i1}x_i} +x_2\sum_{i=1}^n{a_{i2}x_i}+\cdots+x_n\sum_{i=1}^n{a_{in}x_i}}} {\partial{x_n}} \end{pmatrix} \end{align}
∂x∂xTAx=
∂x1∂x1∑i=1nai1xi+x2∑i=1nai2xi+⋯+xn∑i=1nainxi∂x2∂x1∑i=1nai1xi+x2∑i=1nai2xi+⋯+xn∑i=1nainxi⋮∂xn∂x1∑i=1nai1xi+x2∑i=1nai2xi+⋯+xn∑i=1nainxi
我们对其中的一个元素求导:
∂
x
1
∑
i
=
1
n
a
i
1
x
i
+
x
2
∑
i
=
1
n
a
i
2
x
i
+
⋯
+
x
n
∑
i
=
1
n
a
i
n
x
i
∂
x
1
=
∑
i
=
1
n
a
i
1
x
i
+
a
11
x
1
+
a
12
x
2
+
⋯
+
a
1
n
x
n
=
∑
i
=
1
n
a
i
1
x
i
+
∑
j
=
1
n
a
1
j
x
j
\begin{align} \frac{\partial{x_1\sum_{i=1}^n{a_{i1}x_i} +x_2\sum_{i=1}^n{a_{i2}x_i}+\cdots+x_n\sum_{i=1}^n{a_{in}x_i}}} {\partial{x_1}} &= \sum_{i=1}^n{a_{i1}x_i} + a_{11}x_1 + a_{12}x_2 + \cdots + a_{1n}x_n \\ &= \sum_{i=1}^n{a_{i1}x_i} + \sum_{j=1}^n{a_{1j}x_j} \end{align}
∂x1∂x1∑i=1nai1xi+x2∑i=1nai2xi+⋯+xn∑i=1nainxi=i=1∑nai1xi+a11x1+a12x2+⋯+a1nxn=i=1∑nai1xi+j=1∑na1jxj
那么公式
(
9
)
(9)
(9)可化简为:
∂
x
T
A
x
∂
x
=
(
∑
i
=
1
n
a
i
1
x
i
+
∑
j
=
1
n
a
1
j
x
j
∑
i
=
1
n
a
i
2
x
i
+
∑
j
=
1
n
a
2
j
x
j
⋮
∑
i
=
1
n
a
i
n
x
i
+
∑
j
=
1
n
a
n
j
x
j
)
=
(
∑
i
=
1
n
a
i
1
x
i
∑
i
=
1
n
a
i
2
x
i
⋮
∑
i
=
1
n
a
i
n
x
i
)
+
(
∑
j
=
1
n
a
1
j
x
j
∑
j
=
1
n
a
2
j
x
j
⋮
∑
j
=
1
n
a
n
j
x
j
)
\begin{align} \frac{\partial{\mathbf{x}^TA\mathbf{x}}}{\partial{\mathbf{x}}} & = \begin{pmatrix} \sum_{i=1}^n{a_{i1}x_i} + \sum_{j=1}^n{a_{1j}x_j} \\ \sum_{i=1}^n{a_{i2}x_i} + \sum_{j=1}^n{a_{2j}x_j} \\ \vdots \\ \sum_{i=1}^n{a_{in}x_i} + \sum_{j=1}^n{a_{nj}x_j} \end{pmatrix} \\ & = \begin{pmatrix} \sum_{i=1}^n{a_{i1}x_i} \\ \sum_{i=1}^n{a_{i2}x_i} \\ \vdots \\ \sum_{i=1}^n{a_{in}x_i} \end{pmatrix} + \begin{pmatrix} \sum_{j=1}^n{a_{1j}x_j} \\ \sum_{j=1}^n{a_{2j}x_j} \\ \vdots \\ \sum_{j=1}^n{a_{nj}x_j} \end{pmatrix} \end{align}
∂x∂xTAx=
∑i=1nai1xi+∑j=1na1jxj∑i=1nai2xi+∑j=1na2jxj⋮∑i=1nainxi+∑j=1nanjxj
=
∑i=1nai1xi∑i=1nai2xi⋮∑i=1nainxi
+
∑j=1na1jxj∑j=1na2jxj⋮∑j=1nanjxj
根据公式
(
6
)
(6)
(6)可化简为:
∂
x
T
A
x
∂
x
=
A
T
x
+
A
x
=
(
A
T
+
A
)
x
\begin{align} \frac{\partial{\mathbf{x}^TA\mathbf{x}}}{\partial{\mathbf{x}}} &= A^T\mathbf{x} + A\mathbf{x} \\ &= (A^T + A)\mathbf{x} \end{align}
∂x∂xTAx=ATx+Ax=(AT+A)x
如果
A
A
A是对称矩阵,则
∂
x
T
A
x
∂
x
=
2
A
x
\frac{\partial{\mathbf{x}^TA\mathbf{x}}}{\partial{\mathbf{x}}} = 2A\mathbf{x}
∂x∂xTAx=2Ax
1.3 一元线性回归
1.3.1 模型
一般情况下,一元线性回归模型的假设函数为:
y
=
w
x
+
b
y = wx + b
y=wx+b
其中,
w
∈
R
w \in R
w∈R,
b
∈
R
b \in R
b∈R为模型参数,也称为回归系数。当参数
w
w
w 和
b
b
b 确定后,我们可以使用该模型对新的样本数据进行预测。
1)如何确定模型参数(回归系数)?
答:通过历史数据确定模型参数,然后使用该模型预测新的数据。
2)怎样通过历史数据确定模型参数?
答:利用最小二乘法去拟合模型参数。
1.3.2 最小二乘法
最小二乘法(Least Squares, LS)是一种常用的数据拟合方法,其原理是寻找一条直线,使得直线与一系列离散数据点之间的距离平方和最小。假设训练集
D
=
{
(
x
1
,
y
1
)
,
(
x
2
,
y
2
)
,
⋯
,
(
x
n
,
y
n
)
}
D=\{(x_1, y_1), (x_2, y_2), \cdots, (x_n, y_n)\}
D={(x1,y1),(x2,y2),⋯,(xn,yn)}有
n
n
n 个样本,那么均方误差损失函数为:
L
(
w
,
b
)
=
1
n
∑
i
=
1
n
(
y
i
−
y
^
i
)
2
=
1
n
∑
i
=
1
n
(
y
i
−
(
w
x
i
+
b
)
)
2
\begin{aligned} L(w, b) &= \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2 \\ &= \frac{1}{n}\sum_{i=1}^{n}(y_i - (wx_i + b))^2 \end{aligned}
L(w,b)=n1i=1∑n(yi−y^i)2=n1i=1∑n(yi−(wxi+b))2
其中,
y
^
i
\hat{y}_i
y^i为训练集的模型输出;
y
i
y_i
yi为训练集的真实值;
y
i
−
y
^
i
y_i - \hat{y}_i
yi−y^i为残差;
(
y
i
−
y
^
i
)
2
(y_i - \hat{y}_i)^2
(yi−y^i)2为均方误差。当均方误差损失函数最小时,通过训练集拟合出来的模型最好。
(1) 如何求均方误差损失函数的最小值?
答:根据最优化理论可知,当优化函数为凸函数时,那么函数的局部最小值就是全局最小值。
很显然,我们只要求出函数的极小值就可以得到函数的最小值。因此,我们只需要对函数求导即可。函数求导过程如下:
∂
L
(
w
,
b
)
∂
w
=
1
n
∑
i
=
1
n
2
×
(
y
i
−
(
w
x
i
+
b
)
)
(
−
x
i
)
=
2
n
∑
i
=
1
n
(
(
w
x
i
+
b
)
x
i
−
x
i
y
i
)
=
0
\begin{aligned} \frac{\partial L(w,b)}{\partial w} &= \frac{1}{n}\sum_{i=1}^{n}2\times(y_i - (wx_i + b))(-x_i) \\ &= \frac{2}{n}\sum_{i=1}^{n}((wx_i + b)x_i - x_iy_i) \\ &= 0 \end{aligned}
∂w∂L(w,b)=n1i=1∑n2×(yi−(wxi+b))(−xi)=n2i=1∑n((wxi+b)xi−xiyi)=0
∂
L
(
w
,
b
)
∂
b
=
1
n
∑
i
=
1
n
2
×
(
y
i
−
(
w
x
i
+
b
)
)
×
(
−
1
)
=
2
n
∑
i
=
1
n
(
w
x
i
+
b
−
y
i
)
=
0
\begin{aligned} \frac{\partial L(w,b)}{\partial b} &= \frac{1}{n}\sum_{i=1}^{n}2\times(y_i - (wx_i + b))\times(-1) \\ &= \frac{2}{n}\sum_{i=1}^{n}(wx_i + b - y_i) \\ &= 0 \end{aligned}
∂b∂L(w,b)=n1i=1∑n2×(yi−(wxi+b))×(−1)=n2i=1∑n(wxi+b−yi)=0
由
(
4
)
(4)
(4)式可知,
2
n
∑
i
=
1
n
(
w
x
i
+
b
−
y
i
)
=
0
1
n
∑
i
=
1
n
(
w
x
i
+
b
−
y
i
)
=
0
n
b
n
+
1
n
∑
i
=
1
n
(
w
x
i
−
y
i
)
=
0
b
=
1
n
∑
i
=
1
n
(
y
i
−
w
x
i
)
b
=
1
n
∑
i
=
1
n
y
i
−
w
n
∑
i
=
1
n
x
i
\begin{aligned} \frac{2}{n}\sum_{i=1}^{n}(wx_i + b - y_i) = 0 \\ \frac{1}{n}\sum_{i=1}^{n}(wx_i + b - y_i) = 0 \\ \frac{nb}{n} + \frac{1}{n}\sum_{i=1}^{n}(wx_i - y_i) = 0 \\ b = \frac{1}{n}\sum_{i=1}^{n}(y_i - wx_i) \\ b = \frac{1}{n}\sum_{i=1}^{n}y_i - \frac{w}{n}\sum_{i=1}^{n}x_i \end{aligned}
n2i=1∑n(wxi+b−yi)=0n1i=1∑n(wxi+b−yi)=0nnb+n1i=1∑n(wxi−yi)=0b=n1i=1∑n(yi−wxi)b=n1i=1∑nyi−nwi=1∑nxi
即,
b
=
y
ˉ
−
w
x
ˉ
b = \bar{y} - w\bar{x}
b=yˉ−wxˉ
其中,
y
ˉ
=
∑
i
=
1
n
y
i
\bar{y}=\sum_{i=1}^n{y_i}
yˉ=∑i=1nyi,
x
ˉ
=
∑
i
=
1
n
x
i
\bar{x}=\sum_{i=1}^n{x_i}
xˉ=∑i=1nxi.
然后,将
b
b
b代入
(
3
)
(3)
(3)式可得,
2
n
∑
i
=
1
n
(
(
w
x
i
+
b
)
x
i
−
x
i
y
i
)
=
0
1
n
∑
i
=
1
n
(
(
w
x
i
+
b
)
x
i
−
x
i
y
i
)
=
0
1
n
∑
i
=
1
n
(
(
w
x
i
+
y
ˉ
−
w
x
ˉ
)
x
i
−
x
i
y
i
)
=
0
1
n
∑
i
=
1
n
(
w
x
i
x
i
+
y
ˉ
x
i
−
w
x
ˉ
x
i
−
x
i
y
i
)
=
0
1
n
∑
i
=
1
n
(
w
x
i
x
i
−
w
x
ˉ
x
i
+
y
ˉ
x
i
−
x
i
y
i
)
=
0
1
n
∑
i
=
1
n
(
w
x
i
(
x
i
−
x
ˉ
)
−
x
i
(
y
i
−
y
ˉ
)
)
=
0
1
n
∑
i
=
1
n
w
x
i
(
x
i
−
x
ˉ
)
−
1
n
∑
i
=
1
n
x
i
(
y
i
−
y
ˉ
)
=
0
w
n
∑
i
=
1
n
x
i
(
x
i
−
x
ˉ
)
−
1
n
∑
i
=
1
n
x
i
(
y
i
−
y
ˉ
)
=
0
w
∑
i
=
1
n
x
i
(
x
i
−
x
ˉ
)
=
∑
i
=
1
n
x
i
(
y
i
−
y
ˉ
)
w
=
∑
i
=
1
n
x
i
(
y
i
−
y
ˉ
)
∑
i
=
1
n
x
i
(
x
i
−
x
ˉ
)
w
=
∑
i
=
1
n
x
i
y
i
−
∑
i
=
1
n
x
i
y
ˉ
∑
i
=
1
n
x
i
2
−
∑
i
=
1
n
x
i
x
ˉ
w
=
∑
i
=
1
n
x
i
y
i
−
y
ˉ
∑
i
=
1
n
x
i
∑
i
=
1
n
x
i
2
−
x
ˉ
∑
i
=
1
n
x
i
\frac{2}{n}\sum_{i=1}^{n}((wx_i + b)x_i - x_iy_i) = 0 \\ \frac{1}{n}\sum_{i=1}^{n}((wx_i + b)x_i - x_iy_i) = 0 \\ \frac{1}{n}\sum_{i=1}^{n}((wx_i + \bar{y} - w\bar{x})x_i - x_iy_i) = 0 \\ \frac{1}{n}\sum_{i=1}^{n}(wx_ix_i + \bar{y}x_i - w\bar{x}x_i - x_iy_i) = 0 \\ \frac{1}{n}\sum_{i=1}^{n}(wx_ix_i - w\bar{x}x_i + \bar{y}x_i - x_iy_i) = 0 \\ \frac{1}{n}\sum_{i=1}^{n}(wx_i(x_i - \bar{x}) - x_i(y_i - \bar{y})) = 0 \\ \frac{1}{n}\sum_{i=1}^{n}wx_i(x_i - \bar{x}) - \frac{1}{n}\sum_{i=1}^{n}x_i(y_i - \bar{y}) = 0 \\ \frac{w}{n}\sum_{i=1}^{n}x_i(x_i - \bar{x}) - \frac{1}{n}\sum_{i=1}^{n}x_i(y_i - \bar{y}) = 0 \\ w\sum_{i=1}^{n}x_i(x_i - \bar{x}) = \sum_{i=1}^{n}x_i(y_i - \bar{y}) \\ w = \frac{\sum_{i=1}^{n}x_i(y_i - \bar{y})}{\sum_{i=1}^{n}x_i(x_i - \bar{x})} \\ w = \frac{\sum_{i=1}^{n}x_iy_i - \sum_{i=1}^{n}x_i\bar{y}}{\sum_{i=1}^{n}x_i^2 - \sum_{i=1}^{n}x_i\bar{x}} \\ w = \frac{\sum_{i=1}^{n}x_iy_i - \bar{y}\sum_{i=1}^{n}x_i}{\sum_{i=1}^{n}x_i^2 - \bar{x}\sum_{i=1}^{n}x_i}
n2i=1∑n((wxi+b)xi−xiyi)=0n1i=1∑n((wxi+b)xi−xiyi)=0n1i=1∑n((wxi+yˉ−wxˉ)xi−xiyi)=0n1i=1∑n(wxixi+yˉxi−wxˉxi−xiyi)=0n1i=1∑n(wxixi−wxˉxi+yˉxi−xiyi)=0n1i=1∑n(wxi(xi−xˉ)−xi(yi−yˉ))=0n1i=1∑nwxi(xi−xˉ)−n1i=1∑nxi(yi−yˉ)=0nwi=1∑nxi(xi−xˉ)−n1i=1∑nxi(yi−yˉ)=0wi=1∑nxi(xi−xˉ)=i=1∑nxi(yi−yˉ)w=∑i=1nxi(xi−xˉ)∑i=1nxi(yi−yˉ)w=∑i=1nxi2−∑i=1nxixˉ∑i=1nxiyi−∑i=1nxiyˉw=∑i=1nxi2−xˉ∑i=1nxi∑i=1nxiyi−yˉ∑i=1nxi
即,
w
=
∑
i
=
1
n
x
i
y
i
−
n
x
ˉ
y
ˉ
∑
i
=
1
n
x
i
2
−
n
x
ˉ
2
w = \frac{\sum_{i=1}^{n}x_iy_i - n\bar{x}\bar{y}}{\sum_{i=1}^{n}x_i^2 - n\bar{x}^2}
w=∑i=1nxi2−nxˉ2∑i=1nxiyi−nxˉyˉ
1.4 多元线性回归
1.4.1 模型
多元线性回归是一种统计学习方法,用于建立多个自变量和一个因变量之间的线性关系模型。与简单线性回归不同,多元线性回归可以考虑多个自变量对因变量的影响,并估计每个自变量对因变量的贡献。一般情况下,多元回归模型的假设函数为:
y
=
X
w
+
b
\mathbf{y} = X\mathbf{w} + b
y=Xw+b
其中,
w
=
(
w
1
w
2
⋮
w
d
)
\mathbf{w}=\begin{pmatrix} w_1 \\ w_2 \\ \vdots \\w_d \end{pmatrix}
w=
w1w2⋮wd
,
b
b
b 为模型参数,也称回归系数。为了方便计算,通常将
b
b
b 和权向量
w
\mathbf{w}
w 合并,则模型可以简化为:
y
n
×
1
=
X
n
×
(
d
+
1
)
w
(
d
+
1
)
×
1
\mathbf{y_{n\times1}} = X_{n\times{(d+1)}}\mathbf{w_{(d+1)\times{1}}}
yn×1=Xn×(d+1)w(d+1)×1
其中,
w
=
(
w
1
w
2
⋮
w
d
b
)
\mathbf{w}=\begin{pmatrix} w_1 \\ w_2 \\ \vdots \\w_d \\ b\end{pmatrix}
w=
w1w2⋮wdb
X
=
(
x
11
x
12
⋯
x
1
d
1
x
21
x
22
⋯
x
2
d
1
⋮
⋮
⋱
⋮
1
x
n
1
x
n
2
⋯
x
n
d
1
)
X = \begin{pmatrix} x_{11} & x_{12} & \cdots & x_{1d} & 1 \\ x_{21} & x_{22} & \cdots & x_{2d} & 1 \\ \vdots & \vdots & \ddots & \vdots & 1 \\ x_{n1} & x_{n2} & \cdots & x_{nd} & 1 \\ \end{pmatrix}
X=
x11x21⋮xn1x12x22⋮xn2⋯⋯⋱⋯x1dx2d⋮xnd1111
其中,
n
n
n表示实例数目,
d
d
d表示数据维数。
1.4.2 最小二乘法
多元线性回归模型的求解方法和一元线性回归模型求解方法类似。均方误差损失函数为:
L
(
w
)
=
1
n
(
y
−
y
^
)
T
(
y
−
y
^
)
=
1
n
(
y
−
X
w
)
T
(
y
−
X
w
)
=
1
n
(
y
T
−
w
T
X
T
)
(
y
−
X
w
)
=
1
n
(
y
T
y
−
y
T
X
w
−
w
T
X
T
y
−
w
T
X
T
X
w
)
=
1
n
(
y
T
y
−
y
T
X
w
−
(
y
T
X
w
)
T
+
w
T
X
T
X
w
)
\begin{aligned} L(\mathbf{w}) &= \frac{1}{n}(\mathbf{y} - \hat{\mathbf{y}})^T(\mathbf{y} - \hat{\mathbf{y}}) \\ &= \frac{1}{n}(\mathbf{y} - X\mathbf{w})^T(\mathbf{y} - X\mathbf{w}) \\ &= \frac{1}{n}(\mathbf{y}^T - \mathbf{w}^TX^T)(\mathbf{y} - X\mathbf{w}) \\ &= \frac{1}{n}(\mathbf{y}^T\mathbf{y} - \mathbf{y}^TX\mathbf{w} - \mathbf{w}^TX^T\mathbf{y} - \mathbf{w}^TX^TX\mathbf{w}) \\ &= \frac{1}{n}(\mathbf{y}^T\mathbf{y} - \mathbf{y}^TX\mathbf{w} - (\mathbf{y}^TX\mathbf{w})^T + \mathbf{w}^TX^TX\mathbf{w}) \end{aligned}
L(w)=n1(y−y^)T(y−y^)=n1(y−Xw)T(y−Xw)=n1(yT−wTXT)(y−Xw)=n1(yTy−yTXw−wTXTy−wTXTXw)=n1(yTy−yTXw−(yTXw)T+wTXTXw)
因为
y
T
X
w
\mathbf{y}^TX\mathbf{w}
yTXw是一个标量,所以
y
T
X
w
=
(
y
T
X
w
)
T
\mathbf{y}^TX\mathbf{w} = (\mathbf{y}^TX\mathbf{w})^T
yTXw=(yTXw)T
因此,
(
18
)
(18)
(18) 式可化简为
L
(
w
)
=
1
n
(
y
T
y
−
2
y
T
X
w
+
w
T
X
T
X
w
)
L(\mathbf{w}) = \frac{1}{n}(\mathbf{y}^T\mathbf{y} - 2\mathbf{y}^TX\mathbf{w} + \mathbf{w}^TX^TX\mathbf{w})
L(w)=n1(yTy−2yTXw+wTXTXw)
对公式
(
19
)
(19)
(19)求导,即
∂
L
(
w
)
∂
w
=
1
n
(
∂
y
T
y
∂
w
−
2
∂
y
T
X
w
∂
w
+
∂
w
T
X
T
X
w
∂
w
)
\begin{align} \frac{\partial L(\mathbf{w})}{\partial \mathbf{w}} & = \frac{1}{n}(\frac{\partial{\mathbf{y}^T\mathbf{y}}}{\partial\mathbf{w}} - 2\frac{\partial{\mathbf{y}^TX\mathbf{w}}}{\partial{\mathbf{w}}} + \frac{\partial{\mathbf{w}^TX^TX\mathbf{w}}}{\partial{\mathbf{w}}}) \end{align}
∂w∂L(w)=n1(∂w∂yTy−2∂w∂yTXw+∂w∂wTXTXw)
由于向量
y
\mathbf{y}
y与向量
w
\mathbf{w}
w无关,所以
∂
L
(
w
)
∂
w
=
1
n
(
∂
w
T
X
T
X
w
∂
w
−
2
∂
y
T
X
w
∂
w
)
\begin{align} \frac{\partial L(\mathbf{w})}{\partial \mathbf{w}} & = \frac{1}{n}(\frac{\partial{\mathbf{w}^TX^TX\mathbf{w}}}{\partial{\mathbf{w}}} - 2\frac{\partial{\mathbf{y}^TX\mathbf{w}}}{\partial{\mathbf{w}}}) \end{align}
∂w∂L(w)=n1(∂w∂wTXTXw−2∂w∂yTXw)
由于
X
T
X
X^TX
XTX是对称矩阵,所以根据公式
(
4
16
)
(4~16)
(4 16)可知,
∂
L
(
w
)
∂
w
=
2
n
(
X
T
X
w
−
y
T
X
)
=
0
\begin{align} \frac{\partial L(\mathbf{w})}{\partial \mathbf{w}} &= \frac{2}{n}(X^TX\mathbf{w} - \mathbf{y}^TX) \\ &= 0 \end{align}
∂w∂L(w)=n2(XTXw−yTX)=0
即,
X
T
X
w
=
X
T
y
X^TX\mathbf{w} = X^T\mathbf{y}
XTXw=XTy
如果
X
T
X
X^TX
XTX逆矩阵存在,那么
w
=
(
X
T
X
)
−
1
X
T
y
\mathbf{w} = (X^TX)^{-1}X^T\mathbf{y}
w=(XTX)−1XTy
如果逆矩阵不存在,那么可以使用梯度下降法求解,即
w
:
=
w
−
2
η
n
(
X
T
X
w
−
X
T
y
)
\mathbf{w} := \mathbf{w} - \frac{2\eta}{n}(X^TX\mathbf{w} - X^T\mathbf{y})
w:=w−n2η(XTXw−XTy)
注意:如果逆矩阵不存在,可以考虑使用迭代的思想(例如,梯度下降法)求解最优值。梯度下降法是一种常用的无约束优化方法,在数据科学和机器学习领域中广泛应用。与逆矩阵不同,梯度下降法不需要计算逆矩阵,因此具有更好的数值稳定行。
1.5 代码
https://www.kaggle.com/code/wenshulee/linear-regression
1.5.1 线性回归
class LinearRegression:
def __init__(self, n_iter=1000, eta=1e-3, tol=0.001):
self.n_iter = n_iter # the number of iteration.
self.eta = eta # learning ratio.
self.tol = tol # tolerance.
self.weight = None # weight.
self.y_hat = None
self.residual_square = [] # residual.
pass
@staticmethod
def preprocessing_data(X):
""" preprocess data.
Parameters
----------
X: {ndarray or DataFrame}
input data.
"""
one_column = np.ones((len(X), 1))
X = np.hstack((X, one_column))
return X
@staticmethod
def loss(y, y_pred):
""" calculate loss.
Parameters
----------
y: ndarray
ground truth.
y_pred: ndarray
prediction data.
"""
return np.sum((y - y_pred) ** 2)/y.size
def gradient_decent(self, X, y):
""" Gradient decent.
Parameters
----------
X: ndarray
training independent variables.
y: ndarray
training dependent variables.
w: ndarray
linear regression coefficients.
"""
loss_old = np.inf # initialize loss_old.
for i in range(self.n_iter):
self.y_hat = self.__predict(X) # predict y.
loss = self.loss(y, self.y_hat) # calculate loss.
self.residual_square.append(loss)
print(f'==============={self.n_iter}\\{i}===============')
print(f'Loss: {loss:.4f}')
print(f'Tolerence: {self.tol}')
print(f'Loss difference: {(loss_old - loss):.4f}')
if self.tol is not None and loss_old - loss < self.tol:
break
loss_old = loss
gradient = X.T @ (self.y_hat - y) / y.size
self.weight -= self.eta * gradient
pass
def fit(self, train_X, train_y, random_state):
""" Train model.
Parameters
----------
train_X: ndarray
train features.
train_y: ndarray
train target.
random_state: int
random seed.
"""
np.random.seed(random_state)
train_X = self.preprocessing_data(train_X)
self.weight = np.random.random((train_X.shape[1], 1))
self.gradient_decent(train_X, train_y)
def __predict(self, X):
""" preidcation data.
Parameters
----------
data: {ndarray or DataFrame}
training features or testing features.
"""
return X @ self.weight
def predict(self, X):
""" preidcation data.
Parameters
----------
data: {ndarray or DataFrame}
training features or testing features.
"""
X = self.preprocessing_data(X)
return X @ self.weight
1.5.2 训练集和测试集划分
def split_train_test(data, train_ratio, random_state, **kargs):
""" splitting datasets into train datadatset and test dataset.
Parameter
---------
data: {np.ndarray or pd.DataFrame}
input data.
train_ratio: float
train ratio.
random_state: int
random seed.
"""
data = data if isinstance(data, np.ndarray) else np.array(data)
np.random.seed(random_state)
index_list = np.random.permutation(len(data))
train_size = int(len(data) * train_ratio)
train_data = data[index_list[:train_size]]
X_train = train_data[:, :-1]
y_train = train_data[:, -1].reshape((train_size, 1))
test_data = data[index_list[train_size:]]
X_test = test_data[:, :-1]
y_test = test_data[:, -1].reshape((len(data)-train_size, 1))
return X_train, y_train, X_test, y_test
1.5.3 数据归一化
(1) 最大值-最小值归一化
X
^
=
X
−
m
i
n
m
a
x
−
m
i
n
\hat{X} = \frac{X - min}{max - min}
X^=max−minX−min
def min_max_scaler(data):
data = data if isinstance(data, np.ndarray) else np.array(data)
max_columns = np.max(data[:, :-1], axis=0)
min_columns = np.min(data[:, :-1], axis=0)
data[:, :-1] = (data[:, :-1] - min_columns) / (max_columns - min_columns)
return data
(2) 标准化
X
^
=
X
−
μ
σ
\hat{X} = \frac{X - \mu}{\sigma}
X^=σX−μ
def standard_scaler(data):
data = data if isinstance(data, np.ndarray) else np.array(data)
average_cols = np.mean(data[:, :-1], axis=0)
std_cols = np.std(data[:, :-1], axis=0)
data[:, :-1] = (data[:, :-1] - average_cols) / std_cols
return data
1.6 实战
(1) 下载数据集
https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/
- winequality-red.csv
- winequality.names
(2) 导入所需要的包
import pandas as pd
import numpy as np
import random
import seaborn as sns
import matplotlib.pyplot as plt
(3) 加载红酒数据集
path = r"/kaggle/input/wine-quality/wine_quality/winequality-red.csv"
data = pd.read_csv(path, sep=';')
data
(4) 查看数据数目,特征数目和数据类型
- 查看数据数目主要是想了解数据集的大小,查看特征数目是想了解数据维度,查看数据类型是想通过数据类型选择模型(回归模型,分类模型);
- 数据量为1599,特征数目为11,quality是整数类型,根据业务场景,最后想预测红酒的质量评分,这种评分可以比较大小,所以做回归模型比较好。
data.info()
(5) 检查有没有重复数据和缺失数据
- 通过观察发现重复数据有240条;
- 删除重复数据;
- 通过观察缺失数据为0, 不用删除.
num_duplicated_data = data.duplicated().sum()
num_duplicated_data
data_duplicated = data[data.duplicated()]
data_duplicated # duplicated records.
# remove duplicated data.
data = data.drop_duplicates()
data
# check missing records.
data.isnull().sum().sum()
(6) 异常值处理
- 检查异常值数目;
- 删除异常值。
# True indicates presence of on outlier.
Q1 = data.quantile(0.25) # lower quatile.
Q3 = data.quantile(0.75) # upper quatile.
IQR = Q3 - Q1 # quatile range.
((data < (Q1 - 1.5*IQR)) | (data > (Q3 + 1.5*IQR))).any(axis=1).sum() # the number of outlier records.
# remove outlier data.
data = data[~(((data < (Q1 - 1.5*IQR)) | (data > (Q3 + 1.5*IQR))).any(axis=1))]
data
(7) 相关性分析
- 由于皮尔逊相关系数只适用于线性关系,所以先画散点图矩阵,观察这种关系;
- 观察散点图矩阵,找出与目标存在线性关系的特征,或特征与特征存在线性关系的这些特征;
a. 通过下图可以看出,特征与目标之间的关系无法看出线性关系;
b. fixed acidity和density呈正相关;fixed acidity和pH呈负相关。在模型建立好之后,去除共线性问题时,再考虑。 - 计算相关性矩阵,并画热图;
- 根据结论,筛选特征。
# create a scatter plot matrix.
pairplot = sns.pairplot(data)
# flatten the pairplot into 1 dimension array.
for ax in pairplot.axes.flatten():
ax.set_xlabel(ax.get_xlabel(), fontsize=14) # set the x label text to a font size of 18.
ax.set_ylabel(ax.get_ylabel(), fontsize=14) # set the y label text to a font size of 18.
ax.tick_params(axis='both', labelsize=10)
虽然,我们无法从散点图矩阵观察到与quality呈线性关系的特征,但是我们可以选择箱线图来观察其他特征与quality的关系。
fig, axes = plt.subplots(6, 2, figsize=(70, 160), dpi=250)
for i, feature in enumerate(data.columns[:-1]):
row_index = i // 2 # row index.
col_index = i % 2 # columns index.
sns.boxplot(x=feature, y='quality', data=data, orient='h', ax=axes[row_index, col_index]) # box plot.
axes[row_index, col_index].set_title(f'Boxplot of {feature} by quality', fontsize=30)
axes[row_index, col_index].set_xlabel(f'{feature}', fontsize=30)
axes[row_index, col_index].set_ylabel('quality', fontsize=30)
axes[row_index, col_index].tick_params(axis='both', labelsize=30)
# calculate correlational matrix
corr = data.corr()
sns.set(font_scale=3)
fig, axes = plt.subplots(figsize=(80, 80), dpi=300)
sns.heatmap(corr, annot=True, cmap='coolwarm')
结论: 从箱型图和热图矩阵看出,随着pH增大,quality的质量没有变化,所以我们可以考虑将该特征删除;同理,也可以将residual sugar删除。由于free sulfur dioxide和total sulfur dioxide与quality呈非线性,对线性回归模型的贡献很小,所以考虑将该特征删除。
# remove the columns of pH, residual sugar, free sulfur dioxide, total sulfur dioxide.
data = data.drop(['pH', 'residual sugar', 'free sulfur dioxide', 'total sulfur dioxide'], axis=1)
data
(8). 划分数据集
- 训练集
- 测试集
注: 训练集:测试集 = 7:3 或 8:2
def split_train_test(data, train_ratio, random_state, **kargs):
""" splitting datasets into train datadatset and test dataset.
Parameter
---------
data: {np.ndarray or pd.DataFrame}
input data.
train_ratio: float
train ratio.
random_state: int
random seed.
"""
data = data if isinstance(data, np.ndarray) else np.array(data)
np.random.seed(random_state)
index_list = np.random.permutation(len(data))
train_size = int(len(data) * train_ratio)
train_data = data[index_list[:train_size]]
X_train = train_data[:, :-1]
y_train = train_data[:, -1].reshape((train_size, 1))
test_data = data[index_list[train_size:]]
X_test = test_data[:, :-1]
y_test = test_data[:, -1].reshape((len(data)-train_size, 1))
return X_train, y_train, X_test, y_test
X_train, y_train, X_test, y_test = split_train_test(data, train_ratio=0.8, random_state=30)
(9) 构建回归模型
- 初始化模型参数
- 构造损失函数
- 训练模型
- 利用梯度下降法求解模型近似最有参数,即权重和偏置
- 预测
class LinearRegression:
def __init__(self, n_iter=1000, eta=1e-3, tol=0.001):
self.n_iter = n_iter # the number of iteration.
self.eta = eta # learning ratio.
self.tol = tol # tolerance.
self.weight = None # weight.
self.y_hat = None
self.residual_square = [] # residual.
pass
@staticmethod
def preprocessing_data(X):
""" preprocess data.
Parameters
----------
X: {ndarray or DataFrame}
input data.
"""
one_column = np.ones((len(X), 1))
X = np.hstack((X, one_column))
return X
@staticmethod
def loss(y, y_pred):
""" calculate loss.
Parameters
----------
y: ndarray
ground truth.
y_pred: ndarray
prediction data.
"""
return np.sum((y - y_pred) ** 2)/y.size
def gradient_decent(self, X, y):
""" Gradient decent.
Parameters
----------
X: ndarray
training independent variables.
y: ndarray
training dependent variables.
w: ndarray
linear regression coefficients.
"""
loss_old = np.inf # initialize loss_old.
for i in range(self.n_iter):
self.y_hat = self.__predict(X) # predict y.
loss = self.loss(y, self.y_hat) # calculate loss.
self.residual_square.append(loss)
print(f'==============={self.n_iter}\\{i}===============')
print(f'Loss: {loss:.4f}')
print(f'Tolerence: {self.tol}')
print(f'Loss difference: {(loss_old - loss):.4f}')
if self.tol is not None and loss_old - loss < self.tol:
break
loss_old = loss
gradient = X.T @ (self.y_hat - y) / y.size
self.weight -= self.eta * gradient
pass
def fit(self, train_X, train_y, random_state):
""" Train model.
Parameters
----------
train_X: ndarray
train features.
train_y: ndarray
train target.
random_state: int
random seed.
"""
np.random.seed(random_state)
train_X = self.preprocessing_data(train_X)
self.weight = np.random.random((train_X.shape[1], 1))
self.gradient_decent(train_X, train_y)
def __predict(self, X):
""" preidcation data.
Parameters
----------
data: {ndarray or DataFrame}
training features or testing features.
"""
return X @ self.weight
def predict(self, X):
""" preidcation data.
Parameters
----------
data: {ndarray or DataFrame}
training features or testing features.
"""
X = self.preprocessing_data(X)
return X @ self.weight
lr = LinearRegression() # instance.
lr.fit(X_train, y_train, random_state=30)
y_pred = lr.predict(X_test)
# 可视化
plt.figure(figsize=(60, 8), dpi=200)
plt.plot(y_test, 'k-*')
plt.plot(y_pred, 'r--')
plt.xlim((0, len(y_pred)))
plt.legend(("y", 'y_pred'))
plt.figure(figsize=(70, 8), dpi=200)
plt.plot(lr.residual_square, 'k-*')
plt.xlim((0, len(lr.residual_square)))
plt.title("Mean Squared Error")
(10) 微调模型
- 从上面预测的结果来看,模型是不尽人意;
- 我们试试从归一化的角度,来试试能否提高模型的性能。
def min_max_scaler(data):
data = data if isinstance(data, np.ndarray) else np.array(data)
max_columns = np.max(data[:, :-1], axis=0)
min_columns = np.min(data[:, :-1], axis=0)
data[:, :-1] = (data[:, :-1] - min_columns) / (max_columns - min_columns)
return data
norm_data = min_max_scaler(data)
norm_data
X_norm_train, y_norm_train, X_norm_test, y_norm_test = split_train_test(norm_data, train_ratio=0.8, random_state=30)
lr = LinearRegression(eta=0.01) # instance.
lr.fit(X_norm_train, y_norm_train, random_state=30)
y_norm_pred = lr.predict(X_norm_test)
plt.figure(figsize=(60, 8), dpi=200)
plt.plot(y_norm_test, 'k-*')
plt.plot(y_norm_pred, 'r--')
plt.xlim((0, len(y_norm_pred)))
plt.legend(("y_norm_test", 'y_norm_pred'))
plt.figure(figsize=(70, 8), dpi=200)
plt.plot(lr.residual_square, 'k-*')
plt.xlim((0, len(lr.residual_square)))
plt.title("Mean Squared Error")
最大值-最小值归一化后的数据:收敛速度;精度差。
def standard_scaler(data):
data = data if isinstance(data, np.ndarray) else np.array(data)
average_cols = np.mean(data[:, :-1], axis=0)
std_cols = np.std(data[:, :-1], axis=0)
data[:, :-1] = (data[:, :-1] - average_cols) / std_cols
return data
std_data = standard_scaler(data)
std_data
X_std_train, y_std_train, X_std_test, y_std_test = split_train_test(std_data, train_ratio=0.8, random_state=30)
lr = LinearRegression(eta=0.01) # instance.
lr.fit(X_std_train, y_std_train, random_state=30)
y_std_pred = lr.predict(X_std_test)
plt.figure(figsize=(60, 8), dpi=200)
plt.plot(y_std_test, 'k-*')
plt.plot(y_std_pred, 'r--')
plt.xlim((0, len(y_std_pred)))
plt.legend(("y_std_test", 'y_std_pred'))
plt.figure(figsize=(70, 8), dpi=200)
plt.plot(lr.residual_square, 'k-*')
plt.xlim((0, len(lr.residual_square)))
plt.title("Mean Squared Error")
通过上文对比分析,标准化之后的模型的均方误差变差,训练速度变慢,所以我们使用原始数据进行训练。