多元线性回归通式定义如下:
{
y
^
i
=
f
(
x
)
=
θ
0
+
∑
j
=
1
n
θ
j
x
i
j
y
i
=
y
^
i
+
ϵ
i
\begin{cases} \hat y_i = f(x) = \theta_0 + \sum\limits_{j=1}^n \theta_j x_i^j \\ y_i = \hat y_i + \epsilon_i \end{cases}
⎩⎨⎧y^i=f(x)=θ0+j=1∑nθjxijyi=y^i+ϵi
其中
θ
0
\theta_0
θ0为偏置项,
x
i
j
x_i^j
xij为第
i
i
i条数据的第
j
j
j项特征值,
θ
j
\theta_j
θj为第
j
j
j项特征的权重,
y
^
i
\hat y_i
y^i为第
i
i
i条数据
x
i
x_i
xi的预测值,
y
i
y_i
yi为第
i
i
i条数据
x
i
x_i
xi对应的实际值,
ϵ
i
\epsilon_i
ϵi为预测值
y
^
i
\hat y_i
y^i与实际值
y
i
y_i
yi之间的残差。
令 x i 0 = 1 x_i^0 = 1 xi0=1, y ^ i = θ 0 + ∑ j = 1 n θ j x i j \hat y_i = \theta_0 + \sum\limits_{j=1}^n \theta_j x_i^j y^i=θ0+j=1∑nθjxij可被改写成 y ^ i = θ T x i \hat y_i = \theta^Tx_i y^i=θTxi。
假设每条数据
x
i
x_i
xi相互独立且同分布,残差项
ϵ
i
\epsilon_i
ϵi符合标准正态分布。残差项的概率密度函数如下:
p
(
ϵ
i
)
=
1
2
π
σ
exp
(
−
ϵ
i
2
2
σ
2
)
=
1
2
π
σ
exp
(
−
(
y
i
−
θ
T
x
i
)
2
2
σ
2
)
p(\epsilon_i) = \frac {1}{\sqrt{2\pi}\sigma} \exp({-\frac {\epsilon_i^2}{2 \sigma^2}}) = \frac {1}{\sqrt{2\pi}\sigma} \exp({-\frac {(y_i-\theta^Tx_i)^2}{2 \sigma^2}})
p(ϵi)=2πσ1exp(−2σ2ϵi2)=2πσ1exp(−2σ2(yi−θTxi)2)
我们的目的是最小化
∣
ϵ
i
∣
\vert \epsilon_i \vert
∣ϵi∣,而
p
(
ϵ
i
)
p(\epsilon_i)
p(ϵi)在
ϵ
i
\epsilon_i
ϵi值趋近
μ
=
0
\mu = 0
μ=0时取得最大值,如上图。因此,已知
x
i
x_i
xi和
y
i
y_i
yi,我们可以使用最大似然估计来得到
θ
\theta
θ:
L
(
θ
)
=
∏
i
=
1
m
p
(
y
i
∣
x
i
;
θ
)
=
∏
i
=
1
m
1
2
π
σ
exp
(
−
(
y
i
−
θ
T
x
i
)
2
2
σ
2
)
L(\theta) = \prod\limits_{i=1}^m p(y_i|x_i;\theta) = \prod\limits_{i=1}^m \frac {1}{\sqrt{2\pi}\sigma} \exp({-\frac {(y_i-\theta^Tx_i)^2}{2 \sigma^2}})
L(θ)=i=1∏mp(yi∣xi;θ)=i=1∏m2πσ1exp(−2σ2(yi−θTxi)2)
将
L
(
θ
)
L(\theta)
L(θ)取对数,得到
log
L
(
θ
)
\log L(\theta)
logL(θ),连乘变成了累加,消掉了指数幂,减少了函数的复杂度。并且,
L
(
θ
)
L(\theta)
L(θ)与
log
L
(
θ
)
\log L(\theta)
logL(θ)具有相同的变化趋势。因此,求
L
(
θ
)
L(\theta)
L(θ)取得最大值时
θ
\theta
θ的取值等同于求
log
L
(
θ
)
\log L(\theta)
logL(θ)取得最大值时
θ
\theta
θ的取值。
log
L
(
θ
)
=
∑
i
=
1
m
log
(
1
2
π
σ
exp
(
−
(
y
i
−
θ
T
x
i
)
2
2
σ
2
)
)
=
∑
i
=
1
m
(
log
1
2
π
σ
−
(
y
i
−
θ
T
x
i
)
2
2
σ
2
)
=
∑
i
=
1
m
log
1
2
π
σ
−
∑
i
=
1
m
(
y
i
−
θ
T
x
i
)
2
2
σ
2
\begin{aligned} & \log L(\theta) = \sum\limits_{i=1}^m \log (\frac {1}{\sqrt{2\pi}\sigma} \exp({-\frac {(y_i-\theta^Tx_i)^2}{2 \sigma^2}}))\\ & = \sum\limits_{i=1}^m (\log \frac {1}{\sqrt{2\pi}\sigma} -\frac {(y_i-\theta^Tx_i)^2}{2 \sigma^2}) \\ & = \sum\limits_{i=1}^m \log \frac {1}{\sqrt{2\pi}\sigma} - \sum\limits_{i=1}^m \frac {(y_i-\theta^Tx_i)^2}{2 \sigma^2} \end{aligned}
logL(θ)=i=1∑mlog(2πσ1exp(−2σ2(yi−θTxi)2))=i=1∑m(log2πσ1−2σ2(yi−θTxi)2)=i=1∑mlog2πσ1−i=1∑m2σ2(yi−θTxi)2
由于 ∑ i = 1 m log 1 2 π σ \sum\limits_{i=1}^m \log \frac {1}{\sqrt{2\pi}\sigma} i=1∑mlog2πσ1和 2 σ 2 2\sigma^2 2σ2是常量, ∑ i = 1 m ( y i − θ T x i ) 2 \sum\limits_{i=1}^m (y_i-\theta^Tx_i)^2 i=1∑m(yi−θTxi)2取得最小值时, log L ( θ ) \log L(\theta) logL(θ)取得最大值。
于是,我们得到了最小二乘法函数:
J
(
θ
)
=
∑
i
=
1
m
(
y
i
−
θ
T
x
i
)
2
J(\theta) = \sum\limits_{i=1}^m (y_i - \theta^Tx_i)^2
J(θ)=i=1∑m(yi−θTxi)2
J
(
θ
)
J(\theta)
J(θ)在其偏微分取值为零时取得最小值。因此,我们对
J
(
θ
)
J(\theta)
J(θ)偏微分,求其取值为零时
θ
\theta
θ的值。
J
θ
=
∂
J
∂
θ
=
[
∂
J
∂
θ
0
∂
J
∂
θ
1
…
∂
J
∂
θ
n
]
T
=
0
J_\theta = \frac {\partial J}{\partial \theta} = \begin{bmatrix} \frac {\partial J}{\partial \theta_0} & \frac {\partial J}{\partial \theta_1} & \dots & \frac {\partial J}{\partial \theta_n} \end{bmatrix}^T = \mathbf 0
Jθ=∂θ∂J=[∂θ0∂J∂θ1∂J…∂θn∂J]T=0
我们以
∂
J
∂
θ
0
\frac {\partial J}{\partial \theta_0}
∂θ0∂J为例:
∂
J
∂
θ
0
=
∂
∂
θ
0
∑
i
=
1
m
(
y
i
−
θ
T
x
i
)
2
=
∂
∂
θ
0
[
(
y
1
−
θ
T
x
1
)
2
+
(
y
2
−
θ
T
x
2
)
2
+
⋯
+
(
y
m
−
θ
T
x
m
)
2
]
=
∂
∂
θ
0
[
(
y
1
−
∑
j
=
1
n
θ
j
x
1
j
)
2
+
(
y
2
−
∑
j
=
1
n
θ
j
x
2
j
)
2
+
⋯
+
(
y
m
−
∑
j
=
1
n
θ
j
x
m
j
)
2
]
=
2
(
y
1
−
∑
j
=
1
n
θ
j
x
1
j
)
(
−
x
1
0
)
+
2
(
y
2
−
∑
j
=
1
n
θ
j
x
2
j
)
(
−
x
2
0
)
+
⋯
+
2
(
y
m
−
∑
j
=
1
n
θ
j
x
m
j
)
(
−
x
m
0
)
=
−
2
[
y
1
−
∑
j
=
1
n
θ
j
x
1
j
y
2
−
∑
j
=
1
n
θ
j
x
2
j
…
y
m
−
∑
j
=
1
n
θ
j
x
m
j
]
[
x
1
0
x
2
0
…
x
m
0
]
T
=
−
2
[
y
1
−
θ
T
x
1
y
2
−
θ
T
x
2
…
y
m
−
θ
T
x
m
]
[
x
1
0
x
2
0
…
x
m
0
]
T
\begin{aligned} & \frac {\partial J}{\partial \theta_0} = \frac {\partial}{\partial \theta_0} \sum\limits_{i=1}^m(y_i - \theta^Tx_i)^2 \\ & = \frac {\partial}{\partial \theta_0}[(y_1 - \theta^Tx_1)^2 + (y_2 - \theta^Tx_2)^2 + \dots + (y_m - \theta^Tx_m)^2] \\ & = \frac {\partial}{\partial \theta_0}[(y_1 - \sum\limits_{j=1}^n \theta_jx_1^j)^2 + (y_2 - \sum\limits_{j=1}^n \theta_jx_2^j)^2 + \dots + (y_m - \sum\limits_{j=1}^n \theta_jx_m^j)^2] \\ & = 2(y_1 - \sum\limits_{j=1}^n \theta_jx_1^j)(-x_1^0) + 2(y_2 - \sum\limits_{j=1}^n \theta_jx_2^j)(-x_2^0) + \dots + 2(y_m - \sum\limits_{j=1}^n \theta_jx_m^j)(-x_m^0) \\ & = -2 \begin{bmatrix} y_1 - \sum\limits_{j=1}^n\theta_jx_1^j & y_2 - \sum\limits_{j=1}^n\theta_jx_2^j & \dots & y_m - \sum\limits_{j=1}^n\theta_jx_m^j \end{bmatrix} \begin{bmatrix} x_1^0 & x_2^0 & \dots & x_m^0 \end{bmatrix}^T \\ & = -2 \begin{bmatrix} y_1 - \theta^Tx_1 & y_2 - \theta^Tx_2 & \dots & y_m - \theta^Tx_m \end{bmatrix} \begin{bmatrix} x_1^0 & x_2^0 & \dots & x_m^0 \end{bmatrix}^T \end{aligned}
∂θ0∂J=∂θ0∂i=1∑m(yi−θTxi)2=∂θ0∂[(y1−θTx1)2+(y2−θTx2)2+⋯+(ym−θTxm)2]=∂θ0∂[(y1−j=1∑nθjx1j)2+(y2−j=1∑nθjx2j)2+⋯+(ym−j=1∑nθjxmj)2]=2(y1−j=1∑nθjx1j)(−x10)+2(y2−j=1∑nθjx2j)(−x20)+⋯+2(ym−j=1∑nθjxmj)(−xm0)=−2[y1−j=1∑nθjx1jy2−j=1∑nθjx2j…ym−j=1∑nθjxmj][x10x20…xm0]T=−2[y1−θTx1y2−θTx2…ym−θTxm][x10x20…xm0]T
同理,我们可以求得
∂
J
∂
θ
1
\frac {\partial J}{\partial \theta_1}
∂θ1∂J至
∂
J
∂
θ
n
\frac {\partial J}{\partial \theta_n}
∂θn∂J的偏微分:
∂
J
∂
θ
0
=
−
2
[
y
1
−
θ
T
x
1
y
2
−
θ
T
x
2
…
y
m
−
θ
T
x
m
]
[
x
1
0
x
2
0
…
x
m
0
]
T
∂
J
∂
θ
1
=
−
2
[
y
1
−
θ
T
x
1
y
2
−
θ
T
x
2
…
y
m
−
θ
T
x
m
]
[
x
1
1
x
2
1
…
x
m
1
]
T
⋮
∂
J
∂
θ
n
=
−
2
[
y
1
−
θ
T
x
1
y
2
−
θ
T
x
2
…
y
m
−
θ
T
x
m
]
[
x
1
n
x
2
n
…
x
m
n
]
T
\frac {\partial J}{\partial \theta_0} = -2 \begin{bmatrix} y_1 - \theta^Tx_1 & y_2 - \theta^Tx_2 & \dots & y_m - \theta^Tx_m \end{bmatrix} \begin{bmatrix} x_1^0 & x_2^0 & \dots & x_m^0 \end{bmatrix}^T \\ \frac {\partial J}{\partial \theta_1} = -2 \begin{bmatrix} y_1 - \theta^Tx_1 & y_2 - \theta^Tx_2 & \dots & y_m - \theta^Tx_m \end{bmatrix} \begin{bmatrix} x_1^1 & x_2^1 & \dots & x_m^1 \end{bmatrix}^T \\ \vdots \\ \frac {\partial J}{\partial \theta_n} = -2 \begin{bmatrix} y_1 - \theta^Tx_1 & y_2 - \theta^Tx_2 & \dots & y_m - \theta^Tx_m \end{bmatrix} \begin{bmatrix} x_1^n & x_2^n & \dots & x_m^n \end{bmatrix}^T
∂θ0∂J=−2[y1−θTx1y2−θTx2…ym−θTxm][x10x20…xm0]T∂θ1∂J=−2[y1−θTx1y2−θTx2…ym−θTxm][x11x21…xm1]T⋮∂θn∂J=−2[y1−θTx1y2−θTx2…ym−θTxm][x1nx2n…xmn]T
将其改写成矩阵的形式:
J
θ
=
∂
J
∂
θ
=
[
∂
J
∂
θ
0
∂
J
∂
θ
1
⋮
∂
J
∂
θ
n
]
=
−
2
[
y
i
−
θ
T
x
1
y
2
−
θ
T
x
2
…
y
m
−
θ
T
x
m
]
[
x
1
0
x
1
1
…
x
1
n
x
2
0
x
2
1
…
x
2
n
⋮
x
m
0
x
m
1
…
x
m
n
]
=
−
2
(
[
y
1
y
2
…
y
m
]
−
[
θ
T
x
1
θ
T
x
2
…
θ
T
x
m
]
)
X
=
−
2
(
Y
T
−
θ
T
[
x
1
x
2
…
x
m
]
)
X
=
−
2
(
Y
T
−
θ
T
X
T
)
X
\begin{aligned} & J_\theta = \frac {\partial J}{\partial \theta} = \begin{bmatrix} \frac {\partial J}{\partial \theta_0} \\ \frac {\partial J}{\partial \theta_1} \\ \vdots \\ \frac {\partial J}{\partial \theta_n} \end{bmatrix} = -2 \begin{bmatrix} y_i - \theta^Tx_1 & y_2 - \theta^Tx_2 & \dots & y_m - \theta^Tx_m \end{bmatrix} \begin{bmatrix} x_1^0 & x_1^1 & \dots & x_1^n \\ x_2^0 & x_2^1 & \dots & x_2^n \\ \vdots \\ x_m^0 & x_m^1 & \dots & x_m^n \end{bmatrix} \\ & = -2( \begin{bmatrix} y_1 & y_2 & \dots & y_m \end{bmatrix} - \begin{bmatrix} \theta^Tx_1 & \theta^Tx_2 & \dots & \theta^Tx_m \end{bmatrix} ) \mathbf X \\ & = -2(\mathbf Y^T - \theta^T \begin{bmatrix} x_1 & x_2 & \dots & x_m \end{bmatrix}) \mathbf X \\ & = -2(\mathbf Y^T - \theta^T \mathbf X^T) \mathbf X \end{aligned}
Jθ=∂θ∂J=⎣⎢⎢⎢⎡∂θ0∂J∂θ1∂J⋮∂θn∂J⎦⎥⎥⎥⎤=−2[yi−θTx1y2−θTx2…ym−θTxm]⎣⎢⎢⎢⎡x10x20⋮xm0x11x21xm1………x1nx2nxmn⎦⎥⎥⎥⎤=−2([y1y2…ym]−[θTx1θTx2…θTxm])X=−2(YT−θT[x1x2…xm])X=−2(YT−θTXT)X
令
J
θ
=
−
2
(
Y
T
−
θ
T
X
T
)
X
=
0
J_\theta = -2(\mathbf Y^T - \theta^T \mathbf X^T) \mathbf X = \mathbf 0
Jθ=−2(YT−θTXT)X=0,得到:
Y
T
X
=
θ
T
X
T
X
θ
T
=
Y
T
X
(
X
T
X
)
−
1
=
Y
T
X
X
−
1
(
X
T
)
−
1
=
Y
T
(
X
T
)
−
1
=
Y
T
(
X
−
1
)
T
θ
=
(
θ
T
)
T
=
X
−
1
Y
\mathbf Y^T \mathbf X = \theta^T \mathbf X^T \mathbf X \\ \theta^T = \mathbf Y^T \mathbf X(\mathbf X^T \mathbf X)^{-1} = \mathbf Y^T \mathbf X \mathbf X^{-1} (\mathbf X^T)^{-1} = \mathbf Y^T (\mathbf X^T)^{-1} = \mathbf Y^T (\mathbf X^{-1})^T \\ \theta = (\theta^T)^T = \mathbf X^{-1} \mathbf Y
YTX=θTXTXθT=YTX(XTX)−1=YTXX−1(XT)−1=YT(XT)−1=YT(X−1)Tθ=(θT)T=X−1Y
因此,当 θ = X − 1 Y \theta = \mathbf X^{-1} \mathbf Y θ=X−1Y时, y ^ i = θ T x i \hat y_i = \theta^Tx_i y^i=θTxi能最好地描述 X \mathbf X X与 Y \mathbf Y Y之间的线形相关性。
然而,现实中的数据并非都能通过求偏导来求极值。例如:
z
=
y
2
−
x
2
z=y^2 - x^2
z=y2−x2。如下图,当
∂
z
∂
y
=
0
;
∂
z
∂
x
=
0
\frac {\partial z}{\partial y} = 0; \frac {\partial z}{\partial x} = 0
∂y∂z=0;∂x∂z=0时,
y
=
0
;
x
=
0
y=0; x=0
y=0;x=0(鞍点)却非极值点。因此,实际中多采用梯度下降法或牛顿法来求得目标函数极值的近似解。通过线性回归得到回归参数后,可以通过计算判定系数
R
2
R^2
R2来评估回归函数的拟合优度。