#! https://zhuanlan.zhihu.com/p/406879862
岭回归与最小二乘法
对于过拟合我们有很多种处理方法,常用的有三种:增加数据、特征选择、正则化。岭回归即我们通常所述的
l
2
l2
l2 正则,这里研究一下上一篇文章的最小二乘法的岭回归的形式。
首先回顾一下最小二乘法,
L
=
∑
i
=
1
N
∥
w
T
x
i
−
y
i
∥
2
L=\sum_{i=1}^{N}\|w^Tx_i-y_i\|^2
L=∑i=1N∥wTxi−yi∥2,得到
w
^
=
a
r
g
m
i
n
w
L
=
(
X
T
X
)
−
1
X
T
Y
\hat{w}=\underset{w}{argmin}L=(X^TX)^{-1}X^TY
w^=wargminL=(XTX)−1XTY
对于岭回归的最小二乘法,
L
=
∑
i
=
1
N
∥
w
T
x
i
−
y
i
∥
2
+
λ
w
T
w
=
(
w
T
x
1
−
y
1
w
T
x
2
−
y
2
⋯
w
T
x
N
−
y
N
)
(
w
T
x
1
−
y
1
w
T
x
2
−
y
2
⋮
w
T
x
N
−
y
N
)
+
λ
w
T
w
=
(
w
T
X
T
−
Y
T
)
(
X
w
−
Y
)
+
λ
w
T
w
=
w
T
(
X
T
X
+
λ
I
)
w
−
2
w
T
X
T
Y
+
Y
T
Y
\begin{aligned} L&=\sum_{i=1}^{N}\|w^Tx_i-y_i\|^2 + \lambda w^Tw\\ &=\begin{pmatrix}w^Tx_1-y_1 & w^Tx_2-y_2 & \cdots & w^Tx_N-y_N\end{pmatrix}\begin{pmatrix} w^Tx_1-y_1 \\ w^Tx_2-y_2 \\ \vdots \\ w^Tx_N-y_N \end{pmatrix} + \lambda w^Tw \\ &=(w^TX^T-Y^T)(Xw-Y) + \lambda w^Tw \\ &=w^T(X^TX+\lambda I)w - 2w^TX^TY+Y^TY \end{aligned}
L=i=1∑N∥wTxi−yi∥2+λwTw=(wTx1−y1wTx2−y2⋯wTxN−yN)⎝⎜⎜⎜⎛wTx1−y1wTx2−y2⋮wTxN−yN⎠⎟⎟⎟⎞+λwTw=(wTXT−YT)(Xw−Y)+λwTw=wT(XTX+λI)w−2wTXTY+YTY
∂
L
∂
w
=
2
(
X
T
X
+
λ
I
)
w
−
2
X
T
Y
=
0
\begin{aligned} \frac{\partial L}{\partial w} = 2(X^TX+\lambda I)w - 2X^TY=0 \end{aligned}
∂w∂L=2(XTX+λI)w−2XTY=0
所以
w
=
(
X
T
X
+
λ
I
)
X
T
Y
w = (X^TX+\lambda I)X^TY
w=(XTX+λI)XTY。
另一个角度看 l 2 l2 l2正则化的最小二乘法
对于最小二乘法中的权重
w
w
w假设
w
∼
N
(
0
,
σ
0
2
)
w\sim N(0, \sigma^2_0)
w∼N(0,σ02)。从最大后验角度出发。但是仍有我的上一章的一些结论,文章传送:https://blog.csdn.net/weixin_49708196/article/details/120034186?spm=1001.2014.3001.5501,有结论从极大似然的角度来看最小二乘法是相当于用极大似然估计求噪声为高斯分布的的线性模型,即
y
∣
w
,
x
∼
N
(
0
,
σ
2
)
y|w, x \sim N(0, \sigma^2)
y∣w,x∼N(0,σ2)
w
^
=
a
r
g
m
a
x
w
∏
i
=
1
p
(
w
∣
y
i
)
=
a
r
g
m
a
x
w
∏
i
=
1
p
(
y
i
∣
w
)
p
(
w
)
p
(
y
i
)
=
a
r
g
m
a
x
w
∏
i
=
1
p
(
y
i
∣
w
)
p
(
w
)
=
a
r
g
m
a
x
w
∑
i
=
1
log
(
p
(
y
i
∣
w
)
)
+
log
(
p
(
w
)
)
=
a
r
g
m
a
x
w
∑
i
=
1
log
(
1
2
π
σ
e
∥
y
i
−
w
T
x
i
∥
2
2
σ
2
)
+
log
(
1
2
π
σ
0
e
∥
w
∥
2
2
σ
0
2
)
=
a
r
g
m
i
n
w
∑
i
=
1
log
(
2
π
σ
)
+
∥
y
i
−
w
T
x
i
∥
2
+
log
(
2
π
σ
0
)
+
∥
w
∥
2
=
a
r
g
m
i
n
w
∑
i
=
1
∥
y
i
−
w
T
x
i
∥
2
σ
2
+
∥
w
∥
2
σ
0
2
=
a
r
g
m
i
n
w
∑
i
=
1
∥
y
i
−
w
T
x
i
∥
2
+
σ
2
σ
0
2
∥
w
∥
2
\begin{aligned} \hat{w} &= \underset{w}{argmax}\prod_{i=1} p(w|y_i) \\ &= \underset{w}{argmax} \prod_{i=1}\frac{p(y_i|w)p(w)}{p(y_i)} \\ &= \underset{w}{argmax}\prod_{i=1}p(y_i|w)p(w) \\ &= \underset{w}{argmax}\sum_{i=1}\log (p(y_i|w)) + \log (p(w)) \\ &= \underset{w}{argmax} \sum_{i=1} \log (\frac{1}{\sqrt{2\pi}\sigma}e^{\frac{\|y_i-w^Tx_i\|^2}{2\sigma ^ 2}}) + \log (\frac{1}{\sqrt{2\pi }\sigma_{0}}e^{\frac{\|w\|^2}{2\sigma ^2_0}}) \\ &= \underset{w}{argmin}\sum_{i=1}\log (\sqrt{2\pi }\sigma ) + \|y_i-w^Tx_i\|^2 + \log(\sqrt{2\pi }\sigma_0) + \|w\|^2 \\ &= \underset{w}{argmin} \sum_{i=1}\frac{\|y_i-w^Tx_i\|^2}{\sigma ^ 2} + \frac{\|w\|^2}{\sigma^2_0} \\ &= \underset{w}{argmin} \sum_{i=1} \|y_i-w^Tx_i\|^2 + \frac{\sigma^2}{\sigma^2_0} \|w\|^2 \end{aligned}
w^=wargmaxi=1∏p(w∣yi)=wargmaxi=1∏p(yi)p(yi∣w)p(w)=wargmaxi=1∏p(yi∣w)p(w)=wargmaxi=1∑log(p(yi∣w))+log(p(w))=wargmaxi=1∑log(2πσ1e2σ2∥yi−wTxi∥2)+log(2πσ01e2σ02∥w∥2)=wargmini=1∑log(2πσ)+∥yi−wTxi∥2+log(2πσ0)+∥w∥2=wargmini=1∑σ2∥yi−wTxi∥2+σ02∥w∥2=wargmini=1∑∥yi−wTxi∥2+σ02σ2∥w∥2
可以看出上式便是最小二乘法的岭回归,所以我们边可以得到结论最小二乘法的岭回归即是对权重矩阵 w w w做了一个假设,假设其服从高斯分布,以此达到衰减权重的目的。