线性回归的损失函数
J ( θ ) = 1 2 ∑ i = 1 m ( y ( i ) − θ T x ( i ) ) 2 J(\theta) = \frac{1}{2}\sum_{i=1}^{m}(y^{(i)}-\theta^Tx^{(i)})^2 J(θ)=21i=1∑m(y(i)−θTx(i))2
线性回归-最小二乘的概率解释(频率学派-最大似然估计)
当我们面对回归问题时,为什么会采用线性回归,最小二乘法来定义成本函数,即1/2的差的平方和。
这里给出概率解释:
我们拟合的直线的函数值即预测值必然和真实值会存在误差。那么假定一个等式:
y
(
i
)
=
θ
T
x
(
i
)
+
ϵ
y^{(i)} = \theta^Tx^{(i)}+\epsilon
y(i)=θTx(i)+ϵ
其中各个样本的误差项,是独立同分布且服从高斯分布(正态分布)。(可根据中心极限定理来看)
即就是:
ϵ
(
i
)
∼
N
(
0
,
σ
2
)
\epsilon^{(i)} \sim N(0,\sigma^2)
ϵ(i)∼N(0,σ2)
P
(
ϵ
(
i
)
)
=
1
2
π
σ
e
x
p
(
−
(
y
(
i
)
−
θ
T
x
(
i
)
)
2
2
σ
2
)
P(\epsilon^{(i)}) = \frac{1}{\sqrt{2\pi}\sigma} exp(-\frac{(y^{(i)} -\theta^Tx^{(i)})^2}{2\sigma^2})
P(ϵ(i))=2πσ1exp(−2σ2(y(i)−θTx(i))2)
均值为0,容易理解.
所以,
P ( y ( i ) ∣ x ( i ) ; θ ) = 1 2 π σ e x p ( − ( y ( i ) − θ T x ( i ) ) 2 2 σ 2 ) P(y^{(i)}|x^{(i)};\theta) = \frac{1}{\sqrt{2\pi}\sigma} exp(-\frac{(y^{(i)} -\theta^Tx^{(i)})^2}{2\sigma^2}) P(y(i)∣x(i);θ)=2πσ1exp(−2σ2(y(i)−θTx(i))2)
也就是要面对 在 θ \theta θ为参数给定一个x时预测值y是真实值的概率服从正太分布,要求得概率最大时的?
使用最大似然估计:
L
(
θ
)
=
∏
i
=
1
m
p
(
y
(
i
)
∣
x
(
i
)
;
θ
)
=
∏
i
=
1
m
1
2
π
σ
exp
(
−
(
y
(
i
)
−
θ
T
x
(
i
)
)
2
2
σ
2
)
\begin{aligned} L(\theta) &=\prod_{i=1}^{m} p\left(y^{(i)} | x^{(i)} ; \theta\right) \\ &=\prod_{i=1}^{m} \frac{1}{\sqrt{2 \pi} \sigma} \exp \left(-\frac{\left(y^{(i)}-\theta^{T} x^{(i)}\right)^{2}}{2 \sigma^{2}}\right) \end{aligned}
L(θ)=i=1∏mp(y(i)∣x(i);θ)=i=1∏m2πσ1exp(−2σ2(y(i)−θTx(i))2)
l
(
θ
)
=
ln
(
L
(
θ
)
)
=
ln
∏
i
=
1
m
1
2
π
σ
exp
(
−
(
y
(
i
)
−
θ
T
x
(
i
)
)
2
2
σ
2
)
=
∑
i
=
1
m
ln
1
2
π
σ
exp
(
−
(
y
(
i
)
−
θ
T
x
(
i
)
)
2
2
σ
2
)
=
m
l
n
1
2
π
σ
−
1
σ
2
⋅
1
2
∑
i
=
1
m
(
y
(
i
)
−
θ
T
x
(
i
)
)
2
\begin{aligned} l(\theta) &=\ln (L(\theta)) \\ &=\ln \prod_{i=1}^{m} \frac{1}{\sqrt{2 \pi} \sigma} \exp \left(-\frac{\left(y^{(i)}-\theta^{T} x^{(i)}\right)^{2}}{2 \sigma^{2}}\right) \\ &=\sum_{i=1}^{m} \ln \frac{1}{\sqrt{2 \pi} \sigma} \exp \left(-\frac{\left(y^{(i)}-\theta^{T} x^{(i)}\right)^{2}}{2 \sigma^{2}}\right) \\ &=m l n \frac{1}{\sqrt{2 \pi} \sigma}-\frac{1}{\sigma^{2}} \cdot \frac{1}{2} \sum_{i=1}^{m}\left(y^{(i)}-\theta^{T} x^{(i)}\right)^{2} \end{aligned}
l(θ)=ln(L(θ))=lni=1∏m2πσ1exp(−2σ2(y(i)−θTx(i))2)=i=1∑mln2πσ1exp(−2σ2(y(i)−θTx(i))2)=mln2πσ1−σ21⋅21i=1∑m(y(i)−θTx(i))2
根据此过程,要求此函数的最大值 ,需求上式中后项函数
J
(
θ
)
J(\theta)
J(θ) 的最小值,
J
(
θ
)
=
1
2
∑
i
=
1
m
(
y
(
i
)
−
θ
T
x
(
i
)
)
2
J(\theta) = \frac{1}{2}\sum_{i=1}^{m}(y^{(i)}-\theta^Tx^{(i)})^2
J(θ)=21i=1∑m(y(i)−θTx(i))2
此函数又即为最小二乘估计的目标函数。
岭回归的损失函数
J ( θ ) = 1 2 ∑ i = 1 m ( y ( i ) − θ T x ( i ) ) 2 + λ ∣ ∣ θ ∣ ∣ 2 2 J(\theta) = \frac{1}{2}\sum_{i=1}^{m}(y^{(i)}-\theta^Tx^{(i)})^2+\lambda||\theta||_2^2 J(θ)=21i=1∑m(y(i)−θTx(i))2+λ∣∣θ∣∣22
岭回归的概率解释(贝叶斯学派-最大后验估计)
以贝叶斯学派得角度来看:
我们引入高斯噪声
ϵ
\epsilon
ϵ 来看可以知道:
y
(
i
)
∣
θ
∼
N
(
0
,
σ
0
2
)
y^{(i)}|\theta \sim N(0, \sigma_0^2)
y(i)∣θ∼N(0,σ02)
也就是:
P
(
y
∣
θ
)
=
1
2
π
σ
e
x
p
(
−
(
y
−
θ
T
x
)
2
2
σ
2
)
P(y|\theta) = \frac{1}{\sqrt{2\pi}\sigma}exp(-\frac{(y^{} -\theta^Tx^{})^2}{2\sigma^2})
P(y∣θ)=2πσ1exp(−2σ2(y−θTx)2)
我们假定参数
θ
\theta
θ 也服从一个高斯分布:
P
(
θ
)
=
1
2
π
σ
0
e
x
p
(
−
∣
∣
θ
∣
∣
2
2
2
σ
0
2
)
P(\theta) = \frac{1}{\sqrt{2\pi}\sigma_0}exp(-\frac{||\theta||^2_2}{2\sigma_0^2})
P(θ)=2πσ01exp(−2σ02∣∣θ∣∣22)
以及贝叶斯定理:
P
(
θ
∣
y
)
=
P
(
y
∣
θ
)
P
(
θ
)
P
(
y
)
P(\theta|y) = \frac{P(y|\theta)P(\theta)}{P(y)}
P(θ∣y)=P(y)P(y∣θ)P(θ)
根据最大后验估计:
θ ^ = arg max θ P ( θ ∣ y ) = arg max P ( y ∣ θ ) ⋅ P ( θ ) = arg max θ log [ P ( y ∣ θ ) ⋅ P ( θ ) ] = arg max θ log ( 1 2 π σ ⋅ 1 2 π σ 0 ) + log exp { − ( y − θ 2 x ) 2 2 σ 2 − ∥ θ ∥ 2 2 σ 0 2 } = arg min θ ( y − θ 2 x ) 2 2 σ 2 + ∥ θ ∥ 2 2 σ 0 2 = arg min θ ( y − θ 2 x ) 2 + σ 2 σ 0 2 ∥ θ ∥ 2 \begin{aligned} \hat{\theta} &=\arg \max _{\theta} P(\theta | y)=\arg \max P(y | \theta) \cdot P(\theta) \\ &=\arg \max _{\theta} \log [P(y | \theta) \cdot P(\theta)] \\ &=\arg \max _{\theta} \log \left(\frac{1}{\sqrt{2 \pi} \sigma} \cdot \frac{1}{\sqrt{2 \pi} \sigma_{0}}\right)+\log \exp \left\{-\frac{\left(y-\theta^{2} x\right)^{2}}{2 \sigma^{2}}-\frac{\|\theta\|^{2}}{2 \sigma_{0}^{2}}\right\} \\ &=\arg \min _{\theta} \frac{\left(y-\theta^{2} x\right)^{2}}{2 \sigma^{2}}+\frac{\|\theta\|^{2}}{2 \sigma_{0}^{2}} \\ &=\arg \min _{\theta}\left(y-\theta^{2} x\right)^{2}+\frac{\sigma^{2}}{\sigma_{0}^{2}}\|\theta\|^{2} \end{aligned} θ^=argθmaxP(θ∣y)=argmaxP(y∣θ)⋅P(θ)=argθmaxlog[P(y∣θ)⋅P(θ)]=argθmaxlog(2πσ1⋅2πσ01)+logexp{−2σ2(y−θ2x)2−2σ02∥θ∥2}=argθmin2σ2(y−θ2x)2+2σ02∥θ∥2=argθmin(y−θ2x)2+σ02σ2∥θ∥2
MAP: θ M A P ^ = arg min θ ∑ i = 1 m ( y ( i ) − θ T x ( i ) ) 2 + σ 2 σ 0 2 ∣ ∣ θ ∣ ∣ 2 \hat{\theta_{MAP}} = \arg~\min_{\theta} \sum_{i=1}{m}(y^{(i)}-\theta^Tx^{(i)})^2+\frac{\sigma^2}{\sigma_0^2}||\theta||^2 θMAP^=arg θmini=1∑m(y(i)−θTx(i))2+σ02σ2∣∣θ∣∣2
岭回归: J ( θ ) = ∑ i = 1 m ( y ( i ) − θ T x ( i ) ) 2 + λ ∣ ∣ θ ∣ ∣ 2 J(\theta) = \sum_{i=1}{m}(y^{(i)}-\theta^Tx^{(i)})^2+\lambda||\theta||^2 J(θ)=i=1∑m(y(i)−θTx(i))2+λ∣∣θ∣∣2
结论
最小二乘估计 LSE <==> 极大似然估计 MLE(noise 为 高斯分布)
正则化最小二乘 RLSE <==> 最大后验概率估计MAP(先验和噪声均为高斯分布)
最大后验估计与最大似然估计
最大后验概率估计MAP相比于最大似然估计MLP多了一个假定服从某种分布的先验知识。