正则化
背景-过拟合
由图中可以看出,当只有一次项的时候,拟合程度不够Underfitting,当存在五次方项的时候就存在过拟合现象,假设函数很好的fit给定的数据,但是不利于数据的预测
解决过拟合问题的方案:
1、减少特征值的数量
2、正则化:
- 不改变特征值的数量,减小他的系数 θ j \theta_j θj以削弱影响
- 当我们有大量影响较小的feature的时候,正则化就很有用
线性回归正则化 Regularized Linear Regression
对于下列函数:
θ
0
+
θ
1
x
+
θ
2
x
2
+
θ
3
x
3
+
θ
4
x
4
\theta_0 + \theta_1x + \theta_2 x^2 + \theta_3x^3 + \theta_4x^4
θ0+θ1x+θ2x2+θ3x3+θ4x4
消除
θ
3
x
3
+
θ
4
x
4
\theta_3x^3 + \theta_4x^4
θ3x3+θ4x4的影响,会使得曲线更加平滑,cost函数将重新进行定义:
min
θ
1
2
m
[
∑
i
=
1
m
(
h
θ
(
x
(
i
)
)
−
y
(
i
)
)
2
+
1000
⋅
θ
3
2
+
1000
⋅
θ
4
2
]
\min_{\theta} \frac{1}{2m}\left[ \sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})^2 + 1000\cdot\theta_3^2 + 1000\cdot\theta_4^2\right]
θmin2m1[i=1∑m(hθ(x(i))−y(i))2+1000⋅θ32+1000⋅θ42]
更通用的一个定义:
min
θ
1
2
m
[
∑
i
=
1
m
(
h
θ
(
x
(
i
)
)
−
y
(
i
)
)
2
+
λ
∑
j
=
1
m
θ
j
2
]
\min_{\theta} \frac{1}{2m}\left[ \sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})^2 +\lambda\sum_{j=1}^{m}\theta_j^2\right]
θmin2m1[i=1∑m(hθ(x(i))−y(i))2+λj=1∑mθj2]
注意这里没有考虑
θ
0
\theta_0
θ0,因为
θ
0
\theta_0
θ0代表的是常数项,是平移,我们想让曲线更加平滑,但是不想改变它的位置。
这个方程,依赖于
λ
\lambda
λ,也依赖于系数的平方项
梯度下降法
R
e
a
p
t
{
θ
0
=
θ
0
−
α
1
m
∑
i
=
1
m
(
h
θ
(
x
(
i
)
)
−
y
(
i
)
)
x
0
(
i
)
θ
j
=
θ
j
−
α
[
1
m
∑
i
=
1
m
(
h
θ
(
x
(
i
)
)
−
y
(
i
)
)
x
j
(
i
)
+
1
m
θ
j
]
}
u
n
t
i
l
c
o
n
v
e
r
g
e
n
c
e
c
o
n
d
i
t
i
o
n
i
s
s
a
t
i
s
f
i
e
d
Reapt \{ \\ \theta_0 = \theta_0 - \alpha\frac{1}{m}\sum_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)} )x_0^{(i)} \\ \theta_j = \theta_j - \alpha \left[ \frac{1}{m}\sum_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)} )x_j^{(i)} + \frac{1}{m}\theta_j\right] \\ \} until \quad convergence\quad condition \quad is \quad satisfied
Reapt{θ0=θ0−αm1i=1∑m(hθ(x(i))−y(i))x0(i)θj=θj−α[m1i=1∑m(hθ(x(i))−y(i))xj(i)+m1θj]}untilconvergenceconditionissatisfied
在梯度下降的的过程中,正则化就是通过
1
m
θ
j
\frac{1}{m}\theta_j
m1θj体现出来的
规范解 Normal equation
θ
=
(
X
T
X
+
λ
⋅
L
)
−
1
X
T
Y
\theta = (X^TX+\lambda\cdot L)^{-1}X^T Y
θ=(XTX+λ⋅L)−1XTY
L
=
[
0
1
⋱
1
]
L = \begin{bmatrix} 0&&&\\ &1&&\\ &&\ddots&\\ &&&1 \end{bmatrix}
L=⎣⎢⎢⎡01⋱1⎦⎥⎥⎤
逻辑回归正则化
逻辑回归的COST函数
J
(
θ
)
=
−
1
m
∑
i
=
1
m
(
y
(
i
)
log
(
h
θ
)
+
(
1
−
y
(
i
)
)
log
(
1
−
h
θ
(
x
(
i
)
)
)
)
J(\theta) = -\frac{1}{m}\sum_{i = 1}^m ( y^{(i)} \log (h_\theta) + (1- y^{(i)})\log(1-h_\theta(x^{(i)})))
J(θ)=−m1i=1∑m(y(i)log(hθ)+(1−y(i))log(1−hθ(x(i))))
增加一项,将其正则化,
J
(
θ
)
=
−
1
m
∑
i
=
1
m
(
y
(
i
)
log
(
h
θ
(
x
)
)
+
(
1
−
y
(
i
)
)
log
(
1
−
h
θ
(
x
(
i
)
)
)
)
+
λ
2
m
∑
j
=
1
n
θ
j
2
J(\theta) = -\frac{1}{m}\sum_{i=1}^m ( y^{(i)} \log(h_\theta(x)) + (1- y^{(i)}) \log(1- h_\theta(x^{(i)}))) + \frac{\lambda}{2m}\sum_{j=1}^n\theta_j^2
J(θ)=−m1i=1∑m(y(i)log(hθ(x))+(1−y(i))log(1−hθ(x(i))))+2mλj=1∑nθj2
在梯度下降的跌代过程中
θ
0
=
θ
0
−
α
1
m
∑
i
=
1
m
(
h
θ
(
x
(
i
)
)
−
y
(
i
)
)
x
0
(
i
)
\theta_0 = \theta_0 - \alpha\frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})x_0^{(i)}
θ0=θ0−αm1i=1∑m(hθ(x(i))−y(i))x0(i)
θ
j
=
θ
j
−
α
(
1
m
∑
i
=
1
m
(
h
θ
(
x
(
i
)
)
−
y
(
i
)
)
x
0
(
i
)
+
λ
m
θ
j
)
\theta_j = \theta_j - \alpha (\frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})x_0^{(i)} + \frac{\lambda}{m}\theta_j )
θj=θj−α(m1i=1∑m(hθ(x(i))−y(i))x0(i)+mλθj)
MLE MAP
假设数据是服从分布模型:
d
∼
p
(
d
;
θ
)
d \sim p(d;\theta)
d∼p(d;θ)
目标估计
θ
\theta
θ的值,使得模型最贴合data的分布
MLE 极大似然估计
选择参数
θ
\theta
θ,使得data出现的概率最大。
对于与给定参数
θ
\theta
θ,给定数据data出现的概率为:
L
(
θ
)
=
p
(
D
;
θ
)
=
∏
i
=
1
m
p
(
d
(
i
)
;
θ
)
L(\theta) = p(D;\theta) = \prod_{i=1}^m p(d^{(i)};\theta)
L(θ)=p(D;θ)=i=1∏mp(d(i);θ)
取对数,不影响函数的单调性,有一致的极值点
l ( θ ) = log L ( θ ) = ∑ i = 1 m log p ( d ( i ) ; θ ) l(\theta) = \log L(\theta) = \sum_{i=1}^m \log p(d^{(i)};\theta) l(θ)=logL(θ)=i=1∑mlogp(d(i);θ)
θ M L E = arg max θ ∑ i = 1 m log p ( d ( i ) ; θ ) \theta_{MLE} = \arg \max_\theta \sum_{i=1}^m \log p(d^{(i)};\theta) θMLE=argθmaxi=1∑mlogp(d(i);θ)
MAP 最大后验估计
由贝叶斯公式可以得出,后验概率
p
(
θ
∣
D
)
=
p
(
θ
)
p
(
D
∣
θ
)
p
(
D
)
p(\theta| D ) = p(\theta) \frac{p( D|\theta)}{p(D)}
p(θ∣D)=p(θ)p(D)p(D∣θ)
p
(
θ
)
p(\theta)
p(θ):先验概率,在没有看到数据的时候,有经验获得的概率或者猜测的概率
p
(
D
)
p(D)
p(D):data 出现的概率
p
(
D
)
=
∫
θ
p
(
θ
)
p
(
D
∣
θ
)
d
θ
p(D) = \int_\theta p(\theta)p(D|\theta)d\theta
p(D)=∫θp(θ)p(D∣θ)dθ
p
(
D
)
p(D)
p(D)的值和
θ
\theta
θ无关
θ
M
A
P
=
arg
max
θ
p
(
θ
∣
D
)
=
arg
max
θ
p
(
θ
)
p
(
D
∣
θ
)
p
(
D
)
=
arg
max
θ
p
(
θ
)
p
(
D
∣
θ
)
=
arg
max
θ
(
log
p
(
θ
)
+
∑
i
=
1
m
log
p
(
d
(
i
)
;
θ
)
)
\theta_{MAP} = \arg \max_\theta p(\theta|D) \\ = \arg \max_\theta p(\theta) \frac{p(D|\theta)}{p(D)}\\ = \arg \max_\theta p(\theta)p(D| \theta)\\ =\arg \max_\theta(\log p(\theta) + \sum_{i=1}^m\log p(d^{(i)};\theta))
θMAP=argθmaxp(θ∣D)=argθmaxp(θ)p(D)p(D∣θ)=argθmaxp(θ)p(D∣θ)=argθmax(logp(θ)+i=1∑mlogp(d(i);θ))
比较 MLE MAP
MLE:
θ
M
L
E
=
arg
max
θ
∑
i
=
1
m
log
p
(
d
(
i
)
;
θ
)
\theta_{MLE} = \arg \max_\theta \sum_{i=1}^m \log p(d^{(i)};\theta)
θMLE=argθmaxi=1∑mlogp(d(i);θ)
MAP:
θ
M
A
P
=
arg
max
θ
(
log
p
(
θ
)
+
∑
i
=
1
m
log
p
(
d
(
i
)
;
θ
)
)
\theta_{MAP} = \arg \max_\theta(\log p(\theta) + \sum_{i=1}^m\log p(d^{(i)};\theta))
θMAP=argθmax(logp(θ)+i=1∑mlogp(d(i);θ))
极大似然估计,忽略了 θ \theta θ本身的分布,认为 θ \theta θ的分布是均匀的,但是MAP认为 θ \theta θ是服从某一个分布的
MLE solution
对于数据集中
(
x
(
i
)
,
y
(
i
)
)
(x^{(i)},y^{(i)})
(x(i),y(i))
y
(
i
)
=
θ
T
x
(
i
)
+
ϵ
(
i
)
y^{(i)} = \theta^T x^{(i)} + \epsilon^{(i)}
y(i)=θTx(i)+ϵ(i)
其中
ϵ
(
i
)
\epsilon^{(i)}
ϵ(i)服从高斯分布
ϵ
(
i
)
∼
N
(
0
,
σ
2
)
\epsilon^{(i)} \sim N(0,\sigma^2)
ϵ(i)∼N(0,σ2)
那么可以得出
y
(
i
)
y^{(i)}
y(i)服从高斯分布
y
(
i
)
∣
x
(
i
)
;
θ
∼
N
(
θ
T
x
(
i
)
,
σ
2
)
y^{(i)} | x^{(i)};\theta \sim N(\theta^Tx^{(i)},\sigma^2)
y(i)∣x(i);θ∼N(θTx(i),σ2)
补充:高斯分布的概率密度函数
f
(
x
)
=
1
2
π
σ
e
x
p
(
−
(
x
−
u
)
2
2
σ
2
)
f(x) = \frac{1}{\sqrt{2\pi}\sigma}exp(-\frac{(x - u)^2}{2\sigma^2})
f(x)=2πσ1exp(−2σ2(x−u)2)
那么似然函数就成了
l
(
θ
)
=
log
L
(
θ
)
=
m
log
1
2
π
σ
−
∑
i
=
1
m
(
y
(
i
)
−
θ
T
x
(
i
)
)
2
2
σ
2
l(\theta) = \log L(\theta) = m\log \frac{1}{\sqrt{2\pi}\sigma} - \frac{\sum_{i=1}^m(y^{(i)} - \theta^Tx^{(i)})^2}{2\sigma^2}
l(θ)=logL(θ)=mlog2πσ1−2σ2∑i=1m(y(i)−θTx(i))2
要使
l
(
θ
)
l(\theta)
l(θ)取最大值
θ
M
L
E
=
arg
min
θ
1
2
σ
2
∑
i
=
1
m
(
y
(
i
)
−
θ
T
x
(
i
)
)
2
\theta_{MLE} = \arg\min_\theta \frac{1}{2\sigma^2}\sum_{i=1}^m(y^{(i)} - \theta^Tx^{(i)})^2
θMLE=argθmin2σ21i=1∑m(y(i)−θTx(i))2
MAP solution
我们假设
θ
\theta
θ服从高斯分布
θ
∼
N
(
0
,
λ
2
)
\theta\sim N(0,\lambda^2)
θ∼N(0,λ2)
p
(
θ
)
=
1
(
2
π
λ
2
)
n
e
x
p
(
−
θ
T
θ
2
λ
2
)
p(\theta) = \frac{1}{(\sqrt{2\pi}\lambda^2)^n}exp(-\frac{\theta^T\theta}{2\lambda^2})
p(θ)=(2πλ2)n1exp(−2λ2θTθ)
θ M A P = arg max θ ( log p ( θ ) + ∑ i = 1 m log p ( d ( i ) ; θ ) ) = arg max θ ( − n log ( 1 2 π λ 2 ) − θ T θ 2 λ 2 + ∑ i = 1 m log ( 1 2 π σ 2 exp ( − ( y ( i ) − θ T x ( i ) ) 2 2 σ 2 ) ) ) = arg max θ ( − n log ( 1 2 π λ 2 ) − θ T θ 2 λ 2 + m log ( 1 2 π σ 2 ) − ∑ i = 1 m ( y ( i ) − θ T x ( i ) ) 2 2 σ 2 ) \theta_{MAP} = \arg \max_\theta(\log p(\theta) + \sum_{i=1}^m\log p(d^{(i)};\theta))\\ =\arg \max_\theta (-n\log(\frac{1}{\sqrt{2\pi}\lambda^2})-\frac{\theta^T\theta}{2\lambda^2} + \sum_{i=1}^{m}\log(\frac{1}{\sqrt{2\pi}\sigma^2} \exp( -\frac{( y^{(i) - \theta^Tx^{(i)}} ) ^2}{2\sigma^2})))\\ =\arg \max_\theta(-n\log(\frac{1}{\sqrt{2\pi}\lambda^2}) -\frac{\theta^T\theta}{2\lambda^2} + m \log(\frac{1}{\sqrt{2\pi}\sigma^2}) - \sum_{i=1}^{m}\frac{( y^{(i) - \theta^Tx^{(i)}} ) ^2}{2\sigma^2}) θMAP=argθmax(logp(θ)+i=1∑mlogp(d(i);θ))=argθmax(−nlog(2πλ21)−2λ2θTθ+i=1∑mlog(2πσ21exp(−2σ2(y(i)−θTx(i))2)))=argθmax(−nlog(2πλ21)−2λ2θTθ+mlog(2πσ21)−i=1∑m2σ2(y(i)−θTx(i))2)
等价于
θ
M
A
P
=
arg
min
θ
(
θ
T
θ
2
λ
2
+
∑
i
=
1
m
(
y
(
i
)
−
θ
T
x
(
i
)
)
2
2
σ
2
)
\theta_{MAP} = \arg\min_\theta (\frac{\theta^T\theta}{2\lambda^2} + \sum_{i=1}^m\frac{(y^{(i)} - \theta^Tx^{(i)})^2}{2\sigma^2})
θMAP=argθmin(2λ2θTθ+i=1∑m2σ2(y(i)−θTx(i))2)