机器学习(三)正则化Regularization

背景-过拟合

在这里插入图片描述
由图中可以看出,当只有一次项的时候,拟合程度不够Underfitting,当存在五次方项的时候就存在过拟合现象,假设函数很好的fit给定的数据,但是不利于数据的预测
解决过拟合问题的方案:
1、减少特征值的数量

2、正则化:

  • 不改变特征值的数量,减小他的系数 θ j \theta_j θj以削弱影响
  • 当我们有大量影响较小的feature的时候,正则化就很有用

线性回归正则化 Regularized Linear Regression

对于下列函数:
θ 0 + θ 1 x + θ 2 x 2 + θ 3 x 3 + θ 4 x 4 \theta_0 + \theta_1x + \theta_2 x^2 + \theta_3x^3 + \theta_4x^4 θ0+θ1x+θ2x2+θ3x3+θ4x4
消除 θ 3 x 3 + θ 4 x 4 \theta_3x^3 + \theta_4x^4 θ3x3+θ4x4的影响,会使得曲线更加平滑,cost函数将重新进行定义:
min ⁡ θ 1 2 m [ ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 + 1000 ⋅ θ 3 2 + 1000 ⋅ θ 4 2 ] \min_{\theta} \frac{1}{2m}\left[ \sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})^2 + 1000\cdot\theta_3^2 + 1000\cdot\theta_4^2\right] θmin2m1[i=1m(hθ(x(i))y(i))2+1000θ32+1000θ42]

更通用的一个定义:

min ⁡ θ 1 2 m [ ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 + λ ∑ j = 1 m θ j 2 ] \min_{\theta} \frac{1}{2m}\left[ \sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})^2 +\lambda\sum_{j=1}^{m}\theta_j^2\right] θmin2m1[i=1m(hθ(x(i))y(i))2+λj=1mθj2]
注意这里没有考虑 θ 0 \theta_0 θ0,因为 θ 0 \theta_0 θ0代表的是常数项,是平移,我们想让曲线更加平滑,但是不想改变它的位置。
这个方程,依赖于 λ \lambda λ,也依赖于系数的平方项

梯度下降法

R e a p t { θ 0 = θ 0 − α 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x 0 ( i ) θ j = θ j − α [ 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x j ( i ) + 1 m θ j ] } u n t i l c o n v e r g e n c e c o n d i t i o n i s s a t i s f i e d Reapt \{ \\ \theta_0 = \theta_0 - \alpha\frac{1}{m}\sum_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)} )x_0^{(i)} \\ \theta_j = \theta_j - \alpha \left[ \frac{1}{m}\sum_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)} )x_j^{(i)} + \frac{1}{m}\theta_j\right] \\ \} until \quad convergence\quad condition \quad is \quad satisfied Reapt{θ0=θ0αm1i=1m(hθ(x(i))y(i))x0(i)θj=θjα[m1i=1m(hθ(x(i))y(i))xj(i)+m1θj]}untilconvergenceconditionissatisfied
在梯度下降的的过程中,正则化就是通过 1 m θ j \frac{1}{m}\theta_j m1θj体现出来的

规范解 Normal equation

θ = ( X T X + λ ⋅ L ) − 1 X T Y \theta = (X^TX+\lambda\cdot L)^{-1}X^T Y θ=(XTX+λL)1XTY
L = [ 0 1 ⋱ 1 ] L = \begin{bmatrix} 0&&&\\ &1&&\\ &&\ddots&\\ &&&1 \end{bmatrix} L=011

逻辑回归正则化

逻辑回归的COST函数

J ( θ ) = − 1 m ∑ i = 1 m ( y ( i ) log ⁡ ( h θ ) + ( 1 − y ( i ) ) log ⁡ ( 1 − h θ ( x ( i ) ) ) ) J(\theta) = -\frac{1}{m}\sum_{i = 1}^m ( y^{(i)} \log (h_\theta) + (1- y^{(i)})\log(1-h_\theta(x^{(i)}))) J(θ)=m1i=1m(y(i)log(hθ)+1y(i)log(1hθ(x(i))))
增加一项,将其正则化,
J ( θ ) = − 1 m ∑ i = 1 m ( y ( i ) log ⁡ ( h θ ( x ) ) + ( 1 − y ( i ) ) log ⁡ ( 1 − h θ ( x ( i ) ) ) ) + λ 2 m ∑ j = 1 n θ j 2 J(\theta) = -\frac{1}{m}\sum_{i=1}^m ( y^{(i)} \log(h_\theta(x)) + (1- y^{(i)}) \log(1- h_\theta(x^{(i)}))) + \frac{\lambda}{2m}\sum_{j=1}^n\theta_j^2 J(θ)=m1i=1m(y(i)log(hθ(x))+(1y(i))log(1hθ(x(i))))+2mλj=1nθj2
在梯度下降的跌代过程中
θ 0 = θ 0 − α 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x 0 ( i ) \theta_0 = \theta_0 - \alpha\frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})x_0^{(i)} θ0=θ0αm1i=1m(hθ(x(i))y(i))x0(i)
θ j = θ j − α ( 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x 0 ( i ) + λ m θ j ) \theta_j = \theta_j - \alpha (\frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})x_0^{(i)} + \frac{\lambda}{m}\theta_j ) θj=θjα(m1i=1m(hθ(x(i))y(i))x0(i)+mλθj)

MLE MAP

假设数据是服从分布模型:
d ∼ p ( d ; θ ) d \sim p(d;\theta) dp(d;θ)
目标估计 θ \theta θ的值,使得模型最贴合data的分布

MLE 极大似然估计

选择参数 θ \theta θ,使得data出现的概率最大。
对于与给定参数 θ \theta θ,给定数据data出现的概率为:
L ( θ ) = p ( D ; θ ) = ∏ i = 1 m p ( d ( i ) ; θ ) L(\theta) = p(D;\theta) = \prod_{i=1}^m p(d^{(i)};\theta) L(θ)=p(D;θ)=i=1mp(d(i);θ)

取对数,不影响函数的单调性,有一致的极值点

l ( θ ) = log ⁡ L ( θ ) = ∑ i = 1 m log ⁡ p ( d ( i ) ; θ ) l(\theta) = \log L(\theta) = \sum_{i=1}^m \log p(d^{(i)};\theta) l(θ)=logL(θ)=i=1mlogp(d(i);θ)

θ M L E = arg ⁡ max ⁡ θ ∑ i = 1 m log ⁡ p ( d ( i ) ; θ ) \theta_{MLE} = \arg \max_\theta \sum_{i=1}^m \log p(d^{(i)};\theta) θMLE=argθmaxi=1mlogp(d(i);θ)

MAP 最大后验估计

由贝叶斯公式可以得出,后验概率
p ( θ ∣ D ) = p ( θ ) p ( D ∣ θ ) p ( D ) p(\theta| D ) = p(\theta) \frac{p( D|\theta)}{p(D)} p(θD)=p(θ)p(D)p(Dθ)

p ( θ ) p(\theta) p(θ):先验概率,在没有看到数据的时候,有经验获得的概率或者猜测的概率
p ( D ) p(D) p(D):data 出现的概率
p ( D ) = ∫ θ p ( θ ) p ( D ∣ θ ) d θ p(D) = \int_\theta p(\theta)p(D|\theta)d\theta p(D)=θp(θ)p(Dθ)dθ
p ( D ) p(D) p(D)的值和 θ \theta θ无关
θ M A P = arg ⁡ max ⁡ θ p ( θ ∣ D ) = arg ⁡ max ⁡ θ p ( θ ) p ( D ∣ θ ) p ( D ) = arg ⁡ max ⁡ θ p ( θ ) p ( D ∣ θ ) = arg ⁡ max ⁡ θ ( log ⁡ p ( θ ) + ∑ i = 1 m log ⁡ p ( d ( i ) ; θ ) ) \theta_{MAP} = \arg \max_\theta p(\theta|D) \\ = \arg \max_\theta p(\theta) \frac{p(D|\theta)}{p(D)}\\ = \arg \max_\theta p(\theta)p(D| \theta)\\ =\arg \max_\theta(\log p(\theta) + \sum_{i=1}^m\log p(d^{(i)};\theta)) θMAP=argθmaxp(θD)=argθmaxp(θ)p(D)p(Dθ)=argθmaxp(θ)p(Dθ)=argθmax(logp(θ)+i=1mlogp(d(i);θ))

比较 MLE MAP

MLE:
θ M L E = arg ⁡ max ⁡ θ ∑ i = 1 m log ⁡ p ( d ( i ) ; θ ) \theta_{MLE} = \arg \max_\theta \sum_{i=1}^m \log p(d^{(i)};\theta) θMLE=argθmaxi=1mlogp(d(i);θ)
MAP:
θ M A P = arg ⁡ max ⁡ θ ( log ⁡ p ( θ ) + ∑ i = 1 m log ⁡ p ( d ( i ) ; θ ) ) \theta_{MAP} = \arg \max_\theta(\log p(\theta) + \sum_{i=1}^m\log p(d^{(i)};\theta)) θMAP=argθmax(logp(θ)+i=1mlogp(d(i);θ))

极大似然估计,忽略了 θ \theta θ本身的分布,认为 θ \theta θ的分布是均匀的,但是MAP认为 θ \theta θ是服从某一个分布的

MLE solution

对于数据集中 ( x ( i ) , y ( i ) ) (x^{(i)},y^{(i)}) (x(i),y(i))
y ( i ) = θ T x ( i ) + ϵ ( i ) y^{(i)} = \theta^T x^{(i)} + \epsilon^{(i)} y(i)=θTx(i)+ϵ(i)
其中 ϵ ( i ) \epsilon^{(i)} ϵ(i)服从高斯分布
ϵ ( i ) ∼ N ( 0 , σ 2 ) \epsilon^{(i)} \sim N(0,\sigma^2) ϵ(i)N(0,σ2)

那么可以得出 y ( i ) y^{(i)} y(i)服从高斯分布
y ( i ) ∣ x ( i ) ; θ ∼ N ( θ T x ( i ) , σ 2 ) y^{(i)} | x^{(i)};\theta \sim N(\theta^Tx^{(i)},\sigma^2) y(i)x(i);θN(θTx(i),σ2)

补充:高斯分布的概率密度函数
f ( x ) = 1 2 π σ e x p ( − ( x − u ) 2 2 σ 2 ) f(x) = \frac{1}{\sqrt{2\pi}\sigma}exp(-\frac{(x - u)^2}{2\sigma^2}) f(x)=2π σ1exp(2σ2(xu)2)

那么似然函数就成了
l ( θ ) = log ⁡ L ( θ ) = m log ⁡ 1 2 π σ − ∑ i = 1 m ( y ( i ) − θ T x ( i ) ) 2 2 σ 2 l(\theta) = \log L(\theta) = m\log \frac{1}{\sqrt{2\pi}\sigma} - \frac{\sum_{i=1}^m(y^{(i)} - \theta^Tx^{(i)})^2}{2\sigma^2} l(θ)=logL(θ)=mlog2π σ12σ2i=1m(y(i)θTx(i))2
要使 l ( θ ) l(\theta) l(θ)取最大值
θ M L E = arg ⁡ min ⁡ θ 1 2 σ 2 ∑ i = 1 m ( y ( i ) − θ T x ( i ) ) 2 \theta_{MLE} = \arg\min_\theta \frac{1}{2\sigma^2}\sum_{i=1}^m(y^{(i)} - \theta^Tx^{(i)})^2 θMLE=argθmin2σ21i=1m(y(i)θTx(i))2

MAP solution

我们假设 θ \theta θ服从高斯分布 θ ∼ N ( 0 , λ 2 ) \theta\sim N(0,\lambda^2) θN(0,λ2)
p ( θ ) = 1 ( 2 π λ 2 ) n e x p ( − θ T θ 2 λ 2 ) p(\theta) = \frac{1}{(\sqrt{2\pi}\lambda^2)^n}exp(-\frac{\theta^T\theta}{2\lambda^2}) p(θ)=(2π λ2)n1exp(2λ2θTθ)

θ M A P = arg ⁡ max ⁡ θ ( log ⁡ p ( θ ) + ∑ i = 1 m log ⁡ p ( d ( i ) ; θ ) ) = arg ⁡ max ⁡ θ ( − n log ⁡ ( 1 2 π λ 2 ) − θ T θ 2 λ 2 + ∑ i = 1 m log ⁡ ( 1 2 π σ 2 exp ⁡ ( − ( y ( i ) − θ T x ( i ) ) 2 2 σ 2 ) ) ) = arg ⁡ max ⁡ θ ( − n log ⁡ ( 1 2 π λ 2 ) − θ T θ 2 λ 2 + m log ⁡ ( 1 2 π σ 2 ) − ∑ i = 1 m ( y ( i ) − θ T x ( i ) ) 2 2 σ 2 ) \theta_{MAP} = \arg \max_\theta(\log p(\theta) + \sum_{i=1}^m\log p(d^{(i)};\theta))\\ =\arg \max_\theta (-n\log(\frac{1}{\sqrt{2\pi}\lambda^2})-\frac{\theta^T\theta}{2\lambda^2} + \sum_{i=1}^{m}\log(\frac{1}{\sqrt{2\pi}\sigma^2} \exp( -\frac{( y^{(i) - \theta^Tx^{(i)}} ) ^2}{2\sigma^2})))\\ =\arg \max_\theta(-n\log(\frac{1}{\sqrt{2\pi}\lambda^2}) -\frac{\theta^T\theta}{2\lambda^2} + m \log(\frac{1}{\sqrt{2\pi}\sigma^2}) - \sum_{i=1}^{m}\frac{( y^{(i) - \theta^Tx^{(i)}} ) ^2}{2\sigma^2}) θMAP=argθmax(logp(θ)+i=1mlogp(d(i);θ))=argθmax(nlog(2π λ21)2λ2θTθ+i=1mlog(2π σ21exp(2σ2(y(i)θTx(i))2)))=argθmax(nlog(2π λ21)2λ2θTθ+mlog(2π σ21)i=1m2σ2(y(i)θTx(i))2)

等价于
θ M A P = arg ⁡ min ⁡ θ ( θ T θ 2 λ 2 + ∑ i = 1 m ( y ( i ) − θ T x ( i ) ) 2 2 σ 2 ) \theta_{MAP} = \arg\min_\theta (\frac{\theta^T\theta}{2\lambda^2} + \sum_{i=1}^m\frac{(y^{(i)} - \theta^Tx^{(i)})^2}{2\sigma^2}) θMAP=argθmin(2λ2θTθ+i=1m2σ2(y(i)θTx(i))2)

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值