Logistics Regression公式推导

以前一直以为逻辑回归的公式(sigmoid函数)是人为臆造的,学了机器学习课程才知道,它的背后是有贝叶斯定理这样的数学理论支撑的。

为什么是Sigmoid函数?

回顾贝叶斯分类器,贝叶斯分类器是一种生成式学习方。为了获取 P ( Y ∣ X ) P(Y|X) P(YX),我们将它转化为 P ( Y ) P(Y) P(Y)和P(X|Y),然后从数据集中估计这两个参数。那么问题来了,我们是否可以直接估计 P ( Y ∣ X ) P(Y|X) P(YX)呢?

在逻辑回归模型中,我们做如下假设:

  • X X X是一个实数值矩阵,表示 n n n个特征, < X 1 , X 2 , … , X n > <X_1,X_2, \dots, X_n> <X1,X2,,Xn>
  • Y Y Y是一个布尔值矩阵
  • 假设在给定 Y Y Y后所有的 X i X_i Xi之间都是条件独立的,即
    P ( X ∣ Y ) = P ( X 1 , X 2 , … , X n ∣ Y ) = P ( X 1 ∣ Y ) P ( X 2 ∣ Y ) … P ( X n ∣ Y ) P(X|Y)\\ =P(X_1,X_2, \dots ,X_n|Y)\\ =P(X_1|Y)P(X_2|Y) \dots P(X_n|Y) P(XY)=P(X1,X2,,XnY)=P(X1Y)P(X2Y)P(XnY)
  • 假设在给定 Y = y k Y=y_k Y=yk后,每一个 X i X_i Xi都服从高斯分布 N ( μ i k , σ i ) N(\mu_{ik}, \sigma_i) N(μik,σi),即
    ( X i ∣ Y = y k ) ∼ N ( μ i k , σ i ) (X_i|Y=y_k) \sim N(\mu_{ik},\sigma_i) (XiY=yk)N(μik,σi)
    P ( X i ∣ Y = 0 ) = 1 2 π σ i exp ⁡ { − ( X i − μ i 0 ) 2 2 σ i 2 } P \left( X_i|Y=0 \right) =\frac{1}{\sqrt{2\pi}\sigma _i}\exp \left\{ -\frac{\left( X_i-\mu _{i0} \right) ^2}{2\sigma _{i}^{2}} \right\} P(XiY=0)=2π σi1exp{2σi2(Xiμi0)2}
  • 假设类别 Y Y Y的先验服从伯努利分布,即
    P ( Y = 1 ) = π , P ( Y = 0 ) = 1 − π P(Y=1)=\pi,P(Y=0)=1-\pi P(Y=1)=π,P(Y=0)=1π

然后我们就可以开始推公式了。

P ( Y = 1 ∣ X ) = P ( X ∣ Y = 1 ) P ( Y = 1 ) P ( X ) = P ( X ∣ Y = 1 ) P ( Y = 1 ) P ( X ∣ Y = 1 ) P ( Y = 1 ) + P ( X ∣ Y = 0 ) P ( Y = 0 ) = 1 1 + P ( X ∣ Y = 0 ) P ( Y = 0 ) P ( X ∣ Y = 1 ) P ( Y = 1 ) = 1 1 + 1 − π π ⋅ P ( X ∣ Y = 0 ) P ( X ∣ Y = 1 ) = 1 1 + 1 − π π ⋅ ∏ i P ( X i ∣ Y = 0 ) ∏ i P ( X i ∣ Y = 1 ) = 1 1 + exp ⁡ { ln ⁡ ( 1 − π π ⋅ ∏ i P ( X i ∣ Y = 0 ) ∏ i P ( X i ∣ Y = 1 ) ) } = 1 1 + exp ⁡ { ln ⁡ ( 1 − π π ) + ln ⁡ ( ∏ i P ( X i ∣ Y = 0 ) ∏ i P ( X i ∣ Y = 1 ) ) } = 1 1 + exp ⁡ { ln ⁡ ( 1 − π π ) + ∑ i ( ln ⁡ ( P ( X i ∣ Y = 0 ) ) − ln ⁡ ( P ( X i ∣ Y = 1 ) ) ) } P\left( Y=1|X \right) \\ =\frac{P\left( X|Y=1 \right) P\left( Y=1 \right)}{P\left( X \right)} \\ =\frac{P\left( X|Y=1 \right) P\left( Y=1 \right)}{P\left( X|Y=1 \right) P\left( Y=1 \right) +P\left( X|Y=0 \right) P\left( Y=0 \right)} \\ =\frac{1}{1+\frac{P\left( X|Y=0 \right) P\left( Y=0 \right)}{P\left( X|Y=1 \right) P\left( Y=1 \right)}} \\ =\frac{1}{1+\frac{1-\pi}{\pi}\cdot \frac{P\left( X|Y=0 \right)}{P\left( X|Y=1 \right)}} \\ =\frac{1}{1+\frac{1-\pi}{\pi}\cdot \frac{\prod_i{P\left( X_i|Y=0 \right)}}{\prod_i{P\left( X_i|Y=1 \right)}}} \\ =\frac{1}{1+\exp \left\{ \ln \left( \frac{1-\pi}{\pi}\cdot \frac{\prod_i{P\left( X_i|Y=0 \right)}}{\prod_i{P\left( X_i|Y=1 \right)}} \right) \right\}} \\ =\frac{1}{1+\exp \left\{ \ln \left( \frac{1-\pi}{\pi} \right) +\ln \left( \frac{\prod_i{P\left( X_i|Y=0 \right)}}{\prod_i{P\left( X_i|Y=1 \right)}} \right) \right\}} \\ =\frac{1}{1+\exp \left\{ \ln \left( \frac{1-\pi}{\pi} \right) +\sum_i{\left( \ln \left( P\left( X_i|Y=0 \right) \right) -\ln \left( P\left( X_i|Y=1 \right) \right) \right)} \right\}} P(Y=1X)=P(X)P(XY=1)P(Y=1)=P(XY=1)P(Y=1)+P(XY=0)P(Y=0)P(XY=1)P(Y=1)=1+P(XY=1)P(Y=1)P(XY=0)P(Y=0)1=1+π1πP(XY=1)P(XY=0)1=1+π1πiP(XiY=1)iP(XiY=0)1=1+exp{ln(π1πiP(XiY=1)iP(XiY=0))}1=1+exp{ln(π1π)+ln(iP(XiY=1)iP(XiY=0))}1=1+exp{ln(π1π)+i(ln(P(XiY=0))ln(P(XiY=1)))}1

我们观察分母的最后一项中的
ln ⁡ ( P ( X i ∣ Y = 0 ) ) − ln ⁡ ( P ( X i ∣ Y = 1 ) ) \ln \left( P\left( X_i|Y=0 \right) \right) -\ln \left( P\left( X_i|Y=1 \right) \right) ln(P(XiY=0))ln(P(XiY=1))

可以发现由于我们有概率密度函数
P ( X i ∣ Y = 0 ) = 1 2 π σ i exp ⁡ { − ( X i − μ i 0 ) 2 2 σ i 2 } P ( X i ∣ Y = 1 ) = 1 2 π σ i exp ⁡ { − ( X i − μ i 1 ) 2 2 σ i 2 } P\left( X_i|Y=0 \right) =\frac{1}{\sqrt{2\pi}\sigma _i}\exp \left\{ -\frac{\left( X_i-\mu _{i0} \right) ^2}{2\sigma _{i}^{2}} \right\} \\ P\left( X_i|Y=1 \right) =\frac{1}{\sqrt{2\pi}\sigma _i}\exp \left\{ -\frac{\left( X_i-\mu _{i1} \right) ^2}{2\sigma _{i}^{2}} \right\} P(XiY=0)=2π σi1exp{2σi2(Xiμi0)2}P(XiY=1)=2π σi1exp{2σi2(Xiμi1)2}

所以有
ln ⁡ ( P ( X i ∣ Y = 0 ) ) − ln ⁡ ( P ( X i ∣ Y = 1 ) ) = − ( X i − μ i 0 ) 2 2 σ i 2 + ( X i − μ i 1 ) 2 2 σ i 2 = − ( X i − μ i 0 ) 2 + ( X i − μ i 1 ) 2 2 σ i 2 = − X i 2 + 2 X i μ i 0 − μ i 0 2 + X i 2 − 2 X i μ i 1 + μ i 1 2 2 σ i 2 = 2 ( μ i 0 − μ i 1 ) X i − μ i 0 2 + μ i 1 2 2 σ i 2 = μ i 0 − μ i 1 σ i 2 X i + − μ i 0 2 + μ i 1 2 2 σ i 2 \ln \left( P\left( X_i|Y=0 \right) \right) -\ln \left( P\left( X_i|Y=1 \right) \right) \\ =-\frac{\left( X_i-\mu _{i0} \right) ^2}{2\sigma _{i}^{2}}+\frac{\left( X_i-\mu _{i1} \right) ^2}{2\sigma _{i}^{2}} \\ =\frac{-\left( X_i-\mu _{i0} \right) ^2+\left( X_i-\mu _{i1} \right) ^2}{2\sigma _{i}^{2}} \\ =\frac{-X_{\begin{array}{c} i\\ \end{array}}^{2}+2X_i\mu _{i0}-\mu _{i0}^{2}+X_{\begin{array}{c} i\\ \end{array}}^{2}-2X_i\mu _{i1}+\mu _{i1}^{2}}{2\sigma _{i}^{2}} \\ =\frac{2\left( \mu _{i0}-\mu _{i1} \right) X_i-\mu _{i0}^{2}+\mu _{i1}^{2}}{2\sigma _{i}^{2}} \\ =\frac{\mu _{i0}-\mu _{i1}}{\sigma _{i}^{2}}X_i+\frac{-\mu _{i0}^{2}+\mu _{i1}^{2}}{2\sigma _{i}^{2}} ln(P(XiY=0))ln(P(XiY=1))=2σi2(Xiμi0)2+2σi2(Xiμi1)2=2σi2(Xiμi0)2+(Xiμi1)2=2σi2Xi2+2Xiμi0μi02+Xi22Xiμi1+μi12=2σi22(μi0μi1)Xiμi02+μi12=σi2μi0μi1Xi+2σi2μi02+μi12

因此,

P ( Y = 1 ∣ X ) = 1 1 + exp ⁡ { ln ⁡ ( 1 − π π ) + ∑ i ( ln ⁡ ( P ( X i ∣ Y = 0 ) ) − ln ⁡ ( P ( X i ∣ Y = 1 ) ) ) } = 1 1 + exp ⁡ { ln ⁡ ( 1 − π π ) + ∑ i ( μ i 0 − μ i 1 σ i 2 X i + − μ i 0 2 + μ i 1 2 2 σ i 2 ) } = 1 1 + exp ⁡ { ln ⁡ ( 1 − π π ) + ∑ i ( − μ i 0 2 + μ i 1 2 2 σ i 2 ) + ∑ i ( μ i 0 − μ i 1 σ i 2 X i ) } P\left( Y=1|X \right) \\ =\frac{1}{1+\exp \left\{ \ln \left( \frac{1-\pi}{\pi} \right) +\sum_i{\left( \ln \left( P\left( X_i|Y=0 \right) \right) -\ln \left( P\left( X_i|Y=1 \right) \right) \right)} \right\}} \\ =\frac{1}{1+\exp \left\{ \ln \left( \frac{1-\pi}{\pi} \right) +\sum_i{\left( \frac{\mu _{i0}-\mu _{i1}}{\sigma _{i}^{2}}X_i+\frac{-\mu _{i0}^{2}+\mu _{i1}^{2}}{2\sigma _{i}^{2}} \right)} \right\}} \\ =\frac{1}{1+\exp \left\{ \ln \left( \frac{1-\pi}{\pi} \right) +\sum_i{\left( \frac{-\mu _{i0}^{2}+\mu _{i1}^{2}}{2\sigma _{i}^{2}} \right)}+\sum_i{\left( \frac{\mu _{i0}-\mu _{i1}}{\sigma _{i}^{2}}X_i \right)} \right\}} P(Y=1X)=1+exp{ln(π1π)+i(ln(P(XiY=0))ln(P(XiY=1)))}1=1+exp{ln(π1π)+i(σi2μi0μi1Xi+2σi2μi02+μi12)}1=1+exp{ln(π1π)+i(2σi2μi02+μi12)+i(σi2μi0μi1Xi)}1


P ( Y = 1 ∣ X ) = 1 1 + exp ⁡ { ln ⁡ ( 1 − π π ) + ∑ i ( − μ i 0 2 + μ i 1 2 2 σ i 2 ) + ∑ i ( μ i 0 − μ i 1 σ i 2 X i ) } P\left( Y=1|X \right)=\frac{1}{1+\exp \left\{ \ln \left( \frac{1-\pi}{\pi} \right) +\sum_i{\left( \frac{-\mu _{i0}^{2}+\mu _{i1}^{2}}{2\sigma _{i}^{2}} \right)}+\sum_i{\left( \frac{\mu _{i0}-\mu _{i1}}{\sigma _{i}^{2}}X_i \right)} \right\}} P(Y=1X)=1+exp{ln(π1π)+i(2σi2μi02+μi12)+i(σi2μi0μi1Xi)}1


w 0 = ln ⁡ ( 1 − π π ) + ∑ i ( − μ i 0 2 + μ i 1 2 2 σ i 2 ) w_0=\ln \left( \frac{1-\pi}{\pi} \right)+\sum_i{\left( \frac{-\mu _{i0}^{2}+\mu _{i1}^{2}}{2\sigma _{i}^{2}} \right)} w0=ln(π1π)+i(2σi2μi02+μi12)
w i = μ i 0 − μ i 1 σ i 2 w_i=\frac{\mu _{i0}-\mu _{i1}}{\sigma _{i}^{2}} wi=σi2μi0μi1

P ( Y = 1 ∣ X ) = 1 1 + exp ⁡ { w 0 + ∑ i ( w i X i ) } P\left( Y=1|X \right)=\frac{1}{1+\exp \left\{ w_0+\sum_i{\left( w_iX_i \right)} \right\}} P(Y=1X)=1+exp{w0+i(wiXi)}1

P ( Y = 1 ∣ X ) = 1 1 + e w X + b P\left( Y=1|X \right)=\frac{1}{1+e^{wX+b}} P(Y=1X)=1+ewX+b1
也就是我们常说的sigmoid函数。

损失函数推导

我们采用极大似然估计(MLE),但是 P ( < X i , y i > ∣ w ) P(<X_i,y_i>|w) P(<Xi,yi>w)很难求,数据集中很难找到这样的数据,因此我们采用更弱一些的条件极大似然(MCLE),求 P ( Y = y i ∣ X i , w ) P(Y=y_i|X_i,w) P(Y=yiXi,w)
在这里我们需要一个真实的场景。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值