Logistic Regression

【交叉熵的最大似然解释】

概率假设
P ( y = 1 ∣ x ; θ ) = h θ ( x ) P(y=1 \mid x;\theta) = h_\theta(x) P(y=1x;θ)=hθ(x)
P ( y = 0 ∣ x ; θ ) = 1 − h θ ( x ) P(y=0 \mid x;\theta) = 1 - h_\theta(x) P(y=0x;θ)=1hθ(x)
注: h θ ( x ) h_\theta(x) hθ(x)其实就是 y ^ \hat{y} y^

将上述2个式子整合起来,得到
P ( y ∣ x ; θ ) = ( h θ ( x ) ) y ( 1 − h θ ( x ) ) 1 − y P(y \mid x;\theta) = \left ( h_\theta(x) \right )^y \left ( 1 - h_\theta(x) \right )^{1-y} P(yx;θ)=(hθ(x))y(1hθ(x))1y

参数 θ \theta θ的似然函数定义为
L ( θ ) = L ( θ ; X , y ) = P ( y ∣ x ; θ ) = ∏ i = 1 m P ( y ( i ) ∣ x ( i ) ; θ ) = ∏ i = 1 m ( h θ ( x ) ) y ( i ) ( 1 − h θ ( x ) ) 1 − y ( i ) \begin{aligned}L(\theta) &= L(\theta;X,y) = P(y \mid x;\theta) \\ &= \prod_{i=1}^{m}P\left (y^{(i)} \mid x^{(i)};\theta\right ) \\ &= \prod_{i=1}^{m} \left ( h_\theta(x) \right )^{y^{(i)}} \left ( 1 - h_\theta(x) \right )^{1-y^{(i)}}\end{aligned} L(θ)=L(θ;X,y)=P(yx;θ)=i=1mP(y(i)x(i);θ)=i=1m(hθ(x))y(i)(1hθ(x))1y(i)

参数 θ \theta θ的对数似然函数为
l ( θ ) = log ⁡ L ( θ ) = ∑ i = 1 m [ y ( i ) log ⁡ h θ ( x ( i ) ) + ( 1 − y ( i ) ) log ⁡ ( 1 − h θ ( x ( i ) ) ) ] \begin{aligned} l(\theta) &= \log L(\theta) \\ &= \sum_{i=1}^{m} \left [ y^{(i)}\log h_\theta(x^{(i)}) + (1-y^{(i)})\log \left ( 1 - h_\theta(x^{(i)}) \right ) \right ] \end{aligned} l(θ)=logL(θ)=i=1m[y(i)loghθ(x(i))+(1y(i))log(1hθ(x(i)))]

于是最大化 l ( θ ) l(\theta) l(θ)等价于最小化
J ( θ ) = − l ( θ ) = − ∑ i = 1 m [ y ( i ) log ⁡ h θ ( x ( i ) ) + ( 1 − y ( i ) ) log ⁡ ( 1 − h θ ( x ( i ) ) ) ] \begin{aligned}J(\theta) = -l(\theta) =-\sum_{i=1}^{m} \left [ y^{(i)}\log h_\theta(x^{(i)}) + (1-y^{(i)})\log \left ( 1 - h_\theta(x^{(i)}) \right ) \right ]\end{aligned} J(θ)=l(θ)=i=1m[y(i)loghθ(x(i))+(1y(i))log(1hθ(x(i)))]

【关于对数几率的解释】

已知

P ( y = 1 ∣ x ) = 1 1 + e − z = e z 1 + e z P(y=1 \mid x)=\frac{1}{1+e^{-z}}=\frac{e^z}{1+e^z} P(y=1x)=1+ez1=1+ezez

P ( y = 0 ∣ x ) = 1 − 1 1 + e − z = 1 1 + e z P(y=0 \mid x)=1-\frac{1}{1+e^{-z}}=\frac{1}{1+e^z} P(y=0x)=11+ez1=1+ez1

根据对数几率的定义(事件发生与不发生的概率之比取对数)

log ⁡ P ( y = 1 ∣ x ) P ( y = 0 ∣ x ) = z = w x + b \begin{aligned}\log\frac{P(y=1 \mid x)}{P(y=0 \mid x)}=z=wx+b\end{aligned} logP(y=0x)P(y=1x)=z=wx+b

因此,事件 y = 1 y=1 y=1的对数几率是关于特征 x x x的线性函数

同时可以看到sigmoid函数的推导过程
log ⁡ y ^ 1 − y ^ = z y ^ 1 − y ^ = e z y ^ = e z ( 1 − y ^ ) y ^ = e z − y ^ e z ( 1 + e z ) y ^ = e z y ^ = e z 1 + e z y ^ = 1 1 + e − z \begin{aligned} \log\frac{\hat{y}}{1-\hat{y}}&=z \\ \frac{\hat{y}}{1-\hat{y}}&=e^z \\ \hat{y}&=e^z\left ( 1-\hat{y} \right ) \\ \hat{y}&=e^z-\hat{y}e^z \\ \left ( 1+e^z \right )\hat{y}&=e^z \\ \hat{y}&=\frac{e^z}{1+e^z} \\ \hat{y}&=\frac{1}{1+e^{-z}} \end{aligned} log1y^y^1y^y^y^y^(1+ez)y^y^y^=z=ez=ez(1y^)=ezy^ez=ez=1+ezez=1+ez1

± 1 \pm1 ±1标签下的损失函数】

我们使用sigmoid函数将 θ T x \theta^Tx θTx转化为概率

P ( y = 1 ∣ x ; θ ) = h θ ( x ) = s i g m o i d ( θ T x ) = 1 1 + e − θ T x \begin{aligned}P(y=1 \mid x;\theta)=h_\theta(x)=\mathrm{sigmoid}(\theta^Tx)=\frac{1}{1+e^{-\theta^Tx}}\end{aligned} P(y=1x;θ)=hθ(x)=sigmoid(θTx)=1+eθTx1

P ( y = − 1 ∣ x ; θ ) = 1 − h θ ( x ) = 1 − s i g m o i d ( θ T x ) = 1 1 + e θ T x \begin{aligned}P(y=-1 \mid x;\theta)=1-h_\theta(x)=1-\mathrm{sigmoid}(\theta^Tx)=\frac{1}{1+e^{\theta^Tx}}\end{aligned} P(y=1x;θ)=1hθ(x)=1sigmoid(θTx)=1+eθTx1

将上述2个式子整合起来,得到
P ( y ∣ x ; θ ) = 1 1 + e − y θ T x P(y \mid x;\theta)=\frac{1}{1+e^{-y\theta^Tx}} P(yx;θ)=1+eyθTx1
(精妙之处就在此!在sigmoid函数中把 y ∈ { + 1 , − 1 } y \in \left \{ +1,-1 \right \} y{+1,1}整合进去)

参数 θ \theta θ的似然函数定义为
L ( θ ) = L ( θ ; X , y ) = P ( y ∣ x ; θ ) = ∏ i = 1 m P ( y ( i ) ∣ x ( i ) ; θ ) = ∏ i = 1 m 1 1 + e − y ( i )   θ T x ( i ) \begin{aligned} L(\theta) &= L(\theta;X,y) = P(y \mid x;\theta) \\ &= \prod_{i=1}^{m}P\left (y^{(i)} \mid x^{(i)};\theta\right ) \\ &= \prod_{i=1}^{m} \frac{1}{1+e^{-y^{(i)} \ \theta^T x^{(i)}}} \end{aligned} L(θ)=L(θ;X,y)=P(yx;θ)=i=1mP(y(i)x(i);θ)=i=1m1+ey(i) θTx(i)1

参数 θ \theta θ的对数似然函数为
l ( θ ) = log ⁡ L ( θ ) = − ∑ i = 1 m log ⁡ ( 1 + e − y ( i ) θ T x ( i ) ) \begin{aligned} l(\theta) &= \log L(\theta) \\ &= -\sum_{i=1}^{m} \log\left ( 1+e^{-y^{(i)} \theta^T x^{(i)}} \right ) \end{aligned} l(θ)=logL(θ)=i=1mlog(1+ey(i)θTx(i))

于是最大化 l ( θ ) l(\theta) l(θ)等价于最小化KaTeX parse error: Expected group after '_' at position 43: …\frac{1}{m}\sum_̲\limits{i=1}^{m…

总结一下,我们推导了标签为 { 0 , 1 } \left \{0,1\right \} {0,1}时和标签为 { − 1 , 1 } \left \{-1,1\right \} {1,1}时的概率意义下的似然函数

标签为 { 0 , 1 } \left \{0,1\right \} {0,1}时,不将 h θ ( x ) h_\theta(x) hθ(x)展开成具体形式,只是利用含有标签的指数技巧 ( h θ ( x ) ) y ( 1 − h θ ( x ) ) 1 − y \left ( h_\theta(x) \right )^y \left ( 1 - h_\theta(x) \right )^{1-y} (hθ(x))y(1hθ(x))1y,将 y = 1 y=1 y=1 y = 0 y=0 y=0的情况统一起来

标签为 { − 1 , 1 } \left \{-1,1\right \} {1,1}时,将 h θ ( x ) h_\theta(x) hθ(x)展开成具体形式,观察到两个式子 1 1 + e − θ T x \frac{1}{1+e^{-\theta^Tx}} 1+eθTx1 1 1 + e θ T x \frac{1}{1+e^{\theta^Tx}} 1+eθTx1中正好相差一个负号,由此可直接用标签将 y = 1 y=1 y=1 y = − 1 y=-1 y=1的情况统一起来

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值