【Datawhale-机器学习-Task03-对数几率回归】

第三章:对数几率回归
因为之前介绍的回归学习,不满足二分类任务需求,回归模型产生的预测值 是实值,需要将其转换成0/1值。
因此引出Sigmoid函数:

y = 1 1 + e − z y=\frac{1}{1+e^{-z}} y=1+ez1

将y作为x的正例可能性,则1-y是其反例的可能性,有

ln ⁡ y 1 − y = w T x + b \ln\frac{y}{1-y}= \mathbf{w} ^{T}\mathbf{x} +b ln1yy=wTx+b

y 1 − y \frac{y}{1-y} 1yy称为几率, l n y 1 − y ln\frac{y}{1-y} ln1yy 称为对数几率。

ln ⁡ p ( y = 1 ∣ x ) p ( y = 0 ∣ x ) = w T x + b \ln\frac{p\left ( y=1\mid x \right ) }{ p\left ( y=0\mid x \right ) } = w^{T} x+b lnp(y=0x)p(y=1x)=wTx+b
p ( y = 1 ∣ x ) = e w T x + b 1 + e w T x + b p\left ( y=1\mid x \right ) =\frac{e^{w^{T}x+b } }{1+e^{w^{T}x+b } } p(y=1x)=1+ewTx+bewTx+b
p ( y = 0 ∣ x ) = 1 1 + e w T x + b p\left ( y=0\mid x \right ) =\frac{1}{1+e^{w^{T}x+b } } p(y=0x)=1+ewTx+b1

通过极大似然法来估计w和b,有

ℓ ( w , b ) ∑ i = 1 m ln ⁡ p ( y i ∣ x i ; w , b ) \ell\left ( w,b \right ) \sum_{i=1}^{m} \ln p\left ( y_{i}\mid x_{i};w,b \right ) (w,b)i=1mlnp(yixi;w,b)

ℓ ( β ) = ∑ i = 1 m ln ⁡ ( y i p 1 ( x ^ i ; β ) + ( 1 − y i ) p 0 ( x ^ i ; β ) ) \ell\left ( \beta \right ) =\sum_{i=1}^{m}\ln\left ( y_{i}p_{1} \left ( \widehat{x}_{i};\beta \right ) +\left ( 1- y_{i}\right )p_{0}\left ( \widehat{x}_{i} ;\beta \right ) \right ) (β)=i=1mln(yip1(x i;β)+(1yi)p0(x i;β))
p 0 ( x ^ i ; β ) = 1 1 + e β T x ^ i p_{0}\left ( \widehat{x}_{i} ;\beta \right )=\frac{1}{1+e^{\beta T\widehat{x}_{i}}} p0(x i;β)=1+eβTx i1 p 1 ( x ^ i ; β ) = e β T x ^ i 1 + e β T x ^ i p_{1}\left ( \widehat{x}_{i} ;\beta \right )=\frac{e^{\beta T\widehat{x}_{i}}}{1+e^{\beta T\widehat{x}_{i}}} p1(x i;β)=1+eβTx ieβTx i带入上式可得


ℓ ( β ) = ∑ i = 1 m ln ⁡ ( y i e β T x ^ i 1 + e β T x ^ i + ( 1 − y i ) 1 1 + e β T x ^ i ) \ell\left ( \beta \right ) =\sum_{i=1}^{m} \ln\left ( y_{i} \frac{e^{\beta T\widehat{x}_{i}}}{1+e^{\beta T\widehat{x}_{i}}} +\left ( 1-y_{i} \right )\frac{1}{1+e^{\beta T\widehat{x}_{i}}} \right ) (β)=i=1mln(yi1+eβTx ieβTx i+(1yi)1+eβTx i1)
ℓ ( β ) = ∑ i = 1 m ln ⁡ ( y i e β T x ^ i + ( 1 − y i ) 1 + e β T x ^ i ) \ell\left ( \beta \right ) =\sum_{i=1}^{m} \ln\left ( \frac{y_{i}e^{\beta T\widehat{x}_{i}}+\left ( 1-y_{i} \right ) }{1+e^{\beta T\widehat{x}_{i}}} \right ) (β)=i=1mln(1+eβTx iyieβTx i+(1yi))
ℓ ( β ) = ∑ i = 1 m ( ln ⁡ ( y i e β T x ^ i + 1 − y i ) − l n ( 1 + e β T x ^ i ) ) \ell\left ( \beta \right ) =\sum_{i=1}^{m} \left ( \ln\left ( y_{i}e^{\beta T\widehat{x}_{i}}+1- y_{i} \right ) -ln\left ( 1+e^{\beta T\widehat{x}_{i}} \right ) \right ) (β)=i=1m(ln(yieβTx i+1yi)ln(1+eβTx i))

由于 y i = 0 或 1 y_{i} =0或1 yi=01,因此
y i = 0 y_{i} =0 yi=0时有, ℓ ( β ) = ∑ i = 1 m ( − ln ⁡ ( 1 + e β T x ^ i ) ) \ell(\boldsymbol{\beta})=\sum_{i=1}^{m} \left (-\ln \left ( 1+e^{\beta T\widehat{x}_{i}} \right ) \right ) (β)=i=1m(ln(1+eβTx i))
y i = 1 y_{i} =1 yi=1时有, ℓ ( β ) = ∑ i = 1 m ( y i β T x ^ i − ln ⁡ ( 1 + e β T x ^ i ) ) \ell(\boldsymbol{\beta})=\sum_{i=1}^{m} \left (y_{i} \beta ^{T}\widehat{x}_{i} -\ln \left ( 1+e^{\beta T\widehat{x}_{i}} \right ) \right ) (β)=i=1m(yiβTx iln(1+eβTx i))
综合可得

ℓ ( β ) = ∑ i = 1 m ( y i β T x ^ i − ln ⁡ ( 1 + e β T x ^ i ) ) \ell(\boldsymbol{\beta})=\sum_{i=1}^{m}\left(y_{i} \boldsymbol{\beta}^{\mathrm{T}} \hat{\boldsymbol{x}}_{i}-\ln \left(1+e^{\boldsymbol{\beta}^{\mathrm{T}} \hat{\boldsymbol{x}}_{i}}\right)\right) (β)=i=1m(yiβTx^iln(1+eβTx^i))

信息论:
自信息
信息熵:自信心的期望值,用于度量随机变量X的不确定性,值越大代表越不确定。
相对熵:度量两个分布的差异,其典型使用场景是用来度量理想分布p(x)—和模拟分布q(x)之间的差异。


D K L ( p ∥ q ) = ∑ x p ( x ) log ⁡ b p ( x ) q ( x ) D_{KL}\left ( p\parallel q \right )=\sum_{x}^{} p\left ( x \right ) \log_{b}{\frac{p\left ( x \right ) }{q\left ( x \right ) } } DKL(pq)=xp(x)logbq(x)p(x)
D K L ( p ∥ q ) = ∑ x p ( x ) ( log ⁡ b p ( x ) − log ⁡ b q ( x ) ) D_{KL}\left ( p\parallel q \right )=\sum_{x}^{} p\left ( x \right ) \left ( \log_{b}{p\left ( x \right ) }- \log_{b}{q\left ( x \right ) } \right ) DKL(pq)=xp(x)(logbp(x)logbq(x))
D K L ( p ∥ q ) = ∑ x p ( x ) ( log ⁡ b p ( x ) ) − ∑ x p ( x ) log ⁡ b q ( x ) D_{KL}\left ( p\parallel q \right )=\sum_{x}^{} p\left ( x \right ) \left ( \log_{b}{p\left ( x \right ) } \right )-\sum_{x}^{} p\left ( x \right ) \log_{b}{q\left ( x \right ) } DKL(pq)=xp(x)(logbp(x))xp(x)logbq(x)

其中, − ∑ x p ( x ) log ⁡ b q ( x ) -\sum_{x}^{} p\left ( x \right ) \log_{b}{q\left ( x \right ) } xp(x)logbq(x) 称为 交叉熵

根据频率学派的观点,p(x)未知但是固定,因此 ∑ x p ( x ) log ⁡ b p ( x ) \sum_{x}^{} p\left ( x \right ) \log_{b}{p\left ( x \right ) } xp(x)logbp(x)为常数, 所以要使得相对熵最大化,则需要最小化交叉熵即可。
理想分布


p ( y i ) = { p ( 1 ) = 1 , p ( 0 ) = 0 , y i = 1 p ( 1 ) = 0 , p ( 0 ) = 1 , y i = 0 p(y_{i} )=\left\{ \begin{aligned} p\left ( 1 \right ) = 1,p\left ( 0 \right ) = 0,y_{i}=1 \\ p\left ( 1 \right ) = 0,p\left ( 0 \right ) = 1,y_{i}=0 \end{aligned} \right. p(yi)={p(1)=1,p(0)=0,yi=1p(1)=0,p(0)=1,yi=0

模拟分布


q ( y i ) = { e β T x ^ 1 + e β T x ^ = p 1 ( x ^ ; β ) , y i = 1 1 1 + e β T x ^ = p 0 ( x ^ ; β ) , y i = 0 q(y_{i} )=\left\{ \begin{aligned} \frac{e^{\beta ^{T}\widehat{x} }}{1+e^{\beta ^{T}\widehat{x} } } = p_{1}\left ( \widehat{x};\beta \right ),y_{i}=1 \\ \frac{1}{1+e^{\beta ^{T}\widehat{x} } } = p_{0}\left ( \widehat{x};\beta \right ) ,y_{i}=0 \end{aligned} \right. q(yi)= 1+eβTx eβTx =p1(x ;β),yi=11+eβTx 1=p0(x ;β),yi=0

则交叉熵为


− ∑ y i p ( y i ) log ⁡ b q ( y i ) -\sum_{y_{i}}^{} p\left ( y_{i} \right ) \log_{b}{q\left ( y_{i} \right ) } yip(yi)logbq(yi)
− p ( 1 ) log ⁡ b p 1 ( x ^ ; β ) − p ( 0 ) log ⁡ b p 0 ( x ^ ; β ) -p\left ( 1 \right ) \log_{b}{p_{1}\left ( \widehat{x};\beta \right )} -p\left ( 0 \right ) \log_{b}{p_{0}\left ( \widehat{x};\beta \right )} p(1)logbp1(x ;β)p(0)logbp0(x ;β)
− y i log ⁡ b p 1 ( x ^ ; β ) − ( 1 − y i ) log ⁡ b p 0 ( x ^ ; β ) -y_{i} \log_{b}{p_{1}\left ( \widehat{x};\beta \right )} -\left ( 1-y_{i} \right ) \log_{b}{p_{0}\left ( \widehat{x};\beta \right )} yilogbp1(x ;β)(1yi)logbp0(x ;β)

令 b=e,则有 − y i ln ⁡ b p 1 ( x ^ ; β ) − ( 1 − y i ) ln ⁡ b p 0 ( x ^ ; β ) -y_{i} \ln_{b}{p_{1}\left ( \widehat{x};\beta \right )} -\left ( 1-y_{i} \right ) \ln_{b}{p_{0}\left ( \widehat{x};\beta \right )} yilnbp1(x ;β)(1yi)lnbp0(x ;β)

最终化简可得:
∑ i = 1 m ( y i β T x ^ i − ln ⁡ ( 1 + e β T x ^ i ) ) \sum_{i=1}^{m}\left(y_{i} \boldsymbol{\beta}^{\mathrm{T}} \hat{\boldsymbol{x}}_{i}-\ln \left(1+e^{\boldsymbol{\beta}^{\mathrm{T}} \hat{\boldsymbol{x}}_{i}}\right)\right) i=1m(yiβTx^iln(1+eβTx^i))
总而言之,殊途同归。

梯度下降法:是一种迭代求解算法,利用“梯度指向的方向是函数值增大速度最快的方向”这一特性,每次迭代时都朝着梯度的反方向进行,进而实现函数值越迭代越小。
牛顿法:与梯度下降法一样,只不过还要求 x t + 1 x^{t+1} xt+1必须是 x t x^{t} xt领域内的极小值点。

感谢Datawhale小组所做的贡献,本次学习主要参考视频:
https://www.bilibili.com/video/BV1Mh411e7VU/?p=5&spm_id_from=333.880.my_history.page.click&vd_source=7f1a93b833d8a7093eb3533580254fe4

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值