第三章:对数几率回归
因为之前介绍的回归学习,不满足二分类任务需求,回归模型产生的预测值 是实值,需要将其转换成0/1值。
因此引出Sigmoid函数:
将y作为x的正例可能性,则1-y是其反例的可能性,有
ln y 1 − y = w T x + b \ln\frac{y}{1-y}= \mathbf{w} ^{T}\mathbf{x} +b ln1−yy=wTx+b
y 1 − y \frac{y}{1-y} 1−yy称为几率, l n y 1 − y ln\frac{y}{1-y} ln1−yy 称为对数几率。
ln p ( y = 1 ∣ x ) p ( y = 0 ∣ x ) = w T x + b \ln\frac{p\left ( y=1\mid x \right ) }{ p\left ( y=0\mid x \right ) } = w^{T} x+b lnp(y=0∣x)p(y=1∣x)=wTx+b
p ( y = 1 ∣ x ) = e w T x + b 1 + e w T x + b p\left ( y=1\mid x \right ) =\frac{e^{w^{T}x+b } }{1+e^{w^{T}x+b } } p(y=1∣x)=1+ewTx+bewTx+b
p ( y = 0 ∣ x ) = 1 1 + e w T x + b p\left ( y=0\mid x \right ) =\frac{1}{1+e^{w^{T}x+b } } p(y=0∣x)=1+ewTx+b1
通过极大似然法来估计w和b,有
ℓ ( w , b ) ∑ i = 1 m ln p ( y i ∣ x i ; w , b ) \ell\left ( w,b \right ) \sum_{i=1}^{m} \ln p\left ( y_{i}\mid x_{i};w,b \right ) ℓ(w,b)∑i=1mlnp(yi∣xi;w,b)
ℓ
(
β
)
=
∑
i
=
1
m
ln
(
y
i
p
1
(
x
^
i
;
β
)
+
(
1
−
y
i
)
p
0
(
x
^
i
;
β
)
)
\ell\left ( \beta \right ) =\sum_{i=1}^{m}\ln\left ( y_{i}p_{1} \left ( \widehat{x}_{i};\beta \right ) +\left ( 1- y_{i}\right )p_{0}\left ( \widehat{x}_{i} ;\beta \right ) \right )
ℓ(β)=∑i=1mln(yip1(x
i;β)+(1−yi)p0(x
i;β))
将
p
0
(
x
^
i
;
β
)
=
1
1
+
e
β
T
x
^
i
p_{0}\left ( \widehat{x}_{i} ;\beta \right )=\frac{1}{1+e^{\beta T\widehat{x}_{i}}}
p0(x
i;β)=1+eβTx
i1、
p
1
(
x
^
i
;
β
)
=
e
β
T
x
^
i
1
+
e
β
T
x
^
i
p_{1}\left ( \widehat{x}_{i} ;\beta \right )=\frac{e^{\beta T\widehat{x}_{i}}}{1+e^{\beta T\widehat{x}_{i}}}
p1(x
i;β)=1+eβTx
ieβTx
i带入上式可得
ℓ ( β ) = ∑ i = 1 m ln ( y i e β T x ^ i 1 + e β T x ^ i + ( 1 − y i ) 1 1 + e β T x ^ i ) \ell\left ( \beta \right ) =\sum_{i=1}^{m} \ln\left ( y_{i} \frac{e^{\beta T\widehat{x}_{i}}}{1+e^{\beta T\widehat{x}_{i}}} +\left ( 1-y_{i} \right )\frac{1}{1+e^{\beta T\widehat{x}_{i}}} \right ) ℓ(β)=∑i=1mln(yi1+eβTx ieβTx i+(1−yi)1+eβTx i1)
ℓ ( β ) = ∑ i = 1 m ln ( y i e β T x ^ i + ( 1 − y i ) 1 + e β T x ^ i ) \ell\left ( \beta \right ) =\sum_{i=1}^{m} \ln\left ( \frac{y_{i}e^{\beta T\widehat{x}_{i}}+\left ( 1-y_{i} \right ) }{1+e^{\beta T\widehat{x}_{i}}} \right ) ℓ(β)=∑i=1mln(1+eβTx iyieβTx i+(1−yi))
ℓ ( β ) = ∑ i = 1 m ( ln ( y i e β T x ^ i + 1 − y i ) − l n ( 1 + e β T x ^ i ) ) \ell\left ( \beta \right ) =\sum_{i=1}^{m} \left ( \ln\left ( y_{i}e^{\beta T\widehat{x}_{i}}+1- y_{i} \right ) -ln\left ( 1+e^{\beta T\widehat{x}_{i}} \right ) \right ) ℓ(β)=∑i=1m(ln(yieβTx i+1−yi)−ln(1+eβTx i))
由于 y i = 0 或 1 y_{i} =0或1 yi=0或1,因此
当 y i = 0 y_{i} =0 yi=0时有, ℓ ( β ) = ∑ i = 1 m ( − ln ( 1 + e β T x ^ i ) ) \ell(\boldsymbol{\beta})=\sum_{i=1}^{m} \left (-\ln \left ( 1+e^{\beta T\widehat{x}_{i}} \right ) \right ) ℓ(β)=∑i=1m(−ln(1+eβTx i))
当 y i = 1 y_{i} =1 yi=1时有, ℓ ( β ) = ∑ i = 1 m ( y i β T x ^ i − ln ( 1 + e β T x ^ i ) ) \ell(\boldsymbol{\beta})=\sum_{i=1}^{m} \left (y_{i} \beta ^{T}\widehat{x}_{i} -\ln \left ( 1+e^{\beta T\widehat{x}_{i}} \right ) \right ) ℓ(β)=∑i=1m(yiβTx i−ln(1+eβTx i))
综合可得
ℓ ( β ) = ∑ i = 1 m ( y i β T x ^ i − ln ( 1 + e β T x ^ i ) ) \ell(\boldsymbol{\beta})=\sum_{i=1}^{m}\left(y_{i} \boldsymbol{\beta}^{\mathrm{T}} \hat{\boldsymbol{x}}_{i}-\ln \left(1+e^{\boldsymbol{\beta}^{\mathrm{T}} \hat{\boldsymbol{x}}_{i}}\right)\right) ℓ(β)=∑i=1m(yiβTx^i−ln(1+eβTx^i))
信息论:
自信息:
信息熵:自信心的期望值,用于度量随机变量X的不确定性,值越大代表越不确定。
相对熵:度量两个分布的差异,其典型使用场景是用来度量理想分布p(x)—和模拟分布q(x)之间的差异。
D K L ( p ∥ q ) = ∑ x p ( x ) log b p ( x ) q ( x ) D_{KL}\left ( p\parallel q \right )=\sum_{x}^{} p\left ( x \right ) \log_{b}{\frac{p\left ( x \right ) }{q\left ( x \right ) } } DKL(p∥q)=∑xp(x)logbq(x)p(x)
D K L ( p ∥ q ) = ∑ x p ( x ) ( log b p ( x ) − log b q ( x ) ) D_{KL}\left ( p\parallel q \right )=\sum_{x}^{} p\left ( x \right ) \left ( \log_{b}{p\left ( x \right ) }- \log_{b}{q\left ( x \right ) } \right ) DKL(p∥q)=∑xp(x)(logbp(x)−logbq(x))
D K L ( p ∥ q ) = ∑ x p ( x ) ( log b p ( x ) ) − ∑ x p ( x ) log b q ( x ) D_{KL}\left ( p\parallel q \right )=\sum_{x}^{} p\left ( x \right ) \left ( \log_{b}{p\left ( x \right ) } \right )-\sum_{x}^{} p\left ( x \right ) \log_{b}{q\left ( x \right ) } DKL(p∥q)=∑xp(x)(logbp(x))−∑xp(x)logbq(x)
其中, − ∑ x p ( x ) log b q ( x ) -\sum_{x}^{} p\left ( x \right ) \log_{b}{q\left ( x \right ) } −∑xp(x)logbq(x) 称为 交叉熵。
根据频率学派的观点,p(x)未知但是固定,因此
∑
x
p
(
x
)
log
b
p
(
x
)
\sum_{x}^{} p\left ( x \right ) \log_{b}{p\left ( x \right ) }
∑xp(x)logbp(x)为常数, 所以要使得相对熵最大化,则需要最小化交叉熵即可。
理想分布
p ( y i ) = { p ( 1 ) = 1 , p ( 0 ) = 0 , y i = 1 p ( 1 ) = 0 , p ( 0 ) = 1 , y i = 0 p(y_{i} )=\left\{ \begin{aligned} p\left ( 1 \right ) = 1,p\left ( 0 \right ) = 0,y_{i}=1 \\ p\left ( 1 \right ) = 0,p\left ( 0 \right ) = 1,y_{i}=0 \end{aligned} \right. p(yi)={p(1)=1,p(0)=0,yi=1p(1)=0,p(0)=1,yi=0
模拟分布
q ( y i ) = { e β T x ^ 1 + e β T x ^ = p 1 ( x ^ ; β ) , y i = 1 1 1 + e β T x ^ = p 0 ( x ^ ; β ) , y i = 0 q(y_{i} )=\left\{ \begin{aligned} \frac{e^{\beta ^{T}\widehat{x} }}{1+e^{\beta ^{T}\widehat{x} } } = p_{1}\left ( \widehat{x};\beta \right ),y_{i}=1 \\ \frac{1}{1+e^{\beta ^{T}\widehat{x} } } = p_{0}\left ( \widehat{x};\beta \right ) ,y_{i}=0 \end{aligned} \right. q(yi)=⎩ ⎨ ⎧1+eβTx eβTx =p1(x ;β),yi=11+eβTx 1=p0(x ;β),yi=0
则交叉熵为
− ∑ y i p ( y i ) log b q ( y i ) -\sum_{y_{i}}^{} p\left ( y_{i} \right ) \log_{b}{q\left ( y_{i} \right ) } −∑yip(yi)logbq(yi)
− p ( 1 ) log b p 1 ( x ^ ; β ) − p ( 0 ) log b p 0 ( x ^ ; β ) -p\left ( 1 \right ) \log_{b}{p_{1}\left ( \widehat{x};\beta \right )} -p\left ( 0 \right ) \log_{b}{p_{0}\left ( \widehat{x};\beta \right )} −p(1)logbp1(x ;β)−p(0)logbp0(x ;β)
− y i log b p 1 ( x ^ ; β ) − ( 1 − y i ) log b p 0 ( x ^ ; β ) -y_{i} \log_{b}{p_{1}\left ( \widehat{x};\beta \right )} -\left ( 1-y_{i} \right ) \log_{b}{p_{0}\left ( \widehat{x};\beta \right )} −yilogbp1(x ;β)−(1−yi)logbp0(x ;β)
令 b=e,则有 − y i ln b p 1 ( x ^ ; β ) − ( 1 − y i ) ln b p 0 ( x ^ ; β ) -y_{i} \ln_{b}{p_{1}\left ( \widehat{x};\beta \right )} -\left ( 1-y_{i} \right ) \ln_{b}{p_{0}\left ( \widehat{x};\beta \right )} −yilnbp1(x ;β)−(1−yi)lnbp0(x ;β)
最终化简可得:
∑
i
=
1
m
(
y
i
β
T
x
^
i
−
ln
(
1
+
e
β
T
x
^
i
)
)
\sum_{i=1}^{m}\left(y_{i} \boldsymbol{\beta}^{\mathrm{T}} \hat{\boldsymbol{x}}_{i}-\ln \left(1+e^{\boldsymbol{\beta}^{\mathrm{T}} \hat{\boldsymbol{x}}_{i}}\right)\right)
i=1∑m(yiβTx^i−ln(1+eβTx^i))
总而言之,殊途同归。
梯度下降法:是一种迭代求解算法,利用“梯度指向的方向是函数值增大速度最快的方向”这一特性,每次迭代时都朝着梯度的反方向进行,进而实现函数值越迭代越小。
牛顿法:与梯度下降法一样,只不过还要求
x
t
+
1
x^{t+1}
xt+1必须是
x
t
x^{t}
xt领域内的极小值点。
感谢Datawhale小组所做的贡献,本次学习主要参考视频:
https://www.bilibili.com/video/BV1Mh411e7VU/?p=5&spm_id_from=333.880.my_history.page.click&vd_source=7f1a93b833d8a7093eb3533580254fe4