以前一直以为逻辑回归的公式(sigmoid函数)是人为臆造的,学了机器学习课程才知道,它的背后是有贝叶斯定理这样的数学理论支撑的。
为什么是Sigmoid函数?
回顾贝叶斯分类器,贝叶斯分类器是一种生成式学习方。为了获取 P ( Y ∣ X ) P(Y|X) P(Y∣X),我们将它转化为 P ( Y ) P(Y) P(Y)和P(X|Y),然后从数据集中估计这两个参数。那么问题来了,我们是否可以直接估计 P ( Y ∣ X ) P(Y|X) P(Y∣X)呢?
在逻辑回归模型中,我们做如下假设:
- 设 X X X是一个实数值矩阵,表示 n n n个特征, < X 1 , X 2 , … , X n > <X_1,X_2, \dots, X_n> <X1,X2,…,Xn>
- 设 Y Y Y是一个布尔值矩阵
- 假设在给定
Y
Y
Y后所有的
X
i
X_i
Xi之间都是条件独立的,即
P ( X ∣ Y ) = P ( X 1 , X 2 , … , X n ∣ Y ) = P ( X 1 ∣ Y ) P ( X 2 ∣ Y ) … P ( X n ∣ Y ) P(X|Y)\\ =P(X_1,X_2, \dots ,X_n|Y)\\ =P(X_1|Y)P(X_2|Y) \dots P(X_n|Y) P(X∣Y)=P(X1,X2,…,Xn∣Y)=P(X1∣Y)P(X2∣Y)…P(Xn∣Y) - 假设在给定
Y
=
y
k
Y=y_k
Y=yk后,每一个
X
i
X_i
Xi都服从高斯分布
N
(
μ
i
k
,
σ
i
)
N(\mu_{ik}, \sigma_i)
N(μik,σi),即
( X i ∣ Y = y k ) ∼ N ( μ i k , σ i ) (X_i|Y=y_k) \sim N(\mu_{ik},\sigma_i) (Xi∣Y=yk)∼N(μik,σi)
P ( X i ∣ Y = 0 ) = 1 2 π σ i exp { − ( X i − μ i 0 ) 2 2 σ i 2 } P \left( X_i|Y=0 \right) =\frac{1}{\sqrt{2\pi}\sigma _i}\exp \left\{ -\frac{\left( X_i-\mu _{i0} \right) ^2}{2\sigma _{i}^{2}} \right\} P(Xi∣Y=0)=2πσi1exp{−2σi2(Xi−μi0)2} - 假设类别
Y
Y
Y的先验服从伯努利分布,即
P ( Y = 1 ) = π , P ( Y = 0 ) = 1 − π P(Y=1)=\pi,P(Y=0)=1-\pi P(Y=1)=π,P(Y=0)=1−π
然后我们就可以开始推公式了。
P ( Y = 1 ∣ X ) = P ( X ∣ Y = 1 ) P ( Y = 1 ) P ( X ) = P ( X ∣ Y = 1 ) P ( Y = 1 ) P ( X ∣ Y = 1 ) P ( Y = 1 ) + P ( X ∣ Y = 0 ) P ( Y = 0 ) = 1 1 + P ( X ∣ Y = 0 ) P ( Y = 0 ) P ( X ∣ Y = 1 ) P ( Y = 1 ) = 1 1 + 1 − π π ⋅ P ( X ∣ Y = 0 ) P ( X ∣ Y = 1 ) = 1 1 + 1 − π π ⋅ ∏ i P ( X i ∣ Y = 0 ) ∏ i P ( X i ∣ Y = 1 ) = 1 1 + exp { ln ( 1 − π π ⋅ ∏ i P ( X i ∣ Y = 0 ) ∏ i P ( X i ∣ Y = 1 ) ) } = 1 1 + exp { ln ( 1 − π π ) + ln ( ∏ i P ( X i ∣ Y = 0 ) ∏ i P ( X i ∣ Y = 1 ) ) } = 1 1 + exp { ln ( 1 − π π ) + ∑ i ( ln ( P ( X i ∣ Y = 0 ) ) − ln ( P ( X i ∣ Y = 1 ) ) ) } P\left( Y=1|X \right) \\ =\frac{P\left( X|Y=1 \right) P\left( Y=1 \right)}{P\left( X \right)} \\ =\frac{P\left( X|Y=1 \right) P\left( Y=1 \right)}{P\left( X|Y=1 \right) P\left( Y=1 \right) +P\left( X|Y=0 \right) P\left( Y=0 \right)} \\ =\frac{1}{1+\frac{P\left( X|Y=0 \right) P\left( Y=0 \right)}{P\left( X|Y=1 \right) P\left( Y=1 \right)}} \\ =\frac{1}{1+\frac{1-\pi}{\pi}\cdot \frac{P\left( X|Y=0 \right)}{P\left( X|Y=1 \right)}} \\ =\frac{1}{1+\frac{1-\pi}{\pi}\cdot \frac{\prod_i{P\left( X_i|Y=0 \right)}}{\prod_i{P\left( X_i|Y=1 \right)}}} \\ =\frac{1}{1+\exp \left\{ \ln \left( \frac{1-\pi}{\pi}\cdot \frac{\prod_i{P\left( X_i|Y=0 \right)}}{\prod_i{P\left( X_i|Y=1 \right)}} \right) \right\}} \\ =\frac{1}{1+\exp \left\{ \ln \left( \frac{1-\pi}{\pi} \right) +\ln \left( \frac{\prod_i{P\left( X_i|Y=0 \right)}}{\prod_i{P\left( X_i|Y=1 \right)}} \right) \right\}} \\ =\frac{1}{1+\exp \left\{ \ln \left( \frac{1-\pi}{\pi} \right) +\sum_i{\left( \ln \left( P\left( X_i|Y=0 \right) \right) -\ln \left( P\left( X_i|Y=1 \right) \right) \right)} \right\}} P(Y=1∣X)=P(X)P(X∣Y=1)P(Y=1)=P(X∣Y=1)P(Y=1)+P(X∣Y=0)P(Y=0)P(X∣Y=1)P(Y=1)=1+P(X∣Y=1)P(Y=1)P(X∣Y=0)P(Y=0)1=1+π1−π⋅P(X∣Y=1)P(X∣Y=0)1=1+π1−π⋅∏iP(Xi∣Y=1)∏iP(Xi∣Y=0)1=1+exp{ln(π1−π⋅∏iP(Xi∣Y=1)∏iP(Xi∣Y=0))}1=1+exp{ln(π1−π)+ln(∏iP(Xi∣Y=1)∏iP(Xi∣Y=0))}1=1+exp{ln(π1−π)+∑i(ln(P(Xi∣Y=0))−ln(P(Xi∣Y=1)))}1
我们观察分母的最后一项中的
ln
(
P
(
X
i
∣
Y
=
0
)
)
−
ln
(
P
(
X
i
∣
Y
=
1
)
)
\ln \left( P\left( X_i|Y=0 \right) \right) -\ln \left( P\left( X_i|Y=1 \right) \right)
ln(P(Xi∣Y=0))−ln(P(Xi∣Y=1))
可以发现由于我们有概率密度函数
P
(
X
i
∣
Y
=
0
)
=
1
2
π
σ
i
exp
{
−
(
X
i
−
μ
i
0
)
2
2
σ
i
2
}
P
(
X
i
∣
Y
=
1
)
=
1
2
π
σ
i
exp
{
−
(
X
i
−
μ
i
1
)
2
2
σ
i
2
}
P\left( X_i|Y=0 \right) =\frac{1}{\sqrt{2\pi}\sigma _i}\exp \left\{ -\frac{\left( X_i-\mu _{i0} \right) ^2}{2\sigma _{i}^{2}} \right\} \\ P\left( X_i|Y=1 \right) =\frac{1}{\sqrt{2\pi}\sigma _i}\exp \left\{ -\frac{\left( X_i-\mu _{i1} \right) ^2}{2\sigma _{i}^{2}} \right\}
P(Xi∣Y=0)=2πσi1exp{−2σi2(Xi−μi0)2}P(Xi∣Y=1)=2πσi1exp{−2σi2(Xi−μi1)2}
所以有
ln
(
P
(
X
i
∣
Y
=
0
)
)
−
ln
(
P
(
X
i
∣
Y
=
1
)
)
=
−
(
X
i
−
μ
i
0
)
2
2
σ
i
2
+
(
X
i
−
μ
i
1
)
2
2
σ
i
2
=
−
(
X
i
−
μ
i
0
)
2
+
(
X
i
−
μ
i
1
)
2
2
σ
i
2
=
−
X
i
2
+
2
X
i
μ
i
0
−
μ
i
0
2
+
X
i
2
−
2
X
i
μ
i
1
+
μ
i
1
2
2
σ
i
2
=
2
(
μ
i
0
−
μ
i
1
)
X
i
−
μ
i
0
2
+
μ
i
1
2
2
σ
i
2
=
μ
i
0
−
μ
i
1
σ
i
2
X
i
+
−
μ
i
0
2
+
μ
i
1
2
2
σ
i
2
\ln \left( P\left( X_i|Y=0 \right) \right) -\ln \left( P\left( X_i|Y=1 \right) \right) \\ =-\frac{\left( X_i-\mu _{i0} \right) ^2}{2\sigma _{i}^{2}}+\frac{\left( X_i-\mu _{i1} \right) ^2}{2\sigma _{i}^{2}} \\ =\frac{-\left( X_i-\mu _{i0} \right) ^2+\left( X_i-\mu _{i1} \right) ^2}{2\sigma _{i}^{2}} \\ =\frac{-X_{\begin{array}{c} i\\ \end{array}}^{2}+2X_i\mu _{i0}-\mu _{i0}^{2}+X_{\begin{array}{c} i\\ \end{array}}^{2}-2X_i\mu _{i1}+\mu _{i1}^{2}}{2\sigma _{i}^{2}} \\ =\frac{2\left( \mu _{i0}-\mu _{i1} \right) X_i-\mu _{i0}^{2}+\mu _{i1}^{2}}{2\sigma _{i}^{2}} \\ =\frac{\mu _{i0}-\mu _{i1}}{\sigma _{i}^{2}}X_i+\frac{-\mu _{i0}^{2}+\mu _{i1}^{2}}{2\sigma _{i}^{2}}
ln(P(Xi∣Y=0))−ln(P(Xi∣Y=1))=−2σi2(Xi−μi0)2+2σi2(Xi−μi1)2=2σi2−(Xi−μi0)2+(Xi−μi1)2=2σi2−Xi2+2Xiμi0−μi02+Xi2−2Xiμi1+μi12=2σi22(μi0−μi1)Xi−μi02+μi12=σi2μi0−μi1Xi+2σi2−μi02+μi12
因此,
P ( Y = 1 ∣ X ) = 1 1 + exp { ln ( 1 − π π ) + ∑ i ( ln ( P ( X i ∣ Y = 0 ) ) − ln ( P ( X i ∣ Y = 1 ) ) ) } = 1 1 + exp { ln ( 1 − π π ) + ∑ i ( μ i 0 − μ i 1 σ i 2 X i + − μ i 0 2 + μ i 1 2 2 σ i 2 ) } = 1 1 + exp { ln ( 1 − π π ) + ∑ i ( − μ i 0 2 + μ i 1 2 2 σ i 2 ) + ∑ i ( μ i 0 − μ i 1 σ i 2 X i ) } P\left( Y=1|X \right) \\ =\frac{1}{1+\exp \left\{ \ln \left( \frac{1-\pi}{\pi} \right) +\sum_i{\left( \ln \left( P\left( X_i|Y=0 \right) \right) -\ln \left( P\left( X_i|Y=1 \right) \right) \right)} \right\}} \\ =\frac{1}{1+\exp \left\{ \ln \left( \frac{1-\pi}{\pi} \right) +\sum_i{\left( \frac{\mu _{i0}-\mu _{i1}}{\sigma _{i}^{2}}X_i+\frac{-\mu _{i0}^{2}+\mu _{i1}^{2}}{2\sigma _{i}^{2}} \right)} \right\}} \\ =\frac{1}{1+\exp \left\{ \ln \left( \frac{1-\pi}{\pi} \right) +\sum_i{\left( \frac{-\mu _{i0}^{2}+\mu _{i1}^{2}}{2\sigma _{i}^{2}} \right)}+\sum_i{\left( \frac{\mu _{i0}-\mu _{i1}}{\sigma _{i}^{2}}X_i \right)} \right\}} P(Y=1∣X)=1+exp{ln(π1−π)+∑i(ln(P(Xi∣Y=0))−ln(P(Xi∣Y=1)))}1=1+exp{ln(π1−π)+∑i(σi2μi0−μi1Xi+2σi2−μi02+μi12)}1=1+exp{ln(π1−π)+∑i(2σi2−μi02+μi12)+∑i(σi2μi0−μi1Xi)}1
即
P
(
Y
=
1
∣
X
)
=
1
1
+
exp
{
ln
(
1
−
π
π
)
+
∑
i
(
−
μ
i
0
2
+
μ
i
1
2
2
σ
i
2
)
+
∑
i
(
μ
i
0
−
μ
i
1
σ
i
2
X
i
)
}
P\left( Y=1|X \right)=\frac{1}{1+\exp \left\{ \ln \left( \frac{1-\pi}{\pi} \right) +\sum_i{\left( \frac{-\mu _{i0}^{2}+\mu _{i1}^{2}}{2\sigma _{i}^{2}} \right)}+\sum_i{\left( \frac{\mu _{i0}-\mu _{i1}}{\sigma _{i}^{2}}X_i \right)} \right\}}
P(Y=1∣X)=1+exp{ln(π1−π)+∑i(2σi2−μi02+μi12)+∑i(σi2μi0−μi1Xi)}1
令
w
0
=
ln
(
1
−
π
π
)
+
∑
i
(
−
μ
i
0
2
+
μ
i
1
2
2
σ
i
2
)
w_0=\ln \left( \frac{1-\pi}{\pi} \right)+\sum_i{\left( \frac{-\mu _{i0}^{2}+\mu _{i1}^{2}}{2\sigma _{i}^{2}} \right)}
w0=ln(π1−π)+∑i(2σi2−μi02+μi12)
w
i
=
μ
i
0
−
μ
i
1
σ
i
2
w_i=\frac{\mu _{i0}-\mu _{i1}}{\sigma _{i}^{2}}
wi=σi2μi0−μi1
则
P
(
Y
=
1
∣
X
)
=
1
1
+
exp
{
w
0
+
∑
i
(
w
i
X
i
)
}
P\left( Y=1|X \right)=\frac{1}{1+\exp \left\{ w_0+\sum_i{\left( w_iX_i \right)} \right\}}
P(Y=1∣X)=1+exp{w0+∑i(wiXi)}1
即
P
(
Y
=
1
∣
X
)
=
1
1
+
e
w
X
+
b
P\left( Y=1|X \right)=\frac{1}{1+e^{wX+b}}
P(Y=1∣X)=1+ewX+b1
也就是我们常说的sigmoid函数。
损失函数推导
我们采用极大似然估计(MLE),但是
P
(
<
X
i
,
y
i
>
∣
w
)
P(<X_i,y_i>|w)
P(<Xi,yi>∣w)很难求,数据集中很难找到这样的数据,因此我们采用更弱一些的条件极大似然(MCLE),求
P
(
Y
=
y
i
∣
X
i
,
w
)
P(Y=y_i|X_i,w)
P(Y=yi∣Xi,w)。
在这里我们需要一个真实的场景。
1万+

被折叠的 条评论
为什么被折叠?



