指数族分布
P
(
y
;
η
)
=
b
(
y
)
exp
[
η
T
T
(
y
)
−
a
(
η
)
]
P(y;\eta)=b(y)\exp{\left[ \eta^TT(y)-a(\eta) \right]}
P(y;η)=b(y)exp[ηTT(y)−a(η)]
其中
η
\eta
η为自然参数,
T
(
y
)
T(y)
T(y)是充分统计量,
a
(
η
)
a(\eta)
a(η)是归一化因子
对二项分布建模
逻辑回归假设目标变量服从二项分布
P
(
y
;
θ
)
=
ϕ
y
(
1
−
ϕ
)
1
−
y
=
exp
[
y
log
ϕ
+
(
1
−
y
)
log
(
1
−
ϕ
)
]
=
exp
[
(
log
ϕ
1
−
ϕ
)
y
+
log
(
1
−
ϕ
)
]
\begin{aligned} P(y;\theta)&=\phi^y(1-\phi)^{1-y}\\ &=\exp{\left[ y\log \phi+(1-y)\log(1-\phi) \right]}\\ &=\exp{\left[ (\log{\frac{\phi}{1-\phi})y+\log{(1-\phi)}} \right]} \end{aligned}
P(y;θ)=ϕy(1−ϕ)1−y=exp[ylogϕ+(1−y)log(1−ϕ)]=exp[(log1−ϕϕ)y+log(1−ϕ)]
令
log
ϕ
1
−
ϕ
=
η
\log{\frac{\phi}{1-\phi}}=\eta
log1−ϕϕ=η,然后得出
ϕ
=
1
1
+
e
−
η
\phi=\frac{1}{1+e^{-\eta}}
ϕ=1+e−η1,这便是sigmoid函数的来源,我们用sigmoid函数将
η
\eta
η转换后,作为二项分布的概率。记
s
i
g
m
o
i
d
(
x
)
=
σ
(
x
)
=
1
1
+
e
−
x
sigmoid(x)=\sigma(x)=\frac{1}{1+e^{-x}}
sigmoid(x)=σ(x)=1+e−x1
sigmoid
sigmoid函数的导数为
d
σ
(
x
)
d
x
=
e
−
x
(
1
+
e
−
x
)
2
=
1
1
+
e
−
x
e
−
x
1
+
e
−
x
=
1
1
+
e
−
x
(
1
+
e
−
x
)
−
1
1
+
e
−
x
=
1
1
+
e
−
x
(
1
−
1
1
+
e
−
x
)
=
σ
(
x
)
(
1
−
σ
(
x
)
)
\begin{aligned} \frac{d \sigma(x)}{dx}&=\frac{e^{-x}}{(1+e^{-x})^2}\\ &=\frac{1}{1+e^{-x}}\frac{e^{-x}}{1+e^{-x}}\\ &=\frac{1}{1+e^{-x}}\frac{(1+e^{-x})-1}{1+e^{-x}}\\ &=\frac{1}{1+e^{-x}}\left( 1-\frac{1}{1+e^{-x}} \right)\\ &=\sigma(x)(1-\sigma(x)) \end{aligned}
dxdσ(x)=(1+e−x)2e−x=1+e−x11+e−xe−x=1+e−x11+e−x(1+e−x)−1=1+e−x1(1−1+e−x1)=σ(x)(1−σ(x))
参数估计
这里我们还有一个假设
η
=
θ
T
x
\eta=\theta^Tx
η=θTx,使用最大似然估计对参数
θ
\theta
θ进行估计。
L
(
θ
)
=
∏
i
=
1
n
p
(
y
(
i
)
;
θ
)
=
∏
i
=
1
n
σ
(
x
)
y
(
1
−
σ
(
x
)
1
−
y
)
L(\theta)=\prod_{i=1}^{n}p(y^{(i)};\theta)=\prod_{i=1}^{n}\sigma(x)^y(1-\sigma(x)^{1-y})
L(θ)=i=1∏np(y(i);θ)=i=1∏nσ(x)y(1−σ(x)1−y)
转化为对数似然函数
ℓ
θ
)
=
∑
i
=
1
n
log
p
(
y
(
i
)
;
θ
)
=
∑
i
=
1
n
log
σ
(
x
(
i
)
)
y
(
i
)
(
1
−
σ
(
x
(
i
)
)
)
1
−
y
(
i
)
=
∑
i
=
1
n
[
y
(
i
)
log
σ
(
x
(
i
)
)
+
(
1
−
y
(
i
)
)
log
(
1
−
σ
(
x
(
i
)
)
)
]
\begin{aligned} \ell\theta)&=\sum_{i=1}^{n}{\log{p(y^{(i)};\theta)}}\\ &=\sum_{i=1}^{n}{\log{\sigma(x^{(i)})^{y^{(i)}}(1-\sigma(x^{(i)}))^{1-y^{(i)}}}}\\ &=\sum_{i=1}^{n}{\left[ y^{(i)}\log{\sigma(x^{(i)})+(1-y^{(i)})\log{(1-\sigma(x^{(i)}))}} \right]} \end{aligned}
ℓθ)=i=1∑nlogp(y(i);θ)=i=1∑nlogσ(x(i))y(i)(1−σ(x(i)))1−y(i)=i=1∑n[y(i)logσ(x(i))+(1−y(i))log(1−σ(x(i)))]
然后对
ℓ
(
θ
)
\ell(\theta)
ℓ(θ)求偏导
∂
ℓ
(
θ
)
∂
θ
=
∂
ℓ
(
θ
)
∂
σ
(
x
)
∂
σ
(
x
)
∂
θ
T
x
∂
θ
T
x
∂
θ
=
(
y
σ
(
x
)
−
1
−
y
1
−
σ
(
x
)
)
[
σ
(
x
)
(
1
−
σ
(
x
)
)
]
x
=
[
y
(
1
−
σ
(
x
)
)
−
(
1
−
y
)
σ
(
x
)
]
x
=
[
y
−
y
σ
(
x
)
−
σ
(
x
)
+
y
σ
(
x
)
]
x
=
(
y
−
σ
(
x
)
)
x
\begin{aligned} \frac{\partial \ell \left( \theta \right)}{\partial \theta}&=\frac{\partial \ell \left( \theta \right)}{\partial \sigma \left( x \right)}\frac{\partial \sigma \left( x \right)}{\partial \theta ^Tx}\frac{\partial \theta ^Tx}{\partial \theta}\\ &=\left( \frac{y}{\sigma \left( x \right)}-\frac{1-y}{1-\sigma \left( x \right)} \right) \left[ \sigma \left( x \right) \left( 1-\sigma \left( x \right) \right) \right] x\\ &=\left[ y\left( 1-\sigma \left( x \right) \right) -\left( 1-y \right) \sigma \left( x \right) \right] x\\ &=\left[ y-y\sigma \left( x \right) -\sigma \left( x \right) +y\sigma \left( x \right) \right] x\\ &=\left( y-\sigma \left( x \right) \right) x\\ \end{aligned}
∂θ∂ℓ(θ)=∂σ(x)∂ℓ(θ)∂θTx∂σ(x)∂θ∂θTx=(σ(x)y−1−σ(x)1−y)[σ(x)(1−σ(x))]x=[y(1−σ(x))−(1−y)σ(x)]x=[y−yσ(x)−σ(x)+yσ(x)]x=(y−σ(x))x
采用随机梯度下降,即每次只使用一个样本
(
x
(
i
)
,
y
(
i
)
)
\left( x^{\left( i \right)},y^{\left( i \right)} \right)
(x(i),y(i))计算偏导,故
θ
\theta
θ的更新策略为
θ
:
=
θ
+
α
(
y
(
i
)
−
σ
(
x
(
i
)
)
)
x
(
i
)
\theta:=\theta+\alpha(y^{(i)}-\sigma(x^{(i)}))x^{(i)}
θ:=θ+α(y(i)−σ(x(i)))x(i)