用多项式分布建模
softmax假设目标变量服从多项式分布
P
(
y
;
η
)
=
∏
i
=
1
k
ϕ
i
1
{
y
=
i
}
=
(
∏
i
=
1
k
−
1
ϕ
i
y
=
i
)
ϕ
k
1
−
∑
i
=
1
k
−
1
1
{
y
=
i
}
=
(
∏
i
=
1
k
−
1
(
ϕ
i
ϕ
k
)
1
{
y
=
i
}
)
ϕ
k
=
exp
(
∑
i
=
1
k
−
1
log
(
ϕ
i
ϕ
k
)
T
(
y
)
i
+
log
ϕ
k
)
\begin{aligned} P(y;\eta)&=\prod_{i=1}^{k}\phi_i^{1\{ y=i \}}\\ &=\left( \prod_{i=1}^{k-1}\phi_i^{y=i} \right)\phi_k^{1-\sum_{i=1}^{k-1}{1\{ y=i \}}}\\ &=\left( \prod_{i=1}^{k-1}{(\frac{\phi_i}{\phi_k})}^{1\{ y=i \}} \right)\phi_k\\ &=\exp{\left( \sum_{i=1}^{k-1}\log(\frac{\phi_i}{\phi_k})T(y)_i+\log{\phi_k} \right)} \end{aligned}
P(y;η)=i=1∏kϕi1{y=i}=(i=1∏k−1ϕiy=i)ϕk1−∑i=1k−11{y=i}=(i=1∏k−1(ϕkϕi)1{y=i})ϕk=exp(i=1∑k−1log(ϕkϕi)T(y)i+logϕk)
其中
η
=
[
log
ϕ
1
ϕ
k
,
…
,
ϕ
k
−
1
ϕ
k
]
T
T
(
1
)
=
[
1
,
0
,
0
,
…
,
0
]
T
T
(
2
)
=
[
0
,
1
,
0
,
…
,
0
]
T
⋯
T
(
k
−
1
)
=
[
0
,
0
,
0
,
…
,
1
]
T
T
(
k
)
=
[
0
,
0
,
0
,
…
,
0
]
T
\begin{aligned} \eta&=\left[ \log{\frac{\phi_1}{\phi_k}},\dots,\frac{\phi_{k-1}}{\phi_k} \right]^T\\ T(1)&=\left[1,0,0,\dots,0 \right]^T\\ T(2)&=\left[0,1,0,\dots,0 \right]^T\\ &\cdots\\ T(k-1)&=\left[0,0,0,\dots,1 \right]^T\\ T(k)&=\left[0,0,0,\dots,0 \right]^T \end{aligned}
ηT(1)T(2)T(k−1)T(k)=[logϕkϕ1,…,ϕkϕk−1]T=[1,0,0,…,0]T=[0,1,0,…,0]T⋯=[0,0,0,…,1]T=[0,0,0,…,0]T
为了方便令
η
i
=
log
ϕ
i
ϕ
k
,
i
=
1
,
…
,
n
\eta_i=\log{\frac{\phi_i}{\phi_k}},i=1,\dots,n
ηi=logϕkϕi,i=1,…,n
其中
η
k
=
log
ϕ
l
ϕ
k
=
0
\eta_k=\log{\frac{\phi_l}{\phi_k}}=0
ηk=logϕkϕl=0,与逻辑回归的假设一样
η
i
=
\eta_i=
ηi=,那么
e
η
i
=
ϕ
i
ϕ
k
ϕ
k
e
η
i
=
ϕ
i
ϕ
k
∑
i
=
1
k
e
η
i
=
∑
i
=
1
k
ϕ
i
=
1
ϕ
k
=
1
∑
i
=
1
k
e
η
i
⇒
ϕ
i
=
ϕ
k
e
η
i
=
e
η
i
∑
i
=
1
k
e
η
i
\begin{aligned} e^{\eta_i}&=\frac{\phi_i}{\phi_k}\\ \phi_ke^{\eta_i}&=\phi_i\\ \phi_k\sum_{i=1}^{k}{e^{\eta_i}}&=\sum_{i=1}^{k}\phi_i=1\\ \phi_k&=\frac{1}{\sum_{i=1}^{k}{e^{\eta_i}}}\\ \Rightarrow \phi_i&=\phi_ke^{\eta_i}=\frac{e^{\eta_i}}{\sum_{i=1}^{k}{e^{\eta_i}}} \end{aligned}
eηiϕkeηiϕki=1∑keηiϕk⇒ϕi=ϕkϕi=ϕi=i=1∑kϕi=1=∑i=1keηi1=ϕkeηi=∑i=1keηieηi
和逻辑回归一样,softmax同样假设
η
i
=
θ
i
T
x
\eta_i=\theta_i^Tx
ηi=θiTx,故
P
(
y
=
i
∣
x
;
θ
)
=
ϕ
i
=
e
θ
i
T
x
∑
j
=
1
k
e
θ
j
T
x
ϕ
=
[
e
θ
1
T
x
∑
j
=
1
k
e
θ
j
T
x
,
…
,
e
θ
k
−
1
T
x
∑
j
=
1
k
e
θ
j
T
x
,
e
θ
k
T
x
∑
j
=
1
k
e
θ
j
T
x
]
T
=
[
e
θ
1
T
x
∑
j
=
1
k
e
θ
j
T
x
,
…
,
e
θ
k
−
1
T
x
∑
j
=
1
k
e
θ
j
T
x
,
1
∑
j
=
1
k
e
θ
j
T
x
]
T
\begin{aligned} &P(y=i|x;\theta)=\phi_i=\frac{e^{\theta_i^Tx}}{\sum_{j=1}^{k}{e^{\theta_j^Tx}}}\\ \phi&=\left[ \frac{e^{\theta_1^Tx}}{\sum_{j=1}^{k}{e^{\theta_j^Tx}}},\dots,\frac{e^{\theta_{k-1}^Tx}}{\sum_{j=1}^{k}{e^{\theta_j^Tx}}},\frac{e^{\theta_{k}^Tx}}{\sum_{j=1}^{k}{e^{\theta_j^Tx}}} \right]^T\\ &=\left[ \frac{e^{\theta_1^Tx}}{\sum_{j=1}^{k}{e^{\theta_j^Tx}}},\dots,\frac{e^{\theta_{k-1}^Tx}}{\sum_{j=1}^{k}{e^{\theta_j^Tx}}},\frac{1}{\sum_{j=1}^{k}{e^{\theta_j^Tx}}} \right]^T \end{aligned}
ϕP(y=i∣x;θ)=ϕi=∑j=1keθjTxeθiTx=[∑j=1keθjTxeθ1Tx,…,∑j=1keθjTxeθk−1Tx,∑j=1keθjTxeθkTx]T=[∑j=1keθjTxeθ1Tx,…,∑j=1keθjTxeθk−1Tx,∑j=1keθjTx1]T
则
y
=
arg max
i
P
(
y
=
i
∣
x
;
θ
)
=
arg max
i
ϕ
i
y=\argmax_i{P(y=i|x;\theta)}=\argmax_i{\phi_i}
y=iargmaxP(y=i∣x;θ)=iargmaxϕi
参数估计
使用对数似然对参数进行估计
ℓ
θ
)
=
∑
i
=
1
n
log
P
(
y
(
i
)
∣
x
(
i
)
;
θ
)
=
∑
i
=
1
n
log
∏
j
=
1
k
ϕ
k
1
{
y
=
j
}
=
∑
i
=
1
n
(
log
ϕ
)
T
T
(
y
(
i
)
)
\begin{aligned} \ell\theta)&=\sum_{i=1}^{n}{\log{P(y^{(i)}|x^{(i)};\theta)}}\\ &=\sum_{i=1}^{n}{\log{\prod_{j=1}^{k}{\phi_k^{1\{ y=j \}}}}}\\ &=\sum_{i=1}^{n}{\left( \log\phi \right)^TT(y^{(i)})} \end{aligned}
ℓθ)=i=1∑nlogP(y(i)∣x(i);θ)=i=1∑nlogj=1∏kϕk1{y=j}=i=1∑n(logϕ)TT(y(i))
对
ℓ
θ
)
\ell\theta)
ℓθ)求偏导,对于
i
≠
y
,
i
=
1
,
…
,
k
−
1
i \neq y,i=1,\dots,k-1
i=y,i=1,…,k−1
∂
ℓ
(
θ
)
∂
θ
i
=
∂
ℓ
(
θ
)
∂
ϕ
i
∂
ϕ
i
∂
e
θ
i
T
x
∂
e
θ
i
T
x
∂
θ
i
\frac{\partial \ell \left( \theta \right)}{\partial \theta _i}=\frac{\partial \ell \left( \theta \right)}{\partial \phi _i}\frac{\partial \phi _i}{\partial e^{\theta _{i}^{T}x}}\frac{\partial e^{\theta _{i}^{T}x}}{\partial \theta _i}
∂θi∂ℓ(θ)=∂ϕi∂ℓ(θ)∂eθiTx∂ϕi∂θi∂eθiTx
其中
ϕ
i
=
e
θ
i
T
x
/
∑
j
=
1
k
e
θ
j
T
x
\phi_i =e^{\theta _{i}^{T}x}/\sum_{j=1}^k{e^{\theta _{j}^{T}x}}
ϕi=eθiTx/∑j=1keθjTx,由于
T
(
y
)
i
=
0
T(y)_i=0
T(y)i=0,故
ℓ
(
θ
)
=
log
ϕ
i
\ell(\theta)=\log{\phi_i}
ℓ(θ)=logϕi。当
i
≠
y
i \neq y
i=y时
∂
ℓ
(
θ
)
∂
θ
i
=
1
ϕ
i
0
−
e
θ
y
T
x
(
∑
j
=
1
k
e
θ
j
T
x
)
2
e
θ
i
T
x
x
=
1
ϕ
i
ϕ
i
ϕ
y
x
=
ϕ
y
x
\begin{aligned} \frac{\partial \ell \left( \theta \right)}{\partial \theta _i}&=\frac{1}{\phi _i}\frac{0-e^{\theta _{y}^{T}x}}{\left( \sum_{j=1}^k{e^{\theta _{j}^{T}x}} \right) ^2}e^{\theta _{i}^{T}x}x\\ &=\frac{1}{\phi _i}\phi _i\phi _yx\\ &=\phi _yx \end{aligned}
∂θi∂ℓ(θ)=ϕi1(∑j=1keθjTx)20−eθyTxeθiTxx=ϕi1ϕiϕyx=ϕyx
当
i
=
y
i=y
i=y时
∂
ℓ
(
θ
)
∂
θ
i
=
∂
ℓ
(
θ
)
∂
θ
y
=
1
ϕ
y
(
∑
j
=
1
k
e
θ
j
T
x
)
−
e
θ
y
T
x
(
∑
j
=
1
k
e
θ
j
T
x
)
2
e
θ
y
T
x
x
=
1
ϕ
y
ϕ
y
(
1
−
ϕ
y
)
x
=
(
1
−
ϕ
y
)
x
\begin{aligned} \frac{\partial \ell \left( \theta \right)}{\partial \theta _i}&=\frac{\partial \ell \left( \theta \right)}{\partial \theta _y}=\frac{1}{\phi _y}\frac{\left( \sum_{j=1}^k{e^{\theta _{j}^{T}x}} \right) -e^{\theta _{y}^{T}x}}{\left( \sum_{j=1}^k{e^{\theta _{j}^{T}x}} \right) ^2}e^{\theta _{y}^{T}x}x\\ &=\frac{1}{\phi _y}\phi _y\left( 1-\phi _y \right) x\\ &=\left( 1-\phi _y \right) x\\ \end{aligned}
∂θi∂ℓ(θ)=∂θy∂ℓ(θ)=ϕy1(∑j=1keθjTx)2(∑j=1keθjTx)−eθyTxeθyTxx=ϕy1ϕy(1−ϕy)x=(1−ϕy)x
采用随机梯度下降,即每次只使用一个样本
(
x
(
i
)
,
y
(
i
)
)
\left( x^{\left( i \right)},y^{\left( i \right)} \right)
(x(i),y(i))计算偏导,对于
i
=
1
,
⋯
,
k
−
1
i=1,\cdots,k-1
i=1,⋯,k−1,
θ
\theta
θ的更新策略为
θ
i
:
=
{
θ
i
−
α
ϕ
y
(
i
)
x
(
i
)
,
i
≠
y
(
i
)
θ
i
−
α
(
1
−
ϕ
y
(
i
)
)
x
(
i
)
,
i
=
y
(
i
)
\theta _i:=\left\{ \begin{array}{c} \theta _i-\alpha \phi _{y^{(i)}}x^{(i)},i\ne y^{(i)}\\ \theta _i-\alpha \left( 1-\phi _{y^{(i)}} \right) x^{(i)},i=y^{(i)}\\ \end{array} \right.
θi:={θi−αϕy(i)x(i),i=y(i)θi−α(1−ϕy(i))x(i),i=y(i)
而
θ
k
=
0
⃗
\theta_k=\vec{0}
θk=0