softmax——笔记

用多项式分布建模

softmax假设目标变量服从多项式分布
P ( y ; η ) = ∏ i = 1 k ϕ i 1 { y = i } = ( ∏ i = 1 k − 1 ϕ i y = i ) ϕ k 1 − ∑ i = 1 k − 1 1 { y = i } = ( ∏ i = 1 k − 1 ( ϕ i ϕ k ) 1 { y = i } ) ϕ k = exp ⁡ ( ∑ i = 1 k − 1 log ⁡ ( ϕ i ϕ k ) T ( y ) i + log ⁡ ϕ k ) \begin{aligned} P(y;\eta)&=\prod_{i=1}^{k}\phi_i^{1\{ y=i \}}\\ &=\left( \prod_{i=1}^{k-1}\phi_i^{y=i} \right)\phi_k^{1-\sum_{i=1}^{k-1}{1\{ y=i \}}}\\ &=\left( \prod_{i=1}^{k-1}{(\frac{\phi_i}{\phi_k})}^{1\{ y=i \}} \right)\phi_k\\ &=\exp{\left( \sum_{i=1}^{k-1}\log(\frac{\phi_i}{\phi_k})T(y)_i+\log{\phi_k} \right)} \end{aligned} P(y;η)=i=1kϕi1{y=i}=(i=1k1ϕiy=i)ϕk1i=1k11{y=i}=(i=1k1(ϕkϕi)1{y=i})ϕk=exp(i=1k1log(ϕkϕi)T(y)i+logϕk)
其中
η = [ log ⁡ ϕ 1 ϕ k , … , ϕ k − 1 ϕ k ] T T ( 1 ) = [ 1 , 0 , 0 , … , 0 ] T T ( 2 ) = [ 0 , 1 , 0 , … , 0 ] T ⋯ T ( k − 1 ) = [ 0 , 0 , 0 , … , 1 ] T T ( k ) = [ 0 , 0 , 0 , … , 0 ] T \begin{aligned} \eta&=\left[ \log{\frac{\phi_1}{\phi_k}},\dots,\frac{\phi_{k-1}}{\phi_k} \right]^T\\ T(1)&=\left[1,0,0,\dots,0 \right]^T\\ T(2)&=\left[0,1,0,\dots,0 \right]^T\\ &\cdots\\ T(k-1)&=\left[0,0,0,\dots,1 \right]^T\\ T(k)&=\left[0,0,0,\dots,0 \right]^T \end{aligned} ηT(1)T(2)T(k1)T(k)=[logϕkϕ1,,ϕkϕk1]T=[1,0,0,,0]T=[0,1,0,,0]T=[0,0,0,,1]T=[0,0,0,,0]T
为了方便令
η i = log ⁡ ϕ i ϕ k , i = 1 , … , n \eta_i=\log{\frac{\phi_i}{\phi_k}},i=1,\dots,n ηi=logϕkϕi,i=1,,n
其中 η k = log ⁡ ϕ l ϕ k = 0 \eta_k=\log{\frac{\phi_l}{\phi_k}}=0 ηk=logϕkϕl=0,与逻辑回归的假设一样 η i = \eta_i= ηi=,那么
e η i = ϕ i ϕ k ϕ k e η i = ϕ i ϕ k ∑ i = 1 k e η i = ∑ i = 1 k ϕ i = 1 ϕ k = 1 ∑ i = 1 k e η i ⇒ ϕ i = ϕ k e η i = e η i ∑ i = 1 k e η i \begin{aligned} e^{\eta_i}&=\frac{\phi_i}{\phi_k}\\ \phi_ke^{\eta_i}&=\phi_i\\ \phi_k\sum_{i=1}^{k}{e^{\eta_i}}&=\sum_{i=1}^{k}\phi_i=1\\ \phi_k&=\frac{1}{\sum_{i=1}^{k}{e^{\eta_i}}}\\ \Rightarrow \phi_i&=\phi_ke^{\eta_i}=\frac{e^{\eta_i}}{\sum_{i=1}^{k}{e^{\eta_i}}} \end{aligned} eηiϕkeηiϕki=1keηiϕkϕi=ϕkϕi=ϕi=i=1kϕi=1=i=1keηi1=ϕkeηi=i=1keηieηi
和逻辑回归一样,softmax同样假设 η i = θ i T x \eta_i=\theta_i^Tx ηi=θiTx,故
P ( y = i ∣ x ; θ ) = ϕ i = e θ i T x ∑ j = 1 k e θ j T x ϕ = [ e θ 1 T x ∑ j = 1 k e θ j T x , … , e θ k − 1 T x ∑ j = 1 k e θ j T x , e θ k T x ∑ j = 1 k e θ j T x ] T = [ e θ 1 T x ∑ j = 1 k e θ j T x , … , e θ k − 1 T x ∑ j = 1 k e θ j T x , 1 ∑ j = 1 k e θ j T x ] T \begin{aligned} &P(y=i|x;\theta)=\phi_i=\frac{e^{\theta_i^Tx}}{\sum_{j=1}^{k}{e^{\theta_j^Tx}}}\\ \phi&=\left[ \frac{e^{\theta_1^Tx}}{\sum_{j=1}^{k}{e^{\theta_j^Tx}}},\dots,\frac{e^{\theta_{k-1}^Tx}}{\sum_{j=1}^{k}{e^{\theta_j^Tx}}},\frac{e^{\theta_{k}^Tx}}{\sum_{j=1}^{k}{e^{\theta_j^Tx}}} \right]^T\\ &=\left[ \frac{e^{\theta_1^Tx}}{\sum_{j=1}^{k}{e^{\theta_j^Tx}}},\dots,\frac{e^{\theta_{k-1}^Tx}}{\sum_{j=1}^{k}{e^{\theta_j^Tx}}},\frac{1}{\sum_{j=1}^{k}{e^{\theta_j^Tx}}} \right]^T \end{aligned} ϕP(y=ix;θ)=ϕi=j=1keθjTxeθiTx=[j=1keθjTxeθ1Tx,,j=1keθjTxeθk1Tx,j=1keθjTxeθkTx]T=[j=1keθjTxeθ1Tx,,j=1keθjTxeθk1Tx,j=1keθjTx1]T

y = arg max ⁡ i P ( y = i ∣ x ; θ ) = arg max ⁡ i ϕ i y=\argmax_i{P(y=i|x;\theta)}=\argmax_i{\phi_i} y=iargmaxP(y=ix;θ)=iargmaxϕi

参数估计

使用对数似然对参数进行估计
ℓ θ ) = ∑ i = 1 n log ⁡ P ( y ( i ) ∣ x ( i ) ; θ ) = ∑ i = 1 n log ⁡ ∏ j = 1 k ϕ k 1 { y = j } = ∑ i = 1 n ( log ⁡ ϕ ) T T ( y ( i ) ) \begin{aligned} \ell\theta)&=\sum_{i=1}^{n}{\log{P(y^{(i)}|x^{(i)};\theta)}}\\ &=\sum_{i=1}^{n}{\log{\prod_{j=1}^{k}{\phi_k^{1\{ y=j \}}}}}\\ &=\sum_{i=1}^{n}{\left( \log\phi \right)^TT(y^{(i)})} \end{aligned} θ)=i=1nlogP(y(i)x(i);θ)=i=1nlogj=1kϕk1{y=j}=i=1n(logϕ)TT(y(i))
ℓ θ ) \ell\theta) θ)求偏导,对于 i ≠ y , i = 1 , … , k − 1 i \neq y,i=1,\dots,k-1 i=y,i=1,,k1
∂ ℓ ( θ ) ∂ θ i = ∂ ℓ ( θ ) ∂ ϕ i ∂ ϕ i ∂ e θ i T x ∂ e θ i T x ∂ θ i \frac{\partial \ell \left( \theta \right)}{\partial \theta _i}=\frac{\partial \ell \left( \theta \right)}{\partial \phi _i}\frac{\partial \phi _i}{\partial e^{\theta _{i}^{T}x}}\frac{\partial e^{\theta _{i}^{T}x}}{\partial \theta _i} θi(θ)=ϕi(θ)eθiTxϕiθieθiTx
其中 ϕ i = e θ i T x / ∑ j = 1 k e θ j T x \phi_i =e^{\theta _{i}^{T}x}/\sum_{j=1}^k{e^{\theta _{j}^{T}x}} ϕi=eθiTx/j=1keθjTx,由于 T ( y ) i = 0 T(y)_i=0 T(y)i=0,故 ℓ ( θ ) = log ⁡ ϕ i \ell(\theta)=\log{\phi_i} (θ)=logϕi。当 i ≠ y i \neq y i=y
∂ ℓ ( θ ) ∂ θ i = 1 ϕ i 0 − e θ y T x ( ∑ j = 1 k e θ j T x ) 2 e θ i T x x = 1 ϕ i ϕ i ϕ y x = ϕ y x \begin{aligned} \frac{\partial \ell \left( \theta \right)}{\partial \theta _i}&=\frac{1}{\phi _i}\frac{0-e^{\theta _{y}^{T}x}}{\left( \sum_{j=1}^k{e^{\theta _{j}^{T}x}} \right) ^2}e^{\theta _{i}^{T}x}x\\ &=\frac{1}{\phi _i}\phi _i\phi _yx\\ &=\phi _yx \end{aligned} θi(θ)=ϕi1(j=1keθjTx)20eθyTxeθiTxx=ϕi1ϕiϕyx=ϕyx
i = y i=y i=y
∂ ℓ ( θ ) ∂ θ i = ∂ ℓ ( θ ) ∂ θ y = 1 ϕ y ( ∑ j = 1 k e θ j T x ) − e θ y T x ( ∑ j = 1 k e θ j T x ) 2 e θ y T x x = 1 ϕ y ϕ y ( 1 − ϕ y ) x = ( 1 − ϕ y ) x \begin{aligned} \frac{\partial \ell \left( \theta \right)}{\partial \theta _i}&=\frac{\partial \ell \left( \theta \right)}{\partial \theta _y}=\frac{1}{\phi _y}\frac{\left( \sum_{j=1}^k{e^{\theta _{j}^{T}x}} \right) -e^{\theta _{y}^{T}x}}{\left( \sum_{j=1}^k{e^{\theta _{j}^{T}x}} \right) ^2}e^{\theta _{y}^{T}x}x\\ &=\frac{1}{\phi _y}\phi _y\left( 1-\phi _y \right) x\\ &=\left( 1-\phi _y \right) x\\ \end{aligned} θi(θ)=θy(θ)=ϕy1(j=1keθjTx)2(j=1keθjTx)eθyTxeθyTxx=ϕy1ϕy(1ϕy)x=(1ϕy)x
采用随机梯度下降,即每次只使用一个样本 ( x ( i ) , y ( i ) ) \left( x^{\left( i \right)},y^{\left( i \right)} \right) (x(i),y(i))计算偏导,对于 i = 1 , ⋯   , k − 1 i=1,\cdots,k-1 i=1,,k1 θ \theta θ的更新策略为
θ i : = { θ i − α ϕ y ( i ) x ( i ) , i ≠ y ( i ) θ i − α ( 1 − ϕ y ( i ) ) x ( i ) , i = y ( i ) \theta _i:=\left\{ \begin{array}{c} \theta _i-\alpha \phi _{y^{(i)}}x^{(i)},i\ne y^{(i)}\\ \theta _i-\alpha \left( 1-\phi _{y^{(i)}} \right) x^{(i)},i=y^{(i)}\\ \end{array} \right. θi:={θiαϕy(i)x(i),i=y(i)θiα(1ϕy(i))x(i),i=y(i)
θ k = 0 ⃗ \theta_k=\vec{0} θk=0

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值