此文章主要是结合哔站shuhuai008大佬的白板推导视频:玻尔兹曼机_147min
全部笔记的汇总贴:机器学习-白板推导系列笔记
参考花书20.1
一、介绍
玻尔兹曼机连接的每个节点都是离散的二值分布,是全连接的,是为了解决局部最小值的问题而提出的玻尔兹曼机。
v = { 0 , 1 } D h = { 0 , 1 } P L = [ L i j ] D ∗ D J = [ J i j ] P ∗ P W = [ W i j ] D ∗ P v=\{0,1\}^D\;\;\;\;\;h=\{0,1\}^P\\L=\Big[L_{ij}\Big]_{D*D}\\J=\Big[J_{ij}\Big]_{P*P}\\W=\Big[W_{ij}\Big]_{D*P} v={ 0,1}Dh={ 0,1}PL=[Lij]D∗DJ=[Jij]P∗PW=[Wij]D∗P
{ p ( v , h ) = 1 Z exp { − E ( v , h ) } E ( v , h ) = − ( v T W h + 1 2 v T L v + 1 2 h T J h ) θ = { W , L , J } \left\{\begin{matrix} p(v,h)= \frac1Z\exp\{-E(v,h)\}\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\\E(v,h)=-(v^TWh+\frac12v^TLv+\frac12h^TJh)\end{matrix}\right.\\\theta=\{W,L,J\} { p(v,h)=Z1exp{ −E(v,h)}E(v,h)=−(vTWh+21vTLv+21hTJh)θ={ W,L,J}
二、Log似然的梯度
样本集合: V , ∣ V ∣ = N V,\;|V|=N V,∣V∣=N
P ( v ) = ∑ h p ( v , h ) 1 N ∑ v ∈ V log P ( v ) ← l o g − l i k e l i h o o d ∂ ∂ θ 1 N ∑ v ∈ V log P ( v ) = 1 N ∑ v ∈ V ∂ log P ( v ) ∂ θ ← g r a d i e n t o f l o g − l i k e l i h o o d P(v)=\sum_hp(v,h)\\\frac1N\sum_{v\in V}\log P(v)\leftarrow\;\;log-likelihood\\\frac\partial {\partial \theta}\frac1N\sum_{v\in V}\log P(v)=\frac1N\sum_{v\in V}{\color{blue}\frac{\partial\log P(v)} {\partial \theta}}\leftarrow gradient\;of\;log-likelihood P(v)=h∑p(v,h)N1v∈V∑logP(v)←log−likelihood∂θ∂N1v∈V∑logP(v)=N1v∈V∑∂θ∂logP(v)←gradientoflog−likelihood
∂ log P ( v ) ∂ θ = ∑ v ∑ h p ( v , h ) ⋅ ∂ E ( v , h ) ∂ θ − ∑ h p ( h ∣ v ) ⋅ ∂ E ( v , h ) ∂ θ \frac{\partial\log P(v)} {\partial \theta}=\sum_v\sum_h p(v,h)\cdot\frac{\partial E(v,h)}{\partial \theta}-\sum_hp(h|v)\cdot\frac{\partial E(v,h)}{\partial \theta} ∂θ∂logP(v)=v∑h∑p(v,h)⋅∂θ∂E(v,h)−h∑p(h∣v)⋅∂θ∂E(v,h)
∂ log P ( v ) ∂ W = ∑ v ∑ h p ( v , h ) ⋅ ( − v h T ) − ∑ h p ( h ∣ v ) ⋅ ( − v h T ) = ∑ h p ( h ∣ v ) ⋅ v h T − ∑ v ∑ h p ( v , h ) ⋅ v h T \frac{\partial\log P(v)} {\partial W}=\sum_v\sum_h p(v,h)\cdot(-vh^T)-\sum_hp(h|v)\cdot(-vh^T)\\=\sum_hp(h|v)\cdot vh^T-\sum_v\sum_h p(v,h)\cdot vh^T ∂W∂logP(v)=v∑h∑p(v,h)⋅(−vhT)−h∑p(h∣v)⋅(−vhT)=h∑p(h∣v)⋅vhT−v∑h∑p(v,h)⋅vhT
所以,
1 N ∑ v ∈ V ∂ log P ( v ) ∂ θ = 1 N ∑ v ∈ V ∑ h p ( h ∣ v ) ⋅ v h T − 1 N ∑ v ∈ V ∑ v ∑ h p ( v , h ) ⋅ v h T = 1 N ∑ v ∈ V ∑ h p ( h ∣ v ) ⋅ v h T − ∑ v ∑ h p ( v , h ) ⋅ v h T = E P D a t a [ v h T ] − E P m o d e l [ v h T ] \frac1N\sum_{v\in V}{\frac{\partial\log P(v)} {\partial \theta}}=\frac1N\sum_{v\in V}\sum_hp(h|v)\cdot vh^T-\frac1N\sum_{v\in V}\sum_v\sum_h p(v,h)\cdot vh^T\\=\frac1N\sum_{v\in V}\sum_hp(h|v)\cdot vh^T-\sum_v\sum_h p(v,h)\cdot vh^T\\=E_{P_{Data}}\Big[vh^T\Big]-E_{P_{model}}\Big[vh^T\Big] N1v∈V∑∂θ∂logP(v)=N1v∈V∑h∑p(h∣v)⋅vhT−N1v∈V∑v∑h∑p(v,h)⋅vhT=N1v∈V∑h∑p(h∣v)⋅vhT−v∑h∑p(v,h)⋅vhT=EPData[vhT]−EPmodel[vhT]
P D a t a = P D a t a ( v ) P m o d e l ( h ∣ v ) P m o d e l = P m o d e l ( h , v ) = P m o d e l ( v ) P m o d e l ( h ∣ v ) P_{Data}=P_{Data}(v)P_{model}(h|v)\\P_{model}=P_{model}(h,v)=P_{model}(v)P_{model}(h|v) PData=PData(v)Pmodel(h∣v)Pmodel=Pmodel(h,v)=Pmodel(v)Pmodel(h∣v)
三、基于MCMC的随机梯度上升
由上述推导,同理可得:
Δ W = ∂ ( E P D a t a [ v h T ] − E P m o d e l [ v h T ] ) \Delta W=\partial\Bigg(E_{P_{Data}}\Big[vh^T\Big]-E_{P_{model}}\Big[vh^T\Big]\Bigg) ΔW=∂(EPData[vhT]−EPmodel[vhT])
Δ L = ∂ ( E P D a t a [ v v T ] − E P m o d e l [ v v T ] ) \Delta L=\partial\Bigg(E_{P_{Data}}\Big[vv^T\Big]-E_{P_{model}}\Big[vv^T\Big]\Bigg) ΔL=∂(EPData[vvT]−EPmodel[vvT])
Δ J = ∂ ( E P D a t a [ h h T ] − E P m o d e l [ h h T ] ) \Delta J=\partial\Bigg(E_{P_{Data}}\Big[hh^T\Big]-E_{P_{model}}\Big[hh^T\Big]\Bigg) ΔJ=∂(EPData[hhT]−EPmodel[hhT])P D a t a = P D a t a ( v ) P m o d e l ( h ∣ v ) P m o d e l = P m o d e l ( h , v ) = P m o d e l ( v ) P m o d e l ( h ∣ v ) P_{Data}=P_{Data}(v)P_{model}(h|v)\\P_{model}=P_{model}(h,v)=P_{model}(v)P_{model}(h|v) PData=PData(v)Pmodel(h∣v)Pmodel=Pmodel(h,v)=Pmodel(v)Pmodel(h∣v)
W ( t + 1 ) = W ( t ) + Δ W W^{(t+1)}=W^{(t)}+\Delta W W(t+1)=W(t)+ΔW
Δ w i j = ∂ ( E P D a t a [ v i h j ] ⏟ p o s i t i v e p h a s e − E P m o