此文章主要是结合哔站shuhuai008大佬的白板推导视频:直面配分函数_134min
全部笔记的汇总贴:机器学习-白板推导系列笔记
直面配分函数-对应花书第十八章
动机:Learning问题、evaluation问题
一、对数似然梯度
x ∈ R p , { 0 , 1 } p x\in \R^p,\{0,1\}^p x∈Rp,{ 0,1}p
许多概率图模型(通常无向图模型),由一个为归一化的 p ^ ( x ; θ ) \hat p(x;\theta) p^(x;θ)定义,我们必须除以一个配分函数 Z ( θ ) Z(\theta) Z(θ)来归一化, p ( x ; θ ) = 1 Z ( θ ) p ^ ( x ; θ ) p(x;\theta)=\frac1{Z(\theta)} \hat p(x;\theta) p(x;θ)=Z(θ)1p^(x;θ)
配分函数 Z ( θ ) Z(\theta) Z(θ)是对未归一化概率所有状态的积分或者求和:
Z ( θ ) = ∫ p ^ ( x ) d x o r Z ( θ ) = ∑ x p ^ ( x ) Z(\theta)=\int \hat p(x){d}x\;\;\;\;\;or\;\;\;\;\;\;Z(\theta)=\sum_x\hat p(x) Z(θ)=∫p^(x)dxorZ(θ)=x∑p^(x)
ML learning:Given X = { x i } i = 1 N , e s t i m a t e : θ X=\{x_i\}^N_{i=1},estimate:\theta X={ xi}i=1N,estimate:θ
θ = arg max θ p ( x ; θ ) = arg max θ ∏ i = 1 N p ( x i ; θ ) \theta=\underset{\theta}{\argmax }p(x;\theta)=\underset{\theta}{\argmax}\prod^N_{i=1}p(x_i;\theta) θ=θargmaxp(x;θ)=θargmaxi=1∏Np(xi;θ)
引入一个 log \log log
θ = arg max θ log ∏ i = 1 N p ( x i ; θ ) = arg max θ ∑ i = 1 N log p ( x i ; θ ) = arg max θ ∑ i = 1 N ( log p ^ ( x i ; θ ) − log Z ( θ ) ) = arg max θ ∑ i = 1 N log p ^ ( x i ; θ ) − N ⋅ log Z ( θ ) = arg max θ 1 N ∑ i = 1 N log p ^ ( x i ; θ ) − log Z ( θ ) ( 提 一 个 N 对 计 算 无 影 响 ) \theta=\underset{\theta}{\argmax}\log\prod^N_{i=1}p(x_i;\theta)\\=\underset{\theta}{\argmax}\sum^N_{i=1}\log p(x_i;\theta)\\=\underset{\theta}{\argmax}\sum^N_{i=1}(\log \hat p(x_i;\theta)-\log Z(\theta))\\=\underset{\theta}{\argmax}\sum^N_{i=1}\log \hat p(x_i;\theta)-N\cdot\log Z(\theta)\\=\underset{\theta}{\argmax}\frac1N\sum^N_{i=1}\log \hat p(x_i;\theta)-\log Z(\theta)\\(提一个N对计算无影响) θ=θargmaxlogi=1∏Np(xi;θ)=θargmaxi=1∑Nlogp(xi;θ)=θargmaxi=1∑N(logp^(xi;θ)−logZ(θ))=θargmaxi=1∑Nlogp^(xi;θ)−N⋅logZ(θ)=θargmaxN1i=1∑Nlogp^(xi;θ)−logZ(θ)(提一个N对计算无影响)
l ( θ ) = 1 N ∑ i = 1 N log p ^ ( x i ; θ ) − log Z ( θ ) l(\theta)=\frac1N\sum^N_{i=1}\log \hat p(x_i;\theta)-\log Z(\theta) l(θ)=N1i=1∑Nlogp^(xi;θ)−logZ(θ)
求梯度,
∇ θ l ( θ ) = 1 N ∑ i = 1 N ∇ θ log p ^ ( x i ; θ ) − ∇ θ log Z ( θ ) \nabla_\theta l(\theta)=\frac1N\sum^N_{i=1}\nabla_\theta\log \hat p(x_i;\theta)-\nabla_\theta\log Z(\theta) ∇θl(θ)=N1i=1∑N∇θlogp^(xi;θ)−∇θlogZ(θ)
∇ θ log Z ( θ ) = 1 Z ( θ ) ∇ θ Z ( θ ) = p ( x ; θ ) p ^ ( x ; θ ) ∇ θ ∫ p ^ ( x ) d x = p ( x ; θ ) p ^ ( x ; θ ) ∫ ∇ θ p ^ ( x ) d x = ∫ p ( x ; θ ) p ^ ( x ; θ ) ∇ θ p ^ ( x ) d x = ∫ p ( x ; θ ) ∇ θ log p ^ ( x ) d x = E p ( x ; θ ) [ ∇ θ log p ^ ( x ) ] \nabla_\theta\log Z(\theta)=\frac1{Z(\theta)}\nabla_\theta Z(\theta)\\=\frac{p(x;\theta)}{\hat p(x;\theta)}\nabla_\theta \int \hat p(x){d}x\\=\frac{p(x;\theta)}{\hat p(x;\theta)}\int\nabla_\theta \hat p(x){d}x\\=\int\frac{p(x;\theta)}{\hat p(x;\theta)}\nabla_\theta \hat p(x){d}x\\=\int{p(x;\theta)}\nabla_\theta \log \hat p(x){d}x\\=E_{p(x;\theta)}[\nabla_\theta \log \hat p(x)] ∇θlogZ(θ)=Z(θ)1∇θZ(θ)=p^(x;θ)p(x;θ)∇θ∫p^(x)dx=p^(x;θ)p(x;θ)∫∇θp^(x)dx=∫p^(x;θ)p(x;θ)∇θp^(x)dx=∫p(x;θ)∇θlogp^(x)dx=Ep(x;θ)[∇θlogp^(x)]
二、随机最大似然
∇ θ l ( θ ) = 1 N ∑ i = 1 N ∇ θ log p ^ ( x i ; θ ) − E p ( x ; θ ) [ ∇ θ log p ^ ( x ; θ ) ] = E P d a t a [ ∇ θ log p ^ ( x ; θ ) ] ⏟ p o s t i v e p h a s e − E P m o d e l [ ∇ θ log p ^ ( x ; θ ) ] ⏟ n e g a t i v e p h a s e \nabla_\theta l(\theta)=\frac1N\sum^N_{i=1}\nabla_\theta\log \hat p(x_i;\theta)-E_{p(x;\theta)}[\nabla_\theta \log \hat p(x;\theta)]\\=\underset{postive\;phase}{\underbrace{E_{P_{data}}[\nabla_\theta\log \hat p(x;\theta)]}}-\underset{negative\;phase}{\underbrace{E_{P_{model}}[\nabla_\theta \log \hat p(x;\theta)]}} ∇θl(θ)=N1i=1∑N∇θlogp^(xi;θ)−Ep(x;θ)[∇θlogp^