LDA model

本文深入探讨了LDA模型,包括Dirichlet分布、生成模型与判别模型的区别、LDA的模型描述及参数解读,并介绍了变分推断的一般框架。LDA是一种用于主题建模的方法,通过Dirichlet先验和多项式分布来描述文档中的词项分布。
摘要由CSDN通过智能技术生成

LDA model

1 Dirichlet distribution

1.1 Binomial distribution and Beta distribution

  • Bayes rule: p o s t e r i o r ∝ l i k e l i h o o d × p r i o r posterior \propto likelihood\times prior posteriorlikelihood×prior

  • 共轭先验:为了使用极大后验(MAP)估计参数p时,数学形式上的方便性

    • X ∣ θ ∼ B ( n , θ ) X|\theta \sim B(n, \theta) XθB(n,θ)
      , f ( x ; n , θ ) = ( n k ) θ x ( 1 − θ ) n − x f(x ; n, \theta)=\left(\begin{array}{l}n \\ k\end{array}\right) \theta^{x}(1-\theta)^{n-x} f(x;n,θ)=(nk)θx(1θ)nx

    • θ ∼ Be ⁡ ( α , β ) \theta \sim \operatorname{Be}(\alpha, \beta) θBe(α,β)
      , f ( θ ; α , β ) = B ( α , β ) − 1 θ α − 1 ( 1 − θ ) β − 1 f(\theta ; \alpha, \beta)={\mathrm{B}(\alpha, \beta)}^{-1} \theta^{\alpha-1}(1-\theta)^{\beta-1} f(θ;α,β)=B(α,β)1θα1(1θ)β1

  • Beta分布作为二项分布参数p的先验,后验分布任然为Beta分布。

    p o s t e r i o r : θ ∣ X ∼ B e ( α + k , β + n − k ) posterior:\quad \theta|X\sim Be(\alpha+k,\beta+n-k) posterior:θXBe(α+k,β+nk)

1.2 Multinomial distribution and Dirichlet distribution

  • 多项式分布作为二项分布推广,一次实验中总共有K个态,类似的狄利克雷分布也作为Beta分布的推广。

    • X ⃗ ∣ θ ⃗ ∼ M u l t ( n , θ ⃗ ) , f ( x ⃗ ; n , θ ⃗ ) = n ! x 1 ! ⋯ x k ! ∏ i = 1 k θ i x i , ∑ i = 1 k θ i = 1 \vec{X}|\vec{\theta}\sim Mult(n,\vec{\theta}),\quad f(\vec{x};n,\vec{\theta})=\frac{n !}{x_{1} ! \cdots x_{k} !} \prod_{i=1}^k\theta_i^{x_i}, \quad \sum_{i=1}^{k} \theta_i=1 X θ Mult(n,θ ),f(x ;n,θ )=x1!xk!n!i=1kθixi,i=1kθi=1

    • θ ⃗ ∼ D i r ( α ⃗ ) , f ( θ ⃗ ; α ⃗ ) = B ( α ⃗ ) − 1 ∏ i = 1 k θ i α i − 1 , ∑ i = 1 k θ i = 1 \vec{\theta} \sim Dir(\vec{\alpha}),\quad f(\vec{\theta};\vec{\alpha})={\mathrm{B}(\vec{\alpha})}^{-1}\prod_{i=1}^k\theta_i^{\alpha_i-1}, \quad \sum_{i=1}^{k} \theta_i=1 θ Dir(α ),f(θ ;α )=B(α )1i=1kθiαi1,i=1kθi=1

  • p o s t e r i o r : θ ⃗ ∣ X ⃗ ∼ D i r ( α ⃗ + x ⃗ ) posterior:\quad \vec{\theta}|\vec{X}\sim Dir(\vec{\alpha}+\vec{x}) posterior:θ X Dir(α +x )

1.3 Exponential family

2 Discriminative model vs. Generative model

  • label vector y y y , feature vector x x x
  • 生成模型是指从标签向量以“产生”特征向量 p ( x ∣ y ) p(x|y) p(xy)
  • 判别模型是指从特征向量概“分配”标签向量 p ( y ∣ x ) p(y|x) p(yx)
  • 通过贝叶斯法则二者可以互相转化,需要例子
  • 监督与无监督的区别在于标签向量是否可以被观测到。

4 LDA modele

4.1 model description

  • 4.1.1 Notation

    • (superscript v v v) word
      vecabulary index by { 1... V } \{1...V\} {1...V}, one-hot vector w v = 1 , w u = 0 , v ≠ u w^v=1,w^u=0, v\neq u wv=1,wu=0,v=u

    • (subscript n n n) ducument sequence of N N N words w = ( w 1 , . . . . , w N ) \bold{w}= (w_1,....,w_N) w=(w1,....,wN)

    • (subscript d d d) corpus M documents D = { w 1 , . . . , w M } D=\{\bold{w_1},...,\bold{w_M}\} D={w1,...,wM}

  • 4.1.2 model

    • N ∼ P o i s s o n ( λ ) N \sim Poisson(\lambda) NPoisson(λ)
    • θ ∼ D i r ( α ) \theta \sim Dir(\alpha) θDir(α)
    • For each words w n w_n wn:
      • topic z n ∼ M u l t ( θ ) z_n \sim Mult(\theta) znMult(θ)
      • word w n ∼ p ( w n ∣ z n , β ) w_n \sim p(w_n|z_n,\beta) wnp(wnzn,β)conditioned multinomial probability
  • 4.1.3 paramaters interpretation

    • corpus level α β \quad\alpha \quad \beta\quad αβ ( i.e 1 sample/corpus )
    • document level θ d \quad\theta_d θd
    • word level z d n w d n \quad z_{dn} \quad w_{dn} zdnwdn
  • 4.1.4 Simplifying

    • k k k dimension of Dirichlet distribution i.e dimension of topic variable z z z is fix and known
    • word probability parametered by k × V k\times V k×V matrix β \beta β is fixed quantity to be estimated
    • length of document is not critical follows poisson assumption, ignoring ‘N’ randomness
  • 4.1.5 model Distribution

    • given parameters α , β \alpha , \beta α,β, joint distribution of θ , z , w \theta,\bold{z},\bold{w} θ,z,w:
      p ( θ , z , w ∣ α , β ) = p ( θ ∣ α ) ∏ n = 1 N p ( z n ∣ θ ) p ( w n ∣ z n , β ) p(\theta, \mathbf{z}, \mathbf{w} \mid \alpha, \beta)=p(\theta \mid \alpha) \prod_{n=1}^{N} p\left(z_{n} \mid \theta\right) p\left(w_{n} \mid z_{n}, \beta\right) p(θ,z,wα,β)=p(θα)n=1Np(znθ)p(wnzn,β)

      • Note p ( z n ∣ θ ) = θ i p(z_n|\theta)=\theta_i p(znθ)=θi for unique i i i such that z n i = 1 z_n^i=1 zni=1
      • De Finetti’s representation theorem: Infinitely exchangeable observations are conditionally independent and identically distributed relative to the conditioned latent variable.
    • document marginal distribution:
      p ( w ∣ α , β ) = ∫ p ( θ ∣ α ) ( ∏ n = 1 N ∑ z n p ( z n ∣ θ ) p ( w n ∣ z n , β ) ) d θ p(\mathbf{w} \mid \alpha, \beta)=\int p(\theta \mid \alpha)\left(\prod_{n=1}^{N} \sum_{z_{n}} p\left(z_{n} \mid \theta\right) p\left(w_{n} \mid z_{n}, \beta\right)\right) d \theta p(wα,β)=p(θα)(n=1Nznp(znθ)p(wnzn,β))dθ

    • corpus marginal distribution:
      p ( D ∣ α , β ) = ∏ d = 1 M p ( w d ∣ α , β ) p(D \mid \alpha, \beta)=\prod_{d=1}^{M}p(\mathbf{w_d} \mid \alpha, \beta) p(Dα,β)=d=1Mp(wdα,β)

4.2 model relationship

  • 4.2.1 Unigram
  • 4.2.2 Mixture of Unigrams
  • 4.2.3 Probabilistic latent sematic indexing
  • 4.2.4 geometric interpretation

4.3 barriers to learning and inference

  • 3 key targets of probabilistic graphical model
    • representation
    • learning : structure learning and parameters estimation
    • inference : calculate all the conditional probability when given some variables

5 Variation Inference

5.1 General frame

  • why VI

    • EM-like approximate inference method: searching for local optimal

    • The E-step of EM algorithm:

      fix θ t \quad\theta_t\quad θt let q t + 1 ( z ) = p ( z ∣ x , θ t ) \quad q_{t+1}(z)=p(z|x,\theta_t)\quad qt+1(z)=p(zx,θt) maximum L ( q , x , θ t ) L(q,x,\theta_t) L(q,x,θt)

      If posterior p ( z ∣ x , θ t ) p(z|x,\theta_t) p(zx,θt) is intratcle (when there are multiple latent variables) we use variational inference

  • general method

    • Target: find a tractable distribution to approximate the posterior, i.e minimize the KL-divergence K L ( q ( z ) ∣ p ( z ∣ x ) ) KL(q(z)|p(z|x)) KL(q(z)p(zx))

      • ommit the parameters and constants.
    • Key decompositon:
      l n p ( x ) = L ( q ( z ) ) + K L ( q ( z ) ∣ p ( z ∣ x ) ) lnp(x)=L(q(z))+KL(q(z)|p(z|x))\\ lnp(x)=L(q(z))+KL(q(z)p(zx))

    • ELBO (log) Evidence lower bound:
      L ( q ( z ) ) = < l n p ( x , z ) > q − < l n q ( z ) > q = < l n p ( x ∣ z ) > q − K L ( q ( z ) ∣ ∣ p ( z ) ) \begin{aligned} L(q(z)) & =<lnp(x,z)>_q-<lnq(z)>_q\\ & =<lnp(x|z)>_q-KL(q(z)||p(z)) \end{aligned} L(q(z))=<lnp(x,z)>q<lnq(z)>q=<lnp(xz)>qKL(q(z)p(z))
      l n p ( x ) ≥ L ( q ( z ) ) lnp(x)\geq L(q(z)) lnp(x)L(q(z))

    • Mean-field appoximate:
      mean-field first origins in statistical mechanics which reduces many-body problem to a on-body problem. It assume the variation funciton can be facotorized.
      q ( z ) = ∏ i q i ( z i ) = ∏ i q i q(z)=\prod_iq_i(z_i)=\prod_iq_i q(z)=iqi(zi)=iqi

    • ELBO with mean-field

      L ( q ) = < E ( x , z ) > q + H ( q ) E n e r g y E ( x , z ) = l n p ( x , z ) E n t r o p y H ( q ) = − ∫ d z q ( z ) l n q ( z ) \begin{aligned} &L(q)=<E(x,z)>_q+H(q)\\ Energy\quad &E(x,z)=lnp(x,z)\\ Entropy\quad &H(q)=-\int dzq(z)lnq(z) \end{aligned} EnergyEntropyL(q)=<E(x,z)>q+H(q)E(x,z)=lnp(x,z)H(q)=dzq(z)lnq(z)

    • partition:
      { z i , z i ˉ } = { z } \{z_i,\bar{z_i}\}=\{z\} {zi,ziˉ}={z}

    • Energy:
      < E ( x , z ) > q = < < E ( x , z ) > q i ˉ > q i = < E ˉ ( x , z i ) > q i = < l n . e x p E ˉ ( x , z i ) > q i = ∫ q i l n q i ∗ ( z i , x ) d z i + l n Z q i ∗ ( z i ) = Z − 1 e x p { < E ( x , z ) > q i ˉ } \begin{aligned} <E(x,z)>_q&=<<E(x,z)>_{\bar{q_i}}>_{q_i}\\ &=<\bar{E}(x,z_i)>_{q_i}\\ &=<ln.exp\bar{E}(x,z_i)>_{q_i}\\ &=\int q_ilnq_i^*(z_i,x)dz_i+lnZ\\ q_i^*(z_i)&=Z^{-1}exp\{<E(x,z)>_{\bar{q_i}}\} \end{aligned} <E(x,z)>qqi(zi)=<<E(x,z)>qiˉ>qi=<Eˉ(x,zi)>qi=<ln.expEˉ(x,zi)>qi=qilnqi(zi,x)dzi+lnZ=Z1exp{<E(x,z)>qiˉ}

      • Entorpy:
        H ( q ) = ∑ i H i = ∑ i ∫ − d z i q i ( z i ) l n q i ( z i ) H(q)=\sum_iH_i=\sum_i\int -dz_i q_i(z_i)lnq_i(z_i) H(q)=iHi=idziqi(zi)lnqi(zi)
    • ELBO:
      L ( q ) = − K L ( q i ( z i ) ∣ ∣ q i ∗ ( x , z i ) ) + ∑ j ≠ i H j + l n Z L(q)=-KL(q_i(z_i)||q_i^*(x,z_i))+\sum_{j\neq i}H_j+lnZ L(q)=KL(qi(zi)qi(x,zi))+j=iHj+lnZ

    • optimal when fix q j ≠ i q_{j\neq i} qj=i:
      q i ( z i ) = q i ∗ ( x , z i ) = Z − 1 e x p { < E ( x , z ) > q i ˉ } l n q i ( z i ) = < l n p ( x , z ) > q i ˉ + c o n s t \begin{aligned} q_i(z_i)&=q_i^*(x,z_i)\\ &=Z^{-1}exp\{<E(x,z)>_{\bar{q_i}}\}\\ lnq_i(z_i)&=<lnp(x,z)>_{\bar{q_i}}+const \end{aligned} qi(zi)lnqi(zi)=qi(x,zi)=Z1exp{<E(x,z)>qiˉ}=<lnp(x,z)>qiˉ+const

6 Sampling

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值