LDA model

最新推荐文章于 2023-07-21 16:53:34 发布

MarxistZ

最新推荐文章于 2023-07-21 16:53:34 发布

阅读量191

点赞数

分类专栏： NLP 文章标签：自然语言处理机器学习

本文链接：https://blog.csdn.net/lost_pork/article/details/109993549

版权

NLP 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

本文深入探讨了LDA模型，包括Dirichlet分布、生成模型与判别模型的区别、LDA的模型描述及参数解读，并介绍了变分推断的一般框架。LDA是一种用于主题建模的方法，通过Dirichlet先验和多项式分布来描述文档中的词项分布。

摘要由CSDN通过智能技术生成

LDA model

1 Dirichlet distribution

1.1 Binomial distribution and Beta distribution

Bayes rule： $\propto likelihood\times prior$
共轭先验：为了使用极大后验（MAP）估计参数p时，数学形式上的方便性
- $X|\theta \sim B(n, \theta)$
  , $\theta)=\left(\begin{array}{l}n \\ k\end{array}\right) \theta^{x}(1-\theta)^{n-x}$
- $\theta \sim \operatorname{Be}(\alpha, \beta)$
  , $f(\theta ; \alpha, \beta)={\mathrm{B}(\alpha, \beta)}^{-1} \theta^{\alpha-1}(1-\theta)^{\beta-1}$
Beta分布作为二项分布参数p的先验，后验分布任然为Beta分布。

$posterior:\quad \theta|X\sim Be(\alpha+k,\beta+n-k)$

1.2 Multinomial distribution and Dirichlet distribution

多项式分布作为二项分布推广，一次实验中总共有K个态，类似的狄利克雷分布也作为Beta分布的推广。
- $\vec{X}|\vec{\theta}\sim Mult(n,\vec{\theta}),\quad f(\vec{x};n,\vec{\theta})=\frac{n !}{x_{1} ! \cdots x_{k} !} \prod_{i=1}^k\theta_i^{x_i}, \quad \sum_{i=1}^{k} \theta_i=1$
- $\vec{\theta} \sim Dir(\vec{\alpha}),\quad f(\vec{\theta};\vec{\alpha})={\mathrm{B}(\vec{\alpha})}^{-1}\prod_{i=1}^k\theta_i^{\alpha_i-1}, \quad \sum_{i=1}^{k} \theta_i=1$
$posterior:\quad \vec{\theta}|\vec{X}\sim Dir(\vec{\alpha}+\vec{x})$

1.3 Exponential family

2 Discriminative model vs. Generative model

label vector $y$ , feature vector $x$
生成模型是指从标签向量以“产生”特征向量 $p (x ∣ y)$
判别模型是指从特征向量概“分配”标签向量 $p (y ∣ x)$
通过贝叶斯法则二者可以互相转化,需要例子
监督与无监督的区别在于标签向量是否可以被观测到。

4 LDA modele

4.1 model description

4.1.1 Notation
- (superscript $v$ ) word
  vecabulary index by ${1...V\}$ , one-hot vector $w^v=1,w^u=0, v\neq u$
- (subscript $n$ ) ducument sequence of $N$ words $\bold{w}= (w_1,....,w_N)$
- (subscript $d$ ) corpus M documents $D=\{\bold{w_1},...,\bold{w_M}\}$
4.1.2 model
- $\sim Poisson(\lambda)$
- $\theta \sim Dir(\alpha)$
- For each words $w_n$ :
  - topic $z_n \sim Mult(\theta)$
  - word $w_n \sim p(w_n|z_n,\beta)$ conditioned multinomial probability
4.1.3 paramaters interpretation
- corpus level $\quad\alpha \quad \beta\quad$ ( i.e 1 sample/corpus )
- document level $\quad\theta_d$
- word level $\quad z_{dn} \quad w_{dn}$
4.1.4 Simplifying
- $k$ dimension of Dirichlet distribution i.e dimension of topic variable $z$ is fix and known
- word probability parametered by $k\times V$ matrix $\beta$ is fixed quantity to be estimated
- length of document is not critical follows poisson assumption, ignoring ‘N’ randomness
4.1.5 model Distribution
- given parameters $\alpha , \beta$ , joint distribution of $\theta,\bold{z},\bold{w}$ :
  $p(\theta, \mathbf{z}, \mathbf{w} \mid \alpha, \beta)=p(\theta \mid \alpha) \prod_{n=1}^{N} p\left(z_{n} \mid \theta\right) p\left(w_{n} \mid z_{n}, \beta\right)$
  - Note $p(z_n|\theta)=\theta_i$ for unique $i$ such that $z_n^i=1$
  - De Finetti’s representation theorem: Infinitely exchangeable observations are conditionally independent and identically distributed relative to the conditioned latent variable.
- document marginal distribution:
  $p(\mathbf{w} \mid \alpha, \beta)=\int p(\theta \mid \alpha)\left(\prod_{n=1}^{N} \sum_{z_{n}} p\left(z_{n} \mid \theta\right) p\left(w_{n} \mid z_{n}, \beta\right)\right) d \theta$
- corpus marginal distribution:
  $\mid \alpha, \beta)=\prod_{d=1}^{M}p(\mathbf{w_d} \mid \alpha, \beta)$

4.2 model relationship

4.2.1 Unigram
4.2.2 Mixture of Unigrams
4.2.3 Probabilistic latent sematic indexing
4.2.4 geometric interpretation

4.3 barriers to learning and inference

3 key targets of probabilistic graphical model
- representation
- learning : structure learning and parameters estimation
- inference : calculate all the conditional probability when given some variables

5 Variation Inference

5.1 General frame

why VI
- EM-like approximate inference method: searching for local optimal
- The E-step of EM algorithm:
  
  fix $\quad\theta_t\quad$ let $\quad q_{t+1}(z)=p(z|x,\theta_t)\quad$ maximum $L(q,x,\theta_t)$
  
  If posterior $p(z|x,\theta_t)$ is intratcle (when there are multiple latent variables) we use variational inference
general method
- Target: find a tractable distribution to approximate the posterior, i.e minimize the KL-divergence $K L (q (z) ∣ p (z ∣ x))$
  - ommit the parameters and constants.
- Key decompositon:
  $lnp(x)=L(q(z))+KL(q(z)|p(z|x))\\$
- ELBO (log) Evidence lower bound:
  $\begin{aligned} L(q(z)) & =<lnp(x,z)>_q-<lnq(z)>_q\\ & =<lnp(x|z)>_q-KL(q(z)||p(z)) \end{aligned}$
  $lnp(x)\geq L(q(z))$
- Mean-field appoximate:
  mean-field first origins in statistical mechanics which reduces many-body problem to a on-body problem. It assume the variation funciton can be facotorized.
  $q(z)=\prod_iq_i(z_i)=\prod_iq_i$
- ELBO with mean-field
  
  $\begin{aligned} &L(q)=<E(x,z)>_q+H(q)\\ Energy\quad &E(x,z)=lnp(x,z)\\ Entropy\quad &H(q)=-\int dzq(z)lnq(z) \end{aligned}$
- partition:
  $\{z_i,\bar{z_i}\}=\{z\}$
- Energy:
  $\begin{aligned} <E(x,z)>_q&=<<E(x,z)>_{\bar{q_i}}>_{q_i}\\ &=<\bar{E}(x,z_i)>_{q_i}\\ &=<ln.exp\bar{E}(x,z_i)>_{q_i}\\ &=\int q_ilnq_i^*(z_i,x)dz_i+lnZ\\ q_i^*(z_i)&=Z^{-1}exp\{<E(x,z)>_{\bar{q_i}}\} \end{aligned}$
  - Entorpy:
    $H(q)=\sum_iH_i=\sum_i\int -dz_i q_i(z_i)lnq_i(z_i)$
- ELBO:
  $L(q)=-KL(q_i(z_i)||q_i^*(x,z_i))+\sum_{j\neq i}H_j+lnZ$
- optimal when fix $q_{j\neq i}$ :
  $\begin{aligned} q_i(z_i)&=q_i^*(x,z_i)\\ &=Z^{-1}exp\{<E(x,z)>_{\bar{q_i}}\}\\ lnq_i(z_i)&=<lnp(x,z)>_{\bar{q_i}}+const \end{aligned}$

6 Sampling

MarxistZ

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
LDA model

LDA model1 Dirichlet distribution1.1 Binomial distribution and Beta distributionBayes rule： posterior∝likelihood×priorposterior \propto likelihood\times priorposterior∝likelihood×prior共轭先验：为了使用极大后验（MAP）估计参数p时，数学形式上的方便性X∣θ∼B(n,θ)X|\theta \sim B(
复制链接

扫一扫