LDA主题模型

最新推荐文章于 2020-04-23 23:27:33 发布

布纸所云

最新推荐文章于 2020-04-23 23:27:33 发布

阅读量453

点赞数

分类专栏：自然语言处理

本文链接：https://blog.csdn.net/XindiOntheWay/article/details/81479032

版权

自然语言处理专栏收录该内容

6 篇文章 0 订阅

订阅专栏

LDA主题模型

LDA是一种基于概率模型的主题模型算法(generative probabilistic model)，用来识别大规模文档集或者语料库中隐含的主题信息。对于语料库中的每篇文档，LDA定义了如下生成过程：

对每一篇文档，从主题分布中抽一个主题
从上述被抽到的主题对应的单词分布中抽一个单词
重复上述过程直至遍历文档中的每个词

LDA认为每篇文档是多个主题混合而成，而每个主题可以由多个词的概率表征。

背景知识

共轭前驱分布（conjugate prior）

In Bayesian probability theory, if the posterior distribution $p(\theta|x)$ are in the same family as the prior distribution $p(\theta)$ , the prior and the posterior are then called conjugate distributions, and the prior is called a conjugate prior for the likelihood function.

如果后验分布和先验分布同属于一个函数族，那么后验和先验称为共轭分布，先验被称为似然函数的共轭先验分布。Beta分布是二项分布的共轭先验分布，Dirichlet分布是多项分布的共轭先验分布。

根据贝叶斯规则，后验分布=似然函数*先验分布：

p (θ | x) = p ( x | θ ) p ( θ ) p ( x ) = p ( x | θ ) p ( θ ) \int p ( x | θ ) p ( θ ) d θ \propto p (x | θ) p (θ)

$p(\theta|x)=\frac{p(x|\theta)p(\theta)}{p(x)}=\frac{p(x|\theta)p(\theta)}{\int p(x|\theta)p(\theta)d\theta} \varpropto p(x|\theta)p(\theta)$
其中

p(x|θ) p ( x | θ ) $p(x|\theta)$ 为likelihood，

p(θ) p ( θ ) $p(\theta)$ 为prior belief，

p(x) p ( x ) $p(x)$ 为evidence。

Dirichlet Distribution

Dirichlet分布是描述 $k(k\geq 2)$ 个变量 $X_1,X_2,\cdots,X_k$ 的概率分布，其中 $x_i \in (0,1), \sum_{i=1}^{k}x_i=1$ 。Dirichlet分布的参数为 $\vec\alpha=\{\alpha_1,\alpha_2,\cdots,\alpha_k\}$ ，其中 $\alpha_i>0$ （不需要是整数，只需要是正实数即可）。

$\alpha_i$ 越大，赋予 $X_i$ 的权重就越多( $\sum_i x_i=1$ )
当 $\alpha_i$ 相等的时候，分布是对称的
当 $\alpha_i<1$ 时，相当于一个anti-weight把 $x_i$ 推到一些极点(push away toward extremes)
当 $\alpha_i>1$ 时，会使得 $x_i$ 聚集在中心值
$\alpha_1=\cdots=\alpha_k=1$ 时，均匀分布

下图所展示的是三元Dirichlet分布，参数分别为：
1. $\alpha_1=\alpha_2=\alpha_3=1$
2. $\alpha_1=\alpha_2=\alpha_3=10$
3. $\alpha_1=1, \alpha_2=10, \alpha_3=5$
4. $\alpha_1=\alpha_2=\alpha_3=0.2$