java lda主题模型_主题模型(一):LDA 基本原理

一、 数学基础

*** 二项分布***

二项分布为N重伯努利分布,则事件成功k次的概率可表示为:

equation?tex=P%28X%3Dk%7Cn%2Cp%29%3D%5Cfrac%7Bn%21%7D%7Bk%21%28n-k%29%21%7Dp%5Ek%281-p%29%5E%7Bn-k%7D

*** 多项分布 ***

多项分布是二项分布在高维度上的推广:

equation?tex=P%28x_1%2Cx_2%2C%5Ccdots%2Cx_k%7Cn%2Cp_1%2Cp_2%2C%5Ccdots%2Cp_k%29%3D%5Cfrac%7Bn%21%7D%7B%5Cprod_%7Bi%3D1%7D%5Ekx_i%21%7D%5Cprod_%7Bi%3D1%7D%5Ekp_i%5E%7Bx_i%7D

*** Beta分布 ***

equation?tex=B%28x%7C%5Calpha%2C%5Cbeta%29%3D%5Cfrac%7B1%7D%7BB%28%5Calpha%2C%5Cbeta%29%7Dx%5E%7B%5Calpha-1%7D%281-x%29%5E%7B%5Cbeta-1%7D+

其中,

equation?tex=B%28%5Calpha%2C%5Cbeta%29%3D%5Cfrac%7B%5CGamma%28%5Calpha%29%5CGamma%28%5Cbeta%29%7D%7B%5CGamma%28%5Calpha%2B%5Cbeta%29%7D ,满足

equation?tex=x%5Cin%5B0%2C1%5D

equation?tex=%5CGamma%28x%29 为Gamma函数,积分表示为:

equation?tex=%5CGamma%28x%29%3D%5Cint_0%5E1t%5E%7Bx-1%7De%5E%7B-t%7Ddt

*** Dirichlet分布 ***

Dirichlet分布是Beta分布在高维度上的推广:

equation?tex=Dir%28x_1%2Cx_2%2C%5Ccdots%2Cx_k%7C%5Calpha_1%2C%5Calpha_2%2C%5Ccdots%2C%5Calpha_k%29%3D%5Cfrac%7B%5CGamma%28%5Csum_%7Bi%3D1%7D%5Ek%5Calpha_i%29%7D%7B%5Cprod_%7Bi%3D1%7D%5Ek%5CGamma%28%5Calpha_i%29%7D%5Cprod_%7Bi%3D1%7D%5Ekx_i%5E%7B%5Calpha_i-1%7D

其中,

equation?tex=x_i%5Cin%5B0%2C1%5D

equation?tex=%5CGamma%28x%29为Gamma函数。

equation?tex=%5Calpha_i 取相同值时,称为对称Dirichlet分布。此时,只有一个参数

equation?tex=%5Calpha,称为Concentration Parameter (聚集参数)。参数

equation?tex=%5Calpha 越大主题越鲜明,越小主题越分散。

*** 共轭先验分布***

根据贝叶斯定理,可以得到如下公式。由于

equation?tex=P%28x%29 仅与

equation?tex=x 有关,并且仅起到归一化作用,因此最优化参数时可以不考虑。

equation?tex=P%28%5Ctheta%7Cx%29%3D%5Cfrac%7BP%28x%7C%5Ctheta%29P%28%5Ctheta%29%7D%7BP%28x%29%7D+%5Cpropto+P%28x%7C%5Ctheta%29P%28%5Ctheta%29

equation?tex=P%28%5Ctheta%7Cx%29 表示后验分布,

equation?tex=P%28x%7C%5Ctheta%29 表示似然函数,

equation?tex=P%28%5Ctheta%29 表示先验分布。

当先验分布

equation?tex=P%28%5Ctheta%29 和后验分布

equation?tex=P%28%5Ctheta%7Cx%29 满足同样的分布律时,先验分布

equation?tex=P%28%5Ctheta%29 和后验分布

equation?tex=P%28%5Ctheta%7Cx%29 称为共轭分布。同时,先验分布

equation?tex=P%28%5Ctheta%29 叫做似然函数

equation?tex=P%28x%7C%5Ctheta%29 的共轭先验分布。

*** Beta-Binomial 共轭 ***

Beta分布为:

equation?tex=B%28x%7C%5Calpha%2C%5Cbeta%29%3D%5Cfrac%7B1%7D%7BB%28%5Calpha%2C%5Cbeta%29%7Dx%5E%7B%5Calpha-1%7D%281-x%29%5E%7B%5Cbeta-1%7D+

二项分布为:

equation?tex=P%28X%3Dk%7Cn%2Cp%29%3D%5Cfrac%7Bn%21%7D%7Bk%21%28n-k%29%21%7Dp%5Ek%281-p%29%5E%7Bn-k%7D

如果二项分布 * Beta分布得到的后验分布仍然为Beta分布,则说明Beta分布和二项分布满足共轭关系,Beta分布为二项分布的共轭先验分布。证明如下:

equation?tex=%5Cbegin%7Baligned%7D+P%28%5Ctheta%7Cx%29%26%3D+%5Cfrac%7Bn%21%7D%7Bk%21%28n-k%29%21%7Dx%5Ek%281-x%29%5E%7Bn-k%7D%5Ccdot+%5Cfrac%7B1%7D%7BB%28%5Calpha%2C%5Cbeta%29%7Dx%5E%7B%5Calpha-1%7D%281-x%29%5E%7B%5Cbeta-1%7D+%5C%5C+%26%3D%5Cfrac%7Bn%21%7D%7Bk%21%28n-k%29%21B%28%5Calpha%2C%5Cbeta%29%7Dx%5E%7B%28k%2B%5Calpha%29-1%7D%281-x%29%5E%7B%28n-k%2B%5Cbeta%29-1%7D+%5Cend%7Baligned%7D

根据上述推导,可以看出计算得到的后验分布

equation?tex=P%28%5Ctheta%7Cx%29 服从Beta分布。

***Dirichlet-Multinomial 共轭***

Dirichlet分布为:

equation?tex=Dir%28x_1%2Cx_2%2C%5Ccdots%2Cx_k%7C%5Calpha_1%2C%5Calpha_2%2C%5Ccdots%2C%5Calpha_k%29%3D%5Cfrac%7B%5CGamma%28%5Csum_%7Bi%3D1%7D%5Ek%5Calpha_i%29%7D%7B%5Cprod_%7Bi%3D1%7D%5Ek%5CGamma%28%5Calpha_i%29%7D%5Cprod_%7Bi%3D1%7D%5Ekx_i%5E%7B%5Calpha_i-1%7D

多项分布为:

equation?tex=P%28x_1%2Cx_2%2C%5Ccdots%2Cx_k%7Cn%2Cp_1%2Cp_2%2C%5Ccdots%2Cp_k%29%3D%5Cfrac%7Bn%21%7D%7B%5Cprod_%7Bi%3D1%7D%5Ekx_i%21%7D%5Cprod_%7Bi%3D1%7D%5Ekp_i%5E%7Bx_i%7D

如果多项分布 * Dirichlet分布得到的后验分布仍然为Dirichlet分布,则说明Dirichlet分布和多项分布满足共轭关系,Dirichlet分布为多项分布的共轭先验分布。证明如下:

equation?tex=%5Cbegin%7Baligned%7D+P%28%5Ctheta%7Cx%29%26%3D%5Cfrac%7Bn%21%7D%7B%5Cprod_%7Bi%3D1%7D%5EKk_i%21%7D%5Cprod_%7Bi%3D1%7D%5EKx_i%5E%7Bk_i%7D+%5Cast+%5Cfrac%7B%5CGamma%28%5Csum_%7Bi%3D1%7D%5EK%5Calpha_i%29%7D%7B%5Cprod_%7Bi%3D1%7D%5EK%5CGamma%28%5Calpha_i%29%7D%5Cprod_%7Bi%3D1%7D%5EKx_i%5E%7B%5Calpha_i-1%7D+%5C%5C+%26%3D%5Cfrac%7Bn%21%5CGamma%28%5Csum_%7Bi%3D1%7D%5EK%5Calpha_i%29%7D%7B%5Cprod_%7Bi%3D1%7D%5EKk_i%21%5Cprod_%7Bi%3D1%7D%5EK%5CGamma%28%5Calpha_i%29%7D%5Cprod_%7Bi%3D1%7D%5EKx_i%5E%7B%28k_i%2B%5Calpha_i%29-1%7D+%5C%5C+%26%3D+%5Cfrac%7B%5CGamma%28%5Csum_%7Bi%3D1%7D%5EKk_i+%2B1%29%7D%7B%5Cprod_%7Bi%3D1%7D%5EK+%5CGamma%28k_i%2B1%29%5CGamma%28%5Calpha_i%29%7D%5Cprod_%7Bi%3D1%7D%5EKx_i%5E%7B%28k_i%2B%5Calpha_i%29-1%7D+%5C%5C+%26%3D%5Cfrac%7B%5CGamma%28%5Csum_%7Bi%3D1%7D%5EK%28k_i+%2B%5Calpha_i%29%29%7D%7B%5Cprod_%7Bi%3D1%7D%5EK+%5CGamma%28k_i%2B1%29%5CGamma%28%5Calpha_i%29%7D%5Cprod_%7Bi%3D1%7D%5EKx_i%5E%7B%28k_i%2B%5Calpha_i%29-1%7D+%5C%5C+%26%3D%5Cfrac%7B%5CGamma%28%5Csum_%7Bi%3D1%7D%5EK%28k_i+%2B%5Calpha_i%29%29%7D%7B%5Cprod_%7Bi%3D1%7D%5EK+%5CGamma%28k_i%2B%5Calpha_i%29%7D%5Cprod_%7Bi%3D1%7D%5EKx_i%5E%7B%28k_i%2B%5Calpha_i%29-1%7D+%5C%5C+%5Cend%7Baligned%7D

其中,

equation?tex=%5Csum_%7Bi%3D1%7D%5EK+k_i%3Dn

equation?tex=%5Csum_%7Bi%3D1%7D%5EKx_i%3D1

根据上述推导,可以看出计算得到的后验分布

equation?tex=P%28%5Ctheta%7Cx%29 服从Dirichlet分布。

***Beta / Dirichlet 分布的一个重要性质***

如果

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值