狄利克莱过程模型(二)：狄利克莱过程分布的三个经典类比

最新推荐文章于 2022-11-24 20:13:27 发布

duskwaitor

最新推荐文章于 2022-11-24 20:13:27 发布

阅读量3.4k

点赞数 2

本文链接：https://blog.csdn.net/duskwaitor/article/details/41743063

版权

关于DPP的三个经典类比，在本文中依次进行说明：

首先是折棍子模型：

将观测数据分配到不同群中的生成模型，就是一个折棍子的过程，它将一个变量的支持度(所谓支持度，采用了数据挖掘中关联规则抽取的术语了，意思就是概率，搞不明白为毛用支持度这个术语)分为相互不重叠的k个部分，首先，我们从一个长度为单位1的棍子开始，开始折这个棍子，我们根据下面算法在棍子上生成随机点：

(1)从分布beta1~Beta（1，alpha0）的分布中产生一个随机变量

(2)使用这个随机变量来指出棍子上的折断点

(3)迭代k-1次，

生成随机变量 betai~Beta(1,alpha0)

确定下一个代表折断点的随机变量I，这个折断点在上一次折断后剩下的一段棍子上，I的公式是：

在这个类比中，最终截成了k个木棍，每根木棍的长度就代表一个概率值，k个概率值就是一个dirichlet分布，所以DPP是一个分布之上的分布，不过值得注意的是，在折棍子最终产生多个棍子截的过程中，是有顺序的，不过最终得到的这个概率分布，也就是"分布上的分布"中的第一个分布，是没有顺序的。另外，k可能是无限的，最终的棍子数量也是无限的。

下面是折棍子的python代码实现：

from numpy.random import beta

def stick_breaking(alpha, k):
    betas = beta(1, alpha, k)
    remaining_pieces = np.append(1, np.cumprod(1 - betas[:-1]))
    p = betas * remaining_pieces
    return p/p.sum()

中国餐馆模型：

让我们假设你的朋友昨天去了一家中国餐馆：

(1) 餐馆为空

(2) 第一个人alice进入，挑了一个桌子旁边坐下，开始点饭（为这个组选择参数），其他选择alice桌子的人，必须吃alice点的饭

(3) bob第二个进来，他有alpha/(1+alpha)的概率坐在一个新桌子边并进行点餐，有1/(1+alpha)的概率坐在alice桌子边

。。。。。。。

(n) 第n+1个人进来，他有alpha/(n+alpha)的概率坐在一张新桌子上并点餐，有nk/(n+alpha)的概率坐在已经有nk个人的k桌边

请注意一些现象：

a.一个桌子的人越多，人们就越是可能坐到这个桌子旁，换句话说，聚类符合马太效应

b.永远都有一种可能性，一个人开一张新桌子

c.能否开一张新桌子，得看alpha，我们把它叫做分散参数，它影响我们数据的分散程度，alpha越低，我们数据聚类越少，越紧凑，alpha越大，数据越分散，得到的聚簇越多

下面是中国餐馆过程的ruby代码

# Generate table assignments for `num_customers` customers, according to
  # a Chinese Restaurant Process with dispersion parameter `alpha`.
  #
  # returns an array of integer table assignments
  def chinese_restaurant_process(num_customers, alpha)
   return [] if num_customers <= 0
  
   table_assignments = [1] # first customer sits at table 1
   next_open_table = 2 # index of the next empty table
  
   # Now generate table assignments for the rest of the customers.
   1.upto(num_customers - 1) do |i|
     if rand < alpha.to_f / (alpha + i)
       # Customer sits at new table.
       table_assignments << next_open_table
       next_open_table += 1
     else
       # Customer sits at an existing table.
       # He chooses which table to sit at by giving equal weight to each
       # customer already sitting at a table. 
       which_table = table_assignments[rand(table_assignments.size)]
       table_assignments << which_table
     end
   end
  
   table_assignments
  end

应用这些代码

chinese_restaurant_process(num_customers = 10, alpha = 1)
  1, 2, 3, 4, 3, 3, 2, 1, 4, 3 # table assignments from run 1
  1, 1, 1, 1, 1, 1, 2, 2, 1, 3 # table assignments from run 2
  1, 2, 2, 1, 3, 3, 2, 1, 3, 4 # table assignments from run 3
  
  > chinese_restaurant_process(num_customers = 10, alpha = 3)
  1, 2, 1, 1, 3, 1, 2, 3, 4, 5
  1, 2, 3, 3, 4, 3, 4, 4, 5, 5
  1, 1, 2, 3, 1, 4, 4, 3, 1, 1
  
  > chinese_restaurant_process(num_customers = 10, alpha = 5)
  1, 2, 1, 3, 4, 5, 6, 7, 1, 8
  1, 2, 3, 3, 4, 5, 6, 5, 6, 7
  1, 2, 3, 4, 5, 6, 2, 7, 2, 1

随着alpha的增大，桌子数量增多。

polya Urn MODEL 波利亚罐子模型

罐子中包含颜色为x的球alphaGo(x)个，其中Go是我们的基本分布，Go(x)是从Go中抽取x的概率

在每个时间点，从罐子中抽取一个球，记住它的颜色，然后将原来的球与一样颜色的新球放回到罐子中。

DPP模型可以用于聚类，其好处在于不用专门指定一个聚类的数量，关于这个有人写过一些DPP聚类的代码，并提供了一个很toy的例子，见下面地址：

http://www.ece.sunysb.edu/~zyweng/dpcluster.html

duskwaitor

关注

2
点赞
踩
12

收藏

觉得还不错? 一键收藏
0
评论
狄利克莱过程模型(二)：狄利克莱过程分布的三个经典类比

关于DPP的三个经典类比，在本文中依次进行说明：首先是折棍子模型：将观测数据分配到不同群中的生成模型，就是一个折棍子的过程，它将一个变量的支持度(所谓支持度，采用了数据挖掘中关联规则抽取的术语了，意思就是概率，搞不明白为毛用支持度这个术语)分为相互不重叠的k个部分，首先，我们从一个长度为单位1的棍子开始，开始折这个棍子，我们根据下面算法在棍子上生成随机点：
复制链接

扫一扫