GSDMM短文本聚类

[1] Yin J, Wang J. A dirichlet multinomial mixture model-based approach for short text clustering[C]//Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. 2014: 233-242.

与LDA不同,GSDMM针对较小文档,假设每个文档只有一个主题,而LDA假设每个文档有多个主题,并计算每个主题对文档的贡献。短文本聚类是将大量的短文本(例如微博、评论等)根据计算某种相似度进行聚集,最终划分到几个类中的过程。

优点

  1. 自动推断聚类的个数,并且可以快速地收敛
    • 只需设定一个上界K,它会自动调整聚类个数。
  2. 完备性和一致性之间保持平衡
    • 完备性:所有参入计算的短文本最终都能被聚集到某一个具体的族簇中
    • 一致性:被聚集到同一个族簇的所有短文本都具备较为强的相似性
  3. 很好的处理稀疏、高纬度的短文本,可以得到每一类的代表词汇
  4. 较其它的聚类算法,在性能上表现更为突出

例子

全文用Movie Group Process(MGP) 来类比,电影讨论课程的教授要将学生分为不同的组,并且希望在同一个组的学生看过同一部电影,因此他们有更多的东西可以讨论。教授要求学生在几分钟之内写下他们看过的电影。(由于时间限制,写下的电影列表不会太长,更多可能是他们近期观看或最喜欢的电影)。教授需要找到一个方法将学生根据其电影列表分为不同的组。相同的组的学生具有相似的电影列表,不同的组的学生的电影列表不相同。

一开始,所有学生随机分配到k个组中,并写下自己最喜欢的电影列表,然后教授来依次读每个人的列表,每个人被读了以后,都要更新组号,满足下面一个或两个条件:

  • 新组有更多的学生
  • 新组的学生与自己的列表更加相似

不断重复以上操作至所有学生的组号不改变时,就得到我们需要的分组结果了

代码

核心就是一个MovieGroupProcess类,构造方法传入几个参数:

  • K:聚类个数上界
  • α \alpha α:用于控制学生选当前空的组的概率,当其为0时表示没有人会加入空组
  • β \beta β:用于控制学生与其他学生相似兴趣的相似度,当其较低时表示学生更渴望与相似兴趣的学生一组,而非更加受欢迎的组
  • n_iters:迭代次数
from numpy.random import multinomial
from numpy import log, exp
from numpy import argmax
import json

class MovieGroupProcess:
    def __init__(self, K=8, alpha=0.1, beta=0.1, n_iters=30):

        self.K = K
        self.alpha = alpha
        self.beta = beta
        self.n_iters = n_iters

        # slots for computed variables
        self.number_docs = None
        self.vocab_size = None
        self.cluster_doc_count = [0 for _ in range(K)]
        self.cluster_word_count = [0 for _ in range(K)]
        self.cluster_word_distribution = [{} for i in range(K)]

    @staticmethod
    def from_data(K, alpha, beta, D, vocab_size, cluster_doc_count, cluster_word_count, cluster_word_distribution):
        '''
        Reconstitute a MovieGroupProcess from previously fit data
        :param K:
        :param alpha:
        :param beta:
        :param D:
        :param vocab_size:
        :param cluster_doc_count:
        :param cluster_word_count:
        :param cluster_word_distribution:
        :return:
        '''
        mgp = MovieGroupProcess(K, alpha, beta, n_iters=30)
        mgp.number_docs = D
        mgp.vocab_size = vocab_size
        mgp.cluster_doc_count = cluster_doc_count
        mgp.cluster_word_count = cluster_word_count
        mgp.cluster_word_distribution = cluster_word_distribution
        return mgp

    @staticmethod
    def _sample(p):
        '''
        Sample with probability vector p from a multinomial distribution
        :param p: list
            List of probabilities representing probability vector for the multinomial distribution
        :return: int
            index of randomly selected output
        '''
        return [i for i, entry in enumerate(multinomial(1, p)) if entry != 0][0]

    def fit(self, docs, vocab_size):
        '''
        Cluster the input documents
        :param docs: list of list
            list of lists containing the unique token set of each document
        :param V: total vocabulary size for each document
        :return: list of length len(doc)
            cluster label for each document
        '''
        alpha, beta, K, n_iters, V = self.alpha, self.beta, self.K, self.n_iters, vocab_size

        D = len(docs)
        self.number_docs = D
        self.vocab_size = vocab_size

        # unpack to easy var names
        m_z, n_z, n_z_w = self.cluster_doc_count, self.cluster_word_count, self.cluster_word_distribution
        cluster_count = K
        d_z = [None for i in range(len(docs))]

        # initialize the clusters
        for i, doc in enumerate(docs):

            # choose a random  initial cluster for the doc
            z = self._sample([1.0 / K for _ in range(K)])
            d_z[i] = z
            m_z[z] += 1
            n_z[z] += len(doc)

            for word in doc:
                if word not in n_z_w[z]:
                    n_z_w[z][word] = 0
                n_z_w[z][word] += 1

        for _iter in range(n_iters):
            total_transfers = 0

            for i, doc in enumerate(docs):

                # remove the doc from it's current cluster
                z_old = d_z[i]

                m_z[z_old] -= 1
                n_z[z_old] -= len(doc)

                for word in doc:
                    n_z_w[z_old][word] -= 1

                    # compact dictionary to save space
                    if n_z_w[z_old][word] == 0:
                        del n_z_w[z_old][word]

                # draw sample from distribution to find new cluster
                p = self.score(doc)
                z_new = self._sample(p)

                # transfer doc to the new cluster
                if z_new != z_old:
                    total_transfers += 1

                d_z[i] = z_new
                m_z[z_new] += 1
                n_z[z_new] += len(doc)

                for word in doc:
                    if word not in n_z_w[z_new]:
                        n_z_w[z_new][word] = 0
                    n_z_w[z_new][word] += 1

            cluster_count_new = sum([1 for v in m_z if v > 0])
            print("In stage %d: transferred %d clusters with %d clusters populated" % (
            _iter, total_transfers, cluster_count_new))
            if total_transfers == 0 and cluster_count_new == cluster_count and _iter>25:
                print("Converged.  Breaking out.")
                break
            cluster_count = cluster_count_new
        self.cluster_word_distribution = n_z_w
        return d_z

    def score(self, doc):
        '''
        Score a document
        Implements formula (3) of Yin and Wang 2014.
        http://dbgroup.cs.tsinghua.edu.cn/wangjy/papers/KDD14-GSDMM.pdf
        :param doc: list[str]: The doc token stream
        :return: list[float]: A length K probability vector where each component represents
                              the probability of the document appearing in a particular cluster
        '''
        alpha, beta, K, V, D = self.alpha, self.beta, self.K, self.vocab_size, self.number_docs
        m_z, n_z, n_z_w = self.cluster_doc_count, self.cluster_word_count, self.cluster_word_distribution

        p = [0 for _ in range(K)]

        #  We break the formula into the following pieces
        #  p = N1*N2/(D1*D2) = exp(lN1 - lD1 + lN2 - lD2)
        #  lN1 = log(m_z[z] + alpha)
        #  lN2 = log(D - 1 + K*alpha)
        #  lN2 = log(product(n_z_w[w] + beta)) = sum(log(n_z_w[w] + beta))
        #  lD2 = log(product(n_z[d] + V*beta + i -1)) = sum(log(n_z[d] + V*beta + i -1))

        lD1 = log(D - 1 + K * alpha)
        doc_size = len(doc)
        for label in range(K):
            lN1 = log(m_z[label] + alpha)
            lN2 = 0
            lD2 = 0
            for word in doc:
                lN2 += log(n_z_w[label].get(word, 0) + beta)
            for j in range(1, doc_size +1):
                lD2 += log(n_z[label] + V * beta + j - 1)
            p[label] = exp(lN1 - lD1 + lN2 - lD2)

        # normalize the probability vector
        pnorm = sum(p)
        pnorm = pnorm if pnorm>0 else 1
        return [pp/pnorm for pp in p]

    def choose_best_label(self, doc):
        '''
        Choose the highest probability label for the input document
        :param doc: list[str]: The doc token stream
        :return:
        '''
        p = self.score(doc)
        return argmax(p),max(p)

测试

def compute_V(texts):
    V = set()
    for text in texts:
        for word in text:
            V.add(word)
    return len(V)
def test_short_text():
    # there is no perfect segmentation of this text data:
    texts = [
        "where the red dog lives",
        "red dog lives in the house",
        "blue cat eats mice",
        "monkeys hate cat but love trees",
        "green cat eats mice",
        "orange elephant never forgets",
        "orange elephant must forget",
        "monkeys eat banana",
        "monkeys live in trees",
        "elephant",
        "cat",
        "dog",
        "monkeys"
    ]

    texts = [text.split() for text in texts]
    V = compute_V(texts)
    mgp = MovieGroupProcess(K=30, n_iters=100, alpha=0.2, beta=0.01)
    y = mgp.fit(texts, V)

test_short_text()

结果

In stage 0: transferred 8 clusters with 6 clusters populated
In stage 1: transferred 5 clusters with 7 clusters populated
In stage 2: transferred 7 clusters with 6 clusters populated
In stage 3: transferred 5 clusters with 8 clusters populated
In stage 4: transferred 5 clusters with 6 clusters populated
In stage 5: transferred 6 clusters with 9 clusters populated
In stage 6: transferred 7 clusters with 7 clusters populated
In stage 7: transferred 7 clusters with 6 clusters populated
In stage 8: transferred 6 clusters with 6 clusters populated
In stage 9: transferred 4 clusters with 7 clusters populated
In stage 10: transferred 7 clusters with 9 clusters populated
In stage 11: transferred 6 clusters with 7 clusters populated
In stage 12: transferred 6 clusters with 9 clusters populated
In stage 13: transferred 6 clusters with 8 clusters populated
In stage 14: transferred 5 clusters with 7 clusters populated
In stage 15: transferred 4 clusters with 5 clusters populated
In stage 16: transferred 6 clusters with 6 clusters populated
In stage 17: transferred 2 clusters with 5 clusters populated
In stage 18: transferred 3 clusters with 5 clusters populated
In stage 19: transferred 3 clusters with 7 clusters populated
In stage 20: transferred 3 clusters with 6 clusters populated
In stage 21: transferred 5 clusters with 8 clusters populated
In stage 22: transferred 4 clusters with 6 clusters populated
In stage 23: transferred 4 clusters with 5 clusters populated
In stage 24: transferred 5 clusters with 8 clusters populated
In stage 25: transferred 7 clusters with 7 clusters populated
In stage 26: transferred 6 clusters with 8 clusters populated
In stage 27: transferred 4 clusters with 4 clusters populated
In stage 28: transferred 4 clusters with 7 clusters populated
In stage 29: transferred 5 clusters with 6 clusters populated
In stage 30: transferred 3 clusters with 6 clusters populated
In stage 31: transferred 7 clusters with 8 clusters populated
In stage 32: transferred 5 clusters with 6 clusters populated
In stage 33: transferred 3 clusters with 5 clusters populated
In stage 34: transferred 6 clusters with 5 clusters populated
In stage 35: transferred 5 clusters with 7 clusters populated
In stage 36: transferred 6 clusters with 6 clusters populated
In stage 37: transferred 7 clusters with 7 clusters populated
In stage 38: transferred 3 clusters with 6 clusters populated
In stage 39: transferred 5 clusters with 7 clusters populated
In stage 40: transferred 4 clusters with 6 clusters populated
In stage 41: transferred 6 clusters with 7 clusters populated
In stage 42: transferred 2 clusters with 5 clusters populated
In stage 43: transferred 2 clusters with 7 clusters populated
In stage 44: transferred 4 clusters with 5 clusters populated
In stage 45: transferred 4 clusters with 7 clusters populated
In stage 46: transferred 5 clusters with 6 clusters populated
In stage 47: transferred 3 clusters with 6 clusters populated
In stage 48: transferred 4 clusters with 5 clusters populated
In stage 49: transferred 1 clusters with 5 clusters populated
In stage 50: transferred 3 clusters with 7 clusters populated
In stage 51: transferred 6 clusters with 6 clusters populated
In stage 52: transferred 4 clusters with 6 clusters populated
In stage 53: transferred 4 clusters with 5 clusters populated
In stage 54: transferred 2 clusters with 6 clusters populated
In stage 55: transferred 3 clusters with 6 clusters populated
In stage 56: transferred 5 clusters with 7 clusters populated
In stage 57: transferred 6 clusters with 8 clusters populated
In stage 58: transferred 5 clusters with 6 clusters populated
In stage 59: transferred 4 clusters with 6 clusters populated
In stage 60: transferred 3 clusters with 5 clusters populated
In stage 61: transferred 4 clusters with 6 clusters populated
In stage 62: transferred 7 clusters with 8 clusters populated
In stage 63: transferred 5 clusters with 5 clusters populated
In stage 64: transferred 1 clusters with 6 clusters populated
In stage 65: transferred 2 clusters with 6 clusters populated
In stage 66: transferred 2 clusters with 5 clusters populated
In stage 67: transferred 7 clusters with 6 clusters populated
In stage 68: transferred 9 clusters with 7 clusters populated
In stage 69: transferred 5 clusters with 6 clusters populated
In stage 70: transferred 5 clusters with 7 clusters populated
In stage 71: transferred 4 clusters with 5 clusters populated
In stage 72: transferred 4 clusters with 6 clusters populated
In stage 73: transferred 4 clusters with 5 clusters populated
In stage 74: transferred 4 clusters with 5 clusters populated
In stage 75: transferred 4 clusters with 6 clusters populated
In stage 76: transferred 5 clusters with 6 clusters populated
In stage 77: transferred 6 clusters with 7 clusters populated
In stage 78: transferred 4 clusters with 5 clusters populated
In stage 79: transferred 4 clusters with 7 clusters populated
In stage 80: transferred 5 clusters with 5 clusters populated
In stage 81: transferred 4 clusters with 5 clusters populated
In stage 82: transferred 5 clusters with 5 clusters populated
In stage 83: transferred 7 clusters with 5 clusters populated
In stage 84: transferred 3 clusters with 5 clusters populated
In stage 85: transferred 4 clusters with 7 clusters populated
In stage 86: transferred 5 clusters with 6 clusters populated
In stage 87: transferred 4 clusters with 8 clusters populated
In stage 88: transferred 5 clusters with 8 clusters populated
In stage 89: transferred 7 clusters with 7 clusters populated
In stage 90: transferred 4 clusters with 6 clusters populated
In stage 91: transferred 5 clusters with 7 clusters populated
In stage 92: transferred 6 clusters with 6 clusters populated
In stage 93: transferred 4 clusters with 8 clusters populated
In stage 94: transferred 4 clusters with 7 clusters populated
In stage 95: transferred 4 clusters with 5 clusters populated
In stage 96: transferred 3 clusters with 6 clusters populated
In stage 97: transferred 3 clusters with 5 clusters populated
In stage 98: transferred 4 clusters with 6 clusters populated
In stage 99: transferred 3 clusters with 7 clusters populated
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

路过的风666

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值