使用JS距离实现LDA文档相似度计算

最新推荐文章于 2023-02-24 02:32:35 发布

Ace Cheney

最新推荐文章于 2023-02-24 02:32:35 发布

阅读量1.7k

点赞数

分类专栏： NLP 笔记 python 文章标签： python 自然语言处理文档相似度 LDA

本文链接：https://blog.csdn.net/Accelerato/article/details/114578934

版权

python 同时被 3 个专栏收录

17 篇文章 0 订阅

订阅专栏

NLP

8 篇文章 3 订阅

订阅专栏

笔记

7 篇文章 0 订阅

订阅专栏

问题提出：

在这里插入图片描述 [1]

在这里插入图片描述
[2]

实现源码：

topicmodel = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=topic_num, random_state=100, update_every=1, chunksize=100, passes=10, alpha='auto', per_word_topics=True)
vec = topicmodel.get_document_topics(corpus)
topicmatrix = matutils.corpus2dense(vec, topic_num).transpose()

output: (num of doc * num of topic)
value: percentage of each topic

def JS_distance(p, q):
    M = (p + q) / 2
    return 1 - 0.5 * (sts.entropy(p, M) + sts.entropy(q, M))


def DistanceMatrix(data):
    LenRow, LenColumn = data.shape
    Dis_Mat = np.zeros((LenRow,LenRow))
    for i in range(0,LenRow):
        for j in range(0,LenRow):
            if i < j:
                Dis_Mat[i,j]=Dis_Mat[j,i] = JS_distance(data[i],data[j])
        Dis_Mat[i,i]= 1
    return Dis_Mat
    
sim_matrix = DistanceMatrix(topicmatrix)

现存问题：

实现效率低，CPU跑不满，大约比sklearn的sklearn.metrics.pairwise.paired_distances慢50倍左右。然而sklearn的sklearn.metrics.pairwise.paired_distances中并没有实现JS距离计算方法。
希望有好的实现方法的大手子们给些建议

引用

[1]车蕾, 杨小平 . 多特征融合文本聚类的新闻话题发现模型
[2]王春龙, 张敬旭 . 基于 LDA 的改进 K— m eans 算法在文本聚类中的应用 [J 】 . 计算机应用 . 2014, 34( 1 ): 249— 254

Ace Cheney

关注

0
点赞
踩
17

收藏

觉得还不错? 一键收藏
打赏
6
评论
使用JS距离实现LDA文档相似度计算

问题提出：[1][2]实现源码：topicmodel = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=topic_num, random_state=100, update_every=1, chunksize=100, passes=10, alpha='auto', per_word_topics=True)vec = topicmodel.get_document_topics(
复制链接

扫一扫