层次聚类pythonscipy,在Python中使用Scipy层次结构聚类的文本聚类

该博客介绍了如何运用Python的Scipy库进行层次聚类,将1000多篇文章分为不同的群组。通过 dendrogram 划定第三级聚类,并使用fcluster函数确定了文章的归属。为了识别每个群组的主题,建议的方法是将聚类结果与原始文章对应,然后使用pandas的groupby函数统计每个群组中出现频率最高的单词,从而提取出每个群组的关键词。
摘要由CSDN通过智能技术生成

I have a text corpus that contains 1000+ articles each in a separate line. I am trying to use Hierarchy Clustering using Scipy in python to produce clusters of related articles.

This is the code I used to do the clustering

# Agglomerative Clustering

import matplotlib.pyplot as plt

import scipy.cluster.hierarchy as hac

tree = hac.linkage(X.toarray(), method="complete",metric="euclidean")

plt.clf()

hac.dendrogram(tree)

plt.show()

and I got this plot

t2qv5.png

Then I cut off the tree at the third level with fcluster()

from scipy.cluster.hierarchy import fcluster

clustering = fcluster(tree,3,'maxclust')

print(clustering)

and I got this output:

[2 2 2 ..., 2 2 2]

My question is how can I find the top 10 frequent words in each cluster in order to suggest a topic for each cluster?

解决方案

You can do the following:

Align your results (your clustering variable) with your input (the 1000+ articles).

Using pandas library, you can use a groupby function with the cluster # as its key.

Per group (using the get_group function), fill up a defaultdict of integers for every

word you encounter.

You can now sort the dictionary of word counts in descending order and get your desired number of most frequent words.

Good luck with what you're doing and please do accept my answer if it's what you're looking for.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值