层次聚类pythonscipy,在Python中使用Scipy层次结构聚类的文本聚类

最新推荐文章于 2024-01-20 09:00:00 发布

江边的石头房子

最新推荐文章于 2024-01-20 09:00:00 发布

阅读量198

点赞数

文章标签：层次聚类pythonscipy

该博客介绍了如何运用Python的Scipy库进行层次聚类，将1000多篇文章分为不同的群组。通过 dendrogram 划定第三级聚类，并使用fcluster函数确定了文章的归属。为了识别每个群组的主题，建议的方法是将聚类结果与原始文章对应，然后使用pandas的groupby函数统计每个群组中出现频率最高的单词，从而提取出每个群组的关键词。

摘要由CSDN通过智能技术生成

I have a text corpus that contains 1000+ articles each in a separate line. I am trying to use Hierarchy Clustering using Scipy in python to produce clusters of related articles.

This is the code I used to do the clustering

# Agglomerative Clustering

import matplotlib.pyplot as plt

import scipy.cluster.hierarchy as hac

tree = hac.linkage(X.toarray(), method="complete",metric="euclidean")

plt.clf()

hac.dendrogram(tree)

plt.show()

and I got this plot

Then I cut off the tree at the third level with fcluster()

from scipy.cluster.hierarchy import fcluster

clustering = fcluster(tree,3,'maxclust')

print(clustering)

and I got this output:

[2 2 2 ..., 2 2 2]

My question is how can I find the top 10 frequent words in each cluster in order to suggest a topic for each cluster?

解决方案

You can do the following:

Align your results (your clustering variable) with your input (the 1000+ articles).

Using pandas library, you can use a groupby function with the cluster # as its key.

Per group (using the get_group function), fill up a defaultdict of integers for every

word you encounter.

You can now sort the dictionary of word counts in descending order and get your desired number of most frequent words.

Good luck with what you're doing and please do accept my answer if it's what you're looking for.

江边的石头房子

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。