基于BERTopic模型的英文20新闻数据集主题聚类及可视化

本文介绍了如何使用BERTopic模型对20Newsgroups数据集进行主题聚类和可视化。BERTopic是一种基于深度学习的主题建模方法,数据集包含了20个不同新闻组的文本,涵盖多个主题。在模型构建中,讨论了不同嵌入模型的选择,以及UMAP降维和聚类参数的调整。模型训练后,通过条形图、2D地图、层次聚类图和相似矩阵热力图等多种方式进行可视化,以评估和理解主题。此外,文章还提到了主题一致性得分作为模型评估指标。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

bertopic介绍

BERTopic 是基于深度学习的一种主题建模方法。BERT 是一种用于 NLP 的预训练策略,它成功地利用了句子的深层语义信息。

20 newsgroups dataset

fetch_20newsgroups数据集包含来自20个不同新闻组的文本数据。每个新闻组都包含多篇新闻文档,总共约有18,000篇文档。 该数据集的文本数据涵盖了多个主题,包括科技、政治、体育、娱乐等。每个文档都被分配了一个特定的标签,表示其所属的新闻组类别。 fetch_20newsgroups数据集是一个常用的用于文本分类任务和主题建模任务的基准数据集之一。

20 newsgroups数据集下载

对网络有要求,可以直接进行使用。否则也可以直接下载20news-bydate_py3.pkz文件并放在合适的scikit_learn_data文件夹位置。

site-packages文件夹的\site-packages\sklearn\datasets\_twenty_newsgroups.pypython

AG's News Topic Classification Dataset Version 3, Updated 09/09/2015 ORIGIN AG is a collection of more than 1 million news articles. News articles have been gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of activity. ComeToMyHead is an academic news search engine which has been running since July, 2004. The dataset is provided by the academic comunity for research purposes in data mining (clustering, classification, etc), information retrieval (ranking, search, etc), xml, data compression, data streaming, and any other non-commercial activity. For more information, please refer to the link http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html . The AG's news topic classification dataset is constructed by Xiang Zhang (xiang.zhang@nyu.edu) from the dataset above. It is used as a text classification benchmark in the following paper: Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015). DESCRIPTION The AG's news topic classification dataset is constructed by choosing 4 largest classes from the original corpus. Each class contains 30,000 training samples and 1,900 testing samples. The total number of training samples is 120,000 and testing 7,600. The file classes.txt contains a list of classes corresponding to each label. The files train.csv and test.csv contain all the training samples as comma-sparated values. There are 3 columns in them, corresponding to class index (1 to 4), title and description. The title and description are escaped using double quotes ("), and any internal double quote is escaped by 2 double quotes (""). New lines are escaped by a backslash followed with an "n" character, that is "\n".
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Cachel wood

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值