【NLP】Python中文文本聚类

最新推荐文章于 2024-05-02 01:48:41 发布

栗子ma

最新推荐文章于 2024-05-02 01:48:41 发布

阅读量2.2w

点赞数 20

分类专栏：分词层次聚类文本聚类 Tf-idf NLP Python 文章标签：中文分词层次聚类 Tf-idf NLP 文本聚类

本文链接：https://blog.csdn.net/sinat_40431164/article/details/81092288

版权

Python 同时被 3 个专栏收录

14 篇文章 1 订阅

订阅专栏

NLP

5 篇文章 1 订阅

订阅专栏

Tf-idf

4 篇文章 1 订阅

订阅专栏

1. 准备需要进行聚类的文本，这里选取了10篇微博。

import os
path = 'E:/work/@@@@/开发事宜/大数据平台/5. 标签设计/文本测试数据/微博/'
titles = []
files = []
for filename in os.listdir(path):
    titles.append(filename)
    #带BOM的utf-8编码的txt文件时开头会有一个多余的字符\ufeff，BOM被解码为一个字符\ufeff，如何去掉？修改encoding为utf-8_sig或者utf_8_sig
    filestr = open(path + filename, encoding='utf-8_sig').read()
    files.append(filestr)
for i in range(len(titles)):
    print(i, titles[i], files[i])

0 #我有特别的推倒技巧#之佟掌柜收李逍遥.txt 【#我有特别的推倒技巧#之佟掌柜收李逍遥】@胡歌 此前对南都称，“我要海陆空三栖”，可是还没等他征服海陆空呢，@素描闫妮 就把他先收了……他俩主演的电视剧《生活启示录》收视率稳居卫视黄金档排行第1名，在豆瓣甚至被刷到了8 .6的高分。佟掌柜是怎么推倒李逍遥而又不违和的？http://t.cn/Rvbj6oy
1 从前的称是16两为一斤，为什么要16两为一斤呢？.txt 从前的称是16两为一斤，为什么要16两为一斤呢？古人把北斗七星、南斗六星以及福、禄、寿三星，共16颗星比作16两，商人卖东西，要讲究商德，不能缺斤短两；如果耍手腕，克扣一两就减福，克扣二两就损禄，克扣三两就折寿。所以这个数字和古代人对诚信的美好愿望分不开。
2 作死男晒酒驾.txt #作死男晒酒驾# “他还晒出自己闯红灯20多次，最终此人自首。”请问“亲爱的交警同志”你们在干神马事情去啦！！！是不是撞死人呢，你们才会管啊，开宝马的命太金贵了，而我们的命太.......！！！[怒][泪] |作死男晒酒驾
3 冀中星被移送检察院审查起诉.txt #新闻追踪#：【冀中星被移送检察院审查起诉】首都机场公安分局对冀中星爆炸案侦查终结，目前已移送朝阳检察院审查起诉。7月20日18时24分，冀中星在首都机场T3航站楼B口外引爆自制炸药。案发当天除冀中星左手腕因被炸截肢外，无其他人伤亡。7月29日，冀中星因涉嫌爆炸罪被批捕。http://t.cn/zQHjr0S
4 宝马男微博炫富晒酒驾挑逗交警 最终因被人肉求饶[汗].txt 【宝马男微博炫富晒酒驾挑逗交警 最终因被人肉求饶[汗]】"开车喝酒是不是违反交规？@深圳交警 今晚猎虎吗？"，在向交警挑衅后，他又晒出车牌号，称自己闯红灯20多次，随即遭人肉，并陆续被曝光私人信息。很快，他发微博求饶。26日，交警传唤该男子，该男子炫富违法车辆已被查扣。http://t.cn/Rvb9gF6
5 广场舞出口世界.txt #广场舞出口世界# 澳大利亚引进广场舞顺带引进大妈的疑惑：1）是否可以申请技术移民2）是否属于物种入侵3）亚洲女子天团进入澳洲是否会影响当地娱乐圈的圈态平衡4）是否能够接受“中国大妈一旦引进，一概不退不换”的要求
6 方舟子：锤子改口号，换汤不换药！.txt 【方舟子：锤子改口号，换汤不换药！】遭到方舟子举报虚假宣传后，锤子手机官网修改了宣传口号，将"东半球最好用的手机"改成"全球第二好用的智能手机"。对此，方舟子称，被举报后，罗永浩一边说着"呵呵"，一边偷偷改了广告用语，改成了”全球第二好用的智能手机"等，但这仍然是换汤不换药的虚假广告..
7 杜汶泽宣布暂别香港 (2).txt #杜汶泽宣布暂别香港#杜先生长的丑，嘴巴臭，爪子贱不是你的错，你出来吓人乱说话熏到人就是你的不对了，大陆人怎么了，大陆人敢作敢当说不安逸你就不安逸你，大陆人让你无地自容的本事还是绰绰有余的，滚吧，杜狗！ 
8 杜汶泽宣布暂别香港.txt #杜汶泽宣布暂别香港# 虽说言论自由无可厚非，但攻击民族种族国家，涉及歧视他人的行为仍然不是营销策略中可以突破的下线。把无耻当有趣，是多无聊的人才能干出的事儿啊！该。。只有这个字能概括。
9 首都机场爆炸案嫌犯冀中星 移送检方审查起诉.txt #豫广微新闻#【首都机场爆炸案嫌犯冀中星 移送检方审查起诉】 据报道，首都机场公安分局对冀中星爆炸案侦查终结，目前已移送朝阳检察院审查起诉。7月20日，山东籍男子冀中星在首都机场T3航站楼引爆自制炸药，案发当天除冀中星左手腕因被炸截肢外，无其他人伤亡

2. 创建方法封装jieba分词，注意还需要获得用户自定义词和停用词列表

此步骤会在分词时将用户自定义的词看作一个整体，不会分开，比如在添加“佟掌柜”这个词之前，会将其分词成“佟”、“掌柜”，添加该词后会将“佟掌柜视为整体”。且分词后的list会过滤掉停用词列表中的词，这样像标点符号等没有意义的字或字符就不会出现在最终的集合中。

# 创建停用词list  
def stopwordslist(stopwords_filepath):  
    stopwords = [line.strip() for line in open(stopwords_filepath, 'r', encoding='utf-8').readlines()]  
    return stopwords  

# 对句子进行分词
def segment(text, userdict_filepath = "userdict2.txt", stopwords_filepath = 'stopwords.txt'):
    import jieba
    jieba.load_userdict(userdict_filepath)
    stopwords = stopwordslist(stopwords_filepath)  # 这里加载停用词的路径
    seg_list = jieba.cut(text, cut_all=False)
    seg_list_without_stopwords = []
    for word in seg_list:  
        if word not in stopwords:  
            if word != '\t':  
                seg_list_without_stopwords.append(word)    
    return seg_list_without_stopwords

用户自定义字典，命名为userdict2.txt，保存在项目文件夹下

杜汶泽
佟掌柜
南都
生活启示录
第1名
违和
两
南斗六星
福禄寿三星
颗
酒驾
晒出
亲爱的
命
金贵
冀中星
7月20日
T3航站楼
B口
案发当天
7月29日
微博
炫富
广场舞
物种入侵
虚假
宣传口号
好用
不对
大陆人
才能
干出
能
微新闻
被炸

停用词列表：https://blog.csdn.net/shijiebei2009/article/details/39696571，命名为stopwords.txt，保存在项目文件夹下

3. 使用分词器将list of files进行分词

totalvocab_tokenized = []
for i in files:
    allwords_tokenized = segment(i, "userdict2.txt", 'stopwords.txt')
    totalvocab_tokenized.extend(allwords_tokenized)
print(len(totalvocab_tokenized)) #去重前长度371，去重后256

371

4. 获得Tf-idf矩阵

from sklearn.feature_extraction.text import TfidfVectorizer
#max_df: When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.
#min_df: When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.
tfidf_vectorizer = TfidfVectorizer(max_df=0.9, max_features=200000,
                                 min_df=0.1, stop_words='english',
                                 use_idf=True, tokenizer=segment)
#terms is just a 集合 of the features used in the tf-idf matrix. This is a vocabulary
#terms = tfidf_vectorizer.get_feature_names() #长度258
tfidf_matrix = tfidf_vectorizer.fit_transform(files) #fit the vectorizer to synopses
print(tfidf_matrix.shape) #(10, 258)：10篇文档，258个feature

(10, 258)

5. 计算文档相似性

from sklearn.metrics.pairwise import cosine_similarity
#Note that 有了 dist 就可以测量任意两个或多个概要之间的相似性.
#cosine_similarity返回An array with shape (n_samples_X, n_samples_Y)
dist = 1 - cosine_similarity(tfidf_matrix)

6. 获得分类

from scipy.cluster.hierarchy import ward, dendrogram, linkage
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif']=['Microsoft YaHei'] #用来正常显示中文标签
#Perform Ward's linkage on a condensed distance matrix.
#linkage_matrix = ward(dist) #define the linkage_matrix using ward clustering pre-computed distances
#Method 'ward' requires the distance metric to be Euclidean
linkage_matrix = linkage(dist, method='ward', metric='euclidean', optimal_ordering=False)
#Z[i] will tell us which clusters were merged, let's take a look at the first two points that were merged
#We can see that ach row of the resulting array has the format [idx1, idx2, dist, sample_count]
print(linkage_matrix)
for index, title in enumerate(titles):
    print(index, title)

[[ 3.          9.          0.35350366  2.        ]
 [ 2.          4.          1.08521531  2.        ]
 [ 7.          8.          1.2902641   2.        ]
 [ 0.          5.          1.39239608  2.        ]
 [ 1.          6.          1.40430097  2.        ]
 [13.         14.          1.42131068  4.        ]
 [12.         15.          1.4744491   6.        ]
 [11.         16.          1.62772682  8.        ]
 [10.         17.          2.2853395  10.        ]]

0 #我有特别的推倒技巧#之佟掌柜收李逍遥.txt
1 从前的称是16两为一斤，为什么要16两为一斤呢？.txt
2 作死男晒酒驾.txt
3 冀中星被移送检察院审查起诉.txt
4 宝马男微博炫富晒酒驾挑逗交警 最终因被人肉求饶[汗].txt
5 广场舞出口世界.txt
6 方舟子：锤子改口号，换汤不换药！.txt
7 杜汶泽宣布暂别香港 (2).txt
8 杜汶泽宣布暂别香港.txt
9 首都机场爆炸案嫌犯冀中星 移送检方审查起诉.txt

7. 可视化

plt.figure(figsize=(25, 10))
plt.title('中文文本层次聚类树状图')
plt.xlabel('微博标题')
plt.ylabel('距离（越低表示文本越类似）')
dendrogram(
    linkage_matrix,
    labels=titles, 
    leaf_rotation=-70,  # rotates the x axis labels
    leaf_font_size=12  # font size for the x axis labels
)
plt.show()
plt.close()

栗子ma

关注

20
点赞
踩
196

收藏

觉得还不错? 一键收藏
12
评论
【NLP】Python中文文本聚类

1. 准备需要进行聚类的文本，这里选取了10篇微博。import ospath = 'E:/work/@@@@/开发事宜/大数据平台/5. 标签设计/文本测试数据/微博/'titles = []files = []for filename in os.listdir(path): titles.append(filename) #带BOM的utf-8编码的txt文件时...
复制链接

扫一扫

专栏目录