高频词词云图文本重心聚类

KaZaKun

已于 2022-02-16 17:57:58 修改

阅读量4.4k

点赞数 8

分类专栏： Python 文章标签：自然语言处理 python

于 2021-09-02 17:09:26 首次发布

本文链接：https://blog.csdn.net/weixin_43238031/article/details/120064530

版权

Python 专栏收录该内容

12 篇文章 2 订阅

订阅专栏

本文分析了5000条大闸蟹电商的负面评论，发现主要问题包括不新鲜、有异味、蟹黄少、蟹粪多、个头小、口感不佳等。通过对评论进行词频统计、向量化表示和聚类分析，揭示了消费者的不满集中在产品质量和卖家服务上。K均值聚类显示，评论可以分为多个类别，其中一些评论代表了整体的不满情绪。

摘要由CSDN通过智能技术生成

现有某大闸蟹电商负面评论5000条，txt文本，一行一条，举例如下，任务是要对其进行文本分析。读入所有文档并分词，统计词频，找到高频词，确定特征集，为每一条评论生成向量表示，计算一下不同评论之间的距离（自定义，如欧氏或余弦），能不能找到所有评论的“重心”或者所有评论中的代表性评论并输出原文？除了词云外，针对多文档数据还有别的可视化方式没有？

打开时就有一种臭味，不新鲜。今天蒸时也闻到的是怪味，吃蟹爪时没有鲜味，打开蟹壳里面全是蟹粪样的东西，蟹黄一点点并且不好吃。这也没什么，最担心的是吃完这蟹怕会生病，腹泻。
东西就别说能吃一打开那味你问一下都想吐。没有良心的卖家大家一定要慎重。
反应，一直再未收到过卖家的电话。第一次给家里老人买这个就出现情况！此次购买体验较失忘！
比较不满意蟹很脏，壳厚不白，比想的个头小，不如买便宜的了唯一的优点是还活着
非常不值得，是我见过最差的大闸蟹。脏，膏少无黄油，公蟹绝对是洗澡蟹。大家都要注意这家蟹的问题。
只有两个公的应有4-5两，其他都比较小，应没标称的5两或3.5两。最重要的是，母的都比较脏，还有苦味！
和描述差别太大，还有个是坏的，以后不会再买了

一、代码任务：

统计词频，找到高频词，绘制词云图。
确定特征集，用频数法、权重法、独热法为每条评论生成向量表示，计算不同评论之间的距离，输出文本中心。
其他可视化方法 K 均值聚类分析

代码实现

统计词频，找到高频词，绘制词云图。

import re # 正则表达式
import collections # 词频统计库
import numpy as np # numpy数据处理库
import jieba  # 结巴分词
import wordcloud # 词云展示库
from PIL import Image # 图像处理库
import matplotlib.pyplot as plt # 图像展示库
from sklearn.decomposition import PCA # 降维
from sklearn.cluster import KMeans # 聚类

# 获取stopword
def get_stopwords(filepath):
    with open(filepath, 'r', encoding='utf-8') as stopfile:
        stopwords = [line.strip() for line in stopfile.readlines()]
    return set(stopwords)

# 结巴分词
def obj_word(words_filepath, stopwords):
    obj_words = []
    with open(words_filepath, 'r', encoding='utf-8') as f:
        for line in f.readlines():
            for word in jieba.cut(line.strip().replace(' ','')):
                if word not in stopwords:
                    obj_words.append(word)
    return obj_words

# 词频统计,找到高频词
def word_freqcy(obj_word):
    word_freq = collections.Counter(obj_word) # 对分词作词频统计
    return word_freq


# 打印高频词
def topn_word_freq(word_freq, topn=50):
    word_freq_topn = word_freq.most_common(topn) # 获取前100最高频的词
    print("高频词前50：")
    print(word_freq_topn) # 输出检查


# 词云
def draw_cloud(word_freq):
    mask = np.array(Image.open('wordcloud.png')) # 定义词频背景
    wc = wordcloud.WordCloud(
        font_path='C:/Windows/Fonts/simhei.ttf', # 设置字体格式
        mask=mask,          # 背景图
        max_words=150,      # 最多显示词数
        max_font_size=100,   # 字体最大值
        background_color='white'
    )
    wc.generate_from_frequencies(word_freq) # 从字典生成词云
    plt.imshow(wc) # 显示词云
    plt.axis('off') # 关闭坐标轴
    plt.show() # 显示图像

高频词结果：
在这里插入图片描述
词云图结果：

2. 确定特征集，用频数法、权重法、独热法为每条评论生成向量表示，计算不同评论之间的距离，输出文本中心。当出现频数大于10时，定义该词为特征词，为每条评论建立特征词向量，有三种向量表示方法：
3. 权重法——记录每个特征词的词频，用频数除以该句子中所有特征值的词频总和）
4. 独热编码——记录每个特征词的0-1状态
5. 频数——记录每个特征词的词频（频数）

# 确定特征值
def feature(word_freq):
    features = []
    for k in word_freq:
         if word_freq[k] <= 100000 and word_freq[k] >= 10:
            features.append(k)
    print("特征值： ")
    print(features)
    return features

特征值结果：
在这里插入图片描述
三种向量生成方法：

# 用频数法为每条评论建立向量
def get_vector_freq(lineword_freq, feature):
    vector = []
    for f in feature:
        if f in lineword_freq:
            vector.append(lineword_freq[f])
        else:
            vector.append(0)
    return vector


# 用独热法为每条评论生成向量
def get_vector_onehot(lineword_freq, feature):
    vector = []
    for f in feature:
        if f in lineword_freq:
            vector.append(1)
        else:
            vector.append(0)
    return vector


# 用权重法
def get_vector_weight(lineword_freq, feature):
    vector = []
    length = np.sum(list(lineword_freq.values()))
    for f in feature:
        if f in lineword_freq:
            vector.append(lineword_freq[f]/length)
        else:
            vector.append(0)
    return vector

# 生成矩阵
def get_vs_list(word_filepath, features, stopwords):
    vs_list = []
    content = {}
    with open(word_filepath, 'r', encoding='utf-8') as file:
        line = file.readline()
        i = 0
        while line:
            content[i] = line
            obj_linewords = []
            for w in jieba.cut(line.strip()):
                if w not in stopwords:
                    obj_linewords.append(w)
            lineword_freq = word_freqcy(obj_linewords)
            v = get_vector_onehot(lineword_freq, features)
            vs_list.append(v)
            line = file.readline()
            i += 1 
    return vs_list, content   


# 计算矩阵的各向量之间的距离
def dist_matrix(vs_list, content):
    length = len(vs_list)
    distance = []
    vector1 = np.array(vs_list)
    for i in range(length):
        vector2 = vs_list[i]
        dis = np.sqrt(np.sum(np.square(vector1 - vector2)))
        distance.append(dis)
        #print("第%d个评论与重心的距离为：%f" %((i+1), distance[i]) )
    pos = distance.index(min(distance))
    print("第%d个评论为评论重心" %(pos+1))
    print("评论内容为：" + content[pos])

def main():
    stopword = get_stopwords('stopwords_list.txt')
    object_word = obj_word('online_reviews_texts.txt', stopword)
    word_freq = word_freqcy(object_word)
    topn_word_freq(word_freq)
    draw_cloud(word_freq)
    features = feature(word_freq)
    vs_list, content = get_vs_list('online_reviews_texts.txt', features, stopword)
    dist_matrix(vs_list, content)

运行结果：用独热法生成的评论重心：
在这里插入图片描述
用权重法生成的评论重心：

用频数法生成的评论重心：

3. K均值聚类分析

（1）分别将 weight /one hot freq 三种方式得到的向量集进行 PCA 降维结果如下。由图像可知在one hot 编码下，向量集各样本点分散程度最好，故选用 one hot 向量机完成后续 KMeans 文本聚类及分布图的绘制。

    # PCA降维
    pca = PCA(n_components=2)
    reduced_vs = pca.fit_transform(vs_list)
    # 绘图
    plt.scatter(reduced_vs[:,0], reduced_vs[:,1], c='y', marker='.')
    plt.show()

在这里插入图片描述

（2）在进行 K Means 聚类分析前，通常会通过 PCA 主成分分析将数据降维，选择 PCA 降维后保留原始数据 80% 的信息。

以降维后的数据作为数据集，进行 K Means 聚类分析，画出畸变程度曲线。如图所示，横轴表示聚类个数，纵轴表示聚类误差，其计算方法是将同一类的点之间的距离求和，数值越小说明聚类效果越好。
根据肘部法则由图可知在分类个数为 7 类左右时，聚类效果比较明显，因此按照n=7 进行后续分析。

    # PCA降维，保留0.8的信息
    pca = PCA(n_components=0.8)
    vs = pca.fit_transform(vs_list)
    # 根据肘部法则，确定簇数
    iter = 30
    clf_inertia = [0.]*iter
    for i in range(1, iter+1, 1):
        clf = KMeans(n_clusters=i, max_iter=300)
        s = clf.fit(vs)
        clf_inertia[i-1] = clf.inertia_
    # 畸变程度曲线
    plt.figure()
    plt.plot(np.linspace(1, iter, iter), clf_inertia, c='b')
    plt.xlabel('center_num')
    plt.ylabel('inertia')
    plt.show()

在这里插入图片描述
（3） K Means 聚类

   # 聚类中心数量为7
    k = 7
    clf = KMeans(n_clusters=k)
    clf.fit(vs)
 
    # 得到每类聚类的评论和向量，画聚类结果图
    review = []
    with open('online_reviews_texts.txt', 'r', encoding='utf-8') as f:
        lines = f.readlines()
        review = np.array(lines)
    vs_list = np.array(vs_list)
    review_dict = {}  # 每类的评论
    vector_dict = {}  # 每类的评论对应的向量
    color = ['r','y','b','g','c','m','k']
    for i in range(k):
        members = clf.labels_ == i
        review_dict[i] = review[members]
        vector_dict[i] = vs_list[members]
        xs = reduced_vs[members, 0]
        ys = reduced_vs[members, 1]
        plt.scatter(xs, ys, c=color[i], marker='.')
    plt.show()
    near_center_point(review_dict, vector_dict, num=7)
    
 # 得到每类聚类中距离重心最近的Num条评论
def near_center_point(review_dict, vector_dict, num=7):
    length = len(review_dict)   
    for i in range(0, length):
        distance = []
        content = review_dict[i]
        vector = vector_dict[i]
        leng = len(review_dict[i])
        for h in range(leng):
            vector2 = vector[h]
            dis = np.sqrt(np.sum(np.square(vector - vector2)))
            distance.append(dis)
        pos = np.argsort(distance)[0:num]
        print(pos)
        print('\n第',(i+1),'类评论，与重心距离最近的',num,'条评论为：')
        print(content[pos])

if __name__=='__main__':
    main()

在这里插入图片描述
(4)输出

KaZaKun

关注

8
点赞
踩
50

收藏

觉得还不错? 一键收藏
6
评论
高频词词云图文本重心聚类

现有某大闸蟹电商负面评论5000条，txt文本，一行一条，举例如下，任务是要对其进行文本分析。读入所有文档并分词，统计词频，找到高频词，确定特征集，为每一条评论生成向量表示，计算一下不同评论之间的距离（自定义，如欧氏或余弦），能不能找到所有评论的“重心”或者所有评论中的代表性评论并输出原文？除了词云外，针对多文档数据还有别的可视化方式没有？打开时就有一种臭味，不新鲜。今天蒸时也闻到的是怪味，吃蟹爪时没有鲜味，打开蟹壳里面全是蟹粪样的东西，蟹黄一点点并且不好吃。这也没什么，最担心的是吃完这蟹怕会生病，腹泻。
复制链接

扫一扫