用python做词频统计、词云绘制

徐木叶

于 2024-09-06 10:00:00 发布

阅读量751

点赞数 9

文章标签： python 数据挖掘

本文链接：https://blog.csdn.net/m0_55024942/article/details/141931845

版权

import pandas as pd
import requests
from collections import Counter
from wordcloud import WordCloud
import matplotlib.pyplot as plt

# 读取Excel文件
df = pd.read_excel("D:/Working/Suncent/DataCleaning/滤清/滤清-Amazon评论明细报表.xlsx")
df = df[df['评论星级']<4]
count_ = len(df)
# 将评论内容转为字符串并去除缺失值
comments = df["评论内容"].astype(str).dropna()

# 定义一个空的Counter对象
word_count = Counter()

# 无意义的词列表
stop_words = ['which','even','also','than',"not",'as','too',"it's",'when','if','so','just','from','it','any',"a", "an", "the", "and", "or", "but", "to", "of", "in", "on", "at", "for", "with", "is", "are", "was", "were", "be", "been", "this", "that", "these", "those", "can", "could", "may", "might", "will", "would", "should", "shall", "must", "had", "has", "have", "do", "does", "did", "it", "its", "they", "their", "them", "he", "him", "his", "she", "her", "hers", "you", "your", "yours", "we", "our", "us", "ours", "i", "me", "my", "mine"]

# 遍历所有评论，统计词频
for comment in comments:
    # 将每个评论按空格分割成单词列表，并去除无意义单词
    comment = comment.replace(".", "")
    words = [word.lower() for word in comment.split() if word.lower() not in stop_words]
    # 将单词列表转为Counter对象
    word_count.update(words)

# 定义一个空的DataFrame对象，用于保存结果
result_df = pd.DataFrame(columns=["词组", "频数", "频率"])

# 遍历所有词组，进行翻译并保存结果
for word, count in word_count.items():
    # # 调用百度翻译API进行翻译
    # res = requests.get(f"http://fanyi.baidu.com/transapi?from=en&to=zh&query={word}")
    # # 解析翻译结果
    # trans_word = res.json()["data"][0]["dst"]

    # 计算词频率
    freq = count / len(comments)
    # 将结果保存到DataFrame对象中
    # result_df = result_df.append({"词组": word, "频数": count, "频率": freq}, ignore_index=True)
    result_df = pd.concat([result_df, pd.DataFrame({"词组": [word], "频数": [count], "频率": [freq]})], ignore_index=True)
result_df = result_df.sort_values(by='频数', ascending=False)
# 将结果导出到Excel文件中
result_df.to_excel("D:/Working/Suncent/DataCleaning/滤清/滤清-词频统计（去掉介词等,星级低于4）.xlsx", index=False)

import matplotlib.pyplot as plt
import seaborn as sns

# 设置字体为 Microsoft YaHei
plt.rcParams['font.family'] = 'Microsoft YaHei'

# 设置调色板
sns.set_palette("colorblind")

# 绘制英文词云图
wordcloud = WordCloud(background_color="white", width=6000, height=5000,max_words=500).generate_from_frequencies(word_count)
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.title(f'来自滤清器{count_}条低于4星的评论')
plt.show()
#
# # 绘制中文词云图
# zh_word_count = Counter(result_df["翻译后的词组"])
# zh_wordcloud = WordCloud(background_color="white", font_path="simhei.ttf").generate_from_frequencies(zh_word_count)
# plt.imshow(zh_wordcloud, interpolation="bilinear")
# plt.axis("off")
# plt.show()

这段代码是一个Python脚本，用于分析和可视化一个Excel文件中的评论数据。以下是对代码每一步的详细解释：

导入必要的库：
- pandas：用于数据处理和分析。
- requests：用于发送HTTP请求。
- collections.Counter：用于计数对象。
- wordcloud：用于生成词云图。
- matplotlib.pyplot：用于绘图。
- seaborn：用于增强的绘图功能。
读取Excel文件：
- 使用 pandas 的 read_excel 函数读取指定路径下的Excel文件。
- 过滤出评论星级小于4的行。
计算过滤后的评论数量：
- 使用 len(df) 计算过滤后的评论数量，并将其存储在变量 count_ 中。
将评论内容转换为字符串并去除缺失值：
- 使用 astype(str) 将评论内容转换为字符串类型。
- 使用 dropna() 去除任何缺失值。
定义一个空的 Counter 对象 word_count 用于统计词频。
定义一个无意义词列表 stop_words，这些词在词频统计时会被忽略。
遍历所有评论，统计词频：
- 遍历 comments 序列中的每个评论。
- 将评论中的句号去除，并将其转换为小写，然后按空格分割成单词列表。
- 过滤掉列表中的无意义单词。
- 使用 Counter 的 update 方法更新词频。
定义一个空的 DataFrame 对象 result_df，用于保存词组、频数和频率。
遍历 word_count 中的所有词组，计算频率并保存结果：
- 遍历 word_count 中的每个词和对应的计数。
- 计算每个词的频率（词频除以评论总数）。
- 将词组、频数和频率添加到 result_df 中。
- 对 result_df 按频数降序排序。
将结果导出到Excel文件中：
- 使用 to_excel 函数将 result_df 导出到指定路径的Excel文件中。
设置绘图字体和调色板：
- 设置 matplotlib 的字体为 Microsoft YaHei，以便正确显示中文。
- 设置 seaborn 的调色板为 colorblind。
绘制英文词云图：
- 使用 WordCloud 类生成词云图。
- 使用 imshow 函数显示词云图。
- 隐藏坐标轴。
- 设置图表标题并显示图表。
注释掉的代码块是用于绘制中文词云图的，但由于缺少中文字体路径和翻译API的实现，这部分代码被注释掉了。

注意：代码中有一些潜在的问题和改进点：

代码中的百度翻译API调用被注释掉了，如果需要使用，需要取消注释并提供有效的API密钥。
result_df 的构建使用了 pd.concat，这在循环中可能不是最高效的方法，可以考虑先构建一个列表，然后在循环结束后一次性转换为 DataFrame。
代码中的词云图绘制部分没有使用 seaborn，而是直接使用了 matplotlib。
代码中的中文词云图绘制部分被注释掉了，如果需要使用，需要提供中文字体路径和实现翻译功能。