四大名著数据分析可视化-CSDN博客

本文链接：https://blog.csdn.net/2402_87625148/article/details/147046368

一、功能模块概述

在对四大名著进行自然语言处理时，数据分析可视化部分通过多种图表直观展示分析结果，帮助理解文本特征和内涵。主要包括词频统计结果的饼状图、柱状图可视化，人物关系图可视化以及词云可视化。

二、各功能模块详细说明

（一）词频统计功能模块

1. 功能：统计分词后每个词语在文本中出现的频率，获取高频词汇。

2. 实现代码：

from collections import Counter

def word_frequency(words):
word_counts = Counter(words)
top_words = word_counts.most_common()
return top_words

3. 代码解释：使用collections.Counter类对分词后的词语列表words进行统计，Counter会计算每个词语出现的次数。most_common()方法则按出现次数从高到低返回一个包含词语及其对应频率的列表。

（二）生成饼状图可视化功能模块

1. 功能：根据词频统计结果，从txt文件中读取数据（也可直接使用内存中的数据），生成饼状图，直观展示高频词的频率分布比例。

2. 实现代码：

import matplotlib.pyplot as plt

def generate_pie_chart_from_txt(file_path, top_n=10):
with open(file_path, 'r', encoding='utf-8') as f:
content = f.read()
words = word_segmentation(content)
word_freq = word_frequency(words)
top_words = word_freq[:top_n]
labels = [word for word, _ in top_words]
sizes = [freq for _, freq in top_words]
plt.pie(sizes, labels=labels, autopct='%1.1f%%')
plt.title("词频分布饼状图")
plt.show()

3. 代码解释：首先从指定文件路径file_path读取文本内容，进行分词word_segmentation和词频统计word_frequency。然后选取前top_n个高频词，分别提取词语作为标签labels和频率作为大小sizes。最后使用matplotlib的plt.pie()函数绘制饼状图，autopct参数用于显示每个部分的百分比，plt.show()展示图形。

（三）生成柱状图可视化功能模块

1. 功能：以柱状图的形式呈现词频统计结果，清晰地比较各个高频词的出现频率。

2. 实现代码：

import matplotlib.pyplot as plt

def generate_bar_chart(word_freq, top_n=10):
top_words = word_freq[:top_n]
labels = [word for word, _ in top_words]
sizes = [freq for _, freq in top_words]
plt.bar(labels, sizes)
plt.title("词频分布柱状图")
plt.xlabel("词语")
plt.ylabel("频率")
plt.xticks(rotation=45)
plt.show()

3. 代码解释：从词频统计结果word_freq中选取前top_n个高频词，分别获取词语标签labels和频率sizes。使用plt.bar()函数绘制柱状图，设置图表标题title、横纵坐标标签xlabel和ylabel，plt.xticks(rotation=45)使横坐标标签旋转45度以便更好显示，最后plt.show()展示图形。

（四）人物关系图可视化功能模块

1. 功能：分析文本中人物之间的关系，通过人物在文本中的共现情况构建关系图，并用可视化的方式展示人物之间的关联强度。

2. 实现代码：

import networkx as nx
import matplotlib.pyplot as plt

def visualize_relationship_graph(characters, text):
G = nx.Graph()
G.add_nodes_from(characters)
sentences = text.split('。')
for sentence in sentences:
present_characters = [char for char in characters if char in sentence]
if len(present_characters) > 1:
for i in range(len(present_characters)):
for j in range(i + 1, len(present_characters)):
if G.has_edge(present_characters[i], present_characters[j]):
G[present_characters[i]][present_characters[j]]['weight'] += 1
else:
G.add_edge(present_characters[i], present_characters[j], weight=1)
pos = nx.spring_layout(G)
nx.draw(G, pos, with_labels=True, node_size=1500, node_color='skyblue', font_size=10, font_weight='bold', arrows=False)
labels = nx.get_edge_attributes(G, 'weight')
nx.draw_networkx_edge_labels(G, pos, edge_labels=labels)
plt.title("人物关系图")
plt.show()

3. 代码解释：首先创建一个networkx的无向图对象G，并将输入的人物列表characters作为节点添加到图中。将文本按句子分割（以句号为分隔符），遍历每个句子，找出句子中出现的人物。如果一个句子中出现多个人物，则在这些人物节点之间添加边，并根据共现次数设置边的权重。使用nx.spring_layout()计算图的布局，nx.draw()绘制图，设置节点和标签的样式。通过nx.get_edge_attributes()获取边的权重，并使用nx.draw_networkx_edge_labels()绘制边的权重标签，最后plt.show()展示图形。

（五）词云可视化功能模块

1. 功能：根据分词结果生成词云图，突出显示文本中出现频率较高的词语，直观呈现文本的主题和重点。

2. 实现代码：

from wordcloud import WordCloud
import matplotlib.pyplot as plt

def generate_wordcloud(words, output_file):
text = " ".join(words)
wordcloud = WordCloud(font_path="simhei.ttf", background_color="white", width=800, height=400).generate(text)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.savefig(output_file)
plt.show()

3. 代码解释：将分词后的词语列表words用空格连接成一个字符串text。使用WordCloud类生成词云图，设置字体路径（为显示中文需指定合适字体，这里使用simhei.ttf）、背景颜色、宽度和高度等参数。使用plt.imshow()显示词云图，plt.axis("off")关闭坐标轴，plt.savefig()保存词云图到指定文件路径output_file，最后plt.show()展示图形。