自然语言处理_利用自然语言处理工具spacy找出四大名著中修饰某个人物的词语。-CSDN博客

本文链接：https://blog.csdn.net/2402_87625148/article/details/146321368

一、程序实现的功能

1. 分词功能：对四大名著的文本进行分词处理，将连续的文本分割成单个词语，便于后续的分析处理。同时，去除文本中的停用词，提高分析的有效性。

2. 词频统计功能：统计分词后每个词语在文本中出现的频率，获取高频词汇，从而了解文本的核心内容和常见表述。

3. 词性分类保存功能：对文本中的词语进行词性标注，并将不同词性的词语分类保存到txt文件中，方便对文本的词性分布进行研究。

4. 可视化功能：

● 饼状图可视化：根据词频统计结果，从txt文件中读取数据，生成饼状图，直观展示高频词的频率分布比例。

● 柱状图可视化：以柱状图的形式呈现词频统计结果，清晰地比较各个高频词的出现频率。

● 关系图可视化：分析文本中人物之间的关系，通过人物在文本中的共现情况构建关系图，并用可视化的方式展示人物之间的关联强度。

● 词云可视化：根据分词结果生成词云图，突出显示文本中出现频率较高的词语，直观呈现文本的主题和重点。

5. 自定义词典功能：允许用户手动添加自定义的词语到分词词典中，以提高对特定领域词汇或专业术语的分词准确性。

6. 特定实体统计保存功能：

● 统计人名并保存：从文本中识别并统计人名，将其保存到txt文件中，有助于对作品中的人物角色进行分析。

● 统计地名并保存：提取文本中的地名信息，并保存到txt文件，可用于研究作品中涉及的地点和地理背景。

● 统计武器并保存：通过关键词匹配等方式，从文本中找出与武器相关的词汇并保存，为分析作品中的战斗元素和武器描述提供数据支持。

二、设计思想

1. 数据获取与预处理：从网络或其他可靠来源获取四大名著的txt格式文本数据。对原始文本进行清洗，包括去除特殊字符、统一编码格式等操作。加载停用词表，在分词过程中过滤掉无实际意义的词语，如语气词、连词等。

2. 功能模块设计：将整个自然语言处理任务分解为多个独立的功能模块，每个模块实现一个特定的功能，如分词、词频统计等。这种模块化设计使得程序结构清晰，易于维护和扩展。

3. 可视化设计：选择合适的可视化工具和图表类型来展示分析结果。饼状图适合展示比例关系，柱状图便于比较数值大小，关系图能直观呈现人物之间的关联，词云图则突出显示高频词汇。

4. 实体识别与统计：利用词性标注和关键词匹配等方法，识别文本中的人名、地名和武器等实体信息。对于人名和地名，借助词性标注结果（如“nr”表示人名，“ns”表示地名）进行筛选；对于武器，则通过预先定义的关键词列表进行匹配。

三、用到主要的库和库函数介绍

1. jieba：中文分词库。

● jieba.lcut(text)：对文本进行精确模式分词，返回一个包含分词结果的列表。

● jieba.posseg.lcut(text)：对文本进行分词并标注词性，返回一个包含词语和词性的元组列表。

● jieba.add_word(word)：向分词词典中添加自定义词语。

2. networkx：用于创建、操作和研究复杂网络结构的库。

● nx.Graph()：创建一个无向图对象。

● G.add_nodes_from(nodes)：向图中添加节点。

● G.add_edge(u, v, weight=1)：在节点u和v之间添加一条边，并可设置边的权重。

● nx.spring_layout(G)：计算图的布局，用于可视化。

● nx.draw(G, pos, ...)：绘制图，pos为节点的位置布局。

3. matplotlib：数据可视化库。

● plt.pie(sizes, labels=labels, autopct='%1.1f%%')：绘制饼状图，sizes为各部分的大小，labels为各部分的标签，autopct用于显示百分比。

● plt.bar(x, height)：绘制柱状图，x为横坐标位置，height为柱状图的高度。

● plt.show()：显示绘制的图形。

4. wordcloud：生成词云的库。

● WordCloud(font_path="simhei.ttf", background_color="white", width=800, height=400).generate(text)：根据文本生成词云图，可设置字体路径、背景颜色、宽度和高度等参数。

5. collections.Counter：用于统计可哈希对象中元素出现次数的类。

● Counter(words).most_common()：统计词语列表中每个词语的出现次数，并按出现次数从高到低返回一个列表。

6. nltk：自然语言工具包。

● nltk.corpus.stopwords.words('chinese')：获取中文停用词列表。

四、测试数据

选择《红楼梦》的全文文本作为测试数据。《红楼梦》作为中国古典文学的经典之作，文本内容丰富，包含了众多人物、复杂的情节以及丰富的语言表达，能够全面测试程序的各项功能。

import jieba
import jieba.posseg as pseg
import networkx as nx
import matplotlib.pyplot as plt
from wordcloud import WordCloud
import numpy as np
import pandas as pd
from collections import Counter
from nltk.corpus import stopwords
import nltk

# 下载nltk的停用词表
nltk.download('stopwords')
# 自定义停用词表（结合nltk的停用词和自定义的一些停用词）
stop_words = set(stopwords.words('chinese'))
custom_stopwords = ["之", "其", "也", "矣", "乎", "者", "邪", "哉"]
stop_words.update(custom_stopwords)

# 1. 分词功能
def word_segmentation(text):
words = jieba.lcut(text)
filtered_words = [word for word in words if word not in stop_words and len(word) > 1]
return filtered_words

# 2. 词频统计功能
def word_frequency(words):
word_counts = Counter(words)
top_words = word_counts.most_common()
return top_words

# 3. 词性分类保存txt功能
def pos_classification_save(words, text, output_file):
words_and_pos = pseg.lcut(text)
pos_dict = {}
for word, flag in words_and_pos:
if word in words:
if flag not in pos_dict:
pos_dict[flag] = [word]
else:
pos_dict[flag].append(word)

with open(output_file, 'w', encoding='utf-8') as f:
for pos, word_list in pos_dict.items():
f.write(f"{pos}: {', '.join(word_list)}\n")

# 4. 读取txt文件生成饼状图可视化功能
def generate_pie_chart_from_txt(file_path, top_n=10):
with open(file_path, 'r', encoding='utf-8') as f:
content = f.read()
words = word_segmentation(content)
word_freq = word_frequency(words)
top_words = word_freq[:top_n]
labels = [word for word, _ in top_words]
sizes = [freq for _, freq in top_words]

plt.pie(sizes, labels=labels, autopct='%1.1f%%')
plt.title("词频分布饼状图")
plt.show()

# 5. 柱状图可视化功能
def generate_bar_chart(word_freq, top_n=10):
top_words = word_freq[:top_n]
labels = [word for word, _ in top_words]
sizes = [freq for _, freq in top_words]

plt.bar(labels, sizes)
plt.title("词频分布柱状图")
plt.xlabel("词语")
plt.ylabel("频率")
plt.xticks(rotation=45)
plt.show()

# 6. 关系图可视化功能
def visualize_relationship_graph(characters, text):
G = nx.Graph()
G.add_nodes_from(characters)
sentences = text.split('。')
for sentence in sentences:
present_characters = [char for char in characters if char in sentence]
if len(present_characters) > 1:
for i in range(len(present_characters)):
for j in range(i + 1, len(present_characters)):
if G.has_edge(present_characters[i], present_characters[j]):
G[present_characters[i]][present_characters[j]]['weight'] += 1
else:
G.add_edge(present_characters[i], present_characters[j], weight=1)

pos = nx.spring_layout(G)
nx.draw(G, pos, with_labels=True, node_size=1500, node_color='skyblue', font_size=10, font_weight='bold', arrows=False)
labels = nx.get_edge_attributes(G, 'weight')
nx.draw_networkx_edge_labels(G, pos, edge_labels=labels)
plt.title("人物关系图")
plt.show()

# 7. 词云可视化功能
def generate_wordcloud(words, output_file):
text = " ".join(words)
wordcloud = WordCloud(font_path="simhei.ttf", background_color="white", width=800, height=400).generate(text)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.savefig(output_file)
plt.show()

# 8. 建立人工手动自定义词典功能
def add_custom_dict(custom_words):
for word in custom_words:
jieba.add_word(word)

# 9. 统计人名并保存txt功能
def count_and_save_names(text, output_file):
words_and_pos = pseg.lcut(text)
names = [word for word, flag in words_and_pos if flag == 'nr']
with open(output_file, 'w', encoding='utf-8') as f:
for name in names:
f.write(name + '\n')

# 10. 统计地名并保存txt功能
def count_and_save_places(text, output_file):
words_and_pos = pseg.lcut(text)
places = [word for word, flag in words_and_pos if flag == 'ns']
with open(output_file, 'w', encoding='utf-8') as f:
for place in places:
f.write(place + '\n')

# 11. 统计武器并保存txt功能（简单基于关键词匹配，可根据实际情况完善）
def count_and_save_weapons(text, output_file):
weapon_keywords = ["剑", "刀", "枪", "棍", "斧", "弓箭"]
weapons = []
for keyword in weapon_keywords:
weapons.extend([m.group() for m in re.finditer(keyword, text)])
with open(output_file, 'w', encoding='utf-8') as f:
for weapon in weapons:
f.write(weapon + '\n')

if __name__ == "__main__":
# 假设已获取四大名著的文本数据，这里以红楼梦为例，读取文本内容
with open('hongloumeng.txt', 'r', encoding='utf-8') as file:
text = file.read()

# 分词
words = word_segmentation(text)

# 词频统计
word_freq = word_frequency(words)

# 词性分类保存
pos_classification_save(words, text, 'pos_result.txt')

# 生成饼状图
generate_pie_chart_from_txt('hongloumeng.txt')

# 生成柱状图
generate_bar_chart(word_freq)

# 假设的人物列表
characters = ["贾宝玉", "林黛玉", "薛宝钗", "贾母", "王熙凤"]
# 人物关系图可视化
visualize_relationship_graph(characters, text)

# 生成词云
generate_wordcloud(words, 'wordcloud_result.png')

# 自定义词典
custom_words = ["荣国府", "宁国府"]
add_custom_dict(custom_words)

# 统计人名并保存
count_and_save_names(text, 'names_result.txt')

# 统计地名并保存
count_and_save_places(text, 'places_result.txt')

# 统计武器并保存
count_and_save_weapons(text, 'weapons_result.txt')