一、前期准备
1.将以下文章内容复制拷贝到到word:
日本正式启动核污染水排海,海鲜还能吃吗? (qq.com)https://mp.weixin.qq.com/s/R3_D0K4O7l-HLasEcg1TgQ
2.安装依赖包:
python-docx、jieba、pandas
二、Python代码
import docx
import jieba
from collections import Counter
import pandas as pd
'''
1.读取docx文件
Document对象:一个docx文件
paragraphs对象:每个段落
text对象:文本
'''
doc = docx.Document(r"C:\Users\Wendy\Desktop\python.docx")
content = "".join([para.text for para in doc.paragraphs])
'''
2.将doc内容按中文标点符号划分
'''
seg_list = jieba.cut(content,cut_all=False) # 把句子按字词标点符号分开
seg_list = [word for word in seg_list if len(word)>1] # 把标点符号过滤掉,只保留字词
'''
3.统计词频
'''
counter = Counter(seg_list)
# print(type(counter)) # counter class迭代器
# for key,count in counter.items():
# print(key,count)
'''
4.把counter转换为df,按词频高低排序
'''
df = pd.DataFrame(counter.items(),columns=["word","count"])
df = df.sort_values(by="count",ascending=False,ignore_index=True)
print(df.head(10))
结果如图: