Python学习之jieba库的使用以及文本词频统计（十三）

最新推荐文章于 2024-07-26 17:36:26 发布

Blessings_14

最新推荐文章于 2024-07-26 17:36:26 发布

阅读量1k

点赞数 1

文章标签： python 学习搜索引擎

本文链接：https://blog.csdn.net/m0_74421158/article/details/130577535

版权

组合数据类型

jieba库的使用
文本词频统计

jieba库的使用

jieba库的基本介绍

jieba库的概述

jieba是优秀的中文分词第三方库

中文文本需要通过分词获得单个的词语
jieba是优秀的中文分词第三方库，需要额外安装
jieba库提供三种分词模式，最简单只需掌握一个函数

jieba库的安装

python -m pip install --upgrade pip # 升级pip版本
pip install jieba # 安装jieba库

注意事项：网络环境影响下载速度

jieba分词的原理

Jieba分词依靠中文词库

利用一个中文词库，确定汉字之间的关联概率
汉字间概率大的组成词组，形成分词结果
除了分词，用户还可以添加自定义的词组

jieba库的使用说明

jieba分词的三种模式

精确模式、全模式、搜索引擎模式

精确模式：把文本精确的切分开，不存在余单词
全模式：把文本中所有可能的词语都扫描出来，有冗余
搜索引擎模式：在精确模式基础上，对长词再次切分

jieba库常用函数 jieba.lcut(s)

精确模式、全模式、搜索引擎模式

在这里插入图片描述

文本词频统计

文本词频统计问题分析

需求

一篇文章，出现了哪些词 ?
哪些词出现得最多?
如果是英文文本和中文文本呢？

文本

英文文本：Hamet 分析词频
Hamethttps://python123.io/resources/pye/hamlet.txt

中文文本：《三国演义》 分析人物https://python123.io/resources/pye/threekingdoms.txt

英文词频统计

#CalHamletV1.py
def getText():
      txt = open("hamlet.txt", "r").read()
      txt = txt.lower()
      for ch in '!"#$%&()*+,-./:;<=>?@[\\]^_{|}~''':
            txt = txt.replace(ch, " ")
      return txt

hamletTxt = getText()
words = hamleTxt.split()
counts = {}
for word in words:
      counts [word] = counts.get(word,0) + 1
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True)
for i in range(10):
      word, count = items[i]
      print("{0:<10}{1:>5}".format(word, count))

中文词频统计

不准确

#CalThreeKingdomsV1.py
import jieba
txt = open("threekingdoms.txt", "r", encoding="utf-8").read()
words = jieba.lcut(txt)
counts = {}
for word in words:
      if len(word) == 1:
         continue
      else:
            counts[word] = counts.get(word,0) + 1
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True)
for i in range(15):
     word, count = items[i]
     print("{0:<10}{1:>5}".format(word, count))

在前者的词频统计的基础上，增加排除词库并进行人名关联

#CalThreeKingdomsV2.py
import jieba
txt = open("threekingdoms.txt", "r", encoding="utf-8").read()
excludes = {"将军","却说","荆州","二人","不可","不能","如此"}
words = jieba.lcut(txt)
counts = {}
for word in words:
      if len(word) == 1:
         continue
      elif word == "诸葛亮" or word == "孔明曰":
            rword = "孔明"
      elif word == "关公" or word == "云长":
            rword = "关羽"
      elif word == "玄德" or word == "玄德曰":
            rword = "刘备"
      elif word == "孟德" or word == "丞相":
            rword = "曹超"
      else:
            rword = word
      counts[rword] = counts.get(rword,0) + 1
for word in excludes:
      del counts[word]
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True)
for i in range(10):
     word, count = items[i]
     print("{0:<10}{1:>5}".format(word, count))