Python - jieba库的使用

最新推荐文章于 2025-03-09 10:46:14 发布

->yjy

最新推荐文章于 2025-03-09 10:46:14 发布

阅读量1.5k

点赞数 14

分类专栏： Python 文章标签： python 开发语言

本文链接：https://blog.csdn.net/2301_79602614/article/details/143722678

版权

Python 专栏收录该内容

17 篇文章

订阅专栏

文章目录

jieba库概述

jieba是优秀的中文分词第三方库

中文文本需要通过分词获得单个的词语
jieba是优秀的中文分词第三方库，需要额外安装
jieba库提供三种分词模式，最简单的只需要掌握一个函数

jieba分词的三种模式

精确模式，全模式，搜索引擎模式

精确模式：把文本精确的且分开，不存在冗余单词
全模式：把文本中所有可能的词语都扫描出来，有冗余
搜索引擎模式：在精确模式基础上，对长词再次切分

jieba库的安装

cmd命令行： pip install jieba
在这里插入图片描述

jieba分词的原理

利用一个中文词库，确定中文字符之间的关联概率
中文字符间概率大的组成词组，形成分词结果
除了分词，用户还可以添加自定义词组

jieba库常用函数

函数	描述
jieba.cut(s)	精确模式，返回一个可迭代的数据类型
jieba.cut(s,cut_all=True)	全模式，输出文本s中所有可能单词
jieba.cut_for_search(s)	搜索引擎模式，适合搜索引擎建立索引的分词结果

'''
@Author: yjy
@Time: 2024/11/12
'''
import jieba

s = 'yjy在努力学习Python'
print(jieba.cut(s)) # <generator object Tokenizer.cut at 0x0000021EFBCA4040>
print(jieba.cut(s,cut_all=True)) # <generator object Tokenizer.cut at 0x000001A6DE434040>
print(jieba.cut_for_search(s)) # <generator object Tokenizer.cut_for_search at 0x000002BDA3E73890>
print(list(jieba.cut(s))) # ['yjy', '在', '努力学习', 'Python']
print(list(jieba.cut(s,cut_all=True))) # ['yjy', '在', '努力', '努力学习', '力学', '学习', 'Python']
print(list(jieba.cut_for_search(s))) # ['yjy', '在', '努力', '力学', '学习', '努力学习', 'Python']

在这里插入图片描述

实例 : 文本词频统计

在这里插入图片描述
问题分析:
文本词频统计

需求: 一篇文章,出现哪些词?哪些词出现得最多?
该怎么做呢?

这里以
https://python123.io/resources/pye/hamlet.txt 文本为例子:
在这里插入图片描述
“Hamlet英文词频统计”

# 文本去噪及归一化
def getText():
    txt = open("hamlet.txt","r").read()
    txt=txt.lower()
    for ch in '!"#$%&()*+,-./:;<=>?@[\\]^_‘{|}~':
        txt = txt.replace(ch," ")
    return txt

# 使用字典表达词频
if __name__ == '__main__':
    hamletTxt = getText()
    words = hamletTxt.split()
    counts = {}
    for word in words:
        counts[word] = counts.get(word,0)+1 # 统计单词数量
    items = list(counts.items())
   # print(items) #('the', 1138), ('tragedy', 3), ('of', 669), ('hamlet', 462), ...
    items.sort(key=lambda x:x[1],reverse=True)
    for i in range(10):
        word,count = items[i]
        print("{0:<10}{1:>5}".format(word,count))

"""
格式化字符串的基本语法
格式化字符串的基本语法是 {} 和 :，其中 {} 是占位符，: 后面跟着格式说明符。格式说明符可以包括对齐方式、填充字符、宽度、精度等。

对齐和宽度
对齐方式：
<：左对齐
>：右对齐
^：居中对齐
宽度：
指定占位符的最小宽度。如果实际内容的长度小于指定的宽度，将使用空格或其他指定的填充字符进行填充。
示例
左对齐 (<)

print("{0:<10}".format("hello"))
输出：

hello     
解释："hello" 左对齐，总宽度为 10，右侧用空格填充。

右对齐 (>)

print("{0:>10}".format("hello"))
输出：


     hello
解释："hello" 右对齐，总宽度为 10，左侧用空格填充。

居中对齐 (^)

print("{0:^10}".format("hello"))
输出：

深色版本
  hello   
解释："hello" 居中对齐，总宽度为 10，左右两侧各用两个空格填充。

填充字符
可以在对齐方式之前指定填充字符。默认的填充字符是空格。

python

print("{0:*<10}".format("hello"))  # 左对齐，用 * 填充
print("{0:*>10}".format("hello"))  # 右对齐，用 * 填充
print("{0:*^10}".format("hello"))  # 居中对齐，用 * 填充
输出：

hello*****
*****hello
***hello***


"""

在这里插入图片描述

中文文本：《三国演义》分析人物https://python123.io/resources/pye/threekingdoms.txt

'''
中文文本分词,使用字典表达词频
'''
import jieba
text = open("实验文本.txt","r",encoding="utf-8").read()
words = jieba.lcut(text) #精确模式
counts = {}
for word in words:
    if len(word)==1:
        continue
    else:
        counts[word] = counts.get(word,0) + 1
items = list(counts.items())
items.sort(key=lambda x:x[1],reverse=True)
for i in range(15):
    word,count = items[i]
    print("{0:<10}{1:>5}".format(word,count))

在这里插入图片描述
我们发现明明是同一个人,不过有别的称号罢了,但是统计的却不一样,所以我们要进行修改:

import jieba
txt = open("实验文件.txt","r",encoding="utf-8").read()
excludes = {"将军","却说","荆州","二人","不可","不能","如此"}
words = jieba.lcut(txt) # 精确分词
counts = {}
for word in words:
    if len(word) == 1:
        continue
    elif word == '诸葛亮' or word == '孔明曰':
        rword = '孔明'
    elif word == '关公' or word == '云长':
        rword = "关羽"
    elif word == '玄德' or word == '玄德曰':
        rword = '刘备'
    elif word == '孟德' or word == '丞相':
        rword = '曹操'
    else:
        rword = word
    counts[rword] = counts.get(rword,0) + 1
for word in excludes:
    del counts[word]
items = list(counts.items())
items.sort(key=lambda x:x[1],reverse=True)
for i in range(10):
    word,count = items[i]
    print(word,count)