Python从入门到入魔第五天——jieba库的使用

本文链接：https://blog.csdn.net/ssh18581030544/article/details/112570150

jieba库的安装

jieba库是Python中第三方中文分词函数库，需要用户联网自定义安装，
win+r调出命令行：输入cmd
命令行安装方式：pip install jieba;
pycharm环境安装方式：
1.打开pycharm页面后鼠标移到左上角File处点击setting进入
2.点击Project:untitled，再点击projecr interprter
3.双击页面内的pip或者显示栏有色绿色“+”
4.输入搜索jieba，搜索后点击下方install Package

jieba库的使用

调用库的三种方式：

以jieba库为例，调用其它库类似

 1.from jieba import * (*代表jieba包含所有的处理函数，用到某个也可单定义，调用函数时直接使用，比如：luct())
 2.import jieba   (调用函数方式,比如：jieba.lcut())
 3.import jieba as j (自定义库的名称，调用时方便，比如：j.lcut())

jieba库的三种分词模式

1、精确模式：将句子精确地分割开，适合文本分析
jieba.cut(s) 返回一个可迭代的数据类型可迭代就是可以用for循环遍历
jieba.lcut(s) 返回一个列表类型建议使用

>>> import jieba
>>> s='人生苦短，我学Pyhon'
>>> jieba.cut(s)
<generator object Tokenizer.cut at 0x00000126F4BC3EB0>
>>> jieba.lcut(s)
Building prefix dict from the default dictionary ...
Dumping model to file cache C:\Users\185810~1\AppData\Local\Temp\jieba.cache
Loading model cost 0.804 seconds.
Prefix dict has been built successfully.
['人生', '苦短', '，', '我学', 'Pyhon']
>>> for i in jieba.cut(s):  #测试迭代类型
	print(i)

人生
苦短
，
我学
Pyhon

2、全模式：可以将句子中可以成词的词语扫描出来，不能排除歧义
jieba.cut(s,cut_all=True)
jieba,lcut(s,cut_all=Ture)

>>> jieba.cut(s,cut_all=True)
<generator object Tokenizer.cut at 0x00000126F4BEBD60>
>>> jieba.lcut(s,cut_all=True)
['人生', '苦短', '，', '我', '学', 'Pyhon']
>>> for i in jieba.cut(s,cut_all=True):
	print(i)
	
人生
苦短
，
我
学
Pyhon

3、搜索引擎模式：在精确模式的基础上，对长词在进行切分，提高召回率
jieba.cut_for_search(s)
jieba.lcut_for_search(s)

>>> jieba.cut_for_search(s)
<generator object Tokenizer.cut_for_search at 0x00000126F4BC3EB0>
>>> jieba.lcut_for_search(s)
['人生', '苦短', '，', '我学', 'Pyhon']
>>>

4、向分词词典中添加新词
jieba.add_word(w) 向最新定义的分词词典添加新词w，这里新词w不是向分词字典里添加个新词，而是在切分的时候，新词保留，再切分其他词

>>> jieba.add_word('我学Pyhon')
>>> jieba.lcut(s)
['人生', '苦短', '，', '我学Pyhon']
>>>

小试牛刀：
计算列表中词语出现的个数

ls = ["综合", "理工", "综合", "综合", "综合", "综合", "综合", "综合", "综合", "综合",\
      "师范", "理工", "综合", "理工", "综合", "综合", "综合", "综合", "综合","理工",\
      "理工", "理工", "理工", "师范", "综合", "农林", "理工", "综合", "理工", "理工", \
      "理工", "综合", "理工", "综合", "综合", "理工", "农林", "民族", "军事"]  #列表内的\是换行转义符
count={} #定义一个字典用来储存词语
for word in ls:  #遍历列表
	count[word] = count.get(word,0) + 1#用字典的操作函数向字典里添加键值对，.get()函数使用：如果字典中word存在则返回默认值+1进行计数，不存在返回0
for i in count:#遍历字典
	print("{}:{}".format(i,count[i]))#i是字典中的键，count[i]是键i对应的值
输出结果：
综合:20
理工:13
师范:2
农林:2
民族:1
军事:1

如果给的是一篇文章来统计每个词出现的次数呢？
英语文章不需要用jieba 库进行分词，因为每个词汇都有用空格或者其他符号分开，下载一篇英语作文试一下
思路：
1.下载到一个文件中的话需要先打开文件
2.如果不需要统计标点符号的次数，则可以用replace函数用空格替换掉
3.将处理好的文本放到一个列表里处理
4…用字典类型结构计算每个词出现的次数
5…如果想要排序，需将存储好的字典转化为列表类型，因为字典内的键值对是无序排列，不想排序则不需要此步骤

#英语单词文本统计
def get_txt(): #先定义一个文本处理函数
      txt = open('D:\\日常文件\\英语作文.txt','r').read()#打开并读文件
      txt = txt.lower() #将所有字母变为小写，当然都可以变为大写，这样做的目的是有的字母在开头是大写的形式，文本内部则是小写，不利于统计
      for chr in " !@#$%^&*,.;'[]{}:""\/ ": #不想统计符号出现的次数就用这步去掉符号
            txt = txt.replace(chr," ")
      return txt
text = get_txt()#执行函数体
words = text.split() #将文本中的单词一个个的放在列表中。split()返回的是列表类型
count = {}  #定义一个字典，用来存单词和该单词出现的个数键值对
for word in words:#遍历词汇列表
      count[word] = count.get(word,0) + 1 #用字典的操作函数向字典里添加键值对，.get()函数用处：如果字典中word存在则返回默认值+1进行计数，不存在返回0
items = list(count.items()) #为了排序将键值对转化为元组类型并以列表存放
print(items) #打印输出转化后的items列表
items.sort(key = lambda x:x[1],reverse=True) #reverse可选:reverse=True 将对列表进行降序排序。默认是 reverse=False。key也选：指定排序标准的函数。
print(items)##打印输出排序后的items列表
for i in range(5):#取前5名
      word,counts = items[i] #
      print("{}:{}".format(word,counts))
输出结果：
[('i', 22), ('used', 1), ('to', 9), ('have', 5), ('many', 2), ('dreams', 2), ('as', 1), ('grow', 2), ('up', 2), ('still', 1), ('so', 3), ('every', 1), ('dream', 6), ('makes', 4), ('me', 6), ('change', 3), ('when', 3), ('was', 1), ('six', 1), ('years', 1), ('old', 1), ('dreamed', 1), ('be', 5), ('a', 11), ('student', 3), ('this', 4), ('made', 1), ('lot', 2), ('studied', 1), ('hard', 3), ('and', 8), ('did', 1), ('more', 6), ('homework', 1), ('but', 2), ('became', 1), ('wanted', 1), ('writer', 2), ('because', 3), ('love', 1), ('writing', 1), ('much', 1), ('being', 1), ('means', 2), ('people', 1), ('will', 1), ('read', 1), ('my', 7), ('assages', 1), ('can', 2), ('share', 1), ('ideas', 1), ('with', 5), ('everyone', 1), ('well', 1), ('it', 2), ('also', 1), ('you', 1), ('keep', 1), ('calm', 1), ('help', 1), ('become', 2), ('wiser', 1), ('now', 1), ('am', 2), ('dreaming', 2), ('kid', 4), ('want', 1), ('stay', 2), ('parents', 5), ('often', 1), ('maybe', 1), ('is', 1), ('very', 1), ('strange', 1), ('find', 1), ('lost', 1), ('lots', 1), ('of', 3), ('amazing', 1), ('feelings', 1), ('ofbeing', 1), ('play', 1), ('spend', 1), ('them', 1), ('think', 1), ('must', 1), ('study', 1), ('make', 1), ('not', 1), ('work', 1), ('they', 1), ('done', 1), ('for', 1), ('sister', 1), ('hope', 1), ('again', 1), ('time', 1), ('the', 2), ('know', 1), ('hardships', 1), ('changed', 1)]
[('i', 22), ('a', 11), ('to', 9), ('and', 8), ('my', 7), ('dream', 6), ('me', 6), ('more', 6), ('have', 5), ('be', 5), ('with', 5), ('parents', 5), ('makes', 4), ('this', 4), ('kid', 4), ('so', 3), ('change', 3), ('when', 3), ('student', 3), ('hard', 3), ('because', 3), ('of', 3), ('many', 2), ('dreams', 2), ('grow', 2), ('up', 2), ('lot', 2), ('but', 2), ('writer', 2), ('means', 2), ('can', 2), ('it', 2), ('become', 2), ('am', 2), ('dreaming', 2), ('stay', 2), ('the', 2), ('used', 1), ('as', 1), ('still', 1), ('every', 1), ('was', 1), ('six', 1), ('years', 1), ('old', 1), ('dreamed', 1), ('made', 1), ('studied', 1), ('did', 1), ('homework', 1), ('became', 1), ('wanted', 1), ('love', 1), ('writing', 1), ('much', 1), ('being', 1), ('people', 1), ('will', 1), ('read', 1), ('assages', 1), ('share', 1), ('ideas', 1), ('everyone', 1), ('well', 1), ('also', 1), ('you', 1), ('keep', 1), ('calm', 1), ('help', 1), ('wiser', 1), ('now', 1), ('want', 1), ('often', 1), ('maybe', 1), ('is', 1), ('very', 1), ('strange', 1), ('find', 1), ('lost', 1), ('lots', 1), ('amazing', 1), ('feelings', 1), ('ofbeing', 1), ('play', 1), ('spend', 1), ('them', 1), ('think', 1), ('must', 1), ('study', 1), ('make', 1), ('not', 1), ('work', 1), ('they', 1), ('done', 1), ('for', 1), ('sister', 1), ('hope', 1), ('again', 1), ('time', 1), ('know', 1), ('hardships', 1), ('changed', 1)]
i:22
a:11
to:9
and:8
my:7

再试试中文文档词频统计？？？
随便下载了一篇新闻
思路：
1.中文文章是由一个个汉字组成一句话，有的句子有词语有成语等，这时就需要jieba库开进行切分了
2.打开文件并读文件
3.将文本进行切分成单个字词会词语
4.定义一个空字典，用来存放每个词语对应的个数形成的键值对
5.如果想要排序，需将存储好的字典转化为列表类型，因为字典内的键值对是无序排列，不想排序则不需要此步骤

import jieba
text = open("D:\\日常文件\\新闻.txt",'r',encoding='UTF-8').read()
words = jieba.lcut(text)
count = {}
for word in words:
      if len(word)==1:
            continue
      else:
            count[word] = count.get(word,0) + 1
items = list(count.items())
items.sort(key=lambda x:x[1],reverse=True)
for i in range(7):
      word,counts = items[i]
      print("{:<5}:{}个".format(word,counts))
输出结果：
疫情        :6个
防控        :4个
农村        :3个
国家        :2个
卫生        :2个
健康        :2个