nlp-demo01_jieba的基本应用py

HJZ11

于 2020-04-28 23:24:51 发布

阅读量181

点赞数

分类专栏： NLP-自然语言处理

本文链接：https://blog.csdn.net/HJZ11/article/details/105827304

版权

NLP-自然语言处理专栏收录该内容

4 篇文章 0 订阅

订阅专栏

扩展一：jieba词性说明-计算所汉语词性标记集
拓展二：基于TextRank算法的关键词抽取
参考TextRank: Bringing Order into Texts论文
拓展三：基于TF-IDF算法抽取关键词

jieba.analyse.extract_tags(sentence, topK=20, withWeight=False, allowPOS=())
jieba.analyse.set_idf_path(file_name)
jieba.analyse.set_stop_words(file_name)

参考自定义单词逆文件频率的值：idf.txt.big
参考自定义停止词：stop_words.txt

# -- encoding:utf-8 --
# 引入jieba模块，默认安装 pip install jieba - i https://pypi.tuna.tsinghua.edu.cn/simple/,清华镜像
import jieba

# 基本使用
word_list = jieba.cut("欢迎来到自然语言的世界！")
print("【基本应用】：{}".format("/".join(word_list)))
word_list=jieba.cut("我来到湖南国防科技大学",cut_all=True)
print("【全模式】：{}".format("/".join(word_list)))
word_list=jieba.cut("我来到湖南国防科技大学")
print("【精确模式】：{}".format("/".join(word_list)))
word_list=jieba.cut_for_search("我来到湖南国防科技大学")
print("【搜索引擎模式】：{}".format("/".join(word_list)))
word_list=jieba.cut("我在台电大厦上班",HMM=False)
print("【仅词典模式】：{}".format("/".join(word_list)))
word_list=jieba.cut("我在台电大厦上班")
print("【HMM新词发现模式】：{}".format("/".join(word_list)))
print('-' * 50)
# API区别
print("API 区别 cut-lcut")
# jieba.cut：返回的是一个迭代器对象
# jieba.lcut:返回的是一个List集合

cut_word_list = jieba.cut("我来到湖南国防科技大学")
print("【cut API 返回的数据类型】：{}".format(type(cut_word_list)))
print("【cut API 返回结果】：{}".format('/'.join(cut_word_list)))
print("【cut API 返回结构【再次获取】：{}".format('/'.join(cut_word_list)))
print("")
lcut_word_list = jieba.lcut("我来到湖南国防科技大学")
print("【lcut API 返回的数据类型】：{}".format(type(lcut_word_list)))
print("【lcut API 返回结果】：{}".format('/'.join(lcut_word_list)))
print("【lcut API 返回结构【再次获取】：{}".format('/'.join(lcut_word_list)))

print("*"*50)
#词性标注
import jieba.posseg as pseg
sentence = "我觉得人工智能未来的发展非常不错"
#分词+词性标注
words = pseg.cut(sentence)
print("%8s\t%8s"%("【单词】","【词性】"))
for word,flag in words :
    print("%8s\t%8s" % (word,flag))

print("#"*50)
sentence = "我的希望是希望张晚霞的背影被晚霞映红"
words = pseg.cut(sentence)
print("%8s\t%8s" % ("【单词】","【词性】"))
for word,flag in words :
    print("%8s\t%8s" % (word,flag))
print("用jieba.add_word这个API更改单词和词性后结果")
jieba.add_word("张晚霞",tag="人名")
jieba.add_word('希望',tag="vn")
words = pseg.cut(sentence)
print("%8s\t%8s" % ("【单词】","【词性】"))
for word,flag in words :
    print("%8s\t%8s" % (word,flag))

C:\Anaconda3\python.exe D:/AI/07-NLP/[20200418]_NLP基础【三】、Seq2Seq【一】/05_随堂代码/natural_language/chinese_word_segmentation/01-jieba-baseUse-hjz.py
Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\hujiou\AppData\Local\Temp\jieba.cache
Loading model cost 0.869 seconds.
Prefix dict has been built successfully.
【基本应用】：欢迎/来到/自然语言/的/世界/！
【全模式】：我/来到/湖南/南国/国防/国防科/国防科技/国防科技大学/科技/大学
【精确模式】：我/来到/湖南/国防科技大学
【搜索引擎模式】：我/来到/湖南/国防/科技/大学/国防科/国防科技大学
【仅词典模式】：我/在/台/电/大厦/上班
【HMM新词发现模式】：我/在/台电/大厦/上班
--------------------------------------------------
API 区别 cut-lcut
【cut API 返回的数据类型】：<class 'generator'>
【cut API 返回结果】：我/来到/湖南/国防科技大学
【cut API 返回结构【再次获取】：

【lcut API 返回的数据类型】：<class 'list'>
【lcut API 返回结果】：我/来到/湖南/国防科技大学
【lcut API 返回结构【再次获取】：我/来到/湖南/国防科技大学
**************************************************
    【单词】	    【词性】
       我	       r
      觉得	       v
    人工智能	       n
      未来	       t
       的	      uj
      发展	      vn
      非常	       d
      不错	       a
##################################################
    【单词】	    【词性】
       我	       r
       的	      uj
      希望	       v
       是	       v
      希望	       v
       张	       q
      晚霞	       n
       的	      uj
      背影	       n
       被	       p
      晚霞	       n
      映红	      nr
用jieba.add_word这个API更改单词和词性后结果
    【单词】	    【词性】
       我	       r
       的	      uj
      希望	      vn
       是	       v
      希望	      vn
     张晚霞	      人名
       的	      uj
      背影	       n
       被	       p
      晚霞	       n
      映红	      nr

Process finished with exit code 0

HJZ11

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
nlp-demo01_jieba的基本应用py

扩展一：jieba词性说明-计算所汉语词性标记集拓展二：基于TextRank算法的关键词抽取# -- encoding:utf-8 --# 引入jieba模块，默认安装 pip install jieba - i https://pypi.tuna.tssinghua.edu.cn/simple/,清华镜像import jieba# 基本使用word_list = jieba.cut...
复制链接

扫一扫

专栏目录