pandas中关键词提取，jieba，情感分析,jiagu,snownlp等方法

最新推荐文章于 2024-05-07 00:20:46 发布

dair6

最新推荐文章于 2024-05-07 00:20:46 发布

阅读量4.3k

点赞数 1

分类专栏： python相关问题文章标签： sql 数据库 database

本文链接：https://blog.csdn.net/dair6/article/details/121427216

版权

python相关问题专栏收录该内容

26 篇文章 0 订阅

订阅专栏

pandas中关键词提取，jieba，情感分析,jiagu,snownlp等方法

1.jieba分词的使用

(1)安装

pip install jieba

(2)jieba.cut—将文本切分成词语，分词

jieba.cut返回的是一个可迭代的生成器generator,所以能够和for循环一起使用

    sentence = '维生素含叶酸'
    for word in jieba.cut(sentence):
        print(word)

输出：

Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.412 seconds.
Prefix dict has been built successfully.
维生素
含
叶酸

一般来说，我们最常用的分词模式是精确模式，这种模式可以将句子按照最正确的方式切开，

默认使用的就是精确模式

jieba.cut(cut_all=False)等同于jieba.cut()

(3)jieba.lcut—将文本切分成词语，分词

jieba.lcut返回列表

    sentence = '维生素含叶酸'
    print(jieba.lcut(sentence))

输出：

Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.355 seconds.
Prefix dict has been built successfully.
['维生素', '含', '叶酸']

(4)jieba.load_userdict()—添加用户自定义词典

load_userdict()可以添加用户自定义的词典

这个词典的格式如下：

玻璃杯子 词频(整数，可以不写) 词性(表示名词，副词等等词性)

定义一个文件a.txt，内容如下：

玻璃杯子

如果不使用自定义词典a.txt

    sentence = '玻璃杯子'
    print(jieba.lcut(sentence))
    for word in jieba.cut(sentence):
        print(word)

输出：

Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.411 seconds.
Prefix dict has been built successfully.
玻璃杯
子

如果使用自定义词典a.txt

    sentence = '玻璃杯子'
    jieba.load_userdict('a.txt')
    for word in jieba.cut(sentence):
        print(word)

输出：

Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
玻璃杯子
Loading model cost 0.332 seconds.
Prefix dict has been built successfully.

(4)jieba.analyse.set_stop_words()—添加停用词表

作用是可以添加停用词表，停用词表中出现的词都不会在第(5)步的关键词提取中，被作为关键词提取出来

停用词表结构如下：

玻璃
狗狗
详情

调用函数过程如下：

    import jieba.analyse
    jieba.analyse.set_stop_words('data/dictionary/停用词表.txt')

(5)jieba.analyse.extract_tags和jieba.analyse.textrank—提取关键词

    sentence = '玻璃杯子维生素叶酸'
    for a, b in jieba.analyse.extract_tags(sentence, topK=2, withWeight=True,allowPOS=['a','n']):  
        print(a, b)
    for a, b in jieba.analyse.textrank(sentence, topK=2, withWeight=True,allowPOS=['a','n']):
        print(a, b)

输出：

Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.407 seconds.
Prefix dict has been built successfully.
玻璃杯 5.2174708746
叶酸 4.94667223337
玻璃杯 1.0
叶酸 0.9961264494011037

参数解释

topK:将关键词重要性排序之后，取前topK个主要的作为最终的关键词，默认20

withWeight:返回关键词的权重值，默认False，不显示权重值

allowPOS:设置被选中的关键词的词性，比如说，被选出的关键词只能是名词，形容词，或者智能是人名，还是很有用的

此外需要注意的是，对于jieba.analyse.textrank方法，参数allowPOS默认为(‘ns’, ‘n’, ‘vn’, ‘v’)

结巴分词词性表如下：

a 形容词

c 连词

d 副词

e 叹词

f 方位词

h 前缀

k 后缀

m 数词

mq 数量词

n 名词

nr 人名

nr1 汉语姓氏

nr2 汉语名字

nrj 日语人名

nrf 音译人名

ns 地名

nsf 音译地名

nt 机构团体名

nz 其它专名

nl 名词性惯用语

ng 名词性语素

r 代词

rr 人称代词

rz 指示代词

rzt 时间指示代词

rzs 处所指示代词

s 处所词

t 时间词

tg 时间词性语素

v 动词

vd 副动词

vn 名动词

vshi 动词“是”

vyou 动词“有”

vi 不及物动词（内动词）

vl 动词性惯用语

w 标点符号

wkz 左括号，全角：（〔［｛《【〖〈半角：( [ { <

wky 右括号，全角：）〕］｝》】〗〉半角： ) ] { >

wyz 左引号，全角：“ ‘ 『

wyy 右引号，全角：” ’ 』

wj 句号，全角：。

ww 问号，全角：？半角：?

wt 叹号，全角：！半角：!

wd 逗号，全角：，半角：,

wf 分号，全角：；半角： ;

wn 顿号，全角：、

wm 冒号，全角：：半角： :

ws 省略号，全角：…… …

wp 破折号，全角：—— －－ ——－半角：— ----

wb 百分号千分号，全角：％ ‰ 半角：%

wh 单位符号，全角：￥＄￡ ° ℃ 半角：$

x 字符串

xu 网址URL

z 状态词

2.jiagu

(1)安装

pip install -U jiagu  # -U表示更新

(2)jiagu.seg—将文本切分成词语，分词

	import jiagu
    sentence = '玻璃杯子维生素叶酸'
    print(jiagu.seg(sentence))

输出：

['玻璃', '杯子', '维生素', '叶酸']

(3)jiagu.load_userdict—添加用户自定义词典

自定义词典a.txt内容如下：

玻璃杯子

代码如下：

    sentence = '玻璃杯子维生素叶酸'
    jiagu.load_userdict('a.txt')
    print(jiagu.seg(sentence))

输出：

['玻璃杯子', '维生素', '叶酸']

(4)jiagu.keywords—提取关键词

    sentence = '玻璃杯子维生素叶酸'
    print(jiagu.keywords(sentence,3))

其中，数字3表示提取的关键词个数

输出：

['维生素', '杯子', '叶酸']

(5)jiagu.summarize—文本摘要

jiagu.summarize(sentence,1)  #1表示最后的摘要是几个

(6)jiagu.findword—新词发现

jiagu.findword('file1','file2')

(7)情感分析

    sentence = '玻璃杯子维生素叶酸'
    for word in jiagu.seg(sentence):
        print(word,jiagu.sentiment(word))

输出：

玻璃 ('negative', 0.7499999999999999)
杯子 ('positive', 0.5)
维生素 ('positive', 0.5)
叶酸 ('positive', 0.5)

参考https://github.com/ownthink/Jiagu

3.snownlp

(1)安装

pip install snownlp

(2)中文分词和情感分析

   sentence = '玻璃杯子维生素叶酸,卧室，省电费，短发'
   for word in SnowNLP(sentence).sentences:
        print(word,  SnowNLP(word).sentiments)

输出：

玻璃杯子维生素叶酸,卧室 0.06258896727225804
省电费 0.5467035561854678
短发 0.5

我们可以看到，原始语料用英文逗号’,'的部分不能被拆分，'卧室应该’另起一行，而不是放在第一行

右边的浮点数表示的情感分析的取值

dair6

关注

1
点赞
踩
19

收藏

觉得还不错? 一键收藏
0
评论
pandas中关键词提取，jieba，情感分析,jiagu,snownlp等方法

pandas中关键词提取，jieba，情感分析,jiagu,snownlp等方法1.jieba分词的使用(1)安装pip install jieba(2)jieba.cut—将文本切分成词语，分词jieba.cut返回的是一个可迭代的生成器generator,所以能够和for循环一起使用 sentence = '维生素含叶酸' for word in jieba.cut(sentence): print(word)输出：Building prefix dic
复制链接

扫一扫