python爬虫——jieba

最新推荐文章于 2024-03-21 13:36:33 发布

写一篇多根头发

最新推荐文章于 2024-03-21 13:36:33 发布

阅读量1.4k

点赞数

文章标签： python

本文链接：https://blog.csdn.net/qq_43685335/article/details/108753968

版权

python爬虫——jieba

三种分词模式：

精确模式：试图将句子精确分开
jieba.cut(‘字符串’)
全模式：将句子中所有可能成词的词语都扫描出来
jieba.cut(字符串， cut_all=True)
搜索引擎模式：适合于搜索引擎使用
jieba.cut_for_search(字符串)

jieba分词的基本用法：

res = jieba.cut() 返回一个生成器
以/为分隔符进行打印
print(./.join(res))
转换为列表
list(res) 或者list(word for word in res)
jieba.lcut) 返回一个列表

加载自定义词典：

静态加载
词典要求：为utf-8格式，第一列为词语。第二列为频数，第三列为词性
jieba.load_userdict(filePath)
动态加载：
jieba.add_word(word=, freq=, tag=)

停用词：

分词完成后进行停用词切除
将停用词做成一个列表，用法即为循环遍历分词列表中的每一个词，看其是否在停用词列表中。
分词开始前进行通用词切除
import jieba.analyse as ana
文件格式要求：为utf-8编码，且一行一个停用词
ana.set_stop_words(filePath)
topk：输出词频前topk的词语
res = ana.extract_tags(str, topK=20)

词性标注

import jieba.posseg as psg
使用psg来对字符串进行分词，用法与jieba类似

res = psg.cut(字符串)
for re in res:
print(re.word, re.flag)
res = psg.lcut(‘字符串’)
其word为：res[1].word
词性为：res[1].flag

NLTK实现

NLTK只能用空格来分割词条，因此一般做法为，先用jieba进行分词，然后转换为以空格分割的连续文本，再转入NTLK使用

import jieba
import jieba.analyse as ana
import jieba.posseg as psg
if __name__ == '__main__':
    str = '郭靖和黄蓉和牢山三十六剑'

    #精确模式
    res = jieba.cut(str)
    print(res)
    #<generator object Tokenizer.cut at 0x0000023C3533DE40>
    print('/'.join(res))
    #郭靖 / 和 / 黄蓉 / 和 / 牢山 / 三十六 / 剑

    #转换为list

    res = jieba.cut(str)
    li = list(word for word in res)
    res = jieba.cut(str)
    li_1 = list(res)
    print(li, li_1)
    #['郭靖', '和', '黄蓉', '和', '牢山', '三十六', '剑'] ['郭靖', '和', '黄蓉', '和', '牢山', '三十六', '剑']

    #直接返回list

    res = jieba.lcut(str)
    print(res)
    #['郭靖', '和', '黄蓉', '和', '牢山', '三十六', '剑']


    #全模式
    res = jieba.cut(str, cut_all=True)
    print(' '.join(res))
    #郭 靖 和 黄 蓉 和 牢 山 三十 三十六 十六 剑

    #搜索引擎模式
    res = jieba.cut_for_search(str)
    print('/'.join(res))
    #郭靖 / 和 / 黄蓉 / 和 / 牢山 / 三十 / 十六 / 三十六 / 剑

    #修改词典
    #增加词典
    jieba.add_word(word='牢山三十六剑', freq=10, tag='n')

    res = jieba.lcut(str)
    print(res)
    #['郭靖', '和', '黄蓉', '和', '牢山三十六剑']

    #删除词典
    jieba.del_word('牢山三十六剑')
    res = jieba.lcut(str)
    print(res)
    #['郭靖', '和', '黄蓉', '和', '牢山', '三十六', '剑']

    #加载词典
    filePath = './dict.txt'
    jieba.load_userdict(filePath)
    res = jieba.lcut(str)
    print(res)
    #['郭靖', '和', '黄蓉', '和', '牢山三十六剑']

    filePath = './stop.txt'
    #先分词后去除停用词列表
    #提词之后去除停用词
    fp = open(filePath, 'r', encoding='utf-8')
    stop = fp.read().split('\n')
    fp.close()

    res = [w for w in res if w not in stop]
    print(res)
    #['郭靖', '黄蓉', '牢山三十六剑']

    #分词之前去掉停用词
    ana.set_stop_words(filePath)
    res = ana.extract_tags(str, topK=20)
    print(res)
    #['郭靖', '黄蓉', '牢山三十六剑']

    #词性标注
    res = psg.cut(str)
    #<generator object cut at 0x00000241E2C8FCF0>
    print(res)

    for re in res:
        print(re.word, re.flag)
    #郭靖 nr 和 c 黄蓉 nr 和 c 牢山三十六剑 n

    res = psg.lcut(str)

    print(res)
    #[pair('郭靖', 'nr'), pair('和', 'c'), pair('黄蓉', 'nr'), pair('和', 'c'), pair('牢山三十六剑', 'n')]
    print(res[1].word)
    #和

写一篇多根头发

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
python爬虫——jieba

python爬虫——jieba三种分词模式：精确模式：试图将句子精确分开jieba.cut(‘字符串’)全模式：将句子中所有可能成词的词语都扫描出来jieba.cut(字符串， cut_all=True)搜索引擎模式：适合于搜索引擎使用jieba.cut_for_search(字符串)jieba分词的基本用法：res = jieba.cut() 返回一个生成器以/为分隔符进行打印print(./.join(res))转换为列表list(res) 或者list(word for
复制链接

扫一扫