2021SC@SDUSC
目录
1,使用 jieba.posseg模块进行分词
from __future__ import print_function
import sys
sys.path.append("../")
import jieba.posseg as pseg
def cuttest(test_sent):
result = pseg.cut(test_sent)
for word, flag in result:
print(word, "/", flag, ", ", end=' ')
print("")
测试结果:
可以看到每个分词都被标注了它的词性,对于词性不太了解的同学可以参考这里或者参考官方文档
2,改变分词器
jieba.posseg.POSTokenizer(tokenizer=None)
新建自定义分词器
import jieba
import jieba.posseg as psg
dt = psg.POSTokenizer(tokenizer=jieba.dt)
words =dt.cut("你真好,你真棒")
for word in words:
print(word.word,word.flag)
print(type(dt))
结果:
3,paddle模式也支持词性标注
from __future__ import print_function
import sys
sys.path.append("../")
import jieba.posseg as pseg
def cuttest(test_sent):
result = pseg.cut(test_sent,use_paddle=True)
for word, flag in result:
print(word, "/", flag, ", ", end=' ')
print("")