2021SC@SDUSC jieba分词的主要功能有如下几种:
- jieba.cut:该方法接受三个输入参数:需要分词的字符串; cut_all 参数用来控制是否采用全模式;HMM参数用来控制是否适用HMM模型
‘’’
jieba分词主函数,返回generator
参数:
- sentence: 待切分文本.
- cut_all: 切分模式. True 全模式, False 精确模式.
- HMM: 是否使用隐式马尔科夫.
‘’’
可以看出jieba.cut返回一个可迭代的generator,可以使用 for 循环来获得分词后得到的每一个词语(也可以用jieba.lcut直接返回分词list结果)。
cut_all=True, HMM=_对应于全模式,即所有在词典中出现的词都会被切分出来,实现函数为__cut_all;
cut_all=False, HMM=False对应于精确模式且不使用HMM;按Unigram语法模型找出联合概率最大的分词组合,实现函数为__cut_DAG;
cut_all=False, HMM=True对应于精确模式且使用HMM;在联合概率最大的分词组合的基础上,HMM识别未登录词,实现函数为__cut_DAG_NO_HMM。
def cut(self, sentence, cut_all=False, HMM=True, use_paddle=False):
"""
The main function that segments an entire sentence that contains
Chinese characters into separated words.
Parameter:
- sentence: The str(unicode) to be segmented.
- cut_all: Model type. True for full pattern, False for accurate pattern.
- HMM: Whether to use the Hidden Markov Model.
"""
is_paddle_installed = check_paddle_install['is_paddle_installed']
sentence = strdecode(sentence)
if use_paddle and is_paddle_installed:
# if sentence is null, it will raise core exception in paddle.
if sentence is None or len(sentence) == 0:
return
import jieba.lac_small.predict as predict
results = predict.get_sent(sentence)
for sent in results:
if sent is None:
continue
yield sent
return
re_han = re_han_default
re_skip = re_skip_default
if cut_all:
cut_block = self.__cut_all
elif HMM:
cut_block = self.__cut_DAG
else:
cut_block = self.__cut_DAG_NO_HMM
blocks = re_han.split(sentence)
for blk in blocks:
if not blk:
continue
if re_han.match(blk):
for word in cut_block(blk):
yield word
else:
tmp = re_skip.split(blk)
for x in tmp:
if re_skip.match(x):
yield x
elif not cut_all:
for xx in x:
yield xx
else:
yield x
- jieba.cut_for_search:该方法接受两个参数:需要分词的字符串;是否使用HMM模型,该方法适用于搜索引擎构建倒排索引的分词,粒度比较细。
从下面的代码中,可以看出:对于长度大于2的词,依次循环滚动取出在前缀词典中的二元子词;对于长度大于3的词,依次循环滚动取出在前缀词典中的三元子词。
def cut_for_search(self, sentence, HMM=True):
"""
Finer segmentation for search engines.
"""
words = self.cut(sentence, HMM=HMM)
for w in words:
if len(w) > 2:
for i in xrange(len(w) - 1):
gram2 = w[i:i + 2]
if self.FREQ.get(gram2):
yield gram2
if len(w) > 3:
for i in xrange(len(w) - 2):
gram3 = w[i:i + 3]
if self.FREQ.get(gram3):
yield gram3
yield w