jieba中的分词方法
最近刚好在看一些自然语言处理方面的东西,写的一些代码中也用到了jieba这个库,感觉从效果上来说还是可以的。就顺便把分词这一块的代码也给看了(关键词抽取部分的代码已经在之前的博客中提过了),接下来跟大家分享下其中的一些方法。
首先是入口函数:
re_han_default = re.compile("([\u4E00-\u9FD5a-zA-Z0-9+#&\._]+)", re.U)
re_skip_default = re.compile("(\r\n|\s)", re.U)
re_han_cut_all = re.compile("([\u4E00-\u9FD5]+)", re.U)
re_skip_cut_all = re.compile("[^a-zA-Z0-9+#\n]", re.U)
def cut(self, sentence, cut_all=False, HMM=True):
'''
The main function that segments an entire sentence that contains
Chinese characters into seperated words.
Parameter:
- sentence: The str(unicode) to be segmented.
- cut_all: Model type. True for full pattern, False for accurate pattern.
- HMM: Whether to use the Hidden Markov Model.
'''
sentence = strdecode(sentence)
if cut_all:
re_han = re_han_cut_all
re_skip = re_skip_cut_all
else:
re_han = re_han_default # 所用的非空白字符都会被匹配
re_skip = re_skip_default # 所有的空白字符都会被跳过
if cut_all:
cut_block = self.__cut_all
elif HMM:
cut_block = self.__cut_DAG
else:
cut_block = self.__cut_DAG_NO_HMM
blocks = re_han.split(sentence)
for blk in blocks:
if not blk:
continue
if re_han.match(blk):
for word in cut_block(blk):
yield word # 使用yield关键字,将函数变为一个生成器
else:
tmp = re_skip.split(blk)
for x in tmp:
if re_skip.match(x):
yield x
elif not cut_all:
for xx in x:
yield xx
else:
yield x
使用全模式进行分词
至于什么叫做全模式,看完代码,你自然能够明白。
# 用于生成self.FREQ,参数f表示字典的路径
def gen_pfdict(self, f):
lfreq = {}
ltotal = 0 # 所有频率之和
f_name = resolve_filename(f)
for lineno, line in enumerate(f, 1):
try:
line = line.strip().decode('utf-8')
word, freq = line.s