目录
jieba.suggest_freq()
源码
def suggest_freq(self, segment, tune=False):
"""
Suggest word frequency to force the characters in a word to be
joined or splitted.
Parameter:
- segment : The segments that the word is expected to be cut into,
If the word should be treated as a whole, use a str.
- tune : If True, tune the word frequency.
Note that HMM may affect the final result. If the result doesn't change,
set HMM=False.
"""
self.check_initialized()
ftotal = float(self.total)
freq = 1
if isinstance(segment, string_types):
word = segment
for seg in self.cut(word, HMM=False):
freq *= self.FREQ.get(seg, 1) / ftotal
freq = max(int(freq * self.total) + 1, self.FREQ.get(word, 1))
else:
segment = tuple(map(strdecode, segment))
word = ''.join(segment)
for seg in segment:
freq *= self.FREQ.get(seg, 1) / ftotal
freq = min(int(freq * self.total), self.FREQ.get(word, 0))
if tune:
self.add_word(word, freq)
return freq
split
- 测试代码
import jieba
# print(jieba.suggest_freq(('中', '国'), True))
print('/'.join(jieba.cut('同学中国人比例很高', HMM=False)))
- 运行结果
同学/中国/人/比例/很/高
- 测试代码
import jieba
print(jieba.suggest_freq(('中', '国'), True))
print('/'.join(jieba.cut('同学中国人比例很高', HMM=False)))
- 运行结果
121 #这是“中国”一起出现的词频,后文会进行解释。
Loading model cost 0.550 seconds.
同学/中/国人/比例/很/高
比较两次运行结果发现‘中国’从合在一起,变成拆开了。
关键运行过程解释
如图,ftotal
是语料库中的总单词数目;freq
是初始化的“中国”的词频数目为1;string_types
是str,判断传进来的segment是否是str,是str则执行if后面的join操作,否则则如本示例,执行else后面的操作。
如图蓝色的这一行,首先对‘中’执行:
freq *= self.FREQ.get(seg, 1) / ftotal
- 如果’中’不在词典中,freq = 1/60101967;
- 很明显此时‘中’在词典中,因此 freq = ['中’在词典中的次数]/60101967.
再对’国’执行
freq *= self.FREQ.get(seg, 1) / ftotal
- 如果’国’不在词典中,freq = 1/60101967;
- 很明显此时‘国’在词典中,因此 freq = freq * ['国’在词典中的次数]/60101967.
会得到一个比较小的数:
接着执行
freq = min(int(freq * self.total), self.FREQ.get(word, 0))
将此时的freq*总词数(即‘中国’在词典中出现的次数) 与 ‘中国’ 原来在词典中的次数 比较大小,取得较小者。
结果是121,显然次数改小了,从最后的结果中‘中’、‘国’被分开了,也可见一斑。
注意:使用此函数也有可能分不开
例如:
import jieba
print(jieba.suggest_freq(('中', '国'), True))
print('/'.join(jieba.cut('中国共产党员毛泽东', HMM=False)))
运行结果为:
可能是由于‘中国共产党’太强大了吧,哈哈哈!
join
- 测试代码
import jieba
# print(jieba.suggest_freq('台中', True))
print('/'.join(jieba.cut('「台中」正确应该不会被切开', HMM=False)))
-
运行结果
-
测试代码
import jieba
print(jieba.suggest_freq('台中', True))
print('/'.join(jieba.cut('「台中」正确应该不会被切开', HMM=False)))
- 运行结果
tips: 插一句,这里‘台中’69的词频都合一起了,那刚刚‘中国共产党员毛泽东’121合在一起就不难理解了。
关键运行过程解释
前面跟刚才一样就不解释了,从这里开始,因为‘台中’是 str,故执行 if 里面的语句。
‘台’运行完后的结果:
‘中’也运行完后的结果:
注意到图中蓝色的语句每次结束freq都+1了,因此如果原代码一直运行,不出意外,每次的建议词频都会+1。
jieba.add_word()&del_word()
源码
def add_word(self, word, freq=None, tag=None):
"""
Add a word to dictionary.
freq and tag can be omitted, freq defaults to be a calculated value
that ensures the word can be cut out.
"""
self.check_initialized()
word = strdecode(word)
freq = int(freq) if freq is not None else self.suggest_freq(word, False)
self.FREQ[word] = freq
self.total += freq
if tag:
self.user_word_tag_tab[word] = tag
for ch in xrange(len(word)):
wfrag = word[:ch + 1]
if wfrag not in self.FREQ:
self.FREQ[wfrag] = 0
if freq == 0:
finalseg.add_force_split(word)
def del_word(self, word):
"""
Convenient function for deleting a word.
"""
self.add_word(word, 0)
用来强行调整词频的, 如果只输入第一个参数,不指定频次,那么结果同jieba.suggest_freq()
例如上例:
import jieba
print(jieba.add_word('中国共产党', 0))
print(jieba.add_word('中国', 0))
print('/'.join(jieba.cut('中国共产党员毛泽东', HMM=False)))
运行结果为: