简介
包含四种基于词典的Uni-gram中文分词算法代码,分别为正向最大匹配,逆向最大匹配,全切分,维特比。前三个代码自己写的,第四个在维特比算法 实现中文分词 python实现上进行了改进。其中“全切分算法”好难写,且在网上没找到能拿来用的python代码,无奈自己写。因写的繁琐(找时间修改),所以不做解释,直接上代码和实例。(若有简洁易懂的代码可发我,谢大佬!)
正向最大匹配算法
-
加载词典,形式为 “词”:概率
>>>word_prob = { "北京":0.03,"的":0.08,"天":0.005,"气":0.005,"天 气":0.06,"真":0.04,"好":0.05,"真好":0.04,"啊":0.01,"真好啊":0.02, "今":0.01,"今天":0.07,"课程":0.06,"内容":0.06,"有":0.05,"很":0.03,"很有":0.04,"意思":0.06,"有意思":0.005,"课":0.01, "程":0.005,"经常":0.08,"意见":0.08,"意":0.01,"见":0.005,"有意见":0.02,"分歧":0.04,"分":0.02, "歧":0.005} >>>print (sum(word_prob.values())) 1.0000000000000002
输出值为词典中所有词的概率和
-
正向最大匹配,返回分词结果
def Max_Matching_forward(input_str): max_len = max(len(w) for w in word_prob.keys()) i = 0 j = i+max_len segments = [] while(1): if input_str[i:j] in word_prob.keys(): if j >= len(input_str): segments.append(input_str[i:j]) break segments.append(input_str[i:j]) i = j j = i+max_len else: j -= 1 return segments
-
例子:
>>>print(Max_Matching_forward("北京的天气真好啊")) ['北京', '的', '天气', '真好啊'] >>>print(Max_Matching_forward("今天的课程内容很有意思")) ['今天', '的', '课程', '内容', '很有', '意思'] >>>print(Max_Matching_forward("经常有意见分歧")) ['经常', '有意见', '分歧']
逆向最大匹配算法
-
逆向最大匹配,返回分词结果(用正向中的词典)
def Max_Matching_backward(input_str): max_len = max(len(w) for w in word_prob.keys()) j = len(input_str) i = j-max_len segments = [] while(1): if input_str[i:j] in word_prob.keys(): if i == 0: segments.append(input_str[i:j]) break segments.append(input_str[i:j]) j = i i = max(i-max_len,0) else