一、基于枚举法的中文分词工具
- 前向最大匹配
- 例子:我们经常有意见分歧
- 词典:["我们", “经常”,“有”, “有意见”, “意见”, “分歧”]
- 我们定义max_len = 5
- 开始执行前向匹配算法
- (1)
- [我们经常有]意见分歧 (词典中没有, max_len缩小一位)
- [我们经常]有意见分歧(词典中没有, max_len缩小一位)
- [我们经]常有意见分歧(词典中没有, max_len缩小一位)
- [我们]经常有意见分歧(词典中有,将“我们”划分出来)
- (2)
- [经常有意见]分歧(词典中没有, max_len缩小一位)
- [经常有意]见分歧(词典中没有, max_len缩小一位)
- [经常有]意见分歧(词典中没有, max_len缩小一位)
- [经常]有意见分歧(词典有,将“经常”划分出来
- 。。。。。。
- 最终我们得到划分结果为[“我们”,“经常”,“有意见”,“分歧”]
- (1)
- 后向最大匹配
- 例子与上面的相同,过程也相似
- (1)
- 我们经常[有意见分歧](词典中没有, max_len缩小一位)
- 我们经常有[意见分歧](词典中没有, max_len缩小一位)
- 我们经常有意[见分歧](词典中没有, max_len缩小一位)
- 我们经常有意见[分歧](词典有,将“分歧”划分出来)
- (2)
- 我们[经常有意见](词典中没有, max_len缩小一位)
- 我们经[常有意见](词典中没有, max_len缩小一位)
- 我们经常[有意见](词典中没有, max_len缩小一位)
- 。。。。。。
- 最终我们得到划分结果为[“我们”,“经常”,“有意见”,“分歧”]
- 基于上面的两个分词方法,我们对给定的字符串:”我们学习人工智能,人工智能是未来“, 找出一些的分割方式
- [我们,学习,人工智能,人工智能,是,未来]
- [我们,学习,人工,智能,人工智能,是,未来]
- [我们,学习,人工,智能,人工,智能,是,未来]
- [我们,学习,人工智能,人工,智能,是,未来] .......
-
另外我们给定unigram概率:p(我们)=0.25, p(学习)=0.15, p(人工)=0.05, p(智能)=0.1, p(人工智能)=0.2, p(未来)=0.1, p(是)=0.15
-
我们也可以计算出每一个切分之后句子的概率
- p(我们,学习,人工智能,人工智能,是,未来)= -log p(我们)-log p(学习)-log p(人工智能)-log p(人工智能)-log p(是)-log p(未来)
- p(我们,学习,人工,智能,人工智能,是,未来)=-log p(我们)-log p(学习)-log p(人工)-log p(智能)-log p(人工智能)-log p(是)-log p(未来)
- p(我们,学习,人工,智能,人工,智能,是,未来)=-log p(我们)-log p(学习)-log p(人工)-log p(智能)-log p(人工)-log p(智能)-log p(是)-log p(未来)
- p(我们,学习,人工智能,人工,智能,是,未来)=-log p(我们)-log p(学习)-log p(人工智能)-log p(人工)-log p(智能)-log(是)-log p(未来) .....
-
import math dic_words = [ "北京", "的","天", "气","天气", "真","好","真好","啊","真好啊", "今","今天","课程","内容","有", "很","很有","意思","有意思","课", "程","经常","意见","意","见", "有意见","分歧","分","歧", ] word_prob = { "北京":0.03,"的":0.08,"天":0.005,"气":0.005, "天气":0.06,"真":0.04,"好":0.05,"真好":0.04,"啊":0.01,"真好啊":0.02, "今":0.01,"今天":0.07,"课程":0.06,"内容":0.06,"有":0.05,"很":0.03, "很有":0.04,"意思":0.06,"有意思":0.005,"课":0.01, "程":0.005,"经常":0.08,"意见":0.08,"意":0.01,"见":0.005, "有意见":0.02,"分歧":0.04,"分":0.02, "歧":0.005 } def cal_score(segment: list) -> float: sum = 0.0 for word in segment: sum += (-1.0 * math.log10(word_prob.get(word))) return sum def word_segment_naive(input_str: str): global best_segment segments = [] i = 0 L = len(input_str) while i < L: pre = input_str[0: i] post = input_str[i:] segment = [] s1 = pre_segment_naive(pre) s2 = pre_segment_naive(post) if (len(s1) > 0 or len(pre) == 0) and len(s2) > 0: segment.extend(s1) segment.extend(s2) if len(segment) > 0 and segment not in segments: segments.append(segment) pre = input_str[0: i] post = input_str[i:] segment = [] s1 = post_segment_naive(pre) s2 = post_segment_naive(post) if (len(s1) > 0 or len(pre) == 0) and len(s2) > 0: segment.extend(s1) segment.extend(s2) if len(segment) > 0 and segment not in segments: segments.append(segment) i += 1 best_score = -1 for seg in segments: score = cal_score(seg) if score < best_score or best_score == -1: best_score = score best_segment = seg return best_segment def pre_segment_naive(words: str): segment = [] max_len = 5 while len(words) > 0: if len(words) < max_len: use_len = len(words) else: use_len = max_len while use_len > 0: tmp_src = words[: use_len] if tmp_src in word_prob: segment.append(tmp_src) words = words[use_len: ] break else: use_len -= 1 if use_len == 0: break return segment def post_segment_naive(words: str): segment = [] max_len = 5 while len(words) > 0: if len(words) < max_len: use_len = len(words) else: use_len = max_len while use_len > 0: tmp_src = words[-use_len : ] if tmp_src in word_prob: segment.append(tmp_src) words = words[0 : len(words) - use_len] break else: use_len -= 1 if use_len == 0: break segment.reverse() return segment def Print(word: str): seg = word_segment_naive(word) for s in seg: print(s) if __name__ == '__main__': print(word_segment_naive("北京的天气真好啊")) print(word_segment_naive("今天的课程内容很有意思")) print(word_segment_naive("经常有意见分歧"))
输出结果为:
-
['北京', '的', '天气', '真好啊']
['今天', '的', '课程', '内容', '很有', '意思']
['经常', '有意见', '分歧']
二、基于维特比算法来优化上述流程(维比特算法)
- 根据词典,输入的句子和 word_prob来创建带权重的有向图(Directed Graph)
- 编写维特比算法(viterebi)算法来找出其中最好的PATH, 也就是最好的句子切分
- 返回结果
import xlrd
import math
dic_words = [
"北京",
"的","天",
"气","天气",
"真","好","真好","啊","真好啊",
"今","今天","课程","内容","有",
"很","很有","意思","有意思","课",
"程","经常","意见","意","见",
"有意见","分歧","分","歧"
]
word_prob = {
"北京":0.03,"的":0.08,"天":0.005,"气":0.005,
"天气":0.06,"真":0.04,"好":0.05,"真好":0.04,"啊":0.01,"真好啊":0.02,
"今":0.01,"今天":0.07,"课程":0.06,"内容":0.06,"有":0.05,"很":0.03,
"很有":0.04,"意思":0.06,"有意思":0.005,"课":0.01,
"程":0.005,"经常":0.08,"意见":0.08,"意":0.01,"见":0.005,
"有意见":0.02,"分歧":0.04,"分":0.02, "歧":0.005
}
def get_segment(words: str):
segment = []
tmp = ""
for i in range(len(words)):
tmp += words[i]
if tmp in dic_words:
segment.append(tmp)
tmp = ""
while len(tmp) > 0:
tmp = segment[len(segment) - 1] + tmp
del segment[len(segment) - 1]
if tmp in dic_words:
segment.append(tmp)
tmp = ""
return segment
def word_segment_viterbi(words: str):
segment = get_segment(words)
graph = [[] for i in range(len(segment) + 1)]
# vector<pair<int, string, double>> v[10000]
# v[1].push_back({2, 'offer', 0.9})
Info = []
for i in range(len(segment)):
Info.append((i + 1, segment[i], -1.0 * math.log10(word_prob[segment[i]])))
for i in range(len(segment)):
graph[i].append(Info[i])
for i in range(len(segment)):
word = segment[i]
for j in range(i + 1, len(segment)):
word += segment[j]
if word in dic_words:
graph[i].append((j + 1, word, -1.0 * math.log10(word_prob[word])))
dp = [-1 for x in range(len(segment) + 1)]
pre = [-1 for x in range(len(segment) + 1)]
dfs(dp, pre, graph, 0, 0, 0)
ret = []
L = len(dp) - 1
while True:
R = L
L = pre[R]
if R == 0:
break
word = ""
for j in range(L, R):
word += segment[j]
ret.append(word)
ret.reverse()
return ret
def dfs(dp: list, pre: list, graph: list, u: int, v: int, val: int):
pre[v] = u
dp[v] = val
for info in graph[v]:
x = info[0]
w = info[2]
if dp[x] == -1:
dfs(dp, pre, graph, v, x, val + w)
elif dp[x] > (val + w):
dfs(dp, pre, graph, v, x, val + w)
if __name__ == '__main__':
print(word_segment_viterbi("北京的天气真好啊"))
print(word_segment_viterbi("今天的课程内容很有意思"))
print(word_segment_viterbi("经常有意见分歧"))
输出结果为
['北京', '的', '天气', '真好啊']
['今天', '的', '课程', '内容', '很有', '意思']
['经常', '有意见', '分歧']