参考博客参考博客
# given a dict
li = ["北京大学","生前","来","应聘","大学生","前来","北京"]
dic = {w:i for i,w in enumerate(li)}
print(dic)
forward max matching
窗口从前面开始滑动,每次取maxlength
匹配不上也是舍弃后面的
# forward max matching
# given the longest word in dict , the length is 4
# string: test string
max_length = 5
string = "北京大学生前来应聘"
# store the answers
ans = []
while len(string)>0:
right = max(len(string), max_length)
sub_string = string[:right]
while len(sub_string)>0:
if sub_string in dic:
ans.append(sub_string)
string = string[len(sub_string):]
break
else:
sub_string = sub_string[:len(sub_string)-1]
print(ans)
[‘北京大学’, ‘生前’, ‘来’, ‘应聘’]
后向匹配
窗口从最后面开始滑动,匹配不上是舍弃前面的
# backward max matching
res = []
max_length = 5
string = "北京大学生前来应聘"
while len(string) > 0:
l = len(string)
if len(string) < max_length:
sub_string = string
else:
sub_string = string[len(string) - max_length:]
while len(sub_string) > 0:
if sub_string in dic:
res.append(sub_string)
string = string[:l - len(sub_string)]
break
else:
sub_string = sub_string[1:]
print(res[::-1])
[‘北京’, ‘大学生’, ‘前来’, ‘应聘’]
双向匹配
算法流程:
比较正向最大匹配和逆向最大匹配结果 如果分词数量结果不同,那么取分词数量较少的那个 如果分词数量结果相同 分词结果相同,可以返回任何一个 分词结果不同,返回单字数比较少的那个