KMP 算法
如何判断一个substring是否存在于另一个string中呢?
我们有KMP算法可以使用。
具体对于算法的介绍可以参考阮一峰的教程:http://www.ruanyifeng.com/blog/2013/05/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm.html
对于next数组的解释我添加一个:
漫画:什么是KMP算法?mp.weixin.qq.comnext[x]:x表示当前正在比对第几个字符,【即代表了有几个字符匹配通过了,数组从0开始】
next[x]的值表示,如果是x这个位置不匹配,那么应该从哪个开始匹配。
但是其中对于next数组的求解,由于没有出给代码解释,所以我给出以下:
// 以下摘自其博客。
- "A"的前缀和后缀都为空集,共有元素的长度为0;
- "AB"的前缀为[A],后缀为[B],共有元素的长度为0;
- "ABC"的前缀为[A, AB],后缀为[BC, C],共有元素的长度0;
- "ABCD"的前缀为[A, AB, ABC],后缀为[BCD, CD, D],共有元素的长度为0;
- "ABCDA"的前缀为[A, AB, ABC, ABCD],后缀为[BCDA, CDA, DA, A],共有元素为"A",长度为1;
- "ABCDAB"的前缀为[A, AB, ABC, ABCD, ABCDA],后缀为[BCDAB, CDAB, DAB, AB, B],共有元素为"AB",长度为2;
- "ABCDABD"的前缀为[A, AB, ABC, ABCD, ABCDA, ABCDAB],后缀为[BCDABD, CDABD, DABD, ABD, BD, D],共有元素的长度为0。
---
除了上面的解法,那个教程也有dp的解法
dp[0]=dp[1]=0
if substr[i]==substr[j]: dp[j]=dp[i-1]+1
else: while j>0 j-=1 的不停的判断。此处j表示的是当前状态下dp[i]内的值,即有几个部分匹配值。
def next_array_cal(s):
next_array=[0]
for i in range(2, len(s)+1): # 判断在处理哪一个子串
subs = s[:i]
tmpleft = []
tmpright = []
for j in range(1,i): # 针对每一个子串进行处理,找出每一个前后缀,求交集
tmpleft.append(subs[:j]) # 找出每一个前缀
tmpright.append(subs[j:]) # 找出每一后缀
# print(subs,tmpleft,tmpright )
# 最大公共前后缀所对应的那个值 可以用来标记subs中哪些对应的比较结果可以被重复利用一次
tmp_value = [max(len(x), 0) for x in tmpleft if (x in tmpright and x != '')]
# print(tmp_value)
next_array.append(tmp_value[0] if tmp_value else 0)
return next_array
# x = next_array_cal(subs)
# print(x)
def kmp(s, subs):
next_array = next_array_cal(subs)
subs_index = 0
s_index = 0
length = len(subs)
while subs_index<length: # 当subs_index == length 时候就匹配到了
if len(subs[subs_index:])>len(s[s_index:]):
print(length, s_index, len(s[s_index:]))
return -1
if subs[subs_index] == s[s_index]:
subs_index+=1
s_index+=1
continue
else:
# print(subs_index,s_index, next_array)
# 由于当前位置不匹配,所以我们要拿到匹配到的最后一位,即subs_index-1对应的那个值
if subs_index == 0: # 如果 subs 的 第一位都没有匹配上,那么就从s_idnex的下一位开始
s_index+=1
else:
subs_index = next_array[subs_index-1] # 如果此时 subs的index不等于0,那么就从next数组中获取到对应的应该从subs的第几位开始匹配
return s_index, s[s_index-length:s_index]
s = 'BBC ABCDAB ABCDABCDABDE'
subs = 'ABCDABD'
kmp(s, subs)
我们发现其实没有必要提前把数组都计算出来,可以需要的时候在计算
def next_array_cal(s):
next_array=[0]
for i in range(2, len(s)+1): # 判断在处理哪一个子串
subs = s[:i]
tmpleft = []
tmpright = []
for j in range(1,i): # 针对每一个子串进行处理,找出每一个前后缀,求交集
tmpleft.append(subs[:j]) # 找出每一个前缀
tmpright.append(subs[j:]) # 找出每一后缀
# print(subs,tmpleft,tmpright )
tmp_value = [max(len(x), 0) for x in tmpleft if (x in tmpright and x != '')]
# print(tmp_value)
next_array.append(tmp_value[0] if tmp_value else 0)
return next_array
def get_next(s):
tmpleft = []
tmpright = []
for j in range(len(s)): # 针对每一个子串进行处理,找出每一个前后缀,求交集
tmpleft.append(s[:j]) # 找出每一个前缀
tmpright.append(s[j+1:]) # 找出每一后缀
# print(s,tmpleft,tmpright )
tmp_value = [max(len(x), 0) for x in tmpleft if (x in tmpright and x != '')]
# print(tmp_value)
return (tmp_value[0] if tmp_value else 0)
def kmp(s, subs):
next_array = next_array_cal(subs)
subs_index = 0
s_index = 0
length = len(subs)
while subs_index<length: # 当subs_index == length 时候就匹配到了
if len(subs[subs_index:])>len(s[s_index:]):
print(length, s_index, len(s[s_index:]))
return -1
if subs[subs_index] == s[s_index]:
subs_index+=1
s_index+=1
else:
# print(subs_index,s_index, next_array)
if subs_index == 0:
s_index+=1
else:
# subs_index = next_array[subs_index-1]
subs_index = get_next(s[s_index-subs_index:s_index])
return s_index, s[s_index-length:s_index]
s = 'BBC ABCDAB ABCDABCDABDE'
subs = 'ABCDABD'
kmp(s, subs)