编辑距离leetcode代码网址:https://github.com/vivianLL/LeetCode
一、求编辑距离(Leetcode 72)
编辑距离(Edit Distance),是指两个字串之间,由一个转成另一个所需的最少编辑操作次数。允许对字符串中的字符进行的的操作只有替换、插入、删除三种操作。
编辑距离是自然语言处理中的重要的文本比较算法之一。也是从多个相似的字符串组中提取字符串的有利的武器。编辑距离算法,也称为LD算法。LD算法就是自然语言处理(NLP)里的“编辑距离”算法。俄国科学家Levenshtein提出的,故又叫Levenshtein Distance (LD算法)。
解题思路:
定义这样一个函数——edit(i, j),它表示第一个字符串的长度为i的子串到第二个字符串的长度为j的子串的编辑距离。
显然可以有如下动态规划公式:
- if i == 0 且 j == 0,edit(i, j) = 0
- if i == 0 且 j > 0,edit(i, j) = j
- if i > 0 且j == 0,edit(i, j) = i
- if i ≥ 1 且 j ≥ 1 ,edit(i, j) == min{ edit(i-1, j) + 1, edit(i, j-1) + 1, edit(i-1, j-1) + f(i, j) },当第一个字符串的第i个字符不等于第二个字符串的第j个字符时,f(i, j) = 1;否则,f(i, j) = 0。
由此可得二维矩阵。
python实现如下:
import heapq
class Solution:
def minDistance(self, word1: str, word2: str) -> int:
# # 法一:递归 复杂度高
# if len(word1)==0:
# return len(word2)
# elif len(word2)==0:
# return len(word1)
# elif word1[-1]==word2[-1]:
# return self.minDistance(word1[:-1], word2[:-1])
# else:
# return min(self.minDistance(word1,word2[:-1])+1,self.minDistance(word1[:-1],word2)+1,self.minDistance(word1[:-1],word2[:-1])+1)
# #法二:二重循环 建立二维矩阵 时间复杂度为 O(mn), 空间复杂度 O(mn)
# if len(word1) == 0:
# return len(word2)
# elif len(word2)==0:
# return len(word1)
# M = len(word1)
# N = len(word2)
# output = [[0] * (N + 1) for _ in range(M + 1)]
# for i in range(M + 1):
# for j in range(N + 1):
# if i == 0 and j == 0:
# output[i][j] = 0
# elif i == 0 and j != 0:
# output[i][j] = j
# elif i != 0 and j == 0:
# output[i][j] = i
# elif word1[i - 1] == word2[j - 1]:
# output[i][j] = output[i - 1][j - 1]
# else:
# output[i][j] = min(output[i - 1][j - 1] + 1, output[i - 1][j] + 1, output[i][j - 1] + 1)
# return output[M][N]
# #法三 时间复杂度为 O(mn), 空间复杂度 O(m)
# if len(word1) == 0:
# return len(word2)
# elif len(word2) == 0:
# return len(word1)
# M = len(word1)
# N = len(word2)
# tmp = [i for i in range(N + 1)]
# value = None
#
# for i in range(M):
# tmp[0] = i + 1
# last = i
# for j in range(N):
# if word1[i] == word2[j]:
# value = last
# else:
# value = 1 + min(last, tmp[j], tmp[j + 1])
# last = tmp[j + 1]
# tmp[j + 1] = value
# return value
# 法三
heap = [(0, word1, word2)]
seen = set()
while len(heap) > 0:
dist, w1, w2 = heapq.heappop(heap)
if w1 == w2:
return dist
if (w1, w2) not in seen:
seen.add((w1, w2))
while w1 and w2 and w1[-1] == w2[-1]:
w1 = w1[:-1]
w2 = w2[:-1]
else:
heapq.heappush(heap, (dist + 1, w1[:-1], w2))
heapq.heappush(heap, (dist + 1, w1, w2[:-1]))
heapq.heappush(heap, (dist + 1, w1[:-1], w2[:-1]))
sol = Solution()
ans = sol.minDistance("se","")
print(ans)
ans = sol.minDistance("intention","execution")
print(ans)
ans = sol.minDistance("dinitrophenylhydrazine","benzalphenylhydrazone")
print(ans)
二、根据回溯路径求编辑操作
举例,如
str1=bcdabcdef
str2=abcdefbcd
构建矩阵如图:
解题思路:
寻找回溯路径时要从右下角的元素开始,依次看当前元素是如何得到的,有时一个元素可能有多种得到的方式,即表明可以有多种操作可以得到相同的结果。上图的红色箭头即为回溯路径。将回溯路径再反过来就可得到实际编辑操作的路径。
- 向左走,即 dp[m][n]=dp[m-1][n]+1 , 表示删除一个字符
- 斜向下,且值未变,说明相同,不用操作
- 斜向下,且值加1,即dp[m][n]=dp[m-1][n-1]+1,表示替换一个字符
- 向上走,即dp[m][n]=dp[m][n-1]+1,表示添加一个字符
若ai≠bj,回溯到左上角、上边、左边中值最小的单元格,若有相同最小值的单元格,优先级按照左上角、上边、左边的顺序。
若回溯到左上角单元格,将Ai添加到匹配字串A,将Bj添加到匹配字串B;
若回溯到上边单元格,将Bi添加到匹配字串B,将_添加到匹配字串A;
若回溯到左边单元格,将_添加到匹配字串B,将Aj添加到匹配字串A;
搜索晚整个匹配路径,匹配字串也就完成了。
def minDistance(word1, word2) -> int:
if len(word1) == 0:
return len(word2)
elif len(word2) == 0:
return len(word1)
M = len(word1)
N = len(word2)
output = [[0] * (N + 1) for _ in range(M + 1)]
for i in range(M + 1):
for j in range(N + 1):
if i == 0 and j == 0:
output[i][j] = 0
elif i == 0 and j != 0:
output[i][j] = j
elif i != 0 and j == 0:
output[i][j] = i
elif word1[i - 1] == word2[j - 1]:
output[i][j] = output[i - 1][j - 1]
else:
output[i][j] = min(output[i - 1][j - 1] + 1, output[i - 1][j] + 1, output[i][j - 1] + 1)
return output
def backtrackingPath(word1,word2):
dp = minDistance(word1,word2)
m = len(dp)-1
n = len(dp[0])-1
operation = []
spokenstr = []
writtenstr = []
while n>=0 or m>=0:
if n and dp[m][n-1]+1 == dp[m][n]:
print("insert %c\n" %(word2[n-1]))
spokenstr.append("insert")
writtenstr.append(word2[n-1])
operation.append("NULLREF:"+word2[n-1])
n -= 1
continue
if m and dp[m-1][n]+1 == dp[m][n]:
print("delete %c\n" %(word1[m-1]))
spokenstr.append(word1[m-1])
writtenstr.append("delete")
operation.append(word1[m-1]+":NULLHYP")
m -= 1
continue
if dp[m-1][n-1]+1 == dp[m][n]:
print("replace %c %c\n" %(word1[m-1],word2[n-1]))
spokenstr.append(word1[m - 1])
writtenstr.append(word2[n-1])
operation.append(word1[m - 1] + ":"+word2[n-1])
n -= 1
m -= 1
continue
if dp[m-1][n-1] == dp[m][n]:
spokenstr.append(' ')
writtenstr.append(' ')
operation.append(word1[m-1])
n -= 1
m -= 1
spokenstr = spokenstr[::-1]
writtenstr = writtenstr[::-1]
operation = operation[::-1]
# print(spokenstr,writtenstr)
# print(operation)
return spokenstr,writtenstr,operation
参考网址:
编辑距离及编辑距离算法
编辑距离及其回溯路径
编辑距离算法(LD)详解
文本比较算法:编辑距离
另一种文本比较算法——最长公共子序列,也是基于动态规划。