Declaration: Here is the original paper link . It is written by Eugene W. Myers at 1986. I (j_now. You may also call me cydooc) translated parts of the thesis in to Chinese with a few annotations. If it offends your copyright, please leave me a message. I will remove this post ASAP. Thanks.
设 A、B都是一个字符串,其中 A="abcabba",B="cbabac",记作 N=len(A)、M=len(B),其中len是用来计算出字符串长度函数。
定义 LCS:Longest Common Sequence,不同于Longest Common Substring,本例中Longest Common String是CABA,BABA等,但是Longest Common Substring是CB、AB,i.e. 后者要求其中所有字符必须是连续的,Longest Common Sequence的求解要比Longest Common Substring复杂。下文LCS是Longest Common Sequence的简写。LCS并不唯一,Myers算法只找到其中一种LCS。
定义 SES:Shortest Edit Script,算法可以输出由把文件A修改为文件B的脚本,Myers的SES脚本只有两种指令——从文件A中删除字符的指令和插入文件B中的字符的指令,SES通常用来创建patch。其他Diff可能还有Replace指令,例如 reviewborad中使用Myers算法变体PS: Myers算法的时间复杂度就是O(D*(M+N)),其中D=len(SES)
引理 : len(A) + len(B) = N + M = 2*len(LCS) + len(SES)
证明略
定义 Edit Graph:Edit graph 是一个矩阵,矩阵的横坐标是A中各个字符水平向右顺序排列,B中各个字符竖直向下顺序排列,标出x坐标与y坐标相同的点,并以相同点为重点,画一条斜线
Snake:长度大于等于0的斜线边就是一个snake
The graph above is an artificial version. The original image is generated by the application built by Nicholas Butler. You can download ithere.
定义 D-path:D路径的起点为(0, 0),且包含D个非斜线边的路径,i.e. d值就是“水平移动+竖直移动的和”。
D-path的递归定义::D-path 由“(D-1)-path” + “一个非斜线边e” + snake 组成,e的起点为(D-1)-path的终点、snake的起点为(D-1)的终点
The graph above is an artificial version. The original image is generated by the application built by Nicholas Butler. You can download ithere.
推论: v ∈ snake,且 v ∈ diagonal k => ∀ v'∈ snake,v' ∈ diagonal k
A D-path must end on diagonal k ∈ { − D, −D+ 2, . . . D− 2, D }.
证明:
endpoint 函数能够计算出某一个路径的终点,start函数能够计算出某一路径的起点
当 D = 0 时,0-path 就是对角线,对角线上的点全部属于diagonal 0
设 D = i 时,i-path 满足引理,i.e. i-path 的终点属于diagonal k, k ∈ { -i, -i+2, ... i-2, i}
当 D = i+1时,根据 D-path 的递归定义,D-path 由“(D-1)-path” + “一个非斜线边e” + snake 组成:
- i-path 的终点属于 diagonal k,
- 若 e 是水平向右的一个边,则 endpoint(e) ∈ diagonal k + 1;若 e 是竖直向下的一个边,则 endpoint(e) ∈ diagonal k - 1
- ∵ start(snake) = endpoint(e), endpoint(e) ∈ diagonal k - 1,∴ endpoint(snake) ∈ diagonal k ± 1
A 0-path consists solely of diagonal edges and starts on diagonal 0. Hence it must end on diagonal 0. Assume inductively that a D-path must end on diagonal k in { − D, −D+ 2, . . . D− 2, D }. Every (D+1)-path consists of a prefix D-path, ending on say diagonal k, a non-diagonal edge ending on diagonal k+1 or k−1, and a snake that must
also end on diagonal k+1 or k−1. It then follows that every (D+1)-path must end on a diagonal in { (−D)±1, (−D+2)±1, . . . (D−2)±1, (D)±1 } = { −D−1, −D+1, . . . D−1, D+1 }. Thus the result holds by induction.
-D, -D+2 ... D-2, D,i.e 对于D-path,若D是偶数,endpoint(D-path) ∈ diagonal k => k 也是偶数,vice versa
证明:D 是水平移动+竖直移动的和,k是截距,设水平移动为 x,竖直移动次数为y若 x = 2i+1,y = 2j + 1,则 D 是偶数, k = 2i+1 - (2j +1) = 2i-2j,则 k 也是偶数
若 x = 2i+1,y= 2j,则D是奇数,k=2i+1-2j = 2(i+j)+1,则k也是奇数
若 x = 2i,y= 2j,(略)
若 x = 2i,y= 2j+1,(略)
证必
According to Lemma1, we got the following.
Suppose there are two paths from (0, 0) to (x, y). which are i-path, j-path. So (i-j) is an even number. (page 10, the last paragraph).
Because i-path and j-path has the same end point, so the two path end on the same diagonal k. So
- k ∈ ( -i, -i + 2, ..., i-2, i), assume k = i-2m
- k ∈ ( -j, -j + 2, ..., j-2, j), assume k = j-2n
∴i-2m = j-2n, the value of (i -j ) = 2(m-n) is a multiple of 2. Thus i-j is a even number.
To be continued....
Reference:
- Eugene W. Myers paper: An O(ND) Difference Algorithm and Its Variations
- Myers Diff paper -- 2