Myers Diff paper -- 0

Declaration: Here is the original paper link . It is written by Eugene W. Myers at 1986. I (j_now. You may also call me cydooc) translated parts of the thesis in to Chinese with a few annotations. If it offends your copyright, please leave me a message. I will remove this post ASAP. Thanks.


设 A、B都是一个字符串,其中 A="abcabba",B="cbabac",记作 N=len(A)、M=len(B),其中len是用来计算出字符串长度函数。

定义 LCS:Longest Common Sequence,不同于Longest Common Substring,本例中Longest Common String是CABA,BABA等,但是Longest Common Substring是CB、AB,i.e. 后者要求其中所有字符必须是连续的,Longest Common Sequence的求解要比Longest Common Substring复杂。下文LCS是Longest Common Sequence的简写。LCS并不唯一,Myers算法只找到其中一种LCS。

定义 SES:Shortest Edit Script,算法可以输出由把文件A修改为文件B的脚本,Myers的SES脚本只有两种指令——从文件A中删除字符的指令和插入文件B中的字符的指令,SES通常用来创建patch。其他Diff可能还有Replace指令,例如 reviewborad中使用Myers算法变体

PS: Myers算法的时间复杂度就是O(D*(M+N)),其中D=len(SES)

引理 : len(A) + len(B) = N + M = 2*len(LCS) + len(SES)
证明略

定义 Edit Graph:Edit graph 是一个矩阵,矩阵的横坐标是A中各个字符水平向右顺序排列,B中各个字符竖直向下顺序排列,标出x坐标与y坐标相同的点,并以相同点为重点,画一条斜线

Snake:长度大于等于0的斜线边就是一个snake

The graph above is an artificial version. The original image is generated by the application built by Nicholas Butler.  You can download ithere.

PS: Myers算法计算LCS的过程,就是查找从(0, 0) 到 (N, M) 包含snake最多的路径的过程

定义 D-path:D路径的起点为(0, 0),且包含D个非斜线边的路径,i.e. d值就是“水平移动+竖直移动的和”。

D-path的递归定义::D-path 由“(D-1)-path” + “一个非斜线边e” + snake 组成,e的起点为(D-1)-path的终点、snake的起点为(D-1)的终点


定义 diagonal k:diagonal k 是一个顶点集合,其中这些顶点满足 k = x - y,i.e. 一个斜率为1的射线,k就是 截距,由于N=len(A),M=len(B),因此k∈[-M, N]。【注:斜率是1而不是-1,这是因为在edit graph中,y轴是向下的】

The graph above is an artificial version. The original image is generated by the application built by Nicholas Butler.  You can download ithere.


推论:snake 是 diagonal k 射线中的一个线段,设该线段上的点是(x1, y1), (x2, y2) ... (xj, yj),则snake上的点具备以下特性: A[x1] = B[y1], A[x2] = B[y2] ... A[xj] = B[yj]

推论: v ∈ snake,且 v ∈ diagonal k => ∀ v'∈ snake,v' ∈ diagonal k


A D-path must end on diagonal k ∈ { − D, −D+ 2, . . . D− 2, D }.

引理:A D-path 的终点属于 diagonal k,其中 k ∈ { − D, −D+ 2, . . . D− 2, D }.
证明:
endpoint 函数能够计算出某一个路径的终点,start函数能够计算出某一路径的起点
当 D = 0 时,0-path 就是对角线,对角线上的点全部属于diagonal 0

设 D = i 时,i-path 满足引理,i.e. i-path 的终点属于diagonal k,  k ∈ { -i, -i+2, ... i-2, i}

当 D = i+1时,根据 D-path 的递归定义,D-path 由“(D-1)-path” + “一个非斜线边e” + snake 组成:

  •   i-path 的终点属于 diagonal k,
  • 若 e 是水平向右的一个边,则 endpoint(e) ∈ diagonal k + 1;若 e 是竖直向下的一个边,则 endpoint(e) ∈ diagonal k - 1
  • ∵ start(snake) = endpoint(e), endpoint(e) ∈ diagonal k - 1,∴ endpoint(snake) ∈ diagonal k ± 1

A 0-path consists solely of diagonal edges and starts on diagonal 0. Hence it must end on diagonal 0. Assume inductively that a D-path must end on diagonal k in { − D, −D+ 2, . . . D− 2, D }. Every (D+1)-path consists of a prefix D-path, ending on say diagonal k, a non-diagonal edge ending on diagonal k+1 or k−1, and a snake that must
also end on diagonal k+1 or k−1. It then follows that every (D+1)-path must end on a diagonal in { (−D)±1, (−D+2)±1, . . . (D−2)±1, (D)±1 } = { −D−1, −D+1, . . . D−1, D+1 }. Thus the result holds by induction.


-D, -D+2 ... D-2, D,i.e 对于D-path,若D是偶数,endpoint(D-path) ∈ diagonal k => k 也是偶数,vice versa

证明:D 是水平移动+竖直移动的和,k是截距,设水平移动为 x,竖直移动次数为y
若 x = 2i+1,y = 2j + 1,则 D 是偶数, k = 2i+1 - (2j +1) = 2i-2j,则 k 也是偶数
若 x = 2i+1,y= 2j,则D是奇数,k=2i+1-2j = 2(i+j)+1,则k也是奇数
若 x = 2i,y= 2j,(略)
若 x = 2i,y= 2j+1,(略)

证必


According to Lemma1, we got the following.

Suppose there are two paths from (0, 0) to (x, y). which are  i-path, j-path. So (i-j) is an even number. (page 10, the last paragraph).

Because i-path and j-path has the same end point, so the two path end on the same diagonal k. So

  • k ∈ ( -i, -i + 2, ..., i-2, i), assume k = i-2m
  • k ∈ ( -j, -j + 2, ..., j-2, j), assume k = j-2n

∴i-2m = j-2n, the value of (i -j ) = 2(m-n) is a multiple of 2. Thus i-j is a even number.
To be continued....


Reference:


  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值