基于差异的文件同步算法-The rsync algorithm

最新推荐文章于 2019-12-12 16:05:48 发布

zzc1986

最新推荐文章于 2019-12-12 16:05:48 发布

阅读量1.1k

点赞数

文章标签： algorithm 算法文件同步 file search each

注：本文转自http://www.samba.org/rsync/tech_report/tech_report.html ，放在这里只是为了方便查看。本算法主要用于同步两个不同计算机上的文件，它通过检查两份文件的差异，已最低的传输代价来达到文件同步的目的。

Abstract:

This report presents an algorithm for updating a file on one machine to be identical to a file on another machine. We assume that the two machines are connected by a low-bandwidth high-latency bi-directional communications link. The algorithm identifies parts of the source file which are identical to some part of the destination file, and only sends those parts which cannot be matched in this way. Effectively, the algorithm computes a set of differences without having both files on the same machine. The algorithm works best when the files are similar, but will also function correctly and reasonably efficiently when the files are quite different.

The problem

Imagine you have two files, A and B , and you wish to update B to be the same as A . The obvious method is to copy A B . onto

Now imagine that the two files are on machines connected by a slow communications link, for example a dial up IP link. If A is large, copying A onto B will be slow. To make it faster you could compress A before sending it, but that will usually only gain a factor of 2 to 4.

Now assume that A and B are quite similar, perhaps both derived from the same original file. To really speed things up you would need to take advantage of this similarity. A common method is to send just the differences between A and B down the link and then use this list of differences to reconstruct the file.

The problem is that the normal methods for creating a set of differences between two files rely on being able to read both files. Thus they require that both files are available beforehand at one end of the link. If they are not both available on the same machine, these algorithms cannot be used (once you had copied the file over, you wouldn't need the differences). This is the problem that rsync addresses.

The rsync algorithm efficiently computes which parts of a source file match some part of an existing destination file. These parts need not be sent across the link; all that is needed is a reference to the part of the destination file. Only parts of the source file which are not matched in this way need to be sent verbatim. The receiver can then construct a copy of the source file using the references to parts of the existing destination file and the verbatim material.

Trivially, the data sent to the receiver can be compressed using any of a range of common compression algorithms, for further speed improvements.

The rsync algorithm

Suppose we have two general purpose computers 1 and 2. Computer 1 has access to a file A and 2 has access to file B , where A and B are ``similar''. There is a slow communications link between 1 and 2.

The rsync algorithm consists of the following steps:

2 splits the file B into a series of non-overlapping fixed-sized blocks of size S bytes¹ . The last block may be shorter than S bytes.

For each of these blocks 2 calculates two checksums: a weak ``rolling'' 32-bit checksum (described below) and a strong 128-bit MD4 checksum.

2 sends these checksums to 1.

1 searches through A to find all blocks of length S bytes (at any offset, not just multiples of S ) that have the same weak and strong checksum as one of the blocks of B . This can be done in a single pass very quickly using a special property of the rolling checksum described below.

1 sends 2 a sequence of instructions for constructing a copy of A . Each instruction is either a reference to a block of B , or literal data. Literal data is sent only for those sections of A which did not match any of the blocks of B .

The end result is that 2 gets a copy of A , but only the pieces of A that are not found in B (plus a small amount of data for checksums and block indexes) are sent over the link. The algorithm also only requires one round trip, which minimises the impact of the link latency.

The most important details of the algorithm are the rolling checksum and the associated multi-alternate search mechanism which allows the all-offsets checksum search to proceed very quickly. These will be discussed in greater detail below.

Rolling checksum

The weak rolling checksum used in the rsync algorithm needs to have the property that it is very cheap to calculate the checksum of a buffer X ₂ .. X _{n +1} given the checksum of buffer X ₁ .. X _n and the values of the bytes X ₁ and X _{n +1} .

The weak checksum algorithm we used in our implementation was inspired by Mark Adler's adler-32 checksum. Our checksum is defined by

s (k ,l ) = a (k ,l ) + 2¹⁶ b (k ,l )

where s (k ,l ) is the rolling checksum of the bytes . For simplicity and speed, we use M = 2¹⁶ .

The important property of this checksum is that successive values can be computed very efficiently using the recurrence relations

Thus the checksum can be calculated for blocks of length S at all possible offsets within a file in a ``rolling'' fashion, with very little computation at each point.

Despite its simplicity, this checksum was found to be quite adequate as a first level check for a match of two file blocks. We have found in practice that the probability of this checksum matching when the blocks are not equal is quite low. This is important because the much more expensive strong checksum must be calculated for each block where the weak checksum matches.

Checksum searching

Once 1 has received the list of checksums of the blocks of B , it must search A for any blocks at any offset that match the checksum of some block of B . The basic strategy is to compute the 32-bit rolling checksum for a block of length S A in turn, and for each checksum, search the list for a match. To do this our implementation uses a simple 3 level searching scheme. starting at each byte of

The first level uses a 16-bit hash of the 32-bit rolling checksum and a 2¹⁶ entry hash table. The list of checksum values (i.e., the checksums from the blocks of B ) is sorted according to the 16-bit hash of the 32-bit rolling checksum. Each entry in the hash table points to the first element of the list for that hash value, or contains a null value if no element of the list has that hash value.

At each offset in the file the 32-bit rolling checksum and its 16-bit hash are calculated. If the hash table entry for that hash value is not a null value, the second level check is invoked.

The second level check involves scanning the sorted checksum list starting with the entry pointed to by the hash table entry, looking for an entry whose 32-bit rolling checksum matches the current value. The scan terminates when it reaches an entry whose 16-bit hash differs. If this search finds a match, the third level check is invoked.

The third level check involves calculating the strong checksum for the current offset in the file and comparing it with the strong checksum value in the current list entry. If the two strong checksums match, we assume that we have found a block of A which matches a block of B . In fact the blocks could be different, but the probability of this is microscopic, and in practice this is a reasonable assumption.

When a match is found, 1 sends 2 the data in A between the current file offset and the end of the previous match, followed by the index of the block in B that matched. This data is sent immediately a match is found, which allows us to overlap the communication with further computation.

If no match is found at a given offset in the file, the rolling checksum is updated to the next offset and the search proceeds. If a match is found, the search is restarted at the end of the matched block. This strategy saves a considerable amount of computation for the common case where the two files are nearly identical. In addition, it would be a simple matter to encode the block indexes as runs, for the common case where a portion of A matches a series of blocks of B in order.

Pipelining

The above sections describe the process for constructing a copy of one file on a remote system. If we have a several files to copy, we can gain a considerable latency advantage by pipelining the process.

This involves 2 initiating two independent processes. One of the processes generates and sends the checksums to 1 while the other receives the difference information from 1 and reconstructs the files.

If the communications link is buffered then these two processes can proceed independently and the link should be kept fully utilised in both directions for most of the time.

另一篇转载的文章：【放在一起只是为了对照】

rsync算法的思想--从轮询和中断引出

linux中的rsync几乎是个linux专家或者一般管理员都会用，但是没有几个人真正理解过它，我不是有意损别人不求甚解，而是作为一个用户，真的没有必要去浪费时间去理解到底，如果那样的话还不如遇到问题自己从头开始思考解决方案，定理公式就没有用了，我们的世界已经是社会化生产了，各位再也不用一切自己来了，做好自己的事情，各司其职才能活的精彩。虽然会使用rsync程序已经够了，但是在茶余饭后欣赏一下rsync背后的思想还是很有意思的，我google了几乎一下午，没有找到好的资料，所幸最后找到了一篇rsync作者的原文，十分精彩，正是我想要的，读了之后，入口柔，一线喉...

如果在不同的位置有两套一样的文件系统，那么在必要情况下同步它们显然是很必要的事，最简单的办法就是直接把源文件系统拷贝给备份系统，但是这种方式是不负责任的方式，耗费了大量的资源，比如一个10G的文件，我就往最后追加了一个字符，最好的同步策略就是在备份系统的该文件的最后也追加一个字符，但是说来容易做起来难，不是难而是很难！为什么？应该很简单啊，比如我修改了文件A，然后系统将修改A以及怎么修改这件事告诉备份系统就可以了，是的，如果系统真的可以做到这些那当然再好不过了，确实很简单，用一个普通的通知模式也就是观察者模式就可以搞定，关键是如何能监控到有人修改了文件A这件事，是的，你可以用inotify，正如我的最近的工作任务一样，可是那又会涉及到遗漏问题，即使解决了遗漏问题，系统只得到了文件A被修改，具体怎么修改的以及被谁修改的，无从得知，可是系统在实现inotify的时候为何不将修改的内容也通知用户呢？很简单，因为那样很复杂，成功的操作系统不需要偏心任何用户策略也不能帮忙实现任何用户策略，它只是提供给用户一个最小的接口集合，试问它为何要广播一个单独的文件被修改的内容呢，有必要吗？如果那样的话，除了对备份镜像系统以及一些不怀好意的程序有益外还有什么用啊！本身inotify的提供已经够勉强了，我的想法就是inotify的出现就是一个错误，文件系统没有必要用中断的方式通知用户空间一切，因为它太“策略”了，我觉得inotify机制应该用信号的机制实现会更好，因为这样更加统一，在linux系统中，用户空间想得到内核活动的信息方法实现太多了，多的令人害怕，其实就两种接口就可以搞定，按照操作系统设计的理论，一个轮询一个中断就可以解决一切，对于监控文件变化，inotify的方式就是中断的方式，我想中断的方式统一到信号会更好，轮询的方式我希望linux可以自己实现一套接口。用中断的方式显得很高效，按照理论不会有任何的性能损失，也不会做任何无用功，因为只要在事情发生的时候才会处理，但是不幸的是 linux中关于文件变化的中断方式可能会遗漏，即使不遗漏的话，中断没有任何义务告诉应用程序具体发生了哪些变化，具体发生了哪些变化还是需要应用程序自己去实现，那么轮询怎么样？轮询不错！那么怎么实现？有现成的，就是rsync。

rsync的优势在于它只传输文件中实际被更改的部分，如果只有一小部分被改，那么就只传该部分就可以了，那么它的特点也就是出来了，它肯定有一种很高效的算法来发现文件的哪些部分变化了，作为一种替代，当然可以用以下算法实现：将文件分成blocks，每隔一段时间将备份系统的文件的每一个block的最前面的几个字符和最后面的几个字符和整个 block的校验值发送给源文件系统就可以了，这样就少传输了每个block的中间的大部分内容，但是还是会很慢的，而且在精确校验上也不好做，如果还是将文件先分成两半，然后按照二分法的方式进行比较，一次一半的比较，这样固然不错，但是却只适合字符被改的情况，试想一个在文件最前面追加一个字符，那么后面的字符都会向后错开一个单位，这样就会导致每一个部分的校验都不会命中，最终需要传输整个文件。

分块是必然的，但是怎样校验，于是 rsync算法的作者就想到了只传输校验值的方式，最天真的方式就是将一个文件分成n个blocks，每个block的大小为s，计算每一个block的 checksum，然后将这些checksum传输到源文件所在地，源文件也将文件同样分成s大小的n部分吗？不！因为文件会被追加，文件的开始会被添加字符，中间也会被删除，所以源文件应该从任何一个偏移位置按照s大小作为一个block和传输过来的checksum一个一个比较，如果找到一个相同的，那么就说明这个block没有被改变，然后从该block的最后的下一个位置开始，这样会少算很多次。很好，但是停！事情往往是到了该庆功的时候才出问题，这样的算法的效率如何，源文件的每一个偏移岂不是都要计算偏移？是的，如果checksum用MD5算法，那么是很耗时的，因此一种过滤器是必要的，就是用最简单的方式发现不匹配，这样就不用再算耗时的checksum了，一旦用该种简单的方式匹配，那么计算耗时的checksum的也匹配的几率将会很大，也就是实现知道匹配成功的几率，简单的方式可以算出这种几率，几率大的就计算checksum，虽然有可能不匹配，但是匹配的可能性很大，几率小的就不再计算了，因为算法知道即使计算百分之百不会匹配，因此rsync算法采用了双校验码，一个是弱校验，一个是强检验，所谓弱校验是可能不同的 block会得到相同的校验值，强校验就是不同的block一定得不到相同的校验值，注意弱校验只是一个过滤器，将不可能匹配的block过滤掉，它不能发现成功的匹配，因为它的目的是阻挡没有意义的强校验值计算，因为弱校验就不能太耗时，否则还不如直接计算强校验呢。

到此为止，事情慢慢明了，现在就是想办法找出一种很高效的弱检验计算方法，作者采用的是滚动校验的方式，所谓的滚动检验是针对一个序列的很多校验值，每一个校验值有两种方式得到，一种就是按照常规方式用函数求得，另一种方式就是用该序列的前一个元素的校验值得到，滚动校验有一种递归的意思，是的，它就是递归，而且很简单，我们看看它的公式吧：

s (k ,l ) = a (k ,l ) + 2¹⁶ b (k ,l )

上面的式子中s就是最后的校验码。

这种算法中学生都可以想得出来，校验值的计算像滚雪球一样，像很简单！rsync就是用这个简单又快速的算法快速淘汰了很多根本不可能的block匹配，很高效！到此为止，我一直在回避block的大小问题，它选成多少算好呢？这里面涉及到一个时间和空间的问题，如果你想网络传输量最小，也就是只传改变的部分，一个也不多一个也不少（不少是肯定的，如果少了那就是错误的算法），那么block的大小就选做1，这样就是一个一个字符的进行校验，但是这样划算吗？一个检验值多大，比一个字符大吧...如果选的大一些，比如选成了5，那么如果我有一个13字节的文件在最后追加一个字符变成了14字节，那么匹配的时候会发现10个字符的匹配，实际传输的字符将会是4，因为从11到14个字符，正好在备份文件的第三个block内。因此如何选block的大小也是一个策略，rsync程序中有block大小的-B参数，可以调整，文件小的就选小一些，文件大的，更改小的就选大一些。block过小，传输的冗余数据就会少，但是计算校验码的时间就会多，反之则相反。

算法搞明白了，那么怎么实现呢？基本就是三趟比较，源文件系统收到校验码后会根据每一个的block的checksum构建一个hash表，然后按照offset在源文件向前推进的时候会先在hash中找，如果找到就遍历这个hash entry的吊链，然后匹配弱校验，如果弱检验通过就计算强校验，如果强校验通过则表明源和备份的文件的该block是相同的，不需要同步，继续下一个 block的第一个offset，如果不同，那么继续下一个offset，如果匹配成功的话，就会传输前一个匹配成功的block和该block之间的数据。

可见rsync算法的精髓就是双校验方式，弱校验采用了高效的滚动校验的方式。滚动校验依赖前一个元素的校验值可以快速的计算出后一个的校验值。