Optimizing File Replication over Limited-Bandwidth Networks using Remote Differential Compresion

Abstract:   Remote Differential Compression(RDC) protocals can effectively update files over a limited-bandwidth network when two sites have roughly similar files; no sites need to know the content of another's files a priori. We present a heuristic approach to identify and transfer the file differences that is based on finding similar files, subdividing the files into chunks, and comparing chunk signatures. Our work significantly improve upon previous protocals such as LBFS and RSYNC in three ways. Firstly, we present a novel algorithm to efficiently find the client files that are the most similar to a given server file. Our algorithm requires 96 bits of meta-data per file, independent of file size, and thus allows us to keep the metadata in memory and eliminate the need for expensive disk seeks. Secondly, we show that RDC can be applied recursively to signatures to reduce the transfer cost for large files. Thirdly, we describe new ways to subdivide files into chunks that identify file differences more accurately. We have implemented our approach in DFSR, a state-based multimaster file replication service shipping as part of Windows Server 2003 R2. Our experimental results show that similarity detection produces results comparable to LBFS while incurring a much smaller overhead for maintaining the metadata. Recursive signature transfer further increases replication efficiency by up to several orders of magnitude.

Introduciton

The first contribution is a novel and very efficient way for allowing a client to locate a set of files that are likely to be similar to the files that needs to be transferred from a server.

The second contribution is that the LBFS RDC protocol can be applied recursively, by treating the signatures generated as a result of chunking as a new input to the protocal.(把分块后的指纹再用一遍RDC ???)

The third contribution is a chunking algorithm(local maxima chunking algorithm) that identifies file differences more accurately, and an analysis of its quality.

The basic RDC protocal

Using Similarity Detection to find RDC Candidates(这个问题实际上就是怎么找到 Client 和 Server之间的比对对象)

(1) Finding RDC candidates

In this section, we describe a technique that allows the client C to efficeintly select a small subset of its files (Fc1, Fc2, ... Fcn) that are similar to a file Fs that needs to be transferred from S using the RDC protocal.(这个集合一般都是小于10)

两个文件的相似性可以用相同块的个数来定义,如公式,

也可以等价于用signatures来定义,

主要的问题在于如何找到最相似的一个,the problem we need to solve is to identify the files in Files that have the highest degree of similarity with Fs. Sim(Fs, Fci) >= s,其中s是一个threshold.

(2) Using traits to encode similarity information

File similarity can be approximated by using the following heuristic that makes use of compact summary of a file's signatures, called its set of traits. It can be cached as part of the file metadata.  

八步解决问题:

1. Client向Server发送Fs的请求;

2. S把Traits(Fs)发送给Client;

3. Client通过Traits(Fs)识别相似文件集合

4. Client将相似文件集合分块,计算块的signature

5. Server将Fs分块,计算Signature,发送给Client ((Sigs1, Lens1),...)

6. Client接收到signature,记录每一个distinct指纹对应的块

7. Client请求对应的数据块

8. Server发送请求的数据块

(3)Computing the set of traits for a file

Traits(F) is derived from the chunk signatures of F,步骤如下:

1. 用t个哈希函数计算一遍文件的整个 signature set,得到 IS1, IS2, ..., ISt

2. 计算PT1, PT2, ..., PTt   

3. 从PT1到PTt中各选取b个bits出来,实验情况显示(b=6,t=16)和(b=4, t=24)效果很好。

(4)Computing the pre-traits efficiently

先比对高位的字节,然后再来比对低位的字节

(5)Finding similar files using a given set of traits

Parameters(b=6, t=16)的下限为5/16. falase positive rate one in three hundred thousand.因此5/16就可以作为一个分割阈值

To improve both precision and recall, we could increase the total number of bits. For instance, switching to (b=5, t=24) would dramatically improve precision at the cost of increasing memory consumption for file traits.

Recursive RDC Signature transfer

For instance, assuming the size of Fs is 10GB and the average chunk size 2 KB, Fs will be divided into 5 million chunks, corresponding to about 60MB of signature information that needs to be sent over the network.

1.  服务器做一个Re-signature, to produce a list of recursive signatures and lengths.

2. 客户端也先在本地做一个1-level的signature,然后做一个recursive signature,比对发过来的recursive signature,并将不相等的发至服务器。

3. 服务器将recursive不等的1-level的signature发过来

4. 客户端就可以继续走常规的比对了。

Improved Chunking Algorithms

(1)Slack其实就是最小块长的处理,在Slack范围内不是严格的CDC分块

(2)An interval filter is obtained by partitioning the hash values v into two sets, H(head) and T(tail). The average chunk size when using an interval filter can be shown to be e*h, while the slack is 0.77.

(3)Using local maxima. An h-local maximum is a position whose hash value v is strictly larger than the hash values at the preceding h and following h positions. The probability that a given position is a cut-point is: 

We can use the above observation to compute cut points by examining every position only 1+In(h)/h times on average.

Implementation in DFSR

大致讲了下 how files are transferred between machines using RDC

Experimental Results

对比对象: LBFS-like approach, RSYNC, and the local diff utilities xdelta and BSDiff

大致的实验主要对比

(1)Reducing the overhead of computing chunks

RDC and RSYNC appear comparable, while the local diff utilities require much more CPU and memory.

(2)Tuning the chunking parameters and the recursion level

调整参数,找到w,h

(3)Evaluating similarity detection

比对相似性检测的参数

(4)Deployment example

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值