LZ4 explained

参考材料:

http://fastcompression.blogspot.com/2011/05/lz4-explained.html    点击打开链接

文章内容主要有3部分:

1——》开头直到“Regarding the way LZ4 searches and find mathes...”之前。主要根据图分析了LZ4 sequence的结构和每个部分的意义;

2——》“Regarding the way LZ4 searches and find mathes...”开始,直到最后的“Note”之前。讲述的是LZ4是如何对匹配进行搜索的,以及解码段如何解码的。

3——》“Note”部分,这篇文章讲的是LZ4进行compression之后的raw format,但是如果要使得压缩的数据能被别的理解LZ4的程序使用,需要将raw format转换成另一种形式,这在另一个文章http://fastcompression.blogspot.fr/2013/04/lz4-streaming-format-final.html中讲解

LZ4 explained


 At popular request, this post tries to explain the  LZ4 inner workings, in order to allow any programmer to develop its own version, potentially using another language than  the one provided on Google Code (which is C).

The most important design principle behind LZ4 has been simplicity. It allows for an easy code, and fast execution.

Let's start with the compressed data format.

The compressed block is composed of sequences.
Each sequence starts with a  token.
The token is a one byte value, separated into two 4-bits fields (which therefore range from 0 to 15).
The first field uses the 4 high-bits of the token, and indicates the  length of literals. If it is 0, then there is no literal. If it is 15, then we need to add some more bytes to indicate the full length. Each additional byte then represent a value of 0 to 255, which is added to the previous value to produce a total length. When the byte value is 255, another byte is output.
There can be any number of bytes following the token. There is no "size limit". As a sidenote, here is the reason why a not-compressible input data block can be expanded by up to 0.4%.

Following the token and optional  literal length bytes, are the  literals themselves. Literals are uncompressed bytes, to be copied as-is.
They are exactly as numerous as previously decoded into  length of literals. It's possible that there are zero literal.

Following the literals is the  offset. This is a 2 bytes value, between 0 and 65535. It represents the position of the match to be copied from. Note that 0 is an invalid value, never used. 1 means "current position - 1 byte". 65536 cannot be coded, so the maximum offset value is really 65535. The value is stored using "little endian" format.

Then we need to extract the  matchlength. For this, we use the second  token field, a 4-bits value, from 0 to 15. There is an baselength to apply, which is the minimum length of a match, called  minmatch. This minimum is 4. As a consequence, a value of 0 means a match length of 4 bytes, and a value of 15 means a match length of 19+ bytes.
Similar to  literal length, on reaching the highest possible value (15), we output additional bytes, one at a time, with values ranging from 0 to 255. They are added to total to provide the final  matchlength. A 255 value means there is another byte to read and add. There is no limit to the number of optional bytes that can be output this way (This points towards a maximum achievable compression ratio of ~250).

With the  offset and the  matchlength, the decoder can now proceed to copy the repetitive data from the already decoded buffer. Note that it is necessary to pay attention to  overlapped copy, when  matchlength > offset (typically when there are numerous consecutive zeroes).

By decoding the  matchlength, we reach the end of the sequence, and start another one.

Graphically, the sequence looks like this :

Click for larger display



Note that the last sequence stops right after  literals field. 

There are specific parsing rules to respect to be compatible with the reference decoder :
1) The last 5 bytes are always literals
2) The last match cannot start within the last 12 bytes
Consequently, a file with less then 13 bytes can only be represented as literals
These rules are in place to benefit speed and ensure buffer limits are never crossed.

Regarding the way LZ4 searches and finds matches, note that there is no restriction on the method used. It could be a full search, using advanced structures such as  MMC, BST or standard hash chains, a fast scan, a 2D hash table, well whatever. Advanced parsing can also be achieved while respecting full format compatibility (typically achieved by LZ4-HC).

The "fast" version of LZ4 hosted on Google Code uses a fast scan strategy, which is a single-cell wide hash table. Each position in the input data block gets "hashed", using the first 4 bytes (minmatch). Then the position is stored at the hashed position.
The size of the hash table can be modified while respecting full format compatibility. For restricted memory systems, this is an important feature, since the hash size can be reduced to 12 bits, or even 10 bits (1024 positions, needing only 4K). Obviously, the smaller the table, the more collisions (false positive) we get, reducing compression effectiveness. But it nonetheless still works, and remain fully compatible with more complex and memory-hungry versions. The decoder do not care of the method used to find matches, and requires no additional memory.


Note : the format above describes the content of an LZ4 compressed block. It is the raw compression format, with no additional feature, and is intended to be integrated into a program, which will wrap around its own custom enveloppe information.
If you are looking for a portable and interoperable format, which can be understood by other LZ4-compatible programs, you'll have to look at the  LZ4 Framing format. In a nutshell, the framing format allows the compression of large files or data stream of arbitrary size, and will organize data into a flow of smaller compressed blocks with (optionnally) verified checksum.

29 comments:
  1. I read your spec to reimplement from the description. I think it is complete, the only thing that surprised me is that the matches are allowed to overlap forwards. It might be worth mentioning. My impl was intended to experiment with vectorization but in the end it did not work.

    Reply
    Replies
      • Yes, this is a classic LZ77 design. 
        With matches autorised to overlap forward, it makes the equivalent of RLE (Run Length Encoding) for free, and even repeated 2-bytes / 4-bytes sequences, which are very common.
        This is in contrast with LZ78 for example, which never takes advantage of overlap. Neither PPM, nor BWT, etc.

        That being said, i'm not sure to understand in which way it prevented your experiment on vectorization to work.

        Rgds

      • With the specification of "The last match cannot start within the last 12 bytes" to be handled by the reference decoder, it is not an equivalent of RLE for free. However, anything that compresses well with RLE is very likely going to compress well with LZ4 unless you pick a worse case of unique byte tokens repeated 12 at a time.

      • Not to be confused : minimum match length is still only 4.
        So, if a token is repeated 12 times, it will be caught by the algorithm.

        The only exception is for the last 12 bytes within input data. Even though this restriction is supposed to have a negative impact on compression ratio, its impact on real-life data is negligible.

  2. Thank you for creating a clear and easily understood specification.

    I believe you should add a specification of the size limit of literal length and match length. As specified currently, a correct decoder must be able to process an infinite number of bytes in either field. The best would be to specify the maximum value (not length) of either field as a power of two. The maximum length can then be inferred.

    There is a typographical error in your specification: "additional" is the correct spelling.

    Reply
    Replies
      • It was in the initial spirit of the specification that sizes (of literal length or match length) can be unlimited.
        In practice though, it is necessarily limited, by the maximum size that the current implementation supports, which is ~1.9 GB.
        A future implementation may support larger block sizes though.

        There is also a theoretical issue with limited literal length : in case of compressing an encrypted file, it is possible that the compressed output consists only of literals. In this case, the size of literal length is the same as the size of the file. Thus, it cannot be limited.

        Would you mind telling why you think enforcing a limit on length would be beneficial ?

        Typo corrected. Thanks for the hint.

      • Hello Yann, Perhaps it is pedantic, but with no limit specified, a "correct" implementation is impossible to create.
        Another concern is efficiency. To determine a length field value > 14, "we need to add some more bytes to indicate the full length. Each additional byte then represent a value of 0 to 255, which is added to the previous value to produce a total length. When the byte value is 255, another byte is output."

        I understand you chose to add byte values, rather than use a compressed integer, such as the bottom 7 bits as the next most significant bits (with top bit as signal as another byte needed). I believe this was to reduce the output size when the lengths are small. However, with an unlimited length field, we can have a huge number of bytes representing the length. So it seems there must be a limit on the length or the compression becomes inefficient.

      • Regarding maximum size : 
        since block input size is currently limited to 1.9GB, what about limiting length sizes to this value too ?

        Regarding length encoding :
        The LZ4 format was defined years ago. Initially, it was just created to "learn" compression, so its primary design was simplicity. High speed was then a "side-effect". Since then, priorities have been a bit reversed, but the format remained "stable", a key property to build trust around it.

        I can understand that different trade-off can be invented, and may seem better. And indeed, if I had to re-invent LZ4 today, I would probably change a few things, including the way "big lengthes" are encoded.

        But don't expect these corner cases scenario to really make a difference in "normal" circumstances. A few people have already attempted such variant, and benched that in most circumstances, the difference is small (<1%) if only length encoding is modified.

        Larger difference can be achieved by modifying the fixed 64KB frame, allowing repetitions at larger distances, but with bigger impact on performance and complexity. (You can have a look at Shrinker and LZnib for example)

      • "the difference is small (<1%) if only length encoding is modified": I expect this to be true if only small length values are encoded. This was why I expected a fairly small length limit: I assumed the LZ4 format was only useful in cases where lengths are small, as the length encoding is poor for large lengths. A length of (2^16, 16K) requires 65 bytes, 64K requires 257 bytes, 256K requires 2028 bytes, and so forth. I am not speaking of the algorithm to compute the literals, but simply the length representation. Whether such lengths would be computed, I don't know.

      • I cannot say what the best limit would be, you would have to decide, as you are the expert. Hopefully I explained why it was surprising the limit is currently infinite, and why I expected a small limit.

      • Sure. It's possible to introduce the notion of "implementation-dependent limit".

        For example, current LZ4 reference C implementation has an implementation-limit of 1.9 GB. But other implementations could have different limits.

        This seems more important for decoders. So, whenever a decoder detects a length beyond its limit, it could refuse to continue decoding, and send an error message instead.

      • There's a way to handle the encoding limit without hurting the compression ratio nor limiting the total file size (eg: streaming-compatible) : you just have to specify that a given litteral length is never followed by an offset (just like the last block). That way you can easily have a litteral limit of 1 GB (30 bits) and if you want to encode litterals larger than this, you just have to stop at 1 GB and put a new litteral, which will just take a few bytes every gigabyte. BTW, thanks for the description and kudos for this smart and fast design!

      • Yes, I realized that point later on.
        Unfortunately, at that point, LZ4 is already widely deployed, with a stable format. It's no longer possible to change that now.

        A correct "limit" for streaming would probably be something like ~4KB. There is a direct relation between this limit and the amount of "memory buffer" a streaming implementation must allocate.

        Currently, my "work around" to this issue is to use "small blocks", typically 64KB. So the issue is solved by the upper "streaming layer" format.

      • In hindsight, we found an integer overflow (http://blog.securitymouse.com/2014/06/understanding-lz4-memory-corruption.html) right there!

  3. I like to see a bigger version of the tiny picture.

    Would you be so kind and upload one?

    Reply
    Replies
      • Which tiny picture are you talking about ? The first one on the top left ? It's just an illustration, taken from http://fastcompression.blogspot.fr/p/compression-benchmark.html

  4. Did you also evaluate Base-128 Varints (how Protocol Buffers encode ints) for lengths? I assume they might be slightly smaller, but slower, since they require more arithmetic operations.

    Reply
    Replies
  5. can u tell me how much memory consume by lz4 during uncompression

    Reply
    Replies
      • The algorithm itself doesn't consume any memory.

        The used memory is limited to the input and output buffer. So it's implementation dependent.

      • sir, i have only 64kb memory so what i have to do. what should be defined as chunksze

      • Well, it can be any value you want.
        LZ4 compression algorithm (lz4.c & lz4.h) doesn't define a chunk size. You can select 64kb, 32kb, 16kb, or even a weird 10936 bytes, there is no limitation. This parameter is fully implementation specific.
        Since I don't know what's the source of the data, what's the surrounding buffer environment, etc. it's not possible to be more precise.

        LZ4 is known to work on system specs as low as 1979's Atari XL or 1984's Amstrad. So there is no blocking point in making in work into 64kb.

        Regards

  6. how to compress full Directory or Folder?

    Reply
  7. Compressing directory, or even file attributes is outside of the scope of LZ4. LZ4 has the same responsibility as zlib, and therefore compresses "stream of bytes", irrespective of metadata.

    To compress directory, there are 2 possible methods :

    1) On Windows : use the LZ4 installer program, at http://fastcompression.blogspot.fr/p/lz4.html. It will enable a new context menu option by right-clicking a folder : "compress with LZ4". The resulting file will be the directory compressed. You can, of course, regenerate the directory by decompressing the file (just double-click on it).

    2) On Linux : use 'tar' to aggregate directory content, pipe the result to lz4 (exactly the same as gzip).

    Reply
  8. Thanks for the excellent and concise description. There's only one detail that seems to be missing (or maybe I missed it): You say that the token gets divided into two four bit fields. However, I don't think you said how those pack into the byte.

    Are they packed little endian with the first field in bits 0-3 and the second field in bits 4-7, or big endian? 

    I suppose I could go look at the code. It just seems a shame that it isn't included with this otherwise-complete looking description.

    Reply
    Replies
      • The format is explained in more detail in the file LZ4_format_description.txt, which is provided with the source code, and be consulted online here :
        https://code.google.com/p/lz4/source/browse/trunk/lz4_format_description.txt

        Within it, you'll find a more precise answer to your question :
        "Each sequence starts with a token.
        The token is a one byte value, separated into two 4-bits fields.
        Therefore each field ranges from 0 to 15.


        The first field uses the 4 high-bits of the token.
        It provides the length of literals to follow."

        I'll update this blog post to add this information.

  9. First of all thanks for this amazing LZ4 and also for this short explanation.
    If I got this right, according to offset being hard-coded to 2-bytes we can leverage match back only up yo 64k. And this could explain why in my use case the compression ratio is not increasing when I try to feed it with bigger buffers even if the data repeated is very frequent.
    So here it comes the question: theoretically, if memory is not an issue, by increasing the 'offset' to e.g. up to 32bits (no more fixed size at that point) and making the hash-table bigger, we could achieve a more than trivial improvement in compression ratio on bigger buffers.
    Does this sound good or am I missing something?
    Thanks in advance.

    Reply
    Replies
    1. Hi Carlo.

      Indeed, a 32-bits offset would open larger perspectives to find duplicated sequences. However, it would also make the cost of offsets larger.

      Currently, a "match" (duplicate sequence) costs approximately 3 bytes : 2-bytes offset + 1 byte token. Your proposal would increase that cost to 5 bytes. That means you'll need matches of length at least 6 to compress. It also means that you'll need matches which are all, on average 2 bytes larger, to pay off.

      Will that work ? Well, maybe. The problem is, there is no single definitive answer to this question. You'll have to test your hypothesis on your use cases to find out if this modification is worthy.

      Some trivial corner cases :
      1) Input source size <= 64 KB ? The modification will obviously be detrimental.
      2) Large Repeated pattern at distance > 64 KB ? The modification will obviously benefit a lot.

      Unfortunately, real use cases are more complex. 
      My guess is that, on average, the 32-bits offset strategy will cost more than it gains. But then again, if your use case contains large repeated patterns at large distances, it will more likely be a win.

    2. Thanks for advice Yann.
      I was also underestimating the power of L1 cache, so that when increasing hash table beyond your good default, it always start to decrease performances anyway.
      But at the same time I had the chance to play a bit with it and seem to have found some good tradeoff, working decently in all use cases. I'll drop you an email with more details through the dedicated form.
      Thanks again.



  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值