linux c语言lzma,LZMA 算法简介

最新推荐文章于 2024-04-28 22:24:38 发布

跳动的数字

最新推荐文章于 2024-04-28 22:24:38 发布

阅读量1.3k

点赞数

文章标签： linux c语言lzma

The Lempel–Ziv–Markov chain algorithm (LZMA) is an algorithm used to perform lossless data compression. It has been under development either since 1996 or 1998[1] and was first used in the 7z format of the 7-Zip archiver. This algorithm uses a dictionary compression scheme somewhat similar to the LZ77 algorithm published by Abraham Lempel and Jacob Ziv in 1977 and features a high compression ratio (generally higher than bzip2)[2][3] and a variable compression-dictionary size (up to 4 GB),[4] while still maintaining decompression speed similar to other commonly used compression algorithms.[5]

LZMA2 is a simple container format that can include both uncompressed data and LZMA data, possibly with multiple different LZMA encoding parameters. LZMA2 supports arbitrarily scalable multithreaded compression and decompression and efficient compression of data which is partially incompressible.

Contents

Overview

LZMA uses a dictionary compression algorithm (a variant of LZ77 with huge dictionary sizes and special support for repeatedly used match distances), whose output is then encoded with a range encoder, using a complex model to make a probability prediction of each bit. The dictionary compressor finds matches using sophisticated dictionary data structures, and produces a stream of literal symbols and phrase references, which is encoded one bit at a time by the range encoder: many encodings are possible, and a dynamic programming algorithm is used to select an optimal one under certain approximations.

Prior to LZMA, most encoder models were purely byte-based (i.e. they coded each bit using only a cascade of contexts to represent the dependencies on previous bits from the same byte). The main innovation of LZMA is that instead of a generic byte-based model, LZMA's model uses contexts specific to the bitfields in each representation of a literal or phrase: this is nearly as simple as a generic byte-based model, but gives much better compression because it avoids mixing unrelated bits together in the same context. Furthermore, compared to classic dictionary compression (such as the one used in zip and gzip formats), the dictionary sizes can be and usually are much larger, taking advantage of the large amount of memory available on modern systems.

Compressed format overview

In LZMA compression, the compressed stream is a stream of bits, encoded using an adaptive binary range coder. The stream is divided into packets, each packet describing either a single byte, or an LZ77 sequence with its length and distance implicitly or explicitly encoded. Each part of each packet is modeled with independent contexts, so the probability predictions for each bit are correlated with the values of that bit (and related bits from the same field) in previous packets of the same type.

There are 7 types of packets:[citation needed]

LONGREP[*] refers to LONGREP[0-3] packets, *REP refers to both LONGREP and SHORTREP, and *MATCH refers to both MATCH and *REP.

LONGREP[n] packets remove the distance used from the list of the most recent distances and reinsert it at the front, to avoid useless repeated entry, while MATCH just adds the distance to the front even if already present in the list and SHORTREP and LONGREP[0] don't alter the list.

The length is encoded as follows:

As in LZ77, the length is not limited by the distance, because copying from the dictionary is defined as if the copy was performed byte by byte, keeping the distance constant.

Distances are logically 32-bit and distance 0 points to the most recently added byte in the dictionary.

The distance encoding starts with a 6-bit "distance slot", which determines how many further bits are needed. Distances are decoded as a binary concatenation of, from most to least significant, two bits depending on the distance slot, some bits encoded with fixed 0.5 probability, and some context encoded bits, according to the following table (distance slots 0−3 directly encode distances 0−3).

Decompression algorithm details

No complete natural language specification of the compressed format seems to exist, other than the one attempted in the following text.

The description below is based on the compact XZ Embedded decoder by Lasse Collin included in the Linux kernel source[6] from which the LZMA and LZMA2 algorithm details can be relatively easily deduced: thus, while citing source code as reference isn't ideal, any programmer should be able to check the claims below with a few hours of work.

Range coding of bits[edit]

LZMA data is at the lowest level decoded one bit at a time by the range decoder, at the direction of the LZMA decoder.

Context-based range decoding is invoked by the LZMA algorithm passing it a reference to the "context", which consists of the unsigned 11-bit variable prob (typically implemented using a 16-bit data type) representing the predicted probability of the bit being 1, which is read and updated by the range decoder (and should be initialized to 2^10, representing 0.5 probability).

Fixed probability range decoding instead assumes a 0.5 probability, but operates slightly differently from context-based range decoding.

The range decoder state consists of two unsigned 32-bit variables, range (representing the range size), and code (representing the encoded point within the range).

Initialization of the range decoder consists of setting range to 232 − 1, and code to the 32-bit value starting at the second byte in the stream interpreted as big-endian; the first byte in the stream is completely ignored.

Normalization proceeds in this way:

Shift both range and code left by 8 bits

Read a byte from the compressed stream

Set the least significant 8 bits of code to the byte value read

Context-based range decoding of a bit using the prob probability variable proceeds in this way:

If range is less than 2^24, perform normalization

Set bound to floor(range / 2^11) * prob

If code is less than bound:

Set range to bound

Set prob to prob + floor((2^11 - prob) / 2^5)

Return bit 0

Otherwise (if code is greater than or equal to the bound):

Set range to range - bound

Set code to code - bound

Set prob to prob - floor(prob / 2^5)

Return bit 1

Fixed-probability range decoding of a bit proceeds in this way:

If range is less than 2^24, perform normalization

Set range to floor(range / 2)

If code is less than range:

Return bit 0

Otherwise (if code is greater or equal than range):

Set code to code - range

Return bit 1

The Linux kernel implementation of fixed-probability decoding in rc_direct, for performance reasons, doesn't include a conditional branch, but instead subtracts range from code unconditionally, and uses the resulting sign bit to both decide the bit to return, and to generate a mask that is combined with code and added to range.

Note that:

The division by 2^11 when computing bound and floor operation is done before the multiplication, not after (apparently to avoid requiring fast hardware support for 32-bit multiplication with a

最低0.47元/天解锁文章

跳动的数字

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
linux c语言lzma,LZMA 算法简介

TheLempel–Ziv–Markov chain algorithm(LZMA) is analgorithmused to performlossless data compression. It has been under development either since 1996 or 1998[1]and was first used in the7zformat o...
复制链接

扫一扫