物理无损压缩算法详解：前缀编码、Huffman编码与RLE应用-CSDN博客

本文链接：https://blog.csdn.net/weixin_45937291/article/details/119416885

基础

压缩类型

逻辑压缩(Logical Compression)和物理压缩(Physical Compression)

逻辑压缩：根据数据的含义来压缩，只适用于特定的领域，比如录音

物理压缩：只需要知道数据的比特位，不需要知道这些比特的含义

有损压缩(Lossy Compression)和无损压缩(Lossless Compression)

有损压缩：可以实现更好的压缩比例，但是解码是近似的，确切的源文本(Source text)无法还原。

无损压缩：可以准确还原源文本

本文聚焦physical, lossless compression algorithms.

Prefix-free Encoding/Decoding

举个例子，采用如下编码

A	E	N	O	T
000	01	001	10	11

没有哪一个字符的编码是其他字符编码的前缀，这种编码就是prefix-free的

这种编码对应的解码一定是确切的

前缀树(Trie)

在编码这块可以这样理解前缀树，一个叶节点代表一个字符，从根节点出发到叶节点，可以确定这个叶节点对应字符的编码。比如，上述表格中的编码对应的前缀树为：
在这里插入图片描述

Huffman编码

Huffman编码树

给定源文本S，如何确定最好的前缀树从而使得编码后的文本C的长度最小？

Huffman编码树：

记录源文本中出现的字符的频率
对于每个字符，创建一个高度为0的前缀树，该树唯一的节点就是代表这个字符，并且每个前缀树的初始权值为该字符在S中出现的频率
找到权值最小的两个前缀树，创建一个新节点作为新的前缀树的根节点，并将这两个前缀树作为新的前缀树的根节点的两个子节点，新的前缀树的权值为这两个前缀树的权值之和。
重复上一步直到只有一个前缀树

例如：

源文本为GREENENERGY

字符频率为：G:2, R:2, E:4, N:2, Y:1

对应的前缀树为：

在这里插入图片描述

GREENENERGY $\rightarrow$ 000 10 01 01 11 01 11 01 10 000 001

伪代码

# Huffman::encoding(S,C)
# S: input-stream with characters in S_char
# C: output-stream
// get frequencies
f <- array indexed by S_char, initially all-0
while S is non-empty do increase f[S.pop()] by 1
// initialize PQ
Q <- min-oriented priority queue that stores tries
for all c in S_char with f[c] > 0 do
	Q.insert(single-node trie for c with weight f[c])
// build decoding trie
while Q.size > 1 do
	T1 <- Q.deleteMin(), f1 <- weight of T1
	T2 <- Q.deleteMin(), f2 <- weight of T1
	Q.insert(trie with T1, T2 as subtries and weight f1+f2)
T <- Q.deleteMin()
C.append(encoding trie T)
// actual encoding
Re-set input-stream S
C.append(encoding S according to T)

解码需要使用编码时用的前缀树

Run-Length编码(RLE)

基础

本文只考虑源文本和编码后的文本都为二进制的情况。

编码思路：

用实例来说明RLE算法，设S为00000 111 0000
记录S的第一个位，这里为0
然后记录下0连续出现的次数，这里为5
接下来记录1连续出现的次数，这里为3
然后记录0连续出现的次数，这里为4
故S可编码为0，5，3，4
关键是如何编码5，3，4这些表示连续出现次数的正整数
另外值得注意的是，RLE适用于同样的位连续出现的次数较大的情况

Elias gamma coding

使用Elias gamma coding来编码:

正整数 $k$ 编码为： $\lfloor log \space k\rfloor$ 个0，然后后面加上k的二进制表示

例如:

$k$	$\lfloor log \space k\rfloor$	$k$ 的二进制表示	编码
1	0	1	1
2	1	10	010
3	1	11	011
4	2	100	00100
5	2	101	00101
6	2	110	00110

伪代码

# encoding
S: input-stream of bits, C: output-stream
b <- S.top(); C.append(b)
while S is non-empty do
	k=1
	// get length of run
	while(S is non-empty and S.top()==b) do
		++k;S.pop()
	// compute and append Elias gamma code
	K <- empty string
	while k>1
		C.append(0)
		K.append(k mod 2)
		k <- k/2
	K.prepend(1) // K is binary encoding of k.
	C.append(K)
	b <- 1-b

# decoding
C: input-stream of bits, S: output-stream
b <- C.pop()
while C is non-empty
	len <- 0
	while C.pop() == 0 do ++len
	k <- 1
	for(j<-1 to len) do k <- k*2 + C.pop()
	for(j<-1 to k) do S.append(b)
	b <- 1-b

Lempel-Ziv-Welch算法(LZW)

基础

Huffman and RLE take advantage of frequent/repeated single characters

在某些特定的情境下，一些子字符串出现的频率很高

英文文本中:
这些digraph出现的很频繁：TH, ER, ON, AN, RE, HE, IN, ED, ND, HA
这些trigraph出现的很频繁: THE, AND, THA, ENT, ION, TIO, FOR, NDE
HTML: <a href, <img src, <br>
视频: 不同帧之间重复的背景

for Lempel-Ziv-Welch compression: take advantage of such substrings without needing to know beforehand what they are

实现过程

过程如下：

在这里插入图片描述

示例：

在这里插入图片描述

伪代码

# encoding
S: input-stream of characters, C: output-stream
Initialize dictionary D with ASCII in a trie
idx <- 128
while S is non-empty do
	v <- root of trie D
	while (S is non-empty and v has a child c labelled S .top())
		v<-c; S.pop()
	C .append(codenumber stored at v)
	if S is non-empty
		create child of v labelled S.top() with codenumber idx
		++idx

# decoding
C: input-stream of integers, S: output-stream
D <- dictionary that maps {0,...,127} to ASCII
idx <- 128
code <- C.pop(); s <- D(code); S.append(s)
while there are more codes in C do
	s_prev <- s; code <- C.pop();
	if code < idx
		s<-D(code)
	else if code == idx
		s <- s_prev + s_prev[0]
	else FAIL // Encoding was invalid
	S.append(s)
	D.insert(idx,s_prev+s[0])
	++idx

bzip2

在这里插入图片描述

Move-to-Front变换(MTF)

在这里插入图片描述

# encoding
L <- array with S_char in some pre-agreed, fixed order(usually ASCII)
while S is not empty do
	c <- S.pop()
	i <- index such that L[i] = c
	C.append(i)
	for j = i - 1 down to 0
		swap L[j] and L[j+1]

# decoding
L <- array with S_char in some pre-agreed, fixed order(usually ASCII)
while C is not empty do
	i <- next integer from C
	S.append(L[i])
	for j = i - 1 down to 0
		swap L[j] and L[j+1]

Burrows-Wheeler变换(BWT)

在这里插入图片描述

# decoding
C: string of characters, S: output-stream
A <- array of size n // leftmost column
for i=0 to n-1
	A[i] <- (C[i], i) // store character and index
Stably sort A by character
for j=0 to n-1 // where is the $-char?
	if C[j]==$ break

repeat
	S.append(character stored in A[j])
	j <- index stored in A[j]
until we have appended $