基础
压缩类型
逻辑压缩(Logical Compression)和物理压缩(Physical Compression)
逻辑压缩:根据数据的含义来压缩,只适用于特定的领域,比如录音
物理压缩:只需要知道数据的比特位,不需要知道这些比特的含义
有损压缩(Lossy Compression)和无损压缩(Lossless Compression)
有损压缩:可以实现更好的压缩比例,但是解码是近似的,确切的源文本(Source text)无法还原。
无损压缩:可以准确还原源文本
本文聚焦physical, lossless compression algorithms.
Prefix-free Encoding/Decoding
举个例子,采用如下编码
A | E | N | O | T |
---|---|---|---|---|
000 | 01 | 001 | 10 | 11 |
没有哪一个字符的编码是其他字符编码的前缀,这种编码就是prefix-free的
这种编码对应的解码一定是确切的
前缀树(Trie)
在编码这块可以这样理解前缀树,一个叶节点代表一个字符,从根节点出发到叶节点,可以确定这个叶节点对应字符的编码。比如,上述表格中的编码对应的前缀树为:
Huffman编码
Huffman编码树
给定源文本S,如何确定最好的前缀树从而使得编码后的文本C的长度最小?
Huffman编码树:
- 记录源文本中出现的字符的频率
- 对于每个字符,创建一个高度为0的前缀树,该树唯一的节点就是代表这个字符,并且每个前缀树的初始权值为该字符在S中出现的频率
- 找到权值最小的两个前缀树,创建一个新节点作为新的前缀树的根节点,并将这两个前缀树作为新的前缀树的根节点的两个子节点,新的前缀树的权值为这两个前缀树的权值之和。
- 重复上一步直到只有一个前缀树
例如:
源文本为GREENENERGY
字符频率为:G:2, R:2, E:4, N:2, Y:1
对应的前缀树为:
GREENENERGY → \rightarrow → 000 10 01 01 11 01 11 01 10 000 001
伪代码
# Huffman::encoding(S,C)
# S: input-stream with characters in S_char
# C: output-stream
// get frequencies
f <- array indexed by S_char, initially all-0
while S is non-empty do increase f[S.pop()] by 1
// initialize PQ
Q <- min-oriented priority queue that stores tries
for all c in S_char with f[c] > 0 do
Q.insert(single-node trie for c with weight f[c])
// build decoding trie
while Q.size > 1 do
T1 <- Q.deleteMin(), f1 <- weight of T1
T2 <- Q.deleteMin(), f2 <- weight of T1
Q.insert(trie with T1, T2 as subtries and weight f1+f2)
T <- Q.deleteMin()
C.append(encoding trie T)
// actual encoding
Re-set input-stream S
C.append(encoding S according to T)
解码需要使用编码时用的前缀树
Run-Length编码(RLE)
基础
本文只考虑源文本和编码后的文本都为二进制的情况。
编码思路:
- 用实例来说明RLE算法,设S为00000 111 0000
- 记录S的第一个位,这里为0
- 然后记录下0连续出现的次数,这里为5
- 接下来记录1连续出现的次数,这里为3
- 然后记录0连续出现的次数,这里为4
- 故S可编码为0,5,3,4
- 关键是如何编码5,3,4这些表示连续出现次数的正整数
- 另外值得注意的是,RLE适用于同样的位连续出现的次数较大的情况
Elias gamma coding
使用Elias gamma coding来编码:
正整数 k k k编码为: ⌊ l o g k ⌋ \lfloor log \space k\rfloor ⌊log k⌋ 个0,然后后面加上k的二进制表示
例如:
k k k | ⌊ l o g k ⌋ \lfloor log \space k\rfloor ⌊log k⌋ | k k k的二进制表示 | 编码 |
---|---|---|---|
1 | 0 | 1 | 1 |
2 | 1 | 10 | 010 |
3 | 1 | 11 | 011 |
4 | 2 | 100 | 00100 |
5 | 2 | 101 | 00101 |
6 | 2 | 110 | 00110 |
伪代码
# encoding
S: input-stream of bits, C: output-stream
b <- S.top(); C.append(b)
while S is non-empty do
k=1
// get length of run
while(S is non-empty and S.top()==b) do
++k;S.pop()
// compute and append Elias gamma code
K <- empty string
while k>1
C.append(0)
K.append(k mod 2)
k <- k/2
K.prepend(1) // K is binary encoding of k.
C.append(K)
b <- 1-b
# decoding
C: input-stream of bits, S: output-stream
b <- C.pop()
while C is non-empty
len <- 0
while C.pop() == 0 do ++len
k <- 1
for(j<-1 to len) do k <- k*2 + C.pop()
for(j<-1 to k) do S.append(b)
b <- 1-b
Lempel-Ziv-Welch算法(LZW)
基础
Huffman and RLE take advantage of frequent/repeated single characters
在某些特定的情境下,一些子字符串出现的频率很高
- 英文文本中:
这些digraph出现的很频繁:TH, ER, ON, AN, RE, HE, IN, ED, ND, HA
这些trigraph出现的很频繁: THE, AND, THA, ENT, ION, TIO, FOR, NDE - HTML: <a href, <img src, <br>
- 视频: 不同帧之间重复的背景
for Lempel-Ziv-Welch compression: take advantage of such substrings without needing to know beforehand what they are
实现过程
过程如下:
示例:
伪代码
# encoding
S: input-stream of characters, C: output-stream
Initialize dictionary D with ASCII in a trie
idx <- 128
while S is non-empty do
v <- root of trie D
while (S is non-empty and v has a child c labelled S .top())
v<-c; S.pop()
C .append(codenumber stored at v)
if S is non-empty
create child of v labelled S.top() with codenumber idx
++idx
# decoding
C: input-stream of integers, S: output-stream
D <- dictionary that maps {0,...,127} to ASCII
idx <- 128
code <- C.pop(); s <- D(code); S.append(s)
while there are more codes in C do
s_prev <- s; code <- C.pop();
if code < idx
s<-D(code)
else if code == idx
s <- s_prev + s_prev[0]
else FAIL // Encoding was invalid
S.append(s)
D.insert(idx,s_prev+s[0])
++idx
bzip2
Move-to-Front变换(MTF)
# encoding
L <- array with S_char in some pre-agreed, fixed order(usually ASCII)
while S is not empty do
c <- S.pop()
i <- index such that L[i] = c
C.append(i)
for j = i - 1 down to 0
swap L[j] and L[j+1]
# decoding
L <- array with S_char in some pre-agreed, fixed order(usually ASCII)
while C is not empty do
i <- next integer from C
S.append(L[i])
for j = i - 1 down to 0
swap L[j] and L[j+1]
Burrows-Wheeler变换(BWT)
# decoding
C: string of characters, S: output-stream
A <- array of size n // leftmost column
for i=0 to n-1
A[i] <- (C[i], i) // store character and index
Stably sort A by character
for j=0 to n-1 // where is the $-char?
if C[j]==$ break
repeat
S.append(character stored in A[j])
j <- index stored in A[j]
until we have appended $