Data Compression

拉普拉斯的汪

于 2021-03-19 05:53:22 发布

阅读量576

点赞数

分类专栏： Information Theory

本文链接：https://blog.csdn.net/qq_39599295/article/details/114994868

版权

Information Theory 专栏收录该内容

9 篇文章 1 订阅

订阅专栏

Reference:

Elements of Information Theory, 2nd Edition

Slides of EE4560, TUD

Content

Problem description:

Let $X_1,X_2,\cdots,X_n$ be independent, identically distributed random variables drawn from the probability mass function $p (x)$ . We wish to find short descriptions for such sequences of random variables.

How to solve?

Assigning short descriptions to the most frequent outcomes of the data source, and necessarily longer descriptions to the less frequent outcomes.

Most frequent outcomes? $\to$ Typical sequences $\to$ AEP $\to$ Data compression

Consequences of The AEP: Data Compression

We divide all sequences in $\mathcal X^n$ into two sets: the typical set $A_\epsilon^{(n)}$ and its complement non-typical set $\overline{A_\epsilon^{(n)}}$ .

Order all elements in $A_{\epsilon}^{(n)}$ and $\overline{A_{\epsilon}^{(n)}}$ and represent each element by an index
Since $\left|A_{\epsilon}^{(n)}\right| \leq 2^{n(H(X)+\epsilon)}$ , indexing the sequences in $A_{\epsilon}^{(n)}$ requires no more than $n(H(X)+\epsilon)+1$ bits, the extra bit needed in case $n(H(X)+\epsilon)$ is not an integer
Since $\left|\overline{A_{\epsilon}^{(n)}}\right| \leq |\mathcal X|^n$ , we can index each sequence in $\overline{A_{\epsilon}^{(n)}}$ using no more than $\log |\mathcal{X}|+$ 1 bits, where $|\mathcal{X}|$ is the cardinality (number of the elements) of the source alphabet
To distinguish between $A_{\epsilon}^{(n)}$ and $\overline{A_{\epsilon}^{(n)}}$ we need one additional bit

在这里插入图片描述

Note the following features of the coding scheme:

The typical sequences have short descriptions of length $\approx n H(X)$
We used a brute force method to enumerate the elements in $\overline{A_{\epsilon}^{(n)}},$ without taking into account the fact that the number of elements in $\overline{A_{\epsilon}^{(n)}}$ is less than the number of elements in $\mathcal{X}^{n}$
The code is one-to-one and easily decodable; the initial bit acts as a flag bit to indicate the length of the codeword that follows

We use the notation $x^{n}$ to denote a sequence $x_{1}, x_{2}, \ldots, x_{n} .$ Let $l\left(x^{n}\right)$ be the length of the codeword corresponding to $x^{n} .$ If $n$ is sufficiently large so that $\operatorname{Pr}\left\{A_{\epsilon}^{(n)}\right\} \geq 1-\delta,$ the expected length of the codeword is
$\begin{aligned} E\left(l\left(X^{n}\right)\right)&=\sum_{x^{n}} p\left(x^{n}\right) l\left(x^{n}\right)\\ &=\sum_{x^{n}\in A_\epsilon^{(n)}} p\left(x^{n}\right) l\left(x^{n}\right)+\sum_{x^{n}\in \overline{A_\epsilon^{(n)}}} p\left(x^{n}\right) l\left(x^{n}\right)\\ &\le \sum_{x^{n}\in A_\epsilon^{(n)}} p\left(x^{n}\right) (n(H(X)+\epsilon)+2)+\sum_{x^{n}\in \overline{A_\epsilon^{(n)}}} p\left(x^{n}\right) (n\log |\mathcal X|+2)\\ &=\operatorname{Pr}\left\{A_{\epsilon}^{(n)}\right\}(n(H(X)+\epsilon)+2)+\operatorname{Pr}\left\{\overline{A_{\epsilon}^{(n)}}\right\}(n\log |\mathcal X|+2)\\ &\le (n(H(X)+\epsilon)+2)+\delta(n\log |\mathcal X|+2)\\ &=n[H(X)+\epsilon+\frac{2}{n}+\delta(\log |\mathcal X|+\frac{2}{n})]\\ &=n[H(X)+\epsilon '] \end{aligned}$
where $\epsilon '$ can be made arbitrarily small by an appropriate choice of $n$ . Hence we have proved the following theorem.

Theorem 1:

Let $X^{n}$ be i.i.d. $\sim p(x)$ . Let $\epsilon>0 .$ Then there exists a code that maps sequences $x^{n}$ of length $n$ into binary strings such that the mapping is one-to-one (and therefore invertible) and
$E\left[\frac{1}{n} l\left(X^{n}\right)\right] \leq H(X)+\epsilon\tag{1}$
for $n$ sufficiently large.

Thus, we can represent sequences $X^n$ using $n H (X)$ bits on the average.

How can the probability of error be arbitrary small? And what if the string is not binary?

Theorem 2 (Source coding theorem):

Given a discrete memoryless i.i.d. source $\{X_n,n\in \mathbb Z \}\sim p(x^n)$ , we can encode source messages of length $n$ into codewords of length $l$ from a code alphabet of size $r$ with arbitrary probability of error $P_e\le \delta$ if and only if
$r^l\ge 2^{n(H(X)+\epsilon)} \tag{2}$
Proof:

The number of elements in $A_\epsilon ^{(n)}$ is $|A_\epsilon^{(n)}|\le 2^{n(H(X)+\epsilon)}\le r^l$ , so that the number of codewords is larger than the number of typical source words, and $P_e \le \Pr(\overline{A_\epsilon^{(n)}})\le \delta$ (AEP).

Remark:

Code construction based on typical sets requires long source sequences $(n\to \infty)$ .
For short sequences, this coding recipe leads to an inefficient representation of the information.
Can we do better than this for small $n$ ? (see [Shannon Code](# Shannon Code) and [Huffman Code](#Huffman Code))

Source Codes

Information produced by a discrete information source is represented using the alphabet $\mathcal X=\{x_1,\cdots, x_k\}$

Definition 1 (Source code):

A source code $C$ for a random variable $X$ is a mapping $C:\mathcal X \mapsto \mathcal C$ , the set of finite length strings of symbols from a $r$ -ary alphabet. $C (x)$ denotes the codeword corresponding to $x$ and its length is denoted by $l (x)$ .

Definition 2 (Extension):

The extension $C^{*}$ of a code $C$ is the mapping from finite-length strings of $\mathcal{X}$ to finite-length strings of $\mathcal{D}$ , defined by
$C\left(x_{1} x_{2} \cdots x_{n}\right)=C\left(x_{1}\right) C\left(x_{2}\right) \cdots C\left(x_{n}\right)$
where $C\left(x_{1}\right) C\left(x_{2}\right) \cdots C\left(x_{n}\right)$ indicates concatenation of the corresponding codewords.

E.g. If $C\left(x_{1}\right)=00$ and $C\left(x_{2}\right)=11,$ then $C\left(x_{1} x_{2}\right)=0011 .$

在这里插入图片描述

Definition 3 (Non-singular):

A non-singular code is a code that uniquely maps each of the source symbols $x\in \mathcal X$ into a code word $C (x)$ . That is
$x_i\ne x_j \Longrightarrow C(x_i)\ne C(x_j)$
Definition 4 (Uniquely decodable):

A code is uniquely decodable if and only if its $n$ -extension is non-singular for all $n$ . That is
$\{x_1,\cdots,x_n\}_1\ne \{x_1,\cdots,x_n\}_2 \Longrightarrow [C(x_1),\cdots,C(x_n)]_1\ne [C(x_1),\cdots,C(x_n)]_2$
Definition 5 (Prefix/Instantaneous code):

A code is called a prefix or instantaneous code if no codeword is a prefix of any other codeword. It can be decoded without reference to future codewords since the end of a codeword is immediately recognizable.

Remark: A prefix code can be represented by a $r$ -ary tree ( $r$ is the size of the alphabet), where the codewords are represented by a leaf of the pruned tree.

在这里插入图片描述

The branches of the tree represent the symbols of the codeword. For example, the $r$ branches arising from the root node represent the $r$ possible values of the first symbol of the codeword. Then each codeword is represented by a leaf on the tree. The path from the root traces out the symbols of the codeword. The prefix condition on the codewords implies that no codeword is an ancestor of any other codeword on the tree. Hence, each codeword eliminates its descendants as possible codewords.

Examples:

在这里插入图片描述

N.B. Morse code is non-singular, uniquely decodable, but not instantaneous.

Kraft Inequality

We wish to construct instantaneous codes of minimum expected length to describe a given source. It is clear that we cannot assign short codewords to all source symbols and still be prefix-free. The set of codeword lengths possible for instantaneous codes is limited by the following inequality.

Theorem 3 (Kraft inequality):

For any instantaneous code over an alphabet of size $r$ , the codewords lengths $l(x_1),\cdots,l(x_k)$ must satisfy the inequality
$\sum_{i=1}^k r^{-l(x_i)}\le 1 \tag{3}$
Conversely, given a set of codeword lengths that satisfy this inequality, there exists an instantaneous code with these word lengths.

Proof:

Let $l_{\max }$ be the longest codeword. A codeword at level $l\left(x_{i}\right)$ has $r^{l_{\max }-l\left(x_{i}\right)}$ descendants at level $l_{\max } .$ In order to be a prefix code, each of these descendent sets must be disjoint and the total number of nodes in these sets is less than $r^{l_{\max }}$ . Hence, summing over all the codewords, we obtain
$\sum_{i=1}^{k} r^{l_{\max }-l\left(x_{i}\right)} \leq r^{l_{\max }}$
and thus
$\sum_{i=1}^{k} r^{-l\left(x_{i}\right)} \leq 1$
N.B. For any countably infinite set of codewords that form a prefix code, the codeword lengths also satisfy the Kraft inequality, i.e.,
$\sum_{i=1}^\infty r^{-l(x_i)}\le 1$

Optimal Codes

What is the minimum expected length of the prefix code? How do we find the prefix code with minimum expected length?

Bounds on the optimal code length

Consider a constrained optimization problem:
$\min _{l(x_i)} \sum_{i=1}^k p(x_i)l(x_i)\quad \text{subject to } \sum_{i=1}^k r^{-l(x_i)}\le 1 \tag{4}$
Lagrange multiplier technique:
$\min _{l\left(x_{i}\right)} J\left(l\left(x_{i}\right), \lambda\right)=\min _{l\left(x_{i}\right)}\left(\sum_{i=1}^{k} p\left(x_{i}\right) l\left(x_{i}\right)+\lambda\left(\sum_{i=1}^{k} r^{-l\left(x_{i}\right)}-1\right)\right)$
Hence
$\begin{array}{l} \frac{\partial J}{\partial l\left(x_{i}\right)}=p\left(x_{i}\right)-\lambda r^{-l\left(x_{i}\right)} \ln r \quad \Rightarrow \quad r^{-l^{*}\left(x_{i}\right)}=\frac{p\left(x_{i}\right)}{\lambda \ln r} \\ \sum_{i=1}^{k} r^{-l^{*}\left(x_{i}\right)} \leq 1 \Rightarrow \lambda \ln r \geq 1 \Rightarrow l^{*}\left(x_{i}\right) \geq-\log _{r} p\left(x_{i}\right) \end{array}$
The average codelength, $E l (X),$ then becomes
$\sum_{i=1}^{k} p\left(x_{i}\right) l^{*}\left(x_{i}\right) \geq-\sum_{i=1}^{k} p\left(x_{i}\right) \log _{r} p\left(x_{i}\right)=H(X)$
As a consequence, we have that for any instantaneous code
$\operatorname{E}l(X) \geq H(X) \tag{5}$
with equality iff $r^{-l^{*}\left(x_{i}\right)}=p\left(x_{i}\right)$ .

In the case that $-\log p\left(x_{i}\right)$ is not integer, we should choose a set of codeword lengths “close” to the optimal set. Shannon suggested to round it up to the nearest integer
$-\log p\left(x_{i}\right) \leq l\left(x_{i}\right)<-\log p\left(x_{i}\right)+1$
This choice satisfies Kraft’s inequality and we conclude that the optimal codeword length for a given source distribution satisfies
$\leq E l(X) \leq H(X)+1 \tag{6}$

There is an overhead which at most 1 bit per symbol due to the fact that $\log p\left(x_{i}\right)$ is not always an integer
Overhead can be reduced by combining symbols into sequences

Encoding of sequences of length $n$ :
$H\left(X_{1}, \ldots, X_{n}\right) \leq E l\left(X_{1}, \ldots, X_{n}\right)<H\left(X_{1}, \ldots, X_{n}\right)+1$
Define $L_n$ to be the expected codeword length per input symbol, that is
$L_n=\frac{1}{n}El(X_1,\cdots,X_n)\tag{7}$
Assuming symbols are drawn i.i.d. according to $p\left(x^{n}\right),$ we have that $H\left(X_{1}, \ldots, X_{n}\right)=n H(X)$ and we conclude that
$\leq L_n<H(X)+\frac{1}{n} \tag{8}$

This equation relates to $(1)$ .

For a sequence of symbols that is not necessarily i.i.d., we have the bound
$\frac{H\left(X_{1}, \ldots, X_{n}\right)}{n} \leq L_n<\frac{H\left(X_{1}, \ldots, X_{n}\right)}{n}+\frac{1}{n}\tag{9}$
For stationary processes, we have the entropy rate
$H_\infty (X)=\lim _{n\to \infty} \frac{H(X_1,\cdots,X_n)}{n}$
Therefore, $L_n \rightarrow H_{\infty}(X)$ for $\rightarrow \infty$ , which provides another justification for the definition of entropy rate - it is the expected number of bits per symbol required to describe the process.

Example:

在这里插入图片描述

Do there exist uniquely decodable, non-instantaneous codes that achieve shorter expected codelengths?

We have the following result (by McMillan):

The codeword lengths of any uniquely decodable code must satisfy the Kraft inequality.

This rather surprising result implies that the class of uniquely decodable codes does not offer any further choices for the set of codeword lengths than the class of prefix codes!

Shannon Code

In Shannon coding, the symbols are arranged in order from most probable to least probable, and assigned codewords by taking the first $l_i=\lceil -\log p(x_i) \rceil$ bits from the binary expansions of the cumulative probabilities $F(x_i)=\sum \limits_{j=1}^{i-1}p(x_j)$

Example:

在这里插入图片描述

Shannon code is asymptotically optimal as $n\to \infty$ . However, the Shannon code may be much worse than the optimal code for finite $n$ for some particular symbol.

For example, let $p(x_1)=0.99$ and $p(x_2)=0.01$ . Obviously, an optimal code is $C(x_1)=0$ and $C(x_2)=1$ . The Shannon code, though asymptotically optimal, assigns a codeword of length $\lceil \log 100 \rceil=7$ to $x_2$ . Note that in this case $H (X) = 0.08$ and $1 < E l (x) = 1.06 < H (X) + 1$ .

Huffman Code

Theorem 4 (Huffman coding):

Huffman coding is optimal, i.e., if $C^*$ is a Huffman code and $C^{'}$ is any other code, then $El(C^*)\le E l(C')$ .

Remarks:

The lengths are ordered inversely with the probabilities
The two longest codewords have the same length
Two of the longest codewords differ only in the last bit and correspond to the two least likely symbols

How to construct Huffman codes:

For the binary case, the Huffman code arranges the messages in order of decreasing probability, and joins the two least probable source symbols together, resulting in a new message alphabet with one less symbol.
The new messages are reordered, after which two symbols are again joined together
Repeat until the last two symbols added to one
Every time when adding, assign 0 and 1 to the two added probabilities
Start with a symbol and go to the last “1”. Combine the 0 and 1 encountered on the route and flip the result.

Example:

在这里插入图片描述

N.B. The result is not unique. It can also be constructed as

在这里插入图片描述

Observations:

Huffman coding is not ideal since it is an bottom-up approach that requires the calculation of the probabilities of all source sequences and the construction of the corresponding complete code tree.
Cannot easily extended to longer block length without having to redo all the calculations

Arithmetic Coding

The Huffman coding procedure described is optimal for encoding a random variable with a known distribution that has to be encoded symbol by symbol. However, due to the fact that the codeword lengths for a Huffman code were restricted to be integral, there could be a loss of up to 1 bit per symbol in coding efficiency. We could alleviate this loss by using blocks of input symbols—however, the complexity of this approach increases exponentially with block length. We now describe a method of encoding without this inefficiency. In arithmetic coding, instead of using a sequence of bits to represent a symbol, we represent it by a subinterval of the unit interval.

N.B. Huffman code is still the optimal in the sense that we encode the whole sequence, not symbol by symbol.

[Encoding and Decoding details: Slides 40-52]

拉普拉斯的汪

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Data Compression

Reference:Elements of Information Theory, 2nd EditionSlides of EE4560, TUDContentConsequences of The AEP: Data CompressionSource CodesKraft InequalityOptimal CodesBounds on the optimal code lengthShannon CodeHuffman CodeArithmetic CodingProblem descrip
复制链接

扫一扫

专栏目录