Reference:
Elements of Information Theory, 2nd Edition
Slides of EE4560, TUD
Content
For a stationary discrete source, the minimum number of bits to represent the source signal with arbitrarily small probability of error is given by the entropy rate H ∞ ( X ) H_\infty (X) H∞(X).
In many situations, however, it is not necessary to perfectly represent the source signal.
For instance, the description of an arbitrary real number requires an infinite number of bits, so a finite representation of a continuous random variable can never be perfect.
How well can we do? → \to → Define the “goodness” of a representation of a source → \to → Define a distortion measure
Given a source distribution and distortion measure,
- What is the minimum expected distortion achievable at a particular bit rate? D ( R ) D(R) D(R)
- What is the minimum rate description required to achieve a particular distortion? R ( D ) R(D) R(D)
Quantization
Let X ^ ( X ) \hat X(X) X^(X) denote the representation of the random variable X X X. Using R R R bits to represent X X X, the function X ^ \hat X X^ can taken on 2 R 2^R 2R values.
Problem: Find the optimal set of values for X ^ \hat X X^ and the regions that are associate with each value X ^ \hat X X^.
An L L L-level quantizer is characterized by a set of L + 1 L+1 L+1 decision levels or decision thresholds x 0 < x 1 < ⋯ < x L x_0<x_1<\cdots<x_L x0<x1<⋯<xL and a set X ^ = { x ^ k , k = 1 , ⋯ , L } \hat X=\{ \hat x_k,k=1,\cdots,L \} X^={ x^k,k=1,⋯,L} such that x ^ = x ^ k \hat x=\hat x_k x^=x^k if and only if x k − 1 ≤ x < x k x_{k-1}\le x <x_k xk−1≤x<xk, where x 0 = − ∞ x_0=-\infty x0=−∞ and x L = ∞ x_L=\infty xL=∞.
The numbers x ^ k \hat x_k x^k are called the reconstruction values or reproduction levels and the intervals C k = [ x k − 1 , x k ) C_k=[x_{k-1},x_k) Ck=[xk−1,xk) are usually referred to as the decision intervals or quantization cells.
The map X ^ : X ↦ X ^ \hat X:\mathcal X \mapsto \hat {\mathcal X} X^:X↦X^ is given by
X ^ ( x ) = x ^ k for x ∈ C k , k = 1 , ⋯ , L \hat X(x)=\hat x_k\quad \text{for }x\in \mathcal C_k,k=1,\cdots,L X^(x)=x^kfor x∈Ck,k=1,⋯,L
is a staircase function by definition.
In order to find an optimal quantizer, that is, to find optimal decision and reproduction levels, we need a rule for quantitively assigning a distortion value to every possible approximation of the source samples.
Definition 1 (distortion measure):
A distortion function or distortion measure is a mapping
d : X × X ^ ↦ R + d: \mathcal{X} \times \hat{\mathcal{X}} \mapsto \mathbb{R}^{+} d:X×X^↦R+
from the set of source alphabet-reproduction alphabet pairs into a set of nonnegative numbers. The distortion d ( x , x ^ ) d(x, \hat{x}) d(x,x^) is a measure of the cost representing the symbol x x x by the symbol x ^ \hat{x} x^.
Examples:
-
Hamming distortion (Probability of error distortion measure)
d ( x , x ^ ) = { 0 if x = x ^ 1 if x ≠ x ^ d(x, \hat{x})=\left\{\begin{array}{ll} 0 & \text { if } x=\hat{x} \\ 1 & \text { if } x \neq \hat{x} \end{array}\right. d(x,x^)={ 01 if x=x^ if x=x^
E d ( X , X ^ ) = Pr ( x = x ^ ) ⋅ 0 + Pr ( x ≠ x ^ ) ⋅ 1 = Pr ( x ≠ x ^ ) E d(X, \hat{X})=\operatorname{Pr}(x=\hat{x}) \cdot 0+\operatorname{Pr}(x \neq \hat{x}) \cdot 1=\operatorname{Pr}(x \neq \hat{x}) Ed(X,X^)=Pr(x=x^)⋅0+Pr(x=x^)⋅1=Pr(x=x^) -
Squared-error distortion
d ( x , x ^ ) = ( x − x ^ ) 2 d(x,\hat x)=(x-\hat x)^2 d(x,x^)=(x−x^)2
Assume a squared-error distortion measure. What are the optimal reproduction levels and optimal quantization cells?
That is, we wish to find the function X ^ ( X ) \hat X(X) X^(X) such that X ^ \hat X X^ takes on at most L = 2 R L=2^R L=2R values and minimized E ( X − X ^ ) 2 E(X-\hat X)^2 E(X−X^)2.
E ( X − X ^ ) 2 = ∑ k = 1 L ∫ C k ( x − x ^ k ) 2 p ( x ) d x (1) E(X-\hat{X})^{2}=\sum_{k=1}^{L} \int_{\mathcal{C}_{k}}\left(x-\hat{x}_{k}\right)^{2} p(x) d x \tag{1} E(X−X^)2=k=1∑L∫Ck(x−x^k)2p(x)dx(1)
- If the quantization cells C k \mathcal C_k Ck are known:
The optimal reproduction levels are found by
∂ E ( X − X ^ ) 2 ∂ x ^ k ∣ x ^ k = x ^ k ∗ = − 2 ∫ x ∈ C k ( x − x ^ k ∗ ) p ( x ) d x = 0 \left.\frac{\partial E(X-\hat{X})^{2}}{\partial \hat{x}_{k}}\right|_{\hat{x}_{k}=\hat{x}_{k}^{*}}=-2 \int_{x \in C_{k}}\left(x-\hat{x}_{k}^{*}\right) p(x) d x=0 ∂x^k∂E(X−X^)2∣∣∣∣∣x^k=x^k∗=−2∫x∈Ck(x−x^k∗)p(x)dx=0
so that
x ^ k ∗ = ∫ x ∈ C k x p ( x ) d x ∫ x ∈ C k p ( x ) d x \hat{x}_{k}^{*}=\frac{\int_{x \in C_{k}} x p(x) d x}{\int_{x \in C_{k}} p(x) d x} x^k∗=∫x∈Ckp(x)dx∫x∈Ckxp(x)dx
Since
∫ x ∈ C k p ( x ) d x = Pr ( x ∈ C k ) \int_{x \in \mathcal{C}_{k}} p(x) d x=\operatorname{Pr}\left(x \in \mathcal{C}_{k}\right) ∫x∈Ckp(x)dx=Pr(x∈Ck)
we have, using Bayes’ rule, that
p ( x ) Pr ( x ∈ C k ) = p ( x ∣ x ∈ C k ) Pr ( x ∈ C k ∣ x ) \frac{p(x)}{\operatorname{Pr}\left(x \in \mathcal{C}_{k}\right)}=\frac{p\left(x | x \in \mathcal{C}_{k}\right)}{\operatorname{Pr}\left(x \in \mathcal{C}_{k} | x\right)} Pr(x∈Ck)p(x)=Pr(x∈Ck∣x)p(x∣x∈Ck)
so that
x ^ k ∗ = ∫ x ∈ C k x p ( x ) Pr ( x ∈ C k ) d x = ∫ x ∈ C k x p ( x ∣ x ∈ C k ) 1 d x = E ( X ∣ x ∈ C k ) (2) \hat{x}_{k}^{*}=\int_{x \in \mathcal{C}_{k}} x \frac{p(x)}{\operatorname{Pr}\left(x \in \mathcal{C}_{k}\right)} d x=\int_{x \in \mathcal{C}_{k}} x \frac{p\left(x | x \in \mathcal{C}_{k}\right)}{1} d x=E\left(X | x \in \mathcal{C}_{k}\right)\tag{2} x^k∗=∫x∈CkxPr(x∈Ck)p(x)dx=∫x∈Ckx1p(x∣x∈Ck)dx=E(X∣x∈Ck)(2)
It is the conditional mean or centroid of quantisation cell C k \mathcal{C}_{k} Ck.
- If the reproduction levels x ^ k \hat x_k x^k are known:
Given a set { x ^ i } \left\{\hat{x}_{i}\right\} { x^i} of reconstruction points, the distortion is minimized by mapping a source random variable to the representation x ^ i \hat{x}_{i} x^i that is closest to it. The partition into regions of X \mathcal{X} X defined by this mapping is called a Voronoi partition.
- The Voronoi regions are determined by the optimal reproduction points, whereas the optimal reproduction points are obtained given the Voronoi regions. How to solve this problem?
Iterative descent algorithm (Lloyd, 1957 ) ) ) :
- start with an initial collection of reproduction points
- optimize the partitions for these levels by using a minimum distortion mapping (nearest neighbour quantization)
- optimize the set of reproduction levels for the given partition (replace the old values by the centroids of the partition cells)
The alternation is continued until convergence to a local, if not global, optimum.
Instead of quantizing a single random variable, let us assume that we are given a set of n n n i.i.d. random variables X 1 , … X n X_{1}, \ldots X_{n} X1,…Xn drawn from a Gaussian distribution which we want to represent by n R n R nR bits
-
we will represent the entire sequence by a single index taking 2 n R 2^{n R} 2nR values
-
this treatment of entire sequences at once achieves a lower distortion for the same rate than independent quantization of the individual samples
Apparently, rectangular grid points (arising from independent descriptions) do not fill up the space efficiently:
Definition 2 (dimensionless normalized second moment of inertia):
Let ν \nu ν denote the volume of a quantization cell. The dimensionless normalized second moment of inertia G ( C k ) G\left(\mathcal{C}_{k}\right) G(Ck) of a quantization cell is defined by
G ( C k ) = 1 n ν 1 + 2 / n ∫ C k ∥ x − x ^ k ∥ 2 d x G\left(\mathcal{C}_{k}\right)=\frac{1}{n \nu^{1+2 / n}} \int_{\mathcal{C}_{k}}\left\|x-\hat{x}_{k}\right\|^{2} d x G(Ck)=