Reference:
Elements of Information Theory, 2nd Edition
Slides of EE4560, TUD
AEP
In information theory, the analog of the law of large numbers is the asymptotic equipartition property (AEP). It is a direct consequence of the weak law of large numbers.
For independent identically distributed (i.i.d.) random variables
X
1
,
⋯
,
X
n
X_1,\cdots, X_n
X1,⋯,Xn, the weak law of the large numbers states that
1
n
∑
i
=
1
n
x
i
→
E
X
in probability
\frac{1}{n}\sum_{i=1}^nx_i\to EX\quad \text{in probability}
n1i=1∑nxi→EXin probability
The AEP states that
−
1
n
log
p
(
x
1
,
⋯
,
x
n
)
→
−
E
log
p
(
X
)
=
H
(
X
)
in probability
-\frac{1}{n}\log p(x_1,\cdots,x_n)\to -E\log p(X)=H(X)\quad \text{in probability}
−n1logp(x1,⋯,xn)→−Elogp(X)=H(X)in probability
To show this rigorously, we are going to introduce some definitions and theorems.
Definition 1:
Given a sequence of i.i.d. random variables X i X_i Xi. We say that the sequence $X_1,X_2,\cdots $ converges to a random variable X X X:
- In probability if for every ϵ > 0 \epsilon >0 ϵ>0, Pr { ∣ X n − X ∣ > ϵ } → 0 \Pr\{|X_n-X|>\epsilon\}\to 0 Pr{∣Xn−X∣>ϵ}→0
- In mean square if E ( X n − X ) 2 → 0 E(X_n-X)^2\to 0 E(Xn−X)2→0
- With probability 1 (also called almost surely) if Pr { lim n → ∞ X n = X } = 1 \Pr\{\lim_{n\to \infty}X_n=X\}=1 Pr{limn→∞Xn=X}=1
Theorem 1 (Weak law of large numbers):
Given a sequence of i.i.d. random variables
X
i
X_i
Xi. Then for any
ϵ
>
0
\epsilon>0
ϵ>0 and
δ
>
0
\delta>0
δ>0, there exists an
n
0
n_0
n0 such that for any
n
>
n
0
n>n_0
n>n0
Pr
(
∣
1
n
∑
i
=
1
n
X
i
−
E
X
∣
<
ϵ
)
≥
1
−
δ
(1)
\Pr\left( \left|\frac{1}{n}\sum_{i=1}^n X_i-EX \right|<\epsilon \right)\ge 1-\delta\tag{1}
Pr(∣∣∣∣∣n1i=1∑nXi−EX∣∣∣∣∣<ϵ)≥1−δ(1)
It is a direct result of Chebyshev Inequality.
Theorem 2 (AEP):
If
X
1
,
X
2
,
⋯
X_1,X_2,\cdots
X1,X2,⋯ are i.i.d.
∼
p
(
x
)
\sim p(x)
∼p(x), i.e.,
{
X
N
,
n
∈
Z
}
∼
p
(
x
n
)
\{ X_N, n\in \mathbb Z \}\sim p(x^n)
{XN,n∈Z}∼p(xn), then for any
ϵ
>
0
\epsilon>0
ϵ>0 and
δ
>
0
\delta>0
δ>0, there exists an
n
0
n_0
n0 such that for any
n
>
n
0
n>n_0
n>n0
Pr
(
∣
−
1
n
log
p
(
X
1
,
⋯
,
X
n
)
−
H
(
X
)
∣
<
ϵ
)
≥
1
−
δ
(2)
\Pr\left( \left|-\frac{1}{n}\log p(X_1,\cdots,X_n)-H(X) \right|<\epsilon \right)\ge 1-\delta\tag{2}
Pr(∣∣∣∣−n1logp(X1,⋯,Xn)−H(X)∣∣∣∣<ϵ)≥1−δ(2)
Proof: By the weak law of large numbers,
−
1
n
log
p
(
X
1
,
⋯
,
X
n
)
=
−
1
n
log
∏
i
=
1
n
p
(
X
i
)
=
−
1
n
∑
i
=
1
n
log
p
(
X
i
)
→
−
E
log
p
(
X
)
in probability
=
H
(
X
)
\begin{aligned} -\frac{1}{n}\log p(X_1,\cdots,X_n)&=-\frac{1}{n}\log \prod_{i=1}^n p(X_i)=-\frac{1}{n}\sum_{i=1}^n\log p(X_i)\\ &\to -E\log p(X)\quad \text{in probability}\\ &=H(X) \end{aligned}
−n1logp(X1,⋯,Xn)=−n1logi=1∏np(Xi)=−n1i=1∑nlogp(Xi)→−Elogp(X)in probability=H(X)
As a consequence, the probability p ( x 1 , ⋯ , x n ) p(x_1,\cdots,x_n) p(x1,⋯,xn) of almost all sequences will be close to 2 − n H ( X ) 2^{-nH(X)} 2−nH(X) when n n n is large.
Typical Set
We can derive the set of all sequences into two sets, the typical set, where the sample entropy is close to the true entropy (will be explained later), and the non-typical set, which contains the other sequences.
Definition 2 (Typical set):
The typical set, denoted by
A
ϵ
(
n
)
A_\epsilon^{(n)}
Aϵ(n), is defined by
A
ϵ
(
n
)
=
{
(
x
1
,
⋯
,
x
n
)
:
∣
−
1
n
log
p
(
x
1
,
⋯
,
x
n
)
−
H
(
X
)
∣
<
ϵ
}
(3)
A_\epsilon^{(n)}=\left\{(x_1,\cdots,x_n):\left|-\frac{1}{n}\log p(x_1,\cdots,x_n)-H(X)\right|<\epsilon \right\} \tag{3}
Aϵ(n)={(x1,⋯,xn):∣∣∣∣−n1logp(x1,⋯,xn)−H(X)∣∣∣∣<ϵ}(3)
It is the set of sequences
(
x
1
,
⋯
,
x
n
)
∈
X
n
(x_1,\cdots, x_n)\in \mathcal X^n
(x1,⋯,xn)∈Xn having the property
2
−
n
(
H
(
X
)
−
ϵ
)
>
p
(
x
1
,
⋯
,
x
n
)
>
2
−
n
(
H
(
X
)
+
ϵ
)
(4)
2^{-n(H(X)-\epsilon)}>p(x_1,\cdots,x_n)>2^{-n(H(X)+\epsilon)}\tag{4}
2−n(H(X)−ϵ)>p(x1,⋯,xn)>2−n(H(X)+ϵ)(4)
As a consequence of the AEP, we can show that the set
A
ϵ
(
n
)
A_\epsilon^{(n)}
Aϵ(n) has the following properties:
Theorem 3 (Properties of typical sets):
- If ( x 1 , x 2 , … , x n ) ∈ A ϵ ( n ) , \left(x_{1}, x_{2}, \ldots, x_{n}\right) \in A_{\epsilon}^{(n)}, (x1,x2,…,xn)∈Aϵ(n), then H ( X ) − ϵ ≤ − 1 n log p ( x 1 , x 2 , … x n ) ≤ H ( X ) + ϵ H(X)-\epsilon \leq-\frac{1}{n} \log p\left(x_{1}, x_{2}, \ldots x_{n}\right) \leq H(X)+\epsilon H(X)−ϵ≤−n1logp(x1,x2,…xn)≤H(X)+ϵ.
- Pr { A ϵ ( n ) } > 1 − ϵ \operatorname{Pr}\left\{A_{\epsilon}^{(n)}\right\}>1-\epsilon Pr{Aϵ(n)}>1−ϵ for n n n sufficiently large.
- ∣ A ϵ ( n ) ∣ ≤ 2 n ( H ( X ) + ϵ ) , \left|A_{\epsilon}^{(n)}\right| \leq 2^{n(H(X)+\epsilon)}, ∣∣∣Aϵ(n)∣∣∣≤2n(H(X)+ϵ), where ∣ A ∣ |A| ∣A∣ denotes the number of elements in the set A A A.
- ∣ A ϵ ( n ) ∣ ≥ ( 1 − ϵ ) 2 n ( H ( X ) − ϵ ) \left|A_{\epsilon}^{(n)}\right| \geq(1-\epsilon) 2^{n(H(X)-\epsilon)} ∣∣∣Aϵ(n)∣∣∣≥(1−ϵ)2n(H(X)−ϵ) for n n n sufficiently large.
Thus, the typical set has probability nearly 1 1 1, all elements of the typical set are nearly equiprobable with probability close to 2 − n H 2^{-nH} 2−nH, and the number of elements in the typical set is nearly 2 n H 2^{nH} 2nH.
Proof:
-
The proof of property ( 1 ) (1) (1) is immediate from the definition of A ϵ ( n ) A_\epsilon^{(n)} Aϵ(n).
-
Pr { A ϵ ( n ) } \operatorname{Pr}\left\{A_{\epsilon}^{(n)}\right\} Pr{Aϵ(n)} means the probability of the event ( X 1 , X 2 , ⋯ , X n ) ∈ A ϵ ( n ) (X_1,X_2,\cdots,X_n)\in A_\epsilon^{(n)} (X1,X2,⋯,Xn)∈Aϵ(n). The proof of property ( 2 ) (2) (2) follows directly from Theorem 2, since the probability of the event ( X 1 , X 2 , ⋯ , X n ) ∈ A ϵ ( n ) (X_1,X_2,\cdots,X_n)\in A_\epsilon^{(n)} (X1,X2,⋯,Xn)∈Aϵ(n) tends to 1 1 1 as n → ∞ n\to \infty n→∞. Thus, for any δ > 0 \delta>0 δ>0, there exists an n 0 n_0 n0 such that for all n ≥ n 0 n\ge n_0 n≥n0, we have
Pr { A ϵ ( n ) } = Pr { ( X 1 , X 2 , ⋯ , X n ) ∈ A ϵ ( n ) } = Pr { ∣ − 1 n log p ( x 1 , ⋯ , x n ) − H ( X ) ∣ < ϵ } > 1 − δ \begin{aligned} \operatorname{Pr}\left\{A_{\epsilon}^{(n)}\right\}&=\operatorname{Pr}\left\{(X_1,X_2,\cdots,X_n)\in A_\epsilon^{(n)}\right\}\\&=\Pr\left\{\left| -\frac{1}{n}\log p(x_1,\cdots,x_n)-H(X) \right|<\epsilon\right\}\\ &> 1-\delta \end{aligned} Pr{Aϵ(n)}=Pr{(X1,X2,⋯,Xn)∈Aϵ(n)}=Pr{∣∣∣∣−n1logp(x1,⋯,xn)−H(X)∣∣∣∣<ϵ}>1−δ
-
To prove property ( 3 ) (3) (3), we can use Eq. ( 4 ) (4) (4) and write
1 = ∑ x ∈ X n p ( x ) ≥ ∑ x ∈ A ϵ ( n ) p ( x ) ≥ ∑ x ∈ A ϵ ( n ) 2 − n ( H ( X ) + ϵ ) = 2 − n ( H ( X ) + ϵ ) ∣ A ϵ ( n ) ∣ \begin{aligned} 1&=\sum _{\mathbf x\in \mathcal X^n}p(\mathbf x)\ge\sum _{\mathbf x\in A_\epsilon^{(n)}}p(\mathbf x)\ge \sum _{\mathbf x\in A_\epsilon^{(n)}}2^{-n(H(X)+\epsilon)}\\&=2^{-n(H(X)+\epsilon)}\left|A_\epsilon^{(n)}\right| \end{aligned} 1=x∈Xn∑p(x)≥x∈Aϵ(n)∑p(x)≥x∈Aϵ(n)∑2−n(H(X)+ϵ)=2−n(H(X)+ϵ)∣∣∣Aϵ(n)∣∣∣
Hence ∣ A ϵ ( n ) ∣ ≤ 2 n ( H ( X ) + ϵ ) \left|A_{\epsilon}^{(n)}\right| \leq 2^{n(H(X)+\epsilon)} ∣∣∣Aϵ(n)∣∣∣≤2n(H(X)+ϵ). -
The property ( 4 ) (4) (4) can be derived from property ( 2 ) (2) (2) and Eq. ( 4 ) (4) (4). For sufficiently large n n n, Pr { A ϵ ( n ) } > 1 − ϵ \Pr\{A_{\epsilon}^{(n)}\}>1-\epsilon Pr{Aϵ(n)}>1−ϵ, so that
1 − ϵ < Pr { A ϵ ( n ) } = Pr { ( X 1 , X 2 , ⋯ , X n ) ∈ A ϵ ( n ) } = ∑ x ∈ A ϵ ( n ) p ( x ) ≤ ∑ x ∈ A ϵ ( n ) 2 − n ( H ( X ) − ϵ ) = 2 − n ( H ( X ) − ϵ ) ∣ A ϵ ( n ) ∣ \begin{aligned} 1-\epsilon&<\Pr\{A_{\epsilon}^{(n)}\}=\operatorname{Pr}\left\{(X_1,X_2,\cdots,X_n)\in A_\epsilon^{(n)}\right\}\\ &=\sum_{\mathbf x\in A_{\epsilon}^{(n)}}p(\mathbf x)\le \sum_{\mathbf x\in A_{\epsilon}^{(n)}}2^{-n(H(X)-\epsilon)}=2^{-n(H(X)-\epsilon)}\left|A_\epsilon^{(n)}\right| \end{aligned} 1−ϵ<Pr{Aϵ(n)}=Pr{(X1,X2,⋯,Xn)∈Aϵ(n)}=x∈Aϵ(n)∑p(x)≤x∈Aϵ(n)∑2−n(H(X)−ϵ)=2−n(H(X)−ϵ)∣∣∣Aϵ(n)∣∣∣
Hence ∣ A ϵ ( n ) ∣ ≥ ( 1 − ϵ ) 2 n ( H ( X ) − ϵ ) \left|A_{\epsilon}^{(n)}\right| \geq(1-\epsilon) 2^{n(H(X)-\epsilon)} ∣∣∣Aϵ(n)∣∣∣≥(1−ϵ)2n(H(X)−ϵ).
Roughly speaking, typical sequences are sequences of which the proportion of occurrences of its alphabet symbols is close to the true probability of occurrence:
N
(
x
i
)
/
n
≈
p
(
x
i
)
N(x_i)/n\approx p(x_i)
N(xi)/n≈p(xi)
Examples: slides 17-22.
Discussion: We have that Pr ( A ϵ ( n ) ) → 1 \Pr (A_\epsilon^{(n)})\to 1 Pr(Aϵ(n))→1 as n → ∞ n\to \infty n→∞. Does this imply that sequences in A ϵ ( n ) ‾ \overline{A_\epsilon^{(n)}} Aϵ(n) have lower probability as compared to the ones in A ϵ ( n ) A_\epsilon^{(n)} Aϵ(n)?
Pr ( A ϵ ( n ) ) → 1 \Pr (A_\epsilon^{(n)})\to 1 Pr(Aϵ(n))→1 means that X 1 , X 2 , ⋯ X n X_1,X_2,\cdots X_n X1,X2,⋯Xn is likely to be in the typical set as n → ∞ n\to \infty n→∞. But a single typical sequence does not necessarily have the highest probability. For example:
Consider a stochastic process consisting of Bernoulli random variables having probabilities Pr ( X = 0 ) = 1 / 3 \Pr (X=0)=1/3 Pr(X=0)=1/3 and Pr ( X = 1 ) = 2 / 3 \Pr (X=1)=2/3 Pr(X=1)=2/3.
The most likely sequence (length 6): 1 1 1 1 1 1
A typical sequence (length 6): 1 0 1 1 1 0
High-probability Sets
From the definition of A ϵ ( n ) A_\epsilon ^{(n)} Aϵ(n), it is clear that A ϵ ( n ) A_\epsilon^{(n)} Aϵ(n) is a fairly small set that contains most of the probability. But from the definition, it is not clear whether it is the smallest high-probability set. We first give a definition of the smallest high-probability set:
Definition 3 (The smallest high-probability set):
For each
n
=
1
,
2
,
⋯
n=1,2,\cdots
n=1,2,⋯, let
Q
q
(
n
)
⊂
X
n
Q_q^{(n)}\subset \mathcal X^n
Qq(n)⊂Xn as the smallest high-probability set with
Pr
{
Q
q
(
n
)
}
≥
1
−
q
(5)
\Pr\{Q_q^{(n)}\}\ge 1-q \tag{5}
Pr{Qq(n)}≥1−q(5)
We can show that, for sufficiently small
q
q
q, the set
A
ϵ
(
n
)
A_\epsilon^{(n)}
Aϵ(n) and
Q
q
(
n
)
Q_q^{(n)}
Qq(n) have significant intersection:
Pr
{
A
ϵ
(
n
)
∩
Q
q
(
n
)
}
=
Pr
{
A
ϵ
(
n
)
}
+
Pr
{
Q
q
(
n
)
}
−
Pr
{
A
ϵ
(
n
)
∪
Q
q
(
n
)
}
>
1
−
δ
+
1
−
q
−
1
=
1
−
δ
−
q
\begin{aligned} \Pr\{A_\epsilon^{(n)} \cap Q_q^{(n)} \} &=\Pr\{A_\epsilon^{(n)}\}+\Pr\{Q_q^{(n)}\}-\Pr\{A_\epsilon^{(n)} \cup Q_q^{(n)}\}\\ &>1-\delta+1-q-1\\ &=1-\delta-q \end{aligned}
Pr{Aϵ(n)∩Qq(n)}=Pr{Aϵ(n)}+Pr{Qq(n)}−Pr{Aϵ(n)∪Qq(n)}>1−δ+1−q−1=1−δ−q
And we are going to show that
∣
A
ϵ
(
n
)
∣
|A_\epsilon^{(n)}|
∣Aϵ(n)∣ and
∣
Q
q
(
n
)
∣
|Q_q^{(n)}|
∣Qq(n)∣ are about the same.
Theorem 4:
For any
0
<
q
<
1
0<q<1
0<q<1, we have
1
n
log
∣
Q
q
(
n
)
∣
>
H
(
X
)
−
ϵ
′
(6)
\frac{1}{n}\log |Q_q^{(n)}|>H(X)-\epsilon ' \tag{6}
n1log∣Qq(n)∣>H(X)−ϵ′(6)
where
ϵ
′
\epsilon '
ϵ′ can be made arbitrarily small.
Proof:
1
−
δ
−
q
<
Pr
{
A
ϵ
(
n
)
∩
Q
q
(
n
)
}
=
∑
x
n
∈
A
ϵ
(
n
)
∩
Q
q
(
n
)
p
(
x
n
)
<
∑
x
n
∈
A
ϵ
(
n
)
∩
Q
q
(
n
)
2
−
n
(
H
(
X
)
−
ϵ
)
=
∣
A
ϵ
(
n
)
∩
Q
q
(
n
)
∣
2
−
n
(
H
(
X
)
−
ϵ
)
≤
∣
Q
q
(
n
)
∣
2
−
n
(
H
(
X
)
−
ϵ
)
\begin{aligned} 1-\delta-q&<\Pr\{A_\epsilon^{(n)} \cap Q_q^{(n)} \} =\sum_{x^n\in A_\epsilon^{(n)} \cap Q_q^{(n)}}p(x^n)\\ &<\sum_{x^n\in A_\epsilon^{(n)} \cap Q_q^{(n)}}2^{-n(H(X)-\epsilon)}=|A_\epsilon^{(n)} \cap Q_q^{(n)}|2^{-n(H(X)-\epsilon)}\\ &\le |Q_q^{(n)}|2^{-n(H(X)-\epsilon)} \end{aligned}
1−δ−q<Pr{Aϵ(n)∩Qq(n)}=xn∈Aϵ(n)∩Qq(n)∑p(xn)<xn∈Aϵ(n)∩Qq(n)∑2−n(H(X)−ϵ)=∣Aϵ(n)∩Qq(n)∣2−n(H(X)−ϵ)≤∣Qq(n)∣2−n(H(X)−ϵ)
Hence, ∣ Q q ( n ) ∣ > ( 1 − δ − q ) 2 n ( H ( X ) − ϵ ) |Q_q^{(n)}|>(1-\delta-q)2^{n(H(X)-\epsilon)} ∣Qq(n)∣>(1−δ−q)2n(H(X)−ϵ).
Thus, Q q ( n ) Q_q^{(n)} Qq(n) must have at least 2 n H 2^{nH} 2nH elements. And A ϵ ( n ) A_\epsilon^{(n)} Aϵ(n) has 2 n ( H ± ϵ ) 2^{n(H\pm \epsilon)} 2n(H±ϵ) elements. Therefore, A ϵ ( n ) A_\epsilon^{(n)} Aϵ(n) is about the same size as the smallest high-probability set.