Asymptotic Equipartition Property

Reference:

Elements of Information Theory, 2nd Edition

Slides of EE4560, TUD

AEP

In information theory, the analog of the law of large numbers is the asymptotic equipartition property (AEP). It is a direct consequence of the weak law of large numbers.

For independent identically distributed (i.i.d.) random variables X 1 , ⋯   , X n X_1,\cdots, X_n X1,,Xn, the weak law of the large numbers states that
1 n ∑ i = 1 n x i → E X in probability \frac{1}{n}\sum_{i=1}^nx_i\to EX\quad \text{in probability} n1i=1nxiEXin probability
The AEP states that
− 1 n log ⁡ p ( x 1 , ⋯   , x n ) → − E log ⁡ p ( X ) = H ( X ) in probability -\frac{1}{n}\log p(x_1,\cdots,x_n)\to -E\log p(X)=H(X)\quad \text{in probability} n1logp(x1,,xn)Elogp(X)=H(X)in probability
To show this rigorously, we are going to introduce some definitions and theorems.


Definition 1:

Given a sequence of i.i.d. random variables X i X_i Xi. We say that the sequence $X_1,X_2,\cdots $ converges to a random variable X X X:

  1. In probability if for every ϵ > 0 \epsilon >0 ϵ>0, Pr ⁡ { ∣ X n − X ∣ > ϵ } → 0 \Pr\{|X_n-X|>\epsilon\}\to 0 Pr{XnX>ϵ}0
  2. In mean square if E ( X n − X ) 2 → 0 E(X_n-X)^2\to 0 E(XnX)20
  3. With probability 1 (also called almost surely) if Pr ⁡ { lim ⁡ n → ∞ X n = X } = 1 \Pr\{\lim_{n\to \infty}X_n=X\}=1 Pr{limnXn=X}=1

Theorem 1 (Weak law of large numbers):

Given a sequence of i.i.d. random variables X i X_i Xi. Then for any ϵ > 0 \epsilon>0 ϵ>0 and δ > 0 \delta>0 δ>0, there exists an n 0 n_0 n0 such that for any n > n 0 n>n_0 n>n0
Pr ⁡ ( ∣ 1 n ∑ i = 1 n X i − E X ∣ < ϵ ) ≥ 1 − δ (1) \Pr\left( \left|\frac{1}{n}\sum_{i=1}^n X_i-EX \right|<\epsilon \right)\ge 1-\delta\tag{1} Pr(n1i=1nXiEX<ϵ)1δ(1)
It is a direct result of Chebyshev Inequality.

Theorem 2 (AEP):

If X 1 , X 2 , ⋯ X_1,X_2,\cdots X1,X2, are i.i.d. ∼ p ( x ) \sim p(x) p(x), i.e., { X N , n ∈ Z } ∼ p ( x n ) \{ X_N, n\in \mathbb Z \}\sim p(x^n) {XN,nZ}p(xn), then for any ϵ > 0 \epsilon>0 ϵ>0 and δ > 0 \delta>0 δ>0, there exists an n 0 n_0 n0 such that for any n > n 0 n>n_0 n>n0
Pr ⁡ ( ∣ − 1 n log ⁡ p ( X 1 , ⋯   , X n ) − H ( X ) ∣ < ϵ ) ≥ 1 − δ (2) \Pr\left( \left|-\frac{1}{n}\log p(X_1,\cdots,X_n)-H(X) \right|<\epsilon \right)\ge 1-\delta\tag{2} Pr(n1logp(X1,,Xn)H(X)<ϵ)1δ(2)
Proof: By the weak law of large numbers,
− 1 n log ⁡ p ( X 1 , ⋯   , X n ) = − 1 n log ⁡ ∏ i = 1 n p ( X i ) = − 1 n ∑ i = 1 n log ⁡ p ( X i ) → − E log ⁡ p ( X ) in probability = H ( X ) \begin{aligned} -\frac{1}{n}\log p(X_1,\cdots,X_n)&=-\frac{1}{n}\log \prod_{i=1}^n p(X_i)=-\frac{1}{n}\sum_{i=1}^n\log p(X_i)\\ &\to -E\log p(X)\quad \text{in probability}\\ &=H(X) \end{aligned} n1logp(X1,,Xn)=n1logi=1np(Xi)=n1i=1nlogp(Xi)Elogp(X)in probability=H(X)


As a consequence, the probability p ( x 1 , ⋯   , x n ) p(x_1,\cdots,x_n) p(x1,,xn) of almost all sequences will be close to 2 − n H ( X ) 2^{-nH(X)} 2nH(X) when n n n is large.

Typical Set

We can derive the set of all sequences into two sets, the typical set, where the sample entropy is close to the true entropy (will be explained later), and the non-typical set, which contains the other sequences.


Definition 2 (Typical set):

The typical set, denoted by A ϵ ( n ) A_\epsilon^{(n)} Aϵ(n), is defined by
A ϵ ( n ) = { ( x 1 , ⋯   , x n ) : ∣ − 1 n log ⁡ p ( x 1 , ⋯   , x n ) − H ( X ) ∣ < ϵ } (3) A_\epsilon^{(n)}=\left\{(x_1,\cdots,x_n):\left|-\frac{1}{n}\log p(x_1,\cdots,x_n)-H(X)\right|<\epsilon \right\} \tag{3} Aϵ(n)={(x1,,xn):n1logp(x1,,xn)H(X)<ϵ}(3)
It is the set of sequences ( x 1 , ⋯   , x n ) ∈ X n (x_1,\cdots, x_n)\in \mathcal X^n (x1,,xn)Xn having the property
2 − n ( H ( X ) − ϵ ) > p ( x 1 , ⋯   , x n ) > 2 − n ( H ( X ) + ϵ ) (4) 2^{-n(H(X)-\epsilon)}>p(x_1,\cdots,x_n)>2^{-n(H(X)+\epsilon)}\tag{4} 2n(H(X)ϵ)>p(x1,,xn)>2n(H(X)+ϵ)(4)
As a consequence of the AEP, we can show that the set A ϵ ( n ) A_\epsilon^{(n)} Aϵ(n) has the following properties:

Theorem 3 (Properties of typical sets):

  1. If ( x 1 , x 2 , … , x n ) ∈ A ϵ ( n ) , \left(x_{1}, x_{2}, \ldots, x_{n}\right) \in A_{\epsilon}^{(n)}, (x1,x2,,xn)Aϵ(n), then H ( X ) − ϵ ≤ − 1 n log ⁡ p ( x 1 , x 2 , … x n ) ≤ H ( X ) + ϵ H(X)-\epsilon \leq-\frac{1}{n} \log p\left(x_{1}, x_{2}, \ldots x_{n}\right) \leq H(X)+\epsilon H(X)ϵn1logp(x1,x2,xn)H(X)+ϵ.
  2. Pr ⁡ { A ϵ ( n ) } > 1 − ϵ \operatorname{Pr}\left\{A_{\epsilon}^{(n)}\right\}>1-\epsilon Pr{Aϵ(n)}>1ϵ for n n n sufficiently large.
  3. ∣ A ϵ ( n ) ∣ ≤ 2 n ( H ( X ) + ϵ ) , \left|A_{\epsilon}^{(n)}\right| \leq 2^{n(H(X)+\epsilon)}, Aϵ(n)2n(H(X)+ϵ), where ∣ A ∣ |A| A denotes the number of elements in the set A A A.
  4. ∣ A ϵ ( n ) ∣ ≥ ( 1 − ϵ ) 2 n ( H ( X ) − ϵ ) \left|A_{\epsilon}^{(n)}\right| \geq(1-\epsilon) 2^{n(H(X)-\epsilon)} Aϵ(n)(1ϵ)2n(H(X)ϵ) for n n n sufficiently large.

Thus, the typical set has probability nearly 1 1 1, all elements of the typical set are nearly equiprobable with probability close to 2 − n H 2^{-nH} 2nH, and the number of elements in the typical set is nearly 2 n H 2^{nH} 2nH.

在这里插入图片描述

Proof:

  • The proof of property ( 1 ) (1) (1) is immediate from the definition of A ϵ ( n ) A_\epsilon^{(n)} Aϵ(n).

  • Pr ⁡ { A ϵ ( n ) } \operatorname{Pr}\left\{A_{\epsilon}^{(n)}\right\} Pr{Aϵ(n)} means the probability of the event ( X 1 , X 2 , ⋯   , X n ) ∈ A ϵ ( n ) (X_1,X_2,\cdots,X_n)\in A_\epsilon^{(n)} (X1,X2,,Xn)Aϵ(n). The proof of property ( 2 ) (2) (2) follows directly from Theorem 2, since the probability of the event ( X 1 , X 2 , ⋯   , X n ) ∈ A ϵ ( n ) (X_1,X_2,\cdots,X_n)\in A_\epsilon^{(n)} (X1,X2,,Xn)Aϵ(n) tends to 1 1 1 as n → ∞ n\to \infty n. Thus, for any δ > 0 \delta>0 δ>0, there exists an n 0 n_0 n0 such that for all n ≥ n 0 n\ge n_0 nn0, we have

Pr ⁡ { A ϵ ( n ) } = Pr ⁡ { ( X 1 , X 2 , ⋯   , X n ) ∈ A ϵ ( n ) } = Pr ⁡ { ∣ − 1 n log ⁡ p ( x 1 , ⋯   , x n ) − H ( X ) ∣ < ϵ } > 1 − δ \begin{aligned} \operatorname{Pr}\left\{A_{\epsilon}^{(n)}\right\}&=\operatorname{Pr}\left\{(X_1,X_2,\cdots,X_n)\in A_\epsilon^{(n)}\right\}\\&=\Pr\left\{\left| -\frac{1}{n}\log p(x_1,\cdots,x_n)-H(X) \right|<\epsilon\right\}\\ &> 1-\delta \end{aligned} Pr{Aϵ(n)}=Pr{(X1,X2,,Xn)Aϵ(n)}=Pr{n1logp(x1,,xn)H(X)<ϵ}>1δ

  • To prove property ( 3 ) (3) (3), we can use Eq. ( 4 ) (4) (4) and write
    1 = ∑ x ∈ X n p ( x ) ≥ ∑ x ∈ A ϵ ( n ) p ( x ) ≥ ∑ x ∈ A ϵ ( n ) 2 − n ( H ( X ) + ϵ ) = 2 − n ( H ( X ) + ϵ ) ∣ A ϵ ( n ) ∣ \begin{aligned} 1&=\sum _{\mathbf x\in \mathcal X^n}p(\mathbf x)\ge\sum _{\mathbf x\in A_\epsilon^{(n)}}p(\mathbf x)\ge \sum _{\mathbf x\in A_\epsilon^{(n)}}2^{-n(H(X)+\epsilon)}\\&=2^{-n(H(X)+\epsilon)}\left|A_\epsilon^{(n)}\right| \end{aligned} 1=xXnp(x)xAϵ(n)p(x)xAϵ(n)2n(H(X)+ϵ)=2n(H(X)+ϵ)Aϵ(n)
    Hence ∣ A ϵ ( n ) ∣ ≤ 2 n ( H ( X ) + ϵ ) \left|A_{\epsilon}^{(n)}\right| \leq 2^{n(H(X)+\epsilon)} Aϵ(n)2n(H(X)+ϵ).

  • The property ( 4 ) (4) (4) can be derived from property ( 2 ) (2) (2) and Eq. ( 4 ) (4) (4). For sufficiently large n n n, Pr ⁡ { A ϵ ( n ) } > 1 − ϵ \Pr\{A_{\epsilon}^{(n)}\}>1-\epsilon Pr{Aϵ(n)}>1ϵ, so that
    1 − ϵ < Pr ⁡ { A ϵ ( n ) } = Pr ⁡ { ( X 1 , X 2 , ⋯   , X n ) ∈ A ϵ ( n ) } = ∑ x ∈ A ϵ ( n ) p ( x ) ≤ ∑ x ∈ A ϵ ( n ) 2 − n ( H ( X ) − ϵ ) = 2 − n ( H ( X ) − ϵ ) ∣ A ϵ ( n ) ∣ \begin{aligned} 1-\epsilon&<\Pr\{A_{\epsilon}^{(n)}\}=\operatorname{Pr}\left\{(X_1,X_2,\cdots,X_n)\in A_\epsilon^{(n)}\right\}\\ &=\sum_{\mathbf x\in A_{\epsilon}^{(n)}}p(\mathbf x)\le \sum_{\mathbf x\in A_{\epsilon}^{(n)}}2^{-n(H(X)-\epsilon)}=2^{-n(H(X)-\epsilon)}\left|A_\epsilon^{(n)}\right| \end{aligned} 1ϵ<Pr{Aϵ(n)}=Pr{(X1,X2,,Xn)Aϵ(n)}=xAϵ(n)p(x)xAϵ(n)2n(H(X)ϵ)=2n(H(X)ϵ)Aϵ(n)
    Hence ∣ A ϵ ( n ) ∣ ≥ ( 1 − ϵ ) 2 n ( H ( X ) − ϵ ) \left|A_{\epsilon}^{(n)}\right| \geq(1-\epsilon) 2^{n(H(X)-\epsilon)} Aϵ(n)(1ϵ)2n(H(X)ϵ).


Roughly speaking, typical sequences are sequences of which the proportion of occurrences of its alphabet symbols is close to the true probability of occurrence:
N ( x i ) / n ≈ p ( x i ) N(x_i)/n\approx p(x_i) N(xi)/np(xi)
Examples: slides 17-22.

Discussion: We have that Pr ⁡ ( A ϵ ( n ) ) → 1 \Pr (A_\epsilon^{(n)})\to 1 Pr(Aϵ(n))1 as n → ∞ n\to \infty n. Does this imply that sequences in A ϵ ( n ) ‾ \overline{A_\epsilon^{(n)}} Aϵ(n) have lower probability as compared to the ones in A ϵ ( n ) A_\epsilon^{(n)} Aϵ(n)?

Pr ⁡ ( A ϵ ( n ) ) → 1 \Pr (A_\epsilon^{(n)})\to 1 Pr(Aϵ(n))1 means that X 1 , X 2 , ⋯ X n X_1,X_2,\cdots X_n X1,X2,Xn is likely to be in the typical set as n → ∞ n\to \infty n. But a single typical sequence does not necessarily have the highest probability. For example:

Consider a stochastic process consisting of Bernoulli random variables having probabilities Pr ⁡ ( X = 0 ) = 1 / 3 \Pr (X=0)=1/3 Pr(X=0)=1/3 and Pr ⁡ ( X = 1 ) = 2 / 3 \Pr (X=1)=2/3 Pr(X=1)=2/3.

The most likely sequence (length 6): 1 1 1 1 1 1

A typical sequence (length 6): 1 0 1 1 1 0

High-probability Sets

From the definition of A ϵ ( n ) A_\epsilon ^{(n)} Aϵ(n), it is clear that A ϵ ( n ) A_\epsilon^{(n)} Aϵ(n) is a fairly small set that contains most of the probability. But from the definition, it is not clear whether it is the smallest high-probability set. We first give a definition of the smallest high-probability set:

Definition 3 (The smallest high-probability set):

For each n = 1 , 2 , ⋯ n=1,2,\cdots n=1,2,, let Q q ( n ) ⊂ X n Q_q^{(n)}\subset \mathcal X^n Qq(n)Xn as the smallest high-probability set with
Pr ⁡ { Q q ( n ) } ≥ 1 − q (5) \Pr\{Q_q^{(n)}\}\ge 1-q \tag{5} Pr{Qq(n)}1q(5)
We can show that, for sufficiently small q q q, the set A ϵ ( n ) A_\epsilon^{(n)} Aϵ(n) and Q q ( n ) Q_q^{(n)} Qq(n) have significant intersection:
Pr ⁡ { A ϵ ( n ) ∩ Q q ( n ) } = Pr ⁡ { A ϵ ( n ) } + Pr ⁡ { Q q ( n ) } − Pr ⁡ { A ϵ ( n ) ∪ Q q ( n ) } > 1 − δ + 1 − q − 1 = 1 − δ − q \begin{aligned} \Pr\{A_\epsilon^{(n)} \cap Q_q^{(n)} \} &=\Pr\{A_\epsilon^{(n)}\}+\Pr\{Q_q^{(n)}\}-\Pr\{A_\epsilon^{(n)} \cup Q_q^{(n)}\}\\ &>1-\delta+1-q-1\\ &=1-\delta-q \end{aligned} Pr{Aϵ(n)Qq(n)}=Pr{Aϵ(n)}+Pr{Qq(n)}Pr{Aϵ(n)Qq(n)}>1δ+1q1=1δq
And we are going to show that ∣ A ϵ ( n ) ∣ |A_\epsilon^{(n)}| Aϵ(n) and ∣ Q q ( n ) ∣ |Q_q^{(n)}| Qq(n) are about the same.

Theorem 4:

For any 0 < q < 1 0<q<1 0<q<1, we have
1 n log ⁡ ∣ Q q ( n ) ∣ > H ( X ) − ϵ ′ (6) \frac{1}{n}\log |Q_q^{(n)}|>H(X)-\epsilon ' \tag{6} n1logQq(n)>H(X)ϵ(6)
where ϵ ′ \epsilon ' ϵ can be made arbitrarily small.

Proof:
1 − δ − q < Pr ⁡ { A ϵ ( n ) ∩ Q q ( n ) } = ∑ x n ∈ A ϵ ( n ) ∩ Q q ( n ) p ( x n ) < ∑ x n ∈ A ϵ ( n ) ∩ Q q ( n ) 2 − n ( H ( X ) − ϵ ) = ∣ A ϵ ( n ) ∩ Q q ( n ) ∣ 2 − n ( H ( X ) − ϵ ) ≤ ∣ Q q ( n ) ∣ 2 − n ( H ( X ) − ϵ ) \begin{aligned} 1-\delta-q&<\Pr\{A_\epsilon^{(n)} \cap Q_q^{(n)} \} =\sum_{x^n\in A_\epsilon^{(n)} \cap Q_q^{(n)}}p(x^n)\\ &<\sum_{x^n\in A_\epsilon^{(n)} \cap Q_q^{(n)}}2^{-n(H(X)-\epsilon)}=|A_\epsilon^{(n)} \cap Q_q^{(n)}|2^{-n(H(X)-\epsilon)}\\ &\le |Q_q^{(n)}|2^{-n(H(X)-\epsilon)} \end{aligned} 1δq<Pr{Aϵ(n)Qq(n)}=xnAϵ(n)Qq(n)p(xn)<xnAϵ(n)Qq(n)2n(H(X)ϵ)=Aϵ(n)Qq(n)2n(H(X)ϵ)Qq(n)2n(H(X)ϵ)

Hence, ∣ Q q ( n ) ∣ > ( 1 − δ − q ) 2 n ( H ( X ) − ϵ ) |Q_q^{(n)}|>(1-\delta-q)2^{n(H(X)-\epsilon)} Qq(n)>(1δq)2n(H(X)ϵ).

Thus, Q q ( n ) Q_q^{(n)} Qq(n) must have at least 2 n H 2^{nH} 2nH elements. And A ϵ ( n ) A_\epsilon^{(n)} Aϵ(n) has 2 n ( H ± ϵ ) 2^{n(H\pm \epsilon)} 2n(H±ϵ) elements. Therefore, A ϵ ( n ) A_\epsilon^{(n)} Aϵ(n) is about the same size as the smallest high-probability set.

  • 4
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值