Reference:
Elements of Information Theory, 2nd Edition
Slides of EE4560, TUD
文章目录
Differential Entropy
We now introduce the concept of differential entropy, which is the entropy of a continuous random variable.
Definition 1 (Differential Entropy):
The differential entropy
h
(
X
)
h(X)
h(X) of a continuous random variable
X
X
X with density
f
(
x
)
f(x)
f(x) is defined as
h
(
X
)
=
−
∫
S
f
(
x
)
log
f
(
x
)
d
x
(1)
h(X)=-\int_S f(x)\log f(x)dx \tag{1}
h(X)=−∫Sf(x)logf(x)dx(1)
where
S
S
S is the support set of
X
X
X (where
f
(
x
)
>
0
f(x)>0
f(x)>0).
- h ( X ) h(X) h(X) is sometimes denoted as h ( f ) h(f) h(f), as H ( X ) H(X) H(X) is sometimes denoted as H ( p ) H(p) H(p).
- log \log log here denotes log 2 \log _2 log2. Do not forget to change the base.
Examples:
-
Uniform Distribution:
h ( X ) = − ∫ 0 a 1 a log 1 a d x = log a (2) h(X)=-\int_{0}^{a}\frac{1}{a} \log \frac{1}{a}dx=\log a\tag{2} h(X)=−∫0aa1loga1dx=loga(2)- larger a → larger uncertainty → larger h ( X ) \text{larger }a \to \text{larger uncertainty} \to \text{larger } h(X) larger a→larger uncertainty→larger h(X)
- For 0 < a < 1 0<a<1 0<a<1, the differential entropy log a \log a loga is negative! This is different from H ( X ) H(X) H(X), which is always ≥ 0 \ge 0 ≥0.
- However, 2 h ( X ) = 2 log a = a 2^{h(X)}=2^{\log a}=a 2h(X)=2loga=a is always positive.
-
Normal Distribution:
h ( f ) = − ∫ f ( x ) log f ( x ) d x = − ∫ f ( x ) ln 2 [ − ( x − μ ) 2 2 σ 2 − ln 2 π σ 2 ] d x = 1 ln 2 [ E ( X 2 ) 2 σ 2 + 1 2 ln ( 2 π σ 2 ) ] = 1 2 log ( 2 π e σ 2 ) (3) \begin{aligned} h(f)&=-\int f(x)\log f(x)dx=-\int \frac{f(x)}{\ln 2}\left[\frac{-(x-\mu)^2}{2\sigma^2}-\ln \sqrt{2\pi \sigma^2} \right]dx\\ &=\frac{1}{\ln 2}\left[\frac{E(X^2)}{2\sigma^2}+\frac{1}{2}\ln (2\pi \sigma^2)\right]=\frac{1}{2}\log (2\pi e \sigma^2) \end{aligned}\tag{3} h(f)=−∫f(x)logf(x)dx=−∫ln2f(x)[2σ2−(x−μ)2−ln2πσ2]dx=ln21[2σ2E(X2)+21ln(2πσ2)]=21log(2πeσ2)(3)
Definition 2 (Joint Differential Entropy):
The joint differential entropy
h
(
X
)
h(X)
h(X) of a set
X
1
,
X
2
,
⋯
,
X
n
X_1,X_2,\cdots,X_n
X1,X2,⋯,Xn of random variables with density
f
(
x
1
,
x
2
,
⋯
,
x
n
)
f(x_1,x_2,\cdots,x_n)
f(x1,x2,⋯,xn) is defined as
h
(
X
1
,
X
2
,
⋯
,
X
n
)
=
−
∫
f
(
x
n
)
log
f
(
x
n
)
d
x
n
(4)
h(X_1,X_2,\cdots,X_n)=-\int f(x^n)\log f(x^n)dx^n \tag{4}
h(X1,X2,⋯,Xn)=−∫f(xn)logf(xn)dxn(4)
N.B.
x
n
x^n
xn here is a short notation of
(
x
1
,
x
2
,
⋯
,
x
n
)
(x_1,x_2,\cdots,x_n)
(x1,x2,⋯,xn)
Definition 3 (Conditional Differential Entropy):
If
X
,
Y
X, Y
X,Y have a joint density function
f
(
x
,
y
)
,
f(x, y),
f(x,y), we can define the conditional differential entropy
h
(
X
∣
Y
)
h(X | Y)
h(X∣Y) as
h
(
X
∣
Y
)
=
−
∫
f
(
x
,
y
)
log
f
(
x
∣
y
)
d
x
d
y
=
h
(
X
,
Y
)
−
h
(
Y
)
(5)
h(X| Y)=-\int f(x, y) \log f(x | y) d x d y=h(X, Y)-h(Y)\tag{5}
h(X∣Y)=−∫f(x,y)logf(x∣y)dxdy=h(X,Y)−h(Y)(5)
Definition 4 (Mutual Information):
The mutual information
I
(
X
;
Y
)
I(X;Y)
I(X;Y) between two random variables
X
X
X and
Y
Y
Y with joint density
f
(
x
,
y
)
f(x,y)
f(x,y) is defined as
I
(
X
;
Y
)
=
∬
f
(
x
,
y
)
log
f
(
x
,
y
)
f
(
x
)
f
(
y
)
d
x
d
y
=
h
(
X
)
−
h
(
X
∣
Y
)
=
h
(
Y
)
−
h
(
Y
∣
X
)
=
h
(
X
)
+
h
(
Y
)
−
h
(
X
,
Y
)
(6)
\begin{aligned} I(X;Y)&=\iint f(x,y)\log \frac{f(x,y)}{f(x)f(y)}dx dy\\ &=h(X)-h(X|Y)=h(Y)-h(Y|X)\\ &=h(X)+h(Y)-h(X,Y) \end{aligned}\tag{6}
I(X;Y)=∬f(x,y)logf(x)f(y)f(x,y)dxdy=h(X)−h(X∣Y)=h(Y)−h(Y∣X)=h(X)+h(Y)−h(X,Y)(6)
N.B.
I
(
X
;
Y
)
≥
0
I(X;Y)\ge 0
I(X;Y)≥0 with equality if and only if
X
X
X and
Y
Y
Y are independent.
Gaussian Channels
The most important continuous alphabet channel is the Gaussian channel. This is a time-discrete channel with output
Y
i
Y_i
Yi at time
i
i
i,where
Y
i
Y_i
Yi is the sum of the input
X
i
X_i
Xi and the noise
Z
i
Z_i
Zi. The noise
Z
i
Z_i
Zi is drawn i.i.d. from a Gaussian distribution with variance
N
N
N. Thus,
Y
i
=
X
i
+
Z
i
,
Z
i
∼
N
(
0
,
N
)
(7)
Y_i=X_i+Z_i,\quad Z_i \sim \mathcal N(0,N) \tag{7}
Yi=Xi+Zi,Zi∼N(0,N)(7)
The noise
Z
i
Z_i
Zi is assumed to be independent of the signal
X
i
X_i
Xi.
The most common limitation on the input is an energy or power constraint. We assume an average power constraint. For any codeword
(
x
1
,
x
2
,
.
.
.
,
x
n
)
(x_1,x_2,...,x_n)
(x1,x2,...,xn) transmitted over the channel, we require that
1
n
∑
i
=
1
n
x
i
2
≤
P
(8)
\frac{1}{n}\sum_{i=1}^n x_i^2\le P \tag{8}
n1i=1∑nxi2≤P(8)
[Example-Binary input, Gaussian noise: Slide 6-7]
Gaussian Channel Capacity
Definition 5 (Information Capacity):
The information capacity of the Gaussian channel with power constraint
P
P
P is
C
=
max
f
(
x
)
:
E
X
2
≤
P
I
(
X
;
Y
)
=
1
2
log
(
1
+
P
N
)
(9)
C=\max _{f(x): E X^{2} \leq P} I(X ; Y)=\frac{1}{2} \log \left(1+\frac{P}{N}\right) \tag{9}
C=f(x):EX2≤PmaxI(X;Y)=21log(1+NP)(9)
where the maximum is achieved when
X
∼
N
(
0
,
P
)
X\sim \mathcal N(0,P)
X∼N(0,P).
Proof: Expanding
I
(
X
;
Y
)
,
I(X ; Y),
I(X;Y), we have
I
(
X
;
Y
)
=
h
(
Y
)
−
h
(
Y
∣
X
)
=
h
(
Y
)
−
h
(
X
+
Z
∣
X
)
=
h
(
Y
)
−
h
(
Z
∣
X
)
=
h
(
Y
)
−
h
(
Z
)
\begin{aligned} I(X ; Y) &=h(Y)-h(Y | X) \\ &=h(Y)-h(X+Z |X) \\ &=h(Y)-h(Z| X) \\ &=h(Y)-h(Z) \end{aligned}
I(X;Y)=h(Y)−h(Y∣X)=h(Y)−h(X+Z∣X)=h(Y)−h(Z∣X)=h(Y)−h(Z)
since
Z
Z
Z is independent of
X
.
X .
X. From Eq.
(
3
)
(3)
(3),
h
(
Z
)
=
1
2
log
2
π
e
N
.
h(Z)=\frac{1}{2} \log 2 \pi e N .
h(Z)=21log2πeN. Also,
E
Y
2
=
E
(
X
+
Z
)
2
=
E
X
2
+
2
E
X
E
Z
+
E
Z
2
=
P
+
N
E Y^{2}=E(X+Z)^{2}=E X^{2}+2 E X E Z+E Z^{2}=P+N
EY2=E(X+Z)2=EX2+2EXEZ+EZ2=P+N
since
X
X
X and
Z
Z
Z are independent and
E
Z
=
0.
E Z=0 .
EZ=0. Given
E
Y
2
=
P
+
N
,
E Y^{2}=P+N,
EY2=P+N, the entropy of
Y
Y
Y is bounded by
1
2
log
2
π
e
(
P
+
N
)
\frac{1}{2} \log 2 \pi e(P+N)
21log2πe(P+N) by Theorem 8.6.5 (the normal maximizes the entropy for a given variance) [book 254] . Applying this result to bound the mutual information, we obtain
I
(
X
;
Y
)
=
h
(
Y
)
−
h
(
Z
)
≤
1
2
log
2
π
e
(
P
+
N
)
−
1
2
log
2
π
e
N
=
1
2
log
(
1
+
P
N
)
\begin{aligned} I(X ; Y) &=h(Y)-h(Z) \\ & \leq \frac{1}{2} \log 2 \pi e(P+N)-\frac{1}{2} \log 2 \pi e N \\ &=\frac{1}{2} \log \left(1+\frac{P}{N}\right) \end{aligned}
I(X;Y)=h(Y)−h(Z)≤21log2πe(P+N)−21log2πeN=21log(1+NP)
Next, it will be shown that this capacity is also the supremum of the rates achievable for the channel, i.e., the operational capacity.
Definition 6 (Code):
An ( M , n ) (M, n) (M,n) code for the Gaussian channel with power constraint P P P consists of the following:
- An index set { 1 , 2 , … , M } \{1,2, \ldots, M\} {1,2,…,M}
- An encoding function
x
:
{
1
,
2
,
…
,
M
}
→
X
n
x:\{1,2, \ldots, M\} \rightarrow \mathcal{X}^{n}
x:{1,2,…,M}→Xn, yielding codewords
x
n
(
1
)
,
x
n
(
2
)
,
…
,
x
n
(
M
)
,
x^{n}(1), x^{n}(2), \ldots, x^{n}(M),
xn(1),xn(2),…,xn(M), satisfying the power constraint
P
;
P ;
P; that is, for every codeword
∑ i = 1 n x i 2 ( w ) ≤ n P , w = 1 , 2 , … , M \sum_{i=1}^{n} x_{i}^{2}(w) \leq n P, \quad w=1,2, \ldots, M i=1∑nxi2(w)≤nP,w=1,2,…,M - A decoding function
g : Y n → { 1 , 2 , … , M } g: \mathcal{Y}^{n} \rightarrow\{1,2, \ldots, M\} g:Yn→{1,2,…,M}
N.B. Rate R = log M n R=\frac{\log M}{n} R=nlogM, as is defined in the discrete channel.
Definition 7 (Achievable):
A rate R R R is said to be achievable for a Gaussian channel with a power constraint P P P if there exists
- a sequence of ( 2 n R , n ) \left(2^{n R}, n\right) (2nR,n) codes
- with codewords satisfying the power constraint
- such that the maximal probability of error λ ( n ) \lambda^{(n)} λ(n) tends to zero.
The capacity of the channel is the supremum of the achievable rates.
Theorem 1 (The capacity of a Gaussian channel):
The capacity of a Gaussian channel with power constraint
P
P
P and noise variance
N
N
N is
C
=
1
2
log
(
1
+
P
N
)
bits per transmission
(10)
C=\frac{1}{2} \log \left(1+\frac{P}{N}\right) \quad \text{ bits per transmission} \tag{10}
C=21log(1+NP) bits per transmission(10)
[Proof: book 266-268]
A plausibility argument:
Band-Limited Channel
A common model for communication over a radio network or a telephone line is a bandlimited channel with white noise. This is a continuous time channel. The output of such a channel can be described as the convolution
Y
(
t
)
=
(
X
(
t
)
+
Z
(
t
)
)
∗
h
(
t
)
(11)
Y(t)=(X(t)+Z(t))*h(t)\tag{11}
Y(t)=(X(t)+Z(t))∗h(t)(11)
where
- Y ( t ) Y(t) Y(t) is output signal waveform
- X ( t ) X(t) X(t) is input signal waveform
- Z ( t ) Z(t) Z(t) is white Gaussian noise waveform
- h ( t ) h(t) h(t) is impulse response of an ideal bandpass filter (cuts out all frequencies greater than W W W).
Theorem 2 (The sampling theorem):
A function f ( t ) f(t) f(t), which is band-limited to W W W, is completely determined by samples of the function spaced 1 2 W \frac{1}{2W} 2W1 seconds apart.
[Proof: book 271]
Now we can formulate the problem of communication over a bandlimited channel:
- Bandwidth W W W
- Number of samples per second 2 W 2W 2W
- Signal power P P P
- Noise power N = N 0 W N=N_0W N=N0W, where N 0 N_0 N0 is the noise power spectral density
If channel is used over the time interval [ 0 , T ] [0,T] [0,T], then
- energy per sample is P T 2 W T = P 2 W \frac{PT}{2WT}=\frac{P}{2W} 2WTPT=2WP
- noise variance per sample is N 0 W T 2 W T = N 0 2 \frac{N_0 WT}{2WT}=\frac{N_0}{2} 2WTN0WT=2N0
Using Theorem 1 (Eq.
(
10
)
(10)
(10)), we can obtain the capacity per sample:
C
=
1
2
log
(
1
+
P
2
W
N
0
2
)
=
1
2
log
(
1
+
P
N
0
W
)
bits per sample
(12)
C=\frac{1}{2} \log \left(\frac{1+\frac{P}{2 W}}{\frac{N_{0}}{2}}\right)=\frac{1}{2} \log \left(1+\frac{P}{N_{0} W}\right) \quad \text { bits per sample } \tag{12}
C=21log(2N01+2WP)=21log(1+N0WP) bits per sample (12)
Since there are
2
W
2W
2W samples per second, the capacity per second:
C
=
W
log
(
1
+
P
N
0
W
)
bits per second
(13)
C=W\log \left( 1+\frac{P}{N_0W} \right) \quad \text { bits per second } \tag{13}
C=Wlog(1+N0WP) bits per second (13)
N.B. If
W
→
∞
W\to \infty
W→∞, using
ln
(
1
+
x
)
∼
x
(
x
→
0
)
\ln (1+x)\sim x ~(x\to 0)
ln(1+x)∼x (x→0), then
C
=
P
log
e
N
0
b
p
s
C=\frac{P \log e}{N_0}\mathrm{bps}
C=N0Plogebps.
Definition 8 (Band Efficiency):
Bandwidth efficiency
η
\eta
η is defined as the rate
R
R
R (in
b
i
t
/
s
\mathrm{bit/s}
bit/s) divided by the bandwidth
W
W
W (in
H
z
\mathrm{Hz}
Hz):
η
=
R
W
b
i
t
/
s
/
H
z
(14)
\eta=\frac{R}{W}~\mathrm{bit} / \mathrm{s} / \mathrm{Hz} \tag{14}
η=WR bit/s/Hz(14)
From channel capacity formula it follows that
R
<
C
=
W
log
(
1
+
P
W
N
0
)
=
W
log
(
1
+
R
E
b
W
N
0
)
R<C=W \log \left(1+\frac{P}{W N_{0}}\right)=W \log \left(1+\frac{R E_{b}}{W N_{0}}\right)
R<C=Wlog(1+WN0P)=Wlog(1+WN0REb)
where
E
b
E_{b}
Eb is the energy per bit. Hence,
η
<
log
(
1
+
η
E
b
N
0
)
,
i.e.,
E
b
N
0
>
2
η
−
1
η
(15)
\eta<\log \left(1+\eta \frac{E_{b}}{N_{0}}\right), \text { i.e., } \frac{E_{b}}{N_{0}}>\frac{2^{\eta}-1}{\eta} \tag{15}
η<log(1+ηN0Eb), i.e., N0Eb>η2η−1(15)
Parallel Gaussian Channels
Problem to be solved:
minimize
−
∑
j
=
1
k
C
j
=
−
∑
j
=
1
k
1
2
log
(
1
+
P
j
N
j
)
subject to
∑
j
=
1
k
P
j
≤
P
(16)
\begin{aligned} &\text{minimize} && -\sum_{j=1}^{k}C_j =-\sum_{j=1}^{k} \frac{1}{2}\log \left(1+\frac{P_j}{N_j} \right)\\ &\text{subject to} && \sum_{j=1}^{k} P_j \le P \end{aligned}\tag{16}
minimizesubject to−j=1∑kCj=−j=1∑k21log(1+NjPj)j=1∑kPj≤P(16)
Using Lagrange multipliers gives the function
L
(
P
1
,
⋯
,
P
k
,
λ
)
=
−
∑
j
=
1
k
1
2
log
(
1
+
P
j
N
j
)
+
λ
(
∑
j
=
1
k
P
j
−
P
)
L(P_1,\cdots,P_k,\lambda)=-\sum_{j=1}^{k} \frac{1}{2}\log \left(1+\frac{P_j}{N_j} \right)+\lambda(\sum_{j=1}^{k} P_j -P)
L(P1,⋯,Pk,λ)=−j=1∑k21log(1+NjPj)+λ(j=1∑kPj−P)
KKT conditions:
∑
j
=
1
k
P
j
≤
P
,
λ
≥
0
∇
P
j
L
=
0
⟹
P
j
=
1
2
λ
−
N
j
λ
(
∑
j
=
1
k
P
j
−
P
)
=
0
\sum_{j=1}^{k} P_j \le P,\quad \lambda\ge 0\\ \nabla _{P_j}L=0 \Longrightarrow P_j=\frac{1}{2\lambda}-N_j\\ \lambda(\sum_{j=1}^{k} P_j - P)=0
j=1∑kPj≤P,λ≥0∇PjL=0⟹Pj=2λ1−Njλ(j=1∑kPj−P)=0
Together with the condition that
P
j
P_j
Pj are nonnegative gives the solution
P
j
=
max
{
0
,
1
2
λ
−
N
j
}
≜
(
ν
−
N
j
)
+
(17)
P_j=\max \{0,\frac{1}{2\lambda}-N_j\}\triangleq(\nu-N_j)^+ \tag{17}
Pj=max{0,2λ1−Nj}≜(ν−Nj)+(17)
where
ν
\nu
ν is chosen such that
∑
(
ν
−
N
j
)
+
=
P
\sum (\nu -N_j)^+=P
∑(ν−Nj)+=P.
This solution is illustrated graphically in Figure 9.4. The vertical levels indicate the noise levels in the various channels. As the signal power is increased from zero, we allot the power to the channels with the lowest noise. When the available power is increased still further, some of the power is put into noisier channels.
The process by which the power is distributed among the various bins is identical to the way in which water distributes itself in a vessel, hence this process is sometimes referred to as water-filling.
[Example: Slides 23-25]
Gaussian Channels with Feedback
The feedback allows the input of the channel to depend on the past values of the output:
- Capacity without feedback:
max tr ( K X ) ≤ n P 1 2 n log ∣ K X + K Z ∣ ∣ K Z ∣ (18) \max _{\operatorname{tr}\left(K_{X}\right) \leq n P} \frac{1}{2 n} \log \frac{\left|K_{X}+K_{Z}\right|}{\left|K_{Z}\right|}\tag{18} tr(KX)≤nPmax2n1log∣KZ∣∣KX+KZ∣(18) - Capacity with feedback:
max
tr
(
K
X
)
≤
n
P
1
2
n
log
∣
K
X
+
Z
∣
∣
K
Z
∣
(19)
\max _{\operatorname{tr}\left(K_{X}\right) \leq n P} \frac{1}{2 n} \log \frac{\left|K_{X+Z}\right|}{\left|K_{Z}\right|}\tag{19}
tr(KX)≤nPmax2n1log∣KZ∣∣KX+Z∣(19)
where
K
…
K_{\ldots}
K… is
n
×
n
n \times n
n×n covariance matrix.
Remarks:
-
Memoryless channels: feedback does not increase capacity!
-
Channels with memory: feedback does increase capacity!
-
Feedback does not improve capacity by more than 1 2 \frac{1}{2} 21 :
C withFB ≤ C withoutFB + 1 2 (20) C_{\text {withFB }} \leq C_{\text {withoutFB }}+\frac{1}{2} \tag{20} CwithFB ≤CwithoutFB +21(20) -
Feedback does not improve capacity by more than a factor of two:
C withFB ≤ 2 C withoutFB (21) C_{\text {withFB }} \leq 2 C_{\text {withoutFB }} \tag{21} CwithFB ≤2CwithoutFB (21) -
Conclusion: feedback may help, but not much!