Reference:
Elements of Information Theory, 2nd Edition
Slides of EE4560, TUD
How to measure information? → \to → Hartley’s approach [slides 5] → \to → Problems?
no allowance is made for the fact that
- the n n n alphabet elements have unequal probability of occurrence
- and that there could be a possible dependence between successive symbols
→ \to → Shannon’s method: associating information with uncertainty using the concept of probability
Entropy
Definition 1 (Entropy):
Let
X
X
X be a discrete random variable with alphabet
X
\mathcal X
X and probability mass function
p
(
x
)
=
Pr
(
X
=
x
)
,
x
∈
X
p(x)=\Pr (X=x),x\in \mathcal X
p(x)=Pr(X=x),x∈X. The entropy
H
(
X
)
H(X)
H(X) is defined by
H
(
X
)
=
−
∑
x
∈
X
p
(
x
)
log
p
(
x
)
=
−
E
log
p
(
X
)
(1)
H(X)=-\sum_{x\in \mathcal X}p(x)\log p(x)=-E\log p(X)\tag{1}
H(X)=−x∈X∑p(x)logp(x)=−Elogp(X)(1)
Sometimes we will use the notation
H
(
p
)
H(p)
H(p). (Define
−
p
(
x
)
log
p
(
x
)
=
0
-p(x)\log p(x)=0
−p(x)logp(x)=0 if
p
(
x
)
=
0
p(x)=0
p(x)=0)
Interpretation:
- a measure of the uncertainty of a random variable
- the minimum number of ‘yes/no’ questions needed to determine which of the alphabet symbols x ∈ X x\in \mathcal X x∈X occurred on average [slides 9,10, Ex 2.1]
- the minimum number of bits to represent the outcome of an experiment (Shannon’s first coding theorem) [slides 12-14]
- Why logarithm? To make it growing linearly with system size and “behaving like information”.
Properties:
-
H ( X ) ≥ 0 H(X)\ge 0 H(X)≥0 with equality iff ∃ i \exist i ∃i such that p ( x i ) = 1 p(x_i)=1 p(xi)=1.
-
H ( X ) H(X) H(X) is a concave function of p ( x ) p(x) p(x) (since ∇ 2 H ⪯ 0 \nabla^2H\preceq0 ∇2H⪯0)
-
H ( X ) ≤ log k H(X)\le \log k H(X)≤logk with equality iff p ( x i ) = 1 / k p(x_i)=1/k p(xi)=1/k for all i i i (Lagrangian method)
Entropy is maximum for a uniform distribution. (Maximum uncertainty)
Definition 2 (Joint Entropy):
Let
(
X
,
Y
)
(X,Y)
(X,Y) be a pair of discrete random variable with alphabet
X
\mathcal X
X and
Y
\mathcal Y
Y, respectively, and joint probability mass function
p
(
x
,
y
)
=
Pr
(
X
=
x
,
Y
=
y
)
,
x
∈
X
,
y
∈
Y
p(x,y)=\Pr (X=x,Y=y), x\in \mathcal X,y\in \mathcal Y
p(x,y)=Pr(X=x,Y=y),x∈X,y∈Y. The joint entropy
H
(
X
,
Y
)
H(X,Y)
H(X,Y) is defined by
H
(
X
,
Y
)
=
−
∑
x
∈
X
∑
y
∈
Y
p
(
x
,
y
)
log
p
(
x
,
y
)
=
−
E
log
p
(
X
,
Y
)
(2)
H(X,Y)=-\sum_{x\in \mathcal X}\sum_{y\in \mathcal Y}p(x,y)\log p(x,y)=-E\log p(X,Y)\tag{2}
H(X,Y)=−x∈X∑y∈Y∑p(x,y)logp(x,y)=−Elogp(X,Y)(2)
Definition 3 (Conditional Entropy):
What is the entropy of
Y
Y
Y if we know
X
=
x
X=x
X=x?
H
(
Y
∣
X
=
x
)
=
−
∑
y
∈
Y
p
(
y
∣
x
)
log
p
(
y
∣
x
)
H(Y|X=x)=-\sum_{y\in \mathcal Y}p(y|x)\log p(y|x)
H(Y∣X=x)=−y∈Y∑p(y∣x)logp(y∣x)
By averaging over all values
x
∈
X
x\in \mathcal X
x∈X, the average amount of information of
Y
Y
Y given foreknowledge of
X
X
X is found
H
(
Y
∣
X
)
=
∑
x
∈
X
p
(
x
)
H
(
Y
∣
X
=
x
)
=
−
∑
x
∈
X
∑
y
∈
Y
p
(
x
,
y
)
log
p
(
y
∣
x
)
=
−
E
log
p
(
Y
∣
X
)
(3)
H(Y|X)=\sum_{x\in \mathcal X}p(x)H(Y|X=x)=-\sum_{x\in \mathcal X}\sum_{y\in \mathcal Y}p(x,y)\log p(y|x)=-E\log p(Y|X)\tag{3}
H(Y∣X)=x∈X∑p(x)H(Y∣X=x)=−x∈X∑y∈Y∑p(x,y)logp(y∣x)=−Elogp(Y∣X)(3)
The naturalness of the definition of joint entropy and conditional entropy is exhibited by the fact that the entropy of a pair of random variables is the entropy of one plus the conditional entropy of the other.
Theorem 1 (Chain Rule for Entropy):
H
(
X
,
Y
)
=
H
(
X
)
+
H
(
Y
∣
X
)
=
H
(
Y
)
+
H
(
X
∣
Y
)
(4)
H(X,Y)=H(X)+H(Y|X)=H(Y)+H(X|Y)\tag{4}
H(X,Y)=H(X)+H(Y∣X)=H(Y)+H(X∣Y)(4)
Proof:
H
(
X
)
=
−
∑
x
∈
X
p
(
x
)
log
p
(
x
)
=
−
∑
x
∈
X
∑
y
∈
Y
p
(
x
,
y
)
log
p
(
x
)
H(X)=-\sum_{x\in \mathcal X}p(x)\log p(x)=-\sum_{x\in \mathcal X}\sum_{y\in \mathcal Y}p(x,y)\log p(x)
H(X)=−x∈X∑p(x)logp(x)=−x∈X∑y∈Y∑p(x,y)logp(x)
H ( X ) + H ( Y ∣ X ) = − ∑ x ∈ X ∑ y ∈ Y p ( x , y ) [ log p ( x ) + log p ( y ∣ x ) ] = − ∑ x ∈ X ∑ y ∈ Y p ( x , y ) log p ( x , y ) = H ( X , Y ) \begin{aligned} H(X)+H(Y|X)&=-\sum_{x\in \mathcal X}\sum_{y\in \mathcal Y}p(x,y)[\log p(x)+\log p(y|x)]\\ &=-\sum_{x\in \mathcal X}\sum_{y\in \mathcal Y}p(x,y)\log p(x,y)\\ &=H(X,Y) \end{aligned} H(X)+H(Y∣X)=−x∈X∑y∈Y∑p(x,y)[logp(x)+logp(y∣x)]=−x∈X∑y∈Y∑p(x,y)logp(x,y)=H(X,Y)
Remarks:
-
H ( X ∣ Y ) ≠ H ( Y ∣ X ) H(X|Y)\neq H(Y|X) H(X∣Y)=H(Y∣X) unless H ( X ) = H ( Y ) H(X)=H(Y) H(X)=H(Y)
-
Conditioning decreases the entropy (uncertainty): H ( Y ∣ X ) ≤ H ( Y ) H(Y|X)\le H(Y) H(Y∣X)≤H(Y)
Intuitively, if X , Y X,Y X,Y are related, then we will know some information of X X X if we observe Y Y Y. Therefore, the uncertainty is decreased.
Mathematically, that’s to prove − ∑ x ∈ X ∑ y ∈ Y p ( x , y ) log p ( y ∣ x ) ≤ − ∑ x ∈ X ∑ y ∈ Y p ( x , y ) log p ( y ) -\sum_{x\in \mathcal X}\sum_{y\in \mathcal Y}p(x,y)\log p(y|x)\le -\sum_{x\in \mathcal X}\sum_{y\in \mathcal Y}p(x,y)\log p(y) −∑x∈X∑y∈Yp(x,y)logp(y∣x)≤−∑x∈X∑y∈Yp(x,y)logp(y), i.e.,
E X Y log p ( X ) p ( Y ) p ( X , Y ) ≤ 0 E_{XY}\log \frac{p(X)p(Y)}{p(X,Y)}\le 0 EXYlogp(X,Y)p(X)p(Y)≤0
This is true by applying the first order condition for concave functions (Jensen’s Inequality):
E X Y log p ( X ) p ( Y ) p ( X , Y ) ≤ log ( E X Y p ( X ) p ( Y ) p ( X , Y ) ) = log 1 = 0 E_{XY}\log \frac{p(X)p(Y)}{p(X,Y)}\le \log\left(E_{XY}\frac{p(X)p(Y)}{p(X,Y)}\right)=\log 1=0 EXYlogp(X,Y)p(X)p(Y)≤log(EXYp(X,Y)p(X)p(Y))=log1=0 -
As a consequence, H ( X , Y ) ≤ H ( X ) + H ( Y ) H(X,Y)\le H(X)+H(Y) H(X,Y)≤H(X)+H(Y) with equality iff p ( x , y ) = p ( x ) p ( y ) p(x,y)=p(x)p(y) p(x,y)=p(x)p(y)
Generalization:
H
(
X
1
,
⋯
,
X
n
)
=
∑
i
=
1
n
H
(
X
i
∣
X
i
−
1
,
⋯
,
X
1
)
≤
∑
i
=
1
n
H
(
X
i
)
H(X_1,\cdots,X_n)=\sum_{i=1}^nH(X_i|X_{i-1},\cdots,X_1)\le \sum_{i=1}^nH(X_i)
H(X1,⋯,Xn)=i=1∑nH(Xi∣Xi−1,⋯,X1)≤i=1∑nH(Xi)
Proof:
H
(
X
1
,
⋯
,
X
n
)
=
−
∑
x
1
,
⋯
,
x
n
p
(
x
1
,
⋯
,
x
n
)
log
p
(
x
1
,
⋯
,
x
n
)
=
−
∑
x
1
,
⋯
,
x
n
p
(
x
1
,
⋯
,
x
n
)
log
∏
i
=
1
n
p
(
x
i
∣
x
i
−
1
,
⋯
,
x
1
)
=
−
∑
i
=
1
n
∑
x
1
,
⋯
,
x
n
p
(
x
1
,
⋯
,
x
n
)
log
p
(
x
i
∣
x
i
−
1
,
⋯
,
x
1
)
=
−
∑
i
=
1
n
∑
x
1
,
⋯
,
x
i
p
(
x
1
,
⋯
,
x
i
)
log
p
(
x
i
∣
x
i
−
1
,
⋯
,
x
1
)
=
∑
i
=
1
n
H
(
X
i
∣
X
i
−
1
,
⋯
,
X
1
)
\begin{aligned} H(X_1,\cdots,X_n)&=-\sum_{x_1,\cdots,x_n}p(x_1,\cdots, x_n)\log p(x_1,\cdots, x_n)\\ &=-\sum_{x_1,\cdots,x_n}p(x_1,\cdots, x_n)\log \prod_{i=1}^n p(x_i|x_{i-1},\cdots,x_1)\\ &=-\sum_{i=1}^n\sum_{x_1,\cdots,x_n}p(x_1,\cdots, x_n)\log p(x_i|x_{i-1},\cdots,x_1)\\ &=-\sum_{i=1}^n\sum_{x_1,\cdots,x_i}p(x_1,\cdots, x_i)\log p(x_i|x_{i-1},\cdots,x_1)\\ &=\sum_{i=1}^nH(X_i|X_{i-1},\cdots,X_1) \end{aligned}
H(X1,⋯,Xn)=−x1,⋯,xn∑p(x1,⋯,xn)logp(x1,⋯,xn)=−x1,⋯,xn∑p(x1,⋯,xn)logi=1∏np(xi∣xi−1,⋯,x1)=−i=1∑nx1,⋯,xn∑p(x1,⋯,xn)logp(xi∣xi−1,⋯,x1)=−i=1∑nx1,⋯,xi∑p(x1,⋯,xi)logp(xi∣xi−1,⋯,x1)=i=1∑nH(Xi∣Xi−1,⋯,X1)
Mutual Information
The mutual information is a measure of the amount of information that one random variable contains about another random variable.
Definition 4 (Mutual Information):
Given a pair of random variables
(
X
,
Y
)
(X,Y)
(X,Y) with joint probability mass functions
p
(
x
,
y
)
p(x,y)
p(x,y) and marginal probability mass functions
p
(
x
)
p(x)
p(x) and
p
(
y
)
p(y)
p(y), respectively. The mutual information is defined by
I
(
X
;
Y
)
=
∑
x
∈
X
∑
y
∈
Y
p
(
x
,
y
)
log
p
(
x
,
y
)
p
(
x
)
p
(
y
)
(5)
I(X;Y)=\sum_{x\in \mathcal X}\sum_{y\in \mathcal Y}p(x,y)\log \frac{p(x,y)}{p(x)p(y)}\tag{5}
I(X;Y)=x∈X∑y∈Y∑p(x,y)logp(x)p(y)p(x,y)(5)
Corollary:
-
I ( X ; Y ) = H ( X ) − H ( X ∣ Y ) = H ( Y ) − H ( Y ∣ X ) I(X;Y)=H(X)-H(X|Y)=H(Y)-H(Y|X) I(X;Y)=H(X)−H(X∣Y)=H(Y)−H(Y∣X)
The mutual information is the reduction in the uncertainty of one random variable due to the knowledge of other.
-
I ( X ; Y ) = I ( Y ; X ) I(X;Y)=I(Y;X) I(X;Y)=I(Y;X)
-
I ( X ; Y ) = H ( X ) + H ( Y ) − H ( X , Y ) I(X;Y)=H(X)+H(Y)-H(X,Y) I(X;Y)=H(X)+H(Y)−H(X,Y)
-
I ( X ; X ) = H ( X ) I(X;X)=H(X) I(X;X)=H(X)
Properties:
- I ( X ; Y ) ≥ 0 I(X;Y)\ge 0 I(X;Y)≥0 with equality iff X X X and Y Y Y are independent
- I ( X ; Y ) ≤ H ( X ) I(X;Y)\le H(X) I(X;Y)≤H(X) with equality iff X X X and Y Y Y are completely dependent, i.e., ∀ i ∃ j : p ( x i ∣ y j ) = 1 \forall i \exist j:p(x_i|y_j)=1 ∀i∃j:p(xi∣yj)=1
- I ( X ; Y ) I(X;Y) I(X;Y) is a concave function of p ( x ) p(x) p(x) for fixed p ( y ∣ x ) p(y|x) p(y∣x) and a convex function of p ( y ∣ x ) p(y|x) p(y∣x) for fixed p ( x ) p(x) p(x). [book P33]
Definition 5 (Conditional Mutual Information):
Given the random variables
(
X
,
Y
,
Z
)
(X,Y,Z)
(X,Y,Z) with joint probability mass functions
p
(
x
,
y
,
z
)
p(x,y,z)
p(x,y,z) and marginal probability mass functions
p
(
x
)
p(x)
p(x),
p
(
y
)
p(y)
p(y) and
p
(
z
)
p(z)
p(z), respectively. The conditional mutual information
I
(
X
;
Y
∣
Z
)
I(X;Y|Z)
I(X;Y∣Z) is defined by
I
(
X
;
Y
∣
Z
)
=
∑
x
∈
X
∑
y
∈
Y
∑
z
∈
Z
p
(
x
,
y
,
z
)
log
p
(
x
,
y
∣
z
)
p
(
x
∣
z
)
p
(
y
∣
z
)
=
H
(
X
∣
Z
)
−
H
(
X
∣
Y
,
Z
)
(6)
\begin{aligned} I(X ; Y| Z) &=\sum_{x \in \mathcal{X}} \sum_{y \in \mathcal{Y}} \sum_{z \in \mathcal{Z}} p(x, y, z) \log \frac{p(x, y | z)}{p(x | z) p(y | z)} \\ &=H(X | Z)-H(X | Y, Z) \end{aligned}\tag{6}
I(X;Y∣Z)=x∈X∑y∈Y∑z∈Z∑p(x,y,z)logp(x∣z)p(y∣z)p(x,y∣z)=H(X∣Z)−H(X∣Y,Z)(6)
Interpretation:
- The conditional mutual information is the reduction in uncertainty of X X X due to knowledge of Y Y Y when Z Z Z is given.
- After introducing the conditional mutual information, we are able to formulate the chain rule for mutual information.
Theorem 2 (Chain Rule for Mutual Information):
Let
(
X
1
,
⋯
,
X
n
,
Y
)
(X_1,\cdots,X_n,Y)
(X1,⋯,Xn,Y) be a collection of random variables with joint probability mass function
p
(
x
1
,
⋯
,
x
n
,
y
)
p(x_1,\cdots,x_n,y)
p(x1,⋯,xn,y). Then
I
(
X
1
,
⋯
,
X
n
;
Y
)
=
∑
i
=
1
n
I
(
X
i
;
Y
∣
X
i
−
1
,
⋯
,
X
1
)
(7)
I(X_1,\cdots,X_n;Y)=\sum_{i=1}^nI(X_i;Y|X_{i-1},\cdots,X_1)\tag{7}
I(X1,⋯,Xn;Y)=i=1∑nI(Xi;Y∣Xi−1,⋯,X1)(7)
Proof: Since mutual information can be expressed in terms of entropies, it satisfies a chain rule as well.
I
(
X
1
,
X
2
,
…
,
X
n
;
Y
)
=
H
(
X
1
,
X
2
,
…
,
X
n
)
−
H
(
X
1
,
X
2
,
…
,
X
n
∣
Y
)
=
∑
i
=
1
n
H
(
X
i
∣
X
i
−
1
,
…
,
X
1
)
−
∑
i
=
1
n
H
(
X
i
∣
X
i
−
1
,
…
,
X
1
,
Y
)
=
∑
i
=
1
n
I
(
X
i
;
Y
∣
X
1
,
X
2
,
…
,
X
i
−
1
)
\begin{array}{l} I\left(X_{1}, X_{2}, \ldots, X_{n} ; Y\right) \\ \quad=H\left(X_{1}, X_{2}, \ldots, X_{n}\right)-H\left(X_{1}, X_{2}, \ldots, X_{n} | Y\right) \\ \quad=\sum_{i=1}^{n} H\left(X_{i} | X_{i-1}, \ldots, X_{1}\right)-\sum_{i=1}^{n} H\left(X_{i} |X_{i-1}, \ldots, X_{1}, Y\right) \\ \quad=\sum_{i=1}^{n} I\left(X_{i} ; Y| X_{1}, X_{2}, \ldots, X_{i-1}\right) \end{array}
I(X1,X2,…,Xn;Y)=H(X1,X2,…,Xn)−H(X1,X2,…,Xn∣Y)=∑i=1nH(Xi∣Xi−1,…,X1)−∑i=1nH(Xi∣Xi−1,…,X1,Y)=∑i=1nI(Xi;Y∣X1,X2,…,Xi−1)
Entropy Rates
A discrete information source is a source emitting a sequence of symbols from a fixed finite alphabet according to a probability law. It can be modelled by a stochastic process, which is a sequence of random variables characterised by a joint probability function
p
(
x
1
,
⋯
,
x
n
)
=
Pr
(
(
X
1
,
⋯
,
X
n
)
=
(
x
1
,
⋯
,
x
n
)
)
,
∀
n
p(x_1,\cdots,x_n)=\Pr ((X_1,\cdots,X_n)=(x_1,\cdots,x_n)), \forall n
p(x1,⋯,xn)=Pr((X1,⋯,Xn)=(x1,⋯,xn)),∀n
If the random variables are mutually independent, that is
p
(
x
1
,
⋯
,
x
n
)
=
∏
i
p
(
x
i
)
p(x_1,\cdots,x_n)=\prod_ip(x_i)
p(x1,⋯,xn)=∏ip(xi), we say that the process is memoryless. An information source having memory can be described by a Markov process.
Definition 6 (Markov Process):
A discrete stochastic process is said to be a Markov process of order
m
m
m, if for
n
=
1
,
2
,
⋯
n=1,2,\cdots
n=1,2,⋯
Pr
(
X
n
=
x
n
∣
X
n
−
1
=
x
n
−
1
,
⋯
,
X
1
=
x
1
)
=
Pr
(
X
n
=
x
n
∣
X
n
−
1
=
x
n
−
1
,
⋯
,
X
n
−
m
=
x
n
−
m
)
(8)
\Pr(X_n=x_n|X_{n-1}=x_{n-1},\cdots,X_1=x_1)=\Pr(X_n=x_n|X_{n-1}=x_{n-1},\cdots,X_{n-m}=x_{n-m})\tag{8}
Pr(Xn=xn∣Xn−1=xn−1,⋯,X1=x1)=Pr(Xn=xn∣Xn−1=xn−1,⋯,Xn−m=xn−m)(8)
i.e., a Markov process of order
m
m
m is a stochastic process where each random variable depends on the
m
m
m preceding variables and is conditionally independent of all other preceding random variables.
Remarks:
-
The conditional probability of a value x n x_n xn of the random variable X n X_n Xn, given all preceding values, is equal to
p ( x n ∣ x n − 1 , ⋯ , x 1 ) = p ( x n ∣ x n − 1 , ⋯ , x n − m ) p(x_n|x_{n-1},\cdots,x_1)=p(x_n|x_{n-1},\cdots,x_{n-m}) p(xn∣xn−1,⋯,x1)=p(xn∣xn−1,⋯,xn−m) -
The values S i = ( x n − 1 , ⋯ , x n − m ) S_i=(x_{n-1},\cdots,x_{n-m}) Si=(xn−1,⋯,xn−m) form the conditional part of the conditional probability distribution of X n X_n Xn. S i S_i Si is called the state of the Markov chain.
-
Besides giving the conditional probability distribution of X n X_n Xn, one can also characterize the Markov process by giving a probability transition matrix containing the transition probabilities for the different states. [slides 40-43, Ex 4.7]
How do we measure the information of a process with memory?
If we have a sequence of n n n random variables, a natural question to ask is: How does the entropy of the sequence grow with n n n? We define the entropy rate as this rate of growth as follows.
Definition 7 (Entropy Rate):
The entropy of a stochastic process
{
X
i
}
\{X_i\}
{Xi} is defined by
H
∞
(
X
)
=
lim
n
→
∞
1
n
H
(
X
1
,
X
2
,
⋯
,
X
n
)
(9)
H_\infty( X)=\lim_{n\to \infty}\frac{1}{n}H(X_1,X_2,\cdots,X_n)\tag{9}
H∞(X)=n→∞limn1H(X1,X2,⋯,Xn)(9)
when the limit exists.
Remarks:
-
Independent identically distributed (i.i.d.) random variables
H ∞ ( X ) = lim n → ∞ 1 n n H ( X 1 ) = H ( X 1 ) H_\infty( X)=\lim _{n\to \infty}\frac{1}{n}nH(X_1)=H(X_1) H∞(X)=n→∞limn1nH(X1)=H(X1)
The entropy rate of the stochastic process { X i } \{X_i\} {Xi} equals to the entropy of the random variable X i X_i Xi. -
We can also define a related quantity for entropy rate
H ∞ ′ ( X ) = lim n → ∞ H ( X n ∣ X n − 1 , ⋯ , X 1 ) (10) H'_\infty(X)=\lim_{n\to \infty}H(X_n|X_{n-1},\cdots,X_1)\tag{10} H∞′(X)=n→∞limH(Xn∣Xn−1,⋯,X1)(10)
when the limit exists. -
H ∞ ( X ) H_\infty(X) H∞(X) is the per symbol entropy of n n n random variables, while H ∞ ′ ( X ) H'_\infty(X) H∞′(X) is the conditional entropy of the last random variable given the past.
-
For stationary processes, H ∞ ( X ) = H ∞ ′ ( X ) H_\infty(X)=H'_{\infty}(X) H∞(X)=H∞′(X). [book P75]
-
For stationary Markov process of order m m m, the entropy is given by
H ∞ ( X ) = H ∞ ′ ( X ) = H ( X n ∣ X n − 1 , ⋯ , X n − m ) H_\infty(X)=H'_{\infty}(X)=H(X_n|X_{n-1},\cdots,X_{n-m}) H∞(X)=H∞′(X)=H(Xn∣Xn−1,⋯,Xn−m)