Information Measures

最新推荐文章于 2022-12-27 19:59:15 发布

拉普拉斯的汪

最新推荐文章于 2022-12-27 19:59:15 发布

阅读量122

点赞数 1

分类专栏： Information Theory

本文链接：https://blog.csdn.net/qq_39599295/article/details/113856030

版权

Information Theory 专栏收录该内容

9 篇文章 1 订阅

订阅专栏

Reference:

Elements of Information Theory, 2nd Edition

Slides of EE4560, TUD

Content

How to measure information?

\to

Hartley’s approach [slides 5]

\to

Problems?

no allowance is made for the fact that

the $n$ alphabet elements have unequal probability of occurrence
and that there could be a possible dependence between successive symbols

$\to$ Shannon’s method: associating information with uncertainty using the concept of probability

Entropy

Definition 1 (Entropy):

Let $X$ be a discrete random variable with alphabet $\mathcal X$ and probability mass function $p(x)=\Pr (X=x),x\in \mathcal X$ . The entropy $H (X)$ is defined by
$H(X)=-\sum_{x\in \mathcal X}p(x)\log p(x)=-E\log p(X)\tag{1}$
Sometimes we will use the notation $H (p)$ . (Define $-p(x)\log p(x)=0$ if $p (x) = 0$ )

Interpretation:

a measure of the uncertainty of a random variable
the minimum number of ‘yes/no’ questions needed to determine which of the alphabet symbols $x\in \mathcal X$ occurred on average [slides 9,10, Ex 2.1]
the minimum number of bits to represent the outcome of an experiment (Shannon’s first coding theorem) [slides 12-14]
Why logarithm? To make it growing linearly with system size and “behaving like information”.

Properties:

$H(X)\ge 0$ with equality iff $\exist i$ such that $p(x_i)=1$ .
$H (X)$ is a concave function of $p (x)$ (since $\nabla^2H\preceq0$ )
$H(X)\le \log k$ with equality iff $p(x_i)=1/k$ for all $i$ (Lagrangian method)

Entropy is maximum for a uniform distribution. (Maximum uncertainty)

Definition 2 (Joint Entropy):

Let $(X, Y)$ be a pair of discrete random variable with alphabet $\mathcal X$ and $\mathcal Y$ , respectively, and joint probability mass function $p(x,y)=\Pr (X=x,Y=y), x\in \mathcal X,y\in \mathcal Y$ . The joint entropy $H (X, Y)$ is defined by
$H(X,Y)=-\sum_{x\in \mathcal X}\sum_{y\in \mathcal Y}p(x,y)\log p(x,y)=-E\log p(X,Y)\tag{2}$

Definition 3 (Conditional Entropy):

What is the entropy of $Y$ if we know $X = x$ ?
$H(Y|X=x)=-\sum_{y\in \mathcal Y}p(y|x)\log p(y|x)$
By averaging over all values $x\in \mathcal X$ , the average amount of information of $Y$ given foreknowledge of $X$ is found
$H(Y|X)=\sum_{x\in \mathcal X}p(x)H(Y|X=x)=-\sum_{x\in \mathcal X}\sum_{y\in \mathcal Y}p(x,y)\log p(y|x)=-E\log p(Y|X)\tag{3}$

The naturalness of the definition of joint entropy and conditional entropy is exhibited by the fact that the entropy of a pair of random variables is the entropy of one plus the conditional entropy of the other.

Theorem 1 (Chain Rule for Entropy):
$H(X,Y)=H(X)+H(Y|X)=H(Y)+H(X|Y)\tag{4}$
Proof:
$H(X)=-\sum_{x\in \mathcal X}p(x)\log p(x)=-\sum_{x\in \mathcal X}\sum_{y\in \mathcal Y}p(x,y)\log p(x)$

$\begin{aligned} H(X)+H(Y|X)&=-\sum_{x\in \mathcal X}\sum_{y\in \mathcal Y}p(x,y)[\log p(x)+\log p(y|x)]\\ &=-\sum_{x\in \mathcal X}\sum_{y\in \mathcal Y}p(x,y)\log p(x,y)\\ &=H(X,Y) \end{aligned}$

Remarks:

$H(X|Y)\neq H(Y|X)$ unless $H (X) = H (Y)$
Conditioning decreases the entropy (uncertainty): $H(Y|X)\le H(Y)$

Intuitively, if $X, Y$ are related, then we will know some information of $X$ if we observe $Y$ . Therefore, the uncertainty is decreased.

Mathematically, that’s to prove $-\sum_{x\in \mathcal X}\sum_{y\in \mathcal Y}p(x,y)\log p(y|x)\le -\sum_{x\in \mathcal X}\sum_{y\in \mathcal Y}p(x,y)\log p(y)$ , i.e.,
$E_{XY}\log \frac{p(X)p(Y)}{p(X,Y)}\le 0$
This is true by applying the first order condition for concave functions (Jensen’s Inequality):
$E_{XY}\log \frac{p(X)p(Y)}{p(X,Y)}\le \log\left(E_{XY}\frac{p(X)p(Y)}{p(X,Y)}\right)=\log 1=0$
As a consequence, $H(X,Y)\le H(X)+H(Y)$ with equality iff $p (x, y) = p (x) p (y)$

Generalization:
$H(X_1,\cdots,X_n)=\sum_{i=1}^nH(X_i|X_{i-1},\cdots,X_1)\le \sum_{i=1}^nH(X_i)$

Proof:
$\begin{aligned} H(X_1,\cdots,X_n)&=-\sum_{x_1,\cdots,x_n}p(x_1,\cdots, x_n)\log p(x_1,\cdots, x_n)\\ &=-\sum_{x_1,\cdots,x_n}p(x_1,\cdots, x_n)\log \prod_{i=1}^n p(x_i|x_{i-1},\cdots,x_1)\\ &=-\sum_{i=1}^n\sum_{x_1,\cdots,x_n}p(x_1,\cdots, x_n)\log p(x_i|x_{i-1},\cdots,x_1)\\ &=-\sum_{i=1}^n\sum_{x_1,\cdots,x_i}p(x_1,\cdots, x_i)\log p(x_i|x_{i-1},\cdots,x_1)\\ &=\sum_{i=1}^nH(X_i|X_{i-1},\cdots,X_1) \end{aligned}$

Mutual Information

The mutual information is a measure of the amount of information that one random variable contains about another random variable.

Definition 4 (Mutual Information):

Given a pair of random variables $(X, Y)$ with joint probability mass functions $p (x, y)$ and marginal probability mass functions $p (x)$ and $p (y)$ , respectively. The mutual information is defined by
$I(X;Y)=\sum_{x\in \mathcal X}\sum_{y\in \mathcal Y}p(x,y)\log \frac{p(x,y)}{p(x)p(y)}\tag{5}$
Corollary:

$I (X; Y) = H (X) - H (X ∣ Y) = H (Y) - H (Y ∣ X)$

The mutual information is the reduction in the uncertainty of one random variable due to the knowledge of other.
$I (X; Y) = I (Y; X)$
$I (X; Y) = H (X) + H (Y) - H (X, Y)$
$I (X; X) = H (X)$

在这里插入图片描述

Properties:

$I(X;Y)\ge 0$ with equality iff $X$ and $Y$ are independent
$I(X;Y)\le H(X)$ with equality iff $X$ and $Y$ are completely dependent, i.e., $\forall i \exist j:p(x_i|y_j)=1$
$I (X; Y)$ is a concave function of $p (x)$ for fixed $p (y ∣ x)$ and a convex function of $p (y ∣ x)$ for fixed $p (x)$ . [book P33]

Definition 5 (Conditional Mutual Information):

Given the random variables $(X, Y, Z)$ with joint probability mass functions $p (x, y, z)$ and marginal probability mass functions $p (x)$ , $p (y)$ and $p (z)$ , respectively. The conditional mutual information $I (X; Y ∣ Z)$ is defined by
$\begin{aligned} I(X ; Y| Z) &=\sum_{x \in \mathcal{X}} \sum_{y \in \mathcal{Y}} \sum_{z \in \mathcal{Z}} p(x, y, z) \log \frac{p(x, y | z)}{p(x | z) p(y | z)} \\ &=H(X | Z)-H(X | Y, Z) \end{aligned}\tag{6}$
Interpretation:

The conditional mutual information is the reduction in uncertainty of $X$ due to knowledge of $Y$ when $Z$ is given.
After introducing the conditional mutual information, we are able to formulate the chain rule for mutual information.

Theorem 2 (Chain Rule for Mutual Information):

Let $(X_1,\cdots,X_n,Y)$ be a collection of random variables with joint probability mass function $p(x_1,\cdots,x_n,y)$ . Then
$I(X_1,\cdots,X_n;Y)=\sum_{i=1}^nI(X_i;Y|X_{i-1},\cdots,X_1)\tag{7}$
Proof: Since mutual information can be expressed in terms of entropies, it satisfies a chain rule as well.
$\begin{array}{l} I\left(X_{1}, X_{2}, \ldots, X_{n} ; Y\right) \\ \quad=H\left(X_{1}, X_{2}, \ldots, X_{n}\right)-H\left(X_{1}, X_{2}, \ldots, X_{n} | Y\right) \\ \quad=\sum_{i=1}^{n} H\left(X_{i} | X_{i-1}, \ldots, X_{1}\right)-\sum_{i=1}^{n} H\left(X_{i} |X_{i-1}, \ldots, X_{1}, Y\right) \\ \quad=\sum_{i=1}^{n} I\left(X_{i} ; Y| X_{1}, X_{2}, \ldots, X_{i-1}\right) \end{array}$

Entropy Rates

A discrete information source is a source emitting a sequence of symbols from a fixed finite alphabet according to a probability law. It can be modelled by a stochastic process, which is a sequence of random variables characterised by a joint probability function
$p(x_1,\cdots,x_n)=\Pr ((X_1,\cdots,X_n)=(x_1,\cdots,x_n)), \forall n$
If the random variables are mutually independent, that is $p(x_1,\cdots,x_n)=\prod_ip(x_i)$ , we say that the process is memoryless. An information source having memory can be described by a Markov process.

Definition 6 (Markov Process):

A discrete stochastic process is said to be a Markov process of order $m$ , if for $n=1,2,\cdots$
$\Pr(X_n=x_n|X_{n-1}=x_{n-1},\cdots,X_1=x_1)=\Pr(X_n=x_n|X_{n-1}=x_{n-1},\cdots,X_{n-m}=x_{n-m})\tag{8}$
i.e., a Markov process of order $m$ is a stochastic process where each random variable depends on the $m$ preceding variables and is conditionally independent of all other preceding random variables.

Remarks:

The conditional probability of a value $x_n$ of the random variable $X_n$ , given all preceding values, is equal to
$p(x_n|x_{n-1},\cdots,x_1)=p(x_n|x_{n-1},\cdots,x_{n-m})$
The values $S_i=(x_{n-1},\cdots,x_{n-m})$ form the conditional part of the conditional probability distribution of $X_n$ . $S_i$ is called the state of the Markov chain.
Besides giving the conditional probability distribution of $X_n$ , one can also characterize the Markov process by giving a probability transition matrix containing the transition probabilities for the different states. [slides 40-43, Ex 4.7]

在这里插入图片描述

How do we measure the information of a process with memory?

If we have a sequence of $n$ random variables, a natural question to ask is: How does the entropy of the sequence grow with $n$ ? We define the entropy rate as this rate of growth as follows.

Definition 7 (Entropy Rate):

The entropy of a stochastic process ${X_i\}$ is defined by
$H_\infty( X)=\lim_{n\to \infty}\frac{1}{n}H(X_1,X_2,\cdots,X_n)\tag{9}$
when the limit exists.

Remarks:

Independent identically distributed (i.i.d.) random variables
$H_\infty( X)=\lim _{n\to \infty}\frac{1}{n}nH(X_1)=H(X_1)$
The entropy rate of the stochastic process ${X_i\}$ equals to the entropy of the random variable $X_i$ .
We can also define a related quantity for entropy rate
$H'_\infty(X)=\lim_{n\to \infty}H(X_n|X_{n-1},\cdots,X_1)\tag{10}$
when the limit exists.
$H_\infty(X)$ is the per symbol entropy of $n$ random variables, while $H'_\infty(X)$ is the conditional entropy of the last random variable given the past.
For stationary processes, $H_\infty(X)=H'_{\infty}(X)$ . [book P75]
For stationary Markov process of order $m$ , the entropy is given by
$H_\infty(X)=H'_{\infty}(X)=H(X_n|X_{n-1},\cdots,X_{n-m})$

拉普拉斯的汪

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Information Measures

Information MeasuresReference:Elements of Information Theory, 2nd EditionSlides of EE4560, TUDContentInformation MeasuresEntropyMutual InformationEntropy RatesHow to measure information? →\to→ Hartley’s approach [slides 5] →\to→ Problems?no allowance
复制链接

扫一扫