信息论概念记录
信息量和熵
有随机信源
X
\mathcal{X}
X,符号集
{
x
0
,
x
1
,
x
2
,
x
3
,
x
4
⋯
x
n
−
1
}
\{x_0, x_1,x_2,x_3,x_4\cdots x_{n-1}\}
{x0,x1,x2,x3,x4⋯xn−1}, 取得任意一个符号的概率为p(xi), 则符号的信息量表示为
I
(
x
i
)
=
−
l
o
g
2
(
p
(
x
i
)
)
I(x_i) = -log_{2}(p(x_i))
I(xi)=−log2(p(xi))
则此信源的熵表示为
H
(
X
)
=
−
∑
i
=
0
n
−
1
p
(
x
i
)
l
o
g
2
(
p
(
x
i
)
)
H(\mathcal{X}) = -\sum_{i=0}^{n-1}p(x_i)log_{2}(p(x_i))
H(X)=−i=0∑n−1p(xi)log2(p(xi))
就是信源的平均信息量,
X
\mathcal{X}
X表示信源,H(
X
\mathcal{X}
X)不是表示是一个随机变量的函数,它是信源符号概率的函数,不是随机变量。
联合熵
两个随机信源
X
,
Y
\mathcal{X,Y}
X,Y , 对应符号集为
{
x
0
,
x
1
,
x
2
,
x
3
,
x
4
⋯
x
n
−
1
}
\{x_0, x_1,x_2,x_3,x_4\cdots x_{n-1}\}
{x0,x1,x2,x3,x4⋯xn−1},
{
y
0
,
y
1
,
y
2
,
y
3
,
y
4
⋯
y
m
−
1
}
\{y_0, y_1,y_2,y_3,y_4\cdots y_{m-1}\}
{y0,y1,y2,y3,y4⋯ym−1},则它们的联合熵为:
H
(
X
,
Y
)
=
−
∑
i
=
0
n
−
1
∑
j
=
0
m
−
1
p
(
x
,
y
)
l
o
g
2
p
(
x
,
y
)
H
(
X
,
Y
)
=
−
E
(
l
o
g
2
p
(
X
,
Y
)
)
H(\mathcal{X,Y}) = -\sum_{i=0}^{n-1}\sum_{j=0}^{m-1}p(x,y)log_{2}p(x,y) \\ H(\mathcal{X,Y}) = -E(log_{2}p(X,Y))
H(X,Y)=−i=0∑n−1j=0∑m−1p(x,y)log2p(x,y)H(X,Y)=−E(log2p(X,Y))
转移概率
离散无记忆信道,输入符号是随机变量X,输入字符集是
X
=
{
x
0
,
x
1
,
x
2
,
x
3
,
x
4
⋯
x
J
−
1
}
\mathcal{X}=\{x_0, x_1,x_2,x_3,x_4\cdots x_{J-1}\}
X={x0,x1,x2,x3,x4⋯xJ−1}
输出符号是随机变量Y, 输出字符集是
Y
=
{
y
0
,
y
1
,
y
2
,
y
3
,
y
4
⋯
y
K
−
1
}
\mathcal{Y}=\{y_0, y_1,y_2,y_3,y_4\cdots y_{K-1}\}
Y={y0,y1,y2,y3,y4⋯yK−1}
一组转移概率
p
(
y
k
∣
x
j
)
=
P
(
Y
=
y
k
∣
X
=
x
j
)
对
于
所
有
的
j
和
k
p(y_{k}|x_{j}) = P(Y=y_{k}|X=x_{j}) \quad对于所有的j和k
p(yk∣xj)=P(Y=yk∣X=xj)对于所有的j和k
转移概率矩阵
P
=
[
p
(
y
0
∣
x
0
)
p
(
y
1
∣
x
0
)
⋯
p
(
y
K
−
1
∣
x
0
)
p
(
y
0
∣
x
1
)
p
(
y
1
∣
x
1
)
⋯
p
(
y
K
−
1
∣
x
1
)
⋯
⋯
⋯
⋯
p
(
y
0
∣
x
J
−
1
)
p
(
y
1
∣
x
J
−
1
)
⋯
p
(
y
K
−
1
∣
x
J
−
1
)
]
P = \begin {bmatrix} p(y_{0}|x_{0})&p(y_{1}|x_{0})&\cdots&p(y_{K-1}|x_{0})\\ p(y_{0}|x_{1})&p(y_{1}|x_{1})&\cdots&p(y_{K-1}|x_{1})\\ \cdots&\cdots&\cdots&\cdots \\ p(y_{0}|x_{J-1})&p(y_{1}|x_{J-1})&\cdots&p(y_{K-1}|x_{J-1})\\ \end {bmatrix}
P=⎣⎢⎢⎡p(y0∣x0)p(y0∣x1)⋯p(y0∣xJ−1)p(y1∣x0)p(y1∣x1)⋯p(y1∣xJ−1)⋯⋯⋯⋯p(yK−1∣x0)p(yK−1∣x1)⋯p(yK−1∣xJ−1)⎦⎥⎥⎤
条件熵和互信息
条件熵
定义给定的输出
Y
=
y
k
Y=y_{k}
Y=yk,选自
X
\mathcal{X}
X的随机变脸X的条件熵。条件熵定义为:
H
(
X
∣
Y
=
y
k
)
=
−
∑
j
=
0
J
−
1
p
(
x
j
∣
y
k
)
l
o
g
2
(
p
(
x
j
∣
y
k
)
)
H(\mathcal{X}|Y=y_{k}) = -\sum_{j=0}^{J-1}p(x_{j}|y_{k})log_{2}(p(x_{j}|y_{k}))
H(X∣Y=yk)=−j=0∑J−1p(xj∣yk)log2(p(xj∣yk))
H
(
X
∣
Y
)
H(\mathcal{X}|Y)
H(X∣Y)本身就是随机变量,注意这里是Y不是
Y
\mathcal{Y}
Y,该随机变量的期望可以表示为:
H
(
X
∣
Y
)
=
∑
k
=
0
K
−
1
H
(
X
∣
Y
=
y
k
)
p
(
y
k
)
=
−
∑
k
=
0
K
−
1
∑
j
=
0
J
−
1
p
(
x
j
∣
y
k
)
l
o
g
2
(
p
(
x
j
∣
y
k
)
)
p
(
y
k
)
=
−
∑
k
=
0
K
−
1
∑
j
=
0
J
−
1
p
(
x
j
y
k
)
l
o
g
2
(
p
(
x
j
∣
y
k
)
)
H(\mathcal{X}|\mathcal{Y}) = \sum_{k=0}^{K-1}H(\mathcal{X}|Y=y_{k})p(y_{k}) \\ =- \sum_{k=0}^{K-1}\sum_{j=0}^{J-1}p(x_{j}|y_{k})log_{2}(p(x_{j}|y_{k}))p(y_{k}) \\ =- \sum_{k=0}^{K-1}\sum_{j=0}^{J-1}p(x_{j}y_{k})log_{2}(p(x_{j}|y_{k}))
H(X∣Y)=k=0∑K−1H(X∣Y=yk)p(yk)=−k=0∑K−1j=0∑J−1p(xj∣yk)log2(p(xj∣yk))p(yk)=−k=0∑K−1j=0∑J−1p(xjyk)log2(p(xj∣yk))
互信息
互信息, 表示
X
\mathcal{X}
X排除观察到输出
Y
\mathcal{Y}
Y确定信息量后的信息。也就是排除掉信道对
X
→
Y
\mathcal{X} \to \mathcal{Y}
X→Y产生的不确定性(信息)。
I
(
X
,
Y
)
=
H
(
X
)
−
H
(
X
∣
Y
)
=
−
(
∑
j
=
0
J
−
1
p
(
x
j
)
l
o
g
2
(
p
(
x
j
)
)
−
∑
k
=
0
K
−
1
∑
j
=
0
J
−
1
p
(
x
j
y
k
)
l
o
g
2
(
p
(
x
j
∣
y
k
)
)
)
=
−
(
∑
j
=
0
J
−
1
∑
k
=
0
K
−
1
p
(
x
j
y
k
)
l
o
g
2
(
p
(
x
j
)
)
−
∑
k
=
0
K
−
1
∑
j
=
0
J
−
1
p
(
x
j
y
k
)
l
o
g
2
(
p
(
x
j
∣
y
k
)
)
)
=
−
(
∑
j
=
0
J
−
1
∑
k
=
0
K
−
1
p
(
x
j
y
k
)
l
o
g
2
(
p
(
x
j
)
p
(
y
k
)
p
(
x
j
y
k
)
)
)
=
∑
j
=
0
J
−
1
∑
k
=
0
K
−
1
p
(
x
j
y
k
)
l
o
g
2
(
p
(
x
j
y
k
)
p
(
x
j
)
p
(
y
k
)
)
I(\mathcal{X,Y}) = H(\mathcal{X}) - H(\mathcal{X|Y}) = -(\sum_{j=0}^{J-1}p(x_j)log_{2}(p(x_j)) -\sum_{k=0}^{K-1}\sum_{j=0}^{J-1}p(x_{j}y_{k})log_{2}(p(x_{j}|y_{k}))) \\ = -(\sum_{j=0}^{J-1}\sum_{k=0}^{K-1}p(x_jy_k)log_{2}(p(x_j)) -\sum_{k=0}^{K-1}\sum_{j=0}^{J-1}p(x_{j}y_{k})log_{2}(p(x_{j}|y_{k}))) \\ = -(\sum_{j=0}^{J-1}\sum_{k=0}^{K-1}p(x_jy_k)log_{2}(\cfrac{p(x_j)p(y_k)}{p(x_j y_k)}))\\ = \sum_{j=0}^{J-1}\sum_{k=0}^{K-1}p(x_jy_k)log_{2}(\cfrac{p(x_j y_k)}{p(x_j)p(y_k)})
I(X,Y)=H(X)−H(X∣Y)=−(j=0∑J−1p(xj)log2(p(xj))−k=0∑K−1j=0∑J−1p(xjyk)log2(p(xj∣yk)))=−(j=0∑J−1k=0∑K−1p(xjyk)log2(p(xj))−k=0∑K−1j=0∑J−1p(xjyk)log2(p(xj∣yk)))=−(j=0∑J−1k=0∑K−1p(xjyk)log2(p(xjyk)p(xj)p(yk)))=j=0∑J−1k=0∑K−1p(xjyk)log2(p(xj)p(yk)p(xjyk))
熵的链式法则
H
(
X
,
Y
)
=
H
(
X
)
+
H
(
Y
∣
X
)
H(\mathcal{X,Y}) = H(\mathcal{X}) + H(\mathcal{Y|X})
H(X,Y)=H(X)+H(Y∣X)
H
(
X
)
−
H
(
X
∣
Y
)
=
H
(
Y
)
−
H
(
Y
∣
X
)
H(\mathcal{X}) -H(\mathcal{X|Y} ) = H(\mathcal{Y} ) - H(\mathcal{Y|X})
H(X)−H(X∣Y)=H(Y)−H(Y∣X)
相对熵(KL散度)
表征两个分布的距离
在统计上也叫KL散度。