Self-Information: I ( x ) = log 1 P ( x ) I(x) = \log \frac{1}{P(x)} I(x)=logP(x)1
Entropy: the average information
Entropy :
H
(
X
)
=
E
[
I
(
X
)
]
=
E
(
log
1
P
(
X
)
)
=
∑
x
∈
X
P
(
x
)
log
1
P
(
x
)
H(X)=E[I(X)] =E(\log \frac{1}{P(X)})=\sum_{x \in X} P(x)\log \frac{1}{P(x)}
H(X)=E[I(X)]=E(logP(X)1)=∑x∈XP(x)logP(x)1
Relative entropy:
D
(
P
∣
∣
Q
)
=
∑
x
∈
X
P
(
x
)
log
P
(
x
)
Q
(
x
)
D(P || Q) = \sum_{x \in X} P(x)\log \frac{P(x)}{Q(x)}
D(P∣∣Q)=∑x∈XP(x)logQ(x)P(x)
KL-divergence
D
(
P
∣
∣
Q
)
D(P || Q)
D(P∣∣Q) = Relative entropy
Cross entropy:
H
(
P
∣
∣
Q
)
=
∑
x
∈
X
P
(
x
)
log
1
Q
(
x
)
H(P || Q) = \sum_{x \in X} P(x)\log \frac{1}{Q(x)}
H(P∣∣Q)=∑x∈XP(x)logQ(x)1
Minimizing Relative entropy is equivalent to minize Cross Entropy.
Conditional Entropy : How much uncertainty left given the other.
First recall: Conditional Expectation
E
(
Y
∣
X
=
x
)
E(Y|X=x)
E(Y∣X=x) is fixed number
E
(
Y
∣
X
)
E(Y|X)
E(Y∣X) is a random variable of X
H
(
Y
∣
X
=
x
)
H(Y|X=x)
H(Y∣X=x) is analogous to conditional expectation taken the value on
x
x
x
H
(
Y
∣
X
)
H(Y|X)
H(Y∣X) is a little bit different than the above conditional expectation. Here, we take another expectation on X. so that:
H
(
Y
∣
X
)
=
∑
x
∈
X
p
(
x
)
H
(
Y
∣
X
=
x
)
H(Y|X)=\sum_{x \in X}p(x)H(Y|X=x)
H(Y∣X)=x∈X∑p(x)H(Y∣X=x)
=
−
∑
x
∈
X
p
(
x
)
∑
y
∈
Y
p
(
y
∣
x
)
log
p
(
y
∣
x
)
=-\sum_{x \in X}p(x)\sum_{y \in Y}p(y|x)\log p(y|x)
=−x∈X∑p(x)y∈Y∑p(y∣x)logp(y∣x)
=
−
∑
x
∈
X
∑
y
∈
Y
p
(
x
)
p
(
y
∣
x
)
log
p
(
y
∣
x
)
=-\sum_{x \in X}\sum_{y \in Y}p(x)p(y|x)\log p(y|x)
=−x∈X∑y∈Y∑p(x)p(y∣x)logp(y∣x)
=
−
∑
x
∈
X
∑
y
∈
Y
p
(
x
,
y
)
log
p
(
y
∣
x
)
=-\sum_{x \in X}\sum_{y \in Y}p(x,y)\log p(y|x)
=−x∈X∑y∈Y∑p(x,y)logp(y∣x)
=
−
∑
x
∈
X
∑
y
∈
Y
p
(
x
,
y
)
log
p
(
x
,
y
)
p
(
x
)
=-\sum_{x \in X}\sum_{y \in Y}p(x,y)\log \frac{p(x,y)}{p(x)}
=−x∈X∑y∈Y∑p(x,y)logp(x)p(x,y)
=
∑
x
∈
X
∑
y
∈
Y
p
(
x
,
y
)
log
p
(
x
)
p
(
x
,
y
)
=\sum_{x \in X}\sum_{y \in Y}p(x,y)\log \frac{p(x)}{p(x,y)}
=x∈X∑y∈Y∑p(x,y)logp(x,y)p(x)
Relation Between Joint and Conditional Entropy
H(X, Y) = H(X) + H(Y|X)
Mutual Information
I
(
X
;
Y
)
=
I
(
Y
;
X
)
I(X;Y) = I(Y;X)
I(X;Y)=I(Y;X) by symmetric
I
(
X
;
Y
)
=
H
(
X
)
−
H
(
X
∣
Y
)
I(X;Y ) = H(X)-H(X|Y)
I(X;Y)=H(X)−H(X∣Y)-the reduction of uncertain of X due to knowledge of Y
=
H
(
Y
)
−
H
(
Y
∣
X
)
=H(Y)-H(Y|X)
=H(Y)−H(Y∣X)-The reduction of uncertain of Y due to knowldge of X