Chapter 1.6 : Information Theory
PRML, OXford University Deep Learning Course, Machine Learning, Pattern Recognition
Christopher M. Bishop, PRML, Chapter 1 Introdcution
1. Information h(x)
Given a random variable x and we ask how much information is received when we observe a specific value for this variable.
- The amount of information can be viewed as the “degree of surprise” on learning the value of
x .- information h(x) :
h(x)=−log2p(x)(1.92)where the negative sign ensures that information is positive or zero.
- the units of h(x) :
- using logarithms to the base of 2: the units of h(x) are bits (‘binary digits’).
- using logarithms to the base of e , i.e., natural logarithms: the units of
h(x) are nats.
2. Entropy H(x): average amount of information
2.1 Entropy H(x)
Firstly we interpret the concept of entropy in terms of the average amount of information needed to specify the state of a random variable.
Now suppose that a sender wishes to transmit the value of a random variable to a receiver. The average amount of information that they transmit in the process is obtained by taking the expectation of (1.92) with respect to the distribution p(x) and is given as
- discrete entropy for discrete random variable byH[x]=−∑xp(x)log2p(x).(1.93)
- or differential/continuous entropy for continuous random variable byH[x]=−∫p(x)lnp(x)dx.(1.104)- Note that limp→0p⋅ln(p)=0 and so we shall take p(x)⋅lnp(x)=0 whenever we encounter a value for x such that
p(x)=0 . - The nonuniform distribution has a smaller entropy than the uniform one.
2.2 Noiseless coding theorem (Shannon, 1948)
The noiseless coding theorem states that the entropy is a lower bound on the number of bits needed to transmit the state of a random variable.
2.3 Alternative view of entropy H(x)
Secondly, let us introduces the concept of entropy in physics in the context of equilibrium thermodynamics and later given a deeper interpretation as a measure of disorder through developments in statistical mechanics.
Consider a set of N identical objects that are to be divided amongst a set of bins, such that there are
ni objects in the ith bin. Consider the number of different ways of allocating the objects to the bins.
- There are N ways to choose the first object,(N−1) ways to choose the second object, and so on, leading to a total of N! ways to allocate all N objects to the bins.
- However, we don’t wish to distinguish between rearrangements of objects within each bin. In theith bin there are ni! ways of reordering the objects, and so the total number of ways of allocating the N objects to the bins is given by
W=N!∏ini!(1.94)
- The entropy is then defined as the logarithm of the multiplicity scaled by an appropriate constant
H=1NlnW=1NlnN!−1N∑ilnni!(1.95)
- We now consider the limit N→∞ , in which the fractions ni/N are held fixed, and apply Stirling’s approximationlnN!≃NlnN−N(1.96)
- which givesH−→−−−−−−−−−lnni!=nilnni−nilnN!=NlnN−N−→−−−−−−−−−∑iniN=1∑ini=N=- Note that limp→0p⋅ln(p)=0 and so we shall take p(x)⋅lnp(x)=0 whenever we encounter a value for x such that
- information h(x) :