机器学习学习笔记 PRML Chapter 1.6 : Information Theory

最新推荐文章于 2020-12-11 14:40:32 发布

GloryOfFamily

最新推荐文章于 2020-12-11 14:40:32 发布

阅读量1.1k

点赞数

分类专栏： machine learning 机器学习 PRML 文章标签：机器学习模式识别 PRML教材

本文链接：https://blog.csdn.net/ccj5351/article/details/51751140

版权

Chapter 1.6 : Information Theory

PRML, OXford University Deep Learning Course, Machine Learning, Pattern Recognition
Christopher M. Bishop, PRML, Chapter 1 Introdcution

Chapter 16 Information Theory

1. Information h(x)

Given a random variable $x$ and we ask how much information is received when we observe a specific value for this variable.

The amount of information can be viewed as the “degree of surprise” on learning the value of x .
- information $h(x)$ : $h (x) = - log 2 p (x) (1.92)$ $h(x) = - \log_2 p(x) \qquad \qquad (1.92)$ where the negative sign ensures that information is positive or zero.
- the units of h(x) :
  - using logarithms to the base of 2: the units of $h(x)$ are bits (‘binary digits’).
  - using logarithms to the base of $e$ , i.e., natural logarithms: the units of $h(x)$ are nats.
- 2. Entropy H(x): average amount of information
  
  2.1 Entropy H(x)
  
  Firstly we interpret the concept of entropy in terms of the average amount of information needed to specify the state of a random variable.
  
  Now suppose that a sender wishes to transmit the value of a random variable to a receiver. The average amount of information that they transmit in the process is obtained by taking the expectation of (1.92) with respect to the distribution $p(x)$ and is given as
  - discrete entropy for discrete random variable by
  
  H[x]=−∑xp(x)log2p(x).(1.93)
  
  - or differential/continuous entropy for continuous random variable by
  H[x]=−∫p(x)lnp(x)dx.(1.104)
  - Note that $\lim_{p \to 0} p \cdot ln(p) = 0$ and so we shall take $p(x) \cdot ln p(x) = 0$ whenever we encounter a value for $x$ such that $p(x) = 0$ .
  - The nonuniform distribution has a smaller entropy than the uniform one.
  2.2 Noiseless coding theorem (Shannon, 1948)
  
  The noiseless coding theorem states that the entropy is a lower bound on the number of bits needed to transmit the state of a random variable.
  
  2.3 Alternative view of entropy H(x)
  
  Secondly, let us introduces the concept of entropy in physics in the context of equilibrium thermodynamics and later given a deeper interpretation as a measure of disorder through developments in statistical mechanics.
  
  Consider a set of $N$ identical objects that are to be divided amongst a set of bins, such that there are $n_i$ objects in the $i^{th}$ bin. Consider the number of different ways of allocating the objects to the bins.
  - There are $N$ ways to choose the first object, $(N − 1)$ ways to choose the second object, and so on, leading to a total of $N!$ ways to allocate all $N$ objects to the bins.
  - However, we don’t wish to distinguish between rearrangements of objects within each bin. In the $i^{th}$ bin there are $n_i!$ ways of reordering the objects, and so the total number of ways of allocating the $N$ objects to the bins is given by
  
  W=N!∏ini!(1.94)
  which is called the multiplicity.
  - The entropy is then defined as the logarithm of the multiplicity scaled by an appropriate constant
  
  H=1NlnW=1NlnN!−1N∑ilnni!(1.95)
  
  - We now consider the limit N→∞ , in which the fractions ni/N are held fixed, and apply Stirling’s approximation
  lnN!≃NlnN−N(1.96)
  
  - which gives
  H−→−−−−−−−−−lnni!=nilnni−nilnN!=NlnN−N−→−−−−−−−−−∑iniN=1∑ini=N=