Preface
If you find any errata or have a good idea, please contact me via tongust@163.com.
Ex. 8
Ex 8.1
First of all, we should refer to the theory Kullback-Leibler Divergence. Here, I just give a brief derivation.
To proof KL divergence, we use the Jensen Inequality:
The constrains which the formula must satisfy are: the function f(x) must be a convex function at convex set.
In this case, we construct a simple convex function f(x)=−ln(x) , and it has the following property:
Substitute x =
As we know q(x) is a distribution function, it is obvious that ∫q(x)dx=1 .
Therefore, we get the KL divergence:
Since we has shown that (8.61) is maximized as a function of r(y) when r(y)=q(y) , −R(θ′,θ)=−∫p(Z|θ)ln(Zm|Z,θ′,θ) is a convex function and satisfies KL divergence. Hence R(θ′,θ)−R(θ,θ)≤0
◻
Ex. 8.2
Off the topics
Since this exercise is based on one paper [1] , I would say that our lovely authors of ESL just overestimated those poor readers like me to be excellent mathematicians.;)
Proof
I will use the notations in [1] instead of those in ESL which is a bit confusing to me. (Denote
Zm
as
y
)
We want to proof : For a fixed value
From the hint, we use Lagrange Multiplier to rewrite our formula.
To get the stationary points, we set the gradient of L(P(y),λ) W.R.T (with respect to) P(yi)(i=1,2,...,n) with zero :
To simplify:
From (4), it follows that P(y) must be proportional to Pθ(z,y)=P(y,z|θ) . Also we notice that ∑yP(y)=1
Summing (4) W.R.T yi , we can see:
Substitute (7) into (4):
◻
Ex. 8.3
Ex. 8.4
Ex. 8.5
Ex. 8.6
Ex. 8.7
Proof f(x) is non-decreasing under update (8.63)
From (8.62), we have
◻
Proof EM algorithm (Sec. 8.5.2) is an example of an EM algorithms
This exercise need us to show following:
On one hand, from (8.46), we can denote that:
Hence, the left hand side (l.h.s) of equation (2) can be simplified as:
On the other hand, also from (8.46), the r.h.s of (2) can be written as:
From Ex. 8.1, we see:
Finally, we get (4) ≤ (5) to finish our demonstration.
◻
Reference
[1] Neal, Radford M., and G. E. Hinton. A view of the EM algorithm that justifies incremental, sparse, and other variants. Learning in Graphical Models. Springer Netherlands, 2000:355-368.
Ex 15 Random Forest
15.1
对于bagging我们有B个trees,这些trees是identical distributed 而不是,i.i.d (independent identical distributed data). 这里的trees是correlated,
ρ=Var(xixj)σ2
,因为
xi
是同分布的,所以其
σ
是identical.
1. ρ=0
(2) → (1):
2. ρ>0
◻