The solution on the Elements of Statistical Learning

6 篇文章 0 订阅

Preface

If you find any errata or have a good idea, please contact me via tongust@163.com.

Ex. 8

Ex 8.1

First of all, we should refer to the theory Kullback-Leibler Divergence. Here, I just give a brief derivation.
To proof KL divergence, we use the Jensen Inequality:

p(x)f(x)dxf{p(x)xdx}......(1)

The constrains which the formula must satisfy are: the function f(x) must be a convex function at convex set.
In this case, we construct a simple convex function f(x)=ln(x) , and it has the following property:
p(x)ln(x)dxln{p(x)xdx}......(2)

Substitute x = q(x)p(x) into (2):
p(x)ln(q(x)p(x))dxln{p(x)q(x)p(x)dx}......(3)

p(x)ln(q(x)p(x))dxln{q(x)dx}......(4)

As we know q(x) is a distribution function, it is obvious that q(x)dx=1 .
Therefore, we get the KL divergence:
DKL(p||q)=p(x)ln(q(x)p(x))dx0......(5)

Since we has shown that (8.61) is maximized as a function of r(y) when r(y)=q(y) , R(θ,θ)=p(Z|θ)ln(Zm|Z,θ,θ) is a convex function and satisfies KL divergence. Hence R(θ,θ)R(θ,θ)0

Ex. 8.2

Off the topics

Since this exercise is based on one paper [1] , I would say that our lovely authors of ESL just overestimated those poor readers like me to be excellent mathematicians.;)

Proof

I will use the notations in [1] instead of those in ESL which is a bit confusing to me. (Denote Zm as y )
We want to proof : For a fixed value θ, there is a unique distribution, Pθ , given by Pθ(y)=P(y|z,θ) , which maximizes the log-likelihood (8.48).
From the hint, we use Lagrange Multiplier to rewrite our formula.

L(P(y),λ)=
y=1nP(yi)ln(Pθ(z,y))+yinP(yi)ln(P(yi))+λ(1y=1nP(yi))......(1)

To get the stationary points, we set the gradient of L(P(y),λ) W.R.T (with respect to) P(yi)(i=1,2,...,n) with zero :
dLdP(yi)=lnPθ(z,y)+1+lnP(yi)λ=0......(2)

To simplify:
λ=1ln(P(yi))P(z,y|θ))......(3)

P(yi)=exp(1λ)P(z,y|θ)......(4)

i=1,2,3,...,n

From (4), it follows that P(y) must be proportional to Pθ(z,y)=P(y,z|θ) . Also we notice that yP(y)=1
Summing (4) W.R.T yi , we can see:
1=yP(y)=exp(1λ)yP(y,z|θ)......(5)

yP(y,z|θ)=P(z|θ)=1exp(1λ)......(6)

exp(1λ)=1P(z|θ)......(7)

Substitute (7) into (4):
P(yi)=P(z,y|θ)P(z|θ)=P(y|z,θ)......(8)

Ex. 8.3

Ex. 8.4

Ex. 8.5

Ex. 8.6

Ex. 8.7

Proof f(x) is non-decreasing under update (8.63)

From (8.62), we have

f(xs+1)g(xs+1,xs)g(xs,xs)=f(xs)......(1)

Proof EM algorithm (Sec. 8.5.2) is an example of an EM algorithms

This exercise need us to show following:

Q(θ,θ)+log(Pr(Z|θ))Q(θ,θ)log(θ,Z)......(2)

On one hand, from (8.46), we can denote that:
log(Pr(Z|θ))=Q(θ,θ)R(θ,θ)......(3)

Hence, the left hand side (l.h.s) of equation (2) can be simplified as:
Q(θ,θ)+Q(θ,θ)R(θ,θ)Q(θ,θ)=(θ,θ)R(θ,θ)......(4)

On the other hand, also from (8.46), the r.h.s of (2) can be written as:
log(θ,Z)=Q(θ,θ)R(θ,θ)......(5)

From Ex. 8.1, we see:
R(θ,θ)R(θ,θ).......(6)

R(θ,θ)R(θ,θ).......(7)

Q(θ,θ)R(θ,θ)Q(θ,θ)R(θ,θ).......(8)

Finally, we get (4) (5) to finish our demonstration.

Reference

[1] Neal, Radford M., and G. E. Hinton. A view of the EM algorithm that justifies incremental, sparse, and other variants. Learning in Graphical Models. Springer Netherlands, 2000:355-368.

Ex 15 Random Forest

15.1

对于bagging我们有B个trees,这些trees是identical distributed 而不是,i.i.d (independent identical distributed data). 这里的trees是correlated, ρ=Var(xixj)σ2 ,因为 xi 是同分布的,所以其 σ 是identical.

y=1Bi=1BxiE[y]=ExVar[y]=E[y2]E2[x]=1B2E[i=1Bx2i+ijxixj]E2[x]=BB2E[x2]+B2BB2Eij[xixj]E2[x]............(1)

1. ρ=0

Var[(xiE[x])(xjE[x])]σ2=E[xixj]E2[x]σ2=0Eij[xixj]=E2[x]..................(2)

(2) (1):
Var[y]=1B(BE[x2]+(B2B)E2[x]B2E2[x]B2)=1Bσ2

2. ρ>0

E[xixj]=σ2ρ+E2[x]Var[y]=BB2E2[x]+B2BB2(σ2ρ+E2[x])E2[x]Var[y]=1Bσ2+ρσ21Bρσ2

  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值