Stanford 机器学习课程cs229 数学推导知识

2 篇文章 0 订阅
1 篇文章 0 订阅

if x is a row vector,then

xTx=(xx)=x22=tr(xTx)

Linear regression

if A and B are square matrices, and a is a real number:

trABCtrABCDtrAtr(A+B)traAAtrABATf(A)ATtrABAtrABATCATtrABATCifC=IthenAtrABATA|A|θJ(θ)XTXθθ=trCAB=trBCA,=trDABC=trCDAB=trBCDA.=trAT=trA+trB=atrA=BT=(Af(A))T=B=CAB+CTABT=BTATCT+BATC=BTAT+BAT=|A|(A1)T=θ12(Xθy⃗ )T(Xθy⃗ )=12θ(θTXTXθθTXTy⃗ y⃗ TXθ+y⃗ Ty⃗ )=12θtr(θTXTXθθTXTy⃗ y⃗ TXθ+y⃗ Ty⃗ )=12θ(trθTXTXθ2try⃗ TXθ)=12(XTXθ+XTXθ2XTy⃗ )=XTXθXTy⃗ =XTy⃗ =(XTX)1XTy⃗ 

Locally weighted linear regression

w(i)=exp((x(i)x)22τ2)XTWXθ=XTWy⃗ θ=(XTWX)1XTWy⃗ 

Newton’s method:

θ:=θH1θ(θ).Hij=2(θ)θiθj.

fit logistic regression using locally weighted lr:
the log-likelihood function for logistic regression:
(θ)=i=1my(i)logh(x(i))+(1y(i))log(1h(x(i)))

for any vector z , it holds true that
zTHz(θ)θkHkl0.=i=1m(y(i)h(x(i)))x(i)k=2(θ)=i=1mh(x(i))θlx(i)k=i=1mh(x(i))(1h(x(i)))x(i)lx(i)k

The Exponential family

we say that a class of distribution is in the exponential family if it can be written in the form

p(y;η)=b(y)exp(ηTT(y)a(η))).

Jensensinequality
Suppose we start with the inequality in the basic definition of a convex function
f(θx+(1θ)y)θf(x)+(1θ)f(y)for0θ1.

Using induction, this can be fairly easily extended to convex combinations of more than one point,
f(i=1kθixi)i=1kθif(xi)fori=1kθi=1,θi0i.

In fact, this can also be extended to infinite sums or integrals. In the latter case, the inequality can be written as
f(p(x)xdx)p(x)f(x)dxforp(x)dx=1,p(x)0x.

Because p(x) integrates to 1, it is common to consider it as a probability density, in which case the previous equation can be written in terms of expectations,
f(E[x])E[f(x)]

Learning theory
Lemma. (The union bound). Let A1,A2,,Ak be k different events (that may not be independent). Then

P(A1Ak)P(A1)++P(Ak)

Lemma. (Hoeffding inequality) Let Z1,,Zm be m independent and identically distributed (iid) random variables drawn from a Bernoulli(ϕ) distribution. Let ϕ^=(1/m)mi=1Zi be the mean of these random variables, and let any γ>0 be fixed. Then

P(|ϕϕ^>γ|)2exp(2γ2m)

This lemma (which in learning theory is also called the Chernoff bound) says that if we take ϕ^ the average of mBernoulli(ϕ) random variables to be our estimate of ϕ, then the probability of our being far from the true value is small, so long as m is large.

h^=argminhHϵ^(h)

For a hypothesis h ,we define the training error (also called the empirical risk or empirical error in learning theory) to be
ϵ^(h)=1mi=1m1{h(x(i))y(i)}

ϵ(h)=P(x,y)D(h(x)y).

ϵ^(hi)=1mj=1mZj.

ϵ(h^)ϵ(h)+2γ

Theorem. Let |H|=k , and let any m,σ be fixed. Then with probability at least 1σ , we have that
ϵ(h^)(minhHϵ(h))+212mlog2kσ

Corollary. Let |H|=k , and let any γ be fixed. Then for ϵ(h^)minhHϵ(h)+2γ to hold with probability at least 1δ , it suffices that
m12γ2log2kδ=O(1γ2logkδ).

Factor Analysis:

Marginals and conditionals of Gaussians,

Cov(x)=Σ=[Σ11Σ21Σ12Σ22]=E[(xμ)(xμ)T]=E(x1μ1x2μ2)(x1μ1x2μ2)T=E[(x1μ1)(x1μ1)T(x2μ2)(x1μ1)T(x1μ1)(x2μ2)T(x2μ2)(x2μ2)T]

μ1|2Σ1|2=μ1+Σ12Σ122(x2μ2)=Σ11Σ12Σ122Σ21

To deduce the above marginals,we define VR(m+n)×(m+n) need the lemma below:
V=[VAAVBAVABVBB]=Σ1[ACBD]1=[M1D1CM1M1BD1D1+D1CM1BD1]

where M=ABD1C. Using this formula, it follows that
[ΣAAΣBAΣABΣBB]=[VAAVBAVABVBB]1=[(VAAVABV1BBVBA)1V1BBVBA(VAAVABV1BBVBA)1(VAAVABV1BBVBA)1VABV1BB(VBBVBAV1AAVAB)1]

And the “completion of squares” trick.Consider the quadratic function zTAz+bTz+c where A is a symmetric,nonsingular matrix.Then, one can verify directly that
12zTAz+bTz+c=12(z+A1b)TA(z+A1b)+c12bTA1b.

EM for factor analysis
In the factor analysis model, we posit a joint distribution on (x,z) as follows,where zRk is a latent random variable:
zx|zN(0,I)N(μ+Λz,Ψ).

Here, the parameters of our model are the vector μRn , the matrix ΛRn×k . The value of k is usually chosen to be smaller than n.
Thus, we imagine that each datapoint x(i) is generated by sampling a k dimension multivariate Gaussian z(i). Then, it is mapped to a k -dimensional affine space of Rn by computing μ+Λz(i) . Lastly, x(i) is generated by adding covariance Ψ noise to μ+Λz(i) .
zϵxN(0,I)N(0,Ψ)=μ+Λz+ϵ

where ϵ and z are independent.
[zx]N([0⃗ μ],[IΛΛTΛΛT+Ψ])

(μ,Λ,Ψ)=logi=1m1(2π)n/2ΛΛT+Ψ1/2exp(12(x(i)μ)T(ΛΛT+Ψ)1(x(i)μ)).

z(i)|x(i);μ,Λ,ΨN(μz(i)|x(i),Σz(i)|x(i))μz(i)|x(i)Σz(i)|x(i)ΛμΦ=ΛT(ΛΛT+Ψ)1(x(i)μ),=IΛT(ΛΛT+Ψ)1Λ=(i=1m(x(i)μ)μTz(i)|x(i))(i=1mμz(i)|x(i)μTz(i)|x(i)+Σz(i)|x(i))1=1mi=1mx(i)=1mi=1mx(i)x(i)Tx(i)μTz(i)|x(i)ΛTΛμz(i)|x(i)x(i)T+Λ(μz(i)|x(i)μTz(i)|x(i)+Σz(i)|x(i))ΛT

setting Ψii=Φii ( i.e. ,letting Ψ be the diagonal matrix containing only the diagonal entries of Φ ).

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值