if x is a row vector,then
Linear regression
if
A
and
Locally weighted linear regression
w(i)=exp(−(x(i)−x)22τ2)XTWXθ=XTWy⃗ θ=(XTWX)−1XTWy⃗
Newton’s method:
θ:=θ−H−1∇θℓ(θ).Hij=∂2ℓ(θ)∂θi∂θj.
fit logistic regression using locally weighted lr:
the log-likelihood function for logistic regression:
ℓ(θ)=∑i=1my(i)logh(x(i))+(1−y(i))log(1−h(x(i)))
for any vector z , it holds true that
The Exponential family
we say that a class of distribution is in the exponential family if it can be written in the form
p(y;η)=b(y)exp(ηTT(y)−a(η))).
Jensen′sinequality
Suppose we start with the inequality in the basic definition of a convex function
f(θx+(1−θ)y)≤θf(x)+(1−θ)f(y)for0≤θ≤1.
Using induction, this can be fairly easily extended to convex combinations of more than one point,
f(∑i=1kθixi)≤∑i=1kθif(xi)for∑i=1kθi=1,θi≥0∀i.
In fact, this can also be extended to infinite sums or integrals. In the latter case, the inequality can be written as
f(∫p(x)xdx)≤∫p(x)f(x)dxfor∫p(x)dx=1,p(x)≥0∀x.
Because p(x) integrates to 1, it is common to consider it as a probability density, in which case the previous equation can be written in terms of expectations,
f(E[x])≤E[f(x)]
Learning theory
Lemma. (The union bound). Let
A1,A2,…,Ak
be
k
different events (that may not be independent). Then
Lemma. (Hoeffding inequality) Let Z1,…,Zm be m independent and identically distributed (iid) random variables drawn from a
P(|ϕ−ϕ^>γ|)≤2exp(−2γ2m)
This lemma (which in learning theory is also called the Chernoff bound) says that if we take ϕ^− the average of mBernoulli(ϕ) random variables − to be our estimate of
For a hypothesis h ,we define the training error (also called the empirical risk or empirical error in learning theory) to be
ϵ(h)=P(x,y)∼D(h(x)≠y).
ϵ^(hi)=1m∑j=1mZj.
ϵ(h^)≤ϵ(h∗)+2γ
Theorem. Let |H|=k , and let any m,σ be fixed. Then with probability at least 1−σ , we have that
ϵ(h^)≤(minh∈Hϵ(h))+212mlog2kσ−−−−−−−−−√
Corollary. Let |H|=k , and let any γ be fixed. Then for ϵ(h^)≤minh∼Hϵ(h)+2γ to hold with probability at least 1−δ , it suffices that
m≥12γ2log2kδ=O(1γ2logkδ).
Factor Analysis:
Marginals and conditionals of Gaussians,
Cov(x)=Σ=[Σ11Σ21Σ12Σ22]=E[(x−μ)(x−μ)T]=E⎡⎣(x1−μ1x2−μ2)(x1−μ1x2−μ2)T⎤⎦=E[(x1−μ1)(x1−μ1)T(x2−μ2)(x1−μ1)T(x1−μ1)(x2−μ2)T(x2−μ2)(x2−μ2)T]
μ1|2Σ1|2=μ1+Σ12Σ−122(x2−μ2)=Σ11−Σ12Σ−122Σ21
To deduce the above marginals,we define V∈R(m+n)×(m+n) need the lemma below:
V=[VAAVBAVABVBB]=Σ−1[ACBD]−1=[M−1−D−1CM−1−M−1BD−1D−1+D−1CM−1BD−1]
where M=A−BD−1C. Using this formula, it follows that
[ΣAAΣBAΣABΣBB]=[VAAVBAVABVBB]−1=[(VAA−VABV−1BBVBA)−1−V−1BBVBA(VAA−VABV−1BBVBA)−1−(VAA−VABV−1BBVBA)−1VABV−1BB(VBB−VBAV−1AAVAB)−1]
And the “completion of squares” trick.Consider the quadratic function zTAz+bTz+c where A is a symmetric,nonsingular matrix.Then, one can verify directly that
EM for factor analysis
In the factor analysis model, we posit a joint distribution on (x,z) as follows,where z∈Rk is a latent random variable:
zx|z∼N(0,I)∼N(μ+Λz,Ψ).
Here, the parameters of our model are the vector μ∈Rn , the matrix Λ∈Rn×k . The value of k is usually chosen to be smaller than
Thus, we imagine that each datapoint x(i) is generated by sampling a k dimension multivariate Gaussian
zϵx∼N(0,I)∼N(0,Ψ)=μ+Λz+ϵ
where ϵ and z are independent.
ℓ(μ,Λ,Ψ)=log∏i=1m1(2π)n/2∣∣ΛΛT+Ψ∣∣1/2exp(−12(x(i)−μ)T(ΛΛT+Ψ)−1(x(i)−μ)).
z(i)|x(i);μ,Λ,Ψ∼N(μz(i)|x(i),Σz(i)|x(i))μz(i)|x(i)Σz(i)|x(i)ΛμΦ=ΛT(ΛΛT+Ψ)−1(x(i)−μ),=I−ΛT(ΛΛT+Ψ)−1Λ=(∑i=1m(x(i)−μ)μTz(i)|x(i))(∑i=1mμz(i)|x(i)μTz(i)|x(i)+Σz(i)|x(i))−1=1m∑i=1mx(i)=1m∑i=1mx(i)x(i)T−x(i)μTz(i)|x(i)ΛT−Λμz(i)|x(i)x(i)T+Λ(μz(i)|x(i)μTz(i)|x(i)+Σz(i)|x(i))ΛT
setting Ψii=Φii ( i.e. ,letting Ψ be the diagonal matrix containing only the diagonal entries of Φ ).