线性回归
假设数据集为:
D
=
{
(
x
1
,
y
1
)
,
(
x
2
,
y
2
)
,
.
.
.
,
(
x
N
,
y
N
)
}
D = \{(x_1,y_1),(x_2,y_2),...,(x_N,y_N)\}
D={(x1,y1),(x2,y2),...,(xN,yN)}
后面我们记:
X
=
(
x
1
,
x
2
,
.
.
.
,
x
N
)
T
,
Y
=
(
y
1
,
y
2
,
.
.
.
,
y
N
)
T
X=(x_1,x_2,...,x_N)^T, Y=(y_1,y_2,...,y_N)^T
X=(x1,x2,...,xN)T,Y=(y1,y2,...,yN)T
线性回归假设:
f
(
w
)
=
w
T
x
f(w)=w^Tx
f(w)=wTx
最小二乘法
对这个问题,采用二范数定义的平方误差来定义损失函数:
L
(
w
)
=
∑
i
=
1
N
∣
∣
w
T
x
i
−
y
i
∣
∣
2
2
L(w) = {\sum}^N_{i=1}||w^Tx_i-y_i||^2_2
L(w)=∑i=1N∣∣wTxi−yi∣∣22
展开得到:
L
(
w
)
=
(
w
T
x
1
−
y
1
,
.
.
.
,
w
T
x
N
−
y
N
)
.
(
w
T
x
1
−
y
1
,
.
.
.
,
w
T
x
N
−
y
N
)
T
=
(
w
T
X
T
−
Y
T
)
.
(
X
w
−
Y
)
=
w
T
X
T
X
w
−
Y
T
X
w
−
w
T
X
T
Y
+
Y
T
Y
=
w
L(w)=(w^Tx_1-y_1,...,w^Tx_N-y_N).(w^Tx_1-y_1,...,w^Tx_N-y_N)^T\\ =(w^TX^T-Y^T).(Xw-Y)\\ =w^TX^TXw-Y^TXw-w^TX^TY+Y^TY\\ =w
L(w)=(wTx1−y1,...,wTxN−yN).(wTx1−y1,...,wTxN−yN)T=(wTXT−YT).(Xw−Y)=wTXTXw−YTXw−wTXTY+YTY=w
最小化这个值的
w
^
\hat{w}
w^:
w
^
=
a
r
g
m
i
n
w
L
(
w
)
⟶
∂
∂
w
L
(
w
)
=
0
⟶
2
X
T
X
w
^
−
2
X
T
Y
=
0
⟶
w
^
=
(
X
T
X
)
−
1
X
T
Y
=
X
+
Y
\hat{w}=argmin_wL(w)\\ \longrightarrow\frac{\partial}{\partial{w}}L(w)=0\\ \longrightarrow{2X^TX\hat{w}-2X^TY=0}\\ \longrightarrow\hat{w}=(X^TX)^{-1}X^TY=X^+Y
w^=argminwL(w)⟶∂w∂L(w)=0⟶2XTXw^−2XTY=0⟶w^=(XTX)−1XTY=X+Y
这个式子中
(
X
T
X
)
−
1
X
T
(X^TX)^{-1}X^{T}
(XTX)−1XT又称为伪逆。对于行满秩或者列满秩的X,可以直接求解,但是对于非满秩的样本集合,需要使用奇异值分解(SVD)的方法,对X求奇异值分解,得到
X
=
U
Σ
V
T
X=U{\Sigma}V^T
X=UΣVT
在几何上,最小二乘法相当于模型和试验值的距离的平方求和,假设我们的试验样本张成一个p维空间:
X
=
S
p
a
n
(
x
1
,
.
.
.
,
x
N
)
X=Span(x_1,...,x_N)
X=Span(x1,...,xN),而模型可以写成
f
(
w
)
=
X
β
f(w)=X\beta
f(w)=Xβ,也就是
x
1
,
.
.
.
,
x
N
x_1,...,x_N
x1,...,xN的某种组合,而最小二乘法就是说希望Y和这个模型距离越小越好,于是它们的差应该与这个张成的空间垂直:
X
T
.
(
Y
−
X
β
)
=
0
⟶
β
=
(
X
T
X
)
−
1
X
T
Y
X^T.(Y-X\beta)=0\longrightarrow\beta=(X^TX)^{-1}X^TY
XT.(Y−Xβ)=0⟶β=(XTX)−1XTY
噪声为高斯分布的MLE
一维的高斯分布
N ( μ , σ ) = 1 2 π σ e x p ( − ( x − μ ) 2 2 σ 2 ) N(\mu,\sigma)=\frac{1}{\sqrt{2\pi}\sigma}exp(-\frac{(x-\mu)^2}{2\sigma^2}) N(μ,σ)=2πσ1exp(−2σ2(x−μ)2)
p维高斯分布
N ( μ , Σ ) = 1 ( 2 π ) p / 2 Σ 1 / 2 e x p ( − 1 2 ( x − μ ) T Σ − 1 ( x − μ ) ) N(\mu, \Sigma)=\frac{1}{(2\pi)^{p/2}\Sigma^{1/2}}exp(-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)) N(μ,Σ)=(2π)p/2Σ1/21exp(−21(x−μ)TΣ−1(x−μ))
MLE极大似然估计
θ
M
L
E
=
a
r
g
m
a
x
θ
P
(
x
∣
θ
)
\theta_{MLE}=argmax_{\theta}P(x|\theta)
θMLE=argmaxθP(x∣θ)
对于一维的情况,记
y
=
w
T
x
+
ϵ
,
ϵ
∼
N
(
0
,
σ
2
)
y=w^Tx+\epsilon,\epsilon\sim{N(0,\sigma^2)}
y=wTx+ϵ,ϵ∼N(0,σ2),那么
y
∼
N
(
w
T
x
,
σ
2
)
y\sim{N(w^Tx,\sigma^2)}
y∼N(wTx,σ2)。代入极大似然估计中:
L
(
w
)
=
l
o
g
p
(
Y
∣
X
,
w
)
=
l
o
g
∏
i
=
1
N
p
(
y
i
∣
x
i
,
w
)
=
∑
i
=
1
N
l
o
g
(
1
2
π
σ
e
−
(
y
i
−
w
T
x
i
)
2
2
σ
2
)
=
∑
i
=
1
N
[
l
o
g
1
2
π
+
l
o
g
1
σ
−
(
x
i
−
μ
)
2
2
σ
2
]
L(w)=logp(Y|X, w)=log\prod_{i=1}^Np(y_i|x_i,w)\\ =\sum_{i=1}^Nlog(\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(y_i-w^Tx_i)^2}{2\sigma^2}})\\ =\sum_{i=1}^N[log\frac{1}{\sqrt{2\pi}}+log\frac{1}{\sigma}-\frac{(x_i-\mu)^2}{2\sigma^2}]
L(w)=logp(Y∣X,w)=logi=1∏Np(yi∣xi,w)=i=1∑Nlog(2πσ1e−2σ2(yi−wTxi)2)=i=1∑N[log2π1+logσ1−2σ2(xi−μ)2]
因此,我们求
μ
M
L
E
\mu_{MLE}
μMLE,有:
μ
M
L
E
=
a
r
g
m
a
x
μ
l
o
g
p
(
x
∣
θ
)
=
a
r
g
m
a
x
μ
∑
i
=
1
N
−
(
x
i
−
μ
)
2
2
σ
2
=
a
r
g
m
i
n
μ
∑
i
=
1
N
(
x
i
−
μ
)
2
\mu_{MLE}=argmax_\mu{logp(x|\theta)}\\ = argmax_\mu{\sum_{i=1}^N-\frac{(x_i-\mu)^2}{2\sigma^2}}\\ =argmin_\mu{\sum_{i=1}^N(x_i-\mu)^2}
μMLE=argmaxμlogp(x∣θ)=argmaxμi=1∑N−2σ2(xi−μ)2=argminμi=1∑N(xi−μ)2
求导得:
∂
∂
μ
∑
i
=
1
N
(
x
i
−
μ
)
2
=
∑
i
=
1
N
2
(
x
i
−
μ
)
(
−
1
)
=
0
⟶
∑
i
=
1
N
(
x
i
−
μ
)
=
0
⟶
∑
i
=
1
N
x
i
=
N
μ
⟶
μ
M
L
E
=
1
N
∑
i
=
1
N
x
i
\frac{\partial}{\partial{\mu}}\sum_{i=1}^N(x_i-\mu)^2=\sum_{i=1}^N2(x_i-\mu)(-1)=0\\ \longrightarrow\sum_{i=1}^{N}(x_i-\mu)=0\\ \longrightarrow\sum_{i=1}^Nx_i=N\mu\\ \longrightarrow\mu_{MLE}=\frac{1}{N}\sum_{i=1}^Nx_i
∂μ∂i=1∑N(xi−μ)2=i=1∑N2(xi−μ)(−1)=0⟶i=1∑N(xi−μ)=0⟶i=1∑Nxi=Nμ⟶μMLE=N1i=1∑Nxi
对于
σ
M
L
E
2
\sigma^2_{MLE}
σMLE2,我们依然可以根据上式得到:
σ
M
L
E
2
=
a
r
g
m
a
x
σ
l
o
g
P
(
x
∣
θ
)
=
a
r
g
m
a
x
σ
(
−
l
o
g
σ
−
1
2
σ
2
(
x
i
−
μ
)
2
)
\sigma^2_{MLE}=argmax_\sigma{logP(x|\theta)}\\ =argmax_\sigma(-log\sigma-\frac{1}{2\sigma^2}(x_i-\mu)^2)
σMLE2=argmaxσlogP(x∣θ)=argmaxσ(−logσ−2σ21(xi−μ)2)
求导得:
∂
∂
σ
=
−
1
σ
−
1
2
(
x
i
−
μ
)
2
(
−
2
)
σ
−
3
=
0
⟶
−
1
σ
+
(
x
i
−
μ
)
2
σ
−
3
=
0
⟶
−
σ
2
+
(
x
I
−
μ
)
2
=
0
⟶
∑
i
=
1
N
−
σ
2
+
(
x
i
−
μ
)
2
=
0
⟶
σ
M
L
E
2
=
1
N
∑
i
=
1
N
(
x
i
−
μ
)
2
\frac{\partial}{\partial\sigma}=-\frac{1}{\sigma}-\frac{1}{2}(x_i-\mu)^2(-2)\sigma^{-3}=0\\ \longrightarrow-\frac{1}{\sigma}+(x_i-\mu)^2\sigma^{-3}=0\\ \longrightarrow-\sigma^2+(x_I-\mu)^2=0\\ \longrightarrow\sum_{i=1}^N-\sigma^2+(x_i-\mu)^2=0\\ \longrightarrow\sigma^2_{MLE}=\frac{1}{N}\sum_{i=1}^N(x_i-\mu)^2
∂σ∂=−σ1−21(xi−μ)2(−2)σ−3=0⟶−σ1+(xi−μ)2σ−3=0⟶−σ2+(xI−μ)2=0⟶i=1∑N−σ2+(xi−μ)2=0⟶σMLE2=N1i=1∑N(xi−μ)2
但是上面的
σ
\sigma
σ属于有偏估计。
我们对
σ
M
L
E
2
\sigma_{MLE}^2
σMLE2求期望,有:
E
[
σ
M
L
E
2
]
=
1
N
∑
i
=
1
N
E
(
x
i
−
μ
M
L
E
)
2
=
1
N
∑
i
=
1
N
E
(
x
i
2
−
2
x
i
μ
M
L
E
+
μ
M
L
E
2
)
=
1
N
(
∑
i
=
1
N
E
(
x
i
2
)
−
2
∑
i
=
1
N
E
(
x
i
μ
M
L
E
)
+
∑
i
=
1
N
E
(
μ
M
L
E
2
)
)
=
1
N
(
∑
i
=
1
N
E
(
x
i
2
)
−
2
μ
M
L
E
2
+
μ
M
L
E
2
)
=
1
N
E
(
∑
i
=
1
N
x
i
2
−
μ
M
L
E
2
)
=
E
[
1
N
∑
i
=
1
N
x
i
2
−
μ
2
−
(
μ
M
L
E
2
−
μ
2
)
]
=
E
[
1
N
∑
i
=
1
N
x
i
2
−
μ
2
]
−
E
[
(
μ
M
L
E
2
−
μ
2
)
]
E[\sigma_{MLE}^2]=\frac{1}{N}\sum_{i=1}^NE(x_i-\mu_{MLE})^2\\ =\frac{1}{N}\sum_{i=1}^{N}E(x_i^2-2x_i\mu_{MLE}+\mu^2_{MLE})\\ =\frac{1}{N}(\sum_{i=1}^NE(x_i^2)-2\sum_{i=1}^NE(x_i\mu_{MLE})+\sum_{i=1}^NE(\mu^2_{MLE}))\\ =\frac{1}{N}(\sum_{i=1}^NE(x_i^2)-2\mu^2_{MLE}+\mu^2_{MLE})\\ =\frac{1}{N}E(\sum_{i=1}^Nx_i^2-\mu^2_{MLE})\\ =E[\frac{1}{N}\sum_{i=1}^Nx^2_i-\mu^2-(\mu^2_{MLE}-\mu^2)]\\ =E[\frac{1}{N}\sum_{i=1}^Nx^2_i-\mu^2]-E[(\mu^2_{MLE}-\mu^2)]
E[σMLE2]=N1i=1∑NE(xi−μMLE)2=N1i=1∑NE(xi2−2xiμMLE+μMLE2)=N1(i=1∑NE(xi2)−2i=1∑NE(xiμMLE)+i=1∑NE(μMLE2))=N1(i=1∑NE(xi2)−2μMLE2+μMLE2)=N1E(i=1∑Nxi2−μMLE2)=E[N1i=1∑Nxi2−μ2−(μMLE2−μ2)]=E[N1i=1∑Nxi2−μ2]−E[(μMLE2−μ2)]
而
E
[
1
N
∑
i
=
1
N
x
i
2
−
μ
2
]
=
E
(
1
N
∑
i
=
1
N
(
x
i
2
−
μ
2
)
)
=
1
N
∑
i
=
1
N
(
x
i
2
−
μ
2
)
=
1
N
∑
i
=
1
N
E
(
x
i
2
)
−
E
(
μ
2
)
=
1
N
∑
i
=
1
N
σ
2
=
σ
2
E[\frac{1}{N}\sum_{i=1}^Nx^2_i-\mu^2]=E(\frac{1}{N}\sum_{i=1}^N(x_i^2-\mu^2))\\ =\frac{1}{N}\sum_{i=1}^N(x_i^2-\mu^2)\\ =\frac{1}{N}\sum_{i=1}^NE(x_i^2)-E(\mu^2)\\ =\frac{1}{N}\sum_{i=1}^N\sigma^2=\sigma^2
E[N1i=1∑Nxi2−μ2]=E(N1i=1∑N(xi2−μ2))=N1i=1∑N(xi2−μ2)=N1i=1∑NE(xi2)−E(μ2)=N1i=1∑Nσ2=σ2
E
[
(
μ
M
L
E
2
−
μ
2
)
]
=
E
(
μ
M
L
E
2
)
−
E
(
μ
2
)
=
E
(
μ
M
L
E
2
)
−
μ
2
=
V
a
r
(
μ
M
L
E
)
=
V
a
r
[
1
N
∑
i
=
1
N
x
i
]
=
1
N
2
∑
i
=
1
N
V
a
r
(
x
i
)
=
1
N
2
∑
i
=
1
N
σ
2
=
1
N
σ
2
E[(\mu^2_{MLE}-\mu^2)]=E(\mu^2_{MLE})-E(\mu^2)\\ =E(\mu^2_{MLE})-\mu^2=Var(\mu_{MLE})\\ =Var[\frac{1}{N}\sum^N_{i=1}x_i]=\frac{1}{N^2}\sum^N_{i=1}Var(x_i)\\ =\frac{1}{N^2}\sum_{i=1}^N\sigma^2=\frac{1}{N}\sigma^2
E[(μMLE2−μ2)]=E(μMLE2)−E(μ2)=E(μMLE2)−μ2=Var(μMLE)=Var[N1i=1∑Nxi]=N21i=1∑NVar(xi)=N21i=1∑Nσ2=N1σ2
所以最终:
E
[
σ
M
L
E
2
]
=
N
−
1
N
σ
2
E[\sigma^2_{MLE}]=\frac{N-1}{N}\sigma^2
E[σMLE2]=NN−1σ2
而
σ
\sigma
σ的无偏估计为:
σ
=
1
N
−
1
∑
i
=
1
N
(
x
i
−
μ
M
L
E
)
\sigma = \frac{1}{N-1}\sum_{i=1}^N(x_i-\mu_{MLE})
σ=N−11i=1∑N(xi−μMLE)