连续性随机变量的学习问题称为 regression. 回归问题中最常见的是 linear regression
y
∣
x
;
θ
=
h
(
x
)
+
ϵ
y|x;\theta = h(x) + \epsilon
y∣x;θ=h(x)+ϵ
其中
ϵ
∼
N
(
0
,
σ
2
)
\epsilon \sim N(0, \sigma^2)
ϵ∼N(0,σ2) 称为 error term。应用 GLM 可以得出正态分布对应的1
h
(
x
)
=
θ
T
x
h(x) = \theta^Tx
h(x)=θTx
再应用 MLE 可以得到 log likelihood
l
(
θ
)
=
∑
i
=
1
m
log
1
2
π
σ
exp
(
−
(
y
(
i
)
−
h
(
x
(
i
)
)
)
2
2
σ
2
)
=
−
1
2
σ
2
∑
i
=
1
m
(
y
(
i
)
−
h
(
x
(
i
)
)
)
2
+
C
=
−
1
σ
2
⋅
1
2
∑
i
=
1
m
ϵ
i
2
+
C
\begin{array}{rcl} l(\theta) &=& \sum\limits_{i=1}^m\log\frac{1}{\sqrt{2\pi}\sigma}\exp\big(-\frac{(y^{(i)}-h(x^{(i)}))^2}{2\sigma^2}\big)\\ &=& -\frac{1}{2\sigma^2}\sum\limits_{i=1}^m(y^{(i)} - h(x^{(i)}))^2 + C\\ &=& -\frac{1}{\sigma^2}\cdot\frac{1}{2}\sum\limits_{i=1}^m\epsilon_i^2 + C \end{array}
l(θ)===i=1∑mlog2πσ1exp(−2σ2(y(i)−h(x(i)))2)−2σ21i=1∑m(y(i)−h(x(i)))2+C−σ21⋅21i=1∑mϵi2+C
定义 cost function
J
(
θ
)
=
1
2
∑
i
=
1
m
ϵ
i
2
J(\theta) = \frac{1}{2}\sum\limits_{i=1}^m\epsilon_i^2
J(θ)=21i=1∑mϵi2
则最小化代价函数和原问题是等价的。Normal equation2 给出了 closed-form 解
θ
=
(
X
T
X
)
−
1
X
T
y
⃗
\theta = (X^TX)^{-1}X^T\vec{y}
θ=(XTX)−1XTy
如果 X T X X^TX XTX 不可逆,可能的原因有
- 训练集中有重复变量
- 样本数量小于参数个数
无论哪种原因,删除一部分变量总是有效的。如果可以确定所有参数都是不相关的,那么可以选择增加样本数量或者使用正则化后的正规方程
θ
=
(
X
T
X
+
λ
[
0
0
0
⋯
0
0
1
0
⋯
0
0
0
1
⋯
0
⋮
⋱
⋮
0
0
0
⋯
1
]
(
n
+
1
)
×
(
n
+
1
)
)
−
1
X
T
y
⃗
\theta = \left(X^TX+ \lambda\left[\begin{array}{ccccc} 0 & 0 & 0 & \cdots & 0\\ 0 & 1 & 0 & \cdots & 0\\ 0 & 0 & 1 & \cdots & 0\\ & \vdots && \ddots & \vdots\\ 0 & 0 & 0 & \cdots & 1\\ \end{array}\right]_{(n+1)\times(n+1)} \right)^{-1}X^T\vec{y}
θ=⎝⎜⎜⎜⎜⎜⎜⎛XTX+λ⎣⎢⎢⎢⎢⎢⎡0000010⋮00010⋯⋯⋯⋱⋯000⋮1⎦⎥⎥⎥⎥⎥⎤(n+1)×(n+1)⎠⎟⎟⎟⎟⎟⎟⎞−1XTy
模型检验
为了评价模型的拟合优度,定义
SSE
(the sum of squares due to error)
=
2
J
(
θ
)
SSR
(sum of squares of the regression)
=
(
m
−
1
)
s
h
2
SST
(total sum of squares)
=
(
m
−
1
)
s
y
2
r
2
(coefficient of determination)
=
1
−
SSE
/
SST
\begin{array}{rlcl} \text{SSE} & \text{(the sum of squares due to error)} &=& 2J(\theta)\\ \text{SSR} & \text{(sum of squares of the regression)} &=& (m - 1)s_h^2\\ \text{SST} & \text{(total sum of squares)} &=& (m - 1)s_y^2\\ r^2 & \text{(coefficient of determination)} &=& 1 - \text{SSE} / \text{SST} \end{array}
SSESSRSSTr2(the sum of squares due to error)(sum of squares of the regression)(total sum of squares)(coefficient of determination)====2J(θ)(m−1)sh2(m−1)sy21−SSE/SST
可以证明平方和分解公式
SST
=
SSE
+
SSR
\text{SST} = \text{SSE} + \text{SSR}
SST=SSE+SSR
通过观察发现 SSE 受到
θ
0
,
θ
2
,
…
,
θ
n
\theta_0, \theta_2, \dots, \theta_n
θ0,θ2,…,θn 共
n
+
1
n + 1
n+1 个条件制约3,因此其自由度
DFE
(degrees-of-freedom in the error)
=
m
−
n
−
1
\begin{array}{rlcl} \text{DFE} & \text{(degrees-of-freedom in the error)} &=& m - n - 1 \end{array}
DFE(degrees-of-freedom in the error)=m−n−1
而 SSR 和 SST 的自由度分别为
n
n
n 和
m
−
1
m - 1
m−14。由此定义
MSE
(mean squared error)
=
SSE
/
DFE
RMSE
(root mean squared error)
=
MSE
adj-
r
2
(df adjusted
r
2
)
=
1
−
MSE
/
s
y
2
\begin{array}{rlcl} \text{MSE} & \text{(mean squared error)} &=& \text{SSE} / \text{DFE}\\ \text{RMSE} & \text{(root mean squared error)} &=& \sqrt{\text{MSE}}\\ \text{adj-}r^2 & \text{(df adjusted }r^2) &=& 1 - \text{MSE} / s_y^2 \end{array}
MSERMSEadj-r2(mean squared error)(root mean squared error)(df adjusted r2)===SSE/DFEMSE1−MSE/sy2
可以检验
H
0
:
θ
⃗
=
0
H_0 : \vec\theta = 0
H0:θ=0
F
=
SSR
/
n
MSE
∼
F
(
n
,
DFE
)
F = \frac{\text{SSR} / n}{\text{MSE}} \sim F(n, \text{DFE})
F=MSESSR/n∼F(n,DFE)
如果检验显著,则应继续对每个参数进行 t 检验
H
0
:
θ
i
=
0
H_0 : \theta_i = 0
H0:θi=0. 因为 MSE 为
σ
2
\sigma^2
σ2 的无偏估计,因此记
σ
^
2
=
MSE
\hat{\sigma}^2 = \text{MSE}
σ^2=MSE
由公式
C
o
v
(
θ
)
=
σ
^
2
(
X
T
X
)
−
1
Cov(\theta) = \hat{\sigma}^2\left(X^TX\right)^{-1}
Cov(θ)=σ^2(XTX)−1
可以得到
σ
j
2
=
V
a
r
(
θ
i
)
\sigma_j^2 = Var(\theta_i)
σj2=Var(θi),易知
θ
^
j
∼
N
(
θ
j
,
σ
j
2
)
\hat\theta_j \sim N(\theta_j, \sigma_j^2)
θ^j∼N(θj,σj2)
构造枢轴量
θ
^
j
−
θ
j
σ
j
∼
t
(
DFE
)
\frac{\hat\theta_j - \theta_j}{\sigma_j} \sim t(\text{DFE})
σjθ^j−θj∼t(DFE)
可以得到单个自变量对目标变量的拟合程度。关于变量预测的置信区间可以查看多元线性回归的预测
一元线性回归
对于
X
=
[
x
1
]
X = \left[\begin{array}{cc}x&1\end{array}\right]
X=[x1] 可以解得
θ
=
[
∑
i
=
1
m
x
i
2
∑
i
=
1
m
x
i
∑
i
=
1
m
x
i
m
]
−
1
[
x
T
1
]
y
=
1
m
∑
i
=
1
m
x
i
2
−
(
∑
i
=
1
m
x
i
)
2
[
m
−
∑
i
=
1
m
x
i
−
∑
i
=
1
m
x
i
∑
i
=
1
m
x
i
2
]
[
∑
i
=
1
m
x
i
y
i
∑
i
=
1
m
y
i
]
\begin{array}{rcl} \theta &=& \left[\begin{array}{cc} \sum\limits_{i=1}^m x_i^2 & \sum\limits_{i=1}^m x_i\\ \sum\limits_{i=1}^m x_i & m \end{array}\right]^{-1}\left[\begin{array}{c} x^T\\1 \end{array}\right]y\\ &=& \frac{1}{m\sum\limits_{i=1}^m x_i^2 - \left(\sum\limits_{i=1}^m x_i\right)^2}\left[\begin{array}{cc} m & -\sum\limits_{i=1}^m x_i\\ -\sum\limits_{i=1}^m x_i & \sum\limits_{i=1}^m x_i^2 \end{array}\right]\left[\begin{array}{c} \sum\limits_{i=1}^m x_iy_i\\\sum\limits_{i=1}^m y_i \end{array}\right] \end{array}
θ==⎣⎢⎡i=1∑mxi2i=1∑mxii=1∑mxim⎦⎥⎤−1[xT1]ymi=1∑mxi2−(i=1∑mxi)21⎣⎢⎡m−i=1∑mxi−i=1∑mxii=1∑mxi2⎦⎥⎤⎣⎢⎡i=1∑mxiyii=1∑myi⎦⎥⎤
引入记号
l
x
=
(
n
−
1
)
s
x
2
l
x
y
=
(
n
−
1
)
s
x
y
\begin{array}{rcl} l_x &=& (n-1)s_x^2\\ l_{xy} &=& (n-1)s_{xy} \end{array}
lxlxy==(n−1)sx2(n−1)sxy
将等式
∑
i
=
1
m
x
i
=
m
x
ˉ
∑
i
=
1
m
y
i
=
m
y
ˉ
∑
i
=
1
m
x
i
2
=
l
x
+
m
x
ˉ
2
∑
i
=
1
m
x
i
y
i
=
l
x
y
+
m
x
ˉ
y
ˉ
\begin{array}{rcl} \sum\limits_{i=1}^m x_i &=& m\bar x\\ \sum\limits_{i=1}^m y_i &=& m\bar y\\ \sum\limits_{i=1}^m x_i^2 &=& l_x + m\bar{x}^2\\ \sum\limits_{i=1}^m x_iy_i &=& l_{xy} + m\bar x\bar y \end{array}
i=1∑mxii=1∑myii=1∑mxi2i=1∑mxiyi====mxˉmyˉlx+mxˉ2lxy+mxˉyˉ
带入
θ
\theta
θ 表达式
θ
=
1
m
l
x
[
m
−
m
x
ˉ
−
m
x
ˉ
l
x
+
m
x
ˉ
2
]
[
l
x
y
+
m
x
ˉ
y
ˉ
m
y
ˉ
]
=
[
l
x
y
/
l
x
y
ˉ
−
x
ˉ
θ
1
]
\begin{array}{rcl} \theta &=& \frac{1}{ml_x}\left[\begin{array}{cc} m & -m\bar x\\ -m\bar x & l_x + m\bar{x}^2 \end{array}\right]\left[\begin{array}{c} l_{xy} + m\bar x\bar y\\m\bar y \end{array}\right]\\ &=& \left[\begin{array}{c} l_{xy} / l_x\\ \bar y - \bar x\theta_1 \end{array}\right] \end{array}
θ==mlx1[m−mxˉ−mxˉlx+mxˉ2][lxy+mxˉyˉmyˉ][lxy/lxyˉ−xˉθ1]
可以证明
SSR
=
θ
1
2
l
x
r
2
=
ρ
x
y
2
\begin{array}{rcl} \text{SSR} &=& \theta_1^2 l_x\\ r^2 &=& \rho_{xy}^2 \end{array}
SSRr2==θ12lxρxy2
LWR
Locally weighted regression 是一种非参数估计方法。与线性回归不同,LWR 使用的代价函数引入了 weight
w
w
w. 一般取
w
(
i
)
=
exp
(
−
1
2
(
x
(
i
)
−
x
)
T
Σ
−
1
(
x
(
i
)
−
x
)
)
w^{(i)} = \exp\big(-\frac{1}{2}(x^{(i)} - x)^T\Sigma^{-1}(x^{(i)} - x)\big)
w(i)=exp(−21(x(i)−x)TΣ−1(x(i)−x))
其中
Σ
\Sigma
Σ 称为 bandwidth. 修改后的代价函数为
J
(
θ
)
=
∑
i
=
1
m
w
(
i
)
(
y
(
i
)
−
h
(
x
(
i
)
)
)
2
J(\theta) = \sum\limits_{i = 1}^m w^{(i)}(y^{(i)} - h(x^{(i)}))^2
J(θ)=i=1∑mw(i)(y(i)−h(x(i)))2
相应的,模型检验中定义的各个统计量也需要考虑 w w w 的影响。
参考
多元线性回归中的 T 检验怎样理解?其 p 值为什么划定在 0.05?
Define the design matrix
X m × ( n + 1 ) = ( ( x ( 1 ) ) T ( x ( 2 ) ) T ⋯ ( x ( m ) ) T ) X_{m\times(n+1)} = \left(\begin{array}{c} (x^{(1)})^T\\ (x^{(2)})^T\\ \cdots\\ (x^{(m)})^T \end{array}\right) Xm×(n+1)=⎝⎜⎜⎛(x(1))T(x(2))T⋯(x(m))T⎠⎟⎟⎞ Let
y ⃗ = ( y ( 1 ) y ( 2 ) ⋯ y ( m ) ) \vec{y} = \left(\begin{array}{c} y^{(1)}\\ y^{(2)}\\ \cdots\\ y^{(m)} \end{array}\right) y=⎝⎜⎜⎛y(1)y(2)⋯y(m)⎠⎟⎟⎞ Thus
J ( θ ) = 1 2 ( X θ − y ⃗ ) T ( X θ − y ⃗ ) ∇ θ J = 1 2 ∇ θ ( θ T X T X θ − θ T X T y ⃗ − y ⃗ T X θ + y ⃗ T y ⃗ ) \begin{array}{rcl} J(\theta) &=& \frac{1}{2}(X\theta-\vec{y})^T(X\theta-\vec{y})\\\\ \nabla_\theta J &=& \frac{1}{2}\nabla_\theta(\theta^TX^TX\theta - \theta^TX^T\vec{y} - \vec{y}^TX\theta + \vec{y}^T\vec{y}) \end{array} J(θ)∇θJ==21(Xθ−y)T(Xθ−y)21∇θ(θTXTXθ−θTXTy−yTXθ+yTy) since every term is simply a real number
∇ θ J = 1 2 ∇ θ t r ( θ T X T X θ − θ T X T y ⃗ − y ⃗ T X θ ) = 1 2 ( ∇ θ t r ( θ T X T X θ ) − 2 ∇ θ t r ( y ⃗ T X θ ) ) \begin{array}{rcl} \nabla_\theta J &=& \frac{1}{2}\nabla_\theta tr(\theta^TX^TX\theta - \theta^TX^T\vec{y} - \vec{y}^TX\theta)\\\\ &=& \frac{1}{2}(\nabla_\theta tr(\theta^TX^TX\theta) - 2\nabla_\theta tr(\vec{y}^TX\theta)) \end{array} ∇θJ==21∇θtr(θTXTXθ−θTXTy−yTXθ)21(∇θtr(θTXTXθ)−2∇θtr(yTXθ)) By the properties of matrix derivative
∇ θ t r ( y ⃗ T X θ ) = ∇ ( θ T ) T t r ( θ T X T y ⃗ ) = ( ∇ θ T t r ( θ T X T y ⃗ ) ) T = X T y ⃗ ∇ θ t r ( θ T X T X θ ) = ∇ θ t r ( θ θ T X T X ) = ∇ θ t r ( θ I θ T X T X ) = 2 X T X θ \begin{array}{rcl} \nabla_\theta tr(\vec{y}^TX\theta) &=& \nabla_{(\theta^T)^T} tr(\theta^TX^T\vec{y})\\\\ &=& (\nabla_{\theta^T} tr(\theta^TX^T\vec{y}))^T\\\\ &=& X^T\vec{y}\\\\ \nabla_\theta tr(\theta^TX^TX\theta) &=& \nabla_\theta tr(\theta\theta^TX^TX)\\\\ &=& \nabla_\theta tr(\theta I\theta^TX^TX)\\\\ &=& 2X^TX\theta \end{array} ∇θtr(yTXθ)∇θtr(θTXTXθ)======∇(θT)Ttr(θTXTy)(∇θTtr(θTXTy))TXTy∇θtr(θθTXTX)∇θtr(θIθTXTX)2XTXθ Therefore
∇ θ J = X T X θ − X T y ⃗ \begin{array}{rcl} \nabla_\theta J &=& X^TX\theta - X^T\vec{y} \end{array} ∇θJ=XTXθ−XTy Set ∇ θ J ≡ 0 \nabla_\theta J \equiv 0 ∇θJ≡0
X T X θ = X T y ⃗ X^TX\theta = X^T\vec{y} XTXθ=XTy over. ↩︎