第三章 线性模型
基本形式:
f
(
x
)
=
w
1
x
1
+
w
2
x
2
+
…
+
w
d
x
d
+
b
=
w
T
x
+
b
f(\boldsymbol{x})=w_1 x_1+w_2 x_2+\ldots+w_d x_d+b=\boldsymbol{w}^{\mathrm{T}} \boldsymbol{x}+b
f(x)=w1x1+w2x2+…+wdxd+b=wTx+b,其中
(
x
1
;
x
2
;
…
;
x
d
)
\left(x_1 ; x_2 ; \ldots ; x_d\right)
(x1;x2;…;xd)为样本x在d个属性上的取值。
线性模型试图通过对样本的属性进行线性组合来获取样本的预测值。
线性回归:给定数据集 D = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , … , ( x m , y m ) } D=\left\{\left(\boldsymbol{x}_1, y_1\right),\left(\boldsymbol{x}_2, y_2\right), \ldots,\left(\boldsymbol{x}_m, y_m\right)\right\} D={(x1,y1),(x2,y2),…,(xm,ym)}, 其中 x i = ( x i 1 \boldsymbol{x}_i=\left(x_{i 1}\right. xi=(xi1; x i 2 ; … ; x i d ) , y i ∈ R \left.x_{i 2} ; \ldots ; x_{i d}\right), y_i \in \mathbb{R} xi2;…;xid),yi∈R. “线性回归” (linear regression)试图学得一个线性模 型以尽可能准确地预测实值输出标记.
先考虑一种最简单的情形: 输入属性的数目只有一个。线性回归试图学得:
f
(
x
i
)
=
w
x
i
+
b
f\left(x_i\right)=w x_i+b
f(xi)=wxi+b, 使得
f
(
x
i
)
≃
y
i
f\left(x_i\right) \simeq y_i
f(xi)≃yi,我们使用均方误差作为性能度量,让均方误差最小化,即可求得最优解:
(
w
∗
,
b
∗
)
=
arg
min
(
w
,
b
)
∑
i
=
1
m
(
f
(
x
i
)
−
y
i
)
2
=
arg
min
(
w
,
b
)
∑
i
=
1
m
(
y
i
−
w
x
i
−
b
)
2
.
\begin{aligned} \left(w^*, b^*\right) & =\underset{(w, b)}{\operatorname{arg} \min } \sum_{i=1}^m\left(f\left(x_i\right)-y_i\right)^2 \\ & =\underset{(w, b)}{\operatorname{arg} \min } \sum_{i=1}^m\left(y_i-w x_i-b\right)^2 . \end{aligned}
(w∗,b∗)=(w,b)argmini=1∑m(f(xi)−yi)2=(w,b)argmini=1∑m(yi−wxi−b)2.
正交回归与线性回归,对于数据集
D
=
{
(
x
1
,
y
1
)
,
(
x
2
,
y
2
)
,
…
,
(
x
m
,
y
m
)
}
D=\left\{\left(\boldsymbol{x}_1, y_1\right),\left(\boldsymbol{x}_2, y_2\right), \ldots,\left(\boldsymbol{x}_m, y_m\right)\right\}
D={(x1,y1),(x2,y2),…,(xm,ym)}, 如果我们找到一条直线,使得每个点到这个直线的垂直距离尽可能短,那么这是正交回归,如果使得每个点的y值距离这条直线尽可能短,那么这是线性回归。
继续上面的话题,我们得到了均方误差的公式,可以使用最小二乘法(基于均方误差最小化来进行模型求解的方法)来求解。
令
E
(
w
,
b
)
=
∑
i
=
1
m
(
y
i
−
w
x
i
−
b
)
2
E_{(w, b)}=\sum_{i=1}^m\left(y_i-w x_i-b\right)^2
E(w,b)=∑i=1m(yi−wxi−b)2
∂
E
(
w
,
b
)
∂
w
=
2
(
w
∑
i
=
1
m
x
i
2
−
∑
i
=
1
m
(
y
i
−
b
)
x
i
)
,
∂
E
(
w
,
b
)
∂
b
=
2
(
m
b
−
∑
i
=
1
m
(
y
i
−
w
x
i
)
)
\begin{aligned} & \frac{\partial E_{(w, b)}}{\partial w}=2\left(w \sum_{i=1}^m x_i^2-\sum_{i=1}^m\left(y_i-b\right) x_i\right), \\ & \frac{\partial E_{(w, b)}}{\partial b}=2\left(m b-\sum_{i=1}^m\left(y_i-w x_i\right)\right) \end{aligned}
∂w∂E(w,b)=2(wi=1∑mxi2−i=1∑m(yi−b)xi),∂b∂E(w,b)=2(mb−i=1∑m(yi−wxi))
后令上式为零可得到
w
w
w 和
b
b
b 最优解的闭式(closed-form)解
w
=
∑
i
=
1
m
y
i
(
x
i
−
x
ˉ
)
∑
i
=
1
m
x
i
2
−
1
m
(
∑
i
=
1
m
x
i
)
2
,
w=\frac{\sum_{i=1}^m y_i\left(x_i-\bar{x}\right)}{\sum_{i=1}^m x_i^2-\frac{1}{m}\left(\sum_{i=1}^m x_i\right)^2},
w=∑i=1mxi2−m1(∑i=1mxi)2∑i=1myi(xi−xˉ),
b
=
1
m
∑
i
=
1
m
(
y
i
−
w
x
i
)
,
b=\frac{1}{m} \sum_{i=1}^m\left(y_i-w x_i\right),
b=m1i=1∑m(yi−wxi),
其中
x
ˉ
=
1
m
∑
i
=
1
m
x
i
\bar{x}=\frac{1}{m} \sum_{i=1}^m x_i
xˉ=m1∑i=1mxi 为
x
x
x 的均值.
若样本x具有多个属性,根据最小二乘法运用在一元线性回归上的情形, 那么对于多元线性回归来说, 我们可以类 似得到
(
w
∗
,
b
∗
)
=
argmin
(
w
,
b
)
∑
i
=
1
m
(
f
(
x
i
)
−
y
i
)
2
=
arg
min
(
w
,
b
)
∑
i
=
1
m
(
y
i
−
f
(
x
i
)
)
2
=
arg
min
(
w
,
b
)
∑
i
=
1
m
(
y
i
−
(
w
T
x
i
+
b
)
)
2
\begin{aligned} \left(\boldsymbol{w}^*, b^*\right) & =\underset{(\boldsymbol{w}, b)}{\operatorname{argmin}} \sum_{i=1}^m\left(f\left(\boldsymbol{x}_i\right)-y_i\right)^2 \\ & =\underset{(\boldsymbol{w}, b)}{\arg \min } \sum_{i=1}^m\left(y_i-f\left(\boldsymbol{x}_i\right)\right)^2 \\ & =\underset{(\boldsymbol{w}, b)}{\arg \min } \sum_{i=1}^m\left(y_i-\left(\boldsymbol{w}^{\mathrm{T}} \boldsymbol{x}_i+b\right)\right)^2 \end{aligned}
(w∗,b∗)=(w,b)argmini=1∑m(f(xi)−yi)2=(w,b)argmini=1∑m(yi−f(xi))2=(w,b)argmini=1∑m(yi−(wTxi+b))2
为便于讨论, 我们令
w
^
=
(
w
;
b
)
=
(
w
1
;
…
;
w
d
;
b
)
∈
R
(
d
+
1
)
×
1
,
注意中间为
;
表示列向量,
x
^
i
=
(
x
1
;
…
;
x
d
;
1
)
∈
R
(
d
+
1
)
×
1
\hat{\boldsymbol{w}}=(w ; b)=\left(w_1 ; \ldots ; w_d ; b\right) \in \mathbb{R}^{(d+1) \times 1},注意中间为;表示列向量, \hat{\boldsymbol{x}}_i=\left(x_1 ; \ldots ; x_d ; 1\right) \in \mathbb{R}^{(d+1) \times 1}
w^=(w;b)=(w1;…;wd;b)∈R(d+1)×1,注意中间为;表示列向量,x^i=(x1;…;xd;1)∈R(d+1)×1, 那么上式可以 简化为
w
^
∗
=
arg
min
w
^
∑
i
=
1
m
(
y
i
−
w
^
T
x
^
i
)
2
=
arg
min
w
^
∑
i
=
1
m
(
y
i
−
x
^
i
T
w
^
)
2
\begin{aligned} \hat{\boldsymbol{w}}^* & =\underset{\hat{\boldsymbol{w}}}{\arg \min } \sum_{i=1}^m\left(y_i-\hat{\boldsymbol{w}}^{\mathrm{T}} \hat{\boldsymbol{x}}_i\right)^2 \\ & =\underset{\hat{\boldsymbol{w}}}{\arg \min } \sum_{i=1}^m\left(y_i-\hat{\boldsymbol{x}}_i^{\mathrm{T}} \hat{\boldsymbol{w}}\right)^2 \end{aligned}
w^∗=w^argmini=1∑m(yi−w^Tx^i)2=w^argmini=1∑m(yi−x^iTw^)2
x
^
i
和
w
^
都是列向量,所以
a
T
b
=
b
T
a
\hat{\boldsymbol{x}}_i和\hat{\boldsymbol{w}}都是列向量,所以a^{\mathrm{T}}b=b^{\mathrm{T}}a
x^i和w^都是列向量,所以aTb=bTa
根据向量内积的定义可知, 上式可以写成如下向量内积的形式
w
^
∗
=
arg
min
w
^
[
y
1
−
x
^
1
T
w
^
⋯
y
m
−
x
^
m
T
w
^
]
[
y
1
−
x
^
1
T
w
^
⋮
y
m
−
x
^
m
T
w
^
]
\hat{\boldsymbol{w}}^*=\underset{\hat{\boldsymbol{w}}}{\arg \min }\left[\begin{array}{lll} y_1-\hat{\boldsymbol{x}}_1^{\mathrm{T}} \hat{\boldsymbol{w}} & \cdots & y_m-\hat{\boldsymbol{x}}_m^{\mathrm{T}} \hat{\boldsymbol{w}} \end{array}\right]\left[\begin{array}{c} y_1-\hat{\boldsymbol{x}}_1^{\mathrm{T}} \hat{\boldsymbol{w}} \\ \vdots \\ y_m-\hat{\boldsymbol{x}}_m^{\mathrm{T}} \hat{\boldsymbol{w}} \end{array}\right]
w^∗=w^argmin[y1−x^1Tw^⋯ym−x^mTw^]
y1−x^1Tw^⋮ym−x^mTw^
其中
[
y
1
−
x
^
1
T
w
^
⋮
y
m
−
x
^
m
T
w
^
]
=
[
y
1
⋮
y
m
]
−
[
x
^
1
T
w
^
⋮
x
^
m
T
w
^
]
=
y
−
[
x
^
1
T
⋮
x
^
m
T
]
⋅
w
^
=
y
−
X
w
^
\begin{aligned} {\left[\begin{array}{c} y_1-\hat{\boldsymbol{x}}_1^{\mathrm{T}} \hat{\boldsymbol{w}} \\ \vdots \\ y_m-\hat{\boldsymbol{x}}_m^{\mathrm{T}} \hat{\boldsymbol{w}} \end{array}\right] } & =\left[\begin{array}{c} y_1 \\ \vdots \\ y_m \end{array}\right]-\left[\begin{array}{c} \hat{\boldsymbol{x}}_1^{\mathrm{T}} \hat{\boldsymbol{w}} \\ \vdots \\ \hat{\boldsymbol{x}}_m^{\mathrm{T}} \hat{\boldsymbol{w}} \end{array}\right] \\ & =\boldsymbol{y}-\left[\begin{array}{c} \hat{\boldsymbol{x}}_1^{\mathrm{T}} \\ \vdots \\ \hat{\boldsymbol{x}}_m^{\mathrm{T}} \end{array}\right] \cdot \hat{\boldsymbol{w}} \\ & =\boldsymbol{y}-\mathbf{X} \hat{\boldsymbol{w}} \end{aligned}
y1−x^1Tw^⋮ym−x^mTw^
=
y1⋮ym
−
x^1Tw^⋮x^mTw^
=y−
x^1T⋮x^mT
⋅w^=y−Xw^
所以
w
^
∗
=
arg
min
w
^
(
y
−
X
w
^
)
T
(
y
−
X
w
^
)
\hat{\boldsymbol{w}}^*=\underset{\hat{\boldsymbol{w}}}{\arg \min }(y-\mathbf{X} \hat{\boldsymbol{w}})^{\mathrm{T}}(y-\mathbf{X} \hat{w})
w^∗=w^argmin(y−Xw^)T(y−Xw^)
令
E
w
^
=
(
y
−
X
w
^
)
T
(
y
−
X
w
^
)
E_{\hat{\boldsymbol{w}}}=(\boldsymbol{y}-\mathbf{X} \hat{\boldsymbol{w}})^{\mathrm{T}}(\boldsymbol{y}-\mathbf{X} \hat{\boldsymbol{w}})
Ew^=(y−Xw^)T(y−Xw^) 展开可得
E
w
^
=
y
T
y
−
y
T
X
w
^
−
w
^
T
X
T
y
+
w
^
T
X
T
X
w
^
E_{\hat{\boldsymbol{w}}}=\boldsymbol{y}^{\mathrm{T}} \boldsymbol{y}-\boldsymbol{y}^{\mathrm{T}} \mathbf{X} \hat{\boldsymbol{w}}-\hat{\boldsymbol{w}}^{\mathrm{T}} \mathbf{X}^{\mathrm{T}} \boldsymbol{y}+\hat{\boldsymbol{w}}^{\mathrm{T}} \mathbf{X}^{\mathrm{T}} \mathbf{X} \hat{\boldsymbol{w}}
Ew^=yTy−yTXw^−w^TXTy+w^TXTXw^
对
w
^
\hat{w}
w^ 求导可得
∂
E
w
^
∂
w
^
=
∂
y
T
y
∂
w
^
−
∂
y
T
X
w
^
∂
w
^
−
∂
w
^
T
X
T
y
∂
w
^
+
∂
w
^
T
X
T
X
w
^
∂
w
^
\frac{\partial E_{\hat{w}}}{\partial \hat{w}}=\frac{\partial y^{\mathrm{T}} \boldsymbol{y}}{\partial \hat{w}}-\frac{\partial y^{\mathrm{T}} \mathbf{X} \hat{\boldsymbol{w}}}{\partial \hat{w}}-\frac{\partial \hat{\boldsymbol{w}}^{\mathrm{T}} \mathbf{X}^{\mathrm{T}} \boldsymbol{y}}{\partial \hat{\boldsymbol{w}}}+\frac{\partial \hat{\boldsymbol{w}}^{\mathrm{T}} \mathbf{X}^{\mathrm{T}} \mathbf{X} \hat{\boldsymbol{w}}}{\partial \hat{\boldsymbol{w}}}
∂w^∂Ew^=∂w^∂yTy−∂w^∂yTXw^−∂w^∂w^TXTy+∂w^∂w^TXTXw^
由矩阵微分公式
∂
a
T
x
∂
x
=
∂
x
T
a
∂
x
=
a
,
∂
x
T
A
x
∂
x
=
(
A
+
A
T
)
x
\frac{\partial \boldsymbol{a}^{\mathrm{T}} \boldsymbol{x}}{\partial \boldsymbol{x}}=\frac{\partial \boldsymbol{x}^{\mathrm{T}} \boldsymbol{a}}{\partial \boldsymbol{x}}=\boldsymbol{a}, \frac{\partial \boldsymbol{x}^{\mathrm{T}} \mathbf{A} \boldsymbol{x}}{\partial \boldsymbol{x}}=\left(\mathbf{A}+\mathbf{A}^{\mathrm{T}}\right) \boldsymbol{x}
∂x∂aTx=∂x∂xTa=a,∂x∂xTAx=(A+AT)x 可得
∂
E
w
^
∂
w
^
=
0
−
X
T
y
−
X
T
y
+
(
X
T
X
+
X
T
X
)
w
^
∂
E
w
^
∂
w
^
=
2
X
T
(
X
w
^
−
y
)
\begin{gathered} \frac{\partial E_{\hat{w}}}{\partial \hat{w}}=0-\mathbf{X}^{\mathrm{T}} \boldsymbol{y}-\mathbf{X}^{\mathrm{T}} \boldsymbol{y}+\left(\mathbf{X}^{\mathrm{T}} \mathbf{X}+\mathbf{X}^{\mathrm{T}} \mathbf{X}\right) \hat{\boldsymbol{w}} \\ \frac{\partial E_{\hat{w}}}{\partial \hat{w}}=2 \mathbf{X}^{\mathrm{T}}(\mathbf{X} \hat{w}-y) \end{gathered}
∂w^∂Ew^=0−XTy−XTy+(XTX+XTX)w^∂w^∂Ew^=2XT(Xw^−y)
当
X
T
X
\mathbf{X}^{\mathrm{T}} \mathbf{X}
XTX 为满秩矩阵(full-rank matrix)或正定矩阵(positive definite matrix)时,
w
^
∗
=
(
X
T
X
)
−
1
X
T
y
,
\hat{\boldsymbol{w}}^*=\left(\mathbf{X}^{\mathrm{T}} \mathbf{X}\right)^{-1} \mathbf{X}^{\mathrm{T}} \boldsymbol{y},
w^∗=(XTX)−1XTy,
其中
(
X
T
X
)
−
1
\left(\mathbf{X}^{\mathrm{T}} \mathbf{X}\right)^{-1}
(XTX)−1 是矩阵
(
X
T
X
)
\left(\mathbf{X}^{\mathrm{T}} \mathbf{X}\right)
(XTX) 的逆矩阵. 令
x
^
i
=
(
x
i
,
1
)
\hat{\boldsymbol{x}}_i=\left(\boldsymbol{x}_i, 1\right)
x^i=(xi,1), 则最终学得的多元线性回归模型为
f
(
x
^
i
)
=
x
^
i
T
(
X
T
X
)
−
1
X
T
y
.
f\left(\hat{\boldsymbol{x}}_i\right)=\hat{\boldsymbol{x}}_i^{\mathrm{T}}\left(\mathbf{X}^{\mathrm{T}} \mathbf{X}\right)^{-1} \mathbf{X}^{\mathrm{T}} \boldsymbol{y} .
f(x^i)=x^iT(XTX)−1XTy.