线性回归模型
概述
给定数据集D={ ( x ( 1 ) , y ( 1 ) ) , ( x ( 2 ) , y ( 2 ) ) , ⋯ ( x ( m ) , y ( m ) ) (x^{(1)},y^{(1)}),(x^{(2)},y^{(2)}),\cdots(x^{(m)},y^{(m)}) (x(1),y(1)),(x(2),y(2)),⋯(x(m),y(m))}, x ( i ) ∈ X ⊆ R m , y ( i ) ∈ Y ⊆ R , i = 1 , 2 , ⋯   , m , x^{(i)}\in\mathcal{X}\subseteq\mathbb{R}^m,y^{(i)}\in\mathcal{Y}\subseteq\mathbb{R},i=1,2,\cdots,m, x(i)∈X⊆Rm,y(i)∈Y⊆R,i=1,2,⋯,m,其中 x ( i ) = ( x 1 ( i ) , x 2 ( i ) , ⋯   , x n ( i ) ) T , m x^{(i)}=(x_1^{(i)},x_2^{(i)},\cdots,x_n^{(i)})^T,m x(i)=(x1(i),x2(i),⋯,xn(i))T,m为样本的数量, n n n为特征的数量。
线性模型试图学得一个通过属性的线性组合来进行预测的函数,即
f
(
x
)
=
w
1
x
1
+
w
2
x
2
+
.
.
.
+
w
n
x
n
+
b
,
f(x)=w_1x_1+w_2x_2+...+w_nx_n+b,
f(x)=w1x1+w2x2+...+wnxn+b,
其中
w
w
w为权重,
b
b
b为截距,写成向量形式为
f
(
x
)
=
w
T
x
+
b
f(x)=w^Tx+b
f(x)=wTx+b
为了简化公式,设
x
(
i
)
=
(
x
1
(
i
)
,
x
2
(
i
)
,
.
.
.
x
n
(
i
)
,
1
)
T
,
w
=
(
w
1
,
w
2
,
.
.
.
,
w
n
,
b
)
T
x^{(i)}=(x_1^{(i)},x_2^{(i)},...x_n^{(i)},1)^T,w=(w_1,w_2,...,w_n,b)^T
x(i)=(x1(i),x2(i),...xn(i),1)T,w=(w1,w2,...,wn,b)T,简化之后为
f
(
x
)
=
∑
i
=
0
n
w
i
x
i
=
w
T
x
f(x)=\sum_{i=0}^nw_ix_i=w^Tx
f(x)=i=0∑nwixi=wTx
我们的目标便是通过给定的数据集
D
D
D来学习参数
w
w
w,对于给定的样本
x
(
i
)
x^{(i)}
x(i),其预测值
y
^
(
i
)
=
w
T
x
(
i
)
\hat{y}^{(i)}=w^Tx^{(i)}
y^(i)=wTx(i),与真实值
y
(
i
)
y^{(i)}
y(i)越接近越好。这里我们采用平方损失函数,则在训练集
D
D
D上,模型的损失函数为
L
(
w
)
=
∑
i
=
1
m
(
f
(
x
(
i
)
)
−
y
(
i
)
)
2
=
∑
i
=
1
m
(
w
T
x
(
i
)
−
y
(
i
)
)
2
L(w)=\sum_{i=1}^m(f(x^{(i)})-y^{(i)})^2 \\\quad\quad=\sum_{i=1}^m(w^Tx^{(i)}-y^{(i)})^2
L(w)=i=1∑m(f(x(i))−y(i))2=i=1∑m(wTx(i)−y(i))2
这样,我们的目标便变为损失函数最小化。为了之后求导方便,在损失函数前乘以
1
/
2
1/2
1/2,即:
w
∗
=
arg
min
w
1
2
∑
i
=
1
m
(
w
T
x
(
i
)
−
y
(
i
)
)
2
w^*=\mathop{\arg\min}_{w}\frac{1}{2}\sum_{i=1}^m(w^Tx^{(i)}-y^{(i)})^2
w∗=argminw21i=1∑m(wTx(i)−y(i))2
为了求出使
L
(
w
)
L(w)
L(w)最小的
w
w
w值,我们可以使用梯度下降法和正规方程两种方法。
梯度下降法
梯度下降的思想是:开始时随机选择一个参数的组合 ( w 0 , w 1 , w 2 , . . . , w n ) (w_0,w_1,w_2,...,w_n) (w0,w1,w2,...,wn),计算损失函数,然后寻找下一个能让损失函数下降最多的参数组合,持续这么做直到一个局部最小值。通常选择不同的初始参数组合,可能会找到不同的局部最小值。
梯度下降算法公式为:
重复直到收敛{
w j : = w j − α ∂ ∂ w j L ( w ) w_j:=w_j-\alpha\frac{\partial}{\partial w_j}L(w) wj:=wj−α∂wj∂L(w)
}
要实现这个算法,关键在于求出损失函数关于 w w w的导数
∂ ∂ w j L ( w ) = ∂ w j 1 2 ∑ i = 1 m ( w T x ( i ) − y ( i ) ) 2      = ∂ ∂ w j 1 2 ∑ i = 1 m ( w 0 x 0 ( i ) + w 1 x 1 ( i ) + . . . + w j x j ( i ) + . . . + w n x n ( i ) − y ( i ) ) 2      = 2 ⋅ 1 2 ∑ i = 1 m ( w 0 x 0 ( i ) + w 1 x 1 ( i ) + . . . + w n x n ( i ) − y ( i ) ) ⋅ x j ( i )      = ∑ i = 1 m ( w T x ( i ) − y ( i ) ) x j ( i ) \frac{\partial}{\partial w_j}L(w)=\frac{\partial}{w_j}\frac{1}{2}\sum\limits_{i=1}^m(w^Tx^{(i)}-y^{(i)})^2 \\\quad\quad\quad\,\,\,\,=\frac{\partial}{\partial w_j}\frac{1}{2}\sum\limits_{i=1}^m(w_0x_0^{(i)}+w_1x_1^{(i)}+...+w_jx_j^{(i)}+...+w_nx_n^{(i)}-y^{(i)})^2 \\\quad\quad\quad\,\,\,\,=2·\frac{1}{2}\sum\limits_{i=1}^m(w_0x_0^{(i)}+w_1x_1^{(i)}+...+w_nx_n^{(i)}-y^{(i)})·x_j^{(i)} \\\quad\quad\quad\,\,\,\,=\sum\limits_{i=1}^m(w^Tx^{(i)}-y^{(i)})x_j^{(i)} ∂wj∂L(w)=wj∂21i=1∑m(wTx(i)−y(i))2=∂wj∂21i=1∑m(w0x0(i)+w1x1(i)+...+wjxj(i)+...+wnxn(i)−y(i))2=2⋅21i=1∑m(w0x0(i)+w1x1(i)+...+wnxn(i)−y(i))⋅xj(i)=i=1∑m(wTx(i)−y(i))xj(i)
重复直到收敛{
w j : = w j + α ∑ i = 1 m ( y ( i ) − w T x ( i ) ) x j ( i ) w_j:=w_j+\alpha\sum\limits_{i=1}^m(y^{(i)}-w^Tx^{(i)})x_j^{(i)} wj:=wj+αi=1∑m(y(i)−wTx(i))xj(i) (for every j)
}
正规方程
正规方程通过求解下面的方程来找出使损失函数最小的参数: ∂ ∂ w L ( w ) = 0 \frac{\partial}{\partial w}L(w)=0 ∂w∂L(w)=0
矩阵导数
假设函数
f
:
R
m
×
n
→
R
f:R^{m×n}\to R
f:Rm×n→R,从
m
∗
n
m*n
m∗n大小的矩阵映射到实数域,那么当矩阵为
A
A
A时导函数定义如下所示:
∂
f
(
A
)
∂
A
=
[
∂
f
∂
A
11
⋯
∂
f
∂
A
1
n
⋮
⋱
⋮
∂
f
∂
A
m
1
⋯
∂
f
∂
A
m
n
]
\frac{\partial f(A)}{\partial A}=\begin{bmatrix}\frac{\partial f}{\partial A_{11}} & \cdots & \frac{\partial f}{\partial A_{1n}}\\\vdots &\ddots&\vdots\\\frac{\partial f}{\partial A_{m1}}&\cdots&\frac{\partial f}{\partial A_{mn}} \end{bmatrix}
∂A∂f(A)=⎣⎢⎡∂A11∂f⋮∂Am1∂f⋯⋱⋯∂A1n∂f⋮∂Amn∂f⎦⎥⎤
例如A=
[
A
11
A
12
A
21
A
22
]
\begin{bmatrix}A_{11}&A_{12}\\A_{21}&A_{22} \end{bmatrix}
[A11A21A12A22]是
2
∗
2
2*2
2∗2矩阵,给定函数
f
:
R
2
×
2
→
R
f:R^{2×2} \to R
f:R2×2→R为:
f
(
A
)
=
3
2
A
11
+
5
A
12
2
+
A
21
A
22
f(A)=\frac{3}{2}A_{11}+5A_{12}^2+A_{21}A_{22}
f(A)=23A11+5A122+A21A22
那么
∂
f
(
A
)
∂
A
=
[
3
2
10
A
12
A
22
A
21
]
\frac{\partial f(A)}{\partial A}=\begin{bmatrix}\frac{3}{2}&10A_{12}\\A_{22}&A_{21} \end{bmatrix}
∂A∂f(A)=[23A2210A12A21],我们还要引入矩阵的迹(trace),简写为
t
r
tr
tr。对于一个给定的
n
∗
n
n*n
n∗n的方阵
A
A
A,它的迹定义为对角线元素之和:
t
r
A
=
∑
i
=
1
n
A
i
i
tr\ A=\sum_{i=1}^nA_{ii}
tr A=i=1∑nAii
如果有两矩阵 A A A和 B B B,满足 A B AB AB为方阵,则迹运算有以下性质:
t
r
A
B
=
t
r
B
A
t
r
A
B
C
=
t
r
C
A
B
=
t
r
B
C
A
t
r
A
=
t
r
A
T
t
r
(
A
+
B
)
=
t
r
A
+
t
r
B
t
r
a
A
=
a
t
r
A
trAB=trBA\\trABC=trCAB=trBCA\\trA=trA^T\\tr(A+B)=trA+trB\\tr\ aA=atrA
trAB=trBAtrABC=trCAB=trBCAtrA=trATtr(A+B)=trA+trBtr aA=atrA
接下来提出一些矩阵导数:
∂
(
t
r
A
B
)
∂
A
=
B
T
∂
f
(
A
)
∂
A
T
=
(
∂
f
(
A
)
∂
A
)
T
∂
(
t
r
A
B
A
T
C
)
∂
A
=
C
A
B
+
C
T
A
B
T
∂
∣
A
∣
∂
A
=
∣
A
∣
(
A
−
1
)
T
\frac{\partial(trAB)}{\partial A}=B^T\\\frac{\partial f(A)}{\partial A^T}=(\frac{\partial f(A)}{\partial A})^T\\\frac{\partial(trABA^TC)}{\partial A}=CAB+C^TAB^T\\\frac{\partial |A|}{\partial A}=|A|(A^{-1})^T
∂A∂(trAB)=BT∂AT∂f(A)=(∂A∂f(A))T∂A∂(trABATC)=CAB+CTABT∂A∂∣A∣=∣A∣(A−1)T
下面把损失函数
L
(
w
)
L(w)
L(w)用向量的形式表述。令
X
=
[
(
x
(
1
)
)
T
(
x
(
2
)
)
T
⋮
(
x
(
m
)
)
T
]
=
[
x
1
(
1
)
x
2
(
1
)
⋯
x
n
(
1
)
1
x
1
(
2
)
x
2
(
2
)
⋯
x
n
(
2
)
1
⋮
⋮
⋱
⋮
1
x
1
(
m
)
x
2
(
m
)
⋯
x
n
(
m
)
1
]
y
=
[
y
(
1
)
y
(
2
)
⋮
y
(
m
)
]
,
w
=
[
w
1
w
2
⋮
w
n
b
]
X=\begin{bmatrix}(x^{(1)})^T\\(x^{(2)})^T\\\vdots\\(x^{(m)})^T\end{bmatrix}=\begin{bmatrix}x_1^{(1)}&x_2^{(1)}&\cdots& x_n^{(1)}&1\\x_1^{(2)}&x_2^{(2)}&\cdots&x_n^{(2)}&1\\\vdots&\vdots&\ddots&\vdots&1\\x_1^{(m)}&x_2^{(m)}&\cdots&x_n^{(m)}&1\end{bmatrix}\\y=\begin{bmatrix}y^{(1)}\\y^{(2)}\\\vdots\\y^{(m)}\end{bmatrix},w=\begin{bmatrix}w_1\\w_2\\\vdots\\w_n\\b\end{bmatrix}
X=⎣⎢⎢⎢⎡(x(1))T(x(2))T⋮(x(m))T⎦⎥⎥⎥⎤=⎣⎢⎢⎢⎢⎡x1(1)x1(2)⋮x1(m)x2(1)x2(2)⋮x2(m)⋯⋯⋱⋯xn(1)xn(2)⋮xn(m)1111⎦⎥⎥⎥⎥⎤y=⎣⎢⎢⎢⎡y(1)y(2)⋮y(m)⎦⎥⎥⎥⎤,w=⎣⎢⎢⎢⎢⎢⎡w1w2⋮wnb⎦⎥⎥⎥⎥⎥⎤
则有
X
w
−
y
=
[
f
(
x
(
1
)
)
−
y
(
1
)
f
(
x
(
2
)
)
−
y
(
2
)
⋮
f
(
x
(
m
)
)
−
y
(
m
)
]
1
2
(
X
w
−
y
)
T
(
X
w
−
y
)
=
1
2
∑
i
=
1
m
(
f
(
x
(
i
)
)
−
y
(
i
)
)
2
=
L
(
w
)
Xw-y=\begin{bmatrix}f(x^{(1)})-y^{(1)}\\f(x^{(2)})-y^{(2)}\\\vdots\\f(x^{(m)})-y^{(m)}\end{bmatrix}\\\frac{1}{2}(Xw-y)^T(Xw-y)=\frac{1}{2}\sum_{i=1}^m(f(x^{(i)})-y^{(i)})^2=L(w)
Xw−y=⎣⎢⎢⎢⎡f(x(1))−y(1)f(x(2))−y(2)⋮f(x(m))−y(m)⎦⎥⎥⎥⎤21(Xw−y)T(Xw−y)=21i=1∑m(f(x(i))−y(i))2=L(w)
KaTeX parse error: No such environment: align* at position 8: \begin{̲a̲l̲i̲g̲n̲*̲}̲ \frac{\partial…
令其等于
0
0
0便得到下面的正规方程:
X
T
X
w
=
X
T
y
X^TXw=X^Ty
XTXw=XTy
当
X
T
X
X^TX
XTX可逆时,可得:
w
∗
=
(
X
T
X
)
−
1
X
T
y
w^*=(X^TX)^{-1}X^Ty
w∗=(XTX)−1XTy
于是学得的线性回归模型为:
f
(
x
(
i
)
)
=
w
∗
T
x
(
i
)
f(x^{(i)})=w^{*T}x^{(i)}
f(x(i))=w∗Tx(i)