求解权重 w ^ \hat{w} w^的公式推导
推导思路
流程图数学符号不能正确显示,看下面的小标题即可
由最小二乘法导出损失函数 E w ^ E_{\hat{w}} Ew^
首先将 w ⃗ \vec{w} w和b组合成 w ^ \hat{w} w^
多元线性回归模型
f
(
x
i
)
=
w
⃗
T
x
i
⃗
+
b
f(x_{i}) = \vec{w}^{\mathrm{T}}\vec{x_{i}} + b
f(xi)=wTxi+b,这里b是标量,
w
⃗
{\vec{w}}
w和
x
i
x_{i}
xi是列向量,将向量拆开得
f
(
x
i
)
=
(
w
1
w
2
⋯
w
d
)
(
x
i
1
x
i
2
⋮
x
i
d
)
+
b
f(x_{i}) =\begin{pmatrix} w_{1}& w_{2} &\cdots & w_{d} \end{pmatrix}\begin{pmatrix} x_{i1}\\ x_{i2}\\ \vdots\\ x_{id} \end{pmatrix} + b
f(xi)=(w1w2⋯wd)⎝⎜⎜⎜⎛xi1xi2⋮xid⎠⎟⎟⎟⎞+b
根据向量点乘 展开得
f
(
x
i
)
=
w
1
x
i
1
+
w
2
x
i
2
+
⋯
+
w
d
x
i
d
+
d
f(x_{i}) = w_{1}x_{i1} +w_{2}x_{i2}+\cdots +w_{d}x_{id} +d
f(xi)=w1xi1+w2xi2+⋯+wdxid+d
将d表示成 并乘以1得
f
(
x
i
)
=
w
1
x
i
1
+
w
2
x
i
2
+
⋯
+
w
d
x
i
d
+
w
d
+
1
⋅
1
f(x_{i}) = w_{1}x_{i1} +w_{2}x_{i2}+\cdots +w_{d}x_{id} +w_{d+1}\cdot 1
f(xi)=w1xi1+w2xi2+⋯+wdxid+wd+1⋅1
将
w
d
+
1
w_{d+1}
wd+1看成是
w
⃗
T
\vec{w}^{\mathrm{T}}
wT的最后一个元素,将1看成是
x
i
⃗
\vec{x_{i}}
xi的最后一个元素的话,则
f
(
x
i
)
=
(
w
1
w
2
⋯
w
d
w
d
+
1
)
(
x
i
1
x
i
2
⋮
x
i
d
1
)
f(x_{i}) =\begin{pmatrix} w_{1}& w_{2} &\cdots & w_{d} & w_{d+1} \end{pmatrix}\begin{pmatrix} x_{i1}\\ x_{i2}\\ \vdots\\ x_{id}\\ 1 \end{pmatrix}
f(xi)=(w1w2⋯wdwd+1)⎝⎜⎜⎜⎜⎜⎛xi1xi2⋮xid1⎠⎟⎟⎟⎟⎟⎞
将
(
w
1
w
2
⋯
w
d
w
d
+
1
)
\begin{pmatrix} w_{1}& w_{2} &\cdots & w_{d} & w_{d+1} \end{pmatrix}
(w1w2⋯wdwd+1)叫做
w
^
T
\hat{w}^{\mathrm{T}}
w^T,将
(
x
i
1
x
i
2
⋮
x
i
d
1
)
\begin{pmatrix} x_{i1}\\ x_{i2}\\ \vdots\\ x_{id}\\ 1 \end{pmatrix}
⎝⎜⎜⎜⎜⎜⎛xi1xi2⋮xid1⎠⎟⎟⎟⎟⎟⎞叫做
x
i
^
\hat{x_{i}}
xi^,则多元线性回归模型可以写成下面这个形式
f
(
x
i
^
)
=
w
^
⃗
T
x
i
^
⃗
f(\hat{x_{i}}) = \vec{\hat{w}}^{\mathrm{T}}\vec{\hat{x_{i}}}
f(xi^)=w^Txi^
接下来求损失函数
E
w
^
=
∑
i
=
1
m
(
y
i
−
f
(
x
i
⃗
^
)
)
2
E_{\hat{w}} = \sum_{i=1}^{m}(y_{i}-f(\hat{\vec{x_{i}}}))^2
Ew^=i=1∑m(yi−f(xi^))2
将
f
(
x
i
^
)
=
w
^
⃗
T
x
i
^
⃗
f(\hat{x_{i}}) = \vec{\hat{w}}^{\mathrm{T}}\vec{\hat{x_{i}}}
f(xi^)=w^Txi^代入得
E
w
^
=
∑
i
=
1
m
(
y
i
−
w
^
⃗
T
x
i
^
⃗
)
2
E_{\hat{w}} = \sum_{i=1}^{m}(y_{i} - \vec{\hat{w}}^{\mathrm{T}}\vec{\hat{x_{i}}})^2
Ew^=i=1∑m(yi−w^Txi^)2
定义向量化需要的变量
X
=
(
x
11
x
12
⋯
x
1
d
1
x
21
x
22
⋯
x
2
d
1
⋮
⋮
⋱
⋮
⋮
x
m
1
x
m
2
⋯
x
m
d
1
)
=
(
x
1
⃗
T
1
x
1
⃗
T
1
⋮
⋮
x
m
⃗
T
1
)
=
(
x
1
^
⃗
T
x
2
^
⃗
T
⋮
x
m
^
⃗
T
)
X = \begin{pmatrix} x_{11} &x_{12} &\cdots &x_{1d} & 1 \\ x_{21} &x_{22} &\cdots &x_{2d} & 1\\ \vdots & \vdots & \ddots &\vdots &\vdots \\ x_{m1} &x_{m2} &\cdots &x_{md} & 1 \end{pmatrix} = \begin{pmatrix} \vec{x_{1}}^{\mathrm{T}} & 1\\ \vec{x_{1}}^{\mathrm{T}} & 1\\ \vdots & \vdots\\ \vec{x_{m}}^{\mathrm{T}} & 1 \end{pmatrix} = \begin{pmatrix} \vec{\hat{x_{1}}}^{\mathrm{T}}\\ \vec{\hat{x_{2}}}^{\mathrm{T}}\\ \vdots\\ \vec{\hat{x_{m}}}^{\mathrm{T}} \end{pmatrix}
X=⎝⎜⎜⎜⎛x11x21⋮xm1x12x22⋮xm2⋯⋯⋱⋯x1dx2d⋮xmd11⋮1⎠⎟⎟⎟⎞=⎝⎜⎜⎜⎛x1Tx1T⋮xmT11⋮1⎠⎟⎟⎟⎞=⎝⎜⎜⎜⎜⎜⎛x1^Tx2^T⋮xm^T⎠⎟⎟⎟⎟⎟⎞
y ⃗ = ( y 1 y 2 ⋯ y m ) T \vec{y} = \begin{pmatrix} y_{1} &y_{2} &\cdots & y_{m} \end{pmatrix} ^{\mathrm{T}} y=(y1y2⋯ym)T
y
⃗
\vec{y}
y是m行1列的列向量
E
w
^
=
∑
i
=
1
m
(
y
i
−
w
^
⃗
T
x
i
^
⃗
)
2
=
(
y
1
−
w
^
⃗
T
x
1
^
⃗
)
2
+
(
y
2
−
w
^
⃗
T
x
2
^
⃗
)
2
+
⋯
+
(
y
m
−
w
^
⃗
T
x
m
^
⃗
)
2
=
(
y
1
−
w
^
⃗
T
x
1
^
⃗
y
2
−
w
^
⃗
T
x
2
^
⃗
⋯
y
m
−
w
^
⃗
T
x
m
^
⃗
)
(
y
1
−
w
^
⃗
T
x
1
^
⃗
y
2
−
w
^
⃗
T
x
2
^
⃗
⋮
y
m
−
w
^
⃗
T
x
m
^
⃗
)
\begin{aligned} E_{\hat{w}} &= \sum_{i=1}^{m}(y_{i} - \vec{\hat{w}}^{\mathrm{T}}\vec{\hat{x_{i}}})^2 \\&= (y_{1} - \vec{\hat{w}}^{\mathrm{T}}\vec{\hat{x_{1}}})^2 + (y_{2} - \vec{\hat{w}}^{\mathrm{T}}\vec{\hat{x_{2}}})^2 + \cdots + (y_{m} - \vec{\hat{w}}^{\mathrm{T}}\vec{\hat{x_{m}}})^2 \\&= \begin{pmatrix} y_{1} - \vec{\hat{w}}^{\mathrm{T}}\vec{\hat{x_{1}}} &y_{2} - \vec{\hat{w}}^{\mathrm{T}}\vec{\hat{x_{2}}} & \cdots & y_{m} - \vec{\hat{w}}^{\mathrm{T}}\vec{\hat{x_{m}}} \end{pmatrix}\begin{pmatrix} y_{1} - \vec{\hat{w}}^{\mathrm{T}}\vec{\hat{x_{1}}}\\ y_{2} - \vec{\hat{w}}^{\mathrm{T}}\vec{\hat{x_{2}}}\\ \vdots\\ y_{m} - \vec{\hat{w}}^{\mathrm{T}}\vec{\hat{x_{m}}} \end{pmatrix} \end{aligned}
Ew^=i=1∑m(yi−w^Txi^)2=(y1−w^Tx1^)2+(y2−w^Tx2^)2+⋯+(ym−w^Txm^)2=(y1−w^Tx1^y2−w^Tx2^⋯ym−w^Txm^)⎝⎜⎜⎜⎜⎜⎛y1−w^Tx1^y2−w^Tx2^⋮ym−w^Txm^⎠⎟⎟⎟⎟⎟⎞
而 ( y 1 − w ^ ⃗ T x 1 ^ ⃗ y 2 − w ^ ⃗ T x 2 ^ ⃗ ⋮ y m − w ^ ⃗ T x m ^ ⃗ ) = ( y 1 y 2 ⋮ y m ) − ( w ^ ⃗ T x 1 ^ ⃗ w ^ ⃗ T x 2 ^ ⃗ ⋮ w ^ ⃗ T x m ^ ⃗ ) \begin{pmatrix} y_{1} - \vec{\hat{w}}^{\mathrm{T}}\vec{\hat{x_{1}}}\\ y_{2} - \vec{\hat{w}}^{\mathrm{T}}\vec{\hat{x_{2}}}\\ \vdots\\ y_{m} - \vec{\hat{w}}^{\mathrm{T}}\vec{\hat{x_{m}}} \end{pmatrix} = \begin{pmatrix} y_{1}\\ y_{2}\\ \vdots\\ y_{m} \end{pmatrix} - \begin{pmatrix} \vec{\hat{w}}^{\mathrm{T}}\vec{\hat{x_{1}}}\\ \vec{\hat{w}}^{\mathrm{T}}\vec{\hat{x_{2}}}\\ \vdots\\ \vec{\hat{w}}^{\mathrm{T}}\vec{\hat{x_{m}}} \end{pmatrix} ⎝⎜⎜⎜⎜⎜⎛y1−w^Tx1^y2−w^Tx2^⋮ym−w^Txm^⎠⎟⎟⎟⎟⎟⎞=⎝⎜⎜⎜⎛y1y2⋮ym⎠⎟⎟⎟⎞−⎝⎜⎜⎜⎜⎜⎛w^Tx1^w^Tx2^⋮w^Txm^⎠⎟⎟⎟⎟⎟⎞,其中 ( y 1 y 2 ⋮ y m ) \begin{pmatrix} y_{1}\\ y_{2}\\ \vdots\\ y_{m} \end{pmatrix} ⎝⎜⎜⎜⎛y1y2⋮ym⎠⎟⎟⎟⎞即我们定义的 y ⃗ \vec{y} y,是m行1列的列向量, ( w ^ ⃗ T x 1 ^ ⃗ w ^ ⃗ T x 2 ^ ⃗ ⋮ w ^ ⃗ T x m ^ ⃗ ) \begin{pmatrix} \vec{\hat{w}}^{\mathrm{T}}\vec{\hat{x_{1}}}\\ \vec{\hat{w}}^{\mathrm{T}}\vec{\hat{x_{2}}}\\ \vdots\\ \vec{\hat{w}}^{\mathrm{T}}\vec{\hat{x_{m}}} \end{pmatrix} ⎝⎜⎜⎜⎜⎜⎛w^Tx1^w^Tx2^⋮w^Txm^⎠⎟⎟⎟⎟⎟⎞也是m行1列的列向量
单独看 ( w ^ ⃗ T x 1 ^ ⃗ w ^ ⃗ T x 2 ^ ⃗ ⋮ w ^ ⃗ T x m ^ ⃗ ) \begin{pmatrix} \vec{\hat{w}}^{\mathrm{T}}\vec{\hat{x_{1}}}\\ \vec{\hat{w}}^{\mathrm{T}}\vec{\hat{x_{2}}}\\ \vdots\\ \vec{\hat{w}}^{\mathrm{T}}\vec{\hat{x_{m}}} \end{pmatrix} ⎝⎜⎜⎜⎜⎜⎛w^Tx1^w^Tx2^⋮w^Txm^⎠⎟⎟⎟⎟⎟⎞的一个元素,如 w ^ ⃗ T x 1 ^ ⃗ \vec{\hat{w}}^{\mathrm{T}}\vec{\hat{x_{1}}} w^Tx1^,是一个标量( w ^ ⃗ T \vec{\hat{w}}^{\mathrm{T}} w^T是一个行向量, x 1 ^ ⃗ \vec{\hat{x_{1}}} x1^是一个列向量),标量加转置,不影响,即
( w ^ ⃗ T x 1 ^ ⃗ ) T (\vec{\hat{w}}^{\mathrm{T}}\vec{\hat{x_{1}}})^{\mathrm{T}} (w^Tx1^)T = x 1 ^ ⃗ T w ^ ⃗ \vec{\hat{x_{1}}}^{\mathrm{T}}\vec{\hat{w}} x1^Tw^
所以 ( y 1 − w ^ ⃗ T x 1 ^ ⃗ y 2 − w ^ ⃗ T x 2 ^ ⃗ ⋮ y m − w ^ ⃗ T x m ^ ⃗ ) = ( y 1 y 2 ⋮ y m ) − ( w ^ ⃗ T x 1 ^ ⃗ w ^ ⃗ T x 2 ^ ⃗ ⋮ w ^ ⃗ T x m ^ ⃗ ) = ( y 1 y 2 ⋮ y m ) − ( x 1 ^ ⃗ T w ^ ⃗ x 2 ^ ⃗ T w ^ ⃗ ⋮ x m ^ ⃗ T w ^ ⃗ ) \begin{pmatrix} y_{1} - \vec{\hat{w}}^{\mathrm{T}}\vec{\hat{x_{1}}}\\ y_{2} - \vec{\hat{w}}^{\mathrm{T}}\vec{\hat{x_{2}}}\\ \vdots\\ y_{m} - \vec{\hat{w}}^{\mathrm{T}}\vec{\hat{x_{m}}} \end{pmatrix} = \begin{pmatrix} y_{1}\\ y_{2}\\ \vdots\\ y_{m} \end{pmatrix} - \begin{pmatrix} \vec{\hat{w}}^{\mathrm{T}}\vec{\hat{x_{1}}}\\ \vec{\hat{w}}^{\mathrm{T}}\vec{\hat{x_{2}}}\\ \vdots\\ \vec{\hat{w}}^{\mathrm{T}}\vec{\hat{x_{m}}} \end{pmatrix}=\begin{pmatrix} y_{1}\\ y_{2}\\ \vdots\\ y_{m} \end{pmatrix} - \begin{pmatrix} \vec{\hat{x_{1}}}^{\mathrm{T}}\vec{\hat{w}}\\ \vec{\hat{x_{2}}}^{\mathrm{T}}\vec{\hat{w}}\\ \vdots\\ \vec{\hat{x_{m}}}^{\mathrm{T}}\vec{\hat{w}} \end{pmatrix} ⎝⎜⎜⎜⎜⎜⎛y1−w^Tx1^y2−w^Tx2^⋮ym−w^Txm^⎠⎟⎟⎟⎟⎟⎞=⎝⎜⎜⎜⎛y1y2⋮ym⎠⎟⎟⎟⎞−⎝⎜⎜⎜⎜⎜⎛w^Tx1^w^Tx2^⋮w^Txm^⎠⎟⎟⎟⎟⎟⎞=⎝⎜⎜⎜⎛y1y2⋮ym⎠⎟⎟⎟⎞−⎝⎜⎜⎜⎜⎜⎛x1^Tw^x2^Tw^⋮xm^Tw^⎠⎟⎟⎟⎟⎟⎞
而 ( x 1 ^ ⃗ T w ^ ⃗ x 2 ^ ⃗ T w ^ ⃗ ⋮ x m ^ ⃗ T w ^ ⃗ ) = ( x 1 ^ ⃗ T x 2 ^ ⃗ T ⋮ x m ^ ⃗ T ) ⋅ w ^ ⃗ \begin{pmatrix} \vec{\hat{x_{1}}}^{\mathrm{T}}\vec{\hat{w}}\\ \vec{\hat{x_{2}}}^{\mathrm{T}}\vec{\hat{w}}\\ \vdots\\ \vec{\hat{x_{m}}}^{\mathrm{T}}\vec{\hat{w}} \end{pmatrix} = \begin{pmatrix} \vec{\hat{x_{1}}}^{\mathrm{T}}\\ \vec{\hat{x_{2}}}^{\mathrm{T}}\\ \vdots\\ \vec{\hat{x_{m}}}^{\mathrm{T}} \end{pmatrix} \cdot \vec{\hat{w}} ⎝⎜⎜⎜⎜⎜⎛x1^Tw^x2^Tw^⋮xm^Tw^⎠⎟⎟⎟⎟⎟⎞=⎝⎜⎜⎜⎜⎜⎛x1^Tx2^T⋮xm^T⎠⎟⎟⎟⎟⎟⎞⋅w^,这是因为 w ^ ⃗ \vec{\hat{w}} w^, x i ^ ⃗ \vec{\hat{x_{i}}} xi^都是d+1行1列的列向量, x 1 ^ ⃗ T \vec{\hat{x_{1}}}^{\mathrm{T}} x1^T是1行d+1列的行向量,所以 ( x 1 ^ ⃗ T x 2 ^ ⃗ T ⋮ x m ^ ⃗ T ) \begin{pmatrix} \vec{\hat{x_{1}}}^{\mathrm{T}}\\ \vec{\hat{x_{2}}}^{\mathrm{T}}\\ \vdots\\ \vec{\hat{x_{m}}}^{\mathrm{T}} \end{pmatrix} ⎝⎜⎜⎜⎜⎜⎛x1^Tx2^T⋮xm^T⎠⎟⎟⎟⎟⎟⎞是一个m行d+1列的矩阵,所以 ( x 1 ^ ⃗ T x 2 ^ ⃗ T ⋮ x m ^ ⃗ T ) ⋅ w ^ ⃗ \begin{pmatrix} \vec{\hat{x_{1}}}^{\mathrm{T}}\\ \vec{\hat{x_{2}}}^{\mathrm{T}}\\ \vdots\\ \vec{\hat{x_{m}}}^{\mathrm{T}} \end{pmatrix} \cdot \vec{\hat{w}} ⎝⎜⎜⎜⎜⎜⎛x1^Tx2^T⋮xm^T⎠⎟⎟⎟⎟⎟⎞⋅w^是一个m行1列的列向量,这里应用到矩阵乘法
所以 ( y 1 − w ^ ⃗ T x 1 ^ ⃗ y 2 − w ^ ⃗ T x 2 ^ ⃗ ⋮ y m − w ^ ⃗ T x m ^ ⃗ ) = ( y 1 y 2 ⋮ y m ) − ( x 1 ^ ⃗ T x 2 ^ ⃗ T ⋮ x m ^ ⃗ T ) ⋅ w ^ ⃗ \begin{pmatrix} y_{1} - \vec{\hat{w}}^{\mathrm{T}}\vec{\hat{x_{1}}}\\ y_{2} - \vec{\hat{w}}^{\mathrm{T}}\vec{\hat{x_{2}}}\\ \vdots\\ y_{m} - \vec{\hat{w}}^{\mathrm{T}}\vec{\hat{x_{m}}} \end{pmatrix} =\begin{pmatrix} y_{1}\\ y_{2}\\ \vdots\\ y_{m} \end{pmatrix} - \begin{pmatrix} \vec{\hat{x_{1}}}^{\mathrm{T}}\\ \vec{\hat{x_{2}}}^{\mathrm{T}}\\ \vdots\\ \vec{\hat{x_{m}}}^{\mathrm{T}} \end{pmatrix} \cdot \vec{\hat{w}} ⎝⎜⎜⎜⎜⎜⎛y1−w^Tx1^y2−w^Tx2^⋮ym−w^Txm^⎠⎟⎟⎟⎟⎟⎞=⎝⎜⎜⎜⎛y1y2⋮ym⎠⎟⎟⎟⎞−⎝⎜⎜⎜⎜⎜⎛x1^Tx2^T⋮xm^T⎠⎟⎟⎟⎟⎟⎞⋅w^, 这里 ( y 1 y 2 ⋮ y m ) \begin{pmatrix} y_{1}\\ y_{2}\\ \vdots\\ y_{m} \end{pmatrix} ⎝⎜⎜⎜⎛y1y2⋮ym⎠⎟⎟⎟⎞就是前面定义的 y ⃗ \vec{y} y, ( x 1 ^ ⃗ T x 2 ^ ⃗ T ⋮ x m ^ ⃗ T ) \begin{pmatrix} \vec{\hat{x_{1}}}^{\mathrm{T}}\\ \vec{\hat{x_{2}}}^{\mathrm{T}}\\ \vdots\\ \vec{\hat{x_{m}}}^{\mathrm{T}} \end{pmatrix} ⎝⎜⎜⎜⎜⎜⎛x1^Tx2^T⋮xm^T⎠⎟⎟⎟⎟⎟⎞就是前面定义的 X X X
所以 ( y 1 − w ^ ⃗ T x 1 ^ ⃗ y 2 − w ^ ⃗ T x 2 ^ ⃗ ⋮ y m − w ^ ⃗ T x m ^ ⃗ ) = y ⃗ − X w ^ ⃗ \begin{pmatrix} y_{1} - \vec{\hat{w}}^{\mathrm{T}}\vec{\hat{x_{1}}}\\ y_{2} - \vec{\hat{w}}^{\mathrm{T}}\vec{\hat{x_{2}}}\\ \vdots\\ y_{m} - \vec{\hat{w}}^{\mathrm{T}}\vec{\hat{x_{m}}} \end{pmatrix} = \vec{y} - X\vec{\hat{w}} ⎝⎜⎜⎜⎜⎜⎛y1−w^Tx1^y2−w^Tx2^⋮ym−w^Txm^⎠⎟⎟⎟⎟⎟⎞=y−Xw^, ( y 1 − w ^ ⃗ T x 1 ^ ⃗ y 2 − w ^ ⃗ T x 2 ^ ⃗ ⋯ y m − w ^ ⃗ T x m ^ ⃗ ) = ( y ⃗ − X w ^ ⃗ ) T \begin{pmatrix} y_{1} - \vec{\hat{w}}^{\mathrm{T}}\vec{\hat{x_{1}}} &y_{2} - \vec{\hat{w}}^{\mathrm{T}}\vec{\hat{x_{2}}} & \cdots & y_{m} - \vec{\hat{w}}^{\mathrm{T}}\vec{\hat{x_{m}}} \end{pmatrix} = (\vec{y} - X\vec{\hat{w}})^{\mathrm{T}} (y1−w^Tx1^y2−w^Tx2^⋯ym−w^Txm^)=(y−Xw^)T
所以 E w ^ = ( y ⃗ − X w ^ ⃗ ) T ( y ⃗ − X w ^ ⃗ ) E_{\hat{w}} = (\vec{y} - X\vec{\hat{w}})^{\mathrm{T}}(\vec{y} - X\vec{\hat{w}}) Ew^=(y−Xw^)T(y−Xw^),此式就是西瓜书的公式3.9 argmin后面的部分
证明损失函数 E w ^ E_{\hat{w}} Ew^是关于 w ^ \hat{w} w^的凸函数
先看证明损失函数是关于的凸函数需要的知识点
凸集定义:设集合 D ∈ R n D\in R^n D∈Rn, R n R^n Rn是n维向量空间,如果对于任意的 x , y ∈ D x,y\in D x,y∈D与任意的 a ∈ [ 0 , 1 ] a\in [0,1] a∈[0,1],有 a x + ( 1 − a ) y ∈ D ax + (1-a)y\in D ax+(1−a)y∈D,则称集合D是凸集
凸集的几何意思是:若两个点属于此几何,则这两点连线上的任意一点均属于此集合
梯度定义:设n元函数 f ( x ⃗ ) f(\vec{x}) f(x)对自变量 x ⃗ = ( x 1 x 2 ⋯ x n ) T \vec{x}=\begin{pmatrix} x_{1} &x_{2} &\cdots & x_{n} \end{pmatrix} ^{\mathrm{T}} x=(x1x2⋯xn)T的各分量 x i x_{i} xi的偏导数 ∂ f ( x ⃗ ) ∂ x i ( i = 1 , 2 , ⋯   , n ) \frac{\partial f(\vec{x})}{\partial x_{i}}(i = 1,2,\cdots,n) ∂xi∂f(x)(i=1,2,⋯,n)都存在,则称函数 f ( x ⃗ ) f(\vec{x}) f(x)在 x ⃗ \vec{x} x处一阶可导,并称向量 ▽ f ( x ⃗ ) = ( ∂ f ( x ⃗ ) ∂ x 1 ∂ f ( x ⃗ ) ∂ x 2 ⋮ ∂ f ( x ⃗ ) ∂ x n ) \bigtriangledown f(\vec{x}) = \begin{pmatrix} \frac{\partial f(\vec{x})}{\partial x_{1}}\\ \frac{\partial f(\vec{x})}{\partial x_{2}}\\ \vdots\\ \frac{\partial f(\vec{x})}{\partial x_{n}} \end{pmatrix} ▽f(x)=⎝⎜⎜⎜⎜⎛∂x1∂f(x)∂x2∂f(x)⋮∂xn∂f(x)⎠⎟⎟⎟⎟⎞为函数 f ( x ⃗ ) f(\vec{x}) f(x)在 x ⃗ \vec{x} x处的一阶导数或者梯度,记做 ▽ f ( x ⃗ ) \bigtriangledown f(\vec{x}) ▽f(x)(列向量)
Hessian(海塞)矩阵定义:设n元函数 f ( x ⃗ ) f(\vec{x}) f(x)对自变量的 x ⃗ = ( x 1 x 2 ⋯ x n ) T \vec{x}=\begin{pmatrix} x_{1} &x_{2} &\cdots & x_{n} \end{pmatrix} ^{\mathrm{T}} x=(x1x2⋯xn)T各分量的偏导数 ∂ 2 f ( x ⃗ ) ) ∂ x i x j ( i = 1 , 2 , ⋯   , n ; j = 1 , 2 , ⋯   , n ) \frac{\partial^2 f(\vec{x}))}{\partial x_{i}x_{j}}(i = 1,2,\cdots,n;j = 1,2,\cdots,n) ∂xixj∂2f(x))(i=1,2,⋯,n;j=1,2,⋯,n)都存在,则称函数 f ( x ⃗ ) f(\vec{x}) f(x)在 x ⃗ \vec{x} x处二阶可导,并称矩阵 ( ∂ 2 f ( x ⃗ ) ) ∂ x 1 2 ∂ 2 f ( x ⃗ ) ) ∂ x 1 x 2 ⋯ ∂ 2 f ( x ⃗ ) ) ∂ x 1 x n ∂ 2 f ( x ⃗ ) ) ∂ x 2 x 1 ∂ 2 f ( x ⃗ ) ) ∂ x 2 2 ⋯ ∂ 2 f ( x ⃗ ) ) ∂ x 2 x n ⋮ ⋮ ⋱ ⋮ ∂ 2 f ( x ⃗ ) ) ∂ x n x 1 ∂ 2 f ( x ⃗ ) ) ∂ x n x 2 ⋯ ∂ 2 f ( x ⃗ ) ) ∂ x n 2 ) \begin{pmatrix} \frac{\partial^2 f(\vec{x}))}{\partial x_{1}^2} &\frac{\partial^2 f(\vec{x}))}{\partial x_{1}x_{2}} &\cdots &\frac{\partial^2 f(\vec{x}))}{\partial x_{1}x_{n}} \\ \frac{\partial^2 f(\vec{x}))}{\partial x_{2}x_{1}} &\frac{\partial^2 f(\vec{x}))}{\partial x_{2}^2} & \cdots & \frac{\partial^2 f(\vec{x}))}{\partial x_{2}x_{n}}\\ \vdots & \vdots & \ddots &\vdots \\ \frac{\partial^2 f(\vec{x}))}{\partial x_{n}x_{1}} &\frac{\partial^2 f(\vec{x}))}{\partial x_{n}x_{2}} & \cdots & \frac{\partial^2 f(\vec{x}))}{\partial x_{n}^2} \end{pmatrix} ⎝⎜⎜⎜⎜⎜⎛∂x12∂2f(x))∂x2x1∂2f(x))⋮∂xnx1∂2f(x))∂x1x2∂2f(x))∂x22∂2f(x))⋮∂xnx2∂2f(x))⋯⋯⋱⋯∂x1xn∂2f(x))∂x2xn∂2f(x))⋮∂xn2∂2f(x))⎠⎟⎟⎟⎟⎟⎞为 f ( x ⃗ ) f(\vec{x}) f(x)在 x ⃗ \vec{x} x处的二阶导数或Hession矩阵,记为 ▽ 2 f ( x ⃗ ) \bigtriangledown ^2f(\vec{x}) ▽2f(x),若 f ( x ⃗ ) f(\vec{x}) f(x)对 x ⃗ \vec{x} x各变量元的所有二阶偏导数都连续,则 ∂ 2 f ( x ⃗ ) ) ∂ x i x j = ∂ 2 f ( x ⃗ ) ) ∂ x j x i \frac{\partial^2 f(\vec{x}))}{\partial x_{i}x_{j}} = \frac{\partial^2 f(\vec{x}))}{\partial x_{j}x_{i}} ∂xixj∂2f(x))=∂xjxi∂2f(x)),此时 ▽ 2 f ( x ⃗ ) \bigtriangledown ^2f(\vec{x}) ▽2f(x)是对称矩阵
多元实值函数凹凸性判定定理:设 D ⊂ R n D \subset R^n D⊂Rn 是非空开凸集(D是n维向量空间的一个集合), f : D ⊂ R n → R f : D \subset R^n \rightarrow R f:D⊂Rn→R (是一个n维向量空间到实数的一个映射),且 f ( x ⃗ ) f({\vec{x}}) f(x)在 D D D上二阶连续可微,如果 f ( x ⃗ ) f({\vec{x}}) f(x)的Hession矩阵 ▽ 2 f ( x ⃗ ) \bigtriangledown ^2f(\vec{x}) ▽2f(x)在 D D D上是正定的,则 f ( x ⃗ ) f({\vec{x}}) f(x)是 D D D上的严格凸函数
凸充分性定理:若 f : R n → R f :R^n \rightarrow R f:Rn→R是凸函数,且 f ( x ⃗ ) f(\vec{x}) f(x)一阶连续可微,则 x ⃗ ∗ \vec{x}^* x∗是全局解的充分必要条件是 ▽ f ( x ⃗ ∗ ) = 0 ⃗ \bigtriangledown f(\vec{x}^*)=\vec{0} ▽f(x∗)=0,其中 ▽ f ( x ⃗ ) \bigtriangledown f(\vec{x}) ▽f(x)为 f ( x ⃗ ) f(\vec{x}) f(x)关于 x ⃗ \vec{x} x的一阶导数(也称梯度)
矩阵微分知识点,需要用到如下三个
-
[标量-向量] 的矩阵微分公式为
∂ y ∂ x ⃗ = ( ∂ y ∂ x 1 ∂ y ∂ x 2 ⋮ ∂ y ∂ x n ) \frac{\partial y}{\partial \vec{x}} = \begin{pmatrix} \frac{\partial y}{\partial x_{1}}\\ \frac{\partial y}{\partial x_{2}}\\ \vdots\\ \frac{\partial y}{\partial x_{n}} \end{pmatrix} ∂x∂y=⎝⎜⎜⎜⎜⎛∂x1∂y∂x2∂y⋮∂xn∂y⎠⎟⎟⎟⎟⎞ 分母布局,默认使用
∂ y ∂ x ⃗ = ( ∂ y ∂ x 1 ∂ y ∂ x 2 ⋯ ∂ y ∂ x n ) \frac{\partial y}{\partial \vec{x}} = \begin{pmatrix} \frac{\partial y}{\partial x_{1}} &\frac{\partial y}{\partial x_{2}} & \cdots & \frac{\partial y}{\partial x_{n}} \end{pmatrix} ∂x∂y=(∂x1∂y∂x2∂y⋯∂xn∂y) 分子布局
其中 x ⃗ = ( x 1 x 2 ⋯ x n ) T \vec{x}=\begin{pmatrix} x_{1} &x_{2} &\cdots & x_{n} \end{pmatrix} ^{\mathrm{T}} x=(x1x2⋯xn)T为n维列向量, y为 x ⃗ \vec{x} x的n元标量函数
-
由[标量-向量] 的矩阵微分公式可推的
∂ x ⃗ T a ⃗ ∂ x ⃗ = ∂ a ⃗ T x ⃗ ∂ x ⃗ = ( ∂ ( a 1 x 1 + a 2 x 2 + ⋯ + a n x n ) ∂ x 1 ∂ ( a 1 x 1 + a 2 x 2 + ⋯ + a n x n ) ∂ x 2 ⋮ ∂ ( a 1 x 1 + a 2 x 2 + ⋯ + a n x n ) ∂ x n ) = ( a 1 a 2 ⋮ a n ) = a ⃗ \frac{\partial \vec{x}^{\mathrm{T}}\vec{a} }{\partial \vec{x}} = \frac{\partial \vec{a}^{\mathrm{T}}\vec{x} }{\partial \vec{x}} = \begin{pmatrix} \frac{\partial (a_{1}x_{1} + a_{2}x_{2} + \cdots + a_{n}x_{n} )}{\partial x_{1}}\\ \frac{\partial (a_{1}x_{1} + a_{2}x_{2} + \cdots + a_{n}x_{n} )}{\partial x_{2}}\\ \vdots\\ \frac{\partial (a_{1}x_{1} + a_{2}x_{2} + \cdots + a_{n}x_{n} )}{\partial x_{n}} \end{pmatrix} = \begin{pmatrix} a_{1}\\ a_{2}\\ \vdots\\ a_{n} \end{pmatrix} = \vec{a} ∂x∂xTa=∂x∂aTx=⎝⎜⎜⎜⎜⎛∂x1∂(a1x1+a2x2+⋯+anxn)∂x2∂(a1x1+a2x2+⋯+anxn)⋮∂xn∂(a1x1+a2x2+⋯+anxn)⎠⎟⎟⎟⎟⎞=⎝⎜⎜⎜⎛a1a2⋮an⎠⎟⎟⎟⎞=a
-
同理,可推得
∂ x ⃗ T B x ⃗ ∂ x = ( B + B T ) x ⃗ \frac{\partial \vec{x}^{\mathrm{T}}B\vec{x}}{\partial x} = (B+B^{\mathrm{T}})\vec{x} ∂x∂xTBx=(B+BT)x
证明过程如下
x ⃗ T B x ⃗ = ( x 1 x 2 ⋯ x n ) ( b 11 b 12 ⋯ b 1 n b 21 b 22 ⋯ b 2 n ⋮ ⋮ ⋱ ⋮ b n 1 b n 2 ⋯ b n n ) ( x 1 x 2 ⋮ x n ) = ( b 11 x 1 + b 21 x 2 + ⋯ + b n 1 x n ) x 1 + ( b 12 x 1 + b 22 x 2 + ⋯ + b n 2 x n ) x 2 + ⋯ + ( b 1 n x 1 + b 2 n x 2 + ⋯ + b n n x n ) x n \vec{x}^{\mathrm{T}}B\vec{x} = \begin{pmatrix} x_{1} & x_{2} &\cdots & x_{n} \end{pmatrix}\begin{pmatrix} b_{11} & b_{12} & \cdots &b_{1n} \\ b_{21} &b_{22} &\cdots &b_{2n} \\ \vdots &\vdots & \ddots &\vdots \\ b_{n1} &b_{n2} &\cdots & b_{nn} \end{pmatrix} \begin{pmatrix} x_{1}\\ x_{2}\\ \vdots\\ x_{n} \end{pmatrix}=(b_{11}x_{1} + b_{21}x_{2} + \cdots + b_{n1}x_{n})x_{1} + (b_{12}x_{1} + b_{22}x_{2} + \cdots + b_{n2}x_{n})x_{2} + \cdots + (b_{1n}x_{1} + b_{2n}x_{2} + \cdots + b_{nn}x_{n})x_{n} xTBx=(x1x2⋯xn)⎝⎜⎜⎜⎛b11b21⋮bn1b12b22⋮bn2⋯⋯⋱⋯b1nb2n⋮bnn⎠⎟⎟⎟⎞⎝⎜⎜⎜⎛x1x2⋮xn⎠⎟⎟⎟⎞=(b11x1+b21x2+⋯+bn1xn)x1+(b12x1+b22x2+⋯+bn2xn)x2+⋯+(b1nx1+b2nx2+⋯+bnnxn)xn
∂ x ⃗ T B x ⃗ ∂ x = ( ∂ ∂ x 1 [ ( b 11 x 1 + b 21 x 2 + ⋯ + b n 1 x n ) x 1 + ( b 12 x 1 + b 22 x 2 + ⋯ + b n 2 x n ) x 2 + ⋯ + ( b 1 n x 1 + b 2 n x 2 + ⋯ + b n n x n ) x n ] ∂ ∂ x 2 [ ( b 11 x 1 + b 21 x 2 + ⋯ + b n 1 x n ) x 1 + ( b 12 x 1 + b 22 x 2 + ⋯ + b n 2 x n ) x 2 + ⋯ + ( b 1 n x 1 + b 2 n x 2 + ⋯ + b n n x n ) x n ] ⋮ ∂ ∂ x n [ ( b 11 x 1 + b 21 x 2 + ⋯ + b n 1 x n ) x 1 + ( b 12 x 1 + b 22 x 2 + ⋯ + b n 2 x n ) x 2 + ⋯ + ( b 1 n x 1 + b 2 n x 2 + ⋯ + b n n x n ) x n ] ) \frac{\partial \vec{x}^{\mathrm{T}}B\vec{x}}{\partial x} = \begin{pmatrix} \frac{\partial }{\partial x_{1}}[(b_{11}x_{1} + b_{21}x_{2} + \cdots + b_{n1}x_{n})x_{1} + (b_{12}x_{1} + b_{22}x_{2} + \cdots + b_{n2}x_{n})x_{2} + \cdots + (b_{1n}x_{1} + b_{2n}x_{2} + \cdots + b_{nn}x_{n})x_{n}]\\ \frac{\partial }{\partial x_{2}}[(b_{11}x_{1} + b_{21}x_{2} + \cdots + b_{n1}x_{n})x_{1} + (b_{12}x_{1} + b_{22}x_{2} + \cdots + b_{n2}x_{n})x_{2} + \cdots + (b_{1n}x_{1} + b_{2n}x_{2} + \cdots + b_{nn}x_{n})x_{n}]\\ \vdots\\ \frac{\partial }{\partial x_{n}}[(b_{11}x_{1} + b_{21}x_{2} + \cdots + b_{n1}x_{n})x_{1} + (b_{12}x_{1} + b_{22}x_{2} + \cdots + b_{n2}x_{n})x_{2} + \cdots + (b_{1n}x_{1} + b_{2n}x_{2} + \cdots + b_{nn}x_{n})x_{n}] \end{pmatrix} ∂x∂xTBx=⎝⎜⎜⎜⎛∂x1∂[(b11x1+b21x2+⋯+bn1xn)x1+(b12x1+b22x2+⋯+bn2xn)x2+⋯+(b1nx1+b2nx2+⋯+bnnxn)xn]∂x2∂[(b11x1+b21x2+⋯+bn1xn)x1+(b12x1+b22x2+⋯+bn2xn)x2+⋯+(b1nx1+b2nx2+⋯+bnnxn)xn]⋮∂xn∂[(b11x1+b21x2+⋯+bn1xn)x1+(b12x1+b22x2+⋯+bn2xn)x2+⋯+(b1nx1+b2nx2+⋯+bnnxn)xn]⎠⎟⎟⎟⎞
单纯的看 ∂ ∂ x 1 [ ( b 11 x 1 + b 21 x 2 + ⋯ + b n 1 x n ) x 1 + ( b 12 x 1 + b 22 x 2 + ⋯ + b n 2 x n ) x 2 + ⋯ + ( b 1 n x 1 + b 2 n x 2 + ⋯ + b n n x n ) x n ] \frac{\partial }{\partial x_{1}}[(b_{11}x_{1} + b_{21}x_{2} + \cdots + b_{n1}x_{n})x_{1} + (b_{12}x_{1} + b_{22}x_{2} + \cdots + b_{n2}x_{n})x_{2} + \cdots + (b_{1n}x_{1} + b_{2n}x_{2} + \cdots + b_{nn}x_{n})x_{n}] ∂x1∂[(b11x1+b21x2+⋯+bn1xn)x1+(b12x1+b22x2+⋯+bn2xn)x2+⋯+(b1nx1+b2nx2+⋯+bnnxn)xn]
( b 11 x 1 + b 21 x 2 + ⋯ + b n 1 x n ) x 1 = 2 b 11 x 1 + b 21 x 2 + ⋯ + b n 1 x n (b_{11}x_{1} + b_{21}x_{2} + \cdots + b_{n1}x_{n})x_{1} = 2b_{11}x_{1} + b_{21}x_{2} + \cdots + b_{n1}x_{n} (b11x1+b21x2+⋯+bn1xn)x1=2b11x1+b21x2+⋯+bn1xn
( b 12 x 1 + b 22 x 2 + ⋯ + b n 2 x n ) x 2 = b 12 x 2 (b_{12}x_{1} + b_{22}x_{2} + \cdots + b_{n2}x_{n})x_{2} = b_{12}x_{2} (b12x1+b22x2+⋯+bn2xn)x2=b12x2
⋯ \cdots ⋯
( b 1 n x 1 + b 2 n x 2 + ⋯ + b n n x n ) x n = b 1 n x n (b_{1n}x_{1} + b_{2n}x_{2} + \cdots + b_{nn}x_{n})x_{n} = b_{1n}x_{n} (b1nx1+b2nx2+⋯+bnnxn)xn=b1nxn
所以
∂ ∂ x 1 [ ( b 11 x 1 + b 21 x 2 + ⋯ + b n 1 x n ) x 1 + ( b 12 x 1 + b 22 x 2 + ⋯ + b n 2 x n ) x 2 + ⋯ + ( b 1 n x 1 + b 2 n x 2 + ⋯ + b n n x n ) x n ] = ( b 11 + b 11 ) x 1 + ( b 21 + b 12 ) x 2 + ⋯ + ( b n 1 + b 1 n ) x n \begin{aligned} \frac{\partial }{\partial x_{1}}[(b_{11}x_{1} + b_{21}x_{2} + \cdots + b_{n1}x_{n})x_{1} + (b_{12}x_{1} + b_{22}x_{2} + \cdots + b_{n2}x_{n})x_{2} + \cdots + (b_{1n}x_{1} + b_{2n}x_{2} + \cdots + b_{nn}x_{n})x_{n}] &= (b_{11} + b_{11})x_{1} + (b_{21} +b_{12})x_{2} + \cdots + (b_{n1}+ b_{1n})x_{n} \end{aligned} ∂x1∂[(b11x1+b21x2+⋯+bn1xn)x1+(b12x1+b22x2+⋯+bn2xn)x2+⋯+(b1nx1+b2nx2+⋯+bnnxn)xn]=(b11+b11)x1+(b21+b12)x2+⋯+(bn1+b1n)xn所以
∂ x ⃗ T B x ⃗ ∂ x = ( ∂ ∂ x 1 [ ( b 11 x 1 + b 21 x 2 + ⋯ + b n 1 x n ) x 1 + ( b 12 x 1 + b 22 x 2 + ⋯ + b n 2 x n ) x 2 + ⋯ + ( b 1 n x 1 + b 2 n x 2 + ⋯ + b n n x n ) x n ] ∂ ∂ x 2 [ ( b 11 x 1 + b 21 x 2 + ⋯ + b n 1 x n ) x 1 + ( b 12 x 1 + b 22 x 2 + ⋯ + b n 2 x n ) x 2 + ⋯ + ( b 1 n x 1 + b 2 n x 2 + ⋯ + b n n x n ) x n ] ⋮ ∂ ∂ x n [ ( b 11 x 1 + b 21 x 2 + ⋯ + b n 1 x n ) x 1 + ( b 12 x 1 + b 22 x 2 + ⋯ + b n 2 x n ) x 2 + ⋯ + ( b 1 n x 1 + b 2 n x 2 + ⋯ + b n n x n ) x n ] ) = ( ( b 11 + b 11 ) x 1 + ( b 21 + b 12 ) x 2 + ⋯ + ( b n 1 + b 1 n ) x n ( b 12 + b 21 ) x 1 + ( b 22 + b 22 ) x 2 + ⋯ + ( b n 2 + b 2 n ) x n ⋮ ( b 1 n + b n 1 ) x 1 + ( b 2 n + b n 2 ) x 2 + ⋯ + ( b n n + b n n ) x n ) = ( B + B T ) x ⃗ \begin{aligned} \frac{\partial \vec{x}^{\mathrm{T}}B\vec{x}}{\partial x} &= \begin{pmatrix} \frac{\partial }{\partial x_{1}}[(b_{11}x_{1} + b_{21}x_{2} + \cdots + b_{n1}x_{n})x_{1} + (b_{12}x_{1} + b_{22}x_{2} + \cdots + b_{n2}x_{n})x_{2} + \cdots + (b_{1n}x_{1} + b_{2n}x_{2} + \cdots + b_{nn}x_{n})x_{n}]\\ \frac{\partial }{\partial x_{2}}[(b_{11}x_{1} + b_{21}x_{2} + \cdots + b_{n1}x_{n})x_{1} + (b_{12}x_{1} + b_{22}x_{2} + \cdots + b_{n2}x_{n})x_{2} + \cdots + (b_{1n}x_{1} + b_{2n}x_{2} + \cdots + b_{nn}x_{n})x_{n}]\\ \vdots\\ \frac{\partial }{\partial x_{n}}[(b_{11}x_{1} + b_{21}x_{2} + \cdots + b_{n1}x_{n})x_{1} + (b_{12}x_{1} + b_{22}x_{2} + \cdots + b_{n2}x_{n})x_{2} + \cdots + (b_{1n}x_{1} + b_{2n}x_{2} + \cdots + b_{nn}x_{n})x_{n}] \end{pmatrix} \\&= \begin{pmatrix} (b_{11} + b_{11})x_{1} + (b_{21} +b_{12})x_{2} + \cdots + (b_{n1}+ b_{1n})x_{n}\\ (b_{12} + b_{21})x_{1} + (b_{22} +b_{22})x_{2} + \cdots + (b_{n2}+ b_{2n})x_{n}\\ \vdots\\ (b_{1n} + b_{n1})x_{1} + (b_{2n} +b_{n2})x_{2} + \cdots + (b_{nn}+ b_{nn})x_{n} \end{pmatrix} \\&= (B+B^{\mathrm{T}})\vec{x} \end{aligned} ∂x∂xTBx=⎝⎜⎜⎜⎛∂x1∂[(b11x1+b21x2+⋯+bn1xn)x1+(b12x1+b22x2+⋯+bn2xn)x2+⋯+(b1nx1+b2nx2+⋯+bnnxn)xn]∂x2∂[(b11x1+b21x2+⋯+bn1xn)x1+(b12x1+b22x2+⋯+bn2xn)x2+⋯+(b1nx1+b2nx2+⋯+bnnxn)xn]⋮∂xn∂[(b11x1+b21x2+⋯+bn1xn)x1+(b12x1+b22x2+⋯+bn2xn)x2+⋯+(b1nx1+b2nx2+⋯+bnnxn)xn]⎠⎟⎟⎟⎞=⎝⎜⎜⎜⎛(b11+b11)x1+(b21+b12)x2+⋯+(bn1+b1n)xn(b12+b21)x1+(b22+b22)x2+⋯+(bn2+b2n)xn⋮(b1n+bn1)x1+(b2n+bn2)x2+⋯+(bnn+bnn)xn⎠⎟⎟⎟⎞=(B+BT)x
接下来证明损失函数是关于的凸函数
∂
E
w
^
∂
w
^
=
∂
∂
w
^
[
(
y
⃗
−
X
w
^
⃗
)
T
(
y
⃗
−
X
w
^
⃗
)
]
=
∂
∂
w
^
[
(
y
⃗
T
−
w
^
⃗
T
X
T
)
(
y
⃗
−
X
w
^
⃗
)
]
=
∂
∂
w
^
[
y
⃗
T
y
⃗
−
y
⃗
T
X
w
^
⃗
−
w
^
⃗
T
X
T
y
⃗
+
w
^
⃗
T
X
T
X
w
^
⃗
)
]
=
−
X
T
y
⃗
−
X
T
y
⃗
+
(
X
T
X
+
X
T
X
)
w
⃗
^
=
2
X
T
(
X
w
⃗
^
−
y
^
)
\begin{aligned} \frac{\partial E_{\hat{w}}}{\partial \hat{w}} &= \frac{\partial }{\partial \hat{w}}[ (\vec{y} - X\vec{\hat{w}})^{\mathrm{T}}(\vec{y} - X\vec{\hat{w}})] \\&= \frac{\partial }{\partial \hat{w}}[ (\vec{y}^{\mathrm{T}} - \vec{\hat{w}}^{\mathrm{T}}X^{\mathrm{T}})(\vec{y} - X\vec{\hat{w}})] \\&= \frac{\partial }{\partial \hat{w}}[ \vec{y}^{\mathrm{T}}\vec{y} - \vec{y}^{\mathrm{T}}X\vec{\hat{w}} - \vec{\hat{w}}^{\mathrm{T}}X^{\mathrm{T}}\vec{y} + \vec{\hat{w}}^{\mathrm{T}}X^{\mathrm{T}}X\vec{\hat{w}})] \\&= -X^{\mathrm{T}}\vec{y} -X^{\mathrm{T}}\vec{y} + (X^{\mathrm{T}}X + X^{\mathrm{T}}X)\hat{\vec{w}} \\&= 2X^{\mathrm{T}}(X\hat{\vec{w}} - \hat{y}) \end{aligned}
∂w^∂Ew^=∂w^∂[(y−Xw^)T(y−Xw^)]=∂w^∂[(yT−w^TXT)(y−Xw^)]=∂w^∂[yTy−yTXw^−w^TXTy+w^TXTXw^)]=−XTy−XTy+(XTX+XTX)w^=2XT(Xw^−y^)
此式即西瓜书上的公式3.10
∂ 2 E w ⃗ ^ ∂ w ⃗ ^ ∂ w ⃗ ^ T = ∂ ∂ w ⃗ ^ ( ∂ E w ⃗ ^ ∂ w ⃗ ^ ) = ∂ ∂ w ⃗ ^ [ 2 X T ( X w ⃗ ^ − y ^ ) ] = 2 X T X \frac{\partial^2 E_{\hat{\vec{w}}}}{\partial \hat{\vec{w}}\partial \hat{\vec{w}}^{\mathrm{T}} } = \frac{\partial }{\partial \hat{\vec{w}}}(\frac{\partial E_{\hat{\vec{w}}}}{\partial \hat{\vec{w}}}) =\frac{\partial }{\partial \hat{\vec{w}}}\left [ 2X^{\mathrm{T}}(X\hat{\vec{w}} - \hat{y}) \right ] = 2X^{\mathrm{T}}X ∂w^∂w^T∂2Ew^=∂w^∂(∂w^∂Ew^)=∂w^∂[2XT(Xw^−y^)]=2XTX, 此即为Hession矩阵
我们这里假设 X T X X^{\mathrm{T}}X XTX是正定矩阵
对损失函数 E w ^ E_{\hat{w}} Ew^关于 w ^ \hat{w} w^求一阶偏导数
∂ E w ^ ∂ w ^ = 2 X T ( X w ⃗ ^ − y ^ ) \frac{\partial E_{\hat{w}}}{\partial \hat{w}} = 2X^{\mathrm{T}}(X\hat{\vec{w}} - \hat{y}) ∂w^∂Ew^=2XT(Xw^−y^)
令一阶偏导数等于 0 ⃗ \vec{0} 0解出 w ^ \hat{w} w^
∂ E w ^ ∂ w ^ = 2 X T ( X w ⃗ ^ − y ^ ) = 0 ⃗ \frac{\partial E_{\hat{w}}}{\partial \hat{w}} = 2X^{\mathrm{T}}(X\hat{\vec{w}} - \hat{y}) = \vec{0} ∂w^∂Ew^=2XT(Xw^−y^)=0, 把括号展开得
2 X T X w ⃗ ^ − 2 X T y ^ = 0 ⃗ 2X^{\mathrm{T}}X\hat{\vec{w}} - 2X^{\mathrm{T}}\hat{y} = \vec{0} 2XTXw^−2XTy^=0, 移项
2 X T X w ⃗ ^ = 2 X T y ^ 2X^{\mathrm{T}}X\hat{\vec{w}} = 2X^{\mathrm{T}}\hat{y} 2XTXw^=2XTy^ 左右两边同时除以2
X T X w ⃗ ^ = X T y ^ X^{\mathrm{T}}X\hat{\vec{w}} = X^{\mathrm{T}}\hat{y} XTXw^=XTy^, 左右两边同时乘以 ( X T X ) − 1 (X^{\mathrm{T}}X)^{-1} (XTX)−1 得
w ⃗ ^ ∗ = ( X T X ) − 1 X T y ^ \hat{\vec{w}}^{*} = (X^{\mathrm{T}}X)^{-1}X^{\mathrm{T}}\hat{y} w^∗=(XTX)−1XTy^