假设有数据
(
X
,
Y
)
(X,Y)
(X,Y),其中
X
∈
R
m
×
d
X \in {\mathbb{R}^{m \times d}}
X∈Rm×d,
Y
∈
R
m
×
1
Y \in {\mathbb{R}^{m \times 1}}
Y∈Rm×1,
m
m
m为样本数,
d
d
d为特征数,考虑最小二乘解
θ
0
=
(
X
T
X
)
−
1
X
T
Y
=
Σ
0
−
1
X
T
Y
(1)
\begin{aligned}{\theta_0} = {\left( {{X^{\rm{T}}}X} \right)^{ - 1}}{X^{\rm{T}}}Y = {\Sigma_0}^{-1}{X^{\rm{T}}}Y \tag{1}\end{aligned}
θ0=(XTX)−1XTY=Σ0−1XTY(1)
⇒
Σ
0
θ
0
=
X
T
Y
(2)
\Rightarrow {\Sigma_0}{\theta_0} = {X^{\rm{T}}}Y \tag{2}
⇒Σ0θ0=XTY(2)
当新数据
(
X
1
,
Y
1
)
\left( {{X_1},{Y_1}} \right)
(X1,Y1)到来时,更新模型,得到新的回归系数
θ
1
=
(
[
X
X
1
]
T
[
X
X
1
]
)
−
1
[
X
X
1
]
T
[
Y
Y
1
]
=
Σ
1
−
1
[
X
X
1
]
T
[
Y
Y
1
]
(3)
\begin{aligned} {\theta_1} &= {\left( {{{\left[ {\begin{array}{cc} X\\ {{X_1}} \end{array}} \right]}^{\rm{T}}}\left[ {\begin{array}{cc} X\\ {{X_1}} \end{array}} \right]} \right)^{ - 1}}{\left[ {\begin{array}{cc} X\\ {{X_1}} \end{array}} \right]^{\rm{T}}}\left[ {\begin{array}{cc} Y\\ {{Y_1}} \end{array}} \right] \\ &= {\Sigma _1}^{ - 1}{\left[ {\begin{array}{cc} X\\ {{X_1}} \end{array}} \right]^{\rm{T}}}\left[ {\begin{array}{cc} Y\\ {{Y_1}} \end{array}} \right]\tag{3} \end{aligned}
θ1=([XX1]T[XX1])−1[XX1]T[YY1]=Σ1−1[XX1]T[YY1](3)
其中
Σ
1
=
[
X
X
1
]
T
[
X
X
1
]
=
X
T
X
+
X
1
T
X
1
=
Σ
0
+
X
1
T
X
1
(4)
\begin{aligned} {\Sigma _1} &= {\left[ {\begin{array}{cc} X\\ {{X_1}} \end{array}} \right]^{\rm{T}}}\left[ {\begin{array}{cc} X\\ {{X_1}} \end{array}} \right] \\ &= {X^{\rm{T}}}X + X_1^{\rm{T}}{X_1} \\ &= {\Sigma _0} + X_1^{\rm{T}}{X_1} \end{aligned} \tag{4}
Σ1=[XX1]T[XX1]=XTX+X1TX1=Σ0+X1TX1(4)
⇒
Σ
0
=
Σ
1
−
X
1
T
X
1
(5)
\begin{aligned} \Rightarrow {\Sigma _0} = {\Sigma _1} - X_1^{\rm{T}}{X_1} \end{aligned} \tag{5}
⇒Σ0=Σ1−X1TX1(5)
根据公式(4)的结果,通过归纳可得
Σ
k
=
Σ
k
−
1
+
X
k
T
X
k
(6)
\begin{aligned} {\Sigma _k} = {\Sigma _{k - 1}} + X_k^{\rm{T}}{X_k} \end{aligned} \tag{6}
Σk=Σk−1+XkTXk(6)
[
X
X
1
]
T
[
Y
Y
1
]
=
X
T
Y
+
X
1
T
Y
1
=
Σ
0
θ
0
+
X
1
T
Y
1
/
/
公
式
(
2
)
结
果
替
换
得
到
=
(
Σ
1
−
X
1
T
X
1
)
θ
0
+
X
1
T
Y
1
/
/
公
式
(
5
)
结
果
替
换
得
到
=
Σ
1
θ
0
+
X
1
T
(
Y
1
−
X
1
θ
0
)
(7)
\begin{aligned} {\left[ {\begin{array}{cc} X\\ {{X_1}} \end{array}} \right]^{\rm{T}}}\left[ {\begin{array}{cc} Y\\ {{Y_1}} \end{array}} \right] &= {X^{\rm{T}}}Y + X_1^{\rm{T}}{Y_1}\\ &= {\Sigma _0}{\theta_0} + X_1^{\rm{T}}{Y_1} \quad //公式(2)结果替换得到\\ &= \left( {{\Sigma _1} - X_1^{\rm{T}}{X_1}} \right){\theta_0} + X_1^{\rm{T}}{Y_1} \quad //公式(5)结果替换得到\\ &= {\Sigma _1}{\theta_0} + X_1^{\rm{T}}\left( {{Y_1} - {X_1}{\theta_0}} \right) \end{aligned} \tag{7}
[XX1]T[YY1]=XTY+X1TY1=Σ0θ0+X1TY1//公式(2)结果替换得到=(Σ1−X1TX1)θ0+X1TY1//公式(5)结果替换得到=Σ1θ0+X1T(Y1−X1θ0)(7)
将公式(7)回带到公式(3):
θ
1
=
Σ
1
−
1
(
Σ
1
θ
0
+
X
1
T
(
Y
1
−
X
1
θ
0
)
)
=
θ
0
+
Σ
1
−
1
X
1
T
(
Y
1
−
X
1
θ
0
)
(8)
\begin{aligned} {\theta_1} &= {\Sigma _1}^{ - 1}\left( {{\Sigma _1}{\theta_0} + X_1^{\rm{T}}\left( {{Y_1} - {X_1}{\theta_0}} \right)} \right) \\ &= {\theta_0} + {\Sigma _1}^{ - 1}X_1^{\rm{T}}\left( {{Y_1} - {X_1}{\theta_0}} \right) \end{aligned} \tag{8}
θ1=Σ1−1(Σ1θ0+X1T(Y1−X1θ0))=θ0+Σ1−1X1T(Y1−X1θ0)(8)
根据公式(8)的结果,通过归纳可得
θ
k
=
θ
k
−
1
+
Σ
k
−
1
X
k
T
(
Y
k
−
X
k
θ
k
−
1
)
(9)
\begin{aligned} {\theta_k} = {\theta_{k - 1}} + {\Sigma _k}^{ - 1}X_k^{\rm{T}}\left( {{Y_k} - {X_k}{\theta_{k - 1}}} \right) \end{aligned} \tag{9}
θk=θk−1+Σk−1XkT(Yk−Xkθk−1)(9)
到这里,已经能够实现对最小二乘的递推,其过程可概括如下,我们称为算法1:
- 根据公式(5)更新 Σ k = Σ k − 1 + X k T X k {\Sigma _k} = {\Sigma _{k - 1}} + X_k^{\rm{T}}{X_k} Σk=Σk−1+XkTXk;
- 根据公式(9)更新 θ k = θ k − 1 + Σ k − 1 X k T ( Y k − X k θ k − 1 ) {\theta_k} = {\theta_{k - 1}} + {\Sigma _k}^{ - 1}X_k^{\rm{T}}\left( {{Y_k} - {X_k}{\theta_{k - 1}}} \right) θk=θk−1+Σk−1XkT(Yk−Xkθk−1)。
但以上过程存在两个问题:
- 对矩阵 Σ k \Sigma_k Σk的求逆计算复杂度比较高,我们能否在递推过程中避免对 Σ k \Sigma_k Σk的求逆计算,而直接更新它的逆矩阵;
- 矩阵 Σ k \Sigma_k Σk中的元素会随着数据量的增加不断增大,可能会发生数值溢出的问题。
针对以上问题,我们要对公式进一步改造,根据Sherman-Morrison-Woodbury公式:
(
A
+
U
V
T
)
−
1
=
A
−
1
−
A
−
1
U
(
I
+
V
T
A
−
1
U
)
−
1
V
T
A
−
1
{\left( {A + U{V^{\rm{T}}}} \right)^{ - 1}} = {A^{ - 1}} - {A^{ - 1}}U{\left( {I + {V^{\rm{T}}}{A^{ - 1}}U} \right)^{ - 1}}{V^{\rm{T}}}{A^{ - 1}}
(A+UVT)−1=A−1−A−1U(I+VTA−1U)−1VTA−1
公式(6)的逆可写成如下形式
Σ
k
−
1
=
(
Σ
k
−
1
+
X
k
T
X
k
)
−
1
=
Σ
k
−
1
−
1
−
Σ
k
−
1
−
1
X
k
T
(
I
+
X
k
Σ
k
−
1
−
1
X
k
T
)
−
1
X
k
Σ
k
−
1
−
1
(10)
\begin{aligned} {\Sigma _k}^{ - 1} &= {\left( {{\Sigma _{k - 1}} + X_k^{\rm{T}}{X_k}} \right)^{ - 1}} \\ &= \Sigma _{k - 1}^{ - 1} - \Sigma _{k - 1}^{ - 1}X_k^{\rm{T}}{\left( {I + {X_k}\Sigma _{k - 1}^{ - 1}X_k^{\rm{T}}} \right)^{ - 1}}{X_k}\Sigma _{k - 1}^{ - 1} \end{aligned} \tag{10}
Σk−1=(Σk−1+XkTXk)−1=Σk−1−1−Σk−1−1XkT(I+XkΣk−1−1XkT)−1XkΣk−1−1(10)
令
P
k
=
∑
k
−
1
{P_k} = {\sum _k}^{ - 1}
Pk=∑k−1,公式(10)变为:
P
k
=
P
k
−
1
−
P
k
−
1
X
k
T
(
I
+
X
k
P
k
−
1
X
k
T
)
−
1
X
k
P
k
−
1
(11)
\begin{aligned} {P_k} = {P_{k - 1}} - {P_{k - 1}}X_k^{\rm{T}}{\left( {I + {X_k}{P_{k - 1}}X_k^{\rm{T}}} \right)^{ - 1}}{X_k}{P_{k - 1}} \end{aligned} \tag{11}
Pk=Pk−1−Pk−1XkT(I+XkPk−1XkT)−1XkPk−1(11)
公式(9)变为:
θ
k
=
θ
k
−
1
+
P
k
X
k
T
(
Y
k
−
X
k
θ
k
−
1
)
(12)
\begin{aligned} {\theta_k} = {\theta_{k - 1}} + {P_k}X_k^{\rm{T}}\left( {{Y_k} - {X_k}{\theta_{k - 1}}} \right) \end{aligned} \tag{12}
θk=θk−1+PkXkT(Yk−Xkθk−1)(12)
注意到,公式(11)依然存在对
I
+
X
k
P
k
−
1
X
k
T
{I + {X_k}{P_{k - 1}}X_k^{\rm{T}}}
I+XkPk−1XkT的求逆运算,这似乎依然没有解决上述问题1,我们避免了对
Σ
k
\Sigma_k
Σk的求逆,但却又引入了一个新的逆。事实上,如果数据是逐个到达的,则
X
k
X_k
Xk为一个行向量(在本文中,一个样本我们用行向量表示,这主要是因为本文规定数据矩阵中每一行代表一个样本),因此
I
+
X
k
P
k
−
1
X
k
T
{I + {X_k}{P_{k - 1}}X_k^{\rm{T}}}
I+XkPk−1XkT最终得到结果为一个数值,我们无需矩阵求逆计算,只需要取它的倒数就好了,即
P
k
=
P
k
−
1
−
P
k
−
1
X
k
T
X
k
P
k
−
1
1
+
X
k
P
k
−
1
X
k
T
(13)
\begin{aligned} {P_k} = {P_{k - 1}} - \frac{{{P_{k - 1}}X_k^{\rm{T}}{X_k}{P_{k - 1}}}}{{1 + {X_k}{P_{k - 1}}X_k^{\rm{T}}}} \end{aligned} \tag{13}
Pk=Pk−1−1+XkPk−1XkTPk−1XkTXkPk−1(13)
于是我们得到了新的递推算法如下,我们称为算法2:
- 根据公式(13)更新 P k = P k − 1 − P k − 1 X k T X k P k − 1 1 + X k P k − 1 X k T ; {P_k} = {P_{k - 1}} - \frac{{{P_{k - 1}}X_k^{\rm{T}}{X_k}{P_{k - 1}}}}{{1 + {X_k}{P_{k - 1}}X_k^{\rm{T}}}}; Pk=Pk−1−1+XkPk−1XkTPk−1XkTXkPk−1;
- 根据公式(12)更新 θ k = θ k − 1 + P k X k T ( Y k − X k θ k − 1 ) {\theta_k} = {\theta_{k - 1}} + {P_k}X_k^{\rm{T}}\left( {{Y_k} - {X_k}{\theta_{k - 1}}} \right) θk=θk−1+PkXkT(Yk−Xkθk−1)。
一些书上的递推算法可能并非这样的形式,我们可以进一步对上述过程进行一些整理。在一些书中,
K
k
=
P
k
X
k
T
{K_k} = {P_k}X_k^{\rm{T}}
Kk=PkXkT也被称为增益,
Y
k
−
X
k
θ
k
−
1
{Y_k} - {X_k}{\theta_{k - 1}}
Yk−Xkθk−1被称为新息,顾名思义,就是引入的新信息。
K
k
=
P
k
X
k
T
=
(
P
k
−
1
−
P
k
−
1
X
k
T
(
I
+
X
k
P
k
−
1
X
k
T
)
−
1
X
k
P
k
−
1
)
X
k
T
/
/
公
式
(
11
)
结
果
替
换
得
到
=
P
k
−
1
X
k
T
(
I
−
(
I
+
X
k
P
k
−
1
X
k
T
)
−
1
X
k
P
k
−
1
X
k
T
)
=
P
k
−
1
X
k
T
(
I
+
X
k
P
k
−
1
X
k
T
)
−
1
(
(
I
+
X
k
P
k
−
1
X
k
T
)
−
X
k
P
k
−
1
X
k
T
)
=
P
k
−
1
X
k
T
(
I
+
X
k
P
k
−
1
X
k
T
)
−
1
(14)
\begin{aligned} {K_k} &= {P_k}X_k^{\rm{T}}\\ &= \left( {{P_{k - 1}} - {P_{k - 1}}X_k^{\rm{T}}{{\left( {I + {X_k}{P_{k - 1}}X_k^{\rm{T}}} \right)}^{ - 1}}{X_k}{P_{k - 1}}} \right)X_k^{\rm{T}} \quad //公式(11)结果替换得到\\ &= {P_{k - 1}}X_k^{\rm{T}}\left( {I - {{\left( {I + {X_k}{P_{k - 1}}X_k^{\rm{T}}} \right)}^{ - 1}}{X_k}{P_{k - 1}}X_k^{\rm{T}}} \right)\\ &= {P_{k - 1}}X_k^{\rm{T}}{\left( {I + {X_k}{P_{k - 1}}X_k^{\rm{T}}} \right)^{ - 1}}\left( {\left( {I + {X_k}{P_{k - 1}}X_k^{\rm{T}}} \right) - {X_k}{P_{k - 1}}X_k^{\rm{T}}} \right)\\ &= {P_{k - 1}}X_k^{\rm{T}}{\left( {I + {X_k}{P_{k - 1}}X_k^{\rm{T}}} \right)^{ - 1}} \end{aligned} \tag{14}
Kk=PkXkT=(Pk−1−Pk−1XkT(I+XkPk−1XkT)−1XkPk−1)XkT//公式(11)结果替换得到=Pk−1XkT(I−(I+XkPk−1XkT)−1XkPk−1XkT)=Pk−1XkT(I+XkPk−1XkT)−1((I+XkPk−1XkT)−XkPk−1XkT)=Pk−1XkT(I+XkPk−1XkT)−1(14)
将公式(14)的结果代入到公式(11)可得
P
k
=
P
k
−
1
−
K
k
X
k
P
k
−
1
=
(
I
−
K
k
X
k
)
P
k
−
1
(15)
\begin{aligned} {P_k} = {P_{k - 1}} - {K_k}{X_k}{P_{k - 1}} = \left( {I - {K_k}{X_k}} \right){P_{k - 1}} \end{aligned} \tag{15}
Pk=Pk−1−KkXkPk−1=(I−KkXk)Pk−1(15)
于是,算法2可进一步的写为如下形式,我们称为算法3:
- 根据公式(14)更新模型增益 K k = P k − 1 X k T ( I + X k P k − 1 X k T ) − 1 {K_k} = {P_{k - 1}}X_k^{\rm{T}}{\left( {I + {X_k}{P_{k - 1}}X_k^{\rm{T}}} \right)^{ - 1}} Kk=Pk−1XkT(I+XkPk−1XkT)−1;
- 根据公式(15)更新 P k = ( I − K k X k ) P k − 1 {P_k} = \left( {I - {K_k}{X_k}} \right){P_{k - 1}} Pk=(I−KkXk)Pk−1;
- 更新回归系数 θ k = θ k − 1 + K k ( Y k − X k θ k − 1 ) {\theta_k} = {\theta_{k - 1}} + {K_k}\left( {{Y_k} - {X_k}{\theta_{k - 1}}} \right) θk=θk−1+Kk(Yk−Xkθk−1)