对多篇最小二乘法相关的资料的整合,如有错误,敬请指正!
原文地址1
原文地址2
线性回归
线性回归假设数据集中特征与结果存在着线性关系:
y
=
m
x
+
c
y = mx+c
y=mx+c
y为结果,x为特征,m为系数,c为系数
我们需要找到m、c使得m*x+c得到的结果y与真实的y误差最小,这里使用平方差来衡量估计值与真实值得误差(如果只用差值就可能会存在负数);用于计算真实值与预测值的误差的函数称为:平方损失函数(square loss function);这里用L表示损失函数,所以有:
L
n
=
(
y
n
−
(
m
x
n
+
c
)
)
2
L_n = (y_n-(mx_n+c))^2
Ln=(yn−(mxn+c))2
整个数据集上的平均损失为:
L
=
1
N
∑
n
=
1
N
(
y
n
,
f
(
x
n
;
c
,
m
)
)
L=\frac{1}{N} \sum_{n=1}^{N}(y_n,f(x_n;c,m))
L=N1n=1∑N(yn,f(xn;c,m))
我们要求得最匹配的m与c使得L最小;数学表达式可以表示为:
arg
m
min
c
1
N
∑
n
=
1
N
L
n
(
y
n
;
c
,
m
)
{\arg\limits_m}\ {\min\limits_c}\ \frac{1}{N}\sum_{n=1}^{N}L_n(y_n;c,m)
marg cmin N1n=1∑NLn(yn;c,m)
最小二乘法用于求目标函数的最优解,它通过最小化误差的平方和寻找匹配项所以又称为:最小平方法;这里将用最小二乘法求得线性回归的最优解;
最小二乘法
数据集有1…N个数据组成,每个数据由{x,y}构成,x表示特征,y为实际结果;这里将线性回归模型定义为:
f
(
x
;
m
,
c
)
=
m
x
+
c
f(x;m,c)=mx+c
f(x;m,c)=mx+c
平均损失函数为:
L
=
1
N
∑
n
=
1
N
L
n
(
y
n
,
f
(
x
n
;
c
,
m
)
)
=
1
N
∑
n
=
1
N
(
y
n
−
f
(
x
n
;
c
,
m
)
)
2
=
1
N
∑
n
=
1
N
(
y
n
−
(
c
+
m
x
n
)
)
2
=
1
N
∑
n
=
1
N
(
y
n
−
c
−
m
x
n
)
(
y
n
−
c
−
m
x
n
)
=
1
N
∑
n
=
1
N
(
y
n
2
−
2
y
n
c
−
2
y
n
m
x
+
c
2
+
2
c
m
x
+
m
2
x
n
2
)
=
1
N
∑
n
=
1
N
(
y
n
2
−
2
y
n
c
+
2
m
x
(
c
−
y
n
)
+
c
2
+
m
2
x
n
2
)
\begin{aligned} L &=\frac{1}{N}\sum_{n=1}^{N}L_n(y_n,f(x_n;c,m))\\ &=\frac{1}{N}\sum_{n=1}^{N}(y_n-f(x_n;c,m))^2\\ &=\frac{1}{N}\sum_{n=1}^{N}(y_n-(c+mx_n))^2\\ &=\frac{1}{N}\sum_{n=1}^{N}(y_n-c-mx_n)(y_n-c-mx_n)\\ &=\frac{1}N\sum_{n=1}^{N}(y_n^2-2y_nc-2y_nmx+c^2+2cmx+m^2x_n^2)\\ &=\frac{1}{N}\sum_{n=1}^{N}(y_n^2-2y_nc+2mx(c-y_n)+c^2+m^2x_n^2)\\ \end{aligned}
L=N1n=1∑NLn(yn,f(xn;c,m))=N1n=1∑N(yn−f(xn;c,m))2=N1n=1∑N(yn−(c+mxn))2=N1n=1∑N(yn−c−mxn)(yn−c−mxn)=N1n=1∑N(yn2−2ync−2ynmx+c2+2cmx+m2xn2)=N1n=1∑N(yn2−2ync+2mx(c−yn)+c2+m2xn2)
要使L最小,其关于c与m的偏导数为0,所以求偏导数,得出后让导数等于0,并对c与m求解便能得到最小的L,此时的c与m便是最匹配该模型的;
关于c的偏导数:
因为求得是关于c的偏导数,因此把L的等式中不包含c的项去掉,得到:
1
N
∑
n
=
1
N
(
c
2
−
2
y
n
c
+
2
c
m
x
n
)
\frac{1}{N}\sum_{n=1}^{N}(c^2-2y_nc+2cmx_n)
N1n=1∑N(c2−2ync+2cmxn)
整理式子把不包含下标n的往累加和外移得到:
c
2
+
2
c
m
1
N
(
∑
n
=
1
N
x
n
)
−
2
c
1
N
(
∑
n
=
1
N
y
n
)
c^2+2cm\frac{1}{N}(\sum_{n=1}^{N}x_n)-2c\frac{1}{N}(\sum^{N}_{n=1}y_n)
c2+2cmN1(n=1∑Nxn)−2cN1(n=1∑Nyn)
那么对c求偏导数得:
∂
L
∂
c
=
2
c
+
2
m
1
N
(
∑
n
=
1
N
x
n
)
−
2
N
(
∑
n
=
1
N
y
n
)
\frac{\partial L }{\partial c}=2c+2m\frac{1}{N}(\sum_{n=1}^{N}x_n)-\frac{2}{N}(\sum_{n=1}^{N}y_n)
∂c∂L=2c+2mN1(n=1∑Nxn)−N2(n=1∑Nyn)
关与m的偏导数:
因为求得是关于m的偏导数,因此把L的等式中不包含m的项去掉,得到:
1
N
∑
n
=
1
N
(
m
2
x
n
2
−
2
y
n
m
x
n
+
2
c
m
x
n
)
\frac{1}{N}\sum_{n=1}^{N}(m^2x_n^2-2y_nmx_n+2cmx_n)
N1n=1∑N(m2xn2−2ynmxn+2cmxn)
整理式子把不包含下标n的往累加和外移得到:
m
2
1
N
∑
n
=
1
N
(
x
n
2
)
+
2
m
1
N
∑
n
=
1
N
x
n
(
c
−
y
n
)
m_2\frac{1}{N}\sum_{n=1}^{N}(x_n^2)+2m\frac{1}{N}\sum_{n=1}^{N}x_n(c-y_n)
m2N1n=1∑N(xn2)+2mN1n=1∑Nxn(c−yn)
那么对m求偏导数得:
∂
L
∂
m
=
2
m
1
N
∑
n
=
1
N
(
x
n
2
)
+
2
N
∑
n
=
1
N
x
n
(
c
−
y
n
)
\frac{\partial L }{\partial m}=2m\frac{1}{N}\sum_{n=1}^{N}(x_n^2)+\frac{2}{N}\sum_{n=1}^{N}x_n(c-y_n)
∂m∂L=2mN1n=1∑N(xn2)+N2n=1∑Nxn(c−yn)
求解m和c:
令关于c的偏导数等于0,求解:
2
c
+
2
m
1
N
(
∑
n
=
1
N
x
n
)
−
2
N
(
∑
n
=
1
N
y
n
)
=
0
2c+2m\frac{1}{N}(\sum_{n=1}^{N}x_n)-\frac{2}{N}(\sum_{n=1}^{N}y_n)=0
2c+2mN1(n=1∑Nxn)−N2(n=1∑Nyn)=0
2 c = 2 N ( ∑ n = 1 N y n ) − 2 m 1 N ( ∑ n = 1 N x n ) 2c=\frac{2}{N}(\sum_{n=1}^{N}y_n)-2m\frac{1}{N}(\sum_{n=1}^{N}x_n) 2c=N2(n=1∑Nyn)−2mN1(n=1∑Nxn)
c = 1 N ( ∑ n = 1 N y n ) − m 1 N ( ∑ n = 1 N x n ) c=\frac{1}{N}(\sum_{n=1}^{N}y_n)-m\frac{1}{N}(\sum_{n=1}^{N}x_n) c=N1(n=1∑Nyn)−mN1(n=1∑Nxn)
从上求解得到的值可以看出,上面式子中存在两个平均值:
x
‾
=
1
N
(
∑
n
=
1
N
x
n
)
,
y
‾
=
1
N
(
∑
n
=
1
N
y
n
)
\overline{x}=\frac{1}{N}(\sum_{n=1}^{N}x_n),\overline{y}=\frac{1}{N}(\sum_{n=1}^{N}y_n)
x=N1(n=1∑Nxn),y=N1(n=1∑Nyn)
则:
c
=
y
‾
−
m
x
‾
c=\overline{y}-m\overline{x}
c=y−mx
令关于m的偏导数等于0,求解:
2
m
1
N
∑
n
=
1
N
(
x
n
2
)
+
2
N
∑
n
=
1
N
x
n
(
c
−
y
n
)
=
0
2m\frac{1}{N}\sum_{n=1}^{N}(x_n^2)+\frac{2}{N}\sum_{n=1}^{N}x_n(c-y_n)=0
2mN1n=1∑N(xn2)+N2n=1∑Nxn(c−yn)=0
将c和平均值关系带入得:
m
1
N
∑
n
=
1
N
(
x
n
2
)
+
1
N
∑
n
=
1
N
x
n
(
y
‾
−
m
x
‾
−
y
n
)
=
0
m\frac{1}{N}\sum_{n=1}^{N}(x_n^2)+\frac{1}{N}\sum_{n=1}^{N}x_n(\overline{y}-m\overline{x}-y_n)=0
mN1n=1∑N(xn2)+N1n=1∑Nxn(y−mx−yn)=0
m ( 1 N ∑ n = 1 N ( x n 2 ) − 1 N x ‾ ∑ n = 1 N x n ) = 1 N ∑ n = 1 N ( x n y n − x n y ‾ ) m(\frac{1}{N}\sum_{n=1}^{N}(x_n^2)-\frac{1}{N}\overline{x}\sum_{n=1}^{N}x_n)=\frac{1}{N}\sum_{n=1}^{N}(x_ny_n-x_n\overline{y}) m(N1n=1∑N(xn2)−N1xn=1∑Nxn)=N1n=1∑N(xnyn−xny)
令:
x
2
‾
=
1
N
∑
n
=
1
N
(
x
n
2
)
,
x
y
‾
=
1
N
∑
n
=
1
N
(
x
n
y
n
)
\overline{x^2} =\frac{1}{N}\sum_{n=1}^{N}(x_n^2), \ \overline{xy}=\frac{1}{N}\sum_{n=1}^{N}(x_ny_n)
x2=N1n=1∑N(xn2), xy=N1n=1∑N(xnyn)
则:
m
=
x
y
‾
−
x
‾
y
‾
x
2
‾
−
x
‾
2
m=\frac{\overline{xy}-\overline{x}\ \overline{y}}{\overline{x^2}-\overline{x}^2}
m=x2−x2xy−x y
至此,m与c都已计算出
加权最小二乘法:
前面所求解的一般最小二乘法将时间序列中的各项数据的重要性同等看待,而事实上时间序列各项数据对未来的影响作用应是不同的。一般来说,近期数据比起远期数据对未来的影响更大。因此比较合理的方法就是使用加权的方法,对近期数据赋以较大的权数,对远期数据则赋以较小的权数。加权最小二乘法采用指数权数W(0<W<1),加权以后求得的参数估计值应满足:
L
n
=
W
n
(
y
n
−
(
m
x
n
+
c
)
)
2
L_n = W_n(y_n-(mx_n+c))^2
Ln=Wn(yn−(mxn+c))2
L = 1 N ∑ n = 1 N W n ( y n , f ( x n ; c , m ) ) L=\frac{1}{N} \sum_{n=1}^{N}W_n(y_n,f(x_n;c,m)) L=N1n=1∑NWn(yn,f(xn;c,m))
arg m min c 1 N ∑ n = 1 N L n ( y n ; c , m ) = arg m min c 1 N ∑ n = 1 N W n ( y n − ( m x n + c ) ) 2 {\arg\limits_m} \ {\min \limits_{c}}\ \frac{1}{N}\sum_{n=1}^{N}L_n(y_n;c,m)={\arg\limits_m}\ {\min\limits_c}\ \frac{1}{N}\sum_{n=1}^{N}W_n(y_n-(mx_n+c))^2 marg cmin N1n=1∑NLn(yn;c,m)=marg cmin N1n=1∑NWn(yn−(mxn+c))2
同理,平均损失函数为:
L
=
1
N
∑
n
=
1
N
L
n
(
y
n
,
f
(
x
n
;
c
,
m
)
)
=
1
N
∑
n
=
1
N
W
n
(
y
n
−
f
(
x
n
;
c
,
m
)
)
2
=
1
N
∑
n
=
1
N
W
n
(
y
n
−
(
c
+
m
x
n
)
)
2
=
1
N
∑
n
=
1
N
W
n
(
y
n
−
c
−
m
x
n
)
(
y
n
−
c
−
m
x
n
)
=
1
N
∑
n
=
1
N
W
n
(
y
n
2
−
2
y
n
c
−
2
y
n
m
x
+
c
2
+
2
c
m
x
+
m
2
x
n
2
)
=
1
N
∑
n
=
1
N
W
n
(
y
n
2
−
2
y
n
c
+
2
m
x
(
c
−
y
n
)
+
c
2
+
m
2
x
n
2
)
\begin{aligned} L &=\frac{1}{N}\sum_{n=1}^{N}L_n(y_n,f(x_n;c,m))\\ &=\frac{1}{N}\sum_{n=1}^{N}W_n(y_n-f(x_n;c,m))^2\\ &=\frac{1}{N}\sum_{n=1}^{N}W_n(y_n-(c+mx_n))^2\\ &=\frac{1}{N}\sum_{n=1}^{N}W_n(y_n-c-mx_n)(y_n-c-mx_n)\\ &=\frac{1}N\sum_{n=1}^{N}W_n(y_n^2-2y_nc-2y_nmx+c^2+2cmx+m^2x_n^2)\\ &=\frac{1}{N}\sum_{n=1}^{N}W_n(y_n^2-2y_nc+2mx(c-y_n)+c^2+m^2x_n^2) \end{aligned}
L=N1n=1∑NLn(yn,f(xn;c,m))=N1n=1∑NWn(yn−f(xn;c,m))2=N1n=1∑NWn(yn−(c+mxn))2=N1n=1∑NWn(yn−c−mxn)(yn−c−mxn)=N1n=1∑NWn(yn2−2ync−2ynmx+c2+2cmx+m2xn2)=N1n=1∑NWn(yn2−2ync+2mx(c−yn)+c2+m2xn2)
要使L最小,其关于c与m的偏导数为0,所以求偏导数,得出后让导数等于0,并对c与m求解便能得到最小的L,此时的c与m便是最匹配该模型的;
关于c的偏导数:
因为求得是关于c的偏导数,因此把L的等式中不包含c的项去掉,得到:
1
N
∑
n
=
1
N
W
n
(
c
2
−
2
y
n
c
+
2
c
m
x
n
)
\frac{1}{N}\sum_{n=1}^{N}W_n(c^2-2y_nc+2cmx_n)
N1n=1∑NWn(c2−2ync+2cmxn)
整理式子把不包含下标n的往累加和外移得到:
c
2
1
N
∑
n
=
1
N
W
n
+
2
c
m
1
N
(
∑
n
=
1
N
W
n
x
n
)
−
2
c
1
N
(
∑
n
=
1
N
W
n
y
n
)
c^2\frac{1}{N}\sum_{n=1}^{N}W_n+2cm\frac{1}{N}(\sum_{n=1}^{N}W_nx_n)-2c\frac{1}{N}(\sum^{N}_{n=1}W_ny_n)
c2N1n=1∑NWn+2cmN1(n=1∑NWnxn)−2cN1(n=1∑NWnyn)
那么对c求偏导数得:
∂
L
∂
c
=
2
c
1
N
∑
n
=
1
N
W
n
+
2
m
1
N
(
∑
n
=
1
N
W
n
x
n
)
−
2
N
(
∑
n
=
1
N
W
n
y
n
)
\frac{\partial L }{\partial c}=2c\frac{1}{N}\sum_{n=1}^{N}W_n+2m\frac{1}{N}(\sum_{n=1}^{N}W_nx_n)-\frac{2}{N}(\sum_{n=1}^{N}W_ny_n)
∂c∂L=2cN1n=1∑NWn+2mN1(n=1∑NWnxn)−N2(n=1∑NWnyn)
关与m的偏导数:
因为求得是关于m的偏导数,因此把L的等式中不包含m的项去掉,得到:
1
N
∑
n
=
1
N
W
n
(
m
2
x
n
2
−
2
y
n
m
x
n
+
2
c
m
x
n
)
\frac{1}{N}\sum_{n=1}^{N}W_n(m^2x_n^2-2y_nmx_n+2cmx_n)
N1n=1∑NWn(m2xn2−2ynmxn+2cmxn)
整理式子把不包含下标n的往累加和外移得到:
m
2
1
N
∑
n
=
1
N
(
W
n
x
n
2
)
+
2
m
1
N
∑
n
=
1
N
W
n
x
n
(
c
−
y
n
)
m^2\frac{1}{N}\sum_{n=1}^{N}(W_nx_n^2)+2m\frac{1}{N}\sum_{n=1}^{N}W_nx_n(c-y_n)
m2N1n=1∑N(Wnxn2)+2mN1n=1∑NWnxn(c−yn)
那么对m求偏导数得:
∂
L
∂
m
=
2
m
1
N
∑
n
=
1
N
(
W
n
x
n
2
)
+
2
N
∑
n
=
1
N
W
n
x
n
(
c
−
y
n
)
\frac{\partial L }{\partial m}=2m\frac{1}{N}\sum_{n=1}^{N}(W_nx_n^2)+\frac{2}{N}\sum_{n=1}^{N}W_nx_n(c-y_n)
∂m∂L=2mN1n=1∑N(Wnxn2)+N2n=1∑NWnxn(c−yn)
求解m和c:
令关于c的偏导数等于0,求解:
2
c
1
N
∑
n
=
1
N
W
n
+
2
m
1
N
(
∑
n
=
1
N
W
n
x
n
)
−
2
N
(
∑
n
=
1
N
W
n
y
n
)
=
0
2c\frac{1}{N}\sum_{n=1}^{N}W_n+2m\frac{1}{N}(\sum_{n=1}^{N}W_nx_n)-\frac{2}{N}(\sum_{n=1}^{N}W_ny_n)=0
2cN1n=1∑NWn+2mN1(n=1∑NWnxn)−N2(n=1∑NWnyn)=0
2 c = 2 N ( ∑ n = 1 N W n y n ) − 2 m ( 1 N ∑ n = 1 N W n x n ) 1 N ∑ n = 1 N W n 2c=\frac{\frac{2}{N}(\sum_{n=1}^{N}W_ny_n)-2m(\frac{1}{N}\sum_{n=1}^{N}W_nx_n)}{\frac{1}{N}\sum_{n=1}^{N}W_n} 2c=N1∑n=1NWnN2(∑n=1NWnyn)−2m(N1∑n=1NWnxn)
c = 1 N ( ∑ n = 1 N W n y n ) − m ( 1 N ∑ n = 1 N W n x n ) 1 N ∑ n = 1 N W n c=\frac{\frac{1}{N}(\sum_{n=1}^{N}W_ny_n)-m(\frac{1}{N}\sum_{n=1}^{N}W_nx_n)}{\frac{1}{N}\sum_{n=1}^{N}W_n} c=N1∑n=1NWnN1(∑n=1NWnyn)−m(N1∑n=1NWnxn)
令关于m的偏导数等于0,求解:
2
m
1
N
∑
n
=
1
N
(
W
n
x
n
2
)
+
2
N
∑
n
=
1
N
W
n
x
n
(
c
−
y
n
)
=
0
2m\frac{1}{N}\sum_{n=1}^{N}(W_nx_n^2)+\frac{2}{N}\sum_{n=1}^{N}W_nx_n(c-y_n)=0
2mN1n=1∑N(Wnxn2)+N2n=1∑NWnxn(c−yn)=0
将c和平均值关系带入得:
2
m
1
N
∑
n
=
1
N
(
W
n
x
n
2
)
+
2
N
∑
n
=
1
N
W
n
x
n
(
1
N
(
∑
n
=
1
N
W
n
y
n
)
−
m
1
N
(
∑
n
=
1
N
W
n
x
n
)
1
N
∑
n
=
1
N
W
n
−
y
n
)
=
0
2m\frac{1}{N}\sum_{n=1}^{N}(W_nx_n^2)+\frac{2}{N}\sum_{n=1}^{N}W_nx_n(\frac{\frac{1}{N}(\sum_{n=1}^{N}W_ny_n)-m\frac{1}{N}(\sum_{n=1}^{N}W_nx_n)}{\frac{1}{N}\sum_{n=1}^{N}W_n}-y_n)=0
2mN1n=1∑N(Wnxn2)+N2n=1∑NWnxn(N1∑n=1NWnN1(∑n=1NWnyn)−mN1(∑n=1NWnxn)−yn)=0
m = ( 1 N ∑ n = 1 N W n x n y n ) ∗ ( 1 N ∑ n = 1 N W n ) − ( 1 N ∑ n = 1 N W n x n ) ∗ ( 1 N ∑ n = 1 N W n y n ) ( 1 N ∑ n = 1 N W n x n 2 ) ∗ ( 1 N ∑ n = 1 N W n ) − ( 1 N ∑ n = 1 N W n x n ) ∗ ( 1 N ∑ n = 1 N W n y n ) m = \frac{(\frac{1}{N}\sum_{n=1}^{N}W_nx_ny_n)*(\frac{1}{N}\sum_{n=1}^{N}W_n)-(\frac{1}{N}\sum_{n=1}^{N}W_nx_n)*(\frac{1}{N}\sum_{n=1}^{N}W_ny_n)}{(\frac{1}{N}\sum_{n=1}^{N}W_nx_n^2)*(\frac{1}{N}\sum_{n=1}^{N}W_n)-(\frac{1}{N}\sum_{n=1}^{N}W_nx_n)*(\frac{1}{N}\sum_{n=1}^{N}W_ny_n)} m=(N1∑n=1NWnxn2)∗(N1∑n=1NWn)−(N1∑n=1NWnxn)∗(N1∑n=1NWnyn)(N1∑n=1NWnxnyn)∗(N1∑n=1NWn)−(N1∑n=1NWnxn)∗(N1∑n=1NWnyn)
至此,m与c都已计算出
矩阵推导部分
一个n×n的矩阵A的迹是指A的主对角线上各元素的总和,记作tr(A)。即
t
r
(
A
)
=
∑
i
=
1
n
a
i
i
tr(A)=\sum_{i=1}^{n}a_{ii}
tr(A)=i=1∑naii
- 定理一:tr(AB)=tr(BA)
证明:
t
r
(
A
B
)
=
∑
i
=
1
n
(
A
B
)
i
i
=
∑
i
=
1
n
∑
j
=
1
m
a
i
j
b
j
i
=
∑
j
=
1
m
∑
i
=
1
n
b
j
i
a
i
j
=
∑
j
=
1
m
(
B
A
)
j
j
=
t
r
(
B
A
)
tr(AB)=\sum_{i=1}^{n}(AB)_{ii}=\sum_{i=1}^{n}\sum_{j=1}^{m}a_{ij}b_{ji}=\sum_{j=1}^{m}\sum_{i=1}^{n}b_{ji}a_{ij}=\sum_{j=1}^{m}(BA)_{jj}=tr(BA)
tr(AB)=i=1∑n(AB)ii=i=1∑nj=1∑maijbji=j=1∑mi=1∑nbjiaij=j=1∑m(BA)jj=tr(BA)
-
定理二:
t r ( A B C ) = t r ( C A B ) = t r ( B C A ) tr(ABC)=tr(CAB)=tr(BCA) tr(ABC)=tr(CAB)=tr(BCA) -
定理三:
∂ t r ( A B ) ∂ A = ∂ t r ( B A ) ∂ A = B T \frac{\partial{tr(AB)}}{\partial A}=\frac{\partial{tr(BA)}}{\partial A}=B^T ∂A∂tr(AB)=∂A∂tr(BA)=BT
其中A是m×n的矩阵,B是n×m的矩阵
t
r
(
A
B
)
=
t
r
(
a
11
a
12
⋯
a
1
n
a
21
a
22
⋯
a
2
n
⋮
⋮
⋱
⋮
a
m
1
a
m
2
⋯
a
m
n
)
(
b
11
b
12
⋯
b
1
m
b
21
b
22
⋯
b
2
m
⋮
⋮
⋱
⋮
b
n
1
b
n
2
⋯
b
n
m
)
tr(AB)=tr\left(\begin{matrix}a_{11}&a_{12}&\cdots&a_{1n}\\a_{21}&a_{22}&\cdots&a_{2n}\\\vdots&\vdots&\ddots&\vdots\\a_{m1}&a_{m2}&\cdots&a_{mn}\end{matrix}\right) \left(\begin{matrix}b_{11}&b_{12}&\cdots&b_{1m}\\b_{21}&b_{22}&\cdots&b_{2m}\\\vdots&\vdots&\ddots&\vdots\\b_{n1}&b_{n2}&\cdots&b_{nm}\end{matrix}\right)
tr(AB)=tr⎝⎜⎜⎜⎛a11a21⋮am1a12a22⋮am2⋯⋯⋱⋯a1na2n⋮amn⎠⎟⎟⎟⎞⎝⎜⎜⎜⎛b11b21⋮bn1b12b22⋮bn2⋯⋯⋱⋯b1mb2m⋮bnm⎠⎟⎟⎟⎞
只考虑对角线上的元素,那么有
t
r
(
A
B
)
=
∑
i
=
1
n
a
1
i
b
i
1
+
∑
i
=
1
n
a
2
i
b
i
2
+
…
+
∑
i
=
1
n
a
m
i
b
i
m
=
∑
i
=
1
m
∑
j
=
1
n
a
i
j
b
j
i
tr(AB)=\sum_{i=1}^{n}a_{1i}b_{i1}+\sum_{i=1}^{n}a_{2i}b_{i2}+\ldots+\sum_{i=1}^{n}a_{mi}b_{im}=\sum_{i=1}^{m}\sum_{j=1}^{n}a_{ij}b_{ji}
tr(AB)=i=1∑na1ibi1+i=1∑na2ibi2+…+i=1∑namibim=i=1∑mj=1∑naijbji
∂ t r ( A B ) ∂ a i j = b i j ⇒ ∂ t r ( A B ) ∂ A = B T \frac{\partial tr(AB)}{\partial a_{ij}}=b_{ij}\Rightarrow \frac{\partial tr(AB)}{\partial A}=B^T ∂aij∂tr(AB)=bij⇒∂A∂tr(AB)=BT
- 定理四:
∂ t r ( A T B ) ∂ A = ∂ t r ( B A T ) ∂ A = B \frac{\partial{tr(A^TB)}}{\partial A}=\frac{\partial{tr(BA^T)}}{\partial A}=B ∂A∂tr(ATB)=∂A∂tr(BAT)=B
证明:
∂
t
r
(
A
T
B
)
∂
A
=
∂
t
r
(
(
A
T
B
)
T
)
∂
A
=
∂
t
r
(
B
T
A
)
∂
A
=
∂
t
r
(
A
B
T
)
∂
A
=
(
B
T
)
T
=
B
\frac{\partial{tr(A^TB)}}{\partial A}=\frac{\partial{tr((A^TB)^T)}}{\partial A}=\frac{\partial{tr(B^TA)}}{\partial A}=\frac{\partial{tr(AB^T)}}{\partial A}=(B^T)^T=B
∂A∂tr(ATB)=∂A∂tr((ATB)T)=∂A∂tr(BTA)=∂A∂tr(ABT)=(BT)T=B
- 定理五:
t r ( A ) = t r ( A T ) tr(A)=tr(A^T) tr(A)=tr(AT)
- 定理六:如果a是实数,那么有tr(a)=a
- 定理七:
∂ t r ( A B A T C ) ∂ A = C A B + C T A B T \frac{\partial tr(ABA^TC)}{\partial A}=CAB+C^TAB^T ∂A∂tr(ABATC)=CAB+CTABT
证明:
∂
t
r
(
A
B
A
T
C
)
∂
A
=
∂
t
r
(
A
B
A
T
C
)
∂
A
+
∂
t
r
(
A
T
C
A
B
)
∂
A
=
(
B
A
T
C
)
T
+
C
A
B
=
C
T
A
B
T
+
C
A
B
\frac{\partial tr(ABA^TC)}{\partial A}=\frac{\partial tr(ABA^TC)}{\partial A}+\frac{\partial tr(A^TCAB)}{\partial A}=(BA^TC)^T+CAB=C^TAB^T+CAB
∂A∂tr(ABATC)=∂A∂tr(ABATC)+∂A∂tr(ATCAB)=(BATC)T+CAB=CTABT+CAB
最小二乘法矩阵推导:
设:
x
=
(
x
0
(
1
)
x
0
(
2
)
⋯
x
0
(
m
)
x
1
(
1
)
x
1
(
2
)
⋯
x
1
(
m
)
⋮
⋮
⋱
⋮
x
n
(
1
)
x
n
(
2
)
⋯
x
n
(
m
)
)
θ
=
(
θ
0
θ
1
⋮
θ
n
)
X
=
x
T
Y
=
(
y
(
1
)
y
(
2
)
⋮
y
(
m
)
)
x=\left(\begin{matrix}x_0^{(1)}&x_0^{(2)}&\cdots&x_0^{(m)}\\x_1^{(1)}&x_1^{(2)}&\cdots&x_1^{(m)}\\\vdots&\vdots&\ddots&\vdots\\x_n^{(1)}&x_n^{(2)}&\cdots&x_n^{(m)} \end{matrix}\right)\ \ \ \ \ \ \ \ \theta=\left(\begin{matrix}\theta_0\\\theta_1\\\vdots\\\theta_n\end{matrix}\right)\ \ \ \ \ X=x^T\ \ \ \ \ Y=\left(\begin{matrix}y^{(1)}\\y{(2)}\\\vdots\\y_{(m)}\end{matrix}\right)
x=⎝⎜⎜⎜⎜⎛x0(1)x1(1)⋮xn(1)x0(2)x1(2)⋮xn(2)⋯⋯⋱⋯x0(m)x1(m)⋮xn(m)⎠⎟⎟⎟⎟⎞ θ=⎝⎜⎜⎜⎛θ0θ1⋮θn⎠⎟⎟⎟⎞ X=xT Y=⎝⎜⎜⎜⎛y(1)y(2)⋮y(m)⎠⎟⎟⎟⎞
其中x的每一列表示一组特征值,共n个,每一行表示有m组数据,θ表示每一个特征值的系数,X表示特征矩阵,Y表示实际的结果值。
则:
X
θ
−
Y
=
(
∑
i
=
0
n
x
i
(
1
)
θ
i
−
y
(
1
)
∑
i
=
0
n
x
i
(
2
)
θ
i
−
y
(
2
)
⋮
∑
i
=
0
n
x
i
(
m
)
θ
i
−
y
(
m
)
)
=
(
h
θ
(
x
(
1
)
)
−
y
(
1
)
h
θ
(
x
(
2
)
)
−
y
(
2
)
⋮
h
θ
(
x
(
m
)
)
−
y
(
m
)
)
X\theta-Y=\left(\begin{matrix}\sum_{i=0}^{n}x_i^{(1)}\theta_i-y^{(1)}\\\sum_{i=0}^{n}x_i^{(2)}\theta_i-y^{(2)}\\\vdots\\\sum_{i=0}^{n}x_i^{(m)}\theta_i-y^{(m)}\end{matrix}\right)=\left(\begin{matrix}h_\theta(x^{(1)})-y^{(1)}\\h_\theta(x^{(2)})-y^{(2)}\\\vdots\\h_\theta(x^{(m)})-y^{(m)}\end{matrix}\right)
Xθ−Y=⎝⎜⎜⎜⎜⎛∑i=0nxi(1)θi−y(1)∑i=0nxi(2)θi−y(2)⋮∑i=0nxi(m)θi−y(m)⎠⎟⎟⎟⎟⎞=⎝⎜⎜⎜⎛hθ(x(1))−y(1)hθ(x(2))−y(2)⋮hθ(x(m))−y(m)⎠⎟⎟⎟⎞
目标函数:
J
(
θ
)
=
1
2
∑
i
=
1
m
(
h
θ
(
x
(
i
)
)
−
y
(
i
)
)
2
=
1
2
t
r
[
(
X
θ
−
Y
)
T
(
X
θ
−
Y
)
]
J(\theta)=\frac{1}{2}\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})^2=\frac{1}{2}tr[(X\theta-Y)^T(X\theta-Y)]
J(θ)=21i=1∑m(hθ(x(i))−y(i))2=21tr[(Xθ−Y)T(Xθ−Y)]
使目标函数最小,得到的θ就是最匹配的解,对目标函数求导:
∂
J
(
θ
)
∂
θ
=
1
2
∂
t
r
(
θ
T
X
T
X
θ
−
θ
T
X
T
Y
−
Y
T
X
θ
+
Y
T
Y
)
∂
θ
=
1
2
[
∂
t
r
(
θ
T
X
T
X
θ
)
∂
θ
−
∂
t
r
(
θ
T
X
T
Y
)
∂
θ
−
∂
t
r
(
Y
T
X
θ
)
∂
θ
]
=
1
2
[
X
T
X
θ
+
X
T
X
θ
−
X
T
Y
−
X
T
Y
]
=
X
T
X
θ
−
X
T
Y
\begin{aligned} \frac{\partial J(\theta)}{\partial \theta} &= \frac{1}{2}\frac{\partial tr(\theta^TX^TX\theta-\theta^T X^TY-Y^TX\theta+Y^TY)}{\partial \theta}\\&= \frac{1}{2}[\frac{\partial tr(\theta^TX^TX\theta)}{\partial \theta}-\frac{\partial tr(\theta^T X^TY)}{\partial \theta}-\frac{\partial tr(Y^TX\theta)}{\partial \theta}]\\& =\frac{1}{2}[X^TX\theta+X^TX\theta-X^TY-X^TY]\\&=X^TX\theta-X^TY \end{aligned}
∂θ∂J(θ)=21∂θ∂tr(θTXTXθ−θTXTY−YTXθ+YTY)=21[∂θ∂tr(θTXTXθ)−∂θ∂tr(θTXTY)−∂θ∂tr(YTXθ)]=21[XTXθ+XTXθ−XTY−XTY]=XTXθ−XTY
令导数等于0求解:
X
T
X
θ
−
X
T
Y
=
0
θ
=
(
X
T
X
)
−
1
X
T
Y
X^TX\theta-X^TY=0\\ \theta = (X^TX)^{-1}X^TY
XTXθ−XTY=0θ=(XTX)−1XTY
加权最小二乘法矩阵推导:
加权矩阵:
W
=
(
w
1
0
0
⋯
0
0
w
2
0
⋯
0
0
0
w
3
⋯
0
⋮
⋮
⋮
⋱
⋮
0
0
0
⋯
w
m
)
W=\left(\begin{matrix}w_1&0&0&\cdots&0\\0&w_2&0&\cdots&0\\0&0&w_3&\cdots&0\\\vdots&\vdots&\vdots&\ddots&\vdots\\0&0&0&\cdots&w_m\end{matrix}\right)
W=⎝⎜⎜⎜⎜⎜⎛w100⋮00w20⋮000w3⋮0⋯⋯⋯⋱⋯000⋮wm⎠⎟⎟⎟⎟⎟⎞
W为m×m的矩阵,此时目标函数为:
J
(
θ
)
=
1
2
∑
i
=
1
m
w
i
(
h
θ
(
x
(
i
)
)
−
y
(
i
)
)
2
=
1
2
t
r
[
(
X
θ
−
Y
)
T
W
(
X
θ
−
Y
)
]
J(\theta)=\frac{1}{2}\sum_{i=1}^{m}w_i(h_\theta(x^{(i)})-y^{(i)})^2=\frac{1}{2}tr[(X\theta-Y)^TW(X\theta-Y)]
J(θ)=21i=1∑mwi(hθ(x(i))−y(i))2=21tr[(Xθ−Y)TW(Xθ−Y)]
同理,使目标函数最小,得到的θ就是最匹配的解,对目标函数求导:
∂
J
(
θ
)
∂
θ
=
1
2
∂
t
r
(
θ
T
X
T
W
X
θ
−
θ
T
X
T
W
Y
−
Y
T
W
X
θ
+
Y
T
W
Y
)
∂
θ
=
1
2
[
∂
t
r
(
θ
T
X
T
W
X
θ
)
∂
θ
−
∂
t
r
(
θ
T
X
T
W
Y
)
∂
θ
−
∂
t
r
(
Y
T
W
X
θ
)
∂
θ
]
=
1
2
[
X
T
W
X
θ
+
X
T
W
T
X
θ
−
X
T
W
Y
−
X
T
W
T
Y
]
\begin{aligned} \frac{\partial J(\theta)}{\partial \theta} &= \frac{1}{2}\frac{\partial tr(\theta^TX^TWX\theta-\theta^T X^TWY-Y^TWX\theta+Y^TWY)}{\partial \theta}\\&= \frac{1}{2}[\frac{\partial tr(\theta^TX^TWX\theta)}{\partial \theta}-\frac{\partial tr(\theta^T X^TWY)}{\partial \theta}-\frac{\partial tr(Y^TWX\theta)}{\partial \theta}]\\& =\frac{1}{2}[X^TWX\theta+X^TW^TX\theta-X^TWY-X^TW^TY] \end{aligned}
∂θ∂J(θ)=21∂θ∂tr(θTXTWXθ−θTXTWY−YTWXθ+YTWY)=21[∂θ∂tr(θTXTWXθ)−∂θ∂tr(θTXTWY)−∂θ∂tr(YTWXθ)]=21[XTWXθ+XTWTXθ−XTWY−XTWTY]
又因为W是对角阵:
∂
J
(
θ
)
∂
θ
=
1
2
[
X
T
W
X
θ
+
X
T
W
T
X
θ
−
X
T
W
Y
−
X
T
W
T
Y
]
=
1
2
[
X
T
W
X
θ
+
X
T
W
X
θ
−
X
T
W
Y
−
X
T
W
Y
]
=
X
T
W
X
θ
−
X
T
W
Y
\begin{aligned} \frac{\partial J(\theta)}{\partial \theta} &=\frac{1}{2}[X^TWX\theta+X^TW^TX\theta-X^TWY-X^TW^TY]\\&=\frac{1}{2}[X^TWX\theta+X^TWX\theta-X^TWY-X^TWY]\\& =X^TWX\theta-X^TWY \end{aligned}
∂θ∂J(θ)=21[XTWXθ+XTWTXθ−XTWY−XTWTY]=21[XTWXθ+XTWXθ−XTWY−XTWY]=XTWXθ−XTWY
令导数等于0求解:
X
T
W
X
θ
−
X
T
W
Y
=
0
θ
=
(
X
T
W
X
)
−
1
X
T
W
Y
X^TWX\theta-X^TWY=0\\ \theta = (X^TWX)^{-1}X^TWY
XTWXθ−XTWY=0θ=(XTWX)−1XTWY