首先是代价函数
J
(
θ
)
J(\theta)
J(θ)的定义为
J
(
θ
)
=
1
2
m
∑
i
=
1
m
(
θ
0
x
0
(
i
)
+
θ
1
x
1
(
i
)
+
⋯
+
θ
n
x
n
(
i
)
−
y
(
i
)
)
2
J(\theta)=\frac{1}{2m}\sum_{i=1}^m(\theta_0x_0^{(i)}+\theta_1x_1^{(i)}+\cdots+\theta_nx_n^{(i)}-y^{(i)})^2
J(θ)=2m1i=1∑m(θ0x0(i)+θ1x1(i)+⋯+θnxn(i)−y(i))2
对任意参数
θ
j
\theta_j
θj求偏导,得
∂
J
(
θ
)
∂
θ
j
=
1
m
∑
i
=
1
m
x
j
(
i
)
(
θ
0
x
0
(
i
)
+
θ
1
x
1
(
i
)
+
⋯
+
θ
n
x
n
(
i
)
−
y
(
i
)
)
\frac{\partial J(\theta)}{\partial \theta_j}=\frac{1}{m}\sum_{i=1}^m x_j^{(i)}(\theta_0x_0^{(i)}+\theta_1x_1^{(i)}+\cdots+\theta_nx_n^{(i)}-y^{(i)})
∂θj∂J(θ)=m1i=1∑mxj(i)(θ0x0(i)+θ1x1(i)+⋯+θnxn(i)−y(i))
要求极值,则偏导等于0,有
1
m
∑
i
=
1
m
x
j
(
i
)
(
θ
0
x
0
(
i
)
+
θ
1
x
1
(
i
)
+
⋯
+
θ
n
x
n
(
i
)
−
y
(
i
)
)
=
0
\frac{1}{m}\sum_{i=1}^m x_j^{(i)}(\theta_0x_0^{(i)}+\theta_1x_1^{(i)}+\cdots+\theta_nx_n^{(i)}-y^{(i)})=0
m1i=1∑mxj(i)(θ0x0(i)+θ1x1(i)+⋯+θnxn(i)−y(i))=0
尝试把求和号打开,化成矩阵形式
[
x
j
(
1
)
x
j
(
2
)
⋯
x
j
(
m
)
]
[
θ
0
x
0
(
1
)
+
θ
1
x
1
(
1
)
+
⋯
+
θ
n
x
n
(
1
)
−
y
(
1
)
θ
0
x
0
(
2
)
+
θ
1
x
1
(
2
)
+
⋯
+
θ
n
x
n
(
2
)
−
y
(
2
)
⋮
θ
0
x
0
(
m
)
+
θ
1
x
1
(
m
)
+
⋯
+
θ
n
x
n
(
m
)
−
y
(
m
)
]
=
0
\left[\begin{matrix} x_j^{(1)}&x_j^{(2)}&\cdots&x_j^{(m)} \end{matrix}\right] \left[\begin{matrix} \theta_0x_0^{(1)}+\theta_1x_1^{(1)}+\cdots+\theta_nx_n^{(1)}-y^{(1)}\\ \theta_0x_0^{(2)}+\theta_1x_1^{(2)}+\cdots+\theta_nx_n^{(2)}-y^{(2)}\\ \vdots \\ \theta_0x_0^{(m)}+\theta_1x_1^{(m)}+\cdots+\theta_nx_n^{(m)}-y^{(m)}\\ \end{matrix}\right] =0
[xj(1)xj(2)⋯xj(m)]⎣⎢⎢⎢⎢⎡θ0x0(1)+θ1x1(1)+⋯+θnxn(1)−y(1)θ0x0(2)+θ1x1(2)+⋯+θnxn(2)−y(2)⋮θ0x0(m)+θ1x1(m)+⋯+θnxn(m)−y(m)⎦⎥⎥⎥⎥⎤=0
上面的第二个矩阵还可以再次展开为
[
θ
0
x
0
(
1
)
+
θ
1
x
1
(
1
)
+
⋯
+
θ
n
x
n
(
1
)
−
y
(
1
)
θ
0
x
0
(
2
)
+
θ
1
x
1
(
2
)
+
⋯
+
θ
n
x
n
(
2
)
−
y
(
2
)
⋮
θ
0
x
0
(
m
)
+
θ
1
x
1
(
m
)
+
⋯
+
θ
n
x
n
(
m
)
−
y
(
m
)
]
=
[
x
0
(
1
)
x
1
(
1
)
⋯
x
n
(
1
)
x
0
(
2
)
x
1
(
2
)
⋯
x
n
(
2
)
⋮
⋮
⋱
⋮
x
0
(
m
)
x
1
(
m
)
⋯
x
n
(
m
)
]
[
θ
0
θ
1
⋮
θ
n
]
−
[
y
1
y
2
⋮
y
m
]
\left[\begin{matrix} \theta_0x_0^{(1)}+\theta_1x_1^{(1)}+\cdots+\theta_nx_n^{(1)}-y^{(1)}\\ \theta_0x_0^{(2)}+\theta_1x_1^{(2)}+\cdots+\theta_nx_n^{(2)}-y^{(2)}\\ \vdots \\ \theta_0x_0^{(m)}+\theta_1x_1^{(m)}+\cdots+\theta_nx_n^{(m)}-y^{(m)}\\ \end{matrix}\right]= \left[\begin{matrix} x_0^{(1)}&x_1^{(1)}&\cdots&x_n^{(1)}\\ x_0^{(2)}&x_1^{(2)}&\cdots&x_n^{(2)}\\ \vdots & \vdots & \ddots &\vdots\\ x_0^{(m)}&x_1^{(m)}&\cdots&x_n^{(m)}\\ \end{matrix}\right] \left[\begin{matrix} \theta_0\\ \theta_1\\ \vdots \\ \theta_n\\ \end{matrix}\right]- \left[\begin{matrix} y_1\\ y_2\\ \vdots \\ y_m\\ \end{matrix}\right]
⎣⎢⎢⎢⎢⎡θ0x0(1)+θ1x1(1)+⋯+θnxn(1)−y(1)θ0x0(2)+θ1x1(2)+⋯+θnxn(2)−y(2)⋮θ0x0(m)+θ1x1(m)+⋯+θnxn(m)−y(m)⎦⎥⎥⎥⎥⎤=⎣⎢⎢⎢⎢⎡x0(1)x0(2)⋮x0(m)x1(1)x1(2)⋮x1(m)⋯⋯⋱⋯xn(1)xn(2)⋮xn(m)⎦⎥⎥⎥⎥⎤⎣⎢⎢⎢⎡θ0θ1⋮θn⎦⎥⎥⎥⎤−⎣⎢⎢⎢⎡y1y2⋮ym⎦⎥⎥⎥⎤
所以,对于任一参数
θ
j
\theta_j
θj,都有
[
x
j
(
1
)
x
j
(
2
)
⋯
x
j
(
m
)
]
(
[
x
0
(
1
)
x
1
(
1
)
⋯
x
n
(
1
)
x
0
(
2
)
x
1
(
2
)
⋯
x
n
(
2
)
⋮
⋮
⋱
⋮
x
0
(
m
)
x
1
(
m
)
⋯
x
n
(
m
)
]
[
θ
0
θ
1
⋮
θ
n
]
−
[
y
1
y
2
⋮
y
m
]
)
=
0
\left[\begin{matrix} x_j^{(1)}&x_j^{(2)}&\cdots&x_j^{(m)} \end{matrix}\right] \left( \left[\begin{matrix} x_0^{(1)}&x_1^{(1)}&\cdots&x_n^{(1)}\\ x_0^{(2)}&x_1^{(2)}&\cdots&x_n^{(2)}\\ \vdots & \vdots & \ddots &\vdots\\ x_0^{(m)}&x_1^{(m)}&\cdots&x_n^{(m)}\\ \end{matrix}\right] \left[\begin{matrix} \theta_0\\ \theta_1\\ \vdots \\ \theta_n\\ \end{matrix}\right]- \left[\begin{matrix} y_1\\ y_2\\ \vdots \\ y_m\\ \end{matrix}\right] \right)=0
[xj(1)xj(2)⋯xj(m)]⎝⎜⎜⎜⎜⎛⎣⎢⎢⎢⎢⎡x0(1)x0(2)⋮x0(m)x1(1)x1(2)⋮x1(m)⋯⋯⋱⋯xn(1)xn(2)⋮xn(m)⎦⎥⎥⎥⎥⎤⎣⎢⎢⎢⎡θ0θ1⋮θn⎦⎥⎥⎥⎤−⎣⎢⎢⎢⎡y1y2⋮ym⎦⎥⎥⎥⎤⎠⎟⎟⎟⎟⎞=0
将所有参数
θ
\theta
θ得到的方程全部联立,就可以得到
[
x
0
(
1
)
x
0
(
2
)
⋯
x
0
(
m
)
x
1
(
1
)
x
1
(
2
)
⋯
x
1
(
m
)
⋮
⋮
⋱
⋮
x
n
(
1
)
x
n
(
2
)
⋯
x
n
(
m
)
]
(
[
x
0
(
1
)
x
1
(
1
)
⋯
x
n
(
1
)
x
0
(
2
)
x
1
(
2
)
⋯
x
n
(
2
)
⋮
⋮
⋱
⋮
x
0
(
m
)
x
1
(
m
)
⋯
x
n
(
m
)
]
[
θ
0
θ
1
⋮
θ
n
]
−
[
y
1
y
2
⋮
y
m
]
)
=
0
\left[\begin{matrix} x_0^{(1)}&x_0^{(2)}&\cdots&x_0^{(m)}\\ x_1^{(1)}&x_1^{(2)}&\cdots&x_1^{(m)}\\ \vdots & \vdots & \ddots &\vdots\\ x_n^{(1)}&x_n^{(2)}&\cdots&x_n^{(m)}\\ \end{matrix}\right] \left( \left[\begin{matrix} x_0^{(1)}&x_1^{(1)}&\cdots&x_n^{(1)}\\ x_0^{(2)}&x_1^{(2)}&\cdots&x_n^{(2)}\\ \vdots & \vdots & \ddots &\vdots\\ x_0^{(m)}&x_1^{(m)}&\cdots&x_n^{(m)}\\ \end{matrix}\right] \left[\begin{matrix} \theta_0\\ \theta_1\\ \vdots \\ \theta_n\\ \end{matrix}\right]- \left[\begin{matrix} y_1\\ y_2\\ \vdots \\ y_m\\ \end{matrix}\right] \right)=0
⎣⎢⎢⎢⎢⎡x0(1)x1(1)⋮xn(1)x0(2)x1(2)⋮xn(2)⋯⋯⋱⋯x0(m)x1(m)⋮xn(m)⎦⎥⎥⎥⎥⎤⎝⎜⎜⎜⎜⎛⎣⎢⎢⎢⎢⎡x0(1)x0(2)⋮x0(m)x1(1)x1(2)⋮x1(m)⋯⋯⋱⋯xn(1)xn(2)⋮xn(m)⎦⎥⎥⎥⎥⎤⎣⎢⎢⎢⎡θ0θ1⋮θn⎦⎥⎥⎥⎤−⎣⎢⎢⎢⎡y1y2⋮ym⎦⎥⎥⎥⎤⎠⎟⎟⎟⎟⎞=0
令
X
=
[
x
0
(
1
)
x
1
(
1
)
⋯
x
n
(
1
)
x
0
(
2
)
x
1
(
2
)
⋯
x
n
(
2
)
⋮
⋮
⋱
⋮
x
0
(
m
)
x
1
(
m
)
⋯
x
n
(
m
)
]
,
Θ
=
[
θ
0
θ
1
⋮
θ
n
]
,
Y
=
[
y
1
y
2
⋮
y
m
]
X= \left[\begin{matrix} x_0^{(1)}&x_1^{(1)}&\cdots&x_n^{(1)}\\ x_0^{(2)}&x_1^{(2)}&\cdots&x_n^{(2)}\\ \vdots & \vdots & \ddots &\vdots\\ x_0^{(m)}&x_1^{(m)}&\cdots&x_n^{(m)}\\ \end{matrix}\right] ,\Theta=\left[\begin{matrix} \theta_0\\ \theta_1\\ \vdots \\ \theta_n\\ \end{matrix}\right], Y=\left[\begin{matrix} y_1\\ y_2\\ \vdots \\ y_m\\ \end{matrix}\right]
X=⎣⎢⎢⎢⎢⎡x0(1)x0(2)⋮x0(m)x1(1)x1(2)⋮x1(m)⋯⋯⋱⋯xn(1)xn(2)⋮xn(m)⎦⎥⎥⎥⎥⎤,Θ=⎣⎢⎢⎢⎡θ0θ1⋮θn⎦⎥⎥⎥⎤,Y=⎣⎢⎢⎢⎡y1y2⋮ym⎦⎥⎥⎥⎤
则上述方程可写为
X
T
(
X
Θ
−
Y
)
=
0
X^T(X\Theta-Y)=0
XT(XΘ−Y)=0
展开,再化简
X
T
X
Θ
−
X
T
Y
=
0
X
T
X
Θ
=
X
T
Y
Θ
=
(
X
T
X
)
−
1
X
T
Y
X^TX\Theta-X^TY=0\\ X^TX\Theta=X^TY\\ \Theta=(X^TX)^{-1}X^TY
XTXΘ−XTY=0XTXΘ=XTYΘ=(XTX)−1XTY
就可以得到Normal Equation
Θ
=
(
X
T
X
)
−
1
X
T
Y
\Theta=(X^TX)^{-1}X^TY
Θ=(XTX)−1XTY
直接使用这个式子求参数的话,可以避免多次迭代,而且求解出来的结果更加严格。缺点是对矩阵 X T X X^TX XTX求逆的时间复杂度是 O ( n 3 ) O(n^3) O(n3)的,当参数较多( n ≥ 1 0 5 n\ge 10^5 n≥105)时基本上就宣告破产了,应换用梯度下降。