本文主要介绍用梯度下降求线性回归最优解的数学细节
线性回归
令
Y
=
[
y
1
y
2
⋮
y
m
]
Y = \left[\begin{matrix} y_1 \\ y_2 \\ \vdots \\y_m \end{matrix}\right]
Y=⎣⎢⎢⎢⎡y1y2⋮ym⎦⎥⎥⎥⎤,
X
=
[
x
1
1
x
2
1
⋮
x
m
1
]
X = \left[\begin{matrix}x_1&1\\x_2&1\\ \vdots \\ x_m &1 \end{matrix}\right]
X=⎣⎢⎢⎢⎡x1x2⋮xm111⎦⎥⎥⎥⎤,
w
=
[
w
1
w
2
⋮
w
n
b
]
w=\left[\begin{matrix}w_1\\w_2\\\vdots\\w_n\\b\end{matrix}\right]
w=⎣⎢⎢⎢⎢⎢⎡w1w2⋮wnb⎦⎥⎥⎥⎥⎥⎤,
x
i
∈
R
n
x_i\in R^n
xi∈Rn,
Y
^
=
X
w
\hat Y = Xw
Y^=Xw
objective function:
J ( w ) = 1 2 m ( Y − Y ^ ) T ( Y − Y ^ ) J(w)=\frac{1}{2m}(Y-\hat Y)^T(Y-\hat Y) J(w)=2m1(Y−Y^)T(Y−Y^)
gradient of objective function:
根据标量对向量求导公式:
( ∂ u T v ∂ x = ∂ u ∂ x v + ∂ v ∂ x u ) (\frac{\partial u^Tv}{\partial x}=\frac{\partial u}{\partial x}v+\frac{\partial v}{\partial x}u) (∂x∂uTv=∂x∂uv+∂x∂vu)
∂ J ( w ) ∂ w = 1 2 m [ ∂ ( Y − X w ) ∂ w ( Y − X w ) + ∂ ( Y − X w ) T ∂ w ( Y − X w ) T ] = − 1 m X T ( Y − X w ) \frac{\partial J(w)}{\partial w}=\frac{1}{2m}[\frac{\partial (Y-Xw)}{\partial w}(Y-Xw)+\frac{\partial (Y-Xw)^T}{\partial w}(Y-Xw)^T]=-\frac{1}{m}X^T(Y-Xw) ∂w∂J(w)=2m1[∂w∂(Y−Xw)(Y−Xw)+∂w∂(Y−Xw)T(Y−Xw)T]=−m1XT(Y−Xw)
gradient descent:
w = w − η ∂ J ( w ) ∂ w = w + η 1 m X T ( Y − X w ) w=w-\eta \frac{\partial J(w)}{\partial w}=w+\eta\frac{1}{m}X^T(Y-Xw) w=w−η∂w∂J(w)=w+ηm1XT(Y−Xw)
numpy实现
def compute_square_loss(X, y, theta):
n_instance, n_feature = X.shape
loss = 1 / (2*n_instance) * (y - X.dot(theta.T)).dot((y - X.dot(theta.T)).T)
return loss
def compute_square_loss_gradient(X, y, theta):
n_instance, n_feature = X.shape
return -1/n_instance * X.T.dot((y - X.dot(theta.T)).T)
如果考虑加上正则化项,当为Ridge Regression时,
(考虑了加上bias一起正则化)
objective function:
J ( w ) = 1 2 m ( Y − Y ^ ) T ( Y − Y ^ ) + λ w T w J(w)=\frac{1}{2m}(Y-\hat Y)^T(Y-\hat Y) + \lambda w^Tw J(w)=2m1(Y−Y^)T(Y−Y^)+λwTw
gradient of objective function:
∂ J ( w ) ∂ w = 1 2 m [ ∂ ( Y − X w ) ∂ w ( Y − X w ) + ∂ ( Y − X w ) T ∂ w ( Y − X w ) T ] + 2 λ w = − 1 m X T ( Y − X w ) + 2 λ w \frac{\partial J(w)}{\partial w}=\frac{1}{2m}[\frac{\partial (Y-Xw)}{\partial w}(Y-Xw)+\frac{\partial (Y-Xw)^T}{\partial w}(Y-Xw)^T] + 2\lambda w=-\frac{1}{m}X^T(Y-Xw)+2\lambda w ∂w∂J(w)=2m1[∂w∂(Y−Xw)(Y−Xw)+∂w∂(Y−Xw)T(Y−Xw)T]+2λw=−m1XT(Y−Xw)+2λw
gradient descent:
w = w − η ∂ J ( w ) ∂ w = w + η ( 1 m X T ( Y − X w ) − 2 λ w ) w=w-\eta \frac{\partial J(w)}{\partial w}=w+\eta(\frac{1}{m}X^T(Y-Xw)-2\lambda w) w=w−η∂w∂J(w)=w+η(m1XT(Y−Xw)−2λw)