算法原理
所谓脊回归就是在线性回归的基础上加了一个L2正则化,而套索回归则是在线性回归的基础上加上L1正则化。关于线性回归的算法算法请看我的这篇博客:https://blog.csdn.net/just_sort/article/details/101216607 。在线性回归的基础上,脊回归的损失函数可以表示为:
J
w
=
min
w
{
∣
∣
X
w
−
y
∣
∣
2
+
α
∣
∣
w
∣
∣
2
}
J_{w}=\min_{w}\{\left|| Xw-y \right||^2+\alpha\left|| w \right||^2\}
Jw=minw{∣∣Xw−y∣∣2+α∣∣w∣∣2}
其中,
α
>
=
0
\alpha>=0
α>=0,
α
\alpha
α越大,那么正则项,也是惩罚项的作用就越明显;
α
\alpha
α的数值越小,惩罚项的作用越小,特殊的当
α
\alpha
α=0的时候,脊回归就退化为了套索回归。要实现脊回归也是比较简单的,只需要在我们之前实现的线性回归计算梯度的时候加上一个附加项就可以了。这里来讲一讲利用拉格朗日函数求驻点的方法求解
w
w
w。
- 首先,对于线性回归,构造拉格朗日函数:
J w = min w { ∣ ∣ X w − y ∣ ∣ 2 } ∣ ∣ w ∣ ∣ 2 ≤ t \begin{aligned} J_{w} =\min_{w}\{\left|| Xw-y \right||^2\}\\ \left|| w \right||^2 \leq t \end{aligned} Jw=wmin{∣∣Xw−y∣∣2}∣∣w∣∣2≤t - 在《线性回归》中,给出了线性回归的损失函数可以写为:
∥ X w − y ∥ 2 = ( X w − y ) T ( X w − y ) \left \| Xw-y \right \|^{2}=(Xw-y)^T(Xw-y) ∥Xw−y∥2=(Xw−y)T(Xw−y) - 关于参数
w
w
w求导后得:
X T ( X w − y ) + X T ( X w − y ) = 0 X^T(Xw-y) + X^T(Xw-y)=0 XT(Xw−y)+XT(Xw−y)=0 - 求解得:
w = ( X T X ) − 1 X T y w=(X^TX)^{-1}X^Ty w=(XTX)−1XTy
同理,对于脊回归,损失函数为:
∥ X w − y ∥ 2 + α ∣ ∣ w ∣ ∣ 2 = ( X w − y ) T ( X w − y ) + α w T w \left \| Xw-y \right \|^{2} +\alpha\left|| w \right||^2 =(Xw-y)^T(Xw-y)+\alpha w^Tw ∥Xw−y∥2+α∣∣w∣∣2=(Xw−y)T(Xw−y)+αwTw
关于参数 w w w求导后得:
X T ( X w − y ) + X T ( X w − y ) + α w = 0 X^T(Xw-y) + X^T(Xw-y) + \alpha w=0 XT(Xw−y)+XT(Xw−y)+αw=0 - 求解得:
w = ( X T X + α I ) − 1 X T y w=(X^TX+\alpha I)^{-1}X^Ty w=(XTX+αI)−1XTy
OK,到这里所有的推导就完成了,不过这里为了沿用上次的线性回归代码仍然使用SGD来优化脊回归的目标函数,代码实现如下:
#coding=utf-8
from linear_regression import LinearRegreession
from linear_regression import run_time
from linear_regression import load_data
from linear_regression import min_max_scale
from linear_regression import train_test_split
from linear_regression import get_r2
from numpy import ndarray
class Ridge(LinearRegreession):
"""脊回归类
损失函数:
L = (y - y_hat) ^ 2 + L2
L = (y - W * X - b) ^ 2 + α * (W, b) ^ 2
Get partial derivative of W:
dL/dW = -2 * (y - W * X - b) * X + 2 * α * W
dL/dW = -2 * (y - y_hat) * X + 2 * α * W
Get partial derivative of b:
dL/db = -2 * (y - W * X - b) + 2 * α * b
dL/db = -2 * (y - y_hat) + 2 * α * b
----------------------------------------------------------------
超参数:
bias: b
weights: W
alpha: α
"""
def __init__(self):
super(Ridge, self).__init__()
self.alpha = None
def get_gradient_delta(self, Xi, yi):
y_hat = self._predict(Xi)
bias_grad_delta = yi - y_hat - self.alpha * self.bias
weights_grad_delta = [(yi - y_hat) * Xij - self.alpha * wj
for Xij, wj in zip(Xi, self.weights)]
return bias_grad_delta, weights_grad_delta
def fit(self, X, y, lr, epochs, alpha, method="batch", sample_rate=1.0):
self.alpha = alpha
assert method in ("batch", "stochastic")
if method == "batch":
self.batch_gradient_descent(X, y, lr, epochs)
if method == "stochastic":
self.stochastic_gradient_descent(X, y, lr, epochs, sample_rate)
@run_time
def main():
print("Tesing the performance of Ridge Regressor(stochastic)...")
# Load data
data, label = load_data()
data = min_max_scale(data)
# Split data randomly, train set rate 70%
data_train, data_test, label_train, label_test = train_test_split(data, label, random_state=10)
# Train model
reg = Ridge()
reg.fit(X=data_train, y=label_train, lr=0.001, epochs=1000, method="stochastic", sample_rate=0.5, alpha=1e-4)
# Model evaluation
get_r2(reg, data_test, label_test)