数学推导
线性回归模型中参数估计的推导过程:
f ( x i ) = w x i + b , 使 得 f ( x i ) ≈ y i f(x_i)=wx_i+b,使得f(x_i)\approx y_i f(xi)=wxi+b,使得f(xi)≈yi
均方误差是回归任务中最常用的性能度量,可以用于衡量 f ( x ) f(x) f(x)与 y y y之间的差别:
( w ∗ , b ∗ ) = a r g min ( w , b ) ∑ i = 1 m ( y i − f ( x i ) ) 2 = a r g min ( w , b ) ∑ i = 1 m ( y i − w x i − b ) 2 \begin{aligned} (w^*,b^*) & = arg\min_{(w,b)}\sum^{m}_{i=1}(y_i-f(x_i))^2 \\ & = arg\min_{(w,b)}\sum^{m}_{i=1}(y_i-wx_i-b)^2 \end{aligned} (w∗,b∗)=arg(w,b)mini=1∑m(yi−f(xi))2=arg(w,b)mini=1∑m(yi−wxi−b)2
∂ E ( w , b ) ∂ w = 2 ( w ∑ i = 1 m x i 2 − ∑ i = 1 m ( y i − b ) x i ) \frac{\partial E(w,b)}{\partial w}=2(w\sum^{m}_{i=1}x_i^2-\sum^{m}_{i=1}(y_i-b)x_i) ∂w∂E(w,b)=2(wi=1∑mxi2−i=1∑m(yi−b)xi)
∂ E ( w , b ) ∂ b = 2 ( m b − ∑ i = 1 m ( y i − w x i ) ) \frac{\partial E(w,b)}{\partial b}=2(mb-\sum^{m}_{i=1}(y_i-wx_i)) ∂b∂E(w,b)=2(mb−i=1∑m(yi−wxi))
令以上等式为零:
w = ∑ i = 1 m y i ( x i − x ‾ ) ∑ i = 1 m x i 2 − 1 m ( ∑ i = 1 m x i ) 2 , ( x ‾ = 1 m ∑ i = 1 m x i ) w=\frac{\sum^{m}_{i=1}y_i(x_i-\overline x)}{\sum^{m}_{i=1}x_i^2-\frac{1}{m}(\sum^{m}_{i=1}x_i)^2},(\overline x=\frac{1}{m}\sum^{m}_{i=1}x_i) w=∑i=1mxi2−m1(∑i=1mxi)2∑i=1myi(xi−x),(x=m1i=1∑mxi)
b = 1 m ∑ i = 1 m ( y i − w x i ) b=\frac{1}{m}\sum^{m}_{i=1}(y_i-wx_i) b=m1i=1∑m(yi−wxi)
w w w的详细推导如下:
w ∑ i = 1 m x i 2 = ∑ i = 1 m ( y i − b ) x i = ∑ i = 1 m ( y i − 1 m ∑ i = 1 m ( y i − w x i ) ) x i = ∑ i = 1 m y i x i − 1 m ∑ i = 1 m ( y i − w x i ) ∑ i = 1 m x i = ∑ i = 1 m y i x i − 1 m ∑ i = 1 m y i ∑ i = 1 m x i + 1 m w ∑ i = 1 m x i ∑ i = 1 m x i = ∑ i = 1 m y i ( x i − 1 m ∑ i = 1 m x i ) + 1 m w ( ∑ i = 1 m x i ) 2 \begin{aligned} w\sum^{m}_{i=1}x_i^2& = \sum^{m}_{i=1}(y_i-b)x_i \\ & = \sum^{m}_{i=1}(y_i-\frac{1}{m}\sum^{m}_{i=1}(y_i-wx_i))x_i \\ & = \sum^{m}_{i=1}y_ix_i-\frac{1}{m}\sum^{m}_{i=1}(y_i-wx_i)\sum^{m}_{i=1}x_i \\ & = \sum^{m}_{i=1}y_ix_i-\frac{1}{m}\sum^{m}_{i=1}y_i\sum^{m}_{i=1}x_i +\frac{1}{m}w\sum^{m}_{i=1}x_i \sum^{m}_{i=1}x_i \\ & = \sum^{m}_{i=1}y_i(x_i-\frac{1}{m}\sum^{m}_{i=1}x_i) +\frac{1}{m}w(\sum^{m}_{i=1}x_i)^2 \end{aligned} wi=1∑mxi2=i=1∑m(yi−b)xi=i=1∑m(yi−m1i=1∑m(yi−wxi))xi=i=1∑myixi−m1i=1∑m(yi−wxi)i=1∑mxi=i=1∑myixi−m1i=1∑myii=1∑mxi+m1wi=1∑mxii=1∑mxi=i=1∑myi(xi−m1i=1∑mxi)+m1w(i=1∑mxi)2
w = ∑ i = 1 m y i ( x i − 1 m ∑ i = 1 m x i ) ∑ i = 1 m x i 2 − 1 m ( ∑ i = 1 m x i ) 2 w=\frac{\sum^{m}_{i=1}y_i(x_i-\frac{1}{m}\sum^{m}_{i=1}x_i)}{\sum^{m}_{i=1}x_i^2-\frac{1}{m}(\sum^{m}_{i=1}x_i)^2} w=∑i=1mxi2−m1(∑i=1mxi)2∑i=1myi(xi−m1∑i=1mxi)
向量形式推导如下:
f
(
x
)
=
w
1
x
1
+
w
2
x
2
+
⋯
+
w
d
x
d
+
b
f(x)=w_1x_1+w_2x_2+\cdots+w_dx_d+b
f(x)=w1x1+w2x2+⋯+wdxd+b 向量形式为:
f
(
x
)
=
w
T
x
+
b
f(\boldsymbol x)=\boldsymbol w^T\boldsymbol x+b
f(x)=wTx+b
其中
w
=
(
w
1
;
w
2
;
⋯
;
w
d
)
\boldsymbol w=(w_1;w_2;\cdots;w_d)
w=(w1;w2;⋯;wd)
给定数据集 D = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , ⋯ , ( x m , y m ) } D=\{(\boldsymbol x_1,y_1),(\boldsymbol x_2,y_2),\cdots,(\boldsymbol x_m,y_m)\} D={(x1,y1),(x2,y2),⋯,(xm,ym)},其中 x i = ( x i 1 ; x i 2 ; ⋯ ; x i d ) \boldsymbol x_i=(x_i1;x_i2;\cdots;x_id) xi=(xi1;xi2;⋯;xid), y i ∈ R y_i\in \mathbb R yi∈R
令 w ^ = ( w ; b ) \hat \boldsymbol w=(\boldsymbol w;b) w^=(w;b),将数据集 D D D表示为一个 m × ( d + 1 ) 大 小 的 矩 阵 X m\times (d+1)大小的矩阵\boldsymbol X m×(d+1)大小的矩阵X,其中每行对应一个示例,该行前 d d d个元素对应于示例的 d d d个属性值,最后一个元素恒置为1,即:
X
=
(
x
11
x
12
⋯
x
1
d
1
x
21
x
22
⋯
x
2
d
1
⋮
⋮
⋱
⋮
⋮
x
m
1
x
m
2
⋯
x
m
d
1
)
\boldsymbol X= \begin{pmatrix} x_{11} & x_{12} & \cdots & x_{1d} & 1 \\ x_{21} & x_{22} & \cdots & x_{2d} & 1 \\ \vdots & \vdots & \ddots & \vdots & \vdots \\ x_{m1} & x_{m2} & \cdots & x_{md} & 1 \\ \end{pmatrix}
X=⎝⎜⎜⎜⎛x11x21⋮xm1x12x22⋮xm2⋯⋯⋱⋯x1dx2d⋮xmd11⋮1⎠⎟⎟⎟⎞
X
=
(
x
1
T
1
x
2
T
1
⋮
⋮
x
M
T
1
)
\boldsymbol X= \begin{pmatrix} x_{1}^{T} & 1 \\ x_{2}^{T} & 1 \\ \vdots & \vdots\\ x_{M}^{T} & 1 \\ \end{pmatrix}
X=⎝⎜⎜⎜⎛x1Tx2T⋮xMT11⋮1⎠⎟⎟⎟⎞
再把标记也写成向量形式 y = ( y 1 ; y 2 ; ⋯ ; y m ) \boldsymbol y=(y_1;y_2;\cdots;y_m) y=(y1;y2;⋯;ym),则:
w ^ ∗ = a r g min w ^ ( y − X w ^ ) T ( y − X w ^ ) \hat w^*=arg\min_{\hat w}(\boldsymbol y-\boldsymbol X\hat w)^T(\boldsymbol y-\boldsymbol X\hat w) w^∗=argw^min(y−Xw^)T(y−Xw^)
令 E w ^ = ( y − X w ^ ) T ( y − X w ^ ) E_{\hat w}=(\boldsymbol y-\boldsymbol X\hat w)^T(\boldsymbol y-\boldsymbol X\hat w) Ew^=(y−Xw^)T(y−Xw^),对 w ^ \hat w w^求导可得:
∂ E w ^ ∂ w ^ = 2 X T ( X w ^ − y ) \frac{\partial E_{\hat w}}{\partial \hat w}=2\boldsymbol X^T(\boldsymbol X\hat w-\boldsymbol y) ∂w^∂Ew^=2XT(Xw^−y)
令以上等式为零:
w ^ ∗ = ( X T X ) − 1 X T y \hat w^*=(\boldsymbol X^T\boldsymbol X)^{-1}\boldsymbol X^T\boldsymbol y w^∗=(XTX)−1XTy
Numpy 实现
- 回归模型主体:
import numpy as np
def linear_loss(X, y, w, b):
num_train = X.shape[0]
num_feature = X.shape[1]
# 模型公式
y_hat = np.dot(X, w) + b
# 损失函数
loss = np.sum((y_hat-y)**2)/num_train
# 参数的偏导
dw = np.dot(X.T, (y_hat-y)) /num_train
db = np.sum((y_hat-y)) /num_train
return y_hat, loss, dw, db
- 参数初始化:
def initialize_params(dims):
w = np.zeros((dims, 1))
b = 0
return w, b
- 基于梯度下降的模型训练过程:
def linar_train(X, y, learning_rate, epochs):
w, b = initialize(X.shape[1])
loss_list = []
for i in range(1, epochs):
# 计算当前预测值、损失和参数偏导
y_hat, loss, dw, db = linar_loss(X, y, w, b)
loss_list.append(loss)
# 基于梯度下降的参数更新过程
w += -learning_rate * dw
b += -learning_rate * db
# 打印迭代次数和损失
if i % 10000 == 0:
print('epoch %d loss %f' % (i, loss))
# 保存参数
params = {
'w': w,
'b': b
}
# 保存梯度
grads = {
'dw': dw,
'db': db
}
return loss_list, loss, params, grads
- 以 sklearn 中的 diabetes 数据集为例进行简单的训练,数据准备:
from sklearn.datasets import load_diabetes
from sklearn.utils import shuffle
diabetes = load_diabetes()
data = diabetes.data
target = diabetes.target
# 打乱数据
X, y = shuffle(data, target, random_state=13)
X = X.astype(np.float32)
# 训练集与测试集的简单划分
offset = int(X.shape[0] * 0.9)
X_train, y_train = X[:offset], y[:offset]
X_test, y_test = X[offset:], y[offset:]
y_train = y_train.reshape((-1,1))
y_test = y_test.reshape((-1,1))
print('X_train=', X_train.shape)
print('X_test=', X_test.shape)
print('y_train=', y_train.shape)
print('y_test=', y_test.shape)
X_train= (397, 10)
X_test= (45, 10)
y_train= (397, 1)
y_test= (45, 1)
- 执行训练:
loss_list, loss, params, grads = linar_train(X_train, y_train, 0.001, 100000)
epoch 10000 loss 5611.704502
epoch 20000 loss 5258.726277
epoch 30000 loss 4960.271811
epoch 40000 loss 4707.234957
epoch 50000 loss 4492.067734
epoch 60000 loss 4308.511724
epoch 70000 loss 4151.375892
epoch 80000 loss 4016.352800
epoch 90000 loss 3899.866551
- 查看训练得到的回归模型参数值:
print(params)
{‘w’: array([[ 34.67484945],
[ -5.41125944],
[ 173.52846987],
[ 125.02618256],
[ 30.88182121],
[ 23.07784921],
[ -110.84261285],
[ 107.50915451],
[ 145.73740332],
[ 108.18736122]]), ‘b’: 155.69121346255798}
- 定义预测函数对测试集结果进行预测:
def predict(X, params):
w = params['w']
b = params['b']
y_pred = np.dot(X, w) + b
return y_pred
y_pred = predict(X_test, params)
y_pred[:5]
array([[ 132.02545017],
[ 141.77423134],
[ 150.48352021],
[ 128.48666753],
[ 147.29312454]])
- 利用 matplotlib 对预测结果和真值进行展示:
import matplotlib.pyplot as plt
f = X_test.dot(params['w']) + params['b']
plt.scatter(range(X_test.shape[0]), y_test)
plt.plot(f, color = 'darkorange')
plt.xlabel('X')
plt.ylabel('y')
plt.show()
- 训练过程中损失的下降:
import matplotlib.pyplot as plt
plt.plot(loss_list, color = 'blue')
plt.xlabel('epochs')
plt.ylabel('loss')
plt.show()
- 封装一个线性回归类
import numpy as np
from sklearn.utils import shuffle
from sklearn.datasets import load_diabetes
import matplotlib.pyplot as plt
class lr_model():
def __init__(self):
pass
# 数据准备
def prepare_data(self):
# 糖尿病数据集
data = load_diabetes().data
target = load_diabetes().target
# 打乱数据, random_state为随机数种子
X, y = shuffle(data, target, random_state=42)
X = X.astype(np.float32)
y = y.reshape((-1, 1))
data = np.concatenate((X, y), axis=1)
return data
# 参数初始化
def initialize_params(self, dims):
w = np.zeros((dims, 1))
b = 0
return w, b
# 回归模型主体
def linear_loss(self, X, y, w, b):
num_train = X.shape[0]
num_feature = X.shape[1]
# 模型公式
y_hat = np.dot(X, w) + b
# 损失函数
loss = np.sum((y_hat - y) ** 2) / num_train
# 参数的偏导
dw = np.dot(X.T, (y_hat - y)) / num_train
db = np.sum((y_hat - y)) / num_train
return y_hat, loss, dw, db
# 基于梯度下降的模型训练过程
def linear_train(self, X, y, learning_rate, epochs):
w, b = self.initialize_params(X.shape[1])
# 计算当前预测值、损失和参数偏导
for i in range(1, epochs):
y_hat, loss, dw, db = self.linear_loss(X, y, w, b)
# 基于梯度下降的参数更新过程
w += -learning_rate * dw
b += -learning_rate * db
# 打印迭代次数和损失
if i % 10000 == 0:
print('epoch %d loss %f' % (i, loss))
# 保存参数
params = {
'w': w,
'b': b
}
# 保存梯度
grads = {
'dw': dw,
'db': db
}
return loss, params, grads
# 定义预测函数对测试集结果进行预测
def predict(self, X, params):
w = params['w']
b = params['b']
y_pred = np.dot(X, w) + b
return y_pred
# k折交叉验证
def linear_cross_validation(self, data, k, randomize=True):
if randomize:
data = list(data)
shuffle(data)
slices = [data[i::k] for i in range(k)]
for i in range(k):
validation = slices[i]
train = [data
for s in slices if s is not validation for data in s]
train = np.array(train)
validation = np.array(validation)
yield train, validation
if __name__ == '__main__':
lr = lr_model()
data = lr.prepare_data()
# 训练集与测试集的简单划分
for train, validation in lr.linear_cross_validation(data, 5):
# 0-10 正向切片,不包括10
X_train = train[:, :10]
# -1 反向切片
y_train = train[:, -1].reshape((-1, 1))
X_valid = validation[:, :10]
y_valid = validation[:, -1].reshape((-1, 1))
# print('X_train=', X_train.shape)
# X_train= (353, 10)
# print('X_valid=', X_valid.shape)
# X_valid= (89, 10)
# print('y_train=', y_train.shape)
# y_train= (353, 1)
# print('y_valid=', y_valid.shape)
# y_valid= (89, 1)
loss5 = []
# 执行训练:
loss, params, grads = lr.linear_train(X_train, y_train, 0.001, 100000)
# print('loss=', loss)
# loss= 3798.9563843996484
# print('params=', params)
# params= {'w': array([[ 34.67484945],
# [ -5.41125944],
# [ 173.52846987],
# [ 125.02618256],
# [ 30.88182121],
# [ 23.07784921],
# [-110.84261285],
# [ 107.50915451],
# [ 145.73740332],
# [ 108.18736122]]), 'b': 155.69121346255798}
# print('grads=', grads)
# grads= {'dw': array([[-0.13523112],
# [ 0.21054297],
# [-1.25840588],
# [-0.86467876],
# [-0.03181374],
# [ 0.04389929],
# [ 0.7288374 ],
# [-0.60464886],
# [-0.98535644],
# [-0.65978939]]), 'db': -0.0004228883425417108}
loss5.append(loss)
score = np.mean(loss5)
print('five kold cross validation score is', score)
y_pred = lr.predict(X_valid, params)
valid_score = np.sum(((y_pred - y_valid) ** 2)) / len(X_valid)
print('valid score is', valid_score)
f = X_valid.dot(params['w']) + params['b']
# scatter(x,y) 在向量 x 和 y 指定的位置创建一个包含圆形的散点图
plt.scatter(range(X_valid.shape[0]), y_valid)
plt.plot(f, color='darkorange')
plt.xlabel('X')
plt.ylabel('y')
plt.show()
epoch 10000 loss 5611.704502
epoch 20000 loss 5258.726277
epoch 30000 loss 4960.271811
epoch 40000 loss 4707.234957
epoch 50000 loss 4492.067734
epoch 60000 loss 4308.511724
epoch 70000 loss 4151.375892
epoch 80000 loss 4016.352800
epoch 90000 loss 3899.866551
five kold cross validation score is 3798.9563843996484
valid score is 4214.092765494475
参考:
1、周志华 《机器学习》 CH3-线性回归