简单线性回归算法
目标是找到a和b,使得
∑
i
=
1
m
(
y
i
−
a
x
i
−
b
)
2
\sum_{i=1}^m(y_i-ax_i-b)^2
∑i=1m(yi−axi−b)2尽可能的小。
a
=
∑
i
=
1
m
(
x
i
−
x
‾
)
(
y
i
−
y
‾
)
∑
i
=
1
m
(
x
i
−
x
‾
)
2
a = \frac{\sum_{i=1}^m(x_i-\overline{x})(y_i-\overline{y})}{\sum_{i=1}^m(x_i-\overline{x})^2}
a=∑i=1m(xi−x)2∑i=1m(xi−x)(yi−y)
b
=
y
‾
−
a
x
‾
b=\overline{y}-a\overline{x}
b=y−ax
通过上面这个式子,可以自己设计一个求解线性回归的函数:
class SimpleLinearRegression:
def __init__(self):
self.a_ = None
self.b_ = None
def fit(self, x_train, y_train):
# 这里只能处理一维的数据
x_mean = np.mean(x_train)
y_mean = np.mean(y_train)
num = (x_train - x_mean).dot(y_train - y_mean) # 采用向量化的计算方式,效率有极大的提升
d = (x_train - x_mean).dot(x_train - x_mean)
self.a_ = num / d
self.b_ = y_mean - self.a_ * x_mean
return self
def predict(self, x_predict):
return np.array([self._predict(x) for x in x_predict])
def _predict(self, x_single):
return self.a_ * x_single + self.b_
对算法进行测试
import numpy as np
m = 100
x = np.random.random(size=m)
y = x*2 + 3 +np.random.normal(size=m)
reg = SimpleLinearRegression()
reg.fit(x, y)
reg.a_
>>>1.909471288723333
reg.b_
>>>3.0055997219091517
对结果绘图
import matplotlib.pyplot as plt
y_hat = reg.predict(x)
plt.scatter(x,y)
plt.plot(x, y_hat, color='r')
plt.show()
对回归算法进行评价
均方误差MSE(Mean Squared Error) 1 m ∑ i = 1 m ( y i − y i ^ ) 2 \frac{1}{m} \sum^m_{i=1}(y_i -\widehat{y_i})^2 m1∑i=1m(yi−yi )2
from sklearn.metrics import mean_squared_error
mean_squared_error(y_test, y_hat)
均方根误差RMSE(Root Mean Squared Error) M S E \sqrt{MSE} MSE
平均绝对误差MAE(Mean Absolute Error) 1 m ∑ i m ∣ y i − y i ^ ∣ \frac{1}{m} \sum^m_i|y_i - \widehat{y_i}| m1∑im∣yi−yi ∣
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_test, y_hat)
R squard
R
2
=
1
−
S
S
r
e
s
i
d
u
a
l
S
S
t
o
t
a
l
=
1
−
∑
i
(
y
^
i
−
y
i
)
2
∑
i
(
y
‾
i
−
y
i
)
2
=
1
−
∑
i
(
y
^
i
−
y
i
)
2
/
m
∑
i
(
y
‾
i
−
y
i
)
2
/
m
=
1
−
M
S
E
V
a
r
R^2=1-\frac{SS_{residual}}{SS_{total}}=1- \frac{\sum_i(\widehat{y}_i-y_i)^2}{\sum_i(\overline{y}_i-y_i)^2}=1- \frac{\sum_i(\widehat{y}_i-y_i)^2/m}{\sum_i(\overline{y}_i-y_i)^2/m}=1-\frac{MSE}{Var}
R2=1−SStotalSSresidual=1−∑i(yi−yi)2∑i(y
i−yi)2=1−∑i(yi−yi)2/m∑i(y
i−yi)2/m=1−VarMSE:
分子描述的是模型预测产生的错误。分母描述的
y
=
y
‾
y=\overline{y}
y=y产生的错误,就是假设无论x为多少,预测值都是
y
‾
\overline{y}
y时产生的错误,这也被称为基准。
R squared 越大越好,模型不犯任何错误,那么R squared就是1。如果R squared < 0, 说明训练的模型极差,升值不如基准。
多元线性回归
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn import datasets
boston = datasets.load_boston() # 加载波士顿房价的数据
x = boston.data
y = boston.target
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=100)
lin_reg = LinearRegression()
lin_reg.fit(x_train, y_train)
lin_reg.coef_ # 系数
lin_reg.intercept_ # 截距
lin_reg.score(x_test, y_test)