机器学习系列笔记四:线性回归算法
文章目录
introduction
线性回归是用于解决回归问题的一个经典算法,其有如下特点
- 解决回归问题
- 思想简单、实现容易
- 许多强大的非线性模型的基础
- 结果具有很好的可解释性
- 我们能通过线性回归算法的结果,通过对数据的分析和模型的建立学习到真实世界真正的知识。
- 蕴含了机器学习中的很多重要思想
以房价预测为例
假设房屋面积和房价是具有某种线性关系的,那么基于这个假设,我们就可以寻找一条"直线",最大程度地"拟合"样本特征(房屋面积)与样本输出标记(房价)的关系。
当我们知道这条直线的参数时,比如y=x+100
,我们就可以通过这个表示房屋面积与房屋价格映射关系的直线来预测房价,比如新输入的房屋大小为
250
m
2
250m^2
250m2,则预测结果为
250
+
100
=
350
250+100=350
250+100=350万
对于这种样本特征只有一个的线性回归称之为简单线性回归。
假设我们找到了最佳拟合的直线方程:y=ax+b,则对于每一个样本点 x ( i ) x^{(i)} x(i)(上图中的x),根据我们的直线方程,得到的预测值为: y ^ ( i ) = a x ( i ) + b \hat{y}^{(i)}=ax^{(i)}+b y^(i)=ax(i)+b,而真实值为 y ( i ) y^{(i)} y(i)
我们定义
y
^
(
i
)
\hat{y}^{(i)}
y^(i) 与
y
(
i
)
y^{(i)}
y(i) 的差距为:
(
y
(
i
)
−
y
^
(
i
)
)
2
(y^{(i)}-\hat{y}^{(i)})^2
(y(i)−y^(i))2
考虑所有的样本:
∑
i
=
1
m
(
y
(
i
)
−
y
^
(
i
)
)
2
\sum_{i=1}^{m}(y^{(i)}-\hat{y}^{(i)})^2
i=1∑m(y(i)−y^(i))2
我们希望
∑
i
=
1
m
(
y
(
i
)
−
y
^
(
i
)
)
2
\sum_{i=1}^{m}(y^{(i)}-\hat{y}^{(i)})^2
∑i=1m(y(i)−y^(i))2 尽量小,将
y
^
(
i
)
=
a
x
(
i
)
+
b
\hat{y}^{(i)}=ax^{(i)}+b
y^(i)=ax(i)+b 带入其中。可以得到:
∑
i
=
1
m
(
y
(
i
)
−
a
x
(
i
)
−
b
)
2
\sum_{i=1}^{m}(y^{(i)}-ax^{(i)}-b)^2
i=1∑m(y(i)−ax(i)−b)2
这个函数的变体通常有两种形式,被称作损失函数(loss function)或者效用函数(utility function)
-
通过分析问题,确定问题的损失函数或效用函数;
-
通过最优化损失函数或效用函数,获得回归模型
近乎所有参数学习算法都是这样的思路(套路)
- 线性回归
- 多项式回归
- 逻辑回归
- SVM
- 神经网络
- …
正是因为机器学习中大多数的算法都有这样的一个思路,所以有一个学科非常的重要称之为最优化原理,而在最优化领域有一个相应的分支领域叫做凸优化,它解决的是一类特殊的优化问题。
最小二乘法
回到这个关于房价的简单线性回归问题,我们的目标是使得 ∑ i = 1 m ( y ( i ) − a x ( i ) − b ) 2 \sum_{i=1}^{m}(y^{(i)}-ax^{(i)}-b)^2 ∑i=1m(y(i)−ax(i)−b)2 尽可能小,通常使用的手段为最小二乘法。
对最小二乘法的推导
因为针对我们的目标函数
∑
i
=
1
m
(
y
(
i
)
−
a
x
(
i
)
−
b
)
2
\sum_{i=1}^{m}(y^{(i)}-ax^{(i)}-b)^2
∑i=1m(y(i)−ax(i)−b)2 ,其中的变量是
a
,
b
a,b
a,b, 所以我们令
J
(
a
,
b
)
=
∑
i
=
1
m
(
y
(
i
)
−
a
x
(
i
)
−
b
)
2
J(a,b)=\sum_{i=1}^{m}(y^{(i)}-ax^{(i)}-b)^2
J(a,b)=i=1∑m(y(i)−ax(i)−b)2
根据导数的物理意义,我们可以设定如下方程:
∂
J
(
a
,
b
)
∂
a
=
0
\frac{\partial J\left( a,b \right)}{\partial a}=0
∂a∂J(a,b)=0
∂ J ( a , b ) ∂ b = 0 \frac{\partial J\left( a,b \right)}{\partial b}=0 ∂b∂J(a,b)=0
这里需要用到链式求导法则,可以得到结果分别是:
b
=
y
ˉ
−
a
x
ˉ
b=\bar{y}-a\bar{x}
b=yˉ−axˉ
a = ∑ i = 1 m ( x ( i ) y ( i ) − x ( i ) y ˉ ) ∑ i = 1 m ( ( x ( i ) ) 2 − x ˉ x ( i ) ) a=\frac{\ \sum_{i=1}^m{\left( x^{\left( i \right)}y^{\left( i \right)}-x^{\left( i \right)}\bar{y} \right)}}{\sum_{i=1}^m{\left( \left( x^{\left( i \right)} \right) ^2-\bar{x}x^{\left( i \right)} \right)}} a=∑i=1m((x(i))2−xˉx(i)) ∑i=1m(x(i)y(i)−x(i)yˉ)
其中a可以做一些简化,首先:
∑
i
m
x
(
i
)
y
ˉ
=
y
ˉ
∑
i
m
x
(
i
)
=
m
y
ˉ
⋅
x
ˉ
=
x
ˉ
∑
i
=
1
m
y
(
i
)
=
∑
i
=
1
m
x
ˉ
y
(
i
)
=
∑
i
m
x
ˉ
⋅
y
ˉ
\sum_i^m{x^{\left( i \right)}\bar{y}=\bar{y}\sum_i^m{x^{\left( i \right)}=m\bar{y}\cdot \bar{x}}=\bar{x}\sum_{i=1}^m{y^{\left( i \right)}=\sum_{i=1}^m{\bar{x}y^{\left( i \right)}}}}=\sum_i^m{\bar{x}\cdot \bar{y}}
i∑mx(i)yˉ=yˉi∑mx(i)=myˉ⋅xˉ=xˉi=1∑my(i)=i=1∑mxˉy(i)=i∑mxˉ⋅yˉ
故a可以简化为如下:
a
=
∑
i
=
1
m
(
x
(
i
)
y
(
i
)
−
x
(
i
)
y
ˉ
−
x
ˉ
y
(
i
)
+
x
ˉ
y
ˉ
)
∑
i
=
1
m
(
(
x
(
i
)
)
2
−
x
ˉ
x
(
i
)
−
x
ˉ
x
(
i
)
+
x
ˉ
2
)
=
∑
i
m
(
x
(
i
)
−
x
ˉ
)
(
y
(
i
)
−
y
ˉ
)
∑
i
m
(
x
(
i
)
−
x
ˉ
)
2
a=\frac{\,\,\sum_{i=1}^m{\left( x^{\left( i \right)}y^{\left( i \right)}-x^{\left( i \right)}\bar{y}-\bar{x}y^{\left( i \right)}+\bar{x}\bar{y} \right)}}{\sum_{i=1}^m{\left( \left( x^{\left( i \right)} \right) ^2-\bar{x}x^{\left( i \right)}-\bar{x}x^{\left( i \right)}+\bar{x}^2 \right)}}=\frac{\sum_i^m{\left( x^{\left( i \right)}-\bar{x} \right) \left( y^{\left( i \right)}-\bar{y} \right)}}{\sum_i^m{\left( x^{\left( i \right)}-\bar{x} \right) ^2}}
a=∑i=1m((x(i))2−xˉx(i)−xˉx(i)+xˉ2)∑i=1m(x(i)y(i)−x(i)yˉ−xˉy(i)+xˉyˉ)=∑im(x(i)−xˉ)2∑im(x(i)−xˉ)(y(i)−yˉ)
最终,我们的推导结果为:
a
=
∑
i
m
(
x
(
i
)
−
x
ˉ
)
(
y
(
i
)
−
y
ˉ
)
∑
i
m
(
x
(
i
)
−
x
ˉ
)
2
a=\frac{\sum_i^m{\left( x^{\left( i \right)}-\bar{x} \right) \left( y^{\left( i \right)}-\bar{y} \right)}}{\sum_i^m{\left( x^{\left( i \right)}-\bar{x} \right) ^2}}
a=∑im(x(i)−xˉ)2∑im(x(i)−xˉ)(y(i)−yˉ)
b = y ˉ − a x ˉ b=\bar{y}-a\bar{x} b=yˉ−axˉ
实现简单线性回归
import numpy as np
import matplotlib.pyplot as plt
x = np.array([1.,2.,3.,4.,5.])
y = np.array([1.,3.,2.,3.,5.])
plt.scatter(x,y)
plt.axis([0,6,0,6])
plt.show()
a = ∑ i m ( x ( i ) − x ˉ ) ( y ( i ) − y ˉ ) ∑ i m ( x ( i ) − x ˉ ) 2 a=\frac{\sum_i^m{\left( x^{\left( i \right)}-\bar{x} \right) \left( y^{\left( i \right)}-\bar{y} \right)}}{\sum_i^m{\left( x^{\left( i \right)}-\bar{x} \right) ^2}} a=∑im(x(i)−xˉ)2∑im(x(i)−xˉ)(y(i)−yˉ)
b = y ˉ − a x ˉ b=\bar{y}-a\bar{x} b=yˉ−axˉ
x_mean = np.mean(x)
y_mean = np.mean(y)
num = 0.0 # 分子
d = 0.0 # 分母
for x_i ,y_i in zip(x,y):
num += (x_i-x_mean)*(y_i-y_mean)
d += (x_i-x_mean)**2
a = num / d
b = y_mean-a*x_mean
a
0.8
b
0.39999999999999947
预测结果图形化:
y_hat = a*x+b
plt.scatter(x,y)
plt.plot(x,y_hat,color='r')
# 预测
x_predict = 6
y_predict = a * x_predict+b
plt.scatter(x_predict,y_predict,color='green')
plt.show()
print(y_predict)
5.2
可以看到,红线就是使用简单线性规划法得到的拟合直线,而绿色的点就是我们根据输入6得到的预测值
自定义SimpleLinearRegression
import numpy as np
class SimpleLinearRegression1(object):
def __init__(self):
"""初始化Simple Linear Regression模型"""
self.a_ = None
self.b_ = None
def fit(self, x_train, y_train):
"""根据训练集训练SLR模型"""
assert x_train.ndim == 1, \
"Simple Linear Regressor can only process Single feature training data."
assert len(x_train) == len(y_train), \
"the size of x_train must be equal to y_train"
x_mean = np.mean(x_train)
y_mean = np.mean(y_train)
num = 0.0 # 分子
d = 0.0 # 分母
for x_i, y_i in zip(x_train, y_train):
num += (x_i - x_mean) * (y_i - y_mean)
d += (x_i - x_mean) ** 2
self.a_ = num / d
self.b_ = y_mean - self.a_ * x_mean
return self
def predict(self, x_predict):
"""给定待预测测试集x_predict,返回预测结果向量“"""
assert x_predict.ndim == 1, \
"Simple Linear Regressor can only process Single feature training data."
assert self.a_ is not None and self.b_ is not None, \
"must fit before predict"
return np.array([self._predict(x) for x in x_predict])
def _predict(self, x_single):
"""给定单个待预测数据x_single,返回其单个预测结果值"""
return self.a_ * x_single + self.b_
def __repr__(self):
return "SimpleLinearRegression1()"
测试自定义的SLR1
from relatedCode.SimpleLinearRegression import SimpleLinearRegression1
reg1 = SimpleLinearRegression1()
reg1.fit(x,y)
SimpleLinearRegression1()
reg1.predict(np.array([x_predict]))
array([5.2])
预测结果:
y_hat1 = reg1.predict(x)
plt.scatter(x,y)
plt.plot(x,y_hat1,color='r')
plt.show()
向量化运算实现SimpleLinearRegression
假设有这样一组向量:
w = ( w ( 1 ) , w ( 2 ) , . . . , w ( m ) ) w=(w^{(1)},w^{(2)},...,w^{(m)}) w=(w(1),w(2),...,w(m))
v = ( v ( 1 ) , v ( 2 ) , . . . , v ( m ) ) v=(v^{(1)},v^{(2)},...,v^{(m)}) v=(v(1),v(2),...,v(m))
则
∑
i
m
w
(
i
)
⋅
v
(
i
)
=
w
⋅
v
\sum_i^m{w^{\left( i \right)}\cdot v^{\left( i \right)}=w\cdot v}
i∑mw(i)⋅v(i)=w⋅v
我们称这样的表达式为满足向量化运算的
由于 a = ∑ i m ( x ( i ) − x ˉ ) ( y ( i ) − y ˉ ) ∑ i m ( x ( i ) − x ˉ ) 2 a=\frac{\sum_i^m{\left( x^{\left( i \right)}-\bar{x} \right) \left( y^{\left( i \right)}-\bar{y} \right)}}{\sum_i^m{\left( x^{\left( i \right)}-\bar{x} \right) ^2}} a=∑im(x(i)−xˉ)2∑im(x(i)−xˉ)(y(i)−yˉ)中的 x x x和 y y y满足向量化运算,所以可以在SimpleLinearRegression1的基础上做改进,不再使用for循环来计算a,b的值
class SimpleLinearRegression2(object):
def __init__(self):
"""初始化Simple Linear Regression模型"""
self.a_ = None
self.b_ = None
def fit(self, x_train, y_train):
"""根据训练集训练SLR模型"""
assert x_train.ndim == 1, \
"Simple Linear Regressor can only process Single feature training data."
assert len(x_train) == len(y_train), \
"the size of x_train must be equal to y_train"
x_mean = np.mean(x_train)
y_mean = np.mean(y_train)
num = 0.0 # 分子
d = 0.0 # 分母
# for x_i, y_i in zip(x_train, y_train):
# num += (x_i - x_mean) * (y_i - y_mean)
# d += (x_i - x_mean) ** 2
# 用矩阵点乘运算代替for循环
num = (x_train-x_mean).dot(y_train-y_mean)
d = (x_train-x_mean).dot(x_train-x_mean)
self.a_ = num / d
self.b_ = y_mean - self.a_ * x_mean
return self
def predict(self, x_predict):
"""给定待预测测试集x_predict,返回预测结果向量“"""
assert x_predict.ndim == 1, \
"Simple Linear Regressor can only process Single feature training data."
assert self.a_ is not None and self.b_ is not None, \
"must fit before predict"
return np.array([self._predict(x) for x in x_predict])
def _predict(self, x_single):
"""给定单个待预测数据x_single,返回其单个预测结果值"""
return self.a_ * x_single + self.b_
def __repr__(self):
return "SimpleLinearRegression2()"
测试两种实现方式的运行效率差异
from relatedCode.SimpleLinearRegression import SimpleLinearRegression2
reg2 = SimpleLinearRegression2()
reg2.fit(x,y)
SimpleLinearRegression2()
y_hat2 = reg2.predict(x)
plt.scatter(x,y)
plt.plot(x,y_hat2,color='r')
plt.show()
效果完全与SLR1一致,然后我们对性能进行测试
- 生成数据集
m = 1000000
big_x = np.random.random(size=m)
big_y = big_x * 2.0 + 3.0 + np.random.normal(0,1,size=m)
- 通过
%timeit
魔法命令比较两个算法的训练时间效率
%timeit reg1.fit(big_x,big_y)
%timeit reg2.fit(big_x,big_y)
781 ms ± 7.59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
17.8 ms ± 552 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
reg1.a_
1.9994703899928086
reg2.a_
1.9994703899928812
reg1.b_
2.9997681141058745
reg2.b_
2.999768114105838
衡量回归算法的标准
据上文的描述,我们得出对于线性回归算法,其评判标准为损失函数,而通过观察
∑
i
=
1
m
(
y
t
e
s
t
(
i
)
−
y
^
t
e
s
t
(
i
)
)
2
\sum_{i=1}^{m}(y_{test}^{(i)}-\hat{y}_{test}^{(i)})^2
∑i=1m(ytest(i)−y^test(i))2,可以发现这个损失函数的大小表征是与m,即与样本个数是有关的,这是不合理的,所以就有了均分误差MSE
:
M
S
E
=
1
m
∑
i
=
1
m
(
y
t
e
s
t
(
i
)
−
y
^
t
e
s
t
(
i
)
)
2
MSE = \frac{1}{m}\sum_{i=1}^{m}(y_{test}^{(i)}-\hat{y}_{test}^{(i)})^2
MSE=m1i=1∑m(ytest(i)−y^test(i))2
同样,为保证该误差的量纲与y本身的量纲一致,所以在MSE
的基础上开根号,就有了均方根误差RMSE
R
M
S
E
=
M
S
E
=
1
m
∑
i
=
1
m
(
y
t
e
s
t
(
i
)
−
y
^
t
e
s
t
(
i
)
)
2
RMSE=\sqrt{MSE}=\sqrt{\frac{1}{m}\sum_{i=1}^{m}(y_{test}^{(i)}-\hat{y}_{test}^{(i)})^2}
RMSE=MSE=m1i=1∑m(ytest(i)−y^test(i))2
同理,还有一种评判标准为平均绝对误差MAE
:
M
A
E
=
1
m
∑
i
=
1
m
∣
y
t
e
s
t
(
i
)
−
y
^
t
e
s
t
(
i
)
∣
MAE = \frac{1}{m}\sum_{i=1}^{m}|y_{test}^{(i)}-\hat{y}_{test}^{(i)}|
MAE=m1i=1∑m∣ytest(i)−y^test(i)∣
这三种衡量标准都无法表征分类的准确度:1最好,0最差
接下来,用代码的形式实现上述三种衡量标准。
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
导入波士顿房产数据
boston = datasets.load_boston()
boston.feature_names
array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='<U7')
x = boston.data[:,5] # 只使用房间数量这个特征RM
x.shape
(506,)
y = boston.target
y.shape
(506,)
plt.scatter(x,y)
plt.show()
发现在y=50的位置出现了很多有偏数据,所以需要去除
np.max(y)
50.0
x = x[y<50.0]
y = y[y<50.0]
plt.scatter(x,y)
plt.show()
使用简单线性回归法
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=666)
x_train.shape
(392,)
x_test.shape
(98,)
from relatedCode.SimpleLinearRegression import SimpleLinearRegression2
reg = SimpleLinearRegression2()
reg.fit(x_train,y_train)
SimpleLinearRegression2()
reg.a_
7.8608543562689555
reg.b_
-27.459342806705543
plt.scatter(x_train,y_train)
plt.plot(x_train,reg.predict(x_train),color='r')
plt.show()
可以看到表征该模型的拟合直线(红),使用该模型进行预测
y_predict = reg.predict(x_test)
MSE
mse_test = np.sum((y_predict-y_test)**2) / len(y_test)
mse_test
24.156602134387438
RMSE
rmse_test = np.sqrt(mse_test)
rmse_test
4.914936635846635
MAE
mae_test = np.sum(np.absolute(y_predict-y_test))/len(y_test)
mae_test
3.5430974409463873
将三种误差的计算过程分别封装到metrics.py
中
def mean_squared_error(y_true, y_predict):
"""计算y_true和y_predict之间的MSE"""
assert len(y_true) == len(y_predict), \
"the size of y_true must be equal to the size of y_predict"
return np.sum((y_true - y_predict)**2) / len(y_true)
def root_mean_squared_error(y_true, y_predict):
"""计算y_true和y_predict之间的RMSE"""
return sqrt(mean_squared_error(y_true, y_predict))
def mean_absolute_error(y_true, y_predict):
"""计算y_true和y_predict之间的MAE"""
return np.sum(np.absolute(y_true - y_predict)) / len(y_true)
from relatedCode.metrics import mean_squared_error
from relatedCode.metrics import root_mean_squared_error
from relatedCode.metrics import mean_absolute_error
mse = mean_squared_error(y_test,y_predict)
rmse = root_mean_squared_error(y_test,y_predict)
mae = mean_absolute_error(y_test,y_predict)
print("mean_squared_error:{}".format(mse))
print("root_mean_squared_error:{}".format(rmse))
print("mean_absolute_error:{}".format(mae))
mean_squared_error:24.156602134387438
root_mean_squared_error:4.914936635846635
mean_absolute_error:3.5430974409463873
scikit-learn中的MSE和MAE
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
mse_skl = mean_squared_error(y_test,y_predict)
mae_skl = mean_absolute_error(y_test,y_predict)
print("mean_squared_error of sklearn:{}".format(mse_skl))
print("mean_absolute_error of sklearn:{}".format(mae_skl))
mean_squared_error of sklearn:24.156602134387438
mean_absolute_error of sklearn:3.5430974409463873
R Squared
由于上诉的三种衡量指标都无法衡量分类的准确度,所以就有了R方这样一个近乎于最好的衡量线性回归法的指标:
R
2
=
1
−
∑
i
(
y
^
(
i
)
−
y
(
i
)
)
2
∑
i
(
y
ˉ
−
y
(
i
)
)
2
R^2=1-\frac{\sum_{i}{\left( \widehat{y}^{\left( i \right)}-y^{\left( i \right)} \right) ^2}}{\sum_i{\left( \bar{y}-y^{\left( i \right)} \right) ^2}}
R2=1−∑i(yˉ−y(i))2∑i(y
(i)−y(i))2
对于
∑
i
(
y
^
(
i
)
−
y
(
i
)
)
2
∑
i
(
y
ˉ
−
y
(
i
)
)
2
\frac{\sum_i{\left( \widehat{y}^{\left( i \right)}-y^{\left( i \right)} \right) ^2}}{\sum_i{\left( \bar{y}-y^{\left( i \right)} \right) ^2}}
∑i(yˉ−y(i))2∑i(y
(i)−y(i))2 ,其
- 分子描述的是使用我们的模型预测产生的误差/错误
- 分母描述的是使用
y
=
y
ˉ
y=\bar{y}
y=yˉ 预测产生的错误,而
y
=
y
ˉ
y=\bar{y}
y=yˉ 又称为基准模型(
baseline Model
)
故,R Squared具有如下特性:
- R 2 < = 1 R^2<=1 R2<=1
- R 2 R^2 R2 越大越好,当我们的预测模型不犯任何错误时, R 2 R^2 R2 得到最大值1
- 当我们的模型效果等于基准模型时, R 2 = 0 R^2 = 0 R2=0
- 如果 R 2 < 0 R^2<0 R2<0 ,说明我们学习到的模型还不是基准模型(平均值作为预测值),此时,很有可能是我们的数据不存在任何线性关系。
通过稍作变换,
R
2
R^2
R2 可以转换为如下公式:
R
2
=
1
−
∑
i
(
y
^
(
i
)
−
y
(
i
)
)
2
∑
i
(
y
ˉ
−
y
(
i
)
)
2
=
1
−
(
∑
i
=
1
m
(
y
^
(
i
)
−
y
(
i
)
)
2
)
/
m
(
∑
i
=
1
m
(
y
(
i
)
−
y
ˉ
)
2
)
/
m
=
1
−
M
S
E
(
y
^
,
y
)
V
a
r
(
y
)
R^2=1-\frac{\sum_i{\left( \widehat{y}^{\left( i \right)}-y^{\left( i \right)} \right) ^2}}{\sum_i{\left( \bar{y}-y^{\left( i \right)} \right) ^2}}=1-\frac{\left( \sum_{i=1}^m{\left( \widehat{y}^{\left( i \right)}-y^{\left( i \right)} \right) ^2} \right) /m}{\left( \sum_{i=1}^m{\left( y^{\left( i \right)}-\bar{y} \right) ^2} \right) /m}=1-\frac{MSE\left( \widehat{y},y \right)}{Var\left( y \right)}
R2=1−∑i(yˉ−y(i))2∑i(y
(i)−y(i))2=1−(∑i=1m(y(i)−yˉ)2)/m(∑i=1m(y
(i)−y(i))2)/m=1−Var(y)MSE(y
,y)
在代码中实现如下:
1 - mean_squared_error(y_test,y_predict)/np.var(y_test)
0.6129316803937322
将该计算封装到metrics的方法中
def r2_score(y_true,y_predict):
"""计算y_true 和 y_predict之间的R Square"""
return 1-mean_squared_error(y_true,y_predict)/np.var(y_true)
from relatedCode.metrics import r2_score
r2_score(y_test,y_predict)
0.6129316803937322
在sklearn中也封装了同样的方法供我们调用
from sklearn.metrics import r2_score
r2_score(y_true=y_test,y_pred=y_predict)
0.6129316803937324
实际上,在sklearn的score方法中计算的就是R^2
method | describe |
---|---|
fit (self, X, y[, sample_weight]) | Fit linear model. |
get_params (self[, deep]) | Get parameters for this estimator. |
predict (self, X) | Predict using the linear model. |
score (self, X, y[, sample_weight]) | Return the coefficient of determination R^2 of the prediction. |
set_params (self, **params) | Set the parameters of this estimator. |
实现多元线性回归
对于多元线性回归,其目标同样是使得
∑
i
=
1
m
(
y
(
i
)
−
y
^
(
i
)
)
2
\sum_{i=1}^{m}(y^{(i)}-\hat{y}^{(i)})^2
∑i=1m(y(i)−y^(i))2 尽可能小,只不过此时的
y
^
(
i
)
\hat{y}^{(i)}
y^(i) 有了更为复杂的表达式:
y
^
(
i
)
=
θ
0
+
θ
1
X
1
(
i
)
+
θ
2
X
2
(
i
)
+
.
.
.
+
θ
n
X
n
(
i
)
\hat{y}^{\left( i \right)}=\theta _0+\theta _1X_{1}^{\left( i \right)}+\theta _2X_{2}^{\left( i \right)}+...+\theta _nX_{n}^{\left( i \right)}
y^(i)=θ0+θ1X1(i)+θ2X2(i)+...+θnXn(i)
所以最终的目标就是找到
θ
0
,
θ
1
.
.
.
,
θ
n
\theta_0,\theta_1...,\theta_n
θ0,θ1...,θn ,使得
∑
i
=
1
m
(
y
(
i
)
−
y
^
(
i
)
)
2
\sum_{i=1}^{m}(y^{(i)}-\hat{y}^{(i)})^2
∑i=1m(y(i)−y^(i))2 尽可能小。令
θ
=
(
θ
0
,
θ
1
.
.
.
,
θ
n
)
T
\theta = (\theta_0,\theta_1...,\theta_n)^T
θ=(θ0,θ1...,θn)T
y ^ ( i ) = θ 0 X 0 i + θ 1 X 1 ( i ) + θ 2 X 2 ( i ) + . . . + θ n X n ( i ) \hat{y}^{\left( i \right)}=\theta _0X_{0}^{i}+\theta _1X_{1}^{\left( i \right)}+\theta _2X_{2}^{\left( i \right)}+...+\theta _nX_{n}^{\left( i \right)} y^(i)=θ0X0i+θ1X1(i)+θ2X2(i)+...+θnXn(i)
- 其中 X 0 ( i ) ≡ 1 X_0^{(i)}\equiv 1 X0(i)≡1,再令 X ( i ) = ( X 0 ( i ) , X 1 ( i ) , X 2 ( i ) , . . . , X n ( i ) ) X^{(i)}=(X_0^{(i)},X_1^{(i)},X_2^{(i)},...,X_n^{(i)}) X(i)=(X0(i),X1(i),X2(i),...,Xn(i))
故可以得到
y
^
(
i
)
\hat{y}^{(i)}
y^(i)向量化的表示:
y
^
(
i
)
=
X
(
i
)
⋅
θ
\hat{y}^{(i)}=X^{(i)}\cdot\theta
y^(i)=X(i)⋅θ
其中
X
(
i
)
X^{(i)}
X(i) 表示单个样本的所有特征,我们再将所有样本向量组成一个样本矩阵,如下:
X
=
[
1
X
1
(
1
)
X
2
(
1
)
⋯
X
n
(
1
)
1
X
1
(
2
)
X
2
(
2
)
⋯
X
n
(
2
)
⋮
⋮
⋱
⋮
1
X
1
(
m
)
X
2
(
m
)
⋯
X
n
(
m
)
]
=
[
X
(
1
)
X
(
2
)
⋮
X
(
m
)
]
X=\left[ \begin{matrix} 1& X_{1}^{\left( 1 \right)}& X_{2}^{\left( 1 \right)}\cdots& X_{n}^{\left( 1 \right)}\\ 1& X_{1}^{\left( 2 \right)}& X_{2}^{\left( 2 \right)}\cdots& X_{n}^{\left( 2 \right)}\\ \vdots& \vdots& \ddots& \vdots\\ 1& X_{1}^{\left( m \right)}& X_{2}^{\left( m \right)}\cdots& X_{n}^{\left( m \right)}\\ \end{matrix} \right] =\left[ \begin{array}{c} X^{\left( 1 \right)}\\ X^{\left( 2 \right)}\\ \vdots\\ X^{\left( m \right)}\\ \end{array} \right]
X=⎣⎢⎢⎢⎢⎡11⋮1X1(1)X1(2)⋮X1(m)X2(1)⋯X2(2)⋯⋱X2(m)⋯Xn(1)Xn(2)⋮Xn(m)⎦⎥⎥⎥⎥⎤=⎣⎢⎢⎢⎡X(1)X(2)⋮X(m)⎦⎥⎥⎥⎤
故可以得:
y
^
=
X
⋅
θ
\hat{y}=X\cdot\theta
y^=X⋅θ
从而,回归目标就从
m
i
n
{
∑
i
=
1
m
(
y
(
i
)
−
y
^
(
i
)
)
2
}
min\{\sum_{i=1}^{m}(y^{(i)}-\hat{y}^{(i)})^2\}
min{∑i=1m(y(i)−y^(i))2} 变为了:
m
i
n
{
(
y
−
X
⋅
θ
)
T
(
y
−
X
⋅
θ
)
}
min\{(y-X\cdot\theta)^T(y-X\cdot\theta)\}
min{(y−X⋅θ)T(y−X⋅θ)}
通过矩阵求导,最终得到的使得预测误差最小化的
θ
\theta
θ ,称为多元线性回归的正规方程解(Normal Equation)如下所示:
θ
=
(
X
T
X
)
−
1
X
T
y
\theta=(X^TX)^{-1}X^Ty
θ=(XTX)−1XTy
自定义多元线性回归模型
import numpy as np
from .metrics import r2_score
class LinearRegression:
def __init__(self):
"""初始化Linear Regression 模型"""
self.coef_ = None # 系数
self.interception_ = None # 截距
self._theta = None
def fit_normal(self, X_train, y_train):
"""根据训练集训练线性回归模型"""
assert X_train.shape[0] == y_train.shape[0], \
"the size of X_train must be equal to the size of y_train"
X = np.hstack([np.ones(shape=(len(X_train), 1)), X_train])
self._theta = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(y_train)
self.interception_ = self._theta[0]
self.coef_ = self._theta[1:]
return self
def predict(self,X_pre):
"""给定预测集,返回预测结果向量"""
assert self.interception_ is not None and self.coef_ is not None, \
"must fit before predict!"
assert X_pre.shape[1] == len(self.coef_), \
"the feature number of X_predict must be equal to X_train"
X = np.hstack([np.ones(shape=(len(X_pre),1)),X_pre])
return X.dot(self._theta)
def score(self,X_test,y_test):
y_predict = self.predict(X_test)
return r2_score(y_true=y_test,y_predict=y_predict)
def __repr__(self):
return "LinearRegression()"
测试
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from relatedCode.LinearRegression import LinearRegression
from relatedCode.model_selection import train_test_split
if __name__ == '__main__':
boston = datasets.load_boston()
X = boston.data
y = boston.target
X = X[y < 50.0]
y = y[y < 50.0]
X_train, X_test, y_train, y_test = train_test_split(X, y, seed=666)
reg = LinearRegression()
reg.fit_normal(X_train, y_train)
print("线性回归模型系数:", reg.coef_)
print("线性回归模型截距", reg.interception_)
score = reg.score(X_test, y_test)
print("测试集预测准确率:", score)
结果:
C:\Users\ASUS\Anaconda3\envs\tf2\python.exe C:/Users/ASUS/Desktop/思维导图/机器学习编程实践/day03/test/LinearRegressionTest.py
线性回归模型系数: [-1.20354261e-01 3.64423279e-02 -3.61493155e-02 5.12978140e-02
-1.15775825e+01 3.42740062e+00 -2.32311760e-02 -1.19487594e+00
2.60101728e-01 -1.40219119e-02 -8.35430488e-01 7.80472852e-03
-3.80923751e-01]
线性回归模型截距 34.11739972322438
测试集预测准确率: 0.8129794056212711
Process finished with exit code 0
scikit-learn中的线性回归
from sklearn import datasets
from sklearn.linear_model import LinearRegression
from relatedCode.model_selection import train_test_split
def prepareData():
boston = datasets.load_boston()
X = boston.data
y = boston.target
X = X[y < 50.0]
y = y[y < 50.0]
return train_test_split(X, y, seed=666)
if __name__ == '__main__':
X_train, X_test, y_train, y_test = prepareData()
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)
score = lin_reg.score(X_test, y_test)
print("scikit-learn的线性模型预测准确率:", score)
结果
C:\Users\ASUS\Anaconda3\envs\tf2\python.exe C:/Users/ASUS/Desktop/思维导图/机器学习编程实践/day03/test/LinearRegression_SK.py
scikit-learn的线性模型预测准确率: 0.8129794056212809
Process finished with exit code 0
KNN Regressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn import datasets
from relatedCode.model_selection import train_test_split
def prepareData():
boston = datasets.load_boston()
X = boston.data
y = boston.target
X = X[y < 50.0]
y = y[y < 50.0]
return train_test_split(X, y, seed=666)
if __name__ == '__main__':
X_train, X_test, y_train, y_test = prepareData()
knn_reg = KNeighborsRegressor()
knn_reg.fit(X_train,y_train)
knn_reg_score = knn_reg.score(X_test,y_test)
print("KNN Regressor的准确率",knn_reg_score)
结果:
KNN Regressor的准确率 0.5865412198300899
使用网格搜索的方式改进KNN回归器:
from sklearn.neighbors import KNeighborsRegressor
from sklearn import datasets
from relatedCode.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
def prepareData():
boston = datasets.load_boston()
X = boston.data
y = boston.target
X = X[y < 50.0]
y = y[y < 50.0]
return train_test_split(X, y, seed=666)
def searchParams(knn_reg):
param_grid = [
{
"weights": ["uniform"],
"n_neighbors": [i for i in range(1, 11)]
},
{
"weights": ["distance"],
"n_neighbors": [i for i in range(1, 11)],
"p": [i for i in range(1, 6)]
}
]
grid_search = GridSearchCV(knn_reg, param_grid, n_jobs=-1, verbose=1)
grid_search.fit(X_train, y_train)
return grid_search.best_estimator_
if __name__ == '__main__':
X_train, X_test, y_train, y_test = prepareData()
knn_reg = searchParams(KNeighborsRegressor())
knn_reg_score = knn_reg.score(X_test, y_test)
print("KNN Regressor的准确率", knn_reg_score)
结果:
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
Fitting 5 folds for each of 60 candidates, totalling 300 fits
[Parallel(n_jobs=-1)]: Done 34 tasks | elapsed: 0.9s
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed: 1.2s finished
KNN Regressor的准确率 0.7160666820548708
线性回归的可解释性
from sklearn import datasets
from sklearn.linear_model import LinearRegression
import numpy as np
boston = datasets.load_boston()
X = boston.data
y = boston.target
X = X[y < 50.0]
y = y[y < 50.0]
lin_reg = LinearRegression()
lin_reg.fit(X, y)
sorted_coef_index = np.argsort(lin_reg.coef_)
print("按相关性排序后的模型系数索引:", sorted_coef_index)
print("各个因素对房价的影响程度(递增)", boston.feature_names[sorted_coef_index])
print(boston.DESCR)
结果
按相关性排序后的模型系数索引: [ 4 7 10 12 0 2 6 9 11 1 8 3 5]
各个因素对房价的影响程度(递增) ['NOX' 'DIS' 'PTRATIO' 'LSTAT' 'CRIM' 'INDUS' 'AGE' 'TAX' 'B' 'ZN' 'RAD'
'CHAS' 'RM']
.. _boston_dataset:
Boston house prices dataset
---------------------------
**Data Set Characteristics:**
:Number of Instances: 506
:Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.
:Attribute Information (in order):
- CRIM per capita crime rate by town
- ZN proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS proportion of non-retail business acres per town
- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX nitric oxides concentration (parts per 10 million)
- RM average number of rooms per dwelling
- AGE proportion of owner-occupied units built prior to 1940
- DIS weighted distances to five Boston employment centres
- RAD index of accessibility to radial highways
- TAX full-value property-tax rate per $10,000
- PTRATIO pupil-teacher ratio by town
- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT % lower status of the population
- MEDV Median value of owner-occupied homes in $1000's
...
根据排序的影响因子结果以及这些因子的描述,我们可以发现这个结果是和我们的日常生活判断吻合的,也就是说真实世界的房价被各个因素影响的情况与通过线性回归学习到的结果是一致的。
比如:NOX表示房屋内一氧化碳的浓度,这是与房价逆相关的,再比如RM表示房屋内房间的数量,这是与房价正相关的。
所以我们称之为线性回归的可解释性。
我们可以根据线性回归模型的训练结果–各个参数来找到各个代表因素对目标的影响程度。
线性回归算法总结
- 线性回归算法是典型的参数学习
对比KNN:非参数学习 - 只能解决回归问题
虽然很多分类方法中,线性回归是基础(如逻辑回归)
对比KNN:既可以解决分类问题,又可以解决回归问题 - 使用前提:假设数据是线性的
对比KNN:对数据没有假设 - 优点:对数据具有强解释性
这是一个白盒子算法,根据这个模型的训练结果,我们能够学到真正的知识。
参考资料
liuyubobo:https://github.com/liuyubobobo/Play-with-Machine-Learning-Algorithms
liuyubobo:https://coding.imooc.com/class/169.html
莫烦Python:https://www.bilibili.com/video/BV1xW411Y7Qd?from=search&seid=11773783205368546023