目录
一、多元线性回归
如图所示,针对于多元线性回归,由于横坐标x包含了多个特征值,因此我们不再仅仅考虑x1,还徐考虑x2,x3,x4...,并且还包含θ0和θn这n+1个参数。
多元线性回归的解决思路和简单线性回归一样,只是多了几个特征,即找到θ0,θ1,θ2,θ3,...θn,使得以下式子的值尽可能小。
我们将θ0,θ1,θ2,θ3,...θn整理成一个行向量,再转置成一个列向量,然后将x1,x2,x3...xn整理成一个行向量,最后整理式子可得:
因此,多元线性回归问题就变成了估计θ,使得以下矩阵运算尽可能小。
二、实现多元线性回归
1、自定义LinearRegression函数
代码示例:
import numpy as np
from sklearn.metrics import r2_score
class LinearRegression:
def __int__(self):
self.coef_ = None
self.interception_ = None
self._theta = None
def fit_normal(self,X_train,y_train):
assert X_train.shape[0] == y_train.shape[0],\
"the size of X_train must be equal to the size of y_train"
# 正规化方程求解
X_b = np.hstack([np.ones((len(X_train),1)),X_train])
# 转置X_b矩阵再点乘,再求逆阵
self._theta = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y_train);
# 求截距
self.interception_ = self._theta[0]
self.coef_ = self._theta[1:]
return self
def predict(self,X_predict):
""" 给定待预测数据集X_predicr,返回表示X_predict的结果向量"""
assert self.interception_ is not None and self.coef_ is not None, \
"must fit before predict!"
assert X_predict.shape[1] == len(self.coef_), \
"the feature number of X_predict must be equal to X_train"
X_b = np.hstack([np.ones((len(X_predict), 1)), X_predict])
return X_b.dot(self._theta)
def score(self,X_test,y_test):
""" 根据测试数据集 X_test 和 y_test 确定当前模型的准确度 """
y_predict = self.predict(X_test)
return r2_score(y_test,y_predict)
def __repr__(self):
return "LinearRegression()"
2、在jupyter中调用函数测试
代码示例:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
boston = datasets.load_boston()
x = boston.data
y = boston.target
x = x[y < 50.0]
y = y[y < 50.0]
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(x,y,random_state = 666)
from mySklearn.LinearRegression import LinearRegression
reg = LinearRegression()
reg.fit_normal(X_train,y_train)
reg.score(X_test,y_test)
三、使用Sklearn解决回归问题
1、线性回归
代码示例:
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X_train,y_train)
lin_reg.score(X_test,y_test)
2、KNN
代码示例:
from sklearn.neighbors import KNeighborsRegressor
knn_reg = KNeighborsRegressor()
#默认参数
#knn_reg.fit(X_train,y_train)
#knn_reg.score(X_test,y_test)
#定义网格搜索集合
from sklearn.model_selection import GridSearchCV
param_grid = [
{
'weights':['uniform'],
'n_neighbors':[i for i in range(1,11)]
},
{
'weights':['distance'],
'n_neighbors':[i for i in range(1,11)],
'p':[i for i in range(1,6)]
}
]
knn_reg = KNeighborsRegressor()
grid_search = GridSearchCV(knn_reg,param_grid,n_jobs=1,verbose=2)
grid_search.fit(X_train,y_train)
grid_search.score(X_test,y_test)
四、线性回归的可解释性探讨
如果在线性回归最后的模型预测效果较差,我们可以通过查看各个特征之间的相关系来判断该特征存在的现实意义。
代码示例:
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
#由于我们不需要进行预测,因此不需要看预测准确度,不需要测试数据集
lin_reg.fit(x,y)
lin_reg.coef_
运行结果:
这里的正负代表特征的正相关还是负相关,即特征为正,且越大则房价越高;为负,且越高则房价越低;而系数绝对值大小决定了影响程度。