y = a0 + a1x1 + a2x2 + … + anxn
y^(i) = a0 + a1x1(i) + a2x2(i) + … + anxn(i)
(i表示第i个样本)
目标:使下式尽可能小
a = (a0, a1, a2, … ,an)T
y^(i) = a0x0(i) + a1x1(i) + a2x2(i) + … + anxn(i) , x0(i) =1
x = (x0, x1, x2, … ,xn)T
y^(i) = x(i) · a
== 多元线性回归的正规方程解==
优点:不需要对数据进行归一化处理
向量a中,a0是截距,a1—an是系数。
scikit-learn中的线性回归
In [531]: from sklearn.linear_model import LinearRegression
In [532]: lin_reg = LinearRegression()
In [533]: lin_reg.fit(X_train,y_train)
In [534]: lin_reg.coef_ #系数
In [535]: lin_reg.intercept_ #截距
In [537]: lin_reg.score(X_test,y_test)
KNN Regressor
In [540]: from sklearn.neighbors import KNeighborsRegressor
In [541]: knn_reg = KNeighborsRegressor()
In [542]: knn_reg.fit(X_train,y_train)
Out[542]:
KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=5, p=2,
weights='uniform')
In [543]: knn_reg.score(X_test,y_test)
Out[543]: 0.5865412198300899
使用网格搜索后的线性回归
In [546]: from sklearn.model_selection import GridSearchCV
In [547]: knn_reg = KNeighborsRegressor()
In [549]: grid_search = GridSearchCV(knn_reg,param_grid,n_jobs=-1,verbose=1)
In [550]: grid_search.fit(X_train,y_train)
In [551]: grid_search.best_params_
Out[551]: {'n_neighbors': 5, 'p': 1, 'weights': 'distance'}
grid_search.score(X_test,y_test)
Out[552]: 0.7044357727037996
In [553]: grid_search.best_score_
Out[553]: 0.634093080186858
线性回归的更多思考
线性回归具有对数据的可解释性:
#得到按从小到大排列的各系数的名称
In [554]: boston.feature_names[np.argsort(lin_reg.coef_)]
Out[554]:
array(['NOX', 'DIS', 'PTRATIO', 'LSTAT', 'CRIM', 'INDUS', 'AGE', 'TAX',
'B', 'ZN', 'CHAS', 'RAD', 'RM'], dtype='<U7')