机器学习-sklearn框架之线性回归
本篇文章介绍了线性回归之正规方程求解和线性回归之梯度下降求解,岭回归以及lasso回归四种回归算法。
加利福尼亚房价数据集 fetch_california_housing
加载数据集时会自动下载,默认下载的存储位置C:\Users\用户\scikit_learn_data目录下
DESCR
.. _california_housing_dataset:
California Housing dataset
--------------------------
**Data Set Characteristics:**
:Number of Instances: 20640
:Number of Attributes: 8 numeric, predictive attributes and the target
:Attribute Information:
- MedInc median income in block group
- HouseAge median house age in block group
- AveRooms average number of rooms per household
- AveBedrms average number of bedrooms per household
- Population block group population
- AveOccup average number of household members
- Latitude block group latitude
...
- Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions,
Statistics and Probability Letters, 33 (1997) 291-297
Output is truncated. View as a scrollable element or open in a text editor. Adjust cell output settings...
from sklearn.linear_model import LinearRegression, SGDRegressor, Ridge, Lasso
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.datasets import fetch_california_housing as fch # 加利福尼亚房屋价值数据集
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import joblib
# 加载数据集
ch = fch()
x = ch.data
y = ch.target
# 查看数据集描述
print(x[0])
print(y[0])
print('-'*100)
print(ch.target_names)
print('-'*100)
print(ch.DESCR) # 8个特征,20640个样本, 1个目标值
# 划分数据集
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=22)
# 数据预处理
stds = StandardScaler()
x_train = stds.fit_transform(x_train)
x_test = stds.transform(x_test)
y_train = stds.fit_transform(y_train.reshape(-1, 1))
y_test = stds.transform(y_test.reshape(-1, 1))
线性回归之 LinearRegression 正规方程求解
正规方程
一般回归问题计算损失函数都选用最小二乘法,即使残差的平方和最小。
正规方程求解就是令损失为0。
# 模型训练与评估
lr = LinearRegression()
lr.fit(x_train, y_train)
# 查看回归系数,即权重
print('权重:', lr.coef_)
# 预测房价
pre = lr.predict(x_test)
mse = mean_squared_error(y_test, pre) # 均方误差,预测值与真实值的差值的平方的均值
print('逆标准化前 均方误差:', mse)
pre = stds.inverse_transform(pre)
y_it_test = stds.inverse_transform(y_test)
mse = mean_squared_error(y_it_test, pre)
print('逆标准化后 均方误差:', mse)
# 保存模型
joblib.dump(lr, './model/lr.pkl')
# 加载模型
model = joblib.load('./model/lr.pkl')
线性回归之 SGDRegressor 梯度下降求解
梯度下降
梯度下降就是每次求损失函数的梯度,用w+(梯度*学习率)来每次更新w值
sgd = SGDRegressor(eta0=0.001, max_iter=1000, learning_rate='invscaling', early_stopping=True)
sgd.fit(x_train, y_train)
pre = sgd.predict(x_test)
mse = mean_squared_error(y_test, pre) # 均方误差,预测值与真实值的差值的平方的均值
print('逆标准化前 均方误差:', mse)
pre = stds.inverse_transform(pre.reshape(-1,1))
y_it_test = stds.inverse_transform(y_test)
mse = mean_squared_error(y_it_test, pre)
print('逆标准化后 均方误差:', mse)
线性回归之 Ridge 岭回归-引入l2正则化项
岭回归主要引入了L2正则化项来惩罚高阶项权重,使得模型更平滑,不容易过拟合
rg = Ridge(alpha=1.0) # alpha: 正则化力度,对于l2正则化,alpha越大,高阶项的系数越小,越不容易过拟合
rg.fit(x_train, y_train)
pre = rg.predict(x_test)
mse = mean_squared_error(y_test, pre)
print('逆标准化前 均方误差:', mse)
pre = stds.inverse_transform(pre.reshape(-1,1))
y_it_test = stds.inverse_transform(y_test)
mse = mean_squared_error(y_it_test, pre)
print('逆标准化后 均方误差:', mse)
线性回归之 Lasso 套索(lasso)回归-引入l1正则化项
Lasso回归通过引入L1正则化项来减少模型复杂度,类似于特征选择,让某些特征的权重趋近于0(模型参数稀疏化)
ls = Lasso(alpha=0.0001) # alpha: 正则化力度,对于l1正则化,alpha越大,权值越稀疏,相当于特征选择
ls.fit(x_train,y_train)
pre = ls.predict(x_test)
mse = mean_squared_error(y_test,pre)
print('逆标准化前 均方误差:', mse)
pre = stds.inverse_transform(pre.reshape(-1,1))
y_it_test = stds.inverse_transform(y_test)
mse = mean_squared_error(y_it_test, pre)
print('逆标准化后 均方误差:', mse)
对数据集引入偏置项
# 给数据集加入偏置项
x_train = np.hstack([np.ones((x_train.shape[0], 1)), x_train])
print(x_train.shape)
x_test = np.hstack([np.ones((x_test.shape[0], 1)), x_test])
ls = Lasso(alpha=0.0001) # alpha: 正则化力度,对于l1正则化,alpha越大,权值越稀疏,相当于特征选择
ls.fit(x_train,y_train)
pre = ls.predict(x_test)
mse = mean_squared_error(y_test,pre)
print('逆标准化前 均方误差:', mse)
pre = stds.inverse_transform(pre.reshape(-1,1))
y_it_test = stds.inverse_transform(y_test)
mse = mean_squared_error(y_it_test, pre)
print('逆标准化后 均方误差:', mse)
代码均在jupyter notebook环境下执行