基于回归分析的PM2.5预测
案例背景
数据集 BlackFriday 中给出了与商品销售量(Purchase)相关的因素,包 括 Gender 、 Age 、 City_Category 、 Stay_In_City 、Stay_In_Current_City_Years、Marital_Status、Product_Category,上述变量均为类别型变量。请将原始数据集划分为训练集(80%)和测试集(20%),并建立模型对商品销售量进行预测。
数据预处理
- 导入库
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
import warnings
# filter warnings
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
from sklearn import linear_model
- 读取数据
df=pd.read_csv('BlackFriday')
- 数据了解
df.columns
[‘User_ID’, ‘Product_ID’, ‘Gender’, ‘Age’, ‘Occupation’, ‘City_Category’, ‘Stay_In_Current_City_Years’, ‘Marital_Status’, ‘Product_Category_1’, ‘Purchase’]
df.dtypes
- 处理缺失值
- 查看缺失值情况
df.isna().sum()
2. 删除无用列
df.drop('User_ID',axis=1,inplace=True)
df.drop('Product_ID',axis=1,inplace=True)
- 更改数据类型
df['Product_Category_1']=df['Product_Category_1'].astype('str')
- 划分训练集测试集
y=df['Purchase']
x=df.drop('Purchase',axis=1)
x=pd.get_dummies(x)
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.2,random_state = 1)
Lasso回归
要求
建立 Lasso 回归模型,并利用基于 5 折交叉验证的格子搜索技术确定最优惩罚因子;在最优惩罚因子下,分别评价 Lasso 回归在训练集和测试集的预测精度。
模型参数
from sklearn.linear_model import Lasso
lasso=Lasso()
lasso.get_params()
{‘alpha’: 1.0,
‘copy_X’: True,
‘fit_intercept’: True,
‘max_iter’: 1000,
‘normalize’: False,
‘positive’: False,
‘precompute’: False,
‘random_state’: None,
‘selection’: ‘cyclic’,
‘tol’: 0.0001,
‘warm_start’: False}
格子搜索确定最优惩罚因子
lasso=Lasso()
parameters={'alpha':np.arange(0.1,1,0.1)}
lasso_cv=GridSearchCV(lasso,param_grid=parameters,cv=5)
lasso_cv.fit(x,y)
print(lasso_cv.best_params_)
print(lasso_cv.best_score_)
best_params_:{‘alpha’: 0.1}
best_score:0.628258
lasso=Lasso()
parameters={'alpha':np.arange(0.01,0.1,0.01)}
lasso_cv=GridSearchCV(lasso,param_grid=parameters,cv=5)
lasso_cv.fit(x,y)
print(lasso_cv.best_params_)
print(lasso_cv.best_score_)
best_params_:{‘alpha’: 0.01}
best_score:0.628261
继续调参模型改进的效果并没有很明显,所以不再进行调参,最优参数alpha=0.01
用最优惩罚因子训练模型
lasso=Lasso(alpha=0.1)
lasso.fit(x_train,y_train)
lasso.score(x_train,y_train)
lasso.score(x_test,y_test)
训练集精度:0.631
测试集精度:0.628
训练效果
- 均方根误差
y_pre=lasso.predict(x_test)
y_hat=lasso.predict(x_train)
rmse_lasso=((y_hat-y_train).T.dot(y_hat-y_train)/len(y_train))**(0.5)
rfmse_lasso=((y_pre-y_test).T.dot(y_pre-y_test)/len(y_test))**(0.5)
rmse_lasso=3026
rfmse_lasso= 3030
- 画图
#选1000个点进行画图
df_random=df.sample(1500) #样本太少的话,可能就不会包含某个属性某个类别的样例,这样get_dummies之后列数就会变少
x_rand=df_random.drop('Purchase',axis=1)
x_rand=pd.get_dummies(x_rand)
y_rand=df_random['Purchase']
y_rand_pre=lasso_cv.predict(x_rand)
plt.figure()
plt.plot(list(range(len(y_rand_pre))),y_rand,color='blue',label='label')
plt.scatter(list(range(len(y_rand_pre))),y_rand_pre,color='red',label='predict')
plt.title('blackfriday')
plt.legend()
决策树
要求
决策树也可以用于回归分析,请参考 sklearn.tree 模块中的DecisionTreeRegressor 类相关说明,建立模型对商品销售量进行预测(仍然需要对模型进行调优),并对所选最优模型在训练集和测试集的预测精度进行评价。
模型参数
from sklearn.tree import DecisionTreeRegressor
tree=DecisionTreeRegressor()
tree.get_params()
{‘criterion’: ‘mse’,
‘max_depth’: None,
‘max_features’: None,
‘max_leaf_nodes’: None,
‘min_impurity_decrease’: 0.0,
‘min_impurity_split’: None,
‘min_samples_leaf’: 1,
‘min_samples_split’: 2,
‘min_weight_fraction_leaf’: 0.0,
‘presort’: False,
‘random_state’: None,
‘splitter’: ‘best’}
格子搜索确定最参数
tree=DecisionTreeRegressor()
parameters={'max_depth':np.arange(10,20,1)}
tree_cv=GridSearchCV(tree,param_grid=parameters,cv=5)
tree_cv.fit(x,y)
print(tree_cv.best_params_)
print(tree_cv.best_score_)
best_params_:{‘max_depth’: 17}
best_score:0.64
用最优惩罚因子训练模型
tree=DecisionTreeRegressor(max_depth=17)
tree.fit(x_train,y_train)
tree.score(x_train,y_train)
tree.score(x_test,y_test)
训练集精度:0.67
测试集精度:0.63
将得到的决策树对属性重要性进行评价
stat=pd.DataFrame(columns=['importance','feature'])
stat['importance']=tree.feature_importances_
stat['feature']=x.columns
stat.sort_values(by='importance',ascending=False,inplace=True)