前面已经通过特征工程对特征进行了预处理,并且构造了新的特征,接下来我们就可以进行建模,筛选特征、调参优化。
总体思路如下:
- 使用树模型(xgb/randomForest)看一下现所有特征建模结果,以此作为一个基准;
- 选择入模特征;
- 使用xgb、randomForest、lightgbm分别进行建模,选出冠军模型;
- 对冠军模型参数进行调节,保存模型
- 可尝试学习模型融合。
初次建模
我使用xgb模型和五折交叉验证对所有特征进行建模,模型结果如下:
【第一次训练时没有对Y值进行取对数,结果误差非常大,再一次证明在回归模型中需要对Y变量进行对数化处理,当然也可以使用其它转换方式,使之呈现正态分布】
import xgboost as xgb
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_absolute_error,make_scorer
xgb_model = xgb.XGBRegressor()
score = cross_val_score(xgb_model,cv=5,X=train_tree_data.iloc[:,1:-1],y = np.log(train_tree_data.iloc[:,-1]),verbose=1,scoring=make_scorer(mean_absolute_error,greater_is_better=False))
print('MAE_AVG:',np.mean(score))
#MAE_AVG: -0.17081140057724026
在特征构造时只是针对brand与price的关系分组构造了特征,现在认为还可以对bodyType(车身类型)进行分组,代码和brand分组一样,此处不再重复,加入新特征后再次建模,模型结果有一点提升:
#train_tree_data['bodyType'].value_counts()
#0.0 37076
#1.0 30252
#2.0 26846
#3.0 12413
#4.0 8665
#5.0 6608
#6.0 5977
#7.0 998
#Name: bodyType, dtype: int64
#print('MAE_AVG:',np.mean(score))
#MAE_AVG: -0.17035740685819228
筛选特征
根据基模型特征重要性对特征按从大到小排序,获得模型特征累积重要性,按照累积重要性50%,55%…100%进行建模。根据模型结果获得最佳建模特征。
xgb_model.fit(train_tree_data.iloc[:,1:-1],np.log(train_tree_data.iloc[:,-1]))
fea_imp_data = pd.DataFrame({'fea_name': train_tree_data.columns[1:-1],'fea_imp': xgb_model.feature_importances_})
fea_imp_data.sort_values(by='fea_imp',ascending=False,inplace=True)
fea_imp_data = fea_imp_data[fea_imp_data['fea_imp']>0]
fea_imp_data.reset_index(drop=True,inplace=True)
fea_imp_data['sum_imp'] = np.cumsum(fea_imp_data['fea_imp'])*100
#fea_imp_data
#fea_imp fea_name sum_imp
#0 0.138571 v_12 13.8571
#1 0.128571 v_3 26.7143
#2 0.088571 usedDate 35.5714
#3 0.072857 v_10 42.8571
#4 0.065714 v_14 49.4286
#5 0.060000 v_0 55.4286
#6 0.051429 v_1 60.5714
#7 0.047143 v_11 65.2857
#8 0.045714 power_kr 69.8571
#9 0.040000 v_6 73.8571
#10 0.038571 notRepairedDamage 77.7143
#11 0.035714 v_9 81.2857
#12 0.028571 power 84.1429
#13 0.021429 kilometer 86.2857
#14 0.017143 v_8 88.0000
#15 0.014286 brand 89.4286
#16 0.014286 v_5 90.8571
#17 0.012857 brand_max 92.1429
#18 0.011429 v_13 93.2857
#19 0.008571 brand_mean 94.1429
#20 0.008571 v_7 95.0000
#21 0.007143 brand_amount 95.7143
#22 0.007143 name 96.4286
#23 0.005714 bodyType_mean 97.0000
#24 0.005714 v_4 97.5714
#25 0.004286 bodyType 98.0000
#26 0.004286 bodyType_median 98.4286
#27 0.002857 v_2 98.7143
#28 0.002857 bodyType_min 99.0000
#29 0.002857 brand_median 99.2857
#30 0.002857 brand_min 99.5714
#31 0.002857 bodyType_max 99.8571
#32 0.001429 fuelType 100.0000
for idx in list(range(50,105,5)):
model_fea = fea_imp_data.loc[fea_imp_data['sum_imp']<=idx,'fea_name'].values
print('特征累积重要性:{},入模特征个数:{}'.format(idx,len(model_fea)))
xgb_model = xgb.XGBRegressor()
score = cross_val_score(xgb_model,cv=5,X=train_tree_data.loc[:,model_fea],y = np.log(train_tree_data.iloc[:,-1]),verbose=1,scoring=make_scorer(mean_absolute_error,greater_is_better=False))
print('MAE_AVG:',np.mean(score))
#部分结果展示:
#特征累积重要性:90,入模特征个数:16
#MAE_AVG: -0.17037598469975157
#特征累积重要性:95,入模特征个数:20
#MAE_AVG: -0.17028678726922397 #最佳
#特征累积重要性:100,入模特征个数:33
#MAE_AVG: -0.1704971429858277
#结果累积重要性为95%的特征模型效果最好,回归模型特征数共20个,如果特征较多可选择模型效果较好但特征数较少的特征组合作为入模特征,分类模型也是如此。
模型筛选
使用xgb,lgb.randomForest分别进行建模
lgb_rg = lgb.LGBMRegressor()
lgb_score = cross_val_score(lgb_rg,cv=5,X=train_tree_data.loc[:,model_fea],y = np.log(train_tree_data.iloc[:,-1]),verbose=1,scoring=make_scorer(mean_absolute_error,greater_is_better=False))
print('LGB_MAE_AVG:',np.mean(lgb_score))
rf = RandomForestRegressor()
rf_score = cross_val_score(rf,cv=5,X=train_tree_data.loc[:,model_fea],y = np.log(train_tree_data.iloc[:,-1]),verbose=1,scoring=make_scorer(mean_absolute_error,greater_is_better=False))
print('RF_MAE_AVG:',np.mean(rf_score))
#LGB_MAE_AVG: -0.14097477619456394
#RF_MAE_AVG: -0.13618652767192513
模型调参
从上面结果可以看出随机森林模型效果最好,下面对随机森林模型参数进行调参
常用的调参方法有贪心调参,网格调参,贝叶斯调参
一般常用的是网格搜索,特点是耗时比较长。
max_depth = [3,5,10,15,20]
n_estimators = [100,200,300,500]
params = {'n_estimators':n_estimators,'max_depth':max_depth}
rf = RandomForestRegressor()
gs = GridSearchCV(rf,params,cv=5)
gs.fit(train_tree_data.loc[:,model_fea], np.log(train_tree_data.iloc[:,-1]))
print(gs.best_params_)
跑了一小时,没跑出来。