本笔记为阿里云天池龙珠计划数据挖掘训练营的学习内容,链接为:
四、建模与调参
Tip:此部分为零基础入门数据挖掘的 Task4 建模调参 部分,带你来了解各种模型以及模型的评价和调参策略,欢迎大家后续多多交流。
赛题:零基础入门数据挖掘 - 二手车交易价格预测
地址:零基础入门数据挖掘 - 二手车交易价格预测_学习赛_天池大赛-阿里云天池的赛制
5.1 学习目标
- 了解常用的机器学习模型,并掌握机器学习模型的建模与调参流程
- 完成相应学习打卡任务
5.2 内容介绍
- 线性回归模型:
- 线性回归对于特征的要求;
- 处理长尾分布;
- 理解线性回归模型;
- 模型性能验证:
- 评价函数与目标函数;
- 交叉验证方法;
- 留一验证方法;
- 针对时间序列问题的验证;
- 绘制学习率曲线;
- 绘制验证曲线;
- 嵌入式特征选择:
- Lasso回归;
- Ridge回归;
- 决策树;
- 模型对比:
- 常用线性模型;
- 常用非线性模型;
- 模型调参:
- 贪心调参方法;
- 网格调参方法;
- 贝叶斯调参方法;
5.3 相关原理介绍与推荐
由于相关算法原理篇幅较长,本文推荐了一些博客与教材供初学者们进行学习。
5.3.1 线性回归模型
5.3.2 决策树模型
5.3.3 GBDT模型
5.3.4 XGBoost模型
5.3.5 LightGBM模型
5.3.6 推荐教材:
- 《机器学习》 机器学习 (豆瓣)
- 《统计学习方法》 统计学习方法 (豆瓣)
- 《Python大战机器学习》 Python大战机器学习 (豆瓣)
- 《面向机器学习的特征工程》 Feature Engineering for Machine Learning Models (豆瓣)
- 《数据科学家访谈录》 数据科学家访谈录 (豆瓣)
5.4 代码示例
5.4.1 读取数据
import pandas as pdimport numpy as npimport warningswarnings.filterwarnings('ignore')# reduce_mem_usage 函数通过调整数据类型,帮助我们减少数据在内存中占用的空间
def reduce_mem_usage(df):""" iterate through all the columns of a dataframe and modify the data typeto reduce memory usage."""start_mem = df.memory_usage().sum() # 返回数据帧占用内存总数(以字节为单位)
print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))for col in df.columns:col_type = df[col].dtypeif col_type != object:c_min = df[col].min()c_max = df[col].max()if str(col_type)[:3] == 'int':# 如果某个特征的最小值和最大值均介于int*范围内,则将该特征设置为int*类型
if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:df[col] = df[col].astype(np.int8)elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:df[col] = df[col].astype(np.int16)elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:df[col] = df[col].astype(np.int32)elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:df[col] = df[col].astype(np.int64)else:if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:df[col] = df[col].astype(np.float16)elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:df[col] = df[col].astype(np.float32)else:df[col] = df[col].astype(np.float64)else:df[col] = df[col].astype('category') # 将object特征显示转换为category
end_mem = df.memory_usage().sum()print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))return dfsample_feature = reduce_mem_usage(pd.read_csv('data_for_t.csv'))
Memory usage of dataframe is 62099672.00 MB Memory usage after optimization is: 16520303.00 MB Decreased by 73.4%
continuous_feature_names = [x for x in sample_feature.columns if x not in ['price','brand','model','brand']]
5.4.2 线性回归 & 五折交叉验证 & 模拟真实业务情况
# 丢弃包含缺失值的样本,将'-'特征值替换为0,重置索引,覆盖sample_feature sample_feature = sample_feature.dropna().replace('-', 0).reset_index(drop=True)
# notRepairedDamage原本为object类型,在缩减占用内存时转换成了category类型 sample_feature['notRepairedDamage'] = sample_feature['notRepairedDamage'].astype(np.float32)
# 连续型特征加目标特征组成训练集 train_X = sample_feature[continuous_feature_names] train_y = sample_feature['price'] # train = sample_feature[continuous_feature_names + ['price']]# train_X = train[continuous_feature_names]#train_y = train['price']
5.4.2 - 1 简单建模
from sklearn.preprocessing import StandardScaler from sklearn.pipeline import make_pipeline from sklearn.linear_model import LinearRegression# 创建一个管道,先进行标准化转换,再进行线性回归建模 pipe = make_pipeline(StandardScaler(), LinearRegression())
pipe = pipe.fit(train_X, train_y)查看训练的线性回归模型的截距(intercept)与权重(coef)
print('intercept:'+ str(pipe.named_steps['linearregression'].intercept_))# 将连续型特征和对应权重打包,并生成字典,取出字典的键值对后,再按值降序排序 sorted(dict(zip(continuous_feature_names, pipe.named_steps['linearregression'].coef_)).items(), key=lambda x:x[1], reverse=True)
[9]:
[('v_6', 3367064.3416419234), , ('v_8', 700675.5609399051), , ('v_9', 170630.2772322213), , ('v_7', 32322.661932046794), , ('v_12', 20473.670796955677), , ('v_3', 17868.079541493174), , ('v_11', 11474.938996699882), , ('v_13', 11261.764560014171), , ('v_10', 2683.92009059701), , ('gearbox', 881.822503924793), , ('fuelType', 363.9042507217258), , ('bodyType', 189.6027101207636), , ('city', 44.94975120523033), , ('power', 28.553901616760893), , ('brand_price_median', 0.5103728134078656), , ('brand_price_std', 0.4503634709262509), , ('brand_amount', 0.14881120395067537), , ('brand_price_max', 0.0031910186703124413), , ('SaleID', 5.355989919859316e-05), , ('offerType', 4.39654104411602e-06), , ('train', 2.0489096641540527e-08), , ('seller', -5.816807970404625e-06), , ('brand_price_sum', -2.175006868187935e-05), , ('name', -0.00029800127130798996), , ('used_time', -0.0025158943328359956), , ('brand_price_average', -0.4049048451010565), , ('brand_price_min', -2.246775348689869), , ('power_bin', -34.42064411732892), , ('v_14', -274.78411807754395), , ('kilometer', -372.8975266607174), , ('notRepairedDamage', -495.1903844629), , ('v_0', -2045.054957355766), , ('v_5', -11022.986240388815), , ('v_4', -15121.73110985524), , ('v_2', -26098.299920520385), , ('v_1', -45556.18929727161)]
from matplotlib.pyplot as plt# 从样本中随机取50个子样本
subsample_index = np.random.randint(low=0, high=len(train_y), size=50)
绘制特征v_9的值与标签的散点图,图片发现模型的预测结果(蓝色点)与真实标签(黑色点)的分布差异较大,且部分预测值出现了小于0的情况,说明我们的模型存在一些问题
plt.scatter(train_X['v_9'][subsample_index], train_y[subsample_index], color='black')# 模型预测时使用的是样本的所有特征,而不是v_9
plt.scatter(train_X['v_9'][subsample_index], pipe.predict(train_X.loc[subsample_index]), color='blue')plt.xlabel('v_9')plt.ylabel('price')plt.legend(['True Price','Predicted Price'],loc='upper right')print('The predicted price is obvious different from true price')plt.show()
The predicted price is obvious different from true price
通过作图我们发现数据的标签(price)呈现长尾分布,不利于我们的建模预测。原因是很多模型都假设数据误差项(既残差)符合正态分布,而长尾分布的数据违背了这一假设(如果真实值不符合正态分布,则残差同样不符合正态分布)。参考博客:回归分析的五个基本假设
import seaborn as snsprint('It is clear to see the price shows a typical exponential distribution')plt.figure(figsize=(15,5))plt.subplot(1,2,1)sns.distplot(train_y)plt.subplot(1,2,2)# 计算price的9/10分位数(第122295和第122296之间的数),并取price中前9/10分位数的值
sns.distplot(train_y[train_y < np.quantile(train_y, 0.9)])
It is clear to see the price shows a typical exponential distribution
[13]:
<AxesSubplot:xlabel='price'>
在这里我们对标签进行了 log(x+1) 变换,使标签贴近于正态分布
train_y_ln = np.log(train_y + 1) # 对特征值进行log(x+1)变化可以使它更接近正态分布
import seaborn as snsprint('The transformed price seems like normal distribution')plt.figure(figsize=(15,5))plt.subplot(1,2,1)sns.distplot(train_y_ln)plt.subplot(1,2,2)# 反而进行尾部截断后的数据不贴近正态分布了
sns.distplot(train_y_ln[train_y_ln < np.quantile(train_y_ln, 0.9)])
The transformed price seems like normal distribution
[15]:
<AxesSubplot:xlabel='price'>
pipe = pipe.fit(train_X, train_y_ln)print('intercept:'+ str(pipe.named_steps['linearregression'].intercept_))sorted(dict(zip(continuous_feature_names, pipe.named_steps['linearregression'].coef_)).items(), key=lambda x:x[1], reverse=True)
intercept:18.75074572712829
[16]:
[('v_9', 8.052411938253034), , ('v_5', 5.76424821734175), , ('v_12', 1.6182065931157121), , ('v_1', 1.479830409604984), , ('v_11', 1.166901301442136), , ('v_13', 0.9404706174050993), , ('v_7', 0.7137294548201817), , ('v_3', 0.6837865131606176), , ('v_0', 0.008500520542048579), , ('power_bin', 0.008497966847069138), , ('gearbox', 0.00792237740739693), , ('fuelType', 0.0066847683602300175), , ('bodyType', 0.004523520859490933), , ('power', 0.0007161896574413934), , ('brand_price_min', 3.334353379064395e-05), , ('brand_amount', 2.897879910306009e-06), , ('brand_price_median', 1.2571117947502923e-06), , ('brand_price_std', 6.659133921755751e-07), , ('brand_price_max', 6.194957144444865e-07), , ('brand_price_average', 5.999430180910623e-07), , ('SaleID', 2.1194164871793964e-08), , ('offerType', 1.2352074918453582e-10), , ('train', 7.958078640513122e-12), , ('brand_price_sum', -1.5126508194887479e-10), , ('seller', -4.064446557094925e-10), , ('name', -7.015510612650541e-08), , ('used_time', -4.1224772323856875e-06), , ('city', -0.002218783507670577), , ('v_14', -0.004234188272975861), , ('kilometer', -0.013835866970419807), , ('notRepairedDamage', -0.2702794201378993), , ('v_4', -0.8315697033983619), , ('v_2', -0.9470829600923825), , ('v_10', -1.6261472858358768), , ('v_8', -40.34300704975851), , ('v_6', -238.79035779748037)]
再次进行可视化,发现预测结果与真实值较为接近,且未出现异常状况
plt.scatter(train_X['v_9'][subsample_index], train_y[subsample_index], color='black')# 因为模型在训练时将price值做了log转换,因此模型预测出来的值也要进行exp还原
plt.scatter(train_X['v_9'][subsample_index], np.exp(pipe.named_steps['linearregression'].predict(train_X.loc[subsample_index])), color='blue')plt.xlabel('v_9')plt.ylabel('price')plt.legend(['True Price','Predicted Price'],loc='upper right')print('The predicted price seems normal after np.log transforming')plt.show()
The predicted price seems normal after np.log transforming
5.4.2 - 2 五折交叉验证
在使用训练集对参数进行训练的时候,经常会发现人们通常会将一整个训练集分为三个部分(比如mnist手写训练集)。一般分为:训练集(train_set),评估集(valid_set),测试集(test_set)这三个部分。这其实是为了保证训练效果而特意设置的。其中测试集很好理解,其实就是完全不参与训练的数据,仅仅用来观测测试效果的数据。而训练集和评估集则牵涉到下面的知识了。
因为在实际的训练中,训练的结果对于训练集的拟合程度通常还是挺好的(初始条件敏感),但是对于训练集之外的数据的拟合程度通常就不那么令人满意了。因此我们通常并不会把所有的数据集都拿来训练,而是分出一部分来(这一部分不参加训练)对训练集生成的参数进行测试,相对客观的判断这些参数对训练集之外的数据的符合程度。这种思想就称为交叉验证(Cross Validation)
from sklearn.model_selection import cross_val_scorefrom sklearn.metrics import mean_absolute_error, make_scorerdef log_transfer(func):def wrapper(y, yhat):result = func(np.log(y), np.nan_to_num(np.log(yhat)))return resultreturn wrapper
使用线性回归模型,对未处理标签的特征数据进行五折交叉验证(Error 1.36)
# 将y和yhat进行对数转换后,再通过mean_absolute_error评估
scores = cross_val_score(pipe, X=train_X, y=train_y, verbose=1, cv = 5, scoring=make_scorer(log_transfer(mean_absolute_error)))
[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 3.0s finished print('AVG:', np.mean(scores)) AVG: 1.3658024040276628
使用线性回归模型,对处理过标签的特征数据进行五折交叉验证(Error 0.19)
scores = cross_val_score(pipe, X=train_X, y=train_y_ln, verbose=1, cv = 5, scoring=make_scorer(mean_absolute_error))
[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 3.1s finished print('AVG:', np.mean(scores)) AVG: 0.1932530155796405
scores = pd.DataFrame(scores.reshape(1,-1))scores.columns = ['cv' + str(x) for x in range(1, 6)]scores.index = ['MAE']scores
[24]:
, , , , , , , , , , , , , , , , , , , , , , ,
cv1 | cv2 | cv3 | cv4 | cv5 | |
---|---|---|---|---|---|
MAE | 0.190792 | 0.193758 | 0.194132 | 0.191825 | 0.195758 |
5.4.2 - 3 模拟真实业务情况
但在事实上,由于我们并不具有预知未来的能力,五折交叉验证在某些与时间相关的数据集上反而反映了不真实的情况。通过2018年的二手车价格预测2017年的二手车价格,这显然是不合理的,因此我们还可以采用时间顺序对数据集进行分隔。在本例中,我们选用靠前时间的4/5样本当作训练集,靠后时间的1/5当作验证集,最终结果与五折交叉验证差距不大
import datetimesample_feature = sample_feature.reset_index(drop=True) # 重置索引,并删除原索引
split_point = len(sample_feature) // 5 * 4train = sample_feature.loc[:split_point].dropna() # 取原数据集前4/5作为训练样本
val = sample_feature.loc[split_point:].dropna() # 取原数据集后1/5作为验证样本
train_X = train[continuous_feature_names]train_y_ln = np.log(train['price'] + 1)val_X = val[continuous_feature_names]val_y_ln = np.log(val['price'] + 1)model = pipe.fit(train_X, train_y_ln)mean_absolute_error(val_y_ln, model.predict(val_X))
[30]:
0.19577667149549233
5.4.2 - 4 绘制学习率曲线与验证曲线
from sklearn.model_selection import learning_curve, validation_curve? learning_curve # 显示learning_curve的帮助信息
Signature: , learning_curve( , estimator, , X, , y, , groups=None, , train_sizes=array([0.1 , 0.325, 0.55 , 0.775, 1. ]), , cv=None, , scoring=None, , exploit_incremental_learning=False, , n_jobs=1, , pre_dispatch='all', , verbose=0, , shuffle=False, , random_state=None, ,) ,Docstring: ,Learning curve. , ,Determines cross-validated training and test scores for different training ,set sizes. , ,A cross-validation generator splits the whole dataset k times in training ,and test data. Subsets of the training set with varying sizes will be used ,to train the estimator and a score for each training subset size and the ,test set will be computed. Afterwards, the scores will be averaged over ,all k runs for each training subset size. , ,Read more in the :ref:`User Guide <learning_curve>`. , ,Parameters ,---------- ,estimator : object type that implements the "fit" and "predict" methods , An object of that type which is cloned for each validation. , ,X : array-like, shape (n_samples, n_features) , Training vector, where n_samples is the number of samples and , n_features is the number of features. , ,y : array-like, shape (n_samples) or (n_samples, n_features), optional , Target relative to X for classification or regression; , None for unsupervised learning. , ,groups : array-like, with shape (n_samples,), optional , Group labels for the samples used while splitting the dataset into , train/test set. , ,train_sizes : array-like, shape (n_ticks,), dtype float or int , Relative or absolute numbers of training examples that will be used to , generate the learning curve. If the dtype is float, it is regarded as a , fraction of the maximum size of the training set (that is determined , by the selected validation method), i.e. it has to be within (0, 1]. , Otherwise it is interpreted as absolute sizes of the training sets. , Note that for classification the number of samples usually have to , be big enough to contain at least one sample from each class. , (default: np.linspace(0.1, 1.0, 5)) , ,cv : int, cross-validation generator or an iterable, optional , Determines the cross-validation splitting strategy. , Possible inputs for cv are: , , - None, to use the default 3-fold cross validation, , - integer, to specify the number of folds in a `(Stratified)KFold`, , - An object to be used as a cross-validation generator. , - An iterable yielding train, test splits. , , For integer/None inputs, if the estimator is a classifier and ``y`` is , either binary or multiclass, :class:`StratifiedKFold` is used. In all , other cases, :class:`KFold` is used. , , Refer :ref:`User Guide <cross_validation>` for the various , cross-validation strategies that can be used here. , ,scoring : string, callable or None, optional, default: None , A string (see model evaluation documentation) or , a scorer callable object / function with signature , ``scorer(estimator, X, y)``. , ,exploit_incremental_learning : boolean, optional, default: False , If the estimator supports incremental learning, this will be , used to speed up fitting for different training set sizes. , ,n_jobs : integer, optional , Number of jobs to run in parallel (default 1). , ,pre_dispatch : integer or string, optional , Number of predispatched jobs for parallel execution (default is , all). The option can reduce the allocated memory. The string can , be an expression like '2*n_jobs'. , ,verbose : integer, optional , Controls the verbosity: the higher, the more messages. , ,shuffle : boolean, optional , Whether to shuffle training data before taking prefixes of it , based on``train_sizes``. , ,random_state : int, RandomState instance or None, optional (default=None) , If int, random_state is the seed used by the random number generator; , If RandomState instance, random_state is the random number generator; , If None, the random number generator is the RandomState instance used , by `np.random`. Used when ``shuffle`` is True. , ,Returns ,------- ,train_sizes_abs : array, shape = (n_unique_ticks,), dtype int , Numbers of training examples that has been used to generate the , learning curve. Note that the number of ticks might be less , than n_ticks because duplicate entries will be removed. , ,train_scores : array, shape (n_ticks, n_cv_folds) , Scores on training sets. , ,test_scores : array, shape (n_ticks, n_cv_folds) , Scores on test set. , ,Notes ,----- ,See :ref:`examples/model_selection/plot_learning_curve.py ,<sphx_glr_auto_examples_model_selection_plot_learning_curve.py>` ,File: /opt/conda/lib/python3.6/site-packages/sklearn/model_selection/_validation.py ,Type: function
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,n_jobs=1, train_size=np.linspace(.1, 1.0, 5 )):plt.figure() # 创建一个新的画布
plt.title(title)if ylim is not None:plt.ylim(*ylim)plt.xlabel('Training example')plt.ylabel('score')# 默认进行5折交叉验证,训练5轮,每轮训练集总数量根据train_size中的比例选取,再按cv值选取实际用于训练的样本数量,该数量组成了返回的train_sizes
train_sizes, train_scores, test_scores = learning_curve(estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_size, scoring = make_scorer(mean_absolute_error))train_scores_mean = np.mean(train_scores, axis=1)train_scores_std = np.std(train_scores, axis=1)test_scores_mean = np.mean(test_scores, axis=1)test_scores_std = np.std(test_scores, axis=1)plt.grid()#区域# 红色填充5轮训练集评估曲线区域
plt.fill_between(train_sizes, train_scores_mean - train_scores_std,train_scores_mean + train_scores_std, alpha=0.1,color="r") # 绿色填充5轮测试集评估曲线区域
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,test_scores_mean + test_scores_std, alpha=0.1,color="g")# x轴为5轮训练集的样本数量,y轴为每轮训练评估结果的平均值
plt.plot(train_sizes, train_scores_mean, 'o-', color='r',label="Training score")# x轴为5轮训练集的样本数量,y轴为每轮测试评估结果的平均值
plt.plot(train_sizes, test_scores_mean,'o-',color="g",label="Cross-validation score")plt.legend(loc="best")return pltplot_learning_curve(LinearRegression(), 'Liner_model', train_X[:1000], train_y_ln[:1000], ylim=(0.0, 0.5), cv=5, n_jobs=1)# 由图可知模型训练准确率低于验证准确率,且方差较小,没有过拟合;但是准确率偏低,偏差较大,欠拟合
[34]:
<module 'matplotlib.pyplot' from '/opt/conda/lib/python3.6/site-packages/matplotlib/pyplot.py'>
5.4.3 多种模型对比
# 将continuous_feature_names和price合并后再去除含有缺失特征值的所有样本 train = sample_feature[continuous_feature_names + ['price']].dropna()
train_X = train[continuous_feature_names]train_y = train['price']train_y_ln = np.log(train_y + 1)
5.4.3 - 1 线性模型 & 嵌入式特征选择
本章节默认,学习者已经了解关于过拟合、模型复杂度、正则化等概念。否则请寻找相关资料或参考如下连接:
- 用简单易懂的语言描述「过拟合 overfitting」?_Data
- 模型复杂度与泛化能力 - 知乎模型复杂度与泛化能力 - 知乎
- https://www.cnblogs.com/zingp/p/10375691.html
在过滤式和包裹式特征选择方法中,特征选择过程与学习器训练过程有明显的分别。而嵌入式特征选择在学习器训练过程中自动地进行特征选择。嵌入式选择最常用的是L1正则化与L2正则化。在对线性回归模型加入两种正则化方法后,他们分别变成了Lasso回归与岭回归。
from sklearn.linear_model import LinearRegressionfrom sklearn.linear_model import Ridge # L2正则化
from sklearn.linear_model import Lasso # L1正则化
models = [LinearRegression(),Ridge(),Lasso()]result = dict()for model in models:model_name = str(model).split('(')[0]scores = cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error))result[model_name] = scoresprint(model_name + ' is finished')
LinearRegression is finished Ridge is finished Lasso is finished
对三种方法的效果对比,线性回归模型的误差值最低
result = pd.DataFrame(result)result.index = ['cv' + str(x) for x in range(1, 6)]result
[39]:
, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,
LinearRegression | Ridge | Lasso | |
---|---|---|---|
cv1 | 0.190792 | 0.194832 | 0.383899 |
cv2 | 0.193758 | 0.197632 | 0.381894 |
cv3 | 0.194132 | 0.198123 | 0.384090 |
cv4 | 0.191825 | 0.195670 | 0.380526 |
cv5 | 0.195758 | 0.199676 | 0.383611 |
model = LinearRegression().fit(train_X, train_y_ln)print('intercept:'+ str(model.intercept_))sns.barplot(abs(model.coef_), continuous_feature_names)# 除了v_5、v_6、v_8、v_9的权重较大外,其它特征的权重都接近0,或者等于0
intercept:18.750751045631276
[40]:
<AxesSubplot:>
L2正则化在拟合过程中通常都倾向于让权值尽可能小,最后构造一个所有参数都比较小的模型。因为一般认为参数值小的模型比较简单,能适应不同的数据集,也在一定程度上避免了过拟合现象。可以设想一下对于一个线性回归方程,若参数很大,那么只要数据偏移一点点,就会对结果造成很大的影响;但如果参数足够小,数据偏移得多一点也不会对结果造成什么影响,专业一点的说法是『抗扰动能力强』
model = Ridge().fit(train_X, train_y_ln)print('intercept:'+ str(model.intercept_))sns.barplot(abs(model.coef_), continuous_feature_names)# L2正则化返回的权重值均较小,但是为0的权重数量最少
intercept:4.671710811023084
[41]:
<AxesSubplot:>
L1正则化有助于生成一个稀疏权值矩阵,进而可以用于特征选择。如下图,我们发现power与userd_time特征非常重要。
model = Lasso().fit(train_X, train_y_ln)print('intercept:'+ str(model.intercept_))sns.barplot(abs(model.coef_), continuous_feature_names)# 为0的权重数量最多
intercept:8.67218477236799
[42]:
<AxesSubplot:>
除此之外,决策树通过信息熵或GINI指数选择分裂节点时,优先选择的分裂特征也更加重要,这同样是一种特征选择的方法。XGBoost与LightGBM模型中的model_importance指标正是基于此计算的
5.4.3 - 2 非线性模型
除了线性模型以外,还有许多我们常用的非线性模型如下,在此篇幅有限不再一一讲解原理。我们选择了部分常用模型与线性模型进行效果比对。
from sklearn.linear_model import LinearRegressionfrom sklearn.tree import DecisionTreeRegressorfrom sklearn.ensemble import RandomForestRegressorfrom sklearn.ensemble import GradientBoostingRegressorfrom sklearn.neural_network import MLPRegressorfrom xgboost.sklearn import XGBRegressorfrom lightgbm.sklearn import LGBMRegressormodels = [LinearRegression(),DecisionTreeRegressor(),RandomForestRegressor(),GradientBoostingRegressor(),MLPRegressor(solver='lbfgs', max_iter=100),XGBRegressor(n_estimators = 100, objective='reg:squarederror'),LGBMRegressor(n_estimators = 100)]result = dict()for model in models:model_name = str(model).split('(')[0]scores = cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error))result[model_name] = scoresprint(model_name + ' is finished')
LinearRegression is finished DecisionTreeRegressor is finished RandomForestRegressor is finished GradientBoostingRegressor is finished MLPRegressor is finished XGBRegressor is finished LGBMRegressor is finished
result = pd.DataFrame(result)result.index = ['cv' + str(x) for x in range(1, 6)]result
[46]:
, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,
LinearRegression | DecisionTreeRegressor | RandomForestRegressor | GradientBoostingRegressor | MLPRegressor | XGBRegressor | LGBMRegressor | |
---|---|---|---|---|---|---|---|
cv1 | 0.190792 | 0.200199 | 0.140837 | 0.168900 | 368.425973 | 0.140317 | 0.141542 |
cv2 | 0.193758 | 0.191904 | 0.142257 | 0.171831 | 324.132343 | 0.140923 | 0.145501 |
cv3 | 0.194132 | 0.189556 | 0.140941 | 0.170902 | 611.186880 | 0.139739 | 0.143887 |
cv4 | 0.191825 | 0.190797 | 0.140719 | 0.169056 | 765.200565 | 0.137492 | 0.142497 |
cv5 | 0.195758 | 0.202288 | 0.145952 | 0.174078 | 384.894216 | 0.143021 | 0.144852 |
可以看到随机森林模型在每一个fold中均取得了更好的效果
5.4.4 模型调参
在此我们介绍了三种常用的调参方法如下:
- 贪心算法
- 网格调参
- 贝叶斯调参
## LGB的参数集合:objective = ['regression', 'regression_l1', 'mape', 'huber', 'fair'] # 目标函数
num_leaves = [3,5,10,15,20,40, 55] # 一个树上叶子的节点数
max_depth = [3,5,10,15,20,40, 55] # 树的最大深度
5.4.4 - 1 贪心调参
best_obj = dict()# 先选最优目标函数
for obj in objective:model = LGBMRegressor(objective=obj)score = np.mean(cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)))best_obj[obj] = score# 在最优目标函数的基础上再选最优叶子数量
best_leaves = dict()for leaves in num_leaves:model = LGBMRegressor(objective=min(best_obj.items(), key=lambda x:x[1])[0], num_leaves=leaves)score = np.mean(cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)))best_leaves[leaves] = score# 基于最优目标函数和最优叶子数量,最后选择最优树深
best_depth = dict()for depth in max_depth:model = LGBMRegressor(objective=min(best_obj.items(), key=lambda x:x[1])[0],num_leaves=min(best_leaves.items(), key=lambda x:x[1])[0],max_depth=depth)score = np.mean(cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)))best_depth[depth] = scoresns.lineplot(x=['0_initial','1_turning_obj','2_turning_leaves','3_turning_depth'], y=[0.143 ,min(best_obj.values()), min(best_leaves.values()), min(best_depth.values())])
[49]:
<AxesSubplot:>
5.4.4 - 2 Grid Search 调参
from sklearn.model_selection import GridSearchCVparameters = {'objective': objective , 'num_leaves': num_leaves, 'max_depth': max_depth}model = LGBMRegressor()clf = GridSearchCV(model, parameters, cv=5) # 对3个参数可选值逐个组合,选择最优的一组
clf = clf.fit(train_X, train_y)clf.best_params_
[52]:
{'max_depth': 15, 'num_leaves': 55, 'objective': 'regression'}
model = LGBMRegressor(objective='regression',num_leaves=55,max_depth=15)np.mean(cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)))
[54]:
0.1375498038741029
5.4.4 - 3 贝叶斯调参
!pip install bayesian-optimization
Defaulting to user installation because normal site-packages is not writeable Looking in indexes: https://mirrors.aliyun.com/pypi/simple Collecting bayesian-optimization Downloading https://mirrors.aliyun.com/pypi/packages/bb/7a/fd8059a3881d3ab37ac8f72f56b73937a14e8bb14a9733e68cc8b17dbe3c/bayesian-optimization-1.2.0.tar.gz (14 kB) Requirement already satisfied: numpy>=1.9.0 in /opt/conda/lib/python3.6/site-packages (from bayesian-optimization) (1.19.1) Requirement already satisfied: scipy>=0.14.0 in /opt/conda/lib/python3.6/site-packages (from bayesian-optimization) (1.5.4) Requirement already satisfied: scikit-learn>=0.18.0 in /opt/conda/lib/python3.6/site-packages (from bayesian-optimization) (0.23.2) Requirement already satisfied: joblib>=0.11 in /opt/conda/lib/python3.6/site-packages (from scikit-learn>=0.18.0->bayesian-optimization) (0.17.0) Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/conda/lib/python3.6/site-packages (from scikit-learn>=0.18.0->bayesian-optimization) (2.1.0) Building wheels for collected packages: bayesian-optimization Building wheel for bayesian-optimization (setup.py) ... done Created wheel for bayesian-optimization: filename=bayesian_optimization-1.2.0-py3-none-any.whl size=11686 sha256=a27a2be2b1798da36c2b65c71f197c42482fc7b28b4885a12c750c5541a481a4 Stored in directory: /home/admin/.cache/pip/wheels/25/8b/a6/182574d55dfb7c7ca276ffccbd05f36ec9336b08b7244b78e6 Successfully built bayesian-optimization Installing collected packages: bayesian-optimization Successfully installed bayesian-optimization-1.2.0
from bayes_opt import BayesianOptimization# 贝叶斯优化的过程:初始化代理函数,基于代理函数算出采集函数,通过采集函数得到样 本点,使用该样本点更新代理函数和采集函数,并不断迭代
def rf_cv(num_leaves, max_depth, subsample, min_child_samples):val = cross_val_score(LGBMRegressor(objective = 'regression_l1',num_leaves=int(num_leaves), # 贝叶斯优化只适用连续数值型
max_depth=int(max_depth),subsample = subsample,min_child_samples = int(min_child_samples)),X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)).mean()return 1 - val # 因为bayes_opt库只支持最大值,因此得到MAE的均值后(越小越好),用1相减既可得到最大值
rf_bo = BayesianOptimization(rf_cv,{'num_leaves': (2, 100),'max_depth': (2, 100),'subsample': (0.1, 1),'min_child_samples' : (2, 100)})rf_bo.maximize()
| iter | target | max_depth | min_ch... | num_le... | subsample | ------------------------------------------------------------------------- | 1 | 0.8651 | 11.65 | 80.05 | 59.04 | 0.377 | | 2 | 0.8518 | 59.61 | 12.86 | 20.51 | 0.9305 | | 3 | 0.8596 | 55.92 | 76.14 | 36.86 | 0.5089 | | 4 | 0.8612 | 68.22 | 70.43 | 41.69 | 0.6016 | | 5 | 0.8682 | 89.33 | 68.71 | 87.44 | 0.9421 | | 6 | 0.8253 | 3.867 | 97.03 | 98.98 | 0.5146 | | 7 | 0.8507 | 5.633 | 4.214 | 93.7 | 0.1725 | | 8 | 0.8681 | 88.36 | 71.62 | 85.06 | 0.7962 | | 9 | 0.8688 | 99.38 | 3.327 | 92.97 | 0.1082 | | 10 | 0.802 | 2.581 | 98.27 | 3.02 | 0.4331 | | 11 | 0.8683 | 47.76 | 36.64 | 84.46 | 0.2095 | | 12 | 0.8257 | 3.131 | 2.886 | 27.51 | 0.5481 | | 13 | 0.8613 | 99.55 | 2.518 | 42.05 | 0.1748 | | 14 | 0.8655 | 61.37 | 2.13 | 62.49 | 0.5283 | | 15 | 0.8685 | 98.73 | 98.92 | 89.19 | 0.9923 | | 16 | 0.8638 | 23.86 | 43.4 | 53.99 | 0.9645 | | 17 | 0.8344 | 98.93 | 98.41 | 8.308 | 0.9273 | | 18 | 0.8665 | 51.16 | 99.72 | 69.13 | 0.9507 | | 19 | 0.8658 | 99.45 | 98.95 | 65.87 | 0.5692 |
1 - rf_bo.max['target'])
总结
在本章中,我们完成了建模与调参的工作,并对我们的模型进行了验证。此外,我们还采用了一些基本方法来提高预测的精度,提升如下图所示。
plt.figure(figsize=(13,5))
sns.lineplot(x=['0_origin','1_log_transfer','2_L1_&_L2','3_change_model','4_parameter_turning'], y=[1.36 ,0.19, 0.19, 0.14, 0.13])
Task5 建模调参 END.