Datawhale 零基础入门数据挖掘-Task4 建模调参

Datawhale 零基础入门数据挖掘-Task4 建模调参

四、建模与调参

Tip:此部分为零基础入门数据挖掘的 Task4 建模调参 部分,带你来了解各种模型以及模型的评价和调参策略,欢迎大家后续多多交流。

赛题:零基础入门数据挖掘 - 二手车交易价格预测

地址:https://tianchi.aliyun.com/competition/entrance/231784/introduction?spm=5176.12281957.1004.1.38b02448ausjSX

5.1 学习目标

  • 了解常用的机器学习模型,并掌握机器学习模型的建模与调参流程
  • 完成相应学习打卡任务

5.2 内容介绍

  1. 线性回归模型:
    • 线性回归对于特征的要求;
    • 处理长尾分布;
    • 理解线性回归模型;
  2. 模型性能验证:
    • 评价函数与目标函数;
    • 交叉验证方法;
    • 留一验证方法;
    • 针对时间序列问题的验证;
    • 绘制学习率曲线;
    • 绘制验证曲线;
  3. 嵌入式特征选择:
    • Lasso回归;
    • Ridge回归;
    • 决策树;
  4. 模型对比:
    • 常用线性模型;
    • 常用非线性模型;
  5. 模型调参:
    • 贪心调参方法;
    • 网格调参方法;
    • 贝叶斯调参方法;

5.3 相关原理介绍与推荐

由于相关算法原理篇幅较长,本文推荐了一些博客与教材供初学者们进行学习。

5.3.1 线性回归模型

https://zhuanlan.zhihu.com/p/49480391

5.3.2 决策树模型

https://zhuanlan.zhihu.com/p/65304798

5.3.3 GBDT模型

https://zhuanlan.zhihu.com/p/45145899

5.3.4 XGBoost模型

https://zhuanlan.zhihu.com/p/86816771

5.3.5 LightGBM模型

https://zhuanlan.zhihu.com/p/89360721

5.3.6 推荐教材:

  • 《机器学习》 https://book.douban.com/subject/26708119/
  • 《统计学习方法》 https://book.douban.com/subject/10590856/
  • 《Python大战机器学习》 https://book.douban.com/subject/26987890/
  • 《面向机器学习的特征工程》 https://book.douban.com/subject/26826639/
  • 《数据科学家访谈录》 https://book.douban.com/subject/30129410/

5.4 代码示例

5.4.1 读取数据

import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

reduce_mem_usage 函数通过调整数据类型,帮助我们减少数据在内存中占用的空间

def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    start_mem = df.memory_usage().sum() 
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() 
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    return df
sample_feature = reduce_mem_usage(pd.read_csv('data_for_tree.csv'))
Memory usage of dataframe is 62099672.00 MB
Memory usage after optimization is: 16321266.00 MB
Decreased by 73.7%
continuous_feature_names = [x for x in sample_feature.columns if x not in ['price','brand','model','brand']]
print(continuous_feature_names )
['SaleID', 'name', 'bodyType', 'fuelType', 'gearbox', 'power', 'kilometer', 'notRepairedDamage', 'seller', 'offerType', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13', 'v_14', 'train', 'used_time', 'city', 'brand_amount', 'brand_price_max', 'brand_price_median', 'brand_price_min', 'brand_price_sum', 'brand_price_std', 'brand_price_average', 'power_bin']

5.4.2 线性回归 & 五折交叉验证 & 模拟真实业务情况

sample_feature = sample_feature.dropna().replace('-', 0).reset_index(drop=True)
#sample_feature['notRepairedDamage'] = sample_feature['notRepairedDamage'].astype(np.float32)#lr数据没有notRepairedDamage
train = sample_feature[continuous_feature_names + ['price']]

train_X = train[continuous_feature_names]
train_y = train['price']
5.4.2 - 1 简单建模
from sklearn.linear_model import LinearRegression
model = LinearRegression(normalize=True)
model = model.fit(train_X, train_y)

查看训练的线性回归模型的截距(intercept)与权重(coef)

'intercept:'+ str(model.intercept_)
'intercept:-126664.45269441482'
print(list(zip(continuous_feature_names, model.coef_)))
[('SaleID', -2.6242814662349187e-05), ('name', -5.9398048325448294e-05), ('bodyType', 188.1542135740946), ('fuelType', 278.62309474958533), ('gearbox', 819.9985992362838), ('power', 26.52351381616533), ('kilometer', -349.6102591453823), ('notRepairedDamage', -485.72677901237887), ('seller', -1.0499032214283943e-05), ('offerType', -2.9383227229118347e-07), ('v_0', -1288.9192458005584), ('v_1', -42660.76619028431), ('v_2', -23353.30098764864), ('v_3', 16260.567718944378), ('v_4', -16473.433195105255), ('v_5', -48335.98859823758), ('v_6', 3207484.3736582436), ('v_7', 27215.41447298493), ('v_8', 661342.2244891225), ('v_9', 204898.19106724436), ('v_10', 3146.2528408254107), ('v_11', 9878.120544089928), ('v_12', 17969.77067249327), ('v_13', 11220.209680387874), ('v_14', -269.0896142041866), ('train', 1.2852251529693604e-07), ('used_time', -0.07937278205828256), ('city', 45.70470384266182), ('brand_amount', 0.14711997772017754), ('brand_price_max', 0.004267229147085026), ('brand_price_median', 0.5109936826971243), ('brand_price_min', -2.3352033358894935), ('brand_price_sum', -2.1713184360129116e-05), ('brand_price_std', 0.4376364599420586), ('brand_price_average', -0.41117841369466007), ('power_bin', -22.121690304925096)]
'intercept:'+ str(model.intercept_)

sorted(dict(zip(continuous_feature_names, model.coef_)).items(), key=lambda x:x[1], reverse=True)
[('v_6', 3207484.3736582436),
 ('v_8', 661342.2244891225),
 ('v_9', 204898.19106724436),
 ('v_7', 27215.41447298493),
 ('v_12', 17969.77067249327),
 ('v_3', 16260.567718944378),
 ('v_13', 11220.209680387874),
 ('v_11', 9878.120544089928),
 ('v_10', 3146.2528408254107),
 ('gearbox', 819.9985992362838),
 ('fuelType', 278.62309474958533),
 ('bodyType', 188.1542135740946),
 ('city', 45.70470384266182),
 ('power', 26.52351381616533),
 ('brand_price_median', 0.5109936826971243),
 ('brand_price_std', 0.4376364599420586),
 ('brand_amount', 0.14711997772017754),
 ('brand_price_max', 0.004267229147085026),
 ('train', 1.2852251529693604e-07),
 ('offerType', -2.9383227229118347e-07),
 ('seller', -1.0499032214283943e-05),
 ('brand_price_sum', -2.1713184360129116e-05),
 ('SaleID', -2.6242814662349187e-05),
 ('name', -5.9398048325448294e-05),
 ('used_time', -0.07937278205828256),
 ('brand_price_average', -0.41117841369466007),
 ('brand_price_min', -2.3352033358894935),
 ('power_bin', -22.121690304925096),
 ('v_14', -269.0896142041866),
 ('kilometer', -349.6102591453823),
 ('notRepairedDamage', -485.72677901237887),
 ('v_0', -1288.9192458005584),
 ('v_4', -16473.433195105255),
 ('v_2', -23353.30098764864),
 ('v_1', -42660.76619028431),
 ('v_5', -48335.98859823758)]
from matplotlib import pyplot as plt
subsample_index = np.random.randint(low=0, high=len(train_y), size=50)

绘制特征v_9的值与标签的散点图,图片发现模型的预测结果(蓝色点)与真实标签(黑色点)的分布差异较大,且部分预测值出现了小于0的情况,说明我们的模型存在一些问题

plt.scatter(train_X['v_9'][subsample_index], train_y[subsample_index], color='black')
plt.scatter(train_X['v_9'][subsample_index], model.predict(train_X.loc[subsample_index]), color='blue')
plt.xlabel('v_9')
plt.ylabel('price')
plt.legend(['True Price','Predicted Price'],loc='upper right')
print('The predicted price is obvious different from true price')
plt.show()
The predicted price is obvious different from true price

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-dNlOtTbl-1585739013216)(output_25_1.png)]

通过作图我们发现数据的标签(price)呈现长尾分布,不利于我们的建模预测。原因是很多模型都假设数据误差项符合正态分布,而长尾分布的数据违背了这一假设。参考博客:https://blog.csdn.net/Noob_daniel/article/details/76087829

import seaborn as sns
print('It is clear to see the price shows a typical exponential distribution')
plt.figure(figsize=(15,5))
plt.subplot(1,2,1)
sns.distplot(train_y)
plt.subplot(1,2,2)
sns.distplot(train_y[train_y < np.quantile(train_y, 0.9)])
It is clear to see the price shows a typical exponential distribution





<matplotlib.axes._subplots.AxesSubplot at 0x1f283eed400>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-06YSBGOW-1585739013217)(output_27_2.png)]

在这里我们对标签进行了 l o g ( x + 1 ) log(x+1) log(x+1) 变换,使标签贴近于正态分布

train_y_ln = np.log(train_y + 1)
import seaborn as sns
print('The transformed price seems like normal distribution')
plt.figure(figsize=(15,5))
plt.subplot(1,2,1)
sns.distplot(train_y_ln)
plt.subplot(1,2,2)
sns.distplot(train_y_ln[train_y_ln < np.quantile(train_y_ln, 0.9)])
The transformed price seems like normal distribution





<matplotlib.axes._subplots.AxesSubplot at 0x1f284810518>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-u9yG1y1C-1585739013217)(output_30_2.png)]

model = model.fit(train_X, train_y_ln)

print('intercept:'+ str(model.intercept_))
sorted(dict(zip(continuous_feature_names, model.coef_)).items(), key=lambda x:x[1], reverse=True)
intercept:24.63700282276484





[('v_9', 6.1128413214392445),
 ('v_12', 1.8133027515087667),
 ('v_1', 1.3778867649964353),
 ('v_11', 1.0279851607581858),
 ('v_13', 0.9951368625047502),
 ('v_3', 0.8470788263824458),
 ('gearbox', 0.006560671796859838),
 ('power_bin', 0.0063348644662720045),
 ('fuelType', 0.004561793308582229),
 ('bodyType', 0.00396656188299689),
 ('power', 0.0008630393962768842),
 ('brand_price_min', 2.158178983844523e-05),
 ('brand_price_average', 4.826131662712308e-06),
 ('used_time', 3.6445278608513713e-06),
 ('brand_amount', 2.552504856856017e-06),
 ('brand_price_max', 6.39309059168058e-07),
 ('SaleID', 6.2107021175279625e-09),
 ('train', 1.1368683772161603e-13),
 ('offerType', -5.695710569852963e-11),
 ('brand_price_sum', -1.0932993948860288e-10),
 ('seller', -2.469846549502108e-10),
 ('name', -6.799352292953178e-08),
 ('brand_price_median', -1.3798205820287968e-06),
 ('brand_price_std', -2.5206316582625653e-06),
 ('city', -0.0014931465368282217),
 ('v_14', -0.002222951757088256),
 ('kilometer', -0.01294441510560885),
 ('v_5', -0.08606704218427765),
 ('v_0', -0.09585997245552068),
 ('notRepairedDamage', -0.25850271519244444),
 ('v_7', -0.43530643853309636),
 ('v_4', -0.8477977314489259),
 ('v_2', -0.9636353762334509),
 ('v_10', -1.6244464090540611),
 ('v_8', -42.73052915218399),
 ('v_6', -226.54022999537742)]

再次进行可视化,发现预测结果与真实值较为接近,且未出现异常状况

plt.scatter(train_X['v_9'][subsample_index], train_y[subsample_index], color='black')
plt.scatter(train_X['v_9'][subsample_index], np.exp(model.predict(train_X.loc[subsample_index])), color='blue')
plt.xlabel('v_9')
plt.ylabel('price')
plt.legend(['True Price','Predicted Price'],loc='upper right')
print('The predicted price seems normal after np.log transforming')
plt.show()
The predicted price seems normal after np.log transforming

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-K8Jy1rUH-1585739013217)(output_33_1.png)]

5.4.2 - 2 五折交叉验证

在使用训练集对参数进行训练的时候,经常会发现人们通常会将一整个训练集分为三个部分(比如mnist手写训练集)。一般分为:训练集(train_set),评估集(valid_set),测试集(test_set)这三个部分。这其实是为了保证训练效果而特意设置的。其中测试集很好理解,其实就是完全不参与训练的数据,仅仅用来观测测试效果的数据。而训练集和评估集则牵涉到下面的知识了。

因为在实际的训练中,训练的结果对于训练集的拟合程度通常还是挺好的(初始条件敏感),但是对于训练集之外的数据的拟合程度通常就不那么令人满意了。因此我们通常并不会把所有的数据集都拿来训练,而是分出一部分来(这一部分不参加训练)对训练集生成的参数进行测试,相对客观的判断这些参数对训练集之外的数据的符合程度。这种思想就称为交叉验证(Cross Validation)

from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_absolute_error,  make_scorer
def log_transfer(func):
    def wrapper(y, yhat):
        result = func(np.log(y), np.nan_to_num(np.log(yhat)))
        return result
    return wrapper
scores = cross_val_score(model, X=train_X, y=train_y, verbose=1, cv = 5, scoring=make_scorer(log_transfer(mean_absolute_error)))
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    2.5s finished
print('AVG:', np.mean(scores))
AVG: 1.3727047560963908

使用线性回归模型,对未处理标签的特征数据进行五折交叉验证(Error 1.36)

使用线性回归模型,对处理过标签的特征数据进行五折交叉验证(Error 0.19)

scores = cross_val_score(model, X=train_X, y=train_y_ln, verbose=1, cv = 5, scoring=make_scorer(mean_absolute_error))
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    2.2s finished
print('AVG:', np.mean(scores))
AVG: 0.19144173929628336
scores = pd.DataFrame(scores.reshape(1,-1))
scores.columns = ['cv' + str(x) for x in range(1, 6)]
scores.index = ['MAE']
scores
cv1cv2cv3cv4cv5
MAE0.0244830.024770.0249310.0246580.0249
5.4.2 - 3 模拟真实业务情况

但在事实上,由于我们并不具有预知未来的能力,五折交叉验证在某些与时间相关的数据集上反而反映了不真实的情况。通过2018年的二手车价格预测2017年的二手车价格,这显然是不合理的,因此我们还可以采用时间顺序对数据集进行分隔。在本例中,我们选用靠前时间的4/5样本当作训练集,靠后时间的1/5当作验证集,最终结果与五折交叉验证差距不大

import datetime
sample_feature = sample_feature.reset_index(drop=True)
split_point = len(sample_feature) // 5 * 4
train = sample_feature.loc[:split_point].dropna()
val = sample_feature.loc[split_point:].dropna()

train_X = train[continuous_feature_names]
train_y_ln = np.log(train['price'] + 1)
val_X = val[continuous_feature_names]
val_y_ln = np.log(val['price'] + 1)
model = model.fit(train_X, train_y_ln)
mean_absolute_error(val_y_ln, model.predict(val_X))
0.1927560130392256
5.4.2 - 4 绘制学习率曲线与验证曲线
from sklearn.model_selection import learning_curve, validation_curve
? learning_curve
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,n_jobs=1, train_size=np.linspace(.1, 1.0, 5 )):  
    plt.figure()  
    plt.title(title)  
    if ylim is not None:  
        plt.ylim(*ylim)  
    plt.xlabel('Training example')  
    plt.ylabel('score')  
    train_sizes, train_scores, test_scores = learning_curve(estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_size, scoring = make_scorer(mean_absolute_error))  
    train_scores_mean = np.mean(train_scores, axis=1)  
    train_scores_std = np.std(train_scores, axis=1)  
    test_scores_mean = np.mean(test_scores, axis=1)  
    test_scores_std = np.std(test_scores, axis=1)  
    plt.grid()#区域  
    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,  
                     train_scores_mean + train_scores_std, alpha=0.1,  
                     color="r")  
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,  
                     test_scores_mean + test_scores_std, alpha=0.1,  
                     color="g")  
    plt.plot(train_sizes, train_scores_mean, 'o-', color='r',  
             label="Training score")  
    plt.plot(train_sizes, test_scores_mean,'o-',color="g",  
             label="Cross-validation score")  
    plt.legend(loc="best")  
    return plt  
plot_learning_curve(LinearRegression(), 'Liner_model', train_X[:1000], train_y_ln[:1000], ylim=(0.0, 0.5), cv=5, n_jobs=1)  
<module 'matplotlib.pyplot' from 'F:\\dev\\anaconda\\envs\\python35\\lib\\site-packages\\matplotlib\\pyplot.py'>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-txX67G5G-1585739013218)(output_57_1.png)]

5.4.3 多种模型对比
train = sample_feature[continuous_feature_names + ['price']].dropna()

train_X = train[continuous_feature_names]
train_y = train['price']
train_y_ln = np.log(train_y + 1)
5.4.3 - 1 线性模型 & 嵌入式特征选择

本章节默认,学习者已经了解关于过拟合、模型复杂度、正则化等概念。否则请寻找相关资料或参考如下连接:

  • 用简单易懂的语言描述「过拟合 overfitting」? https://www.zhihu.com/question/32246256/answer/55320482
  • 模型复杂度与模型的泛化能力 http://yangyingming.com/article/434/
  • 正则化的直观理解 https://blog.csdn.net/jinping_shi/article/details/52433975

在过滤式和包裹式特征选择方法中,特征选择过程与学习器训练过程有明显的分别。而嵌入式特征选择在学习器训练过程中自动地进行特征选择。嵌入式选择最常用的是L1正则化与L2正则化。在对线性回归模型加入两种正则化方法后,他们分别变成了岭回归与Lasso回归。

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
models = [LinearRegression(),
          Ridge(),
          Lasso()]
for model in models:
    print (str(model).split('('))
['LinearRegression', 'copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)']
['Ridge', "alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,\n      normalize=False, random_state=None, solver='auto', tol=0.001)"]
['Lasso', "alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,\n      normalize=False, positive=False, precompute=False, random_state=None,\n      selection='cyclic', tol=0.0001, warm_start=False)"]
result = dict()
for model in models:
    model_name = str(model).split('(')[0]
    scores = cross_val_score(model, X=train_X, y=train_y_ln, verbose=1, cv = 5, scoring=make_scorer(mean_absolute_error))
    result[model_name] = scores
    print(model_name + ' is finished')
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    2.2s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


LinearRegression is finished


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    1.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Ridge is finished
Lasso is finished


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    9.2s finished

对三种方法的效果对比

result = pd.DataFrame(result)
result.index = ['cv' + str(x) for x in range(1, 6)]
result
LinearRegressionRidgeLasso
cv10.1897530.1935540.408584
cv20.1916330.1954740.408400
cv30.1924180.1963090.410003
cv40.1906450.1947120.406354
cv50.1927600.1965960.409488
model = LinearRegression().fit(train_X, train_y_ln)
print('intercept:'+ str(model.intercept_))
sns.barplot(abs(model.coef_), continuous_feature_names)
intercept:24.63703112761765





<matplotlib.axes._subplots.AxesSubplot at 0x1f288ef7588>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Kayq7eI3-1585739013218)(output_69_2.png)]

L2正则化在拟合过程中通常都倾向于让权值尽可能小,最后构造一个所有参数都比较小的模型。因为一般认为参数值小的模型比较简单,能适应不同的数据集,也在一定程度上避免了过拟合现象。可以设想一下对于一个线性回归方程,若参数很大,那么只要数据偏移一点点,就会对结果造成很大的影响;但如果参数足够小,数据偏移得多一点也不会对结果造成什么影响,专业一点的说法是『抗扰动能力强』

model = Ridge().fit(train_X, train_y_ln)
print('intercept:'+ str(model.intercept_))
sns.barplot(abs(model.coef_), continuous_feature_names)
intercept:8.363622088450727





<matplotlib.axes._subplots.AxesSubplot at 0x1f286d2b4a8>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-V3IILxW6-1585739013218)(output_71_2.png)]

L1正则化有助于生成一个稀疏权值矩阵,进而可以用于特征选择。如下图,我们发现power与userd_time特征非常重要。

model = Lasso().fit(train_X, train_y_ln)
print('intercept:'+ str(model.intercept_))
sns.barplot(abs(model.coef_), continuous_feature_names)
intercept:8.633835377056277





<matplotlib.axes._subplots.AxesSubplot at 0x1f2998c84e0>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-PPV4A5zk-1585739013218)(output_73_2.png)]

除此之外,决策树通过信息熵或GINI指数选择分裂节点时,优先选择的分裂特征也更加重要,这同样是一种特征选择的方法。XGBoost与LightGBM模型中的model_importance指标正是基于此计算的

5.4.3 - 2 非线性模型

除了线性模型以外,还有许多我们常用的非线性模型如下,在此篇幅有限不再一一讲解原理。我们选择了部分常用模型与线性模型进行效果比对。

from sklearn.linear_model import LinearRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.neural_network import MLPRegressor
from xgboost.sklearn import XGBRegressor
from lightgbm.sklearn import LGBMRegressor
models = [LinearRegression(),
          DecisionTreeRegressor(),
          RandomForestRegressor(),
          GradientBoostingRegressor(),
          MLPRegressor(solver='lbfgs', max_iter=100), 
          XGBRegressor(n_estimators = 100, objective='reg:squarederror'), 
          LGBMRegressor(n_estimators = 100)]
result = dict()
for model in models:
    model_name = str(model).split('(')[0]
    scores = cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error))
    result[model_name] = scores
    print(model_name + ' is finished')
LinearRegression is finished
DecisionTreeRegressor is finished
RandomForestRegressor is finished
GradientBoostingRegressor is finished
MLPRegressor is finished
XGBRegressor is finished
LGBMRegressor is finished
result = pd.DataFrame(result)
result.index = ['cv' + str(x) for x in range(1, 6)]
result
LinearRegressionDecisionTreeRegressorRandomForestRegressorGradientBoostingRegressorMLPRegressorXGBRegressorLGBMRegressor
cv10.1897530.1949660.1318310.168502330.965632NaN0.142259
cv20.1916330.1882770.1320250.170869529.702511NaN0.143896
cv30.1924180.1865360.1330570.172026252.100253NaN0.143949
cv40.1906450.1899730.1320150.1695871660.597709NaN0.142199
cv50.1927600.1945980.1340230.1702571016.973845NaN0.143291

可以看到随机森林模型在每一个fold中均取得了更好的效果

5.4.4 模型调参

在此我们介绍了三种常用的调参方法如下:

  • 贪心算法 https://www.jianshu.com/p/ab89df9759c8
  • 网格调参 https://blog.csdn.net/weixin_43172660/article/details/83032029
  • 贝叶斯调参 https://blog.csdn.net/linxid/article/details/81189154
## LGB的参数集合:

objective = ['regression', 'regression_l1', 'mape', 'huber', 'fair']

num_leaves = [3,5,10,15,20,40, 55]
max_depth = [3,5,10,15,20,40, 55]
bagging_fraction = []
feature_fraction = []
drop_rate = []
5.4.4 - 1 贪心调参
best_obj = dict()
for obj in objective:
    model = LGBMRegressor(objective=obj)
    score = np.mean(cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)))
    best_obj[obj] = score
    
best_leaves = dict()
for leaves in num_leaves:
    model = LGBMRegressor(objective=min(best_obj.items(), key=lambda x:x[1])[0], num_leaves=leaves)
    score = np.mean(cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)))
    best_leaves[leaves] = score
    
best_depth = dict()
for depth in max_depth:
    model = LGBMRegressor(objective=min(best_obj.items(), key=lambda x:x[1])[0],
                          num_leaves=min(best_leaves.items(), key=lambda x:x[1])[0],
                          max_depth=depth)
    score = np.mean(cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)))
    best_depth[depth] = score
print(min(best_obj.items(), key=lambda x:x[1])[0], min(best_leaves.items(), key=lambda x:x[1])[0],min(best_depth.items(), key=lambda x:x[1])[0])
regression_l1 55 20
sns.lineplot(x=['0_initial','1_turning_obj','2_turning_leaves','3_turning_depth'], y=[0.143 ,min(best_obj.values()), min(best_leaves.values()), min(best_depth.values())])
<matplotlib.axes._subplots.AxesSubplot at 0x1f28609e9b0>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-62XfGARp-1585739013219)(output_88_1.png)]

5.4.4 - 2 Grid Search 调参
from sklearn.model_selection import GridSearchCV
parameters = {'objective': objective , 'num_leaves': num_leaves, 'max_depth': max_depth}
model = LGBMRegressor()
clf = GridSearchCV(model, parameters, cv=5)
clf = clf.fit(train_X, train_y)
clf.best_params_
{'max_depth': 20, 'num_leaves': 55, 'objective': 'regression'}
model = LGBMRegressor(objective='regression',
                          num_leaves=55,
                          max_depth=15)
np.mean(cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)))
0.1367394710811723
5.4.4 - 3 贝叶斯调参
from bayes_opt import BayesianOptimization
def rf_cv(num_leaves, max_depth, subsample, min_child_samples):
    val = cross_val_score(
        LGBMRegressor(objective = 'regression_l1',
            num_leaves=int(num_leaves),
            max_depth=int(max_depth),
            subsample = subsample,
            min_child_samples = int(min_child_samples)
        ),
        X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)
    ).mean()
    return 1 - val
rf_bo = BayesianOptimization(
    rf_cv,
    {
    'num_leaves': (2, 100),
    'max_depth': (2, 100),
    'subsample': (0.1, 1),
    'min_child_samples' : (2, 100)
    }
)
rf_bo.maximize()
|   iter    |  target   | max_depth | min_ch... | num_le... | subsample |
-------------------------------------------------------------------------
| [0m 1       [0m | [0m 0.8081  [0m | [0m 2.139   [0m | [0m 95.27   [0m | [0m 13.28   [0m | [0m 0.7674  [0m |
| [95m 2       [0m | [95m 0.8542  [0m | [95m 79.84   [0m | [95m 95.88   [0m | [95m 24.63   [0m | [95m 0.7816  [0m |
| [95m 3       [0m | [95m 0.8656  [0m | [95m 64.07   [0m | [95m 21.62   [0m | [95m 63.41   [0m | [95m 0.3652  [0m |
| [95m 4       [0m | [95m 0.8684  [0m | [95m 67.02   [0m | [95m 73.01   [0m | [95m 89.31   [0m | [95m 0.9635  [0m |
| [0m 5       [0m | [0m 0.8617  [0m | [0m 96.55   [0m | [0m 95.6    [0m | [0m 44.35   [0m | [0m 0.3263  [0m |
| [95m 6       [0m | [95m 0.8694  [0m | [95m 99.38   [0m | [95m 3.649   [0m | [95m 99.61   [0m | [95m 0.9498  [0m |
| [0m 7       [0m | [0m 0.8682  [0m | [0m 10.57   [0m | [0m 2.683   [0m | [0m 99.59   [0m | [0m 0.2856  [0m |
| [0m 8       [0m | [0m 0.8137  [0m | [0m 98.58   [0m | [0m 2.266   [0m | [0m 4.117   [0m | [0m 0.4037  [0m |
| [0m 9       [0m | [0m 0.8691  [0m | [0m 97.07   [0m | [0m 4.548   [0m | [0m 96.72   [0m | [0m 0.2396  [0m |
| [0m 10      [0m | [0m 0.8693  [0m | [0m 99.28   [0m | [0m 59.42   [0m | [0m 99.63   [0m | [0m 0.3136  [0m |
| [0m 11      [0m | [0m 0.8694  [0m | [0m 49.26   [0m | [0m 2.312   [0m | [0m 99.34   [0m | [0m 0.2627  [0m |
| [0m 12      [0m | [0m 0.8501  [0m | [0m 5.686   [0m | [0m 54.37   [0m | [0m 99.81   [0m | [0m 0.8644  [0m |
| [0m 13      [0m | [0m 0.8689  [0m | [0m 93.38   [0m | [0m 99.77   [0m | [0m 98.88   [0m | [0m 0.1199  [0m |
| [0m 14      [0m | [0m 0.8508  [0m | [0m 5.093   [0m | [0m 3.532   [0m | [0m 64.22   [0m | [0m 0.9956  [0m |
| [0m 15      [0m | [0m 0.8692  [0m | [0m 75.14   [0m | [0m 26.5    [0m | [0m 97.88   [0m | [0m 0.9986  [0m |
| [0m 16      [0m | [0m 0.8672  [0m | [0m 94.55   [0m | [0m 47.66   [0m | [0m 75.79   [0m | [0m 0.9908  [0m |
| [0m 17      [0m | [0m 0.8675  [0m | [0m 71.72   [0m | [0m 98.67   [0m | [0m 80.06   [0m | [0m 0.1107  [0m |
| [0m 18      [0m | [0m 0.8693  [0m | [0m 73.31   [0m | [0m 32.14   [0m | [0m 99.59   [0m | [0m 0.1351  [0m |
| [0m 19      [0m | [0m 0.8694  [0m | [0m 70.88   [0m | [0m 4.322   [0m | [0m 99.78   [0m | [0m 0.6159  [0m |
| [0m 20      [0m | [0m 0.8677  [0m | [0m 98.94   [0m | [0m 99.55   [0m | [0m 83.43   [0m | [0m 0.8325  [0m |
| [0m 21      [0m | [0m 0.8669  [0m | [0m 82.89   [0m | [0m 66.63   [0m | [0m 76.34   [0m | [0m 0.1787  [0m |
| [0m 22      [0m | [0m 0.8692  [0m | [0m 42.33   [0m | [0m 2.094   [0m | [0m 97.08   [0m | [0m 0.8848  [0m |
| [0m 23      [0m | [0m 0.8691  [0m | [0m 99.86   [0m | [0m 25.37   [0m | [0m 99.77   [0m | [0m 0.8352  [0m |
| [0m 24      [0m | [0m 0.8682  [0m | [0m 74.09   [0m | [0m 2.025   [0m | [0m 87.13   [0m | [0m 0.418   [0m |
| [0m 25      [0m | [0m 0.8693  [0m | [0m 96.85   [0m | [0m 32.41   [0m | [0m 99.86   [0m | [0m 0.1375  [0m |
| [0m 26      [0m | [0m 0.8694  [0m | [0m 79.41   [0m | [0m 2.742   [0m | [0m 99.7    [0m | [0m 0.3403  [0m |
| [0m 27      [0m | [0m 0.8692  [0m | [0m 85.17   [0m | [0m 57.07   [0m | [0m 99.65   [0m | [0m 0.9642  [0m |
| [0m 28      [0m | [0m 0.8692  [0m | [0m 55.01   [0m | [0m 17.77   [0m | [0m 99.77   [0m | [0m 0.1056  [0m |
| [0m 29      [0m | [0m 0.8686  [0m | [0m 80.48   [0m | [0m 30.21   [0m | [0m 91.61   [0m | [0m 0.1095  [0m |
| [0m 30      [0m | [0m 0.8694  [0m | [0m 98.03   [0m | [0m 99.1    [0m | [0m 99.78   [0m | [0m 0.855   [0m |
=========================================================================
1 - rf_bo.max['target']
0.13060740104370572

总结

在本章中,我们完成了建模与调参的工作,并对我们的模型进行了验证。此外,我们还采用了一些基本方法来提高预测的精度,提升如下图所示。

plt.figure(figsize=(13,5))
sns.lineplot(x=['0_origin','1_log_transfer','2_L1_&_L2','3_change_model','4_parameter_turning'], y=[1.36 ,0.19, 0.19, 0.14, 0.13])
<matplotlib.axes._subplots.AxesSubplot at 0x1f28c646ba8>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-cFsCk7ot-1585739013219)(output_103_1.png)]

Task5 建模调参 END.

— By: 小雨姑娘

数据挖掘爱好者,多次获比赛TOP名次。
作者的机器学习笔记:https://zhuanlan.zhihu.com/mlbasic

关于Datawhale:

Datawhale是一个专注于数据科学与AI领域的开源组织,汇集了众多领域院校和知名企业的优秀学习者,聚合了一群有开源精神和探索精神的团队成员。Datawhale 以“for the learner,和学习者一起成长”为愿景,鼓励真实地展现自我、开放包容、互信互助、敢于试错和勇于担当。同时 Datawhale 用开源的理念去探索开源内容、开源学习和开源方案,赋能人才培养,助力人才成长,建立起人与人,人与知识,人与企业和人与未来的联结。

本次数据挖掘路径学习,专题知识将在天池分享,详情可关注Datawhale:
(图片!!!)

学习笔记

  • 前面的特征工程决定了模型预测score的上限,而对于如何来达到这个上线就需要靠我们的建模与调参了

  • 刚开始我们先使用比较简单线性回归模型对数据进行一个处理,同时这里有一个点就是我们发现price不符合正态分布(长尾分布)导致我们预测的偏差较大,我们我们依次使用了截断的方式跟log变换的方式,发现log变换的效果比较好,其变换后的数据跟接近正太分布,同时其偏差AVG也从 1.3727047560963908下降到了0.1914417左右,log变换的方式是一种常用的变换方式。

  • 接下来介绍了模型性能验证的方式,首先我们使用了k折交叉验证的方式,然后对于存在时间按序列的数据,我们不能用未来的数据来预测现在我们使用前4/5的数据作为训练集,后面的1/5数据作为验证集,发现偏差跟k折交叉验证的效果差不多

  • 线性模型中介绍了嵌入式的特征选择即假如正则化,来防止过拟合,其中L1正则化跟L2正则化(L1可能会导致很多的特征权重为零,L2可能会让每一个特征的权重都比较小)特征趋向于稀疏的时候采用L1正则化,不是稀疏的时候采用L2正则化

  • 接下来对不同的模型分别跑一个分数,来粗略的选取性能比较好的模型

  • 选好了模型接下来就是模型调参的环节了

    • 贪心算法(每次只考虑一个特征的最优,后面的特征在前面特征已经选好的最优特征的基础上们进行调试)

    • 网格调参(遍历所有的特征组合,选出误差最小的那个)

    • 贝叶斯调参
      -给定优化的目标函数(广义的函数,只需指定输入和输出即可,无需知道内部结构以及数学性质),通过不断地添加样本点来更新目标函数的后验分布(高斯过程,直到后验分布基本贴合于真实分布。简单的说,就是考虑了上一次参数的信息**,从而更好的调整当前的参数。
      他与常规的网格搜索或者随机搜索的区别是
      贝叶斯调参采用高斯过程,考虑之前的参数信息,不断地更新先验;
      网格搜索未考虑之前的参数信息
      贝叶斯调参迭代次数少,速度快;
      网格搜索速度慢,参数多时易导致维度爆炸
      贝叶斯调参针对非凸问题依然稳健;
      网格搜索针对非凸问题易得到局部优最优

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值