天池二手车拍卖赛题理解之建模调参

最新推荐文章于 2020-04-04 21:57:30 发布

起名字什么的好难

最新推荐文章于 2020-04-04 21:57:30 发布

阅读量322

点赞数

分类专栏：人工智能

本文链接：https://blog.csdn.net/u012428169/article/details/105254344

版权

人工智能专栏收录该内容

37 篇文章 1 订阅

订阅专栏

天池二手车交易价格预测赛题理解之特征分析模型和调参技巧
原文链接：
Datawhale 零基础入门数据挖掘-Task4 建模调参
本文为个人阅读笔记，仅记录阅读过程中遇到的新知识。

模型
模型的简单建立

#1.加载模型
from sklearn.linear_model import LinearRegression
#from sklearn.linear_model import Ridge
#from sklearn.linear_model import Lasso
#from sklearn.svm import SVC
#from sklearn.tree import DecisionTreeRegressor
#from sklearn.ensemble import RandomForestRegressor
#from sklearn.ensemble import GradientBoostingRegressor
#from sklearn.neural_network import MLPRegressor
#from xgboost.sklearn import XGBRegressor
#from lightgbm.sklearn import LGBMRegressor
#模型实例化
model = LinearRegression(normalize=True)#其他模型类似
#向模型中填充数据
model = model.fit(train_X, train_y)
#模型预测
model.predict(train_X.loc[subsample_index])

多个模型对比

models = [LinearRegression(),
          DecisionTreeRegressor(),
          RandomForestRegressor(),
          GradientBoostingRegressor(),
          MLPRegressor(solver='lbfgs', max_iter=100), 
          XGBRegressor(n_estimators = 100, objective='reg:squarederror'), 
          LGBMRegressor(n_estimators = 100)]
#将各模型结果保存在字典中
result = dict()
for model in models:
    model_name = str(model).split('(')[0]
    scores = cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error))
    result[model_name] = scores
    print(model_name + ' is finished')

result = pd.DataFrame(result)
result.index = ['cv' + str(x) for x in range(1, 6)]
result

线性模型的两种正则化变种
在过滤式和包裹式特征选择方法中，特征选择过程与学习器训练过程有明显的分别。而嵌入式特征选择在学习器训练过程中自动地进行特征选择。嵌入式选择最常用的是L1正则化与L2正则化。在对线性回归模型加入两种正则化方法后，他们分别变成了岭回归与Lasso回归。

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso

K折交叉验证
不把所有的数据集都拿来训练，而是分出一部分来（这一部分不参加训练）对训练集生成的参数进行测试，相对客观的判断这些参数对训练集之外的数据的符合程度。这种思想就称为交叉验证（Cross Validation）。

from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_absolute_error,  make_scorer

5折交叉验证
对未处理过标签的数据进行5折验证
在这里插入图片描述

对处理过标签的数据进行5折验证

问题：为什么未处理过标签的数据要定义一个log_transfer函数？

在这里插入图片描述

需要注意的是：K折交叉验证针对的是相互独立的数据（我的猜测），如果是跟时间相关联的，最好是取前n-k个数据训练，最后k个数据验证。

绘制学习曲线
绘制学习率曲线与验证曲线

from sklearn.model_selection import learning_curve, validation_curve
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,n_jobs=1, train_size=np.linspace(.1, 1.0, 5 )):  
    plt.figure()  
    plt.title(title)  
    if ylim is not None:  
        plt.ylim(*ylim)  
    plt.xlabel('Training example')  
    plt.ylabel('score')  
    train_sizes, train_scores, test_scores = learning_curve(estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_size, scoring = make_scorer(mean_absolute_error))  
    train_scores_mean = np.mean(train_scores, axis=1)  
    train_scores_std = np.std(train_scores, axis=1)  
    test_scores_mean = np.mean(test_scores, axis=1)  
    test_scores_std = np.std(test_scores, axis=1)  
    plt.grid()#区域  
    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,  
                     train_scores_mean + train_scores_std, alpha=0.1,  
                     color="r")  
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,  
                     test_scores_mean + test_scores_std, alpha=0.1,  
                     color="g")  
    plt.plot(train_sizes, train_scores_mean, 'o-', color='r',  
             label="Training score")  
    plt.plot(train_sizes, test_scores_mean,'o-',color="g",  
             label="Cross-validation score")  
    plt.legend(loc="best")  
    return plt  
plot_learning_curve(LinearRegression(), 'Liner_model', train_X[:1000], train_y_ln[:1000], ylim=(0.0, 0.5), cv=5, n_jobs=1)

使用seaborn和matplotlib画图，参考资料
教你使用Matplotlib和Seaborn演示Python可视化
6. 模型调参
三种策略：贪心算法，网格调参和贝叶斯调参。
1）贪心调参

best_obj = dict()
for obj in objective:
    model = LGBMRegressor(objective=obj)
    score = np.mean(cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)))
    best_obj[obj] = score
    
best_leaves = dict()
for leaves in num_leaves:
    model = LGBMRegressor(objective=min(best_obj.items(), key=lambda x:x[1])[0], num_leaves=leaves)
    score = np.mean(cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)))
    best_leaves[leaves] = score
    
best_depth = dict()
for depth in max_depth:
    model = LGBMRegressor(objective=min(best_obj.items(), key=lambda x:x[1])[0],
                          num_leaves=min(best_leaves.items(), key=lambda x:x[1])[0],
                          max_depth=depth)
    score = np.mean(cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)))
    best_depth[depth] = score

通过min(best_obj.values()), min(best_leaves.values()), min(best_depth.values())获取最优参数。
2）网格调参（Grid Search 调参）

from sklearn.model_selection import GridSearchCV
parameters = {'objective': objective , 'num_leaves': num_leaves, 'max_depth': max_depth}
model = LGBMRegressor()
clf = GridSearchCV(model, parameters, cv=5)
clf = clf.fit(train_X, train_y)

通过clf.best_params_获取最优参数值。
3）贝叶斯调参

from bayes_opt import BayesianOptimization
def rf_cv(num_leaves, max_depth, subsample, min_child_samples):
    val = cross_val_score(
        LGBMRegressor(objective = 'regression_l1',
            num_leaves=int(num_leaves),
            max_depth=int(max_depth),
            subsample = subsample,
            min_child_samples = int(min_child_samples)
        ),
        X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)
    ).mean()
    return 1 - val

rf_bo = BayesianOptimization(
    rf_cv,
    {
    'num_leaves': (2, 100),
    'max_depth': (2, 100),
    'subsample': (0.1, 1),
    'min_child_samples' : (2, 100)
    }
)

通过rf_bo.maximize()获取最优参数值。

起名字什么的好难

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
天池二手车拍卖赛题理解之建模调参

天池二手车交易价格预测赛题理解之特征分析模型和调参技巧原文链接：Datawhale 零基础入门数据挖掘-Task4 建模调参本文为个人阅读笔记，仅记录阅读过程中遇到的新知识。模型模型的简单建立#1.加载模型from sklearn.linear_model import LinearRegression#from sklearn.linear_model import Ridg...
复制链接

扫一扫

专栏目录