零基础入门数据挖掘-Task4 建模调参

最新推荐文章于 2024-08-30 08:07:28 发布

GCZZZ

最新推荐文章于 2024-08-30 08:07:28 发布

阅读量182

点赞数

分类专栏：数据挖掘入门文章标签：数据挖掘

本文链接：https://blog.csdn.net/GCZZZ/article/details/105254367

版权

数据挖掘入门专栏收录该内容

4 篇文章 0 订阅

订阅专栏

学习内容

线性回归模型
线性回归对于特征的要求；
处理长尾分布；
理解线性回归模型；
模型性能验证
评价函数与目标函数；
交叉验证方法；
留一验证方法；
针对时间序列问题的验证；
绘制学习率曲线；
绘制验证曲线；
嵌入式特征选择

Lasso回归；
Ridge回归；
决策树；
模型对比

常用线性模型；
常用非线性模型；

模型调参
贪心调参方法；
网格调参方法；
贝叶斯调参方法。

部分代码示例

线性回归

sample_feature = sample_feature.dropna().replace('-', 0).reset_index(drop=True)
sample_feature['notRepairedDamage'] = sample_feature['notRepairedDamage'].astype(np.float32)
train = sample_feature[continuous_feature_names + ['price']]

train_X = train[continuous_feature_names]
train_y = train['price']

查看训练的线性回归模型的截距（intercept）与权重(coef)

'intercept:'+ str(model.intercept_)

sorted(dict(zip(continuous_feature_names, model.coef_)).items(), key=lambda x:x[1], reverse=True)

绘制特征值与标签的散点图，（以V9为例）

plt.scatter(train_X['v_9'][subsample_index], train_y[subsample_index], color='black')
plt.scatter(train_X['v_9'][subsample_index], model.predict(train_X.loc[subsample_index]), color='blue')
plt.xlabel('v_9')
plt.ylabel('price')
plt.legend(['True Price','Predicted Price'],loc='upper right')
print('The predicted price is obvious different from true price')
plt.show()

对标签进行了log(x+1) 变换，使标签贴近于正态分布

train_y_ln = np.log(train_y + 1)
import seaborn as sns
print('The transformed price seems like normal distribution')
plt.figure(figsize=(15,5))
plt.subplot(1,2,1)
sns.distplot(train_y_ln)
plt.subplot(1,2,2)
sns.distplot(train_y_ln[train_y_ln < np.quantile(train_y_ln, 0.9)])

绘制学习率曲线与验证曲线

from sklearn.model_selection import learning_curve, validation_curve
? learning_curve
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,n_jobs=1, train_size=np.linspace(.1, 1.0, 5 )):  
    plt.figure()  
    plt.title(title)  
    if ylim is not None:  
        plt.ylim(*ylim)  
    plt.xlabel('Training example')  
    plt.ylabel('score')  
    train_sizes, train_scores, test_scores = learning_curve(estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_size, scoring = make_scorer(mean_absolute_error))  
    train_scores_mean = np.mean(train_scores, axis=1)  
    train_scores_std = np.std(train_scores, axis=1)  
    test_scores_mean = np.mean(test_scores, axis=1)  
    test_scores_std = np.std(test_scores, axis=1)  
    plt.grid()#区域  
    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,  
                     train_scores_mean + train_scores_std, alpha=0.1,  
                     color="r")  
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,  
                     test_scores_mean + test_scores_std, alpha=0.1,  
                     color="g")  
    plt.plot(train_sizes, train_scores_mean, 'o-', color='r',  
             label="Training score")  
    plt.plot(train_sizes, test_scores_mean,'o-',color="g",  
             label="Cross-validation score")  
    plt.legend(loc="best")  
    return plt  
plot_learning_curve(LinearRegression(), 'Liner_model', train_X[:1000], train_y_ln[:1000], ylim=(0.0, 0.5), cv=5, n_jobs=1)

在过滤式和包裹式特征选择方法中，特征选择过程与学习器训练过程有明显的分别。而嵌入式特征选择在学习器训练过程中自动地进行特征选择。嵌入式选择最常用的是L1正则化与L2正则化。在对线性回归模型加入两种正则化方法后，他们分别变成了岭回归与Lasso回归。

模型调参

介绍了三种常用的调参方法如下：

贪心算法 https://www.jianshu.com/p/ab89df9759c8
网格调参 https://blog.csdn.net/weixin_43172660/article/details/83032029
贝叶斯调参 https://blog.csdn.net/linxid/article/details/81189154

贝叶斯调参代码示例

from bayes_opt import BayesianOptimization
def rf_cv(num_leaves, max_depth, subsample, min_child_samples):
    val = cross_val_score(
        LGBMRegressor(objective = 'regression_l1',
            num_leaves=int(num_leaves),
            max_depth=int(max_depth),
            subsample = subsample,
            min_child_samples = int(min_child_samples)
        ),
        X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)
    ).mean()
    return 1 - val
rf_bo = BayesianOptimization(
    rf_cv,
    {
    'num_leaves': (2, 100),
    'max_depth': (2, 100),
    'subsample': (0.1, 1),
    'min_child_samples' : (2, 100)
    }
)

rf_bo.maximize()
1 - rf_bo.max['target']

总结

完成了建模与调参的工作，并对我们的模型进行了验证。
了解各种模型以及模型的评价和调参策略。

感谢小雨姑娘（https://zhuanlan.zhihu.com/mlbasic）的指导。

GCZZZ

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
零基础入门数据挖掘-Task4 建模调参

学习内容线性回归模型线性回归对于特征的要求；处理长尾分布；理解线性回归模型；模型性能验证评价函数与目标函数；交叉验证方法；留一验证方法；针对时间序列问题的验证；绘制学习率曲线；绘制验证曲线；嵌入式特征选择Lasso回归；Ridge回归；决策树；模型对比常用线性模型；常用非线性模型；模型调参贪心调参方法；网格调参方法；贝叶斯调参方法。部分代码示例线性回...
复制链接

扫一扫

专栏目录