零基础入门数据挖掘-Task4 建模调参

最新推荐文章于 2023-02-27 11:23:09 发布

Zee_Chao

最新推荐文章于 2023-02-27 11:23:09 发布

阅读量848

点赞数

分类专栏： Datawhale Team Learning 文章标签：数据挖掘

本文链接：https://blog.csdn.net/Zee_Chao/article/details/105242699

版权

Datawhale Team Learning 专栏收录该内容

27 篇文章 1 订阅

订阅专栏

1. 学习内容

1. 了解常用的机器学习模型，并掌握机器学习模型的建模与调参流程

2. 线性回归模型

3. 模型性能验证

4. 嵌入式特征选择

5. 不同模型效果对比

6. 模型调参方法：贪心、网格和贝叶斯

本项目参见https://github.com/datawhalechina/team-learning

2. 准备工作

在建模之前，除了导入相关模块和数据之外，还需要调整各个特征的类型以达到压缩内存空间的目的。

import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

def reduce_mem_usage(df):
    '''
    修改数据类型，压缩存储空间
    主要思想就是数值特征转换成占用内存尽可能小的类型
    字符串特征转换成类别特征
    '''
    start_mem = df.memory_usage().sum() 
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() 
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    return df

sample_feature = reduce_mem_usage(pd.read_csv(r'./data/data_for_tree.csv'))

输出结果如下：

Memory usage of dataframe is 62099672.00 MB
Memory usage after optimization is: 16520303.00 MB
Decreased by 73.4%

可以看到最终压缩了73.4%的空间！

3. 用线性回归简单建模

3.1 用简单线性回归建模

代码如下：

from sklearn.linear_model import LinearRegression

sample_feature = sample_feature.dropna().replace('-', 0).reset_index(drop = True)
sample_feature['notRepairedDamage'] = \
    sample_feature['notRepairedDamage'].astype(np.float32)
continuous_feature_names = [x for x in sample_feature.columns \
                            if x not in ['price','brand','model','brand']]
train = sample_feature[continuous_feature_names + ['price']]

X_train = train[continuous_feature_names]
y_train = train['price']

lr = LinearRegression(normalize = True)
model = lr.fit(X_train, y_train)

# 查看训练的线性回归模型的截距（intercept）与权重(coef)
print('intercept:'+ str(model.intercept_))

sorted(dict(zip(continuous_feature_names, model.coef_)).items(), \
       key = lambda x: x[1], reverse = True)

输出结果如下：

intercept:-110670.6827721497
[('v_6', 3367064.3416418717),
 ('v_8', 700675.5609399063),
 ('v_9', 170630.2772322219),
 ('v_7', 32322.66193203625),
 ('v_12', 20473.670796956616),
 ('v_3', 17868.079541493582),
 ('v_11', 11474.938996702811),
 ('v_13', 11261.764560014222),
 ('v_10', 2683.9200905932366),
 ('gearbox', 881.8225039247454),
 ('fuelType', 363.90425072159144),
 ('bodyType', 189.60271012069165),
 ('city', 44.949751205222555),
 ('power', 28.553901616746646),
 ('brand_price_median', 0.5103728134080039),
 ('brand_price_std', 0.450363470926374),
 ('brand_amount', 0.1488112039506524),
 ('brand_price_max', 0.003191018670311645),
 ('SaleID', 5.355989919856515e-05),
 ('train', -1.0244548320770264e-07),
 ('offerType', -2.930755726993084e-07),
 ('seller', -2.7147470973432064e-06),
 ('brand_price_sum', -2.175006868187502e-05),
 ('name', -0.00029800127130996705),
 ('used_time', -0.0025158943328600102),
 ('brand_price_average', -0.40490484510127067),
 ('brand_price_min', -2.246775348689046),
 ('power_bin', -34.42064411722464),
 ('v_14', -274.7841180775971),
 ('kilometer', -372.8975266606936),
 ('notRepairedDamage', -495.19038446280786),
 ('v_0', -2045.0549573554758),
 ('v_5', -11022.98624049396),
 ('v_4', -15121.731109856253),
 ('v_2', -26098.299920522953),
 ('v_1', -45556.18929727541)]

3.2 查看效果并做相应的调整

from matplotlib import pyplot as plt

subsample_index = np.random.randint(low = 0, high = len(y_train), size = 50)

# 绘制特征v_9的值与标签的散点图
plt.scatter(X_train['v_9'][subsample_index], \
            y_train[subsample_index], color = 'black')
plt.scatter(X_train['v_9'][subsample_index], \
            model.predict(X_train.loc[subsample_index]), color = 'blue')
plt.xlabel('v_9')
plt.ylabel('price')
plt.legend(['True Price','Predicted Price'],loc = 'upper right')
plt.show()

从图片中可以发现模型的预测结果（蓝色点）与真实标签（黑色点）的分布差异较大，且部分预测值出现了小于0的情况。这说明我们的模型存在一些问题。通过下面的作图我们发现数据的标签（price）呈现长尾分布，不利于我们的建模预测。实际上，很多模型都假设数据误差项服从正态分布，而长尾分布的数据违背了这一假设，因此效果不是很好[1]。

import seaborn as sns

plt.figure(figsize = (15, 5))
plt.subplot(1, 2, 1)
sns.distplot(y_train)
plt.subplot(1, 2, 2)
sns.distplot(y_train[y_train < np.quantile(y_train, 0.9)])

对于长尾分布，通常会用取对数的方式使其服从正态分布。因此我们的处理如下：

import seaborn as sns

y_train_ln = np.log(y_train + 1)

plt.figure(figsize = (15, 5))
plt.subplot(1, 2, 1)
sns.distplot(y_train_ln)
plt.subplot(1, 2, 2)
sns.distplot(y_train_ln[y_train_ln < np.quantile(y_train_ln, 0.9)])

现在再来做一次简单线性回归。代码如下：

model = lr.fit(X_train, y_train_ln)

# 查看训练的线性回归模型的截距（intercept）与权重(coef)
print('intercept:'+ str(model.intercept_))

sorted(dict(zip(continuous_feature_names, model.coef_)).items(), \
       key = lambda x: x[1], reverse = True)

intercept:18.7507494655777
[('v_9', 8.05240990056729),
 ('v_5', 5.764236596650283),
 ('v_12', 1.6182081236785628),
 ('v_1', 1.479831058294811),
 ('v_11', 1.1669016563620904),
 ('v_13', 0.9404711296031402),
 ('v_7', 0.7137273083560264),
 ('v_3', 0.6837875771077782),
 ('v_0', 0.008500518010120259),
 ('power_bin', 0.008497969302890544),
 ('gearbox', 0.007922377278338628),
 ('fuelType', 0.006684769706828798),
 ('bodyType', 0.004523520092704174),
 ('power', 0.0007161894205360409),
 ('brand_price_min', 3.334351114746047e-05),
 ('brand_amount', 2.897879704277868e-06),
 ('brand_price_median', 1.2571172872993166e-06),
 ('brand_price_std', 6.659176363432616e-07),
 ('brand_price_max', 6.194956307517354e-07),
 ('brand_price_average', 5.999345965082222e-07),
 ('SaleID', 2.1194170039651024e-08),
 ('seller', 5.696421112588723e-11),
 ('offerType', 4.128253294766182e-11),
 ('train', -5.6274984672199935e-12),
 ('brand_price_sum', -1.5126504215930465e-10),
 ('name', -7.015512588874946e-08),
 ('used_time', -4.122479372351641e-06),
 ('city', -0.0022187824810422163),
 ('v_14', -0.004234223418102942),
 ('kilometer', -0.01383586622688452),
 ('notRepairedDamage', -0.2702794234984635),
 ('v_4', -0.8315701200993081),
 ('v_2', -0.9470842241623765),
 ('v_10', -1.6261466689794903),
 ('v_8', -40.34300748761742),
 ('v_6', -238.79036385506777)]

此时再看效果。

# 绘制特征v_9的值与标签的散点图
plt.scatter(X_train['v_9'][subsample_index], \
            y_train[subsample_index], color = 'black')
plt.scatter(X_train['v_9'][subsample_index], \
            np.exp(model.predict(X_train.loc[subsample_index])), color = 'blue')
plt.xlabel('v_9')
plt.ylabel('price')
plt.legend(['True Price','Predicted Price'],loc = 'upper right')

plt.show()

可以发现，这次几乎没有小于0的预测值了。

3.3 K-折交叉验证

在使用训练集对参数进行训练的时候，经常会发现人们通常会将一整个训练集分为三个部分（比如mnist手写训练集）。一般分为：训练集，评估集和测试集这三个部分。这其实是为了保证训练效果而特意设置的。其中测试集很好理解，其实就是完全不参与训练的数据，仅仅用来观测测试效果的数据。

在实际的训练中，训练的结果对于训练集的拟合程度通常还是挺好的（初始条件敏感），但是对于训练集之外的数据的拟合程度通常就不那么令人满意了。因此我们通常并不会把所有的数据集都拿来训练，而是分出一部分来（这一部分不参加训练）对训练集生成的参数进行测试，相对客观的判断这些参数对训练集之外的数据的符合程度。这种思想就称为交叉验证。

交叉验证可以用sklearn.model_selection中的cross_val_score来实现，具体用法参见[2]。另外，有关其scoring参数的设计可以参考[3]和[4]。

from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import make_scorer

# 自定义评分函数
def log_transfer(func):
    def wrapper(y, yhat):
        result = func(np.log(y), np.nan_to_num(np.log(yhat)))
        return result
    return wrapper

# 分别对未处理的数据和已处理过的数据做交叉验证
scores1 = cross_val_score(model, X = X_train, y = y_train, verbose = 1, cv = 5, \
                         scoring = make_scorer(log_transfer(mean_absolute_error)))
scores2 = cross_val_score(model, X = X_train, y = y_train_ln, verbose = 1, cv = 5, \
                         scoring = make_scorer(log_transfer(mean_absolute_error)))

print('AVG1: ', np.mean(scores1))
print('AVG2: ', np.mean(scores2))

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.8s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
AVG1:  1.365802392031425
AVG2:  0.024865475238762588
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.8s finished

从平均分数可以看出取对数后的数据其MAE要比原来小很多，因此想过也相对更好。

每一折的具体分数可以用下面的语句查看：

# 查看每一折的具体分数
scores2 = pd.DataFrame(scores2.reshape(1,-1))
scores2.columns = ['cv' + str(x) for x in range(1, 6)]
scores2.index = ['MAE']
scores2

3.4 模拟真实的业务情况

事实上，由于我们并不具有预知未来的能力，五折交叉验证在某些与时间相关的数据集上反而反映了不真实的情况。例如，通过2018年的二手车价格预测2017年的二手车价格，这显然是不合理的。因此我们还可以采用时间顺序对数据集进行分隔。在本例中，由于数据是按照时间顺序由上到下排列的，因此我们可以选用靠前时间的4/5样本当作训练集，靠后时间的1/5当作验证集，最终结果与五折交叉验证差距不大。

import datetime

sample_feature = sample_feature.reset_index(drop = True)
split_point = len(sample_feature) // 5 * 4

train = sample_feature.loc[:split_point].dropna()
val = sample_feature.loc[split_point:].dropna()

X_train = train[continuous_feature_names]
y_train_ln = np.log(train['price'] + 1)
X_val = val[continuous_feature_names]
y_val_ln = np.log(val['price'] + 1)

model = lr.fit(X_train, y_train_ln)
mean_absolute_error(y_val_ln, model.predict(X_val))

所得结果为：

0.19577667270301014

3.5 绘制学习率曲线和验证曲线

学习曲线就是通过画出不同训练集大小时训练集和交叉验证的准确率，可以通过看到模型在新数据上的表现，进而来判断模型是否方差偏高或偏差过高，以及增大训练集是否可以减小过拟合[5]。

验证曲线是一种通过定位过拟合与欠拟合等诸多问题的方法，帮助提高模型性能的有效工具。验证曲线绘制的是准确率与模型参数之间的关系[5][6]。

这里给出学习率曲线的一种画法，验证曲线可以参考[6]。

from sklearn.model_selection import learning_curve
from sklearn.model_selection import validation_curve

def plot_learning_curve(estimator, title, X, y, ylim = None, cv = None, \
                        n_jobs = 1, train_size = np.linspace(.1, 1.0, 5 )):  
    plt.figure()  
    plt.title(title)  
    if ylim is not None:  
        plt.ylim(*ylim)  
    plt.xlabel('Training example')  
    plt.ylabel('score')  
    train_sizes, train_scores, test_scores = \
    learning_curve(estimator, X, y, cv = cv, n_jobs = n_jobs, \
                   train_sizes = train_size, \
                   scoring = make_scorer(mean_absolute_error))  
    train_scores_mean = np.mean(train_scores, axis = 1)  
    train_scores_std = np.std(train_scores, axis = 1)  
    test_scores_mean = np.mean(test_scores, axis = 1)  
    test_scores_std = np.std(test_scores, axis = 1)  
    plt.grid()#区域  
    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,  
                     train_scores_mean + train_scores_std, alpha = 0.1,  
                     color = "r")  
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,  
                     test_scores_mean + test_scores_std, alpha = 0.1,  
                     color = "g")  
    plt.plot(train_sizes, train_scores_mean, 'o-', color = 'r',  
             label = "Training score")  
    plt.plot(train_sizes, test_scores_mean,'o-',color = "g",  
             label = "Cross-validation score")  
    plt.legend(loc = "best")  
    return plt  

plot_learning_curve(LinearRegression(), 'Liner_model', \
                    X_train[:1000], y_train_ln[:1000], ylim = (0.0, 0.5), 
                    cv = 5, n_jobs = 1)

4. 多模型对比

4.1 预处理

train = sample_feature[continuous_feature_names + ['price']].dropna()

X_train = train[continuous_feature_names]
y_train = train['price']
y_train_ln = np.log(y_train + 1)

4.2 线性模型与嵌入式特征选择

除了简单线性模型容易产生过拟合，因此人们引入了正则项（也叫惩罚项）来削减这种情况。正则项主要有L1正则和L2正则。相关内容可以参考[7][8][9]。

一下代码比较了简单线性回归、Laoss回归（L1正则）和Ridge回归（L2正则）的效果。

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso

models = [LinearRegression(),
          Ridge(),
          Lasso()]

result = dict()
for model in models:
    model_name = str(model).split('(')[0]
    scores = cross_val_score(model, X = X_train, y = y_train_ln, verbose = 0, \
                             cv = 5, scoring = make_scorer(mean_absolute_error))
    result[model_name] = scores
    print(model_name + ' is finished')

# 对比效果
result = pd.DataFrame(result)
result.index = ['cv' + str(x) for x in range(1, 6)]
result

LinearRegression is finished
Ridge is finished
Lasso is finished

可视化三种线性回归的参数取值。

fig, axes = plt.subplots(3, figsize = (20, 30))
model = LinearRegression().fit(X_train, y_train_ln)

print('intercept:'+ str(model.intercept_))

sns.barplot(abs(model.coef_), continuous_feature_names, ax = axes[0])

model = Ridge().fit(X_train, y_train_ln)

print('intercept:'+ str(model.intercept_))

sns.barplot(abs(model.coef_), continuous_feature_names, ax = axes[1])

model = Lasso().fit(X_train, y_train_ln)

print('intercept:'+ str(model.intercept_))

sns.barplot(abs(model.coef_), continuous_feature_names, ax = axes[2])

intercept:18.750749465547507
intercept:4.671709787675356
intercept:8.672182462666198

从图中可以看出：

L2正则化在拟合过程中通常都倾向于让权值尽可能小，最后构造一个所有参数都比较小的模型。因为一般认为参数值小的模型比较简单，能适应不同的数据集，也在一定程度上避免了过拟合现象。可以设想一下对于一个线性回归方程，若参数很大，那么只要数据偏移一点点，就会对结果造成很大的影响。但如果参数足够小，数据偏移得多一点也不会对结果造成什么影响，专业一点的说法是“抗扰动能力强”。

L1正则化有助于生成一个稀疏权值矩阵即使用L1正则后会有很多特征对应的参数为0。从特征选择的角度来看，非零的特征往往重要性都很大，因此应该保留。如第三张图，我们发现power与userd_time特征非常重要。这也是为什么正则化可以用来作为特征选择工具的原因。

4.3 非线性模型

from sklearn.linear_model import LinearRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.neural_network import MLPRegressor
from xgboost.sklearn import XGBRegressor
from lightgbm.sklearn import LGBMRegressor

models = [LinearRegression(),
          DecisionTreeRegressor(),
          RandomForestRegressor(),
          GradientBoostingRegressor(),
          MLPRegressor(solver = 'lbfgs', max_iter = 100), 
          XGBRegressor(n_estimators = 100, objective = 'reg:squarederror'), 
          LGBMRegressor(n_estimators = 100)]

result = dict()
for model in models:
    model_name = str(model).split('(')[0]
    scores = cross_val_score(model, X = X_train, y = y_train_ln, \
                             verbose = 0, cv = 5, \
                             scoring = make_scorer(mean_absolute_error))
    result[model_name] = scores
    print(model_name + ' is finished')

result = pd.DataFrame(result)
result.index = ['cv' + str(x) for x in range(1, 6)]
result

可以发现随机森林和XGBoost回归效果都很不错。

5. 模型调参（以LGB模型为例）

这里介绍的调参方法有：贪心调参[10]、网格调参[11]和贝叶斯调参[12]。

5.1 贪心调参

贪心调参的核心思想就是：从某一类超参数开始去训练模型并得到最优的超参数，然后在这个最优超参数基础上添加新的类别的超参数进行训练。每一次都挑选最优的超参数，直到结束。

由于不同的超参数并非相互独立的，所以贪心调参最后得到的实际上是一个局部最优解。

objective = ['regression', 'regression_l1', 'mape', 'huber', 'fair']

num_leaves = [3,5,10,15,20,40, 55]
max_depth = [3,5,10,15,20,40, 55]
bagging_fraction = []
feature_fraction = []
drop_rate = []

# 贪心调参过程
best_obj = dict()
for obj in objective:
    model = LGBMRegressor(objective = obj)
    score = np.mean(cross_val_score(model, X = X_train, y = y_train_ln, \
                                    verbose = 0, cv = 5, \
                                    scoring = make_scorer(mean_absolute_error)))
    best_obj[obj] = score
    
best_leaves = dict()
for leaves in num_leaves:
    model = LGBMRegressor(objective = min(best_obj.items(), \
                                          key = lambda x:x[1])[0], \
                          num_leaves = leaves)
    score = np.mean(cross_val_score(model, X = X_train, y = y_train_ln, \
                                    verbose = 0, cv = 5, \
                                    scoring = make_scorer(mean_absolute_error)))
    best_leaves[leaves] = score
    
best_depth = dict()
for depth in max_depth:
    model = LGBMRegressor(objective = min(best_obj.items(), \
                                          key = lambda x:x[1])[0],
                          num_leaves = min(best_leaves.items(), \
                                           key = lambda x:x[1])[0],
                          max_depth = depth)
    score = np.mean(cross_val_score(model, X = X_train, y = y_train_ln, \
                                    verbose = 0, cv = 5, \
                                    scoring = make_scorer(mean_absolute_error)))
    best_depth[depth] = score

sns.lineplot(x = ['0_initial','1_turning_obj','2_turning_leaves','3_turning_depth'], \
             y = [0.143, min(best_obj.values()), \
                  min(best_leaves.values()), min(best_depth.values())])

虽然贪心调参得到的解释局部最优的，不过多数情况下由此方法得到的超参数还是相对较好的。

5.2 网格调参

如果说贪心调参求的是局部最优解，那么网格调参求的就是全局最优解。实际上网格调参属于暴力调参即把每一种不同的超参数组合都挨个试一遍，并从中得到最优的超参数组合。不过可想而知，超参数的组合越多，程序的运行时间越长。

from sklearn.model_selection import GridSearchCV

parameters = {'objective': objective , 'num_leaves': num_leaves, \
              'max_depth': max_depth}
model = LGBMRegressor()
clf = GridSearchCV(model, parameters, cv = 5)
clf = clf.fit(X_train, y_train)

clf.best_params_

输出结果为：

{'max_depth': 15, 'num_leaves': 55, 'objective': 'regression'}

用输出的结果进行建模并查看效果。

model = LGBMRegressor(objective = 'regression',
                          num_leaves = 55,
                          max_depth = 15)

np.mean(cross_val_score(model, X = X_train, y = y_train_ln, verbose = 1, \
                        cv = 5, scoring = make_scorer(mean_absolute_error)))

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    4.6s finished
0.13754833106731224

5.3 贝叶斯调参

贝叶斯优化会通过基于目标函数的过去评估结果建立替代函数（概率模型），来找到最小化目标函数的值。贝叶斯方法与随机或网格搜索的不同之处在于，它在尝试下一组超参数时，会参考之前的评估结果，因此可以省去很多无用功。

超参数的评估代价很大，因为它要求使用待评估的超参数训练一遍模型，而许多模型动则几个小时甚至几天才能训练完，因此耗费巨大。贝叶斯调参会使用不断更新的概率模型，通过推断过去的结果来“集中”有希望的超参数。

相比于网格调参，贝叶斯调参会节省很多的时间。

from bayes_opt import BayesianOptimization

def rf_cv(num_leaves, max_depth, subsample, min_child_samples):
    val = cross_val_score(
        LGBMRegressor(objective = 'regression_l1',
            num_leaves=int(num_leaves),
            max_depth=int(max_depth),
            subsample = subsample,
            min_child_samples = int(min_child_samples)
        ),
        X = X_train, y = y_train_ln, verbose = 0, cv = 5, \
        scoring = make_scorer(mean_absolute_error)
    ).mean()
    return 1 - val

rf_bo = BayesianOptimization(
    rf_cv,
    {
    'num_leaves': (2, 100),
    'max_depth': (2, 100),
    'subsample': (0.1, 1),
    'min_child_samples' : (2, 100)
    }
)

rf_bo.maximize()

贝叶斯调参还有一个好处就是可以动态显示其调参过程，其显示的结果如下：

|   iter    |  target   | max_depth | min_ch... | num_le... | subsample |
-------------------------------------------------------------------------
|  1        |  0.8686   |  83.99    |  53.31    |  92.01    |  0.6151   |
|  2        |  0.8665   |  99.54    |  91.77    |  69.18    |  0.5601   |
|  3        |  0.8576   |  45.49    |  7.876    |  31.13    |  0.714    |
|  4        |  0.8664   |  93.0     |  24.48    |  67.35    |  0.418    |
|  5        |  0.8665   |  64.91    |  66.06    |  69.84    |  0.4307   |
|  6        |  0.8693   |  17.56    |  97.53    |  99.7     |  0.9283   |
|  7        |  0.869    |  89.01    |  3.675    |  99.55    |  0.9377   |
|  8        |  0.8694   |  99.41    |  97.79    |  98.36    |  0.3918   |
|  9        |  0.8694   |  96.45    |  88.98    |  99.87    |  0.2571   |
|  10       |  0.8692   |  87.19    |  99.33    |  99.8     |  0.856    |
|  11       |  0.8695   |  95.7     |  93.94    |  99.58    |  0.1425   |
|  12       |  0.8689   |  86.32    |  96.85    |  95.19    |  0.5916   |
|  13       |  0.8692   |  99.37    |  4.827    |  98.41    |  0.3677   |
|  14       |  0.8064   |  2.193    |  92.17    |  99.38    |  0.5856   |
|  15       |  0.7719   |  9.267    |  99.99    |  2.007    |  0.4641   |
|  16       |  0.7719   |  99.46    |  5.528    |  2.533    |  0.4402   |
|  17       |  0.8204   |  99.31    |  97.94    |  5.447    |  0.569    |
|  18       |  0.7719   |  4.896    |  5.204    |  2.488    |  0.1845   |
|  19       |  0.8688   |  18.72    |  2.612    |  92.06    |  0.273    |
|  20       |  0.8654   |  30.95    |  99.95    |  62.09    |  0.7186   |
|  21       |  0.867    |  47.56    |  2.204    |  74.38    |  0.3159   |
|  22       |  0.8694   |  39.7     |  75.17    |  99.66    |  0.274    |
|  23       |  0.802    |  53.63    |  57.14    |  3.16     |  0.1212   |
|  24       |  0.869    |  49.82    |  99.84    |  95.66    |  0.2684   |
|  25       |  0.8569   |  56.73    |  99.9     |  29.45    |  0.5984   |
|  26       |  0.8692   |  96.38    |  6.865    |  97.47    |  0.8463   |
|  27       |  0.8063   |  2.183    |  2.383    |  63.35    |  0.7351   |
|  28       |  0.8694   |  39.25    |  24.5     |  99.85    |  0.9749   |
|  29       |  0.869    |  32.1     |  2.07     |  99.53    |  0.8066   |
|  30       |  0.866    |  96.76    |  2.392    |  67.57    |  0.7835   |
=========================================================================

现在我们查看一下调参后的效果。

1 - rf_bo.max['target']

0.13048469566813226

6. 参考文献

1. https://blog.csdn.net/Noob_daniel/article/details/76087829

2. https://blog.csdn.net/qq_36523839/article/details/80707678

3. https://blog.csdn.net/qq_32590631/article/details/82831613

4. https://www.cnblogs.com/harvey888/p/6964741.html

5. https://blog.csdn.net/qq_36523839/article/details/82556932

6. https://blog.csdn.net/qq_20412595/article/details/81771790

7. https://www.zhihu.com/question/32246256/answer/55320482

8. http://yangyingming.com/article/434

9. https://blog.csdn.net/jinping_shi/article/details/52433975

10. https://www.jianshu.com/p/ab89df9759c8

11. https://blog.csdn.net/weixin_43172660/article/details/83032029

12. https://blog.csdn.net/linxid/article/details/81189154

Zee_Chao

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
零基础入门数据挖掘-Task4 建模调参

目录1. 学习内容2. 准备工作3. 用线性回归简单建模3.1 用简单线性回归建模3.2 查看效果并做相应的调整3.3 K-折交叉验证3.4 模拟真实的业务情况3.5绘制学习率曲线和验证曲线4. 多模型对比4.1 预处理4.2线性模型与嵌入式特征选择4.3 非线性模型5. 模型调参（以LGB模型为例）5.1 贪心调参5.2 网格调参...
复制链接

扫一扫