机器学习-5.0模型选择

最新推荐文章于 2022-12-05 18:52:25 发布

架构菜芽

最新推荐文章于 2022-12-05 18:52:25 发布

阅读量423

点赞数

分类专栏：机器学习-模型流程文章标签：机器学习

本文链接：https://blog.csdn.net/weixin_41175904/article/details/115502872

版权

机器学习-模型流程专栏收录该内容

7 篇文章 1 订阅

订阅专栏

随机森林-RF模型（bagging）
随机森林就是通过集成学习的思想将多棵树集成的一种算法，它的基本单元是决策树
AdaBoost模型（基于权值）
AdaBoost，是英文"Adaptive Boosting"（自适应增强）的缩写。
（1）前一个基本分类器分错的样本会得到加权，加权后的全体样本再次被用来训练下一个新的基本分类器。
（2）在每一轮中加入一个新的弱分类器，直到达到某个预定的足够小的错误率或达到预先指定的最大迭代次数，
（3）各个弱分类器的训练过程结束后，误差率低的弱分类器在最终分类器中占的权重较大，否则较小
GBDT模型（基于残差）
与AdaBoost不同，GBDT每一次的计算是都为了减少上一次的残差，进而在残差减少（负梯度）的方向上建立一个新的模型。

GBDT的主要优点：
　　1）可以灵活的处理各种类型的数据
　　2）预测的准确率高
　　3）使用了一些健壮的损失函数，如huber，可以很好的处理异常值
GBDT的缺点：
1）由于基学习器之间的依赖关系，难以并行化处理，
XGBoost模型（boosting）
与GBDT不同
第一，GBDT将目标函数泰勒展开到一阶，而xgboost将目标函数泰勒展开到了二阶。保留了更多有关目标函数的信息，对提升效果有帮助。
第二，GBDT是给新的基模型寻找新的拟合标签（前面加法模型的负梯度），而xgboost是给新的基模型寻找新的目标函数（目标函数关于新的基模型的二阶泰勒展开）。
第三，xgboost加入了和叶子权重的L2正则化项，因而有利于模型获得更低的方差。
第四，xgboost增加了自动处理缺失值特征的策略。通过把带缺失值样本分别划分到左子树或者右子树，比较两种方案下目标函数的优劣，从而自动对有缺失值的样本进行划分，无需对缺失特征进行填充预处理
第五，引进了特征子采样，这种方法既能降低过拟合，还能减少计算。
第六，在寻找最佳分割点时，考虑到传统的贪心算法效率较低，实现了一种近似贪心算法
第七，用来加速和减小内存消耗，除此之外还考虑了稀疏数据集和缺失值的处理，对于特征的值有缺失的样本，XGBoost依然能自动找到其要分裂的方向。
第八，XGBoost支持并行处理，XGBoost的并行不是在模型上的并行，而是在特征上的并行，将特征列排序后以block的形式存储在内存中，在后面的迭代中重复使用这个结构。这个block也使得并行化成为了可能，其次在进行节点分裂时，计算每个特征的增益，最终选择增益最大的那个特征去做分割，那么各个特征的增益计算就可以开多线程进行
LightGBM模型（boosting）

第一，XGBoost使用基于预排序的决策树算法，每遍历一个特征就需要计算一次特征的增益，时间复杂度为O(datafeature)。而LightGBM使用基于直方图的决策树算法，直方图的优化算法只需要计算K次，时间复杂度为O(Kfeature)
第二，XGBoost使用按层生长(level-wise)的决策树生长策略，LightGBM则采用带有深度限制的按叶子节点(leaf-wise)算法。在分裂次数相同的情况下，leaf-wise可以降低更多的误差，得到更好的精度。leaf-wise的缺点在于会产生较深的决策树，产生过拟合。
第三，支持类别特征，不需要进行独热编码处理
第四，优化了特征并行和数据并行算法，除此之外还添加了投票并行方案
第五，采用基于梯度的单边采样来保持数据分布，减少模型因数据分布发生变化而造成的模型精度下降
第六，特征捆绑转化为图着色问题，减少特征数量

3、代码演示

## xgb-Model
xgr = xgb.XGBRegressor(n_estimators=120, learning_rate=0.1, gamma=0, subsample=0.8, \
                       colsample_bytree=0.9, max_depth=7)  # ,objective ='reg:squarederror'
scores_train = []
scores = []
# 5折交叉验证方式
sk = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
# 设置测试集，输出矩阵。每一组数据输出：[0,0,0,0]以概率值填入
test = np.zeros((X_test.shape[0], 1))
print("test", test)
for train_ind, val_ind in sk.split(X_data, Y_data):
    print('************************************ {} ************************************'.format(str(sk.n_splits)))
    train_x = X_data.iloc[train_ind].values
    train_y = Y_data.iloc[train_ind]

    val_x = X_data.iloc[val_ind].values
    val_y = Y_data.iloc[val_ind]

    xgr.fit(train_x, train_y)
    pred_train_xgb = xgr.predict(train_x)
    pred_xgb = xgr.predict(val_x)
    pred_xgb = pred_xgb.reshape(-1, 1)
    #
    score_train = mean_absolute_error(train_y, pred_train_xgb)
    scores_train.append(score_train)
    score = mean_absolute_error(val_y, pred_xgb)
    print("预测训练集的平均绝对误差评分", score)
    scores.append(score)

    # 预测
    X_test_pred = X_test.values
    pred_test = xgr.predict(X_test_pred)
    pred_test2 = pred_test.reshape(-1, 1)
    test += pred_test2

print('Train mae:', np.mean(score_train))
print('Val mae', np.mean(scores))
#预测值
sub_Weighted = test / sk.n_splits

"""使用lightgbm 5折交叉验证进行建模预测"""
import lightgbm as lgb
#交叉验证参数
kf = KFold(n_splits=5, shuffle=True, random_state=2021)
cv_scores = []
for i, (train_index, valid_index) in enumerate(kf.split(X_train, y_train)):
    print('************ {} ********'.format(str(i+1)))
    X_train_split, y_train_split, X_val, y_val = X_train.iloc[train_index], y_train[train_index], X_train.iloc[valid_index], y_train[valid_index]
    
    train_matrix = lgb.Dataset(X_train_split, label=y_train_split)
    valid_matrix = lgb.Dataset(X_val, label=y_val)

    params = {
                'boosting_type': 'gbdt',
                'objective': 'binary',
                'learning_rate': 0.1,
                'metric': 'auc',
        
                'min_child_weight': 1e-3,
                'num_leaves': 31,
                'max_depth': -1,
                'reg_lambda': 0,
                'reg_alpha': 0,
                'feature_fraction': 1,
                'bagging_fraction': 1,
                'bagging_freq': 0,
                'seed': 2020,
                'nthread': 8,
                'silent': True,
                'verbose': -1,
    }
    
    model = lgb.train(params, train_set=train_matrix, num_boost_round=20000, valid_sets=valid_matrix, verbose_eval=1000, early_stopping_rounds=200)
    val_pred = model.predict(X_val, num_iteration=model.best_iteration)
    
    cv_scores.append(roc_auc_score(y_val, val_pred))
    print(cv_scores)

print("lgb_scotrainre_list:{}".format(cv_scores))
print("lgb_score_mean:{}".format(np.mean(cv_scores)))
print("lgb_score_std:{}".format(np.std(cv_scores)))

"""
贪心
先使用当前对模型影响最大的参数进行调优，达到当前参数下的模型最优化，再使用对模型影响次之的参数进行调优，如此下去，直到所有的参数调整完毕。
这个方法的缺点就是可能会调到局部最优而不是全局最优，但是只需要一步一步的进行参数最优化调试即可，容易理解。
"""
#这里列举一下日常调参过程中常用的参数和调参顺序：
#①：max_depth、num_leaves
#②：min_data_in_leaf、min_child_weight
#③：bagging_fraction、 feature_fraction、bagging_freq
#④：reg_lambda、reg_alpha
#⑤：min_split_gain

from sklearn.model_selection import cross_val_score

# 调objective
best_obj = dict()
for obj in objective:
    model = LGBMRegressor(objective=obj)
    """预测并计算roc的相关指标"""
    score = cross_val_score(model, X_train, y_train, cv=5, scoring='roc_auc').mean()
    best_obj[obj] = score
    
# num_leaves
best_leaves = dict()
for leaves in num_leaves:
    model = LGBMRegressor(objective=min(best_obj.items(), key=lambda x:x[1])[0], num_leaves=leaves)
    """预测并计算roc的相关指标"""
    score = cross_val_score(model, X_train, y_train, cv=5, scoring='roc_auc').mean()
    best_leaves[leaves] = score
    
# max_depth
best_depth = dict()
for depth in max_depth:
    model = LGBMRegressor(objective=min(best_obj.items(), key=lambda x:x[1])[0],
                          num_leaves=min(best_leaves.items(), key=lambda x:x[1])[0],
                          max_depth=depth)
    """预测并计算roc的相关指标"""
    score = cross_val_score(model, X_train, y_train, cv=5, scoring='roc_auc').mean()
    best_depth[depth] = score

"""
可依次将模型的参数通过上面的方式进行调整优化，并且通过可视化观察在每一个最优参数下模型的得分情况
"""

"""sklearn 提供GridSearchCV用于进行网格搜索，只需要把模型的参数输进去，就能给出最优化的结果和参数。
相比起贪心调参，网格搜索的结果会更优，但是网格搜索只适合于小数据集，一旦数据的量级上去了，很难得出结果。
通过网格搜索确定最优参数"""
from sklearn.model_selection import GridSearchCV

def get_best_cv_params(learning_rate=0.1, n_estimators=581, num_leaves=31, max_depth=-1, bagging_fraction=1.0, 
                       feature_fraction=1.0, bagging_freq=0, min_data_in_leaf=20, min_child_weight=0.001, 
                       min_split_gain=0, reg_lambda=0, reg_alpha=0, param_grid=None):
    # 设置5折交叉验证
    cv_fold = StratifiedKFold(n_splits=5, random_state=0, shuffle=True, )
    
    model_lgb = lgb.LGBMClassifier(learning_rate=learning_rate,
                                   n_estimators=n_estimators,
                                   num_leaves=num_leaves,
                                   max_depth=max_depth,
                                   bagging_fraction=bagging_fraction,
                                   feature_fraction=feature_fraction,
                                   bagging_freq=bagging_freq,
                                   min_data_in_leaf=min_data_in_leaf,
                                   min_child_weight=min_child_weight,
                                   min_split_gain=min_split_gain,
                                   reg_lambda=reg_lambda,
                                   reg_alpha=reg_alpha,
                                   n_jobs= 8
                                  )
    grid_search = GridSearchCV(estimator=model_lgb, 
                               cv=cv_fold,
                               param_grid=param_grid,
                               scoring='roc_auc'
                              )
    grid_search.fit(X_train, y_train)

    print('模型当前最优参数为:{}'.format(grid_search.best_params_))
    print('模型当前最优得分为:{}'.format(grid_search.best_score_))