集成学习进阶-----学习笔记整理

最新推荐文章于 2024-04-07 13:17:24 发布

一蓑烟雨紫洛

最新推荐文章于 2024-04-07 13:17:24 发布

阅读量747

点赞数

分类专栏：机器学习文章标签：学习

本文链接：https://blog.csdn.net/weixin_34280060/article/details/123575109

版权

机器学习专栏收录该内容

24 篇文章 4 订阅

订阅专栏

集成学习进阶

知道xgboost算法原理

知道otto案例通过xgboost实现流程

知道lightGBM算法原理

知道PUBG案例通过lightGBM实现流程

知道stacking算法原理

知道住房月租金预测通过stacking实现流程

1、xgboost算法原理

XGBoost（Extreme Gradient Boosting)全名极端梯度提升树，在绝大多数的回归和分类问题上表现得十分顶尖。

2、最优模型的构建方法

在这里插入图片描述

3、应用

决策树生成和剪枝分别对应了经验风险最小化和结构风险最小化

XGBoost的决策树生成时结构风险最小化的结果。

4、XGBoost的目标函数推导

4.1、目标函数

即损失函数，通过最凶啊话损失函数来构建最优模型

损失函数应加上表示模型复杂度的正则项，且XGBoost对应的模型包含了多个CART树，因此，模型的目标函数为
$obj(θ)=\sum_i^nL(y_t,y_i)+\sum_{k=1}^kΩ(f_k)$
正则化的损失函数

其中 $y_t$ 是模型的实际输出结果, $y_i$ 是模型的输出结果

等式右边第一部分是模型的训练误差，第二部分是正则化项，这里的正则化项是K颗树的正则化项相加而来的。

4.2、CART树的介绍

在这里插入图片描述

上图为第K棵CART树，确定一颗CART树需要确定两部分

第一部分是树的结构，这个结构将输入样本映射到一个确定的叶子节点上，记为 $f_k(x)$

第二部分就是各个叶子节点的值，q(x)表示输出的叶子节点序号， $w_q(x)$ 表示对应叶子节点序号的值

定义 $f_k(x)=w_{q(x)}$

4.3、树的复杂度

定义每棵树的复杂度

在这里插入图片描述

4.4、目标函数的推导

在这里插入图片描述

5、XGBoost的回归树构建方法

5.1、计算分裂节点

5.2、停止分裂条件判断

在这里插入图片描述

6、XGBoost与GDBT区别

区别一：

XGBoost生成CART树考虑了树的复杂度。

GDBT未考虑，GDBT在树的简直步骤中考虑了树的复杂度

区别二：

XGBoost是拟合上一轮损失函数的二阶导展开，GDBT是拟合上一轮损失函数的一阶导展开，因此，XGBoost的准确性更高，且满足相同的训练效果，需要的迭代次数更少。

区别三：

XGBoost与GDBT都是逐次迭代来提高模型性能，但是XGBoost在选取最佳切分点时可以开启多线程进行，大大提高了运行速度。

7、XGBoost算法API与参数介绍

7.1、安装

pip3 install xgboost

7.2、XGBoost参数介绍

XGBoost中封装了很多参数，主要由三种类型构成：通用参数（general parameters) Booster参数（booster parameters)和学习目标参数（task parameter)

通用参数：主要是宏观函数控制

Booster参数：取决于选择的Booster类型，用于控制每一步的booster(tree,regressing)

学习目标参数：控制训练目标的表现。

7.2.1 通用参数（general parameters)

1、booster[缺省值=gbtree]

2、决定使用哪个booster，可以是gbtree,gblinear 或者dart

gbtree和dart使用基于树的模型（dart主要多了Dropout),而gblinear使用线下函数

3、slient[缺省值=0]

设置为0打印运行信息：设置为1静默模式，不打印。

4、nthread [缺省值=设置为最大可能的线程数]

并行运行xgboost的线程数，输入的参数应该<=系统的CPU核心数，若是没有设置算法会检测将其设置为CPU的全部核心数。

下面2个参数不需要设置，使用默认的就好了

1、num_pbuffer[xgboost 自动设置，不需要用户设置]

预测结果缓存大小，通常设置为训练实例的个数，该缓存用于保存最后boosting操作的预测结果

2、num_feature[xgboost 自动设置，不需要用户设置]

在boosting中使用特征的维度，设置为特征的最大维度

7.2.2、Booster参数（booster parameters)

7.2.2.1、Parameters for Tree Booster

1、eta[缺省值=0.3，别名：learning_rate]

更新中减少的步长来防止过拟合

在每次boosting之后，可以直接获得新的特征权值，这样使得boosting更加鲁棒

范围[0,1]

2、gamma[缺省值=0，别名：min_split_loss] (分裂最小loss)

在节点分裂时，只有分裂后损失函数的值下降了，才会分裂这个节点。

gamma指定了节点分裂所需的最小损失函数下降值。这个参数的值越大，算法越保守。这个参数的值和损失函数息息相关，所以是需要调整的。

1、max_depth[缺省值=6]

这个值为树的最大深度，这个值也可以避免过拟合。max_depth越大，模型会学到更具体更局部的样本。设置为0代表没有限制。范围[0,∞)

1、min_child_weight[缺省值=1]

决定最小叶子节点样本权重和。xgboost的这个参数是最小样本权重的和。

当它的值较大时，可以避免模型学习到局部的特殊样本。但是如果这个值过高，会导致欠拟合，这个参数需要使用CV来调整。范围[0,∞)

1、subsample[缺省值=1]

这个参数控制对于每棵树、随机采样的比例

减少这个参数的值，算法会更加保守，避免过拟合。但是，如果这个值设置过小，它可能会导致欠拟合

典型值：0.5-1 0.5代表平均采样，防止过拟合。

范围[0,1]

1、colsample_bytree[缺省值=1]

用来控制每棵随机采样的列数的占比（每一列是一个特征）典型值：0.5-1 范围（0，1]

1、colsample_bylevel[缺省值=1]

用来控制树的每一级的每一次分裂，对列数的采样的占比范围（0，1]

1、lambda[缺省值=1,别名：reg_lambda]

权重的L2正则化项（和Ridge regression类似）

这个参数用来控制正则化部分。

2、alpha[缺省值=0,别名：reg_alpha]

权重的L1正则化项（和Lasso regression类似），可以应用在高维度的情况下，使得算法的速度更快

3、scale_pos_weight[缺省值=1]

在各类样本十分不平衡时，把这个参数设定为一个正值，可以使算法更快收敛，通常可以将其设置为负样本的数目与正样本数目的比值。

7.2.2.2、Parameters for Linear Booster

1、lambda[缺省值=0, 别名：reg_lambda]

L2正则化惩罚系数，增加该值会使得模型更加保守

这个参数用来控制正则化部分。

2、alpha[缺省值=0,别名：reg_alpha]

L1正则化惩罚系数，增加该值会使得模型更加保守

3、lambda_blas[缺省值=0，别名：reg_lambda_blas]

偏置的L2正则化

7.2.2.3、学习目标参数

在这里插入图片描述

8、xgboost简单案例介绍

利用xgboost训练泰坦尼克号的数据

#基于泰坦尼克号的xgboost

import pandas as pd
import numpy as np
from sklearn.feature_extraction import DictVectorizer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier,export_graphviz

#1）获取数据
titanic = pd.read_csv('D:\\data\\titanic1.csv')
titanic.describe()
x= titanic[["pclass","age","sex"]]
y = titanic[["survived"]]
x.head()
y.head()

#2）数据处理

x['age'].fillna(value=titanic["age"].mean(), inplace=True)
x.head()


#4）划分数据集

x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=22)

#字典特征抽取
from sklearn.feature_extraction import DictVectorizer
from sklearn.tree import DecisionTreeClassifier,export_graphviz

transfer = DictVectorizer()
x_train = transfer.fit_transform(x_train.to_dict(orient="records"))
x_test = transfer.transform(x_test.to_dict(orient="records"))

#xgboost模型训练和模型评估
from xgboost import XGBClassifier
xg = XGBClassifier()

xg.fit(x_train,y_train)
xg.score(x_test,y_test)

#针对max_depth进行模型调优
depth_range = range(10)
score = []
for i in depth_range:
    xg=XGBClassifier(eta=1,gama=0,max_depth=1)
    xg.fit(x_train,y_test)
    s=xg.score(x_test,y_test)
    print(s)
    score.append(s)

#结果可视化
import matplotlib.pyplot as plt
plt.plot(depth_range,score)
plt.show()

# 加入网格搜索与交叉验证
# 参数准备
param_dict = {"n_estimators": [120,200,300,500,800,1200], "max_depth": [5,8,15,25,30]}
estimator = GridSearchCV(estimator, param_grid=param_dict, cv=3)
estimator.fit(x_train, y_train)

# 5）模型评估
# 方法1：直接比对真实值和预测值
y_predict = estimator.predict(x_test)
print("y_predict:\n", y_predict)
print("直接比对真实值和预测值:\n", y_test == y_predict)

# 方法2：计算准确率
score = estimator.score(x_test, y_test)
print("准确率为：\n", score)

# 最佳参数：best_params_
print("最佳参数：\n", estimator.best_params_)
# 最佳结果：best_score_
print("最佳结果：\n", estimator.best_score_)
# 最佳估计器：best_estimator_
print("最佳估计器:\n", estimator.best_estimator_)
# 交叉验证结果：cv_results_
print("交叉验证结果:\n", estimator.cv_results_)

print("最佳参数：\n", estimator.best_params_)


print("最佳结果：\n", estimator.best_score_)

9、otto案例xgboost实现

#基于泰坦尼克号的xgboost

import pandas as pd
import numpy as np
from sklearn.feature_extraction import DictVectorizer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier,export_graphviz

#1）获取数据
titanic = pd.read_csv('D:\\data\\titanic1.csv')
titanic.describe()
x= titanic[["pclass","age","sex"]]
y = titanic[["survived"]]
x.head()
y.head()

#2）数据处理

x['age'].fillna(value=titanic["age"].mean(), inplace=True)
x.head()


#4）划分数据集

x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=22)

#字典特征抽取
from sklearn.feature_extraction import DictVectorizer
from sklearn.tree import DecisionTreeClassifier,export_graphviz

transfer = DictVectorizer()
x_train = transfer.fit_transform(x_train.to_dict(orient="records"))
x_test = transfer.transform(x_test.to_dict(orient="records"))

#xgboost模型训练和模型评估
from xgboost import XGBClassifier
xg = XGBClassifier()

xg.fit(x_train,y_train)
xg.score(x_test,y_test)

#针对max_depth进行模型调优
depth_range = range(10)
score = []
for i in depth_range:
    xg=XGBClassifier(eta=1,gama=0,max_depth=1)
    xg.fit(x_train,y_test)
    s=xg.score(x_test,y_test)
    print(s)
    score.append(s)

#结果可视化
import matplotlib.pyplot as plt
plt.plot(depth_range,score)
plt.show()

# 加入网格搜索与交叉验证
# 参数准备
param_dict = {"n_estimators": [120,200,300,500,800,1200], "max_depth": [5,8,15,25,30]}
estimator = GridSearchCV(estimator, param_grid=param_dict, cv=3)
estimator.fit(x_train, y_train)

# 5）模型评估
# 方法1：直接比对真实值和预测值
y_predict = estimator.predict(x_test)
print("y_predict:\n", y_predict)
print("直接比对真实值和预测值:\n", y_test == y_predict)

# 方法2：计算准确率
score = estimator.score(x_test, y_test)
print("准确率为：\n", score)

# 最佳参数：best_params_
print("最佳参数：\n", estimator.best_params_)
# 最佳结果：best_score_
print("最佳结果：\n", estimator.best_score_)
# 最佳估计器：best_estimator_
print("最佳估计器:\n", estimator.best_estimator_)
# 交叉验证结果：cv_results_
print("交叉验证结果:\n", estimator.cv_results_)
print("最佳参数：\n", estimator.best_params_)
print("最佳结果：\n", estimator.best_score_)

一蓑烟雨紫洛

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
集成学习进阶-----学习笔记整理

集成学习进阶知道xgboost算法原理知道otto案例通过xgboost实现流程知道lightGBM算法原理知道PUBG案例通过lightGBM实现流程知道stacking算法原理知道住房月租金预测通过stacking实现流程1、xgboost算法原理XGBoost（Extreme Gradient Boosting)全名极端梯度提升树，在绝大多数的回归和分类问题上表现得十分顶尖。2、最优模型的构建方法3、应用决策树生成和剪枝分别对应了经验风险最小化和结构风险最小化
复制链接

扫一扫

专栏目录