catboost原理、参数详解及python实例

最新推荐文章于 2025-03-21 14:36:00 发布

qq_24591139

最新推荐文章于 2025-03-21 14:36:00 发布

阅读量1.2w

点赞数 8

分类专栏： Machine Learning 文章标签： CatBoost 参数实例

本文链接：https://blog.csdn.net/qq_24591139/article/details/100104812

版权

Machine Learning 专栏收录该内容

7 篇文章

订阅专栏

catboost 简介

优点：

1）它自动采用特殊的方式处理类别型特征（categorical features）。首先对categorical features做一些统计，计算某个类别特征（category）出现的频率，之后加上超参数，生成新的数值型特征（numerical features）。这也是我在这里介绍这个算法最大的motivtion，有了catboost，再也不用手动处理类别型特征了。
2）catboost还使用了组合类别特征，可以利用到特征之间的联系，这极大的丰富了特征维度。
3）catboost的基模型采用的是对称树，同时计算leaf-value方式和传统的boosting算法也不一样，传统的boosting算法计算的是平均数，而catboost在这方面做了优化采用了其他的算法，这些改进都能防止模型过拟合。
在这里插入图片描述

一、原理

参考：
https://blog.csdn.net/appleyuchi/article/details/85413352
http://www.idataskys.com/2019/08/04/Boosting%E7%AE%97%E6%B3%95%E4%B9%8BCatBoost%E7%AE%97%E6%B3%95%E8%AF%A6%E8%A7%A3/
在这里插入图片描述

1、类别型特征

对于给定的数据集，最简单的方式就是用整个数据集的平均标签值代替类别特征的值，即：该特征某个类别的值等于所有该类别特征所对应Yi的平均值。由于某些类别出现次数少，所以需要做平滑处理：
在这里插入图片描述
一般情况下，a为大于0的参数，对于回归问题，一般情况下，先验项可取数据集的均值。对于二分类，先验项是正例的先验概率。
问题：对单个样本的估计量是有偏的。即：假设某个分类变量没有重复值，即每个样本对应一个分类值，那么
在这里插入图片描述
但

解决方法：
使用除去该样本的数据子集来估计,而不是用整个数据集

CatBoost使用更加有效的处理方式。就是TS值的计算依靠目前已经观察的样本集。我们可以随机生成一个排列来实现带时序的训练集，CatBoost在不同的树中使用不同的排列。

2、特征组合

对于树的第一次分割，不考虑任何组合。对于下一个分割，CatBoost将当前树的所有组合、类别型特征与数据集中的所有类别型特征相结合。组合被动态地转换为数字。

3、对称树

在这里插入图片描述

安装

pip install catboost

API

方法(method):

----Fit------
X: 输入数据数据类型可以是，list; pandas.DataFrame; pandas.Series
y=None
cat_features=None: 拿来做处理的类别特征
sample_weight=None: 输入数据的样本权重
logging_level=None: 控制是否输出日志信息，或者何种信息
plot=False: 训练过程中，绘制，度量值，所用时间等
eval_set=None: 验证集合，数据类型list(X, y)tuples
baseline=None
use_best_model=None
verbose=None
----predict---- 返回验证样本所属类别，数据类型为array
----predict_proba---- 返回验证样本所属类别的概率，数据类型为array
----get_feature_importance----

属性(attribute)：

feature_importances_

CatBoost参数详解

通用参数：

loss_function 损失函数，支持的有RMSE, Logloss, MAE, CrossEntropy, Quantile, LogLinQuantile, Multiclass, MultiClassOneVsAll, MAPE, Poisson。默认RMSE。
custom_metric 训练过程中输出的度量值。这些功能未经优化，仅出于信息目的显示。默认None。
eval_metric 用于过拟合检验（设置True）和最佳模型选择（设置True）的loss function，用于优化。
iterations 最大树数。默认1000。
learning_rate 学习率。默认03。
random_seed 训练时候的随机种子
l2_leaf_reg L2正则参数。默认3
bootstrap_type 定义权重计算逻辑，可选参数：Poisson (supported for GPU only)/Bayesian/Bernoulli/No，默认为Bayesian
bagging_temperature 贝叶斯套袋控制强度，区间[0, 1]。默认1。
subsample 设置样本率，当bootstrap_type为Poisson或Bernoulli时使用，默认66
sampling_frequency 设置创建树时的采样频率，可选值PerTree/PerTreeLevel，默认为PerTreeLevel
random_strength 分数标准差乘数。默认1。
use_best_model 设置此参数时，需要提供测试数据，树的个数通过训练参数和优化loss function获得。默认False。
best_model_min_trees 最佳模型应该具有的树的最小数目。
depth 树深，最大16，建议在1到10之间。默认6。
ignored_features 忽略数据集中的某些特征。默认None。
one_hot_max_size 如果feature包含的不同值的数目超过了指定值，将feature转化为float。默认False
has_time 在将categorical features转化为numerical features和选择树结构时，顺序选择输入数据。默认False（随机）
rsm 随机子空间（Random subspace method）。默认1。
nan_mode 处理输入数据中缺失值的方法，包括Forbidden(禁止存在缺失)，Min(用最小值补)，Max(用最大值补)。默认Min。
fold_permutation_block_size 数据集中的对象在随机排列之前按块分组。此参数定义块的大小。值越小，训练越慢。较大的值可能导致质量下降。
leaf_estimation_method 计算叶子值的方法，Newton/ Gradient。默认Gradient。
leaf_estimation_iterations 计算叶子值时梯度步数。
leaf_estimation_backtracking 在梯度下降期间要使用的回溯类型。
fold_len_multiplier folds长度系数。设置大于1的参数，在参数较小时获得最佳结果。默认2。
approx_on_full_history 计算近似值，False：使用1／fold_len_multiplier计算；True：使用fold中前面所有行计算。默认False。
class_weights 类别的权重。默认None。
scale_pos_weight 二进制分类中class 1的权重。该值用作class 1中对象权重的乘数。
boosting_type 增压方案
allow_const_label 使用它为所有对象训练具有相同标签值的数据集的模型。默认为False

CatBoost默认参数：

‘iterations’: 1000,
‘learning_rate’:0.03,
‘l2_leaf_reg’:3,
‘bagging_temperature’:1,
‘subsample’:0.66,
‘random_strength’:1,
‘depth’:6,
‘rsm’:1,
‘one_hot_max_size’:2
‘leaf_estimation_method’:’Gradient’,
‘fold_len_multiplier’:2,
‘border_count’:128,

CatBoost参数取值范围：

‘learning_rate’:Log-uniform distribution [e^{-7}, 1]
‘random_strength’:Discrete uniform distribution over a set {1, 20}
‘one_hot_max_size’:Discrete uniform distribution over a set {0, 25}
‘l2_leaf_reg’:Log-uniform distribution [1, 10]
‘bagging_temperature’:Uniform [0, 1]
‘gradient_iterations’:Discrete uniform distribution over a set {1, 10}

使用实例

import numpy as np
import catboost as cb
 
train_data = np.random.randint(0, 100, size=(100, 10))
train_label = np.random.randint(0, 2, size=(100))
test_data = np.random.randint(0,100, size=(50,10))
 
model = cb.CatBoostClassifier(iterations=2, depth=2, learning_rate=0.5, loss_function='Logloss',
                              logging_level='Verbose')
model.fit(train_data, train_label, cat_features=[0,2,5])
preds_class = model.predict(test_data)
preds_probs = model.predict_proba(test_data)
print('class = ',preds_class)
print('proba = ',preds_probs)

参考文献：https://www.jianshu.com/p/49ab87122562