R︱Yandex的梯度提升CatBoost 算法（官方述：超越XGBoost/lightGBM/h2o）

最新推荐文章于 2024-08-06 02:34:17 发布

悟乙己

最新推荐文章于 2024-08-06 02:34:17 发布

阅读量1.2w

点赞数 2

分类专栏：机器学习︱R+python 文章标签：算法 yandex 机器学习 catBoost xgboost

本文链接：https://blog.csdn.net/sinat_26917383/article/details/75635017

版权

机器学习︱R+python 专栏收录该内容

85 篇文章 130 订阅

订阅专栏

俄罗斯搜索巨头 Yandex 昨日宣布开源 CatBoost ，这是一种支持类别特征，基于梯度提升决策树的机器学习方法。
CatBoost 是由 Yandex 的研究人员和工程师开发的，是 MatrixNet 算法的继承者，在公司内部广泛使用，用于排列任务、预测和提出建议。Yandex 称其是通用的，可应用于广泛的领域和各种各样的问题。

CatBoost 的主要优势：

与其他库相比，质量上乘
支持数字化和分类功能
带有数据可视化工具

官网：https://tech.yandex.com/CatBoost/
github:https://github.com/catboost/catboost

有R/python两个版本，官方自述超越现有的最好的三个ML库：XGBoost/lightGBM/h2o

衡量标准为： Logloss 越小越好：

这里写图片描述

默认参数解析（[github](https://github.com/catboost/benchmarks/blob/master/comparison_description.pdf)）：

这里写图片描述

安装

在window笔者遇到了：

* installing *source* package 'catboost' ...
** libs
  running 'src/Makefile.win' ...
/cygdrive/c/Users/mzheng50/Desktop/R-package/src/../../../ya.bat make -r -o ../../..
make: /cygdrive/c/Users/mzheng50/Desktop/R-package/src/../../../ya.bat: Command not found
make: *** [all] Error 127
警告: 运行命令'make --no-print-directory -f "Makefile.win"'的状态是2
ERROR: compilation failed for package 'catboost'
* removing 'C:/Users/mzheng50/Documents/R/win-library/3.1/catboost'
Error: Command failed (1)

在Linux用下面code可以一气呵成：

devtools::install_github('catboost/catboost', subdir = 'catboost/R-package')

一个官方案例：

library(caret)
library(titanic)
library(catboost)

set.seed(12345)

data <- as.data.frame(as.matrix(titanic_train), stringsAsFactors = TRUE)

drop_columns = c("PassengerId", "Survived", "Name", "Ticket", "Cabin")
x <- data[,!(names(data) %in% drop_columns)]
y <- data[,c("Survived")]

fit_control <- trainControl(method = "cv",
                            number = 4,
                            classProbs = TRUE)

grid <- expand.grid(depth = c(4, 6, 8),
                    learning_rate = 0.1,
                    iterations = 100,
                    l2_leaf_reg = 1e-3,
                    rsm = 0.95,
                    border_count = 64)

report <- train(x, as.factor(make.names(y)),
                method = catboost.caret,
                verbose = TRUE, preProc = NULL,
                tuneGrid = grid, trControl = fit_control)

print(report)
--------------------------
> Catboost
> 
> 891 samples   7 predictors   2 classes: 'X0', 'X1'
> 
> No pre-processing Resampling: Cross-Validated (4 fold) Summary of
> sample sizes: 669, 668, 668, 668 Resampling results across tuning
> parameters:
> 
>   depth  Accuracy   Kappa   4      0.8091544  0.5861049   6     
> 0.8035642  0.5728401   8      0.7026674  0.2672683
> 
> Tuning parameter 'learning_rate' was held constant at a value of 0.1
> 
> Tuning parameter 'rsm' was held constant at a value of 0.95 Tuning 
> parameter 'border_count' was held constant at a value of 64 Accuracy
> was used to select the optimal model using  the largest value. The
> final values used for the model were depth = 4, learning_rate =
>  0.1, iterations = 100, l2_leaf_reg = 0.001, rsm = 0.95 and border_count = 64.

importance <- varImp(report, scale = FALSE)
print(importance)
--------------------------
custom variable importance
         Overall
Fare      25.918
Parch     19.419
Sex       17.999
Pclass    17.410
Age       10.372
Embarked   5.879
SibSp      3.004

公众号“素质云笔记”定期更新博客内容：
这里写图片描述