CatBoost: A machine learning library to handle categorical (CAT) data automatically MACHINE LEARNING

最新推荐文章于 2024-04-24 10:16:46 发布

赵大寳Note

最新推荐文章于 2024-04-24 10:16:46 发布

阅读量1.3k

点赞数

分类专栏：机器学习算法文章标签：机器学习

机器学习算法专栏收录该内容

33 篇文章 0 订阅

订阅专栏

Introduction

How many of you have seen this error while building your machine learning models using “sklearn”?

I bet most of us! At least in the initial days.

This error occurs when dealing with categorical (string) variables. In sklearn, you are required to convert these categories in the numerical format.

In order to do this conversion, we use several pre-processing methods like “label encoding”, “one hot encoding” and others.

In this article, I will discuss a recently open sourced library ” CatBoost” developed and contributed by Yandex. CatBoost can use categorical features directly and is scalable in nature.

“This is the first Russian machine learning technology that’s an open source,” said Mikhail Bilenko, Yandex’s head of machine intelligence and research.

P.S. You can also read this article written by me before “How to deal with categorical variables?“.

What is CatBoost?
Advantages of CatBoost library
CatBoost in comparison to other boosting algorithms
Installing CatBoost
Solving ML challenge using CatBoost
End Notes

1. What is CatBoost?

CatBoost is a recently open-sourced machine learning algorithm from Yandex. It can easily integrate with deep learning frameworks like Google’s TensorFlow and Apple’s Core ML. It can work with diverse data types to help solve a wide range of problems that businesses face today. To top it up, it provides best-in-class accuracy.

It is especially powerful in two ways:

It yields state-of-the-art results without extensive data training
typically required by other machine learning methods, and
Provides powerful out-of-the-box support for the more descriptive
data formats that accompany many business problems.

“CatBoost” name comes from two words “**Cat**egory” and “**Boost**ing”.

As discussed, the library works well with multiple Categories of data, such as audio, text, image including historical data.

“Boost” comes from gradient boosting machine learning algorithm as this library is based on gradient boosting library. Gradient boosting is a powerful machine learning algorithm that is widely applied to multiple types of business challenges like fraud detection, recommendation items, forecasting and it performs well also. It can also return very good result with relatively less data, unlike DL models that need to learn from a massive amount of data.

Here is a video message of Mikhail Bilenko, Yandex’s head of machine intelligence and research and Anna Veronika Dorogush, Head of Tandex machine learning systems.

赵大寳Note

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
CatBoost: A machine learning library to handle categorical (CAT) data automatically MACHINE LEARNING

IntroductionHow many of you have seen this error while building your machine learning models using “sklearn”?I bet most of us! At least in the initial days.This error occurs when dealing with categoric
复制链接

扫一扫