机器学习中的CatBoost算法

python收藏家

于 2024-04-28 17:29:33 发布

阅读量809

点赞数 18

分类专栏：机器学习文章标签：机器学习

本文链接：https://blog.csdn.net/qq_42034590/article/details/132381166

版权

机器学习专栏收录该内容

76 篇文章 5 订阅

订阅专栏

我们经常遇到包含分类特征的数据集，为了将这些数据集拟合到Boosting模型中，我们对数据集应用各种编码技术，例如One-Hot编码或标签编码。但是应用One-Hot编码会创建一个稀疏矩阵，这有时可能导致模型的过拟合，我们使用CatBoost来处理这个问题。CatBoost可以自动处理分类特征。

什么是CatBoost

CatBoost或Categorical Boosting是由Yandex开发的开源boosting库。它被设计用于具有大量独立特征的回归和分类等问题。
Catboost是梯度提升的一种变体，可以处理分类和数值特征。它不需要任何特征编码技术，如One-Hot Encoder或Label Encoder将分类特征转换为数值特征。它还使用了一种称为对称加权分位数草图（SWQS）的算法，该算法自动处理数据集中的缺失值，以减少过拟合并提高数据集的整体性能。

CatBoost的特点

用于处理分类特征的内置方法 - CatBoost可以处理分类特征，而无需任何特征编码
处理缺失值的内置方法 -与其他模型不同，CatBoost可以轻松处理数据集中的任何缺失值
自动特征缩放 - CatBoost内部将所有列缩放为相同的缩放比率，而在其他模型中，我们需要大量地转换列
内置交叉验证 - CatBoost在内部应用交叉验证方法来选择模型的最佳超参数。
正则化 - CatBoost支持L1和L2正则化方法，以减少过拟合

CatBoost与其他Boosting算法在不同数据集上的比较结果

在这里插入图片描述

CatBoost安装和应用案例

安装

pip install catboost

我们将使用Python将CatBoost应用于机器学习项目问题。在这个问题中，我们给出了一个包含3种花的数据集，以及这些花的特征，如萼片长度，萼片宽度，花瓣长度和花瓣宽度，我们必须将花分类到这些物种中。

导入库

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from catboost import CatBoostClassifier
import warnings
warnings.filterwarnings("ignore")

加载数据集
导入库之后，我们将使用pandas read_csv方法加载我们的数据集：

data = pd.read_csv("Iris.csv")

# Printing the shape of the dataset
print(data.shape)

输出

(150, 6)

我们的数据集有150行和6列。让我们使用head()方法探索数据集内容，如下所示：

data.head()

在这里插入图片描述

删除ID列并从数据集中分离目标变量

data = data.drop('Id', axis=1)
X = data.iloc[:, :-1]
y = data.iloc[:, -1]
print("Shape of X is %s and shape \
    of y is %s" % (X.shape, y.shape))

输出

Shape of X is (150, 4) and shape of y is (150,)

由于这是一个分类任务，我们希望确定因变量中唯一类别的总数。

total_classes = y.nunique()
print("Number of unique species in dataset are: ",total_classes)

输出

Number of unique species in dataset are: 3

在我们的因变量中有3个唯一类，我们希望看到这些唯一类的计数，以检查数据集中的平衡。

distribution = y.value_counts()
print(distribution)

输出

Iris-virginica     50
Iris-setosa        50
Iris-versicolor    50
Name: Species, dtype: int64

让我们深入挖掘我们的数据集，可以在上面看到，我们的数据集包含3个类别，我们的花也分布在其中，因为有150个样本，所有三个物种在数据集中有相同数量的样本，所以没有类别不平衡。

拆分数据集

现在，我们将分割数据集用于训练和验证目的，验证集占总数据集的25%。为了将数据集划分为训练和测试，我们将使用sklearn模型选择中的train_test_split方法。

X_train, X_val, Y_train, Y_val = train_test_split(
	X, y, test_size=0.25, random_state=28)

将CatBoost应用于数据

# Define the hyperparameters for the CatBoost algorithm
params = {'learning_rate': 0.1, 'depth': 6,\
		'l2_leaf_reg': 3, 'iterations': 100}

# Initialize the CatBoostClassifier object
# with the defined hyperparameters and fit it on the training set
model = CatBoostClassifier(**params)
model.fit(X_train, Y_train)

CatBoost模型的准确性

# Predict the target variable on the validation
# set and evaluate the performance
y_pred = model.predict(X_val)
accuracy = (y_pred == np.array(Y_val)).mean()
print("Validation Accuracy:", accuracy)

输出

Validation Accuracy: 0.33518005540166207

python收藏家

关注

18
点赞
踩
14

收藏

觉得还不错? 一键收藏
0
评论
机器学习中的CatBoost算法

CatBoost或Categorical Boosting是由Yandex开发的开源boosting库。它被设计用于具有大量独立特征的回归和分类等问题。Catboost是梯度提升的一种变体，可以处理分类和数值特征。它不需要任何特征编码技术，如One-Hot Encoder或Label Encoder将分类特征转换为数值特征。它还使用了一种称为对称加权分位数草图（SWQS）的算法，该算法自动处理数据集中的缺失值，以减少过拟合并提高数据集的整体性能。
复制链接

扫一扫