机器学习案例9：基于XGBoost的otto产品预测

自学小白菜

于 2023-01-15 12:30:21 发布

阅读量250

点赞数

分类专栏：机器学习案例文章标签： python 人工智能集成学习

本文链接：https://blog.csdn.net/weixin_46676835/article/details/128693401

版权

机器学习案例专栏收录该内容

10 篇文章 6 订阅

订阅专栏

案例9：基于XGBoost的otto产品预测

为什么写本博客

前人种树，后人乘凉。希望自己的学习笔记可以帮助到需要的人。

需要的基础

懂不懂原理不重要，本系列的目标是使用python实现机器学习。

必须会的东西：python基础、numpy、pandas、matplotlib和库的使用技巧。

说明

完整的代码在最后，另外之前案例中出现过的方法不会再讲解。

目录结构

文章目录

- 案例9：基于XGBoost的otto产品预测

1. 说明：

上一个案例中，我们使用了random forest来实现了otto产品预测，这里我们通过简单的修改之前的代码，来实现XGBoost算法。

其中，数据可以通过上一讲中的方法获取。

另外，安装xgboost包，如果你通过pip安装失败（pip install xgboost），那么可以通过其官网进行下载，这里可以参考别人的博客https://blog.csdn.net/xiaofeixia002X/article/details/104479101。

2. 涉及的新方法：

模型

# 导入模块，需要先下载模块
from xgboost import XGBClassifier
# 创建模型
model = XGBClassifier()
model.fit(x_train, y_train)
'''
常用参数：
	booster ： 指定使用什么算法，比如gbtree、gblinea、dar
	nthread ： 设置线程数
	max_depth ： 树最大深度
'''

3. 代码修改：

先将上次案例的代码进行修改，删除一些不需要的代码：

# 导包
import pandas as pd
import numpy as np
import seaborn
from matplotlib import pyplot as plt
from imblearn.under_sampling import RandomUnderSampler
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss
from xgboost import XGBClassifier

# 加载数据
train_data = pd.read_csv('./data/otto/train.csv')
# print(train_data.shape) # (61878, 95)
y = train_data['target']
x = train_data.iloc[:,1:-1]
# print(x.shape) # (61878, 93)
# 随机欠采样
rus = RandomUnderSampler(random_state=20)
new_x,new_y = rus.fit_resample(x,y)
# print(new_x.shape) # (17361, 93)
# 标签 -- 数字
encoder = LabelEncoder()
new_y = encoder.fit_transform(new_y)
# 数据划分
x_train,x_test,y_train,y_test = train_test_split(new_x,new_y,test_size=0.2)
# 创建模型
model = XGBClassifier()
model.fit(x_train, y_train)
# 评估1
print('准确率：',model.score(x_test,y_test))
# 评估2
y_pred = model.predict(x_test)
# one-hot
encoder_oneHot = OneHotEncoder(sparse=False)
y_test = encoder_oneHot.fit_transform(y_test.reshape(-1,1)) # 需要转为矩阵
y_pred = encoder_oneHot.fit_transform(y_pred.reshape(-1,1))
# logloss
print('log_loss:',log_loss(y_test,y_pred))

下面，对上面的代码进行修改。

其实，修改的地方不多，我们只需要添加两个部分：数据标准化和PCA降维。

数据标准化：

# 数据标准化
stand = StandardScaler()
stand.fit(x_train)
x_train_stand = stand.transform(x_train)
x_test_stand = stand.transform(x_test)

PCA降维：

# PCA降维
pca_model = PCA(n_components=0.9)
x_train_pca = pca_model.fit_transform(x_train_stand)
x_test_pca = pca_model.transform(x_test_stand)

最后运行的结果：

准确率： 0.7494961128707169
log_loss: 8.65209774361426

4. 总结：

上面的代码相比于随机森林改动的不多，另外也仅仅是把流程跑了一转，并没有深究参数的选择等问题，另外明明是机器学习的大成算法之一，但是准确率一直不佳，也是我很纠结的问题，后面我也会花时间探究一下，总结出来后会发出来。

完整代码：

# author: baiCai
# 导包
import pandas as pd
import numpy as np
import seaborn
from matplotlib import pyplot as plt
from imblearn.under_sampling import RandomUnderSampler
from sklearn.preprocessing import LabelEncoder,OneHotEncoder,StandardScaler
from sklearn.model_selection import StratifiedShuffleSplit,train_test_split
from sklearn.decomposition import PCA
from sklearn.metrics import log_loss
from xgboost import XGBClassifier

# 加载数据
train_data = pd.read_csv('./data/otto/train.csv')
# print(train_data.shape) # (61878, 95)
y = train_data['target']
x = train_data.iloc[:,1:-1]
# print(x.shape) # (61878, 93)
# 随机欠采样
rus = RandomUnderSampler(random_state=20)
new_x,new_y = rus.fit_resample(x,y)
# print(new_x.shape) # (17361, 93)
# 标签 -- 数字
encoder = LabelEncoder()
new_y = encoder.fit_transform(new_y)
# 数据划分1
'''
split_obj = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=0)
for train_index, test_index in split_obj.split(new_x.values,new_y):
    x_train = new_x.values[train_index]
    x_test = new_x.values[test_index]
    y_train = new_y[train_index]
    y_test = new_y[test_index]
'''
# 数据划分2
x_train,x_test,y_train,y_test = train_test_split(new_x,new_y,test_size=0.2)
# print(x_train.shape) # (13888, 93)
# print(x_test.shape) # (3473, 93)
# 数据标准化
stand = StandardScaler()
stand.fit(x_train)
x_train_stand = stand.transform(x_train)
x_test_stand = stand.transform(x_test)
# print(x_train_stand)
# PCA降维
pca_model = PCA(n_components=0.9)
x_train_pca = pca_model.fit_transform(x_train_stand)
x_test_pca = pca_model.transform(x_test_stand)
# print(x_train_pca.shape)
# 创建模型
model = XGBClassifier()
'''
learning_rate =0.1,
                    n_estimators=550,
                    max_depth=3,
                    min_child_weight=3,
                    subsample=0.7,
                    colsample_bytree=0.7,
                    nthread=4,
                    seed=42,
                    objective='multi:softprob'
'''
model.fit(x_train_pca, y_train)
# 评估1
print('准确率：',model.score(x_test_pca,y_test))
# 评估2
y_pred = model.predict(x_test_pca)
# one-hot
encoder_oneHot = OneHotEncoder(sparse=False)
y_test = encoder_oneHot.fit_transform(y_test.reshape(-1,1)) # 需要转为矩阵
y_pred = encoder_oneHot.fit_transform(y_pred.reshape(-1,1))
# logloss
print('log_loss:',log_loss(y_test,y_pred))
'''
准确率： 0.7506478548805068
log_loss: 8.612317983873504
-----
准确率： 0.7422977253095306
log_loss: 8.90072124199398
'''