案例9:基于XGBoost的otto产品预测
为什么写本博客
前人种树,后人乘凉。希望自己的学习笔记可以帮助到需要的人。
需要的基础
懂不懂原理不重要,本系列的目标是使用python实现机器学习。
必须会的东西:python基础、numpy、pandas、matplotlib和库的使用技巧。
说明
完整的代码在最后,另外之前案例中出现过的方法不会再讲解。
目录结构
1. 说明:
上一个案例中,我们使用了random forest来实现了otto产品预测,这里我们通过简单的修改之前的代码,来实现XGBoost算法。
其中,数据可以通过上一讲中的方法获取。
另外,安装xgboost包,如果你通过pip安装失败(pip install xgboost
),那么可以通过其官网进行下载,这里可以参考别人的博客https://blog.csdn.net/xiaofeixia002X/article/details/104479101
。
2. 涉及的新方法:
模型
# 导入模块,需要先下载模块
from xgboost import XGBClassifier
# 创建模型
model = XGBClassifier()
model.fit(x_train, y_train)
'''
常用参数:
booster : 指定使用什么算法,比如gbtree、gblinea、dar
nthread : 设置线程数
max_depth : 树最大深度
'''
3. 代码修改:
先将上次案例的代码进行修改,删除一些不需要的代码:
# 导包
import pandas as pd
import numpy as np
import seaborn
from matplotlib import pyplot as plt
from imblearn.under_sampling import RandomUnderSampler
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss
from xgboost import XGBClassifier
# 加载数据
train_data = pd.read_csv('./data/otto/train.csv')
# print(train_data.shape) # (61878, 95)
y = train_data['target']
x = train_data.iloc[:,1:-1]
# print(x.shape) # (61878, 93)
# 随机欠采样
rus = RandomUnderSampler(random_state=20)
new_x,new_y = rus.fit_resample(x,y)
# print(new_x.shape) # (17361, 93)
# 标签 -- 数字
encoder = LabelEncoder()
new_y = encoder.fit_transform(new_y)
# 数据划分
x_train,x_test,y_train,y_test = train_test_split(new_x,new_y,test_size=0.2)
# 创建模型
model = XGBClassifier()
model.fit(x_train, y_train)
# 评估1
print('准确率:',model.score(x_test,y_test))
# 评估2
y_pred = model.predict(x_test)
# one-hot
encoder_oneHot = OneHotEncoder(sparse=False)
y_test = encoder_oneHot.fit_transform(y_test.reshape(-1,1)) # 需要转为矩阵
y_pred = encoder_oneHot.fit_transform(y_pred.reshape(-1,1))
# logloss
print('log_loss:',log_loss(y_test,y_pred))
下面,对上面的代码进行修改。
其实,修改的地方不多,我们只需要添加两个部分:数据标准化和PCA降维。
数据标准化:
# 数据标准化
stand = StandardScaler()
stand.fit(x_train)
x_train_stand = stand.transform(x_train)
x_test_stand = stand.transform(x_test)
PCA降维:
# PCA降维
pca_model = PCA(n_components=0.9)
x_train_pca = pca_model.fit_transform(x_train_stand)
x_test_pca = pca_model.transform(x_test_stand)
最后运行的结果:
准确率: 0.7494961128707169
log_loss: 8.65209774361426
4. 总结:
上面的代码相比于随机森林改动的不多,另外也仅仅是把流程跑了一转,并没有深究参数的选择等问题,另外明明是机器学习的大成算法之一,但是准确率一直不佳,也是我很纠结的问题,后面我也会花时间探究一下,总结出来后会发出来。
完整代码:
# author: baiCai
# 导包
import pandas as pd
import numpy as np
import seaborn
from matplotlib import pyplot as plt
from imblearn.under_sampling import RandomUnderSampler
from sklearn.preprocessing import LabelEncoder,OneHotEncoder,StandardScaler
from sklearn.model_selection import StratifiedShuffleSplit,train_test_split
from sklearn.decomposition import PCA
from sklearn.metrics import log_loss
from xgboost import XGBClassifier
# 加载数据
train_data = pd.read_csv('./data/otto/train.csv')
# print(train_data.shape) # (61878, 95)
y = train_data['target']
x = train_data.iloc[:,1:-1]
# print(x.shape) # (61878, 93)
# 随机欠采样
rus = RandomUnderSampler(random_state=20)
new_x,new_y = rus.fit_resample(x,y)
# print(new_x.shape) # (17361, 93)
# 标签 -- 数字
encoder = LabelEncoder()
new_y = encoder.fit_transform(new_y)
# 数据划分1
'''
split_obj = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=0)
for train_index, test_index in split_obj.split(new_x.values,new_y):
x_train = new_x.values[train_index]
x_test = new_x.values[test_index]
y_train = new_y[train_index]
y_test = new_y[test_index]
'''
# 数据划分2
x_train,x_test,y_train,y_test = train_test_split(new_x,new_y,test_size=0.2)
# print(x_train.shape) # (13888, 93)
# print(x_test.shape) # (3473, 93)
# 数据标准化
stand = StandardScaler()
stand.fit(x_train)
x_train_stand = stand.transform(x_train)
x_test_stand = stand.transform(x_test)
# print(x_train_stand)
# PCA降维
pca_model = PCA(n_components=0.9)
x_train_pca = pca_model.fit_transform(x_train_stand)
x_test_pca = pca_model.transform(x_test_stand)
# print(x_train_pca.shape)
# 创建模型
model = XGBClassifier()
'''
learning_rate =0.1,
n_estimators=550,
max_depth=3,
min_child_weight=3,
subsample=0.7,
colsample_bytree=0.7,
nthread=4,
seed=42,
objective='multi:softprob'
'''
model.fit(x_train_pca, y_train)
# 评估1
print('准确率:',model.score(x_test_pca,y_test))
# 评估2
y_pred = model.predict(x_test_pca)
# one-hot
encoder_oneHot = OneHotEncoder(sparse=False)
y_test = encoder_oneHot.fit_transform(y_test.reshape(-1,1)) # 需要转为矩阵
y_pred = encoder_oneHot.fit_transform(y_pred.reshape(-1,1))
# logloss
print('log_loss:',log_loss(y_test,y_pred))
'''
准确率: 0.7506478548805068
log_loss: 8.612317983873504
-----
准确率: 0.7422977253095306
log_loss: 8.90072124199398
'''