kaggle：Otto Group Product Classification简单代码

最新推荐文章于 2023-01-09 14:53:46 发布

LYROOO

最新推荐文章于 2023-01-09 14:53:46 发布

阅读量897

点赞数

本文链接：https://blog.csdn.net/LYROOO/article/details/89158298

版权

简介及相关链接：

在b站上看的一个视频B站xgboost课程。讲了xgboost相关的理论知识后，用kaggle上的Otto Group Product Classification数据集进行分类任务。

1、首先导入模块

from xgboost import XGBClassifier
import xgboost as xgb

import numpy as np
import pandas as pd

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold

from sklearn.metrics import log_loss #-log似然损失

from matplotlib import pyplot
import seaborn as sns
%matplotlib inline

2、读取数据，用到pandas里面pd.read_csv()。之前读取碰到过读取json数据，用pd.read_json()

path='D:\\data\\kaggle\\otto-group-product-classification-challenge'
train=pd.read_csv(path+'\\train.csv')
test=pd.read_csv(path+'\\test.csv')

3、一般在处理问题前都要看下各种特征的分布。该数据集的变量确定，数据特征单一，且经过脱敏（肉眼看不出来他的实际意义），在特征工程方面少做些工作，主要精力在参数调优。

先看一下target的分布，看看各类样本是否均衡。

sns.countplot(train.target)
pyplot.xlabel('target')
pyplot.ylabel('number of occurrences')

4、分类不是很均衡，各类数据的个数差异很大，所以在交叉检验的时候要按各类比例抽取。所以这里用StratifiedKFold，在每折采样时各类样本按比例采样，确保训练集，测试集中各类别样本的比例与原始数据集中相同。

y_train=train['target']
y_train=y_train.map(lambda s:s[6:])
y_train=y_train.map(lambda s:int(s)-1)

train=train.drop(['id','target'],axis=1)#维度1，丢掉id和target列。target作为label单独存储，id不作为特征
x_train=np.array(train)

kfold=StratifiedKFold(n_splits=5,shuffle=True,random_state=3)

5、默认参数，此时学习率为0.1,比较大，观察弱分类器数目的大致范围。（采用默认参数，看模型是过拟合还是欠拟合）

def modelfit(alg,x_train,y_train,useTrainCV=True,cv_folds=None,early_stopping_rounds=50):
    if useTrainCV:
        xgb_param=alg.get_xgb_params()
        xgb_param['num_class']=9
        
        xgtrain=xgb.DMatrix(x_train,label=y_train)
        cvresult=xgb.cv(xgb_param,xgtrain,num_boost_round=alg.get_params()['n_estimators'],folds=cv_folds,
                       metrics='mlogloss',early_stopping_rounds=early_stopping_rounds)
        
        n_estimators=cvresult.shape[0]
        alg.set_params(n_estimators=n_estimators)
        
        print(cvresult)
        
        cvresult.to_csv(path+'\\my_preds_4_1.csv',index_label='n_estimators')
        
        #plot
        test_means=cvresult['test-mlogloss-mean']#平均
        test_stds=cvresult['test-mlogloss-std']#标准差
        
        train_means=cvresult['train-mlogloss-mean']
        train_stds=cvresult['train-mlogloss-std']
        
        x_axis=range(0,n_estimators)
        pyplot.errorbar(x_axis,test_means,yerr=test_stds,label='test')
        pyplot.errorbar(x_axis,train_means,yerr=train_stds,label='train')
        pyplot.title('XGBoost n_estimators vs Log Loss')
        pyplot.xlabel('n_estimators')
        pyplot.ylabel('Log Loss')
        pyplot.savefig(path+'\\n_estimators.png')
        
    #在数据上训练模型
    alg.fit(x_train,y_train,eval_metric='mlogloss')
    #在训练集上预测
    train_predprob=alg.predict_proba(x_train)
    logloss=log_loss(y_train,train_predprob)
    
    #print
    print("logloss of train:")
    print(logloss)


xgbl=XGBClassifier(
    learning_rate=0.1,
    n_estimators=1000,
    max_depth=5,
    min_child_weight=1,
    gamma=0,
    subsample=0.3,
    colsample_bytree=0.8,
    colsample_bylevel=0.7,
    objective='multi:softprob',#多分类问题
    seed=3)

modelfit(xgbl,x_train,y_train,cv_folds=kfold)

这一步好花费很长时间：在弱分类器在742的时候停止，所以接下来调参，我们将弱分类器个数固定，调整每棵树的深度和叶子节点权重。

6、调整树的参数:max_depth&min_child_weight。粗调参数的步长为2，下一步是在粗调的最佳参数周围，将步长降为1，进行精细调整¶

max_depth=range(3,10,2)
min_child_weight=range(1,6,2)
param_test2_1=dict(max_depth=max_depth,min_child_weight=min_child_weight)
param_test2_1

xgb2_1=XGBClassifier(
    learning_rate=0.1,
    n_estimators=742, #这里用上面找到的最优弱分类器个数
    max_depth=5,
    min_child_weight=1,
    gamma=0,
    subsample=0.3,
    colsample_bytree=0.8,
    colsample_bylevel=0.7,
    objective='multi:softprob',#多分类问题
    seed=3)

qsearch2_1=GridSearchCV(xgb2_1,param_grid=param_test2_1,scoring='neg_log_loss',n_jobs=-1,cv=kfold)
qsearch2_1.fit(x_train,y_train)
qsearch2_1.grid_scores_,qsearch2_1.best_params_,qsearch2_1.best_score_

7、可视化，寻找各个参数和损失函数的关系

#可视化
print("best:{} using {}".format(qsearch2_1.best_score_,qsearch2_1.best_params_))
test_means=qsearch2_1.cv_results_['mean_test_score']
test_stds=qsearch2_1.cv_results_['std_test_score']
train_means=qsearch2_1.cv_results_['mean_train_score']
train_stds=qsearch2_1.cv_results_['std_train_score']

pd.DataFrame(qsearch2_1.cv_results_).to_csv(path+'\\my_preds_maxdepth_min_child_weight')

test_scores=np.array(test_means).reshape(len(max_depth),len(min_child_weight))
train_scores=np.array(train_means).reshape(len(max_depth),len(min_child_weight))
for i,value in enumerate(max_depth):
    pyplot.plot(min_child_weight,-test_scores[i],label='test_max_depth')
    

pyplot.legend()
pyplot.xlabel('max_depth')
pyplot.ylabel('log_loss')
pyplot.savefig('max_depth_vs_min_child_weight_1.png')

8、微调，步长设为1

max_depth=[6,7,8]
min_child_weight=[4,5,6]

通过以上步骤得到较优参数值max_depth和min_child_weight

调整max_depth=6和min_child_weight=4后，再次调整n_estimators

gamma参数调整

正则参数调整

降低学习率，调整树的数目

大家看开头的视频，讲的很详细！

LYROOO

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
kaggle：Otto Group Product Classification简单代码

简介及相关链接：在b站上看的一个视频B站xgboost课程。讲了xgboost相关的理论知识后，用kaggle上的Otto Group Product Classification数据集进行分类任务。1、首先导入模块from xgboost import XGBClassifierimport xgboost as xgbimport numpy as npimport pa...
复制链接

扫一扫