XGBoost的基本使用应用&Kaggle便利店销量预测

最新推荐文章于 2024-10-10 16:42:37 发布

Grateful_Dead424

最新推荐文章于 2024-10-10 16:42:37 发布

阅读量2.4k

点赞数

分类专栏： XGBoost LightGBM CatBoost 文章标签：深度学习机器学习 pytorch

本文链接：https://blog.csdn.net/grateful_dead424/article/details/123051209

版权

XGBoost LightGBM CatBoost 专栏收录该内容

9 篇文章 8 订阅

订阅专栏

XGBoost的基本使用应用

导入XGBoost等相关包：

from numpy import loadtxt
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

加载数据，提取特征集和标签：

dataset = loadtxt('pima-indians-diabetes.csv', delimiter=',')

X = dataset[:, 0:8]
y = dataset[:, 8]
dataset
#array([[  6.   , 148.   ,  72.   , ...,   0.627,  50.   ,   1.   ],
#       [  1.   ,  85.   ,  66.   , ...,   0.351,  31.   ,   0.   ],
#       [  8.   , 183.   ,  64.   , ...,   0.672,  32.   ,   1.   ],
#       ...,
#       [  5.   , 121.   ,  72.   , ...,   0.245,  30.   ,   0.   ],
#       [  1.   , 126.   ,  60.   , ...,   0.349,  47.   ,   1.   ],
#       [  1.   ,  93.   ,  70.   , ...,   0.315,  23.   ,   0.   ]])

将数据划分为训练集和测试集：

seed = 7 
test_size = 0.33
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=seed)

X_train.shape, X_test.shape, y_train.shape, y_test.shape
#((514, 8), (254, 8), (514,), (254,))

创建及训练模型：

model = XGBClassifier(n_jobs=-1)
model.fit(X_train, y_train)
#XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
#       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
#       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
#       n_jobs=-1, nthread=None, objective='binary:logistic',
#       random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
#       seed=None, silent=True, subsample=1)

使用训练后的模型对测试集进行预测，并计算预测值与实际之间的acc值：

y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
#Accuracy: 77.95%

使用训练后的模型对测试集进行预测，得到每个类别的预测概率：

y_pred = model.predict(X_test)
y_pred
#array([0., 1., 1., 0., 1., 1., 0., 0., 1., 0., 1., 0., 1., 1., 0., 0., 0.,
#       1., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0.,
#       0., 1., 1., 0., 1., 0., 1., 1., 1., 0., 0., 0., 1., 0., 0., 1., 0.,
#       . . . 
#       0., 0., 1., 0., 0., 0., 1., 1., 0., 1., 0., 0., 0., 1., 0., 0., 0.,
#       0., 1., 0., 1., 1., 1., 1., 1., 0., 0., 1., 0., 0., 0., 0., 0., 1.,
#       0., 0., 1., 1., 0., 0., 1., 0., 1., 0., 0., 0., 0., 1., 1., 1.])
y_pred_proba = model.predict_proba(X_test)
y_pred_proba
#array([[0.9545844 , 0.04541559],
#       [0.05245447, 0.9475455 ],
#       [0.41897488, 0.5810251 ],
#
#       [0.42821795, 0.57178205],
#       [0.2364142 , 0.7635858 ],
#       [0.05780089, 0.9421991 ]], dtype=float32)

监控模型表现：xgboost可以在模型训练时，评价模型在测试集上的表现，也可以输出每一步的分数

model = XGBClassifier()
eval_set = [(X_test,y_test)]
model.fit(X_train,y_train,early_stopping_rounds=10,eval_metric="logloss",eval_set=eval_set,verbose=True)
#10轮验证集效果不提升，停止
#那么它会在每加入一棵树后打印出logloss
#[0]	validation_0-logloss:0.60491
#[1]	validation_0-logloss:0.55934
#[2]	validation_0-logloss:0.53068
#[3]	validation_0-logloss:0.51795
#[4]	validation_0-logloss:0.51153
#[5]	validation_0-logloss:0.50935
#[6]	validation_0-logloss:0.50818
#[7]	validation_0-logloss:0.51097
#[8]	validation_0-logloss:0.51760
#[9]	validation_0-logloss:0.51912
#[10]	validation_0-logloss:0.52503
#[11]	validation_0-logloss:0.52697
#[12]	validation_0-logloss:0.53335
#[13]	validation_0-logloss:0.53905
#[14]	validation_0-logloss:0.54546
#[15]	validation_0-logloss:0.54613
#[16]	validation_0-logloss:0.54982

输出各特征重要程度：

from xgboost import plot_importance
from matplotlib import pyplot
%matplotlib inline

plot_importance(model)
pyplot.show()

在这里插入图片描述
xgboost根据结构分数的增益情况计算出来选择哪个特征作为分割点,而某个特征的重要性就是它在所有树中出现的次数之和。也就是说一个属性越多的被用来在模型中构建决策树，它的重要性就相对越高

调参
如何调参呢，下面是三个超参数的一般实践最佳值，可以先将它们设定为这个范围，然后画出learning curves，再再调节参数找到最佳模型：

learning_rate=0.1 或更小，越小就需要多假如若学习器
tree_depth = 2~8
subsample=训练集的30%～80%

接下来我们用GridSearchCV来进行调参会更加方便一些

可以调的参数组合有：
树的个数和大小（n_estimators and max_depth）学习率和树的个数（learning_rate and n_estimators）.行列的subsampling rates（subsample，colsample_bytree and colsample_bylevel ）

导入调参相关包：

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold

创建模型及参数搜索空间：

model_GS = XGBClassifier()
learning_rate = [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3]
max_depth = [1, 2, 3, 4, 5]
param_grid = dict(learning_rate=learning_rate, max_depth=max_depth)

设置分层抽样验证及创建搜索对象：

kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)
grid_search = GridSearchCV(model_GS, param_grid=param_grid, scoring='neg_log_loss', n_jobs=-1, cv=kfold)
grid_result = grid_search.fit(X, y)

y_pred = grid_result.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
#Accuracy: 81.10%
grid_result.best_score_, grid_result.best_params_
#(-0.47171179660714796, {'learning_rate': 0.2, 'max_depth': 1})

便利店销量预测

Describe

罗斯曼在7个欧洲国家经营着3000多家药店。目前，罗斯曼商店经理的任务是提前六周预测每日销售额。商店销售受到许多因素的影响，包括促销、竞争、学校和国定假日、季节性和地域性。由于成千上万的经理根据自己的特殊情况预测销售，结果的准确性可能会有很大差异。

在他们的第一场Kaggle竞赛中，Rossmann挑战你预测德国各地1115家商店6周的日销售额。可靠的销售预测使商店经理能够制定有效的员工时间表，提高工作效率和积极性。通过帮助Rossmann创建一个稳健的预测模型，您将帮助门店经理专注于对他们来说最重要的事情：他们的客户和团队！

我们为您提供了1115家罗斯曼商店的历史销售数据。任务是预测测试集的“销售”列。请注意，数据集中的一些商店因翻新而暂时关闭。

Evaluation

Submissions are evaluated on the Root Mean Square Percentage Error (RMSPE). The RMSPE is calculated as
$\operatorname{RMSPE}=\sqrt{\frac{1}{n} \sum_{i=1}^{n}\left(\frac{y_{i}-\hat{y}_{i}}{y_{i}}\right)^{2}}$
where y_i denotes the sales of a single store on a single day and yhat_i denotes the corresponding prediction. Any day and store with 0 sales is ignored in scoring.

自定义损失函数的方法太过于复杂（一阶导、二阶导不求）（自定义损失函数要求计算出一阶导、二阶导，不是光有损失函数就行了），我们可以通过自定义评估指标来完成这个预测

Files

train.csv - historical data including Sales
test.csv - historical data excluding Sales
sample_submission.csv - a sample submission file in the correct format
store.csv - supplemental information about the stores

Data fields

Most of the fields are self-explanatory. The following are descriptions for those that aren't.

Id - an Id that represents a (Store, Date) duple within the test set
Store - a unique Id for each store
Sales - the turnover for any given day (this is what you are predicting)
Customers - the number of customers on a given day
Open - an indicator for whether the store was open: 0 = closed, 1 = open
StateHoliday - indicates a state holiday. Normally all stores, with few exceptions, are closed on state holidays. Note that all schools are closed on public holidays and weekends. a = public holiday, b = Easter holiday, c = Christmas, 0 = None
SchoolHoliday - indicates if the (Store, Date) was affected by the closure of public schools
StoreType - differentiates between 4 different store models: a, b, c, d
Assortment - describes an assortment level: a = basic, b = extra, c = extended
CompetitionDistance - distance in meters to the nearest competitor store
CompetitionOpenSince[Month/Year] - gives the approximate year and month of the time the nearest competitor was opened
Promo - indicates whether a store is running a promo on that day
Promo2 - Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating
Promo2Since[Year/Week] - describes the year and calendar week when the store started participating in Promo2
PromoInterval - describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. E.g. "Feb,May,Aug,Nov" means each round starts in February, May, August, November of any given year for that store

==promotion是否促销
promotion是否在长周期的促销
Promo2Since[Year/Week]长周期促销从哪年那个星期开始
PromoInterval促销情况

引入所需的库

import pandas as pd
import datetime
import csv
import numpy as np
import os
import scipy as sp
import xgboost as xgb
import itertools
import operator
import warnings
warnings.filterwarnings("ignore")
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.base import TransformerMixin
from matplotlib import pylab as plt
plot = True

goal = 'Sales'
myid = 'Id'

当你的eval metric和loss function并不一致的时候

Early stopping

按照原来的loss function去优化，一颗一颗树生长和添加，但是在验证集上，盯着eval metric去看，在验证集上评估指标不再优化的时候，停止集成模型的生长。

有标签的数据部分(训练集) + 无标签/需要做预估的部分(测试集)

训练集 = 真正的训练集 + 验证集(利用它去完成模型选择和调参)

定义一些变换和评判准则

使用不同的evaluation function的时候要特别注意这个

def ToWeight(y):
    # y is np.array
    w = np.zeros(y.shape, dtype=float)
    ind = y != 0
    w[ind] = 1./(y[ind]**2)
    return w

def rmspe(yhat, y):
    w = ToWeight(y)
    rmspe = np.sqrt(np.mean( w * (y - yhat)**2 ))
    return rmspe

def rmspe_xg(yhat, y):
    # y = y.values
    y = y.get_label()
    y = np.exp(y) - 1
    yhat = np.exp(yhat) - 1
    w = ToWeight(y)
    rmspe = np.sqrt(np.mean(w * (y - yhat)**2))
    return "rmspe", rmspe

store = pd.read_csv('store.csv')
store.head()

在这里插入图片描述

train_df = pd.read_csv('train.csv')
train_df.head()

在这里插入图片描述

test_df = pd.read_csv('test.csv')
test_df.head()

在这里插入图片描述

加载数据

def load_data():
    """
        加载数据，设定数值型和非数值型数据
    """
    store = pd.read_csv('store.csv')
    train_org = pd.read_csv('train.csv',dtype={'StateHoliday':pd.np.string_})
    test_org = pd.read_csv('test.csv',dtype={'StateHoliday':pd.np.string_})
    train = pd.merge(train_org,store, on='Store', how='left')
    test = pd.merge(test_org,store, on='Store', how='left')
    features = test.columns.tolist()
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    features_numeric = test.select_dtypes(include=numerics).columns.tolist()
    features_non_numeric = [f for f in features if f not in features_numeric]
    return (train,test,features,features_non_numeric)

数据与特征处理

def process_data(train,test,features,features_non_numeric):
    """
        Feature engineering and selection.
    """
    # # FEATURE ENGINEERING
    train = train[train['Sales'] > 0]

    for data in [train,test]:
        # year month day
        data['year'] = data.Date.apply(lambda x: x.split('-')[0])
        data['year'] = data['year'].astype(float)
        data['month'] = data.Date.apply(lambda x: x.split('-')[1])
        data['month'] = data['month'].astype(float)
        data['day'] = data.Date.apply(lambda x: x.split('-')[2])
        data['day'] = data['day'].astype(float)

        # promo interval "Jan,Apr,Jul,Oct"
        data['promojan'] = data.PromoInterval.apply(lambda x: 0 if isinstance(x, float) else 1 if "Jan" in x else 0)
        #TypeError: argument of type 'float' is not iterable 为什么使用isinstance(x,float)
        data['promofeb'] = data.PromoInterval.apply(lambda x: 0 if isinstance(x, float) else 1 if "Feb" in x else 0)
        data['promomar'] = data.PromoInterval.apply(lambda x: 0 if isinstance(x, float) else 1 if "Mar" in x else 0)
        data['promoapr'] = data.PromoInterval.apply(lambda x: 0 if isinstance(x, float) else 1 if "Apr" in x else 0)
        data['promomay'] = data.PromoInterval.apply(lambda x: 0 if isinstance(x, float) else 1 if "May" in x else 0)
        data['promojun'] = data.PromoInterval.apply(lambda x: 0 if isinstance(x, float) else 1 if "Jun" in x else 0)
        data['promojul'] = data.PromoInterval.apply(lambda x: 0 if isinstance(x, float) else 1 if "Jul" in x else 0)
        data['promoaug'] = data.PromoInterval.apply(lambda x: 0 if isinstance(x, float) else 1 if "Aug" in x else 0)
        data['promosep'] = data.PromoInterval.apply(lambda x: 0 if isinstance(x, float) else 1 if "Sep" in x else 0)
        data['promooct'] = data.PromoInterval.apply(lambda x: 0 if isinstance(x, float) else 1 if "Oct" in x else 0)
        data['promonov'] = data.PromoInterval.apply(lambda x: 0 if isinstance(x, float) else 1 if "Nov" in x else 0)
        data['promodec'] = data.PromoInterval.apply(lambda x: 0 if isinstance(x, float) else 1 if "Dec" in x else 0)

    # # Features set.
    noisy_features = [myid,'Date']
    features = [c for c in features if c not in noisy_features]
    features_non_numeric = [c for c in features_non_numeric if c not in noisy_features]
    features.extend(['year','month','day'])
    # Fill NA
    class DataFrameImputer(TransformerMixin):
        # http://stackoverflow.com/questions/25239958/impute-categorical-missing-values-in-scikit-learn
        def __init__(self):
            """Impute missing values.
            Columns of dtype object are imputed with the most frequent value
            in column.
            Columns of other types are imputed with mean of column.
            """
        def fit(self, X, y=None):
            self.fill = pd.Series([X[c].value_counts().index[0] # mode
                if X[c].dtype == np.dtype('O') else X[c].mean() for c in X], # mean
                index=X.columns)
            return self
        def transform(self, X, y=None):
            return X.fillna(self.fill)
    train = DataFrameImputer().fit_transform(train)
    test = DataFrameImputer().fit_transform(test)
    # Pre-processing non-numberic values
    le = LabelEncoder()
    for col in features_non_numeric:
        le.fit(list(train[col])+list(test[col]))
        train[col] = le.transform(train[col])
        test[col] = le.transform(test[col])
    # LR和神经网络这种模型都对输入数据的幅度极度敏感，请先做归一化操作
    scaler = StandardScaler()
    for col in set(features) - set(features_non_numeric) - \
      set([]): # TODO: add what not to scale
        scaler.fit(list(train[col])+list(test[col]))
        train[col] = scaler.transform(train[col])
        test[col] = scaler.transform(test[col])
    return (train,test,features,features_non_numeric)

训练与分析

predict_result = log(y+1)
y = e^(predict_result)-1

def XGB_native(train,test,features,features_non_numeric):
    depth = 6
    eta = 0.01
    ntrees = 8000
    mcw = 3
    params = {"objective": "reg:linear",
              "booster": "gbtree",
              "eta": eta,
              "max_depth": depth,
              "min_child_weight": mcw,
              "subsample": 0.7,
              "colsample_bytree": 0.7,
              "silent": 1
              }
    print "Running with params: " + str(params)
    print "Running with ntrees: " + str(ntrees)
    print "Running with features: " + str(features)

    # Train model with local split
    tsize = 0.05
    X_train, X_test = cross_validation.train_test_split(train, test_size=tsize)
    dtrain = xgb.DMatrix(X_train[features], np.log(X_train[goal] + 1))
    dvalid = xgb.DMatrix(X_test[features], np.log(X_test[goal] + 1))
    watchlist = [(dvalid, 'eval'), (dtrain, 'train')]
    gbm = xgb.train(params, dtrain, ntrees, evals=watchlist, early_stopping_rounds=100, feval=rmspe_xg, verbose_eval=True)
    train_probs = gbm.predict(xgb.DMatrix(X_test[features]))
    indices = train_probs < 0
    train_probs[indices] = 0
    error = rmspe(np.exp(train_probs) - 1, X_test[goal].values)
    print error

    # Predict and Export
    test_probs = gbm.predict(xgb.DMatrix(test[features]))
    indices = test_probs < 0
    test_probs[indices] = 0
    submission = pd.DataFrame({myid: test[myid], goal: np.exp(test_probs) - 1})
    if not os.path.exists('result/'):
        os.makedirs('result/')
    submission.to_csv("./result/dat-xgb_d%s_eta%s_ntree%s_mcw%s_tsize%s.csv" % (str(depth),str(eta),str(ntrees),str(mcw),str(tsize)) , index=False)
    # Feature importance
    if plot:
      outfile = open('xgb.fmap', 'w')
      i = 0
      for feat in features:
          outfile.write('{0}\t{1}\tq\n'.format(i, feat))
          i = i + 1
      outfile.close()
      importance = gbm.get_fscore(fmap='xgb.fmap')
      importance = sorted(importance.items(), key=operator.itemgetter(1))
      df = pd.DataFrame(importance, columns=['feature', 'fscore'])
      df['fscore'] = df['fscore'] / df['fscore'].sum()
      # Plotitup
      plt.figure()
      df.plot()
      df.plot(kind='barh', x='feature', y='fscore', legend=False, figsize=(25, 15))
      plt.title('XGBoost Feature Importance')
      plt.xlabel('relative importance')
      plt.gcf().savefig('Feature_Importance_xgb_d%s_eta%s_ntree%s_mcw%s_tsize%s.png' % (str(depth),str(eta),str(ntrees),str(mcw),str(tsize)))

print "=> 载入数据中..."
train,test,features,features_non_numeric = load_data()
print "=> 处理数据与特征工程..."
train,test,features,features_non_numeric = process_data(train,test,features,features_non_numeric)
print "=> 使用XGBoost建模..."
XGB_native(train,test,features,features_non_numeric)
train.head()

# => 载入数据中...
# => 处理数据与特征工程...
# => 使用XGBoost建模...
# Running with params: {'subsample': 0.7, 'eta': 0.01, 'colsample_bytree': 0.7, 'silent': 1, 'objective': 'reg:linear', 'max_depth': 6, 'min_child_weight': 3, 'booster': 'gbtree'}
# Running with ntrees: 8000
# Running with features: ['Store', 'DayOfWeek', 'Open', 'Promo', 'StateHoliday', 'SchoolHoliday', 'StoreType', 'Assortment', 'CompetitionDistance', 'CompetitionOpenSinceMonth', 'CompetitionOpenSinceYear', 'Promo2', 'Promo2SinceWeek', 'Promo2SinceYear', 'PromoInterval', 'year', 'month', 'day']
# [0]	eval-rmspe:0.999864	train-rmspe:0.999864
# Multiple eval metrics have been passed: 'train-rmspe' will be used for early stopping.

# Will train until train-rmspe hasn't improved in 100 rounds.
# [1]	eval-rmspe:0.999838	train-rmspe:0.999837
# [2]	eval-rmspe:0.99981	train-rmspe:0.999809
# [3]	eval-rmspe:0.999779	train-rmspe:0.999779
#       . . . 
# [503]	eval-rmspe:0.314933	train-rmspe:0.342737
# [504]	eval-rmspe:0.315016	train-rmspe:0.342834
# [505]	eval-rmspe:0.31512	train-rmspe:0.342928
# Stopping. Best iteration:
# [405]	eval-rmspe:0.312829	train-rmspe:0.33589

# 0.315119522982