XGBoost的基本使用应用
导入XGBoost等相关包:
from numpy import loadtxt
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
加载数据,提取特征集和标签:
dataset = loadtxt('pima-indians-diabetes.csv', delimiter=',')
X = dataset[:, 0:8]
y = dataset[:, 8]
dataset
#array([[ 6. , 148. , 72. , ..., 0.627, 50. , 1. ],
# [ 1. , 85. , 66. , ..., 0.351, 31. , 0. ],
# [ 8. , 183. , 64. , ..., 0.672, 32. , 1. ],
# ...,
# [ 5. , 121. , 72. , ..., 0.245, 30. , 0. ],
# [ 1. , 126. , 60. , ..., 0.349, 47. , 1. ],
# [ 1. , 93. , 70. , ..., 0.315, 23. , 0. ]])
将数据划分为训练集和测试集:
seed = 7
test_size = 0.33
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=seed)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
#((514, 8), (254, 8), (514,), (254,))
创建及训练模型:
model = XGBClassifier(n_jobs=-1)
model.fit(X_train, y_train)
#XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
# colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
# max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
# n_jobs=-1, nthread=None, objective='binary:logistic',
# random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
# seed=None, silent=True, subsample=1)
使用训练后的模型对测试集进行预测,并计算预测值与实际之间的acc值:
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
#Accuracy: 77.95%
使用训练后的模型对测试集进行预测,得到每个类别的预测概率:
y_pred = model.predict(X_test)
y_pred
#array([0., 1., 1., 0., 1., 1., 0., 0., 1., 0., 1., 0., 1., 1., 0., 0., 0.,
# 1., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0.,
# 0., 1., 1., 0., 1., 0., 1., 1., 1., 0., 0., 0., 1., 0., 0., 1., 0.,
# . . .
# 0., 0., 1., 0., 0., 0., 1., 1., 0., 1., 0., 0., 0., 1., 0., 0., 0.,
# 0., 1., 0., 1., 1., 1., 1., 1., 0., 0., 1., 0., 0., 0., 0., 0., 1.,
# 0., 0., 1., 1., 0., 0., 1., 0., 1., 0., 0., 0., 0., 1., 1., 1.])
y_pred_proba = model.predict_proba(X_test)
y_pred_proba
#array([[0.9545844 , 0.04541559],
# [0.05245447, 0.9475455 ],
# [0.41897488, 0.5810251 ],
#
# [0.42821795, 0.57178205],
# [0.2364142 , 0.7635858 ],
# [0.05780089, 0.9421991 ]], dtype=float32)
监控模型表现:xgboost可以在模型训练时,评价模型在测试集上的表现,也可以输出每一步的分数
model = XGBClassifier()
eval_set = [(X_test,y_test)]
model.fit(X_train,y_train,early_stopping_rounds=10,eval_metric="logloss",eval_set=eval_set,verbose=True)
#10轮验证集效果不提升,停止
#那么它会在每加入一棵树后打印出logloss
#[0] validation_0-logloss:0.60491
#[1] validation_0-logloss:0.55934
#[2] validation_0-logloss:0.53068
#[3] validation_0-logloss:0.51795
#[4] validation_0-logloss:0.51153
#[5] validation_0-logloss:0.50935
#[6] validation_0-logloss:0.50818
#[7] validation_0-logloss:0.51097
#[8] validation_0-logloss:0.51760
#[9] validation_0-logloss:0.51912
#[10] validation_0-logloss:0.52503
#[11] validation_0-logloss:0.52697
#[12] validation_0-logloss:0.53335
#[13] validation_0-logloss:0.53905
#[14] validation_0-logloss:0.54546
#[15] validation_0-logloss:0.54613
#[16] validation_0-logloss:0.54982
输出各特征重要程度:
from xgboost import plot_importance
from matplotlib import pyplot
%matplotlib inline
plot_importance(model)
pyplot.show()
xgboost根据结构分数的增益情况计算出来选择哪个特征作为分割点,而某个特征的重要性就是它在所有树中出现的次数之和。也就是说一个属性越多的被用来在模型中构建决策树,它的重要性就相对越高
调参
如何调参呢,下面是三个超参数的一般实践最佳值,可以先将它们设定为这个范围,然后画出learning curves,再再调节参数找到最佳模型:
- learning_rate=0.1 或更小,越小就需要多假如若学习器
- tree_depth = 2~8
- subsample=训练集的30%~80%
接下来我们用GridSearchCV来进行调参会更加方便一些
可以调的参数组合有:
树的个数和大小(n_estimators and max_depth)学习率和树的个数(learning_rate and n_estimators).行列的subsampling rates(subsample,colsample_bytree and colsample_bylevel )
导入调参相关包:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
创建模型及参数搜索空间:
model_GS = XGBClassifier()
learning_rate = [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3]
max_depth = [1, 2, 3, 4, 5]
param_grid = dict(learning_rate=learning_rate, max_depth=max_depth)
设置分层抽样验证及创建搜索对象:
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)
grid_search = GridSearchCV(model_GS, param_grid=param_grid, scoring='neg_log_loss', n_jobs=-1, cv=kfold)
grid_result = grid_search.fit(X, y)
y_pred = grid_result.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
#Accuracy: 81.10%
grid_result.best_score_, grid_result.best_params_
#(-0.47171179660714796, {'learning_rate': 0.2, 'max_depth': 1})
便利店销量预测
Describe
罗斯曼在7个欧洲国家经营着3000多家药店。目前,罗斯曼商店经理的任务是提前六周预测每日销售额。商店销售受到许多因素的影响,包括促销、竞争、学校和国定假日、季节性和地域性。由于成千上万的经理根据自己的特殊情况预测销售,结果的准确性可能会有很大差异。
在他们的第一场Kaggle竞赛中,Rossmann挑战你预测德国各地1115家商店6周的日销售额。可靠的销售预测使商店经理能够制定有效的员工时间表,提高工作效率和积极性。通过帮助Rossmann创建一个稳健的预测模型,您将帮助门店经理专注于对他们来说最重要的事情:他们的客户和团队!
我们为您提供了1115家罗斯曼商店的历史销售数据。任务是预测测试集的“销售”列。请注意,数据集中的一些商店因翻新而暂时关闭。
Evaluation
Submissions are evaluated on the Root Mean Square Percentage Error (RMSPE). The RMSPE is calculated as
RMSPE
=
1
n
∑
i
=
1
n
(
y
i
−
y
^
i
y
i
)
2
\operatorname{RMSPE}=\sqrt{\frac{1}{n} \sum_{i=1}^{n}\left(\frac{y_{i}-\hat{y}_{i}}{y_{i}}\right)^{2}}
RMSPE=n1i=1∑n(yiyi−y^i)2
where y_i denotes the sales of a single store on a single day and yhat_i denotes the corresponding prediction. Any day and store with 0 sales is ignored in scoring.
自定义损失函数的方法太过于复杂(一阶导、二阶导不求)(自定义损失函数要求计算出一阶导、二阶导,不是光有损失函数就行了),我们可以通过自定义评估指标来完成这个预测
Files
- train.csv - historical data including Sales
- test.csv - historical data excluding Sales
- sample_submission.csv - a sample submission file in the correct format
- store.csv - supplemental information about the stores
Data fields
Most of the fields are self-explanatory. The following are descriptions for those that aren't.
- Id - an Id that represents a (Store, Date) duple within the test set
- Store - a unique Id for each store
- Sales - the turnover for any given day (this is what you are predicting)
- Customers - the number of customers on a given day
- Open - an indicator for whether the store was open: 0 = closed, 1 = open
- StateHoliday - indicates a state holiday. Normally all stores, with few exceptions, are closed on state holidays. Note that all schools are closed on public holidays and weekends. a = public holiday, b = Easter holiday, c = Christmas, 0 = None
- SchoolHoliday - indicates if the (Store, Date) was affected by the closure of public schools
- StoreType - differentiates between 4 different store models: a, b, c, d
- Assortment - describes an assortment level: a = basic, b = extra, c = extended
- CompetitionDistance - distance in meters to the nearest competitor store
- CompetitionOpenSince[Month/Year] - gives the approximate year and month of the time the nearest competitor was opened
- Promo - indicates whether a store is running a promo on that day
- Promo2 - Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating
- Promo2Since[Year/Week] - describes the year and calendar week when the store started participating in Promo2
- PromoInterval - describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. E.g. "Feb,May,Aug,Nov" means each round starts in February, May, August, November of any given year for that store
==promotion是否促销
promotion是否在长周期的促销
Promo2Since[Year/Week]长周期促销从哪年那个星期开始
PromoInterval促销情况
引入所需的库
import pandas as pd
import datetime
import csv
import numpy as np
import os
import scipy as sp
import xgboost as xgb
import itertools
import operator
import warnings
warnings.filterwarnings("ignore")
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.base import TransformerMixin
from matplotlib import pylab as plt
plot = True
goal = 'Sales'
myid = 'Id'
当你的eval metric和loss function并不一致的时候
Early stopping
按照原来的loss function去优化,一颗一颗树生长和添加,但是在验证集上,盯着eval metric去看,在验证集上评估指标不再优化的时候,停止集成模型的生长。
有标签的数据部分(训练集) + 无标签/需要做预估的部分(测试集)
训练集 = 真正的训练集 + 验证集(利用它去完成模型选择和调参)
定义一些变换和评判准则
使用不同的evaluation function的时候要特别注意这个
def ToWeight(y):
# y is np.array
w = np.zeros(y.shape, dtype=float)
ind = y != 0
w[ind] = 1./(y[ind]**2)
return w
def rmspe(yhat, y):
w = ToWeight(y)
rmspe = np.sqrt(np.mean( w * (y - yhat)**2 ))
return rmspe
def rmspe_xg(yhat, y):
# y = y.values
y = y.get_label()
y = np.exp(y) - 1
yhat = np.exp(yhat) - 1
w = ToWeight(y)
rmspe = np.sqrt(np.mean(w * (y - yhat)**2))
return "rmspe", rmspe
store = pd.read_csv('store.csv')
store.head()
train_df = pd.read_csv('train.csv')
train_df.head()
test_df = pd.read_csv('test.csv')
test_df.head()
加载数据
def load_data():
"""
加载数据,设定数值型和非数值型数据
"""
store = pd.read_csv('store.csv')
train_org = pd.read_csv('train.csv',dtype={'StateHoliday':pd.np.string_})
test_org = pd.read_csv('test.csv',dtype={'StateHoliday':pd.np.string_})
train = pd.merge(train_org,store, on='Store', how='left')
test = pd.merge(test_org,store, on='Store', how='left')
features = test.columns.tolist()
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
features_numeric = test.select_dtypes(include=numerics).columns.tolist()
features_non_numeric = [f for f in features if f not in features_numeric]
return (train,test,features,features_non_numeric)
数据与特征处理
def process_data(train,test,features,features_non_numeric):
"""
Feature engineering and selection.
"""
# # FEATURE ENGINEERING
train = train[train['Sales'] > 0]
for data in [train,test]:
# year month day
data['year'] = data.Date.apply(lambda x: x.split('-')[0])
data['year'] = data['year'].astype(float)
data['month'] = data.Date.apply(lambda x: x.split('-')[1])
data['month'] = data['month'].astype(float)
data['day'] = data.Date.apply(lambda x: x.split('-')[2])
data['day'] = data['day'].astype(float)
# promo interval "Jan,Apr,Jul,Oct"
data['promojan'] = data.PromoInterval.apply(lambda x: 0 if isinstance(x, float) else 1 if "Jan" in x else 0)
#TypeError: argument of type 'float' is not iterable 为什么使用isinstance(x,float)
data['promofeb'] = data.PromoInterval.apply(lambda x: 0 if isinstance(x, float) else 1 if "Feb" in x else 0)
data['promomar'] = data.PromoInterval.apply(lambda x: 0 if isinstance(x, float) else 1 if "Mar" in x else 0)
data['promoapr'] = data.PromoInterval.apply(lambda x: 0 if isinstance(x, float) else 1 if "Apr" in x else 0)
data['promomay'] = data.PromoInterval.apply(lambda x: 0 if isinstance(x, float) else 1 if "May" in x else 0)
data['promojun'] = data.PromoInterval.apply(lambda x: 0 if isinstance(x, float) else 1 if "Jun" in x else 0)
data['promojul'] = data.PromoInterval.apply(lambda x: 0 if isinstance(x, float) else 1 if "Jul" in x else 0)
data['promoaug'] = data.PromoInterval.apply(lambda x: 0 if isinstance(x, float) else 1 if "Aug" in x else 0)
data['promosep'] = data.PromoInterval.apply(lambda x: 0 if isinstance(x, float) else 1 if "Sep" in x else 0)
data['promooct'] = data.PromoInterval.apply(lambda x: 0 if isinstance(x, float) else 1 if "Oct" in x else 0)
data['promonov'] = data.PromoInterval.apply(lambda x: 0 if isinstance(x, float) else 1 if "Nov" in x else 0)
data['promodec'] = data.PromoInterval.apply(lambda x: 0 if isinstance(x, float) else 1 if "Dec" in x else 0)
# # Features set.
noisy_features = [myid,'Date']
features = [c for c in features if c not in noisy_features]
features_non_numeric = [c for c in features_non_numeric if c not in noisy_features]
features.extend(['year','month','day'])
# Fill NA
class DataFrameImputer(TransformerMixin):
# http://stackoverflow.com/questions/25239958/impute-categorical-missing-values-in-scikit-learn
def __init__(self):
"""Impute missing values.
Columns of dtype object are imputed with the most frequent value
in column.
Columns of other types are imputed with mean of column.
"""
def fit(self, X, y=None):
self.fill = pd.Series([X[c].value_counts().index[0] # mode
if X[c].dtype == np.dtype('O') else X[c].mean() for c in X], # mean
index=X.columns)
return self
def transform(self, X, y=None):
return X.fillna(self.fill)
train = DataFrameImputer().fit_transform(train)
test = DataFrameImputer().fit_transform(test)
# Pre-processing non-numberic values
le = LabelEncoder()
for col in features_non_numeric:
le.fit(list(train[col])+list(test[col]))
train[col] = le.transform(train[col])
test[col] = le.transform(test[col])
# LR和神经网络这种模型都对输入数据的幅度极度敏感,请先做归一化操作
scaler = StandardScaler()
for col in set(features) - set(features_non_numeric) - \
set([]): # TODO: add what not to scale
scaler.fit(list(train[col])+list(test[col]))
train[col] = scaler.transform(train[col])
test[col] = scaler.transform(test[col])
return (train,test,features,features_non_numeric)
训练与分析
predict_result = log(y+1)
y = e^(predict_result)-1
def XGB_native(train,test,features,features_non_numeric):
depth = 6
eta = 0.01
ntrees = 8000
mcw = 3
params = {"objective": "reg:linear",
"booster": "gbtree",
"eta": eta,
"max_depth": depth,
"min_child_weight": mcw,
"subsample": 0.7,
"colsample_bytree": 0.7,
"silent": 1
}
print "Running with params: " + str(params)
print "Running with ntrees: " + str(ntrees)
print "Running with features: " + str(features)
# Train model with local split
tsize = 0.05
X_train, X_test = cross_validation.train_test_split(train, test_size=tsize)
dtrain = xgb.DMatrix(X_train[features], np.log(X_train[goal] + 1))
dvalid = xgb.DMatrix(X_test[features], np.log(X_test[goal] + 1))
watchlist = [(dvalid, 'eval'), (dtrain, 'train')]
gbm = xgb.train(params, dtrain, ntrees, evals=watchlist, early_stopping_rounds=100, feval=rmspe_xg, verbose_eval=True)
train_probs = gbm.predict(xgb.DMatrix(X_test[features]))
indices = train_probs < 0
train_probs[indices] = 0
error = rmspe(np.exp(train_probs) - 1, X_test[goal].values)
print error
# Predict and Export
test_probs = gbm.predict(xgb.DMatrix(test[features]))
indices = test_probs < 0
test_probs[indices] = 0
submission = pd.DataFrame({myid: test[myid], goal: np.exp(test_probs) - 1})
if not os.path.exists('result/'):
os.makedirs('result/')
submission.to_csv("./result/dat-xgb_d%s_eta%s_ntree%s_mcw%s_tsize%s.csv" % (str(depth),str(eta),str(ntrees),str(mcw),str(tsize)) , index=False)
# Feature importance
if plot:
outfile = open('xgb.fmap', 'w')
i = 0
for feat in features:
outfile.write('{0}\t{1}\tq\n'.format(i, feat))
i = i + 1
outfile.close()
importance = gbm.get_fscore(fmap='xgb.fmap')
importance = sorted(importance.items(), key=operator.itemgetter(1))
df = pd.DataFrame(importance, columns=['feature', 'fscore'])
df['fscore'] = df['fscore'] / df['fscore'].sum()
# Plotitup
plt.figure()
df.plot()
df.plot(kind='barh', x='feature', y='fscore', legend=False, figsize=(25, 15))
plt.title('XGBoost Feature Importance')
plt.xlabel('relative importance')
plt.gcf().savefig('Feature_Importance_xgb_d%s_eta%s_ntree%s_mcw%s_tsize%s.png' % (str(depth),str(eta),str(ntrees),str(mcw),str(tsize)))
print "=> 载入数据中..."
train,test,features,features_non_numeric = load_data()
print "=> 处理数据与特征工程..."
train,test,features,features_non_numeric = process_data(train,test,features,features_non_numeric)
print "=> 使用XGBoost建模..."
XGB_native(train,test,features,features_non_numeric)
train.head()
# => 载入数据中...
# => 处理数据与特征工程...
# => 使用XGBoost建模...
# Running with params: {'subsample': 0.7, 'eta': 0.01, 'colsample_bytree': 0.7, 'silent': 1, 'objective': 'reg:linear', 'max_depth': 6, 'min_child_weight': 3, 'booster': 'gbtree'}
# Running with ntrees: 8000
# Running with features: ['Store', 'DayOfWeek', 'Open', 'Promo', 'StateHoliday', 'SchoolHoliday', 'StoreType', 'Assortment', 'CompetitionDistance', 'CompetitionOpenSinceMonth', 'CompetitionOpenSinceYear', 'Promo2', 'Promo2SinceWeek', 'Promo2SinceYear', 'PromoInterval', 'year', 'month', 'day']
# [0] eval-rmspe:0.999864 train-rmspe:0.999864
# Multiple eval metrics have been passed: 'train-rmspe' will be used for early stopping.
# Will train until train-rmspe hasn't improved in 100 rounds.
# [1] eval-rmspe:0.999838 train-rmspe:0.999837
# [2] eval-rmspe:0.99981 train-rmspe:0.999809
# [3] eval-rmspe:0.999779 train-rmspe:0.999779
# . . .
# [503] eval-rmspe:0.314933 train-rmspe:0.342737
# [504] eval-rmspe:0.315016 train-rmspe:0.342834
# [505] eval-rmspe:0.31512 train-rmspe:0.342928
# Stopping. Best iteration:
# [405] eval-rmspe:0.312829 train-rmspe:0.33589
# 0.315119522982