XGBoost

最新推荐文章于 2024-06-08 09:51:00 发布

Hayden112

最新推荐文章于 2024-06-08 09:51:00 发布

阅读量809

点赞数 1

分类专栏：机器学习文章标签： python 机器学习 XGBoost

本文链接：https://blog.csdn.net/weixin_42432468/article/details/114942720

版权

机器学习专栏收录该内容

9 篇文章 0 订阅

订阅专栏

文章目录

- XGBoost

XGBoost

XGBoost是陈天奇等人开发的一个开源机器学习项目，高效地实现了GBDT算法并进行了算法和工程上的许多改进，被广泛应用在Kaggle竞赛及其他许多机器学习竞赛中并取得了不错的成绩。

说到XGBoost，不得不提GBDT(Gradient Boosting Decision Tree)。因为XGBoost本质上还是一个GBDT，但是力争把速度和效率发挥到极致，所以叫X (Extreme) GBoosted。包括前面说过，两者都是boosting方法。

关于GBDT，这里不再提，可以查看我另一篇的介绍，点此跳转。

XGBoost与GBDT异同点

GBDT是机器学习算法，XGBoost是该算法的工程实现。
在使用CART作为基分类器时，XGBoost显式地加入了正则项来控制模型的复杂度，有利于防止过拟合，从而提高模型的泛化能力。
GBDT在模型训练时只使用了代价函数的一阶导数信息，XGBoost对代价函数进行二阶泰勒展开，可以同时使用一阶和二阶导数。
传统的GBDT采用CART作为基分类器，XGBoost支持多种类型的基分类器，比如线性分类器。
传统的GBDT在每轮迭代时使用全部的数据，XGBoost则采用了与随机森林相似的策略，支持对数据进行采样。
传统的GBDT没有设计对缺失值进行处理，XGBoost能够自动学习出缺失值的处理策略。

部分内容转自博客：终于有人说清楚了–XGBoost算法。

XGBoost公式推导

机器学习集成算法XGBoost原理及推导

import pandas as pd
import numpy as np
import scipy
import xgboost as xgb
import tensorflow as tf
import seaborn as sns
import matplotlib.pyplot as plt

# from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from  sklearn import metrics 

# from sklearn.utils import class_weight
from sklearn.model_selection import train_test_split
# from sklearn.metrics import roc_auc_score,precision_recall_curve

# 打印行、列数量设置
pd.set_option('display.max_columns',200)
pd.set_option('display.max_rows',200)
pd.set_option('display.width',200)

#忽略一些版本不兼容等警告
import warnings
warnings.filterwarnings("ignore")

# 模型参数设定
model = xgb.XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.1, max_delta_step=0, max_depth=4,
              min_child_weight=1, monotone_constraints='()',
              n_estimators=10, n_jobs=0, num_parallel_tree=1, random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

XGBClassifier参数

base_score [ default=0.5 ]

$\qquad$ 所有实例的初始预测得分，整体偏差the initial prediction score of all instances, global bias

n_estimators 迭代次数
booster [default=gbtree]

$\qquad$ 有两中模型可以选择gbtree和gblinear。gbtree使用基于树的模型进行提升计算，gblinear使用线性模型进行提升计算。缺省值为gbtree。

colsample_bytree [default=1]

$\qquad$ 在建立树时对特征采样的比例。缺省值为1.取值范围： $(0, 1]$ 。subsample, colsample_bytree = 0.8: 这个是最常见的初始值了。典型值的范围在0.5-0.9之间

subsample [default=1] 每棵树随机采样的样本的比例

$\qquad$ 用于训练模型的子样本占整个样本集合的比例。如果设置为0.5则意味着XGBoost将随机的冲整个样本集合中随机的抽取出50%的子样本建立树模型，这能够防止过拟合。取值范围为：(0,1]。

colsample_bylevel 每层随机采样特征占比

$\qquad$ 用来控制树的每一级的每一次分裂，对列数的采样的占比。一般不太用这个参数，因为subsample参数和colsample_bytree参数可以起到相同的作用。但是如果感兴趣，可以挖掘这个参数更多的用处。

colsample_bynode 每节点随机采样特征占比
gamma [default=0] 惩罚项那个和叶子节点结合的项

$\qquad$ 在树的叶节点上进行进一步分区所需的最小损失减少。越大，算法将越保守。range: $[0,\infty]$ 。minimum loss reduction required to make a further partition on a leaf node of the tree. the larger, the more conservative the algorithm will be.

learning_rates：每一次提升的学习率的列表
max_delta_step [default=0]

$\qquad$ 我们允许的最大增量步长是每棵树的权重估算值。如果将该值设置为0，则表示没有约束。如果将其设置为正值，则可以帮助使更新步骤更“保守”。通常不需要此参数，但是当类极度不平衡时，它可能有助于逻辑回归。将其设置为1-10的值可能有助于控制更新。取值范围为： $[0,\infty]$ 。Maximum delta step we allow each tree’s weight estimation to be. If the value is set to 0, it means there is no constraint. If it is set to a positive value, it can help making the update step more conservative. Usually this parameter is not needed, but it might help in logistic regression when class is extremely imbalanced. Set it to value of 1-10 might help control the update

max_depth [default=6]

$\qquad$ 树的最大深度。缺省值为6。取值范围为： $[1,\infty]$

min_child_weight [default=1]

$\qquad$ 孩子节点中最小的样本权重和。如果一个叶子节点的样本权重和小于min_child_weight则拆分过程结束。在线性回归模型中，这个参数是指建立每个模型所需要的最小样本数。取值范围为: $[0,\infty]$

importance_type 特征重要性依据 str, default “weight”

weight是特征在树中出现的次数

gain是使用该特征切分的平均增益

cover是使用该特征切分的平均覆盖率，覆盖率定义为受拆分影响的样本数

参数	说明
n_jobs	线程数
missing	缺失值表示
num_parallel_tree	构造并行树的数量
random_state	随机种子
reg_alpha	L1 正则化项
reg_lambda	L2 正则化项
scale_pos_weight	样本不平衡处理
tree_method	树构建算法
validate_parameters	输入参数验证
verbosity	打印消息日志等级
interaction_constraints	特征交互约束
monotone_constraints	单调约束
gpu_id	gpu ID

部分内容转载自：

泰坦尼克号乘客生存率分析

traindata_path = u'D:/01_Project/99_test/ML/titanic/train.csv'
testdata_path = u'D:/01_Project/99_test/ML/titanic/test.csv'
testresult_path = u'D:/01_Project/99_test/ML/titanic/gender_submission.csv'
df_train = pd.read_csv(traindata_path)
df_test = pd.read_csv(testdata_path)
df_test['Survived'] = pd.read_csv(testresult_path)['Survived']
data_original = pd.concat([df_train,df_test],sort=False)
display (data_original.head(5))

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

字段注释

PassengerId => 乘客ID
Pclass => 乘客等级(1/2/3等舱位)
Name => 乘客姓名
Sex => 性别
Age => 年龄
SibSp => 堂兄弟/妹个数
Parch => 父母与小孩个数
Ticket => 船票信息
Fare => 票价
Cabin => 客舱
Embarked => 登船港口

# 查看数据
data_original.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1309 entries, 0 to 417
Data columns (total 12 columns):
PassengerId    1309 non-null int64
Survived       1309 non-null int64
Pclass         1309 non-null int64
Name           1309 non-null object
Sex            1309 non-null object
Age            1046 non-null float64
SibSp          1309 non-null int64
Parch          1309 non-null int64
Ticket         1309 non-null object
Fare           1308 non-null float64
Cabin          295 non-null object
Embarked       1307 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 132.9+ KB

features = list(data_original.columns[data_original.dtypes != 'object'])
print (features)

['PassengerId', 'Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare']

# 查看数字类型特征分布
data_original[features].describe()

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	1309.000000	1309.000000	1309.000000	1046.000000	1309.000000	1309.000000	1308.000000
mean	655.000000	0.377387	2.294882	29.881138	0.498854	0.385027	33.295479
std	378.020061	0.484918	0.837836	14.413493	1.041658	0.865560	51.758668
min	1.000000	0.000000	1.000000	0.170000	0.000000	0.000000	0.000000
25%	328.000000	0.000000	2.000000	21.000000	0.000000	0.000000	7.895800
50%	655.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	982.000000	1.000000	3.000000	39.000000	1.000000	0.000000	31.275000
max	1309.000000	1.000000	3.000000	80.000000	8.000000	9.000000	512.329200

# 查看类别特征值分布
print (data_original['Sex'].value_counts())
print (data_original['Embarked'].value_counts())

male      843
female    466
Name: Sex, dtype: int64
S    914
C    270
Q    123
Name: Embarked, dtype: int64

XGBoost 可以缺失值，但是不接受类型特征，要对类型特征进行转换

DataFrame.dtypes for data must be int, float or bool.

Did not expect the data types in fields Sex, Embarked

# 类别特征独热编码
data_onehot = pd.get_dummies(data_original,columns=['Sex','Embarked'])
# data_original['Sex'].replace('male',0,inplace=True)   #inplace=True 替换
data_onehot.head()

	PassengerId	Survived	Pclass	Name	Age	SibSp	Ticket	Fare	Cabin	Sex_female	Sex_male	Embarked_C	Embarked_S
0	1	0	3	Braund, Mr. Owen Harris	22.0	1	A/5 21171	7.2500	NaN	0	1	0	1
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	38.0	1	PC 17599	71.2833	C85	1	0	1	0
2	3	1	3	Heikkinen, Miss. Laina	26.0	0	STON/O2. 3101282	7.9250	NaN	1	0	0	1
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	35.0	1	113803	53.1000	C123	1	0	0	1
4	5	0	3	Allen, Mr. William Henry	35.0	0	373450	8.0500	NaN	0	1	0	1

# 剔除非训练特征
drop_features = ['PassengerId', 'Survived', 'Name','Ticket','Cabin']
features_filted = list(data_onehot.columns.values)
for feature in drop_features:
    features_filted.remove(feature)
# features_filted = list(set(features_filted) - set(drop_features))
print (features_filted)

# 划分训练集和验证集
x_train, x_test, y_train, y_test = train_test_split(data_onehot[features_filted], data_onehot['Survived'], random_state=1, train_size=0.7)
display(x_train.shape)
display(x_test.shape)
display(y_train.shape)
display(y_train.shape)

['Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'Sex_female', 'Sex_male', 'Embarked_C', 'Embarked_Q', 'Embarked_S']



(916, 10)



(393, 10)



(916,)



(916,)

模型训练

model.fit(x_train, y_train,eval_set=[(x_train,y_train),(x_test,y_test)],
       eval_metric=['error'],early_stopping_rounds=5,verbose=True)

[0]	validation_0-error:0.11572	validation_1-error:0.15013
Multiple eval metrics have been passed: 'validation_1-error' will be used for early stopping.

Will train until validation_1-error hasn't improved in 5 rounds.
[1]	validation_0-error:0.11572	validation_1-error:0.15013
[2]	validation_0-error:0.11572	validation_1-error:0.14758
[3]	validation_0-error:0.11572	validation_1-error:0.14758
[4]	validation_0-error:0.11572	validation_1-error:0.14758
[5]	validation_0-error:0.11572	validation_1-error:0.14758
[6]	validation_0-error:0.10808	validation_1-error:0.14758
[7]	validation_0-error:0.10808	validation_1-error:0.14758
Stopping. Best iteration:
[2]	validation_0-error:0.11572	validation_1-error:0.14758






XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.1, max_delta_step=0, max_depth=4,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=10, n_jobs=0, num_parallel_tree=1, random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

参数说明

early_stopping_rounds：在连续加入5棵树之后，每一次模型的损失函数都没有下降，这时候停止加树，有监控作用
eval_set：进行测试的数据集
verbose=False不打印训练过程
objective 目标函数

$\quad$ 回归任务

$\qquad$ reg:linear (默认)

$\qquad$ reg:logistic

$\quad$ 二分类

$\qquad$ binary:logistic 概率

$\qquad$ binary：logitraw 类别

$\quad$ 多分类

$\qquad$ multi：softmax num_class=n 返回类别

$\qquad$ multi：softprob num_class=n 返回概率

$\qquad$ rank:pairwise

eval_metric

$\quad$ 回归任务(默认rmse)

$\qquad$ rmse–均方根误差

$\qquad$ mae–平均绝对误差

$\quad$ 分类任务(默认error)

$\qquad$ auc–roc曲线下面积

$\qquad$ error–错误率（二分类）

$\qquad$ merror–错误率（多分类）

$\qquad$ logloss–负对数似然函数（二分类）

$\qquad$ mlogloss–负对数似然函数（多分类）

特征重要性

importance_df = pd.DataFrame({
    'features':x_train.columns.values,
    'importance':model.feature_importances_.tolist()
})
importance_df = importance_df.sort_values('importance',ascending=False)
importance_df

	features	importance
5	Sex_female	0.893800
0	Pclass	0.052338
4	Fare	0.015934
2	SibSp	0.015133
1	Age	0.011970
8	Embarked_Q	0.010825
3	Parch	0.000000
6	Sex_male	0.000000
7	Embarked_C	0.000000
9	Embarked_S	0.000000

import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(10,6))
sns.barplot(importance_df['importance'][:20],importance_df['features'][:20])
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-bz7yuuvx-1615976811646)(output_19_0.png)]

混淆矩阵

pred_y_test = model.predict(x_test)
# m = metrics.confusion_matrix(y_test, pred_y_test)
# display (m)
tn, fp, fn, tp = metrics.confusion_matrix(y_test, pred_y_test).ravel()
print ('matrix    label1   label0')
print ('predict1  {:<6d}   {:<6d}'.format(int(tp), int(fp)))
print ('predict0  {:<6d}   {:<6d}'.format(int(fn), int(tn)))

matrix    label1   label0
predict1  123      17    
predict0  41       212

交叉验证

验证模型得分

score_x = x_train
score_y = y_train

# 正确率
scores = cross_val_score(model, score_x, score_y, cv=5, scoring='accuracy')
print('交叉验证正确率为:'+str(scores.mean()))

交叉验证正确率为:0.8700700879068662

# 精确率
scores = cross_val_score(model, score_x, score_y, cv=5, scoring='precision')
print('交叉验证精确率为:'+str(scores.mean()))

交叉验证精确率为:0.834282815104733

# 召回率
scores = cross_val_score(model, score_x, score_y, cv=5, scoring='recall')
print('交叉验证召回率为:'+str(scores.mean()))

交叉验证召回率为:0.8030303030303031

# f1_score
scores = cross_val_score(model, score_x, score_y, cv=5, scoring='f1')
print('交叉验证f1_score为:'+str(scores.mean()))

交叉验证f1_score为:0.81649043555853

TopN

当样本不均衡且比较关注召回率时使用TopN来评估模型，泰坦尼克号乘客生存率预测不适合用TopN来评判模型预测好还。

ratio_list = [0.01,0.02,0.05,0.1,0.2]
test_label = pd.DataFrame(y_test)
index_of_label1 = model.classes_.tolist().index(1)
pred_y_test = model.predict(x_test)
proba_y_test = model.predict_proba(x_test)
test_label['predict'] = pred_y_test
test_label['label_1'] = proba_y_test[:,index_of_label1]
display (test_label.head())

label_1_nbr = len(test_label[test_label['Survived']==1])
print ('label_1_nbr:',label_1_nbr)
print ('sample number:',len(test_label))

for ratio in ratio_list:
    num = test_label.sort_values('label_1',ascending=False)[:int(ratio*test_label.shape[0])]['Survived'].sum()
    count = test_label.sort_values('label_1',ascending=False)[:int(ratio*test_label.shape[0])]['Survived'].count()
    print ('Top %.2f label_1_nbr:%d,sample_nbr:%d,recall:%f'%(ratio,num,count,1.0*num/label_1_nbr))

	Survived	predict	label_1
201	0	0	0.406292
115	0	0	0.381720
255	1	1	0.520915
212	0	0	0.406292
195	1	1	0.622862

label_1_nbr: 164
sample number: 393
Top 0.01 label_1_nbr:3,sample_nbr:3,recall:0.018293
Top 0.02 label_1_nbr:7,sample_nbr:7,recall:0.042683
Top 0.05 label_1_nbr:19,sample_nbr:19,recall:0.115854
Top 0.10 label_1_nbr:38,sample_nbr:39,recall:0.231707
Top 0.20 label_1_nbr:75,sample_nbr:78,recall:0.457317

网格搜索最佳参数

param_grid = [
{'n_estimators': [3, 10, 30], 'max_depth': [2, 4, 6, 8],'learning_rate': [0.01,0.05,0.1]}
]

clf = xgb.XGBClassifier()
grid_search = GridSearchCV(clf, param_grid, cv=5,scoring='neg_mean_squared_error')
grid_search.fit(x_train, y_train)
print (grid_search.best_params_)
print (grid_search.best_estimator_)

{'learning_rate': 0.1, 'max_depth': 4, 'n_estimators': 10}
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.1, max_delta_step=0, max_depth=4,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=10, n_jobs=0, num_parallel_tree=1, random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

查看特征的正负样本分布

def KdePlot(df,label,factor,flag=None,positive=1):
    import seaborn as sns
    import matplotlib.pyplot as plt
    
    # 设置核密度分布图
    plt.figure(figsize=(20,10))
    sns.set(style='white')
    if positive==0:
        df[factor] = np.abs(df[factor])
    else:
        pass
    if flag == 'log':
        x0 = np.log(df[df[label]==0][factor]+1)
        x1 = np.log(df[df[label]==1][factor]+1)
    else:
        x0 = df[df[label]==0][factor]
        x1 = df[df[label]==1][factor]
        
    sns.distplot(x0,
               color = 'blue',
               kde = True, # 绘制密度曲线
               hist = True, # 绘制直方图
               #rug = True, # rug图
               kde_kws = {'shade':True,'color':'green','facecolor':'green','label':'label_0'},
               rug_kws = {'color':'green','height':0.1,'alpha':0.1})
    plt.xlabel('%s'%factor,fontsize=40)
    plt.ylabel('label_0',fontsize = 30)
    plt.xticks(fontsize = 30)
    plt.yticks(fontsize = 30)
    plt.legend(loc='upper left',fontsize=30)
    
    plt.twinx()
    
    sns.distplot(x1,
               color = 'orange',
               kde = True, # 绘制密度曲线
               hist = True, # 绘制直方图
               #rug = True, # rug图
               kde_kws = {'shade':True,'color':'red','facecolor':'red','label':'label_1'},
               rug_kws = {'color':'red','height':0.1,'alpha':0.2})
#     plt.xlabel('%s'%factor,fontsize=40)
    plt.ylabel('label_1',fontsize = 30)
    plt.xticks(fontsize = 30)
    plt.yticks(fontsize = 30)
    plt.legend(loc='upper right',fontsize=30)
    plt.show()

for factor in importance_df['features'].values:
    KdePlot(data_onehot,'Survived',factor)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-xWt8sib9-1615976811653)(output_34_0.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-elFxCb98-1615976811658)(output_34_1.png)]

XGBoost网络实例

下面随机构造了一批数据，经过学习也能有很高的准确率。

实例来自

import numpy as np
import sklearn as sk
from sklearn.model_selection import train_test_split
import xgboost as xgb
import matplotlib.pyplot as plt
import matplotlib as mpl


def markData():

    x1 = 5 + np.random.rand(30)*5
    y1 = 8 + np.random.rand(30)*5

    x2 = 9 + np.random.rand(30) * 3
    y2 = 1 + np.random.rand(30) * 5

    x3 = 4 + np.random.rand(30) * 5
    y3 = 3 + np.random.rand(30) * 5

    x4 = 8 + np.random.rand(30) * 7
    y4 = 5 + np.random.rand(30) * 5
    # plt.figure()
    # plt.plot(x1, y1, 'ro', linewidth=0.8, label='x1')
    # plt.plot(x2, y2, 'ko', linewidth=0.8, label='x2')
    # plt.plot(x3, y3, 'bo', linewidth=0.8, label='x3')
    # plt.plot(x4, y4, 'go', linewidth=0.8, label='x4')
    # plt.legend(loc="upper right",)
    # plt.show()
    # print(x1)
    x = np.hstack((x1, x2, x3, x4))
    y = np.hstack((y1, y2, y3, y4))
    x = np.stack((x, y), axis=0).transpose()
    # print(x)
    # plt.figure()
    # plt.plot(x[0:30, 0], x[0:30, 1], 'ro', linewidth=0.8)
    # plt.plot(x[30:60, 0], x[30:60, 1], 'bo', linewidth=0.8)
    # plt.plot(x[60:90, 0], x[60:90, 1], 'go', linewidth=0.8)
    # plt.plot(x[90:120, 0], x[90:120, 1], 'ko', linewidth=0.8)
    # plt.show()
    y = np.zeros(120)
    y[0:30] =0
    y[30:60] = 1
    y[60:90] = 2
    y[90:120] =3
    # print(y)
    return x, y


if __name__ == '__main__':
    x, y = markData()
    x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.75, random_state=1)
    # print(x_train)
    # print(y_train)
    data_train = xgb.DMatrix(x_train, label=y_train)
    data_test = xgb.DMatrix(x_test, label=y_test)
    # print(data_train)
    # print(data_test)

    # 定义xgb的模型参数
    parms = {'max_depth': 3, 'eta': 0.5, 'slient': 0, 'objective': 'multi:softmax', 'num_class': 4}
    watchlist = [(data_train, 'eval'), (data_test, 'train')]
    bst = xgb.train(parms, data_train, num_boost_round=6, evals=watchlist)
    y_hat = bst.predict(data_test)

    #计算准确率
    print(np.mean(y_hat == y_test))

    # 绘制分类图片
    N, M = 200, 200
    x_min, x_max = np.min(x[:, 0]), np.max(x[:, 0])
    y_min, y_max = np.min(x[:, 1]), np.max(x[:, 1])
    x1 = np.linspace(x_min, x_max, N)
    x2 = np.linspace(y_min, y_max, M)
    tx, ty = np.meshgrid(x1, x2)
    xx = np.stack((tx.flat, ty.flat), axis=1)
    data_xx = xgb.DMatrix(xx)
    yy = bst.predict(data_xx)
    yy = yy.reshape(tx.shape)

    cmp_light = mpl.colors.ListedColormap(['#33FF33', '#FFCC66', '#FFF500', '#22CFCC'])
    cmp_drak = mpl.colors.ListedColormap(['r', 'g', 'b', 'k'])

    plt.figure()
    plt.pcolormesh(tx, ty, yy, cmap=cmp_light)
    plt.scatter(x[:, 0], x[:, 1], c=y, edgecolors='k', cmap=cmp_drak)
    plt.xlabel("x1")
    plt.ylabel("x2")
    plt.xlim(x_min, x_max)
    plt.ylim(y_min, y_max)
    plt.grid(True)
    plt.show()

[18:10:05] WARNING: C:\Users\Administrator\workspace\xgboost-win64_release_1.2.0\src\learner.cc:516: 
Parameters: { slient } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[0]	eval-merror:0.03333	train-merror:0.20000
[1]	eval-merror:0.02222	train-merror:0.20000
[2]	eval-merror:0.02222	train-merror:0.20000
[3]	eval-merror:0.01111	train-merror:0.16667
[4]	eval-merror:0.01111	train-merror:0.20000
[5]	eval-merror:0.01111	train-merror:0.20000
0.8

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-hrv3x0d7-1615976811668)(output_36_1.png)]

Hayden112

关注

1
点赞
踩
10

收藏

觉得还不错? 一键收藏
0
评论
XGBoost

文章目录XGBoostXGBoost与GBDT异同点XGBoost公式推导XGBClassifier参数泰坦尼克号乘客生存率分析模型训练特征重要性混淆矩阵交叉验证TopN网格搜索最佳参数查看特征的正负样本分布XGBoost网络实例XGBoostXGBoost是陈天奇等人开发的一个开源机器学习项目，高效地实现了GBDT算法并进行了算法和工程上的许多改进，被广泛应用在Kaggle竞赛及其他许多机器学习竞赛中并取得了不错的成绩。说到XGBoost，不得不提GBDT(Gradient Boosting Dec
复制链接

扫一扫

专栏目录