python xgboost_从Gradient Boost到XGBoost理论+实战

最新推荐文章于 2024-05-07 00:31:29 发布

weixin_39708636

最新推荐文章于 2024-05-07 00:31:29 发布

阅读量5.1k

点赞数

文章标签： python xgboost

（一）Gradient Boost理论知识

参考资料：

https://www.youtube.com/watch?v=3CC4N4z3GJcwww.youtube.com https://www.youtube.com/watch?v=2xudPOBz-vswww.youtube.com https://www.youtube.com/watch?v=jxuNLH5dXCswww.youtube.com https://www.youtube.com/watch?v=StWY5QWMXCwwww.youtube.com

（二）XGBoost理论知识

参考资料：

https://www.youtube.com/watch?v=OtD8wVaFm6Ewww.youtube.com https://www.youtube.com/watch?v=8b1JEDvenQUwww.youtube.com 树模型(六)：XGBoost_雪伦的专栏-CSDN博客_xgboostblog.csdn.net

通俗理解kaggle比赛大杀器xgboost_结构之法算法之道-CSDN博客_xgboostblog.csdn.net

XGBoost是eXtreme Gradient Boost的简写。也就是说，XGBoost是一种特殊的Gradient Boost。

（三）Gradient Boost实战

（四）XGBoost实战

参考资料：

章华燕：史上最详细的XGBoost实战zhuanlan.zhihu.com

人类身份验证 - SegmentFaultsegmentfault.com Installation Guide - xgboost 1.3.0-SNAPSHOT documentationxgboost.readthedocs.io

xgboost需要单独安装库，不能通过sklearn调用~

另外，没有适合python3.8的版本……

使用Anaconda的朋友，可以在Anaconda里面新建一个python3.6的虚拟环境，激活该环境，然后下载xgboost。在使用jupyter notebook编程时，记得切换虚拟环境为python3.6。

具体操作见我的另一篇文章：

Cara：Keras安装zhuanlan.zhihu.com

啰嗦结束，下面全是实战干货~~~~~~

1.XGBoost对数值型数据进行预测

演示数据集获取地址如下：

链接：https://pan.baidu.com/s/1PC2Gw4t8gQ2vhSftBEbPDg

提取码：pquc

演示数据为电信客户流失数据，目标字段为churn?，即客户是否流失，我们的目标是根据客户的一些特征，来预测其是否会流失。特征的取值全部是数字，非字符。

（1）导入数据

import pandas as pd
filename = '.../churn.csv'
data = pd.read_csv(filename,engine='python')
#python3.6环境，需要添加engine='python'指令，否则会报错：OSError: Initializing from file failed

（2）查看数据并进行预处理

col_name = list(data.columns)#获取所有列名
#删除一些无意义的列，以及标签列'Churn?'，来构建特征集
x_col = col_name
col_drop=['State','Account Length','Area Code','Phone', 'Churn?']
for i in col_drop:
    x_col.remove(i)

#查看变量的取值
print(data['Intl Plan'].value_counts())
'''
结果：
no     3010
yes     323
'''
print(data['VMail Plan'].value_counts())
'''
结果：
no     2411
yes     922
'''
print(data['Churn?'].value_counts())
'''
结果：
False.    2850
True.      483
'''

#将取值为yes的映射为1，取值为no的映射为0
data['Intl Plan']=data['Intl Plan'].map({'yes':1,'no':0})
data['VMail Plan']=data['VMail Plan'].map({'yes':1,'no':0})
data['Churn?']=data['Churn?'].map({'True.':1,'False.':0})

（3）拆分出训练集和测试集

from sklearn.model_selection import train_test_split
X = data[x_col]
y = data['Churn?']
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.1,random_state=33)

（4）训练模型

import xgboost as xgb
from xgboost.sklearn import XGBClassifier
import time

'''
使用sklearn接口的好处：
①方便进行模型拟合，不需要对数据进行特殊处理和转换。如果使用XGBoost原生接口，则需要进行矩阵转换。同时，进行模型拟合、预测的方法与sklearn库中其他模型使用方法一模一样，fit、predict……
②方便进行模型性能评估，sklearn原有的评估方法都可以使用。如：roc、score、cross_val_score……
'''

starttime = time.clock()#模型开始拟合的时间
clf = XGBClassifier(
n_estimators=30,#定义树的数量
learning_rate =0.1,#学习率一般从0.1作为起始值，学习率过大，容易越过最优点，学习率太小，则学习速度太慢。
max_depth=3,#树的最大深度为3，表示最多长到第3层，第3层必然为叶子节点。3层的树是一个非常简洁的树。
min_child_weight=1,#设置最小叶子节点样本权重的和，默认值为1，设置过小，容易过拟合，过大，容易欠拟合。当样本叶子节点的权重小于设置的权重，则不生长该节点。
gamma=0.3,#gamma值越大，则生成的决策树越简洁，生成树的效率更高。
subsample=0.8,#样本随机采样，默认为1，即所有样本均参与训练。一般取0.5-1之间。subsample=0.8表示80%的数据作为训练集。因为，我们已经使用train_test_split对数据进行了拆分，并使用训练集作为输入数据，所以这里的意思是取训练集的80%作为最终的模型训练数据，即
colsample_bytree=0.8,#列采样，默认为1，即全部列均参与训练。colsample_bytree=0.8表示80%的列参与模型训练。
objective= 'binary:logistic',#如果是二分类，设置为logistic，多分类，设置为softmax。不过，XGBoost会根据样本数据进行自动选择，可以不用设置。
reg_lambda=1,#即正则化、规范化，使用的是L2正则化，减少极端值对训练结果的影响，使得数据落在较小范围内，如0~1之间。当样本数据缺失值较多时，尽量进行正则化处理。
seed=27#为了使训练结果可复现，我们设置随机数种子
)

print("开始训练……")
clf = clf.fit(X_train, y_train)
endtime = time.clock()#结束模型拟合的时间
time_consumption = endtime - starttime
print("XGBoost拟合模型总耗时为:{}秒".format(time_consumption))
'''
运行结果：
开始训练……
XGBoost拟合模型总耗时为:0.15816159999999999秒
'''

（5）进行预测

#y_predict = clf.predict_proba(X_test)#返回单个样本属于各个类别的概率。对于二分类，第1列为属于类别0的概率，第2列为属于类别1的概率。
#y_predict = clf.predict(X_test)返回测试集中单个样本具体属于哪一个类，要么为0，要么为1。
y_predict = clf.predict(X_test)
print(y_predict)
'''
结果：
[0 0 0 0 0 1 0 0 0 0 0 1 0 1 1 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0
 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 1 0 0 0
 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
 0]
'''

（6）模型评价

from sklearn.metrics import roc_auc_score
AUCscore = roc_auc_score(y_test, y_predict)
print("XGBoost_sklearn接口 AUC Score : {}".format(AUCscore))
#XGBoost_sklearn接口 AUC Score : 0.8299435028248587

trainset_score = clf.score(X_train,y_train)
print ('模型在训练集上的准确率',trainset_score)
#模型在训练集上的准确率 0.9533177725908636

testset_score = clf.score(X_test, y_test)
print('模型在测试集上的准确率为',testset_score)
#模型在测试集上的准确率为 0.9550898203592815

from sklearn.metrics import classification_report,confusion_matrix
print(confusion_matrix(y_test, y_predict))
'''
结果：
[[293   2]
 [ 13  26]]

'''
print(classification_report(y_test,y_predict))
#结果如下：

（7）查看特征重要性

from xgboost import plot_importance
import matplotlib.pyplot as plt
fig,(ax1,ax2) = plt.subplots(nrows=2,ncols=1,figsize=(10,10),sharex=True,dpi=800)
'''
subplots函数用于绘制多个子图。
nrows=2，ncols=1表示图形将绘制在绘制两行1列，即两个图是上下关系显示。如果设置为nrows=1，ncols=2，将并列显示。
figsize设置图像的大小
dpi设置每英寸像素数量，数值越大，绘制出来的图像越清晰！
sharex=True表示两个图使用相同的x坐标体系（如：两个x轴的间距会相同，最小值和最大值也会相同，便于比较）
plt.subplots会返回两个值，一个是图片的尺寸，即figsize，一般用fig变量存储。另一个返回值是坐标信息，如果有多个子图，则需要使用多个变量来存储，如这里绘制2个子图，所以使用(ax1,ax2)分别存储两个图的坐标信息

'''
plot_importance(clf,height=0.5,max_num_features=10,ax=ax1)

'''
height设置柱体的粗细。
max_num_features=n，表示设置绘制出前n个最重要的特征。
ax=ax1表示使用在plt.subplots()函数中预先设置的坐标信息。
‘weight’: the number of times a feature is used to split the data across all trees.即，在模型生成的所有的树中，这个特征有多少次被用来当作划分节点。被用来当作划分节点的次数越多，则权重越大，则特征重要性越大。
‘gain’: the average gain across all splits the feature is used in.#即，在模型生成的所有的树中，使用了该特征作为划分属性的时候，获得的平均信息增益。平均信息增益值越大，则该特征越重要。
‘cover’: the average coverage across all splits the feature is used in.#平均覆盖率，什么东西？？？

'''
#有多少节点使用了这个特征进行分解，数量越多，说明越重要。
plot_importance(clf,height=0.5,max_num_features=5,ax=ax2)
plt.show()

绘出的特征重要性如下：

（8）使用GridSearchCV对XGBoost调参

from sklearn.model_selection import GridSearchCV
import numpy as np
import time
from sklearn.metrics import accuracy_score
learning_rate = np.arange(0.1,0.5,0.1)
max_depth = range(3,10,1)
n_estimators = range(10,100,10)
parammGrid = {'learning_rate': learning_rate, 
              'max_depth': max_depth,
              'n_estimators':n_estimators}

xgb_model = XGBClassifier()
starttime = time.clock()#模型开始拟合的时间
gridSearch = GridSearchCV(xgb_model,parammGrid,scoring='roc_auc',n_jobs=-1,cv=10)
gridFit = gridSearch.fit(X_train,y_train)
endtime = time.clock()#模型开始拟合的时间
time_consumption = endtime - starttime
print("XGBoost拟合模型总耗时为:{}秒".format(time_consumption))
#XGBoost拟合模型总耗时为:201.7074844秒
print ('参数的最佳取值：{0}'.format(gridFit.best_params_))
#参数的最佳取值：{'learning_rate': 0.4, 'max_depth': 6, 'n_estimators': 10}
print ('最佳模型得分:{0}'.format(gridFit.best_score_))
#最佳模型得分:0.9249765702366805
Y_pred = gridFit.predict(X_test)
accScore = accuracy_score(y_test,Y_pred)
print('accuracy:',accScore)
#accuracy: 0.9550898203592815

（9）调整单个参数：以调整树的数量为例

from sklearn.model_selection import cross_val_score
n_estimators = range(10,500,10) #10棵树到50棵树，间隔为10，即：取10,20,30,40...,10
scores = []
for i in n_estimators:
    xgb_model = XGBClassifier(n_estimators=i,learning_rate=0.1,random_state=420)
    scores.append(cross_val_score(xgb_model,X_train,y_train,cv=5).mean())
best_n_estimators = n_estimators[scores.index(max(scores))]
print ('最佳树的数量为:{0},对应的得分为{1}'.format(best_n_estimators,max(scores)))
#最佳树的数量为:220,对应的得分为0.9513144129104063

#绘制树的数量和模型准确率变化图
import matplotlib.pyplot as plt
plt.figure(figsize=(20,5))
plt.title("find the best n_estimators",fontsize = 18)
plt.plot(n_estimators,scores,c="red")
plt.xlabel("树的数量")
plt.ylabel("模型准确率")
plt.show()

绘图如下：

XGBoost中树的数量决定了模型的学习能力，树越多，模型学习能力越强，并且XGBoost天生就是过拟合的模型。在数据较少时，模型会变得不稳定；从上图可以看到，刚开始树的增加会很快提高准确率，但到一定限度后，增加树也很难使模型的准确率继续提升，并且浪费计算资源；n_estimators，一般建议300以下为佳。

2.XGBoost在字符型数据上的应用

演示数据获取地址：

链接：https://pan.baidu.com/s/1cC0Gtv0mgPRzG1Nis_9w_Q

提取码：5xnm

使用的演示数据目标变量为收入，其余属性均为特征，并且许多特征的取值为离散型，如工作类型，取值为：国企、私企、自主创业……

（1）导入数据

import pandas as pd
path = '……adult.data'
rawdata = pd.read_table(path,sep=',',header=None)
#添加表头
rawdata.columns=["age", "workclass", "fnlwgt", "education", "education_num", "marital_status",
                   "occupation", "relationship", "race", "sex", "capital_gain", "capital_loss",
                   "hours_per_week", "native_country","salary"]
rawdata.head(3)

（2）查看数据情况

#使用unique()查看单个字段可以取哪些值
#以workclass字段为例
rawdata.workclass.unique()
#?表示缺失值，注意！问号前面有空格。每个取值前面都有一个空格。
'''
#结果：
array([' State-gov', ' Self-emp-not-inc', ' Private', ' Federal-gov',
       ' Local-gov', ' ?', ' Self-emp-inc', ' Without-pay',
       ' Never-worked'], dtype=object)
'''

#查看单个字段不同的取值分布情况
rawdata.workclass.value_counts()
'''
#结果如下：
Private             22696
 Self-emp-not-inc     2541
 Local-gov            2093
 ?                    1836
 State-gov            1298
 Self-emp-inc         1116
 Federal-gov           960
 Without-pay            14
 Never-worked            7
Name: workclass, dtype: int64
'''

（3）数据预处理

#处理掉字符型数据中的空格
delete_blank = lambda x:x.strip(" ") if type(x)==str else x
data = rawdata.applymap(delete_blank)#applymap函数是元素级别的数据处理，可以直接对表中具体的每个值进行处理。

#将全集中用问号表示的缺失值替换nan表示。
import numpy as np
data = data.replace("?",np.nan)

#查看全集的数据缺失情况
print(data.isnull().any())
#可以看到，只有workclass、occupation、native_country三个字段有缺失值。
'''
结果：
age               False
workclass          True
fnlwgt            False
education         False
education_num     False
marital_status    False
occupation         True
relationship      False
race              False
sex               False
capital_gain      False
capital_loss      False
hours_per_week    False
native_country     True
salary            False
dtype: bool
'''


#填充workclass列的缺失值
#因为workclass列为离散属性，所以，可以使用频率最高的属性值填充缺失值
data['workclass'] = data['workclass'].fillna(value = 'Private')


#填充occupation列的缺失值
#因为occupation列为离散属性，所以，可以使用频率最高的属性值填充缺失值
data['occupation'] = data['occupation'].fillna(value = 'Prof-specialty')

#填充native_country列的缺失值
#因为native_country列为离散属性，所以，可以使用频率最高的属性值填充缺失值
data['native_country'] = data['native_country'].fillna(value = 'United-States')

（4）数据可视化，进一步增强对数据的理解

参考：

【手把手机器学习入门到放弃】SVM支持向量机_yao09605的博客-CSDN博客blog.csdn.net

①绘制收入的分布图

import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(8,5))
plt.rcParams['figure.dpi'] = 500 #分辨率
sns.color_palette("Set3")
sns.set(style="whitegrid")
sns.countplot(data.salary)
plt.title("distribution of salary")
plt.show()

可以看到样本标签不是很平衡，收入在50k以下的样本数量远远多于50k以上的。在进行抽样时，可以考虑增大50k以上的样本的权重。

②查看收入和年龄的相关性

plt.figure(figsize=(8,10))
plt.rcParams['figure.dpi'] = 600 #分辨率
sns.kdeplot(data[data["salary"]=="<=50K"].age, shade=True)
sns.kdeplot(data[data["salary"]==">50K"].age, shade=True)
plt.legend(["<=50K",">50K"])
plt.xlabel('age')
plt.title("age distribution over salary")
plt.show()

我们可以看到工资较高的人群平均年龄也较高，超过40岁，而工资较低的人群年龄平均在20多。

③探究收入与工作时长的关系

plt.figure(figsize=(10,10))
plt.rcParams['figure.dpi'] = 800 #分辨率
sns.set()#set(style="whitegrid")
sns.kdeplot(data[data["salary"]=="<=50K"]["hours_per_week"],vertical=False)
sns.kdeplot(data[data["salary"]==">50K"]["hours_per_week"],vertical=False)
plt.legend(["<=50K",">50K"])
plt.xlabel('age')
plt.title("age distribution over salary")
plt.show()

工资较低人群每周工作时长比较低，而工资较高的人群确实工作时长也比较高。

④探究年龄和工作时长之间的关系

plt.figure(figsize=(15,8))
plt.rcParams['figure.dpi'] = 500 #分辨率
#cmap = sns.cubehelix_palette(rot=-.2, as_cmap=True)
ax = sns.lmplot(x="age", y="hours_per_week",hue="salary",data=data,fit_reg=False,markers=["o", "x"])
#fit_reg=False表示不添加拟合回归线
#hue="salary"表示按照收入进行分类
#markers=["o", "x"]设置不同类别的标记
plt.show()

（5）建模前的数据准备

①将y的取值转换为0和1，0：收入小于50k，1：收入大于50k

X = data.drop(['salary'], axis = 1)#删除收入列。axis = 1表示按列操作，默认为axis = 0，即按行操作
y = data.salary
y = y.map({">50K":0,"<=50K":1})
y.value_counts()
'''
1    24720
0     7841
Name: salary, dtype: int64
'''

②对特征集X进行特征转换：离散型数据变为数值型

from sklearn.feature_extraction import DictVectorizer #特征转换器
vec = DictVectorizer(sparse=False)
X_transformed = vec.fit_transform(X.to_dict(orient='record'))#fit_transform()是fit()和transform()的合并写法
print(X_transformed)
'''
[[3.9000e+01 2.1740e+03 0.0000e+00 ... 0.0000e+00 1.0000e+00 0.0000e+00]
 [5.0000e+01 0.0000e+00 0.0000e+00 ... 1.0000e+00 0.0000e+00 0.0000e+00]
 [3.8000e+01 0.0000e+00 0.0000e+00 ... 0.0000e+00 0.0000e+00 0.0000e+00]
 ...
 [5.8000e+01 0.0000e+00 0.0000e+00 ... 0.0000e+00 0.0000e+00 0.0000e+00]
 [2.2000e+01 0.0000e+00 0.0000e+00 ... 0.0000e+00 0.0000e+00 0.0000e+00]
 [5.2000e+01 1.5024e+04 0.0000e+00 ... 0.0000e+00 0.0000e+00 0.0000e+00]]
'''
print(X_transformed.shape)
#(32561, 105)
#X_train原来的shape是(26048, 14)，转换后的shape是(26048, 105)，为什么列增大了很多？？？
#因为，对离散数据进行特征转换就是将每个特征的取值都作为一个新特征

print(vec.get_feature_names())
'''
['age', 'capital_gain', 'capital_loss', 'education=10th', 'education=11th', 'education=12th', 
'education=1st-4th', 'education=5th-6th', 'education=7th-8th', 'education=9th', 'education=Assoc-acdm',
 'education=Assoc-voc', 'education=Bachelors', 'education=Doctorate', 'education=HS-grad', 'education=Masters',
 'education=Preschool', 'education=Prof-school', 'education=Some-college', 'education_num', 'fnlwgt', 'hours_per_week',
 'marital_status=Divorced', 'marital_status=Married-AF-spouse', 'marital_status=Married-civ-spouse', 
'marital_status=Married-spouse-absent', 'marital_status=Never-married', 'marital_status=Separated', 'marital_status=Widowed',
 'native_country=Cambodia', 'native_country=Canada', 'native_country=China', 'native_country=Columbia', 'native_country=Cuba',
 'native_country=Dominican-Republic', 'native_country=Ecuador', 'native_country=El-Salvador', 'native_country=England',
 'native_country=France', 'native_country=Germany', 'native_country=Greece', 'native_country=Guatemala', 'native_country=Haiti',
 'native_country=Holand-Netherlands', 'native_country=Honduras', 'native_country=Hong', 'native_country=Hungary', 'native_country=India', 
'native_country=Iran', 'native_country=Ireland', 'native_country=Italy', 'native_country=Jamaica', 'native_country=Japan',
 'native_country=Laos', 'native_country=Mexico', 'native_country=Nicaragua', 'native_country=Outlying-US(Guam-USVI-etc)', 
'native_country=Peru', 'native_country=Philippines', 'native_country=Poland', 'native_country=Portugal', 'native_country=Puerto-Rico',
 'native_country=Scotland', 'native_country=South', 'native_country=Taiwan', 'native_country=Thailand', 'native_country=Trinadad&Tobago', 
'native_country=United-States', 'native_country=Vietnam', 'native_country=Yugoslavia', 'occupation=Adm-clerical', 'occupation=Armed-Forces',
 'occupation=Craft-repair', 'occupation=Exec-managerial', 'occupation=Farming-fishing', 'occupation=Handlers-cleaners', 'occupation=Machine-op-inspct',
 'occupation=Other-service', 'occupation=Priv-house-serv', 'occupation=Prof-specialty', 'occupation=Protective-serv', 'occupation=Sales', 'occupation=Tech-support',
 'occupation=Transport-moving', 'race=Amer-Indian-Eskimo', 'race=Asian-Pac-Islander', 'race=Black', 'race=Other', 'race=White', 'relationship=Husband',
 'relationship=Not-in-family', 'relationship=Other-relative', 'relationship=Own-child', 'relationship=Unmarried', 'relationship=Wife', 'sex=Female', 'sex=Male',
 'workclass=Federal-gov', 'workclass=Local-gov', 'workclass=Never-worked', 'workclass=Private', 'workclass=Self-emp-inc', 'workclass=Self-emp-not-inc',
 'workclass=State-gov', 'workclass=Without-pay']
'''
print(len(vec.get_feature_names()))#105

一个小栗子，解释DictVectorizer 在做什么。

假设教育水平这个字段的取值范围为：小学生、中学生、本科生、研究生，原始数据长下面这样：

样本编号	教育水平
001	研究生
002	小学生
003	中学生
004	小学生
005	中学生

经过DictVectorizer 转换后的数据长下面这样：

样本编号	教育水平=小学生	教育水平=中学生	教育水平=研究生
001	0	0	1
002	1	0	0
003	0	1	0
004	1	0	0
005	0	1	0

③将数据划分为训练集和测试集

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X_transformed,y,test_size=0.2,random_state=33)  # 将数据进行分割
print(X_train.shape)
#(26048, 105)
print(X_test.shape)
#(6513, 105)
print(y_train.shape)
#(26048,)
print(y_test.shape)
#(6513,)

（6）进行模型拟合

from xgboost.sklearn import XGBClassifier
clf = XGBClassifier(
	    n_estimators=10,#定义树的数量
	    learning_rate =0.1,
	    max_depth=5,
	    min_child_weight=1,
	    gamma=0.3,
	    subsample=0.8,
	    colsample_bytree=0.8,
	    objective= 'binary:logistic',
	    reg_lambda=1,
	    seed=27)
xgb_model = clf.fit(X_train,y_train)

（7）进行预测和模型评价

#预测
y_predict = xgb_model.predict(X_test)
print(y_predict )#查看预测结果

#评价
print ('score训练准确率',xgb_model.score(X_train,y_train))
#score训练准确率 0.8581081081081081
from sklearn.metrics import classification_report,confusion_matrix
print (confusion_matrix(y_test, y_predict))
'''
结果：
[[ 888  719]
 [ 238 4668]]
'''
print (classification_report(y_predict,y_test,target_names=['0','1']))

结果如下：

模型的准确率并不高，可以使用上面的方法进行调参。

The End~

weixin_39708636

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python xgboost_从Gradient Boost到XGBoost理论+实战

（一）Gradient Boost理论知识参考资料：https://www.youtube.com/watch?v=3CC4N4z3GJcwww.youtube.comhttps://www.youtube.com/watch?v=2xudPOBz-vswww.youtube.comhttps://www.youtube.com/watch?v=jxuNLH5dXCswww.youtube....
复制链接

扫一扫