机器学习 | Xgboost代码框架

最新推荐文章于 2023-01-23 21:02:46 发布

jdmike

最新推荐文章于 2023-01-23 21:02:46 发布

阅读量1k

点赞数

分类专栏：机器学习

本文链接：https://blog.csdn.net/RichardsZ_/article/details/113770395

版权

机器学习专栏收录该内容

34 篇文章 4 订阅

订阅专栏

集成学习 | Xgboost

文章目录

集成学习 | Xgboost
一. DMatrix建模方式
- libsvm数据转化
- DataFrame数据格式
二. Sklearn建模方式（推荐）
三. 参数说明
只用前n颗树预测

一. DMatrix建模方式

DMatrix数据主要由以下数据源转化：

libsvm转化
csv文件读取为DataFrame，再转化为DMatrix

libsvm数据转化

libsvm文件读取

#!/usr/bin/python
import numpy as np
import scipy.sparse
import pickle
import xgboost as xgb

# 基本例子，从libsvm文件中读取数据，做二分类
# 数据是libsvm的格式
#1 3:1 10:1 11:1 21:1 30:1 34:1 36:1 40:1 41:1 53:1 58:1 65:1 69:1 77:1 86:1 88:1 92:1 95:1 102:1 105:1 117:1 124:1
#0 3:1 10:1 20:1 21:1 23:1 34:1 36:1 39:1 41:1 53:1 56:1 65:1 69:1 77:1 86:1 88:1 92:1 95:1 102:1 106:1 116:1 120:1
#0 1:1 10:1 19:1 21:1 24:1 34:1 36:1 39:1 42:1 53:1 56:1 65:1 69:1 77:1 86:1 88:1 92:1 95:1 102:1 106:1 116:1 122:1
dtrain = xgb.DMatrix('./data/agaricus.txt.train')
dtest = xgb.DMatrix('./data/agaricus.txt.test')

#!/usr/bin/python
import numpy as np
import scipy.sparse
import pickle
import xgboost as xgb

# 基本例子，从libsvm文件中读取数据，做二分类
# 数据是libsvm的格式
#1 3:1 10:1 11:1 21:1 30:1 34:1 36:1 40:1 41:1 53:1 58:1 65:1 69:1 77:1 86:1 88:1 92:1 95:1 102:1 105:1 117:1 124:1
#0 3:1 10:1 20:1 21:1 23:1 34:1 36:1 39:1 41:1 53:1 56:1 65:1 69:1 77:1 86:1 88:1 92:1 95:1 102:1 106:1 116:1 120:1
#0 1:1 10:1 19:1 21:1 24:1 34:1 36:1 39:1 42:1 53:1 56:1 65:1 69:1 77:1 86:1 88:1 92:1 95:1 102:1 106:1 116:1 122:1
dtrain = xgb.DMatrix('./data/agaricus.txt.train')
dtest = xgb.DMatrix('./data/agaricus.txt.test')


#超参数设定
param = {'max_depth':2, 'eta':1, 'silent':1, 'objective':'binary:logistic' }

# 设定watchlist用于查看模型状态
watchlist  = [(dtest,'eval'), (dtrain,'train')]
num_round = 2
bst = xgb.train(param, dtrain, num_round, watchlist)

# 使用模型预测
preds = bst.predict(dtest)

# 判断准确率
labels = dtest.get_label()
print('错误率为%f' % \
       (sum(1 for i in range(len(preds)) if int(preds[i]>0.5)!=labels[i]) /float(len(preds))))

# 模型存储
bst.save_model('./model/0001.model')

DataFrame数据格式

csv文件读取

Train = pd.read_csv("./feat/train_sku_feat.csv") #训练集
Valid = pd.read_csv("./feat/valid_sku_feat.csv") #验证集

dtrain = xgb.DMatrix(Train, label=Train.label)
dvalid = xgb.DMatrix(Valid, label=Train.label)

#!/usr/bin/python
import numpy as np
import pandas as pd
import pickle
import xgboost as xgb
from sklearn.model_selection import train_test_split

# 基本例子，从csv文件中读取数据，做二分类

# 用pandas读入数据
data = pd.read_csv('./data/Pima-Indians-Diabetes.csv')

# 做数据切分
train, test = train_test_split(data)

# 转换成Dmatrix格式
feature_columns = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']
target_column = 'Outcome'
# 取出numpy array去初始化DMatrix对象
xgtrain = xgb.DMatrix(train[feature_columns].values, train[target_column].values)
xgtest = xgb.DMatrix(test[feature_columns].values, test[target_column].values)

#参数设定
param = {'max_depth':5, 'eta':0.1, 'silent':1, 'subsample':0.7, 'colsample_bytree':0.7, 'objective':'binary:logistic' }

# 设定watchlist用于查看模型状态
watchlist  = [(xgtest,'eval'), (xgtrain,'train')]
num_round = 10
bst = xgb.train(param, xgtrain, num_round, watchlist)

# 使用模型预测
preds = bst.predict(xgtest)

# 判断准确率
labels = xgtest.get_label()
print('错误类为%f' % \
       (sum(1 for i in range(len(preds)) if int(preds[i]>0.5)!=labels[i]) /float(len(preds))))

# 模型存储
bst.save_model('./model/0002.model')

二. Sklearn建模方式（推荐）

如果对Scikit-learn建模过程熟悉，不妨尝试这种方式，也是笔者较为熟悉的方式。

#!/usr/bin/python
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
import pickle
import xgboost as xgb
from sklearn.model_selection import train_test_split
#from sklearn.externals import joblib


# 基本例子，从csv文件中读取数据，做二分类

# 用pandas读入数据
data = pd.read_csv('./data/Pima-Indians-Diabetes.csv')

# 做数据切分
train, test = train_test_split(data)

# 取出特征X和目标y的部分
feature_columns = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']
target_column = 'Outcome'
train_X = train[feature_columns].values
train_y = train[target_column].values
test_X = test[feature_columns].values
test_y = test[target_column].values

# 初始化模型
xgb_classifier = xgb.XGBClassifier(n_estimators=20,\
                                   max_depth=4, \
                                   learning_rate=0.1, \
                                   subsample=0.7, \
                                   colsample_bytree=0.7，
                                   silent=0)

# 拟合模型
xgb_classifier.fit(train_X, train_y)

# 使用模型预测
preds = xgb_classifier.predict(test_X)

# 判断准确率
print('错误类为%f' %((preds!=test_y).sum()/float(test_y.shape[0])))

三. 参数说明

Booster Parameters（模型参数）
1.eta [default=0.3]:shrinkage参数，用于更新叶子节点权重时，乘以该系数，避免步长过大。参数值越大，越可能无法收敛。把学习率 eta 设置的小一些，小学习率可以使得后面的学习更加仔细。

2.min_child_weight [default=1]:这个参数默认是 1，是每个叶子结点的权重和的阈值。对正负样本不均衡时的 0-1 分类而言，假设叶子节点的权重在 0.01 附近，min_child_weight 为 1 意味着叶子节点中最少需要包含 100 个样本。这个参数非常影响结果，控制叶子节点中二阶导的和的最小值，该参数值越小，越容易 overfitting。

3.max_depth [default=6]: 每颗树的最大深度，树高越深，越容易过拟合。

4.max_leaf_nodes:最大叶结点数，与max_depth作用有点重合。

5.gamma [default=0]：后剪枝时，用于控制是否后剪枝的参数。

6.max_delta_step [default=0]：这个参数在更新步骤中起作用，如果取0表示没有约束，如果取正值则使得更新步骤更加保守。可以防止做太大的更新步子，使更新更加平缓。

7.subsample [default=1]：样本随机采样，较低的值使得算法更加保守，防止过拟合，但是太小的值也会造成欠拟合。

8.colsample_bytree [default=1]：列采样，对每棵树的生成用的特征进行列采样.一般设置为： 0.5-1

9.lambda [default=1]：控制模型复杂度的权重值的L2正则化项参数，参数越大，模型越不容易过拟合。

10.alpha [default=0]:控制模型复杂程度的权重值的 L1 正则项参数，参数值越大，模型越不容易过拟合。

11.scale_pos_weight [default=1]：对于样本不均衡设置的参数，代表负样本/正样本的比重，如负样本假如为60000，正样本有20000，则负样本是正样本数量的3倍，因此可以设置scale_pos_weight = 3（负样本/正样本的比重）
源码如下，实际上是增加了正样本的权重

if (label == 1.0f) {
            w *= scale_pos_weight;
          }
 
# 见源码 src/objective/regression_obj.cu

Learning Task Parameters（学习任务参数）
1.objective [default=reg:linear]：定义最小化损失函数类型，常用参数：
binary:logistic –logistic regression for binary classification, returns predicted probability (not class)
multi:softmax –multiclass classification using the softmax objective, returns predicted class (not probabilities)
you also need to set an additional num_class (number of classes) parameter defining the number of unique classes
multi:softprob –same as softmax, but returns predicted probability of each data point belonging to each class.

2.eval_metric [ default according to objective ]：
The metric to be used for validation data.
The default values are rmse for regression and error for classification.
Typical values are:
    rmse – root mean square error
    mae – mean absolute error
    logloss – negative log-likelihood
    error – Binary classification error rate (0.5 threshold)
    merror – Multiclass classification error rate
    mlogloss – Multiclass logloss
    auc: Area under the curve

3.seed [default=0]：
The random number seed. 随机种子，用于产生可复现的结果
Can be used for generating reproducible results and also for parameter tuning

只用前n颗树预测

ypred2 = bst.predict(xgtest, ntree_limit=9)

jdmike

关注

0
点赞
踩
7

收藏

觉得还不错? 一键收藏
0
评论
机器学习 | Xgboost代码框架

Xgboost提示：这里可以添加系列文章的所有文章的目录，目录需要自己手动添加例如：第一章 Python 机器学习入门之pandas的使用提示：写完文章后，目录可以自动生成，如何生成可参考右边的帮助文档文章目录Xgboost模型格式DMatrix数据格式DataFrame数据格式参数设置模型格式Xgboost对于输入数据的分类，可分为两类：DMatrixDataFrameDMatrix数据格式DMatrix数据主要由以下数据源转化：libsvm转化csv文件读取为Data
复制链接

扫一扫