XGBoost Python Package中文文档

Hexo博客地址:Yanbin’s blog

XGBoost Python Package英文文档

安装XGBoost

import xgboost as xgb

数据接口

XGBoost的Python模块可以从以下几种文件导入数据:

  • LibSVM text format file
  • Comma-separated values (CSV) file
  • NumPy 2D array
  • SciPy 2D sparse array
  • XGBoost binary buffer file

点击Text Input Format of DMatrix查看具体的text输入形式

这些数据存储在DMatrix对象中。

  • 导入a libsvm text文件或者a XGBoost binary文件到DMatrix:
dtrain = xgb.DMatrix('train.svm.txt')
dtest = xgb.DMatrix('test.svm.buffer')
  • 导入a CSV file到DMatrix:
# label_column specifies the index of the column containing the true label
dtrain = xgb.DMatrix('train.csv?format=csv&label_column=0')
dtest = xgb.DMatrix('test.csv?format=csv&label_column=0')

(需要强调的是,XGBoost不支持categorical features; 如果你的数据包含categorical features, 先把它加载成a NumPy数组,然后进行独热编码.)

  • 导入a NumPy array到DMatrix:
data = np.random.rand(5, 10)  # 5 entities, each contains 10 features
label = np.random.randint(2, size=5)  # binary target
dtrain = xgb.DMatrix(data, label=label)
  • 导入a scipy.sparse array到DMatrix:
csr = scipy.sparse.csr_matrix((dat, (row, col)))
dtrain = xgb.DMatrix(csr)
  • 保存数据到a XGBoost binary文件可以提高导入速度:
dtrain = xgb.DMatrix('train.svm.txt')
dtrain.save_binary('train.buffer')
  • 缺失值可以在DMatrix的构造函数中被默认值替代:
dtrain = xgb.DMatrix(data, label=label, missing=-999.0)
  • 在需要的时候可以设置权重:
w = np.random.rand(5, 1)
dtrain = xgb.DMatrix(data, label=label, missing=-999.0, weight=w)

设置参数

XGBoost可以使用a list of pairs或者字典设置参数。例如:

  • 提升器参数
param = {'max_depth': 2, 'eta': 1, 'silent': 1, 'objective': 'binary:logistic'}
param['nthread'] = 4
param['eval_metric'] = 'auc'
  • 你也可以指定多个指标:
param['eval_metric'] = ['auc', 'ams@0']

# alternatively:
# plst = param.items()
# plst += [('eval_metric', 'ams@0')]
  • 指定验证集查看模型性能
evallist = [(dtest, 'eval'), (dtrain, 'train')]

训练

训练一个模型需要参数列表和数据集。

num_round = 10
bst = xgb.train(param, dtrain, num_round, evallist)

训练之后,可以保存模型

bst.save_model('0001.model')

模型和它的feature map也可以输出成text文件。

# dump model
bst.dump_model('dump.raw.txt')
# dump model with feature map
bst.dump_model('dump.raw.txt', 'featmap.txt')
````

上面保存的模型可以通过一下代码加载:





<div class="se-preview-section-delimiter"></div>

```python
bst = xgb.Booster({'nthread': 4})  # init model
bst.load_model('model.bin')  # load data

提前停止(Early Stopping)

如果你有一个验证集,你可以使用early stopping寻找一个最优的提升次数。Early Stopping需要evals中至少有一个验证集。如果evals中的验证集多于一个,将会使用最后一个。

train(..., evals=evals, early_stopping_rounds=10)

模型将会一直训练直到验证集的score不再提升。验证集误差需要再每early_stopping_rounds轮中保持减少,模型才会继续训练。

如果提前停止发生了,模型将会有三个额外的fields:bst.best_score, bst.best_iteration and bst.best_ntree_limit。需要强调的是,xgboost.train()将会返回最后一次迭代的模型,而不是最好的一个。

这适用于两种指标:最小化(RMSE,log loss等)和最大化(MAP,NDCG,AUC)。需要强调的是,如果你在param[‘eval_metric’]中指定了多种评价指标,最后一个指标将会被用到early stopping中。

预测

训练好或者加载的模型可以预测数据集。

# 7 entities, each contains 10 features
data = np.random.rand(7, 10)
dtest = xgb.DMatrix(data)
ypred = bst.predict(dtest)

如果在训练的时候开启了提前停止功能,你可以通过bst.best_ntree_limit获得最优迭代次数下的预测值。

ypred = bst.predict(dtest, ntree_limit=bst.best_ntree_limit)

绘图

你可以使用plotting模块对importance and output tree进行绘图展示。

绘制importance,可以使用xgboost.plot_importance()。该函数需要安装依赖库matplotlib。

xgb.plot_importance(bst)

通过matplotlib绘制输出树,可以使用xgboost.plot_tree(),需要指定目标树的序号。这个函数需要依赖库graphviz和matplotlib。

xgb.plot_tree(bst, num_trees=2)

当你使用IPython,你可以使用xgboost.to_graphviz()函数;该函数将目标树转换成一个graphviz实例。这个graphviz实例将在IPython中自动被渲染。

xgb.to_graphviz(bst, num_trees=2)
  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
XGBoost is a popular machine learning library used for supervised learning problems like classification, regression, and ranking tasks. It is an optimized distributed gradient boosting library designed to be highly efficient, flexible, and portable. The XGBoost library can be used with Python using the xgboost package. Some of the key features of XGBoost include: 1. Regularization: Helps prevent overfitting by adding penalties to the loss function. 2. Cross-validation: Allows for hyperparameter tuning and model selection. 3. Parallel processing: Can be run on a distributed computing environment. 4. High accuracy: XGBoost has been shown to have high accuracy in many benchmark datasets. To use XGBoost in Python, you need to first install the xgboost package. After installation, you can import the package and create an XGBoost model by specifying the parameters for the model. You can then fit the model to your data and make predictions on new data. Here's an example code snippet for using XGBoost in Python: ```python import xgboost as xgb from sklearn.datasets import load_boston from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error # Load Boston Housing dataset boston = load_boston() X, y = boston.data, boston.target # Split data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123) # Create XGBoost model xgb_model = xgb.XGBRegressor(objective='reg:squarederror', n_estimators=1000, seed=123) # Fit model to training data xgb_model.fit(X_train, y_train) # Make predictions on test data y_pred = xgb_model.predict(X_test) # Calculate root mean squared error rmse = mean_squared_error(y_test, y_pred, squared=False) print('RMSE:', rmse) ``` This code uses the Boston Housing dataset to train an XGBoost regression model. The model is then used to make predictions on a test set, and the root mean squared error is calculated to evaluate the model's performance.

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值