Core XGBoost Library
https://xgboost.readthedocs.io/en/latest/python/python_api.html
class xgboost.
DMatrix
(data, label=None, missing=None, weight=None, silent=False, feature_names=None, feature_types=None, nthread=None)
Bases: object
Data Matrix used in XGBoost.
DMatrix is a internal data structure that used by XGBoost which is optimized for both memory efficiency and training speed. You can construct DMatrix from numpy.arrays
Parameters
-
data (os.PathLike/string/numpy.array/scipy.sparse/pd.DataFrame/) – dt.Frame/cudf.DataFrame Data source of DMatrix. When data is string or os.PathLike type, it represents the path libsvm format txt file, or binary file that xgboost can read from.
-
label (list or numpy 1-D array, optional) – Label of the training data.
-
missing (float, optional) – Value in the dense input data (e.g. numpy.ndarray) which needs to be present as a missing value. If None, defaults to np.nan.
-
weight (list or numpy 1-D array , optional) –
Weight for each instance.
Note
For ranking task, weights are per-group.
In ranking task, one weight is assigned to each group (not each data point). This is because we only care about the relative ordering of data points within each group, so it doesn’t make sense to assign weights to individual data points.
-
silent (boolean, optional) – Whether print messages during construction
-
feature_names (list, optional) – Set names for features.
-
feature_types (list, optional) – Set types for features.
-
nthread (integer, optional) – Number of threads to use for loading data from numpy array. If -1, uses maximum threads available on the system.
property feature_names
Get feature names (column labels).
Returns
feature_names
Return type
property feature_types
Get feature types (column types).
Returns
feature_types
Return type
get_base_margin
()
Get the base margin of the DMatrix.
Returns
base_margin
Return type
get_float_info
(field)
Get float property from the DMatrix.
Parameters
field (str) – The field name of the information
Returns
info – a numpy array of float information of the data
Return type
array
get_label
()
Get the label of the DMatrix.
Returns
label
Return type
array
get_uint_info
(field)
Get unsigned integer property from the DMatrix.
Parameters
field (str) – The field name of the information
Returns
info – a numpy array of unsigned integer information of the data
Return type
array
get_weight
()
Get the weight of the DMatrix.
Returns
weight
Return type
array
num_col
()
Get the number of columns (features) in the DMatrix.
Returns
number of columns
Return type
num_row
()
Get the number of rows in the DMatrix.
Returns
number of rows
Return type
save_binary
(fname, silent=True)
Save DMatrix to an XGBoost buffer. Saved binary can be later loaded by providing the path to xgboost.DMatrix()
as input.
Parameters
-
fname (string or os.PathLike) – Name of the output buffer file.
-
silent (bool (optional; default: True)) – If set, the output is suppressed.
set_base_margin
(margin)
Set base margin of booster to start from.
This can be used to specify a prediction value of existing model to be base_margin However, remember margin is needed, instead of transformed prediction e.g. for logistic regression: need to put in value before logistic transformation see also example/demo.py
Parameters
margin (array like) – Prediction margin of each datapoint
set_float_info
(field, data)
Set float type property into the DMatrix.
Parameters
-
field (str) – The field name of the information
-
data (numpy array) – The array of data to be set
set_float_info_npy2d
(field, data)
Set float type property into the DMatrix
for numpy 2d array input
Parameters
-
field (str) – The field name of the information
-
data (numpy array) – The array of data to be set
set_group
(group)
Set group size of DMatrix (used for ranking).
Parameters
group (array like) – Group size of each group
set_interface_info
(field, data)
Set info type peoperty into DMatrix.
set_label
(label)
Set label of dmatrix
Parameters
label (array like) – The label information to be set into DMatrix
set_label_npy2d
(label)
Set label of dmatrix
Parameters
label (array like) – The label information to be set into DMatrix from numpy 2D array
set_uint_info
(field, data)
Set uint type property into the DMatrix.
Parameters
-
field (str) – The field name of the information
-
data (numpy array) – The array of data to be set
set_weight
(weight)
Set weight of each instance.
Parameters
weight (array like) –
Weight for each data point
Note
For ranking task, weights are per-group.
In ranking task, one weight is assigned to each group (not each data point). This is because we only care about the relative ordering of data points within each group, so it doesn’t make sense to assign weights to individual data points.
set_weight_npy2d
(weight)
Set weight of each instance
for numpy 2D array
Parameters
weight (array like) –
Weight for each data point in numpy 2D array
Note
For ranking task, weights are per-group.
In ranking task, one weight is assigned to each group (not each data point). This is because we only care about the relative ordering of data points within each group, so it doesn’t make sense to assign weights to individual data points.
slice
(rindex, allow_groups=False)
Slice the DMatrix and return a new DMatrix that only contains rindex.
Parameters
-
rindex (list) – List of indices to be selected.
-
allow_groups (boolean) – Allow slicing of a matrix with a groups attribute
Returns
res – A new DMatrix containing only selected indices.
Return type
class xgboost.
Booster
(params=None, cache=(), model_file=None)
Bases: object
A Booster of XGBoost.
Booster is the model of xgboost, that contains low level routines for training, prediction and evaluation.
Parameters
-
params (dict) – Parameters for boosters.
-
cache (list) – List of cache items.
-
model_file (string or os.PathLike) – Path to the model file.
attr
(key)
Get attribute string from the Booster.
Parameters
key (str) – The key to get attribute from.
Returns
value – The attribute value of the key, returns None if attribute do not exist.
Return type
attributes
()
Get attributes stored in the Booster as a dictionary.
Returns
result – Returns an empty dict if there’s no attributes.
Return type
dictionary of attribute_name: attribute_value pairs of strings.
boost
(dtrain, grad, hess)
Boost the booster for one iteration, with customized gradient statistics. Like xgboost.core.Booster.update()
, this function should not be called directly by users.
Parameters
-
dtrain (DMatrix) – The training DMatrix.
-
grad (list) – The first order of gradient.
-
hess (list) – The second order of gradient.
copy
()
Copy the booster object.
Returns
booster – a copied booster model
Return type
Booster
dump_model
(fout, fmap='', with_stats=False, dump_format='text')
Dump model into a text or JSON file.
Parameters
-
fout (string or os.PathLike) – Output file name.
-
fmap (string or os.PathLike, optional) – Name of the file containing feature map names.
-
with_stats (bool, optional) – Controls whether the split statistics are output.
-
dump_format (string, optional) – Format of model dump file. Can be ‘text’ or ‘json’.
eval
(data, name='eval', iteration=0)
Evaluate the model on mat.
Parameters
-
data (DMatrix) – The dmatrix storing the input.
-
name (str, optional) – The name of the dataset.
-
iteration (int, optional) – The current iteration number.
Returns
result – Evaluation result string.
Return type
eval_set
(evals, iteration=0, feval=None)
Evaluate a set of data.
Parameters
-
evals (list of tuples (DMatrix, string)) – List of items to be evaluated.
-
iteration (int) – Current iteration.
-
feval (function) – Custom evaluation function.
Returns
result – Evaluation result string.
Return type
get_dump
(fmap='', with_stats=False, dump_format='text')
Returns the model dump as a list of strings.
Parameters
-
fmap (string or os.PathLike, optional) – Name of the file containing feature map names.
-
with_stats (bool, optional) – Controls whether the split statistics are output.
-
dump_format (string, optional) – Format of model dump. Can be ‘text’, ‘json’ or ‘dot’.
get_fscore
(fmap='')
Get feature importance of each feature.
Note
Feature importance is defined only for tree boosters
Feature importance is only defined when the decision tree model is chosen as base learner (booster=gbtree). It is not defined for other base learner types, such as linear learners (booster=gblinear).
Note
Zero-importance features will not be included
Keep in mind that this function does not include zero-importance feature, i.e. those features that have not been used in any split conditions.
Parameters
fmap (str or os.PathLike (optional)) – The name of feature map file
get_score
(fmap='', importance_type='weight')
Get feature importance of each feature. Importance type can be defined as:
-
‘weight’: the number of times a feature is used to split the data across all trees.
-
‘gain’: the average gain across all splits the feature is used in.
-
‘cover’: the average coverage across all splits the feature is used in.
-
‘total_gain’: the total gain across all splits the feature is used in.
-
‘total_cover’: the total coverage across all splits the feature is used in.
Note
Feature importance is defined only for tree boosters
Feature importance is only defined when the decision tree model is chosen as base learner (booster=gbtree). It is not defined for other base learner types, such as linear learners (booster=gblinear).
Parameters
-
fmap (str or os.PathLike (optional)) – The name of feature map file.
-
importance_type (str, default 'weight') – One of the importance types defined above.
get_split_value_histogram
(feature, fmap='', bins=None, as_pandas=True)
Get split value histogram of a feature
Parameters
-
feature (str) – The name of the feature.
-
fmap (str or os.PathLike (optional)) – The name of feature map file.
-
bin (int, default None) – The maximum number of bins. Number of bins equals number of unique split values n_unique, if bins == None or bins > n_unique.
-
as_pandas (bool, default True) – Return pd.DataFrame when pandas is installed. If False or pandas is not installed, return numpy ndarray.
Returns
-
a histogram of used splitting values for the specified feature
-
either as numpy array or pandas DataFrame.
load_model
(fname)
Load the model from a file.
The model is loaded from an XGBoost internal binary format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) will not be loaded. To preserve all attributes, pickle the Booster object.
Parameters
fname (string, os.PathLike, or a memory buffer) – Input file name or memory buffer(see also save_raw)
load_rabit_checkpoint
()
Initialize the model by load from rabit checkpoint.
Returns
version – The version number of the model.
Return type
integer
predict
(data, output_margin=False, ntree_limit=0, pred_leaf=False, pred_contribs=False, approx_contribs=False, pred_interactions=False, validate_features=True)
Predict with data.
Note
This function is not thread safe.
For each booster object, predict can only be called from one thread. If you want to run prediction using multiple thread, call bst.copy()
to make copies of model object and then call predict()
.
Note
Using predict()
with DART booster
If the booster object is DART type, predict()
will perform dropouts, i.e. only some of the trees will be evaluated. This will produce incorrect results if data
is not the training data. To obtain correct results on test sets, set ntree_limit
to a nonzero value, e.g.
preds = bst.predict(dtest, ntree_limit=num_round)
Parameters
-
data (DMatrix) – The dmatrix storing the input.
-
output_margin (bool) – Whether to output the raw untransformed margin value.
-
ntree_limit (int) – Limit number of trees in the prediction; defaults to 0 (use all trees).
-
pred_leaf (bool) – When this option is on, the output will be a matrix of (nsample, ntrees) with each record indicating the predicted leaf index of each sample in each tree. Note that the leaf index of a tree is unique per tree, so you may find leaf 1 in both tree 1 and tree 0.
-
pred_contribs (bool) – When this is True the output will be a matrix of size (nsample, nfeats + 1) with each record indicating the feature contributions (SHAP values) for that prediction. The sum of all feature contributions is equal to the raw untransformed margin value of the prediction. Note the final column is the bias term.
-
approx_contribs (bool) – Approximate the contributions of each feature
-
pred_interactions (bool) – When this is True the output will be a matrix of size (nsample, nfeats + 1, nfeats + 1) indicating the SHAP interaction values for each pair of features. The sum of each row (or column) of the interaction values equals the corresponding SHAP value (from pred_contribs), and the sum of the entire matrix equals the raw untransformed margin value of the prediction. Note the last row and column correspond to the bias term.
-
validate_features (bool) – When this is True, validate that the Booster’s and data’s feature_names are identical. Otherwise, it is assumed that the feature_names are the same.
Returns
prediction
Return type
numpy array
save_model
(fname)
Save the model to a file.
The model is saved in an XGBoost internal binary format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) will not be saved. To preserve all attributes, pickle the Booster object.
Parameters
fname (string or os.PathLike) – Output file name
save_rabit_checkpoint
()
Save the current booster to rabit checkpoint.
save_raw
()
Save the model to a in memory buffer representation
Returns
Return type
a in memory buffer representation of the model
set_attr
(**kwargs)
Set the attribute of the Booster.
Parameters
**kwargs – The attributes to set. Setting a value to None deletes an attribute.
set_param
(params, value=None)
Set parameters into the Booster.
Parameters
-
params (dict/list/str) – list of key,value pairs, dict of key to value or simply str key
-
value (optional) – value of the specified parameter, when params is str key
trees_to_dataframe
(fmap='')
Parse a boosted tree model text dump into a pandas DataFrame structure.
This feature is only defined when the decision tree model is chosen as base learner (booster in {gbtree, dart}). It is not defined for other base learner types, such as linear learners (booster=gblinear).
Parameters
fmap (str or os.PathLike (optional)) – The name of feature map file.
update
(dtrain, iteration, fobj=None)
Update for one iteration, with objective function calculated internally. This function should not be called directly by users.
Parameters
-
dtrain (DMatrix) – Training data.
-
iteration (int) – Current iteration number.
-
fobj (function) – Customized objective function.
Learning API
Training Library containing training routines.
xgboost.
train
(params, dtrain, num_boost_round=10, evals=(), obj=None, feval=None, maximize=False, early_stopping_rounds=None, evals_result=None, verbose_eval=True, xgb_model=None, callbacks=None, learning_rates=None)
Train a booster with given parameters.
Parameters