# Gradient Tree Boosting (GBM, GBRT, GBDT, MART)算法解析和基于XGBoost/Scikit-learn的实现

1. 概要

Gradient Tree Boosting (别名 GBM, GBRT, GBDT, MART)是一类很常用的集成学习算法，在KDD Cup, Kaggle组织的很多数据挖掘竞赛中多次表现出在分类和回归任务上面最好的performance。同时在2010年Yahoo Learning to Rank Challenge中, 夺得冠军的LambdaMART算法也属于这一类算法。因此Tree Boosting算法和深度学习算法DNN/CNN/RNN等等一样在工业界和学术界中得到了非常广泛的应用。

A. Introduction to Boosted Tree. https://xgboost.readthedocs.io/en/latest/model.html

B. Introduction to Boosted Trees. By Tianqi Chen. http://homes.cs.washington.edu/~tqchen/data/pdf/BoostedTree.pdf

C. Tianqi Chen and Carlos Guestrin. XGBoost: A Scalable Tree Boosting System. in KDD '16. http://www.kdd.org/kdd2016/papers/files/rfp0697-chenAemb.pdf

2. 阅读笔记 (注释: 这部分摘录的图片出自Tianqi的slide,感谢原作者的精彩分享，我主要加上了一些个人理解性笔记，具体细节可以参考原版slide)

Tianqi的Slide首先给出了监督学习中一些常用基本概念的介绍，然后给出了Tree Ensemble 模型的目标函数定义

slide里面解释很清楚，还给出了具体例子。注意这里q是一个把训练example隐射到叶子节点index的函数。

3. 基于XGBoost/Scikit-learn的实现

import numpy as np
import sys

# !skip code to read train/test data from files
X_train = np.nan_to_num(X_train)
X_test = np.nan_to_num(X_test)

print 'train data size: ', len(X_train)
print 'test data size: ', len(X_test)

# Data normalization
#===================================================
from sklearn import preprocessing
# scale the data attributes
scaler = preprocessing.MinMaxScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

print 'normalized_X: ', X_train

# Feature selection
#===================================================
from sklearn import metrics
# from sklearn.ensemble import ExtraTreesClassifier
# model = ExtraTreesClassifier()
# model.fit(X_train, y_train)
# # display the relative importance of each attribute
# print('feature_importance', model.feature_importances_)

# Classification
#===================================================

# Build Model

print 'build model...'

#AdaBoost, LR, NeuralNet, SVM, RandomForest, Bagging, ExtraTrees
if model_name == 'LR':
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
elif model_name == 'NeuralNet':
from sklearn.neural_network import MLPClassifier
hidden_layer_sizes=(100, 100), random_state=1)
elif model_name == 'SVM':
from sklearn.svm import LinearSVC
model = LinearSVC()
elif model_name == 'RandomForest':
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
elif model_name == 'GBRT':
elif model_name == 'Bagging':
from sklearn.ensemble import BaggingClassifier
model = BaggingClassifier()
elif model_name == 'ExtraTrees':
from sklearn.ensemble import ExtraTreesClassifier
model = ExtraTreesClassifier()
else:
raise  NameError("wrong model name!")

from sklearn import metrics

model.fit(X_train, y_train)
print(model)
# make predictions
expected = y_test
predicted = model.predict(X_test)
# summarize the fit of the model
print 'classification_report\n', metrics.classification_report(expected, predicted, digits=6)
print 'confusion_matrix\n', metrics.confusion_matrix(expected, predicted)
print 'accuracy\t', metrics.accuracy_score(expected, predicted)

print 'dump the predicted proba and predicted label to files in the folder ', model_res_path
predicted_score = model.predict_proba(X_test)
predicted_label = predicted
output_file_pred_score = model_res_path + data_name + '_' + model_name + '_' + feature_set + '.pred_score'
output_file_pred_label = model_res_path + data_name + '_' + model_name + '_' + feature_set + '.pred_label'
np.savetxt(output_file_pred_score, predicted_score, delimiter='\t')
np.savetxt(output_file_pred_label, predicted_label, delimiter='\t')

if model_name == 'RandomForest' or model_name == 'AdaBoost' or model_name == 'GBRT':
print('feature importance score\n')
print(model.feature_importances_)

feat_import_score_file = model_res_path + model_name + '_' + feature_set + '.featimportance'
print('save feature importance file to the model_res_path: ', feat_import_score_file)
np.savetxt(feat_import_score_file, model.feature_importances_, delimiter='\t')



4 Reference

[1]. Introduction to Boosted Tree. https://xgboost.readthedocs.io/en/latest/model.html

[2]. Introduction to Boosted Trees. By Tianqi Chen. http://homes.cs.washington.edu/~tqchen/data/pdf/BoostedTree.pdf

[3]. Tianqi Chen and Carlos Guestrin. XGBoost: A Scalable Tree Boosting System. in KDD '16. http://www.kdd.org/kdd2016/papers/files/rfp0697-chenAemb.pdf

01-26 323

10-25 348
11-29 17万+
06-21 8170
05-03 13万+
06-22 7973
02-29 3001
02-21
05-07 423
08-27 47
04-25 7344