机器学习sklearn之集成学习(二)

GBDT概述

GBDT算法在实际应用中深受大众的喜爱,不同于Adaboost(利用前一轮弱学习器的误差率来更新训练集的误差权重),GBDT采用的是前向分步算法,但是弱学习器则限定了CART决策树模型。

GBDT的通俗理解是:假如一所房屋的价格是100万,我们先用80万去拟合,发现损失20万,这时我们用15万去拟合发现损失5万,接着我们使用4万去拟合返现损失1万,这样一直迭代下去,使得损失误差一直减少到设定的阈值。

如何解决损失函数拟合方法的问题,大牛提出了使用损失函数的负梯度来拟合本轮损失的近似值,进而拟合一个CART回归树。由于公式编辑很费时间,这里给大家推荐一个博客,具体细节可参考

GBDT的优点有很多,总结如下:
1、由于采用了CART作为弱学习器,可以处理各种类型的数据,包括连续值和离散值
2、预测准确度较高
3、可以采用一些正则化的损失函数,对异常值的鲁棒性非常强

下面对GBDT的实际应用做个小小的demo。

使用决策树CART进行分类

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt
%matplotlib inline

data = load_iris()
X = data.data
y = data.target

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

fig = plt.figure()
plt.scatter(x_train[y_train==0][:, 1], x_train[y_train==0][:, 2])
plt.scatter(x_train[y_train==1][:, 1], x_train[y_train==1][:, 2])
plt.scatter(x_train[y_train==2][:, 1], x_train[y_train==2][:, 2])
plt.legend(data.target_names)
plt.show()

在这里插入图片描述

from sklearn.tree import DecisionTreeClassifier, export_graphviz
from IPython.display import Image
import pydotplus
from sklearn import metrics

dtc = DecisionTreeClassifier()
dtc.fit(x_train, y_train)
print("Accuracy (train): %.4g" % metrics.accuracy_score(y_train, dtc.predict(x_train)))
print("Accuracy (test): %.4g" % metrics.accuracy_score(y_test, dtc.predict(x_test)))
print("混淆矩阵:\n", metrics.classification_report(y_test, dtc.predict(x_test), target_names=data.target_names))
dot_data = export_graphviz(decision_tree=dtc, 
                           out_file=None, 
                           feature_names=data.feature_names, 
                           class_names=data.target_names, 
                           filled=True, 
                           rounded=True, 
                           special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data)
Image(graph.create_png())
Accuracy (train): 1
Accuracy (test): 0.9474
混淆矩阵:
              precision    recall  f1-score   support

     setosa       1.00      1.00      1.00        11
 versicolor       0.80      1.00      0.89         8
  virginica       1.00      0.89      0.94        19

avg / total       0.96      0.95      0.95        38

png

使用GBDT进行集成学习

from sklearn.ensemble import GradientBoostingClassifier
gbc = GradientBoostingClassifier()
gbc.fit(x_train, y_train)

print("Accuracy(train) : %.4g" % gbc.score(x_train, y_train))
print("Accuracy(test) : %.4g" % gbc.score(x_test, y_test))
print("混淆矩阵:\n", metrics.classification_report(y_test, gbc.predict(x_test), target_names=data.target_names))
Accuracy(train) : 1
Accuracy(test) : 0.9474
混淆矩阵:
              precision    recall  f1-score   support

     setosa       1.00      1.00      1.00        11
 versicolor       0.80      1.00      0.89         8
  virginica       1.00      0.89      0.94        19

avg / total       0.96      0.95      0.95        38

参数调优,这里只是对决策树的深度进行交叉验证,能调的参数比较多,这里不一一举例

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV

params = {"max_depth": list(range(1,11))}
gbc = GradientBoostingClassifier()
gs = GridSearchCV(gbc, param_grid=params, cv=10)
gs.fit(x_train, y_train)


print("Accuracy(train) : %.4g" % gs.score(x_train, y_train))
print("Accuracy(test) : %.4g" % gs.score(x_test, y_test))
print("混淆矩阵:\n", metrics.classification_report(y_test, gs.predict(x_test), target_names=data.target_names))
Accuracy(train) : 1
Accuracy(test) : 0.9474
混淆矩阵:
              precision    recall  f1-score   support

     setosa       1.00      1.00      1.00        11
 versicolor       0.80      1.00      0.89         8
  virginica       1.00      0.89      0.94        19

avg / total       0.96      0.95      0.95        38

交叉验证的结果

gs.best_estimator_, gs.best_score_, gs.best_params_, gs.grid_scores_
D:\anaconda\setup\lib\site-packages\sklearn\model_selection\_search.py:761: DeprecationWarning: The grid_scores_ attribute was deprecated in version 0.18 in favor of the more elaborate cv_results_ attribute. The grid_scores_ attribute will not be available from 0.20
  DeprecationWarning)





(GradientBoostingClassifier(criterion='friedman_mse', init=None,
               learning_rate=0.1, loss='deviance', max_depth=4,
               max_features=None, max_leaf_nodes=None,
               min_impurity_decrease=0.0, min_impurity_split=None,
               min_samples_leaf=1, min_samples_split=2,
               min_weight_fraction_leaf=0.0, n_estimators=100,
               presort='auto', random_state=None, subsample=1.0, verbose=0,
               warm_start=False),
 0.9642857142857143,
 {'max_depth': 4},
 [mean: 0.94643, std: 0.06325, params: {'max_depth': 1},
  mean: 0.94643, std: 0.07307, params: {'max_depth': 2},
  mean: 0.95536, std: 0.06385, params: {'max_depth': 3},
  mean: 0.96429, std: 0.04335, params: {'max_depth': 4},
  mean: 0.95536, std: 0.06385, params: {'max_depth': 5},
  mean: 0.95536, std: 0.06385, params: {'max_depth': 6},
  mean: 0.95536, std: 0.06385, params: {'max_depth': 7},
  mean: 0.95536, std: 0.06385, params: {'max_depth': 8},
  mean: 0.95536, std: 0.06385, params: {'max_depth': 9},
  mean: 0.95536, std: 0.06385, params: {'max_depth': 10}])

使用集成优化Adaboost算法进行分类

from sklearn.ensemble import AdaBoostClassifier
ada = AdaBoostClassifier()
ada.fit(x_train, y_train)
ada.score(x_test, y_test)

print("Accuracy(train) : %.4g" % ada.score(x_train, y_train))
print("Accuracy(test) : %.4g" % ada.score(x_test, y_test))
print("混淆矩阵:\n", metrics.classification_report(y_test, ada.predict(x_test), target_names=data.target_names))
Accuracy(train) : 0.9821
Accuracy(test) : 0.9474
混淆矩阵:
              precision    recall  f1-score   support

     setosa       1.00      1.00      1.00        11
 versicolor       0.80      1.00      0.89         8
  virginica       1.00      0.89      0.94        19

avg / total       0.96      0.95      0.95        38

使用Xgboost进行分类

import xgboost
params = {
    'booster': 'gbtree',
    'objective': 'multi:softmax',  # 多分类的问题
    'num_class': 3,               # 类别数,与 multisoftmax 并用
    'gamma': 0.1,                  # 用于控制是否后剪枝的参数,越大越保守,一般0.1、0.2这样子。
    'max_depth': 3,               # 构建树的深度,越大越容易过拟合
    'lambda': 1,                   # 控制模型复杂度的权重值的L2正则化项参数,参数越大,模型越不容易过拟合。
    'subsample': 0.8,              # 随机采样训练样本
    'colsample_bytree': 0.7,       # 生成树时进行的列采样
    'min_child_weight': 3,
    'silent': 1,                   # 设置成1则没有运行信息输出,最好是设置为0.
    'eta': 0.01,                  # 如同学习率
    'seed': 1000,
    'nthread': 4,                  # cpu 线程数
}

dtrain = xgboost.DMatrix(x_train, y_train)
num_rounds = 500
model = xgboost.train(params=params,dtrain=dtrain, num_boost_round=num_rounds)
dtest = xgboost.DMatrix(x_test)
ans = model.predict(dtest)
cnt1 = 0
cnt2 = 0
for i in range(len(y_test)):
    if ans[i] == y_test[i]:
        cnt1 += 1
    else:
        cnt2 += 1
print("Accuracy(test): \n", cnt1/(cnt1 + cnt2))          
Accuracy(test): 
 0.9473684210526315

一个更复杂的例子,数据量大,可以体现集成学习的优势

import pandas as pd
train = pd.read_csv("./train_modified.csv")
# train.head(10)
train.describe()
# train[train.isnull().values==True]
DisbursedExisting_EMILoan_Amount_AppliedLoan_Tenure_AppliedMonthly_IncomeVar4Var5AgeEMI_Loan_Submitted_MissingInterest_Rate_Missing...Var2_2Var2_3Var2_4Var2_5Var2_6Mobile_Verified_0Mobile_Verified_1Source_0Source_1Source_2
count20000.00000020000.000002.000000e+0420000.0000002.000000e+0420000.00000020000.00000020000.00000020000.00000020000.000000...20000.00000020000.00000020000.00000020000.00000020000.00000020000.00000020000.00000020000.00000020000.00000020000.000000
mean0.0160003890.296222.424191e+052.2544504.575767e+042.9663505.85135030.8789500.6440000.644000...0.2393000.0093000.0649500.0264500.0009000.3541000.6459000.1155000.6454500.239050
std0.12547810534.216473.582973e+051.9884674.575422e+051.5759895.8359976.8296510.4788270.478827...0.4266670.0959890.2464440.1604730.0299870.4782520.4782520.3196320.4783890.426514
min0.0000000.000000.000000e+000.0000001.000000e+011.0000000.00000018.0000000.0000000.000000...0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000
25%0.0000000.000000.000000e+000.0000001.700000e+041.0000000.00000026.0000000.0000000.000000...0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000
50%0.0000000.000001.000000e+052.0000002.500000e+043.0000003.00000029.0000001.0000001.000000...0.0000000.0000000.0000000.0000000.0000000.0000001.0000000.0000001.0000000.000000
75%0.0000004000.000003.000000e+054.0000004.000000e+045.00000011.00000034.0000001.0000001.000000...0.0000000.0000000.0000000.0000000.0000001.0000001.0000000.0000001.0000000.000000
max1.000000420000.000009.000000e+0610.0000005.495454e+077.00000017.00000065.0000001.0000001.000000...1.0000001.0000001.0000001.0000001.0000001.0000001.0000001.0000001.0000001.000000

8 rows × 50 columns

target = "Disbursed"
IDcol = "ID"
train["Disbursed"].value_counts()
0    19680
1      320
Name: Disbursed, dtype: int64
x_columns = [x for x in train.columns if x not in [target, IDcol]]
X = train[x_columns]
y = train["Disbursed"]

使用决策树CART进行分类

from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

dtc = DecisionTreeClassifier()
dtc.fit(x_train, y_train)
print("Accuracy (train): %.4g" % metrics.accuracy_score(y_train, dtc.predict(x_train)))
print("Accuracy (test): %.4g" % metrics.accuracy_score(y_test, dtc.predict(x_test)))
print("混淆矩阵:\n", metrics.classification_report(y_test, dtc.predict(x_test)))
Accuracy (train): 0.9997
Accuracy (test): 0.9682
混淆矩阵:
              precision    recall  f1-score   support

          0       0.99      0.98      0.98      4931
          1       0.04      0.06      0.05        69

avg / total       0.97      0.97      0.97      5000

使用集成学习GBDT进行分类

from sklearn.ensemble import GradientBoostingClassifier
from sklearn import metrics
gbm0 = GradientBoostingClassifier(random_state=10)
gbm0.fit(x_train, y_train)
y_pred = gbm0.predict(x_test)
y_predprob = gbm0.predict_proba(x_test)[:,1]
print("Accuracy (train): %.4g" % metrics.accuracy_score(y_train, gbm0.predict(x_train)))
print("Accuracy (test): %.4g" % metrics.accuracy_score(y_test, gbm0.predict(x_test)))
print("混淆矩阵:\n", metrics.classification_report(y_test, gbm0.predict(x_test)))
Accuracy (train): 0.9844
Accuracy (test): 0.9854
混淆矩阵:
              precision    recall  f1-score   support

          0       0.99      1.00      0.99      4931
          1       0.00      0.00      0.00        69

avg / total       0.97      0.99      0.98      5000

对GBDT的n_estimators参数进行的调优

from sklearn.model_selection import GridSearchCV
params = {"n_estimators": range(20, 81, 10)}
gsearch1 = GridSearchCV(estimator=GradientBoostingClassifier(learning_rate=0.1, 
                                                             min_samples_split=300,  # 限制子树继续划分的条件,如果某节点的样本数小于这个值,则不会继续尝试选择最优特征进行划分
                                                             min_samples_leaf=20,   # 限制叶子节点最少的样本数,如果少于这个值则会和兄弟节点一起被剪枝
                                                             max_depth=8,   # 决策树的最大深度
                                                             max_features="sqrt",   # 划分时考虑的最大特征数,“sqrt”或者“auto”意味着划分时最多考虑根号(N)个特征
                                                             subsample=0.8,  # 子采样,不放回采样,0.8表示只使用了80%的数据,可以防止过拟合
                                                             random_state=10),
                        param_grid=params,
                        scoring="roc_auc",
                        iid=False,
                        cv=5)
gsearch1.fit(x_train, y_train)
gsearch1.grid_scores_, gsearch1.best_params_, gsearch1.best_score_
D:\anaconda\setup\lib\site-packages\sklearn\model_selection\_search.py:761: DeprecationWarning: The grid_scores_ attribute was deprecated in version 0.18 in favor of the more elaborate cv_results_ attribute. The grid_scores_ attribute will not be available from 0.20
  DeprecationWarning)





([mean: 0.82240, std: 0.03087, params: {'n_estimators': 20},
  mean: 0.82469, std: 0.03055, params: {'n_estimators': 30},
  mean: 0.82479, std: 0.03178, params: {'n_estimators': 40},
  mean: 0.82445, std: 0.02968, params: {'n_estimators': 50},
  mean: 0.82230, std: 0.02993, params: {'n_estimators': 60},
  mean: 0.82074, std: 0.02881, params: {'n_estimators': 70},
  mean: 0.81918, std: 0.02904, params: {'n_estimators': 80}],
 {'n_estimators': 40},
 0.8247923927911079)

对GBDT的max_depth和min_samples_split参数进行的调优

params2 = {"max_depth": range(3, 14, 2), "min_samples_split": range(100, 801, 200)}
gsearch2 = GridSearchCV(estimator=GradientBoostingClassifier(learning_rate=0.1, 
                                                             n_estimators=60, 
                                                             min_samples_leaf=20, 
                                                             max_features="sqrt", 
                                                             subsample=0.8, 
                                                             random_state=10),
                        param_grid=params2, 
                        scoring="roc_auc", 
                        iid=False, 
                        cv=5)
gsearch2.fit(x_train, y_train)
gsearch2.grid_scores_, gsearch2.best_params_, gsearch2.best_score_
D:\anaconda\setup\lib\site-packages\sklearn\model_selection\_search.py:761: DeprecationWarning: The grid_scores_ attribute was deprecated in version 0.18 in favor of the more elaborate cv_results_ attribute. The grid_scores_ attribute will not be available from 0.20
  DeprecationWarning)





([mean: 0.81718, std: 0.03177, params: {'max_depth': 3, 'min_samples_split': 100},
  mean: 0.81821, std: 0.02824, params: {'max_depth': 3, 'min_samples_split': 300},
  mean: 0.81938, std: 0.02993, params: {'max_depth': 3, 'min_samples_split': 500},
  mean: 0.81850, std: 0.02894, params: {'max_depth': 3, 'min_samples_split': 700},
  mean: 0.82919, std: 0.02452, params: {'max_depth': 5, 'min_samples_split': 100},
  mean: 0.82704, std: 0.02582, params: {'max_depth': 5, 'min_samples_split': 300},
  mean: 0.82595, std: 0.02603, params: {'max_depth': 5, 'min_samples_split': 500},
  mean: 0.82930, std: 0.02581, params: {'max_depth': 5, 'min_samples_split': 700},
  mean: 0.82742, std: 0.02200, params: {'max_depth': 7, 'min_samples_split': 100},
  mean: 0.81882, std: 0.02066, params: {'max_depth': 7, 'min_samples_split': 300},
  mean: 0.82529, std: 0.02404, params: {'max_depth': 7, 'min_samples_split': 500},
  mean: 0.82395, std: 0.02940, params: {'max_depth': 7, 'min_samples_split': 700},
  mean: 0.82908, std: 0.02157, params: {'max_depth': 9, 'min_samples_split': 100},
  mean: 0.81857, std: 0.03291, params: {'max_depth': 9, 'min_samples_split': 300},
  mean: 0.82545, std: 0.02825, params: {'max_depth': 9, 'min_samples_split': 500},
  mean: 0.82815, std: 0.02859, params: {'max_depth': 9, 'min_samples_split': 700},
  mean: 0.81604, std: 0.02591, params: {'max_depth': 11, 'min_samples_split': 100},
  mean: 0.82513, std: 0.02261, params: {'max_depth': 11, 'min_samples_split': 300},
  mean: 0.82908, std: 0.03235, params: {'max_depth': 11, 'min_samples_split': 500},
  mean: 0.82534, std: 0.02583, params: {'max_depth': 11, 'min_samples_split': 700},
  mean: 0.81899, std: 0.02132, params: {'max_depth': 13, 'min_samples_split': 100},
  mean: 0.82667, std: 0.02806, params: {'max_depth': 13, 'min_samples_split': 300},
  mean: 0.82685, std: 0.03581, params: {'max_depth': 13, 'min_samples_split': 500},
  mean: 0.82662, std: 0.02611, params: {'max_depth': 13, 'min_samples_split': 700}],
 {'max_depth': 5, 'min_samples_split': 700},
 0.8292976819017346)

对GBDT的min_samples_leaf和min_samples_split参数进行的调优

param_test3 = {'min_samples_split':range(800,1900,200), 'min_samples_leaf':range(60,101,10)}
gsearch3 = GridSearchCV(estimator = GradientBoostingClassifier(learning_rate=0.1, 
                                                               n_estimators=60,
                                                               max_depth=9,
                                                               max_features='sqrt',
                                                               subsample=0.8, 
                                                               random_state=10), 
                        param_grid = param_test3, 
                        scoring='roc_auc',
                        iid=False, 
                        cv=5)
gsearch3.fit(x_train, y_train)
gsearch3.grid_scores_, gsearch3.best_params_, gsearch3.best_score_
D:\anaconda\setup\lib\site-packages\sklearn\model_selection\_search.py:761: DeprecationWarning: The grid_scores_ attribute was deprecated in version 0.18 in favor of the more elaborate cv_results_ attribute. The grid_scores_ attribute will not be available from 0.20
  DeprecationWarning)





([mean: 0.82938, std: 0.02746, params: {'min_samples_leaf': 60, 'min_samples_split': 800},
  mean: 0.82748, std: 0.03127, params: {'min_samples_leaf': 60, 'min_samples_split': 1000},
  mean: 0.82002, std: 0.03099, params: {'min_samples_leaf': 60, 'min_samples_split': 1200},
  mean: 0.82265, std: 0.03321, params: {'min_samples_leaf': 60, 'min_samples_split': 1400},
  mean: 0.82615, std: 0.02846, params: {'min_samples_leaf': 60, 'min_samples_split': 1600},
  mean: 0.82273, std: 0.02671, params: {'min_samples_leaf': 60, 'min_samples_split': 1800},
  mean: 0.82471, std: 0.03209, params: {'min_samples_leaf': 70, 'min_samples_split': 800},
  mean: 0.82705, std: 0.03119, params: {'min_samples_leaf': 70, 'min_samples_split': 1000},
  mean: 0.82525, std: 0.02723, params: {'min_samples_leaf': 70, 'min_samples_split': 1200},
  mean: 0.82698, std: 0.02734, params: {'min_samples_leaf': 70, 'min_samples_split': 1400},
  mean: 0.82374, std: 0.02662, params: {'min_samples_leaf': 70, 'min_samples_split': 1600},
  mean: 0.82543, std: 0.02728, params: {'min_samples_leaf': 70, 'min_samples_split': 1800},
  mean: 0.82468, std: 0.02681, params: {'min_samples_leaf': 80, 'min_samples_split': 800},
  mean: 0.82688, std: 0.02378, params: {'min_samples_leaf': 80, 'min_samples_split': 1000},
  mean: 0.82400, std: 0.02718, params: {'min_samples_leaf': 80, 'min_samples_split': 1200},
  mean: 0.82635, std: 0.03008, params: {'min_samples_leaf': 80, 'min_samples_split': 1400},
  mean: 0.82478, std: 0.02849, params: {'min_samples_leaf': 80, 'min_samples_split': 1600},
  mean: 0.82215, std: 0.02679, params: {'min_samples_leaf': 80, 'min_samples_split': 1800},
  mean: 0.82416, std: 0.02264, params: {'min_samples_leaf': 90, 'min_samples_split': 800},
  mean: 0.82559, std: 0.02115, params: {'min_samples_leaf': 90, 'min_samples_split': 1000},
  mean: 0.82556, std: 0.02317, params: {'min_samples_leaf': 90, 'min_samples_split': 1200},
  mean: 0.82452, std: 0.02702, params: {'min_samples_leaf': 90, 'min_samples_split': 1400},
  mean: 0.82319, std: 0.02409, params: {'min_samples_leaf': 90, 'min_samples_split': 1600},
  mean: 0.82400, std: 0.02738, params: {'min_samples_leaf': 90, 'min_samples_split': 1800},
  mean: 0.83031, std: 0.02758, params: {'min_samples_leaf': 100, 'min_samples_split': 800},
  mean: 0.82296, std: 0.02450, params: {'min_samples_leaf': 100, 'min_samples_split': 1000},
  mean: 0.82464, std: 0.02562, params: {'min_samples_leaf': 100, 'min_samples_split': 1200},
  mean: 0.82332, std: 0.02972, params: {'min_samples_leaf': 100, 'min_samples_split': 1400},
  mean: 0.82227, std: 0.02910, params: {'min_samples_leaf': 100, 'min_samples_split': 1600},
  mean: 0.82231, std: 0.02642, params: {'min_samples_leaf': 100, 'min_samples_split': 1800}],
 {'min_samples_leaf': 100, 'min_samples_split': 800},
 0.8303082748093461)

使用最参数进行分类

gbm = GradientBoostingClassifier(learning_rate=0.1,
                                 n_estimators=40,
                                 min_samples_split=800,
                                 min_samples_leaf=100,
                                 max_depth=5,
                                 max_features="sqrt",
                                 subsample=0.8,
                                 random_state=10)
gbm.fit(x_train, y_train)
print("Accuracy (train): %.4g" % metrics.accuracy_score(y_train, gbm.predict(x_train)))
print("Accuracy (test): %.4g" % metrics.accuracy_score(y_test, gbm.predict(x_test)))
print("混淆矩阵:\n", metrics.classification_report(y_test, gbm.predict(x_test)))
Accuracy (train): 0.9833
Accuracy (test): 0.9862
混淆矩阵:
              precision    recall  f1-score   support

          0       0.99      1.00      0.99      4931
          1       0.00      0.00      0.00        69

avg / total       0.97      0.99      0.98      5000



D:\anaconda\setup\lib\site-packages\sklearn\metrics\classification.py:1135: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值