python_GBDT分析
Gradient Boosted Regression Trees (Gradient Boosting Machines)
random_state
from sklearn.ensemble import GradientBoostingClassifier
X_train, X_test, y_train, y_test = train_test_split(
cancer.data, cancer.target, random_state=0)
gbrt = GradientBoostingClassifier(random_state=0)
gbrt.fit(X_train, y_train)
print("Accuracy on training set: {:.3f}".format(gbrt.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(gbrt.score(X_test, y_test)))
Accuracy on training set: 1.000
Accuracy on test set: 0.965
gbrt = GradientBoostingClassifier(random_state=0, max_depth=1)
gbrt.fit(X_train, y_train)
print("Accuracy on training set: {:.3f}".format(gbrt.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(gbrt.score(X_test, y_test)))
Accuracy on training set: 0.991
Accuracy on test set: 0.972
gbrt = GradientBoostingClassifier(random_state=0, learning_rate=0.01)
gbrt.fit(X_train, y_train)
print("Accuracy on training set: {:.3f}".format(gbrt.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(gbrt.score(X_test, y_test)))
Accuracy on training set: 0.988
Accuracy on test set: 0.965
gbrt = GradientBoostingClassifier(random_state=0, max_depth=1)
gbrt.fit(X_train, y_train)
plot_feature_importances_cancer(gbrt)
可以看到,梯度提升树的特征重要性与随机森林的特征重要性有些类似,不过梯度提升完
全忽略了某些特征。
由于梯度提升和随机森林两种方法在类似的数据上表现得都很好,因此一种常用的方法就
是先尝试随机森林,它的鲁棒性很好。如果随机森林效果很好,但预测时间太长,或者机
器学习模型精度小数点后第二位的提高也很重要,那么切换成梯度提升通常会有用。