机器学习笔记之Ensembles of Decision Trees

最新推荐文章于 2022-06-04 23:29:37 发布

YukAgame

最新推荐文章于 2022-06-04 23:29:37 发布

阅读量875

点赞数

分类专栏：机器学习学习笔记

本文链接：https://blog.csdn.net/weixin_38686737/article/details/108057844

版权

机器学习学习笔记专栏收录该内容

7 篇文章 2 订阅

订阅专栏

机器学习笔记之Ensembles of Decision Trees

Ensembles are methods that combine multiple machine learning models to create more powerful models.

Random Forests

Random forests are one way to address the overfitting problem of Decision Trees.
A random forest is essentially a collection of decision trees, where each tree is slightly different from the others.(Build as many as trees and get the average of their results to reduce overfitting)
There are two ways to implement such strategy: by selecting the data points used to build a tree / by selecting the features in each split test.

在Python当中可以通过scikit-learn中的 RandomForestClassifier 来实现

import mglearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_moons,load_breast_cancer
from sklearn.model_selection import train_test_split
import  matplotlib.pyplot as plt
import numpy as np

x, y = make_moons(n_samples = 100, noise = 0.25, random_state = 2)
x_train,x_test,y_train,y_test = train_test_split(x,y,random_state = 42,stratify = y)
'''n_estimators： the number of trees built'''
''' n_jobs: number of CPU cores used'''
''' max_features: number of max_features used'''
forest = RandomForestClassifier(n_estimators = 5, random_state = 2,n_jobs = -1)
forest.fit(x_train,y_train)

fig,axes = plt.subplots(2,3,figsize = (20,10))
for i,(ax,tree) in enumerate(zip(axes.ravel(),forest.estimators_)):
    ax.set_title('Tree %s'%i)
    mglearn.plots.plot_tree_partition(x_train,y_train,tree,ax)
mglearn.plots.plot_2d_separator(forest,x_train,fill = True,ax = axes[-1,-1])
axes[-1,-1].set_title('Random Forest')
mglearn.discrete_scatter(x_train[:,0],x_train[:,1],y_train)
plt.show()

综合5个树的结果得到一个综合结果（下图右下角）

在这里插入图片描述

下面还是以乳腺癌人群的数据为例

cancer = load_breast_cancer()
x_train,x_test,y_train,y_test = train_test_split(cancer.data,cancer.target,random_state = 0)
forest = RandomForestClassifier(n_estimators= 100,random_state = 0).fit(x_train,y_train)

print('Training score :%s'%(forest.score(x_train,y_train)),'Testing score :%s'%(forest.score(x_test,y_test)))




n_features  = cancer.data.shape[1]
plt.barh(range(n_features),forest.feature_importances_)
plt.yticks(np.arange(n_features),cancer.feature_names)
plt.xlabel('Feature Importance')
plt.ylabel('Feature')
plt.ylim(-1, n_features)
plt.show()

可以看到拟合度还是很高，但测试集表现明显比单个的决策树好（之前训练集1.0的情况下单个决策树测试集表现是0.937）

在这里插入图片描述
再来看看各个特征的权重

在这种情况下，与单个决策树相比，很多特征都得到了非0值的权重，但是模型给了worst perimeter而不是worst radius 最高的权重。

在这里插入图片描述

max_features = sqrt(n_features) for classification, max_features = n_features for regression

Strengths :work well without heavy tuning of parameters, don’t require scale of data, need compact representation of the decision-making process. (when dealing with large datasets, could use n_jobs = -1 to set all the cores running your task)

Gradient boosted regression trees
在python当中可以通过 scikit-learn 中的 GradientBoostingClassifier 来实现

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
import mglearn
import matplotlib.pyplot as plt
import numpy as np

cancer = load_breast_cancer()
x_train,x_test,y_train,y_test = train_test_split(cancer.data,cancer.target, random_state = 0)
gbrt1 = GradientBoostingClassifier(random_state = 0).fit(x_train,y_train)
print('Training score: %s'%(gbrt1.score(x_train,y_train)),'Testing score :%s'%(gbrt1.score(x_test,y_test)))

gbrt2 = GradientBoostingClassifier(random_state = 0,max_depth = 1).fit(x_train,y_train)
print('Training score: %s'%(gbrt2.score(x_train,y_train)),'Testing score :%s'%(gbrt2.score(x_test,y_test)))


gbrt3 = GradientBoostingClassifier(random_state = 0, learning_rate = 0.01)
gbrt3.fit(x_train,y_train)
print('Training score: %s'%(gbrt.score(x_train,y_train)),'Testing score :%s'%(gbrt3.score(x_test,y_test)))