机器学习笔记之Ensembles of Decision Trees

机器学习笔记之Ensembles of Decision Trees

  • Ensembles are methods that combine multiple machine learning models to create more powerful models.

Random Forests

  • Random forests are one way to address the overfitting problem of Decision Trees.
  • A random forest is essentially a collection of decision trees, where each tree is slightly different from the others.(Build as many as trees and get the average of their results to reduce overfitting)
  • There are two ways to implement such strategy: by selecting the data points used to build a tree / by selecting the features in each split test.

在Python当中可以通过scikit-learn中的 RandomForestClassifier 来实现

import mglearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_moons,load_breast_cancer
from sklearn.model_selection import train_test_split
import  matplotlib.pyplot as plt
import numpy as np
x, y = make_moons(n_samples = 100, noise = 0.25, random_state = 2)
x_train,x_test,y_train,y_test = train_test_split(x,y,random_state = 42,stratify = y)
'''n_estimators: the number of trees built'''
''' n_jobs: number of CPU cores used'''
''' max_features: number of max_features used'''
forest = RandomForestClassifier(n_estimators = 5, random_state = 2,n_jobs = -1)
forest.fit(x_train,y_train)
fig,axes = plt.subplots(2,3,figsize = (20,10))
for i,(ax,tree) in enumerate(zip(axes.ravel(),forest.estimators_)):
    ax.set_title('Tree %s'%i)
    mglearn.plots.plot_tree_partition(x_train,y_train,tree,ax)
mglearn.plots.plot_2d_separator(forest,x_train,fill = True,ax = axes[-1,-1])
axes[-1,-1].set_title('Random Forest')
mglearn.discrete_scatter(x_train[:,0],x_train[:,1],y_train)
plt.show()

综合5个树的结果得到一个综合结果(下图右下角)

在这里插入图片描述

下面还是以乳腺癌人群的数据为例

cancer = load_breast_cancer()
x_train,x_test,y_train,y_test = train_test_split(cancer.data,cancer.target,random_state = 0)
forest = RandomForestClassifier(n_estimators= 100,random_state = 0).fit(x_train,y_train)

print('Training score :%s'%(forest.score(x_train,y_train)),'Testing score :%s'%(forest.score(x_test,y_test)))




n_features  = cancer.data.shape[1]
plt.barh(range(n_features),forest.feature_importances_)
plt.yticks(np.arange(n_features),cancer.feature_names)
plt.xlabel('Feature Importance')
plt.ylabel('Feature')
plt.ylim(-1, n_features)
plt.show()

可以看到拟合度还是很高,但测试集表现明显比单个的决策树好(之前训练集1.0的情况下单个决策树测试集表现是0.937)

在这里插入图片描述
再来看看各个特征的权重

在这种情况下,与单个决策树相比,很多特征都得到了非0值的权重,但是模型给了worst perimeter而不是worst radius 最高的权重。

在这里插入图片描述

  • max_features = sqrt(n_features) for classification, max_features = n_features for regression
  • Strengths :work well without heavy tuning of parameters, don’t require scale of data, need compact representation of the decision-making process. (when dealing with large datasets, could use n_jobs = -1 to set all the cores running your task)

Gradient boosted regression trees
在python当中可以通过 scikit-learn 中的 GradientBoostingClassifier 来实现

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
import mglearn
import matplotlib.pyplot as plt
import numpy as np

cancer = load_breast_cancer()
x_train,x_test,y_train,y_test = train_test_split(cancer.data,cancer.target, random_state = 0)
gbrt1 = GradientBoostingClassifier(random_state = 0).fit(x_train,y_train)
print('Training score: %s'%(gbrt1.score(x_train,y_train)),'Testing score :%s'%(gbrt1.score(x_test,y_test)))

gbrt2 = GradientBoostingClassifier(random_state = 0,max_depth = 1).fit(x_train,y_train)
print('Training score: %s'%(gbrt2.score(x_train,y_train)),'Testing score :%s'%(gbrt2.score(x_test,y_test)))


gbrt3 = GradientBoostingClassifier(random_state = 0, learning_rate = 0.01)
gbrt3.fit(x_train,y_train)
print('Training score: %s'%(gbrt.score(x_train,y_train)),'Testing score :%s'%(gbrt3.score(x_test,y_test)))

可以看到限制最大高度,降低了模型的复杂度,训练集表现有所下降,但是测试集表现有所上升,降低learning_rate 训练集和测试集表现更为接近
在这里插入图片描述

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值