scikit-learn随机森林调参小结

最新推荐文章于 2022-09-13 10:54:19 发布

lusic01

最新推荐文章于 2022-09-13 10:54:19 发布

阅读量1.9k

点赞数 1

sklearn学习笔记之开始

简介

自2007年发布以来，scikit-learn已经成为Python重要的机器学习库了。scikit-learn简称sklearn，支持包括分类、回归、降维和聚类四大机器学习算法。还包含了特征提取、数据处理和模型评估三大模块。
sklearn是Scipy的扩展，建立在NumPy和matplotlib库的基础上。利用这几大模块的优势，可以大大提高机器学习的效率。
sklearn拥有着完善的文档，上手容易，具有着丰富的API，在学术界颇受欢迎。sklearn已经封装了大量的机器学习算法，包括LIBSVM和LIBINEAR。同时sklearn内置了大量数据集，节省了获取和整理数据集的时间。

机器学习基础

　　定义：针对经验E和一系列的任务T和一定表现的衡量P，如果随着经验E的积累，针对定义好的任务T可以提高表现P，就说明机器具有学习能力。

sklearn安装

sklearn目前的版本是0.17.1，可以使用pip安装。在安装时需要进行包依赖检查，具体有以下几个要求：

Python（>=2.6 or >=3.3）
NumPy(>=1.6.1)
SciPy(>=0.9)

如果满足上述条件，就能使用pip进行安装了：

1 pip install -U scikit-learn

当然，使用pip安装会比较麻烦，推荐使用Anaconda科学计算环境，里面已经内置了NumPy、SciPy、sklearn等模块，直接可用。或者使用conda进行包管理。conda安装与pip类似：

1  conda install scikit-learn

安装完sklearn以后，可以检查以下版本：

1  >>> import sklearn
2  >>> sklearn.__version__
3  '0.17.1'

==================================================================================================================================================================

sklearn中xgboost模块的XGBClassifier函数

==================================================================================================================================================================

LogisticRegression 逻辑回归

==================================================================================================================================================================

　　　　在Bagging与随机森林算法原理小结中，我们对随机森林(Random Forest, 以下简称RF）的原理做了总结。本文就从实践的角度对RF做一个总结。重点讲述scikit-learn中RF的调参注意事项，以及和GBDT调参的异同点。

1. scikit-learn随机森林类库概述

　在scikit-learn中，RF的分类类是RandomForestClassifier，回归类是RandomForestRegressor。

当然RF的变种Extra Trees也有，分类类ExtraTreesClassifier，回归类ExtraTreesRegressor。

由于RF和Extra Trees的区别较小，调参方法基本相同，本文只关注于RF的调参。

　和GBDT的调参类似，RF需要调参的参数也包括两部分，第一部分是Bagging框架的参数，第二部分是CART决策树的参数。下面我们就对这些参数做一个介绍。

2. RF框架参数

　　　　首先我们关注于RF的Bagging框架的参数。这里可以和GBDT对比来学习。在scikit-learn 梯度提升树(GBDT)调参小结中我们对GBDT的框架参数做了介绍。GBDT的框架参数比较多，重要的有最大迭代器个数，步长和子采样比例，调参起来比较费力。但是RF则比较简单，这是因为bagging框架里的各个弱学习器之间是没有依赖关系的，这减小的调参的难度。换句话说，达到同样的调参效果，RF调参时间要比GBDT少一些。

　　　　下面我来看看RF重要的Bagging框架的参数，由于RandomForestClassifier和RandomForestRegressor参数绝大部分相同，这里会将它们一起讲，不同点会指出。

　　　　1) n_estimators: 也就是弱学习器的最大迭代次数，或者说最大的弱学习器的个数。一般来说n_estimators太小，容易欠拟合，n_estimators太大，计算量会太大，并且n_estimators到一定的数量后，再增大n_estimators获得的模型提升会很小，所以一般选择一个适中的数值。默认是100。

　　　　2) oob_score :即是否采用袋外样本来评估模型的好坏。默认识False。个人推荐设置为True，因为袋外分数反应了一个模型拟合后的泛化能力。

　　　　3) criterion: 即CART树做划分时对特征的评价标准。分类模型和回归模型的损失函数是不一样的。分类RF对应的CART分类树默认是基尼系数gini,另一个可选择的标准是信息增益。回归RF对应的CART回归树默认是均方差mse，另一个可以选择的标准是绝对值差mae。一般来说选择默认的标准就已经很好的。

　　　　从上面可以看出， RF重要的框架参数比较少，主要需要关注的是 n_estimators，即RF最大的决策树个数。

3. RF决策树参数

　　　　下面我们再来看RF的决策树参数，它要调参的参数基本和GBDT相同，如下:

　　　　1) RF划分时考虑的最大特征数max_features: 可以使用很多种类型的值，默认是"auto",意味着划分时最多考虑N−−√

个特征；如果是"log2"意味着划分时最多考虑log2N个特征；如果是"sqrt"或者"auto"意味着划分时最多考虑N−−√

个特征。如果是整数，代表考虑的特征绝对数。如果是浮点数，代表考虑特征百分比，即考虑（百分比xN）取整后的特征数。其中N为样本总特征数。一般我们用默认的"auto"就可以了，如果特征数非常多，我们可以灵活使用刚才描述的其他取值来控制划分时考虑的最大特征数，以控制决策树的生成时间。

　　　　2) 决策树最大深度max_depth: 默认可以不输入，如果不输入的话，决策树在建立子树的时候不会限制子树的深度。一般来说，数据少或者特征少的时候可以不管这个值。如果模型样本量多，特征也多的情况下，推荐限制这个最大深度，具体的取值取决于数据的分布。常用的可以取值10-100之间。

　　　　3) 内部节点再划分所需最小样本数min_samples_split: 这个值限制了子树继续划分的条件，如果某节点的样本数少于min_samples_split，则不会继续再尝试选择最优特征来进行划分。默认是2.如果样本量不大，不需要管这个值。如果样本量数量级非常大，则推荐增大这个值。

　　　　4) 叶子节点最少样本数min_samples_leaf: 这个值限制了叶子节点最少的样本数，如果某叶子节点数目小于样本数，则会和兄弟节点一起被剪枝。默认是1,可以输入最少的样本数的整数，或者最少样本数占样本总数的百分比。如果样本量不大，不需要管这个值。如果样本量数量级非常大，则推荐增大这个值。

　　　　5）叶子节点最小的样本权重和min_weight_fraction_leaf：这个值限制了叶子节点所有样本权重和的最小值，如果小于这个值，则会和兄弟节点一起被剪枝。默认是0，就是不考虑权重问题。一般来说，如果我们有较多样本有缺失值，或者分类树样本的分布类别偏差很大，就会引入样本权重，这时我们就要注意这个值了。

　　　　6) 最大叶子节点数max_leaf_nodes: 通过限制最大叶子节点数，可以防止过拟合，默认是"None”，即不限制最大的叶子节点数。如果加了限制，算法会建立在最大叶子节点数内最优的决策树。如果特征不多，可以不考虑这个值，但是如果特征分成多的话，可以加以限制，具体的值可以通过交叉验证得到。

　　　　7) 节点划分最小不纯度min_impurity_split: 这个值限制了决策树的增长，如果某节点的不纯度(基于基尼系数，均方差)小于这个阈值，则该节点不再生成子节点。即为叶子节点。一般不推荐改动默认值1e-7。

　　　　上面决策树参数中最重要的包括最大特征数max_features，最大深度max_depth，内部节点再划分所需最小样本数min_samples_split和叶子节点最少样本数min_samples_leaf。

4.RF调参实例

　　　　这里仍然使用GBDT调参时同样的数据集来做RF调参的实例，数据的下载地址在这。本例我们采用袋外分数来评估我们模型的好坏。

　　　　完整代码参见我的github:https://github.com/ljpzzz/machinelearning/blob/master/ensemble-learning/random_forest_classifier.ipynb

　　　　首先，我们载入需要的类库：

复制代码

import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.grid_search import GridSearchCV
from sklearn import cross_validation, metrics

import matplotlib.pylab as plt
%matplotlib inline

复制代码

　　　　接着，我们把解压的数据用下面的代码载入，顺便看看数据的类别分布。

train = pd.read_csv('train_modified.csv')
target='Disbursed' # Disbursed的值就是二元分类的输出
IDcol = 'ID'
train['Disbursed'].value_counts()

　　　　可以看到类别输出如下，也就是类别0的占大多数。

0 19680
1 320
Name: Disbursed, dtype: int64

　　　　接着我们选择好样本特征和类别输出。

x_columns = [x for x in train.columns if x not in [target, IDcol]]
X = train[x_columns]
y = train['Disbursed']

　　　　不管任何参数，都用默认的，我们拟合下数据看看：

rf0 = RandomForestClassifier(oob_score=True, random_state=10)
rf0.fit(X,y)
print rf0.oob_score_
y_predprob = rf0.predict_proba(X)[:,1]
print "AUC Score (Train): %f" % metrics.roc_auc_score(y, y_predprob)

　　　　输出如下，可见袋外分数已经很高，而且AUC分数也很高。相对于GBDT的默认参数输出，RF的默认参数拟合效果对本例要好一些。

0.98005
AUC Score (Train): 0.999833

　　　　我们首先对n_estimators进行网格搜索：

复制代码

param_test1 = {'n_estimators':range(10,71,10)}
gsearch1 = GridSearchCV(estimator = RandomForestClassifier(min_samples_split=100,
                                  min_samples_leaf=20,max_depth=8,max_features='sqrt' ,random_state=10), 
                       param_grid = param_test1, scoring='roc_auc',cv=5)
gsearch1.fit(X,y)
gsearch1.grid_scores_, gsearch1.best_params_, gsearch1.best_score_

复制代码

　　　　输出结果如下：

([mean: 0.80681, std: 0.02236, params: {'n_estimators': 10},
mean: 0.81600, std: 0.03275, params: {'n_estimators': 20},
mean: 0.81818, std: 0.03136, params: {'n_estimators': 30},
mean: 0.81838, std: 0.03118, params: {'n_estimators': 40},
mean: 0.82034, std: 0.03001, params: {'n_estimators': 50},
mean: 0.82113, std: 0.02966, params: {'n_estimators': 60},
mean: 0.81992, std: 0.02836, params: {'n_estimators': 70}],
{'n_estimators': 60},
0.8211334476626017)

　　　　这样我们得到了最佳的弱学习器迭代次数，接着我们对决策树最大深度max_depth和内部节点再划分所需最小样本数min_samples_split进行网格搜索。

复制代码

param_test2 = {'max_depth':range(3,14,2), 'min_samples_split':range(50,201,20)}
gsearch2 = GridSearchCV(estimator = RandomForestClassifier(n_estimators= 60, 
                                  min_samples_leaf=20,max_features='sqrt' ,oob_score=True, random_state=10),
   param_grid = param_test2, scoring='roc_auc',iid=False, cv=5)
gsearch2.fit(X,y)
gsearch2.grid_scores_, gsearch2.best_params_, gsearch2.best_score_

复制代码

　　　　输出如下：

([mean: 0.79379, std: 0.02347, params: {'min_samples_split': 50, 'max_depth': 3},
mean: 0.79339, std: 0.02410, params: {'min_samples_split': 70, 'max_depth': 3},
mean: 0.79350, std: 0.02462, params: {'min_samples_split': 90, 'max_depth': 3},
mean: 0.79367, std: 0.02493, params: {'min_samples_split': 110, 'max_depth': 3},
mean: 0.79387, std: 0.02521, params: {'min_samples_split': 130, 'max_depth': 3},
mean: 0.79373, std: 0.02524, params: {'min_samples_split': 150, 'max_depth': 3},
mean: 0.79378, std: 0.02532, params: {'min_samples_split': 170, 'max_depth': 3},
mean: 0.79349, std: 0.02542, params: {'min_samples_split': 190, 'max_depth': 3},
mean: 0.80960, std: 0.02602, params: {'min_samples_split': 50, 'max_depth': 5},
mean: 0.80920, std: 0.02629, params: {'min_samples_split': 70, 'max_depth': 5},
mean: 0.80888, std: 0.02522, params: {'min_samples_split': 90, 'max_depth': 5},
mean: 0.80923, std: 0.02777, params: {'min_samples_split': 110, 'max_depth': 5},
mean: 0.80823, std: 0.02634, params: {'min_samples_split': 130, 'max_depth': 5},
mean: 0.80801, std: 0.02637, params: {'min_samples_split': 150, 'max_depth': 5},
mean: 0.80792, std: 0.02685, params: {'min_samples_split': 170, 'max_depth': 5},
mean: 0.80771, std: 0.02587, params: {'min_samples_split': 190, 'max_depth': 5},
mean: 0.81688, std: 0.02996, params: {'min_samples_split': 50, 'max_depth': 7},
mean: 0.81872, std: 0.02584, params: {'min_samples_split': 70, 'max_depth': 7},
mean: 0.81501, std: 0.02857, params: {'min_samples_split': 90, 'max_depth': 7},
mean: 0.81476, std: 0.02552, params: {'min_samples_split': 110, 'max_depth': 7},
mean: 0.81557, std: 0.02791, params: {'min_samples_split': 130, 'max_depth': 7},
mean: 0.81459, std: 0.02905, params: {'min_samples_split': 150, 'max_depth': 7},
mean: 0.81601, std: 0.02808, params: {'min_samples_split': 170, 'max_depth': 7},
mean: 0.81704, std: 0.02757, params: {'min_samples_split': 190, 'max_depth': 7},
mean: 0.82090, std: 0.02665, params: {'min_samples_split': 50, 'max_depth': 9},
mean: 0.81908, std: 0.02527, params: {'min_samples_split': 70, 'max_depth': 9},
mean: 0.82036, std: 0.02422, params: {'min_samples_split': 90, 'max_depth': 9},
mean: 0.81889, std: 0.02927, params: {'min_samples_split': 110, 'max_depth': 9},
mean: 0.81991, std: 0.02868, params: {'min_samples_split': 130, 'max_depth': 9},
mean: 0.81788, std: 0.02436, params: {'min_samples_split': 150, 'max_depth': 9},
mean: 0.81898, std: 0.02588, params: {'min_samples_split': 170, 'max_depth': 9},
mean: 0.81746, std: 0.02716, params: {'min_samples_split': 190, 'max_depth': 9},
mean: 0.82395, std: 0.02454, params: {'min_samples_split': 50, 'max_depth': 11},
mean: 0.82380, std: 0.02258, params: {'min_samples_split': 70, 'max_depth': 11},
mean: 0.81953, std: 0.02552, params: {'min_samples_split': 90, 'max_depth': 11},
mean: 0.82254, std: 0.02366, params: {'min_samples_split': 110, 'max_depth': 11},
mean: 0.81950, std: 0.02768, params: {'min_samples_split': 130, 'max_depth': 11},
mean: 0.81887, std: 0.02636, params: {'min_samples_split': 150, 'max_depth': 11},
mean: 0.81910, std: 0.02734, params: {'min_samples_split': 170, 'max_depth': 11},
mean: 0.81564, std: 0.02622, params: {'min_samples_split': 190, 'max_depth': 11},
mean: 0.82291, std: 0.02092, params: {'min_samples_split': 50, 'max_depth': 13},
mean: 0.82177, std: 0.02513, params: {'min_samples_split': 70, 'max_depth': 13},
mean: 0.82415, std: 0.02480, params: {'min_samples_split': 90, 'max_depth': 13},
mean: 0.82420, std: 0.02417, params: {'min_samples_split': 110, 'max_depth': 13},
mean: 0.82209, std: 0.02481, params: {'min_samples_split': 130, 'max_depth': 13},
mean: 0.81852, std: 0.02227, params: {'min_samples_split': 150, 'max_depth': 13},
mean: 0.81955, std: 0.02885, params: {'min_samples_split': 170, 'max_depth': 13},
mean: 0.82092, std: 0.02600, params: {'min_samples_split': 190, 'max_depth': 13}],
{'max_depth': 13, 'min_samples_split': 110},
0.8242016800050813)

　　　　我们看看我们现在模型的袋外分数：

rf1 = RandomForestClassifier(n_estimators= 60, max_depth=13, min_samples_split=110,
                                  min_samples_leaf=20,max_features='sqrt' ,oob_score=True, random_state=10)
rf1.fit(X,y)
print rf1.oob_score_

　　　　输出结果为：

0.984

　　　　可见此时我们的袋外分数有一定的提高。也就是时候模型的泛化能力增强了。

　　　　对于内部节点再划分所需最小样本数min_samples_split，我们暂时不能一起定下来，因为这个还和决策树其他的参数存在关联。下面我们再对内部节点再划分所需最小样本数min_samples_split和叶子节点最少样本数min_samples_leaf一起调参。

复制代码

param_test3 = {'min_samples_split':range(80,150,20), 'min_samples_leaf':range(10,60,10)}
gsearch3 = GridSearchCV(estimator = RandomForestClassifier(n_estimators= 60, max_depth=13,
                                  max_features='sqrt' ,oob_score=True, random_state=10),
   param_grid = param_test3, scoring='roc_auc',iid=False, cv=5)
gsearch3.fit(X,y)
gsearch3.grid_scores_, gsearch3.best_params_, gsearch3.best_score_

复制代码

　　　　输出如下：

([mean: 0.82093, std: 0.02287, params: {'min_samples_split': 80, 'min_samples_leaf': 10},
mean: 0.81913, std: 0.02141, params: {'min_samples_split': 100, 'min_samples_leaf': 10},
mean: 0.82048, std: 0.02328, params: {'min_samples_split': 120, 'min_samples_leaf': 10},
mean: 0.81798, std: 0.02099, params: {'min_samples_split': 140, 'min_samples_leaf': 10},
mean: 0.82094, std: 0.02535, params: {'min_samples_split': 80, 'min_samples_leaf': 20},
mean: 0.82097, std: 0.02327, params: {'min_samples_split': 100, 'min_samples_leaf': 20},
mean: 0.82487, std: 0.02110, params: {'min_samples_split': 120, 'min_samples_leaf': 20},
mean: 0.82169, std: 0.02406, params: {'min_samples_split': 140, 'min_samples_leaf': 20},
mean: 0.82352, std: 0.02271, params: {'min_samples_split': 80, 'min_samples_leaf': 30},
mean: 0.82164, std: 0.02381, params: {'min_samples_split': 100, 'min_samples_leaf': 30},
mean: 0.82070, std: 0.02528, params: {'min_samples_split': 120, 'min_samples_leaf': 30},
mean: 0.82141, std: 0.02508, params: {'min_samples_split': 140, 'min_samples_leaf': 30},
mean: 0.82278, std: 0.02294, params: {'min_samples_split': 80, 'min_samples_leaf': 40},
mean: 0.82141, std: 0.02547, params: {'min_samples_split': 100, 'min_samples_leaf': 40},
mean: 0.82043, std: 0.02724, params: {'min_samples_split': 120, 'min_samples_leaf': 40},
mean: 0.82162, std: 0.02348, params: {'min_samples_split': 140, 'min_samples_leaf': 40},
mean: 0.82225, std: 0.02431, params: {'min_samples_split': 80, 'min_samples_leaf': 50},
mean: 0.82225, std: 0.02431, params: {'min_samples_split': 100, 'min_samples_leaf': 50},
mean: 0.81890, std: 0.02458, params: {'min_samples_split': 120, 'min_samples_leaf': 50},
mean: 0.81917, std: 0.02528, params: {'min_samples_split': 140, 'min_samples_leaf': 50}],
{'min_samples_leaf': 20, 'min_samples_split': 120},
0.8248650279471544)

　　　　最后我们再对最大特征数max_features做调参:

复制代码

param_test4 = {'max_features':range(3,11,2)}
gsearch4 = GridSearchCV(estimator = RandomForestClassifier(n_estimators= 60, max_depth=13, min_samples_split=120,
                                  min_samples_leaf=20 ,oob_score=True, random_state=10),
   param_grid = param_test4, scoring='roc_auc',iid=False, cv=5)
gsearch4.fit(X,y)
gsearch4.grid_scores_, gsearch4.best_params_, gsearch4.best_score_

复制代码

　　　　输出如下：

([mean: 0.81981, std: 0.02586, params: {'max_features': 3},
mean: 0.81639, std: 0.02533, params: {'max_features': 5},
mean: 0.82487, std: 0.02110, params: {'max_features': 7},
mean: 0.81704, std: 0.02209, params: {'max_features': 9}],
{'max_features': 7},
0.8248650279471544)

　　　　用我们搜索到的最佳参数，我们再看看最终的模型拟合：

rf2 = RandomForestClassifier(n_estimators= 60, max_depth=13, min_samples_split=120,
                                  min_samples_leaf=20,max_features=7 ,oob_score=True, random_state=10)
rf2.fit(X,y)
print rf2.oob_score_

　　　　此时的输出为：

0.984

　　　　可见此时模型的袋外分数基本没有提高，主要原因是0.984已经是一个很高的袋外分数了，如果想进一步需要提高模型的泛化能力，我们需要更多的数据。

以上就是RF调参的一个总结，希望可以帮到朋友们。

（欢迎转载，转载请注明出处。欢迎沟通交流： liujianping-ok@163.com）

===========================================================================================================================================================================================================================================================================

scikit-learn（工程中用的相对较多的模型介绍）：1.11. Ensemble methods

2015年08月04日 08:24:16 mmc2015 阅读数：10987 标签：机器学习 scikit-learn ensemble boosting 工程应用更多

个人分类： scikit-learn

所属专栏： scikit-learn

参考：http://scikit-learn.org/stable/modules/ensemble.html

在实际项目中，我们真的很少用到那些简单的模型，比如LR、kNN、NB等，虽然经典，但在工程中确实不实用。

今天我们关注在工程中用的相对较多的Ensemble methods。

Ensemble methods（集成方法）主要是综合多个estimators加权或不加权的投票结果来产生最终结果。主要有两大类：

In averaging methods（平均方法）, the driving principle is to build several estimatorsindependently and then toaverage their predictions. On average, the combined estimator is usually better than any of the single base estimator because its variance is reduced.

Examples: Bagging methods, Forests of randomized trees, ...
By contrast, in boosting methods（提升方法）, base estimators are builtsequentially and one tries to reduce the bias of the combined estimator（the former estimator）. The motivation is to combine several weak models to produce a powerful ensemble.

Examples: AdaBoost, Gradient Tree Boosting, ...

接下来主要讲：

1、Bagging meta-estimator

注意bagging和boosting的区别：bagging methods work best with strong and complex models (e.g., fully developed decision trees), in contrast with boosting methods which usually work best with weak models (e.g., shallow decision trees).

不同bagging方法的区别：产生random subsets的方式，有些是随机子样本集，有些是随机子features集，有些是随机子样本/features集，还有一些是有放回的抽样（samples、features可重复）。

scikit-learn提供了a unified BaggingClassifier meta-estimator (resp. BaggingRegressor)，同时由参数max_samples和max_features决定子集大小、由bootstrap和bootstrap_features决定子集产生过程是否有替换。小例子：

>>> from sklearn.ensemble import BaggingClassifier
>>> from sklearn.neighbors import KNeighborsClassifier
>>> bagging = BaggingClassifier(KNeighborsClassifier(),
...                             max_samples=0.5, max_features=0.5)

Single estimator versus bagging: bias-variance decomposition

2、Forests of ranomized trees

两种算法：RandomForest algorithm and the Extra-Trees method。最终结果是average prediction of the individual classifiers。给个简单例子：

>>> from sklearn.ensemble import RandomForestClassifier
>>> X = [[0, 0], [1, 1]]
>>> Y = [0, 1]
>>> clf = RandomForestClassifier(n_estimators=10)
>>> clf = clf.fit(X, Y)

Like decision trees, forests of trees also extend to multi-output problems (if Y is an array of size [n_samples, n_outputs]).

RandomForest algorithm ：

有两个class，分别处理分类和回归，RandomForestClassifier and RandomForestRegressor classes。样本提取时允许replacement（a bootstrap sample），在随机选取的部分（而不是全部的）features上进行划分，与原论文的vote方法不同，scikit-learn通过平均每个分类器的预测概率（averaging their probabilistic prediction）来生成最终结果。

Extremely Randomized Trees ：

有两个class，分别处理分类和回归， ExtraTreesClassifier and ExtraTreesRegressor classes。默认使用所有样本，但划分时features随机选取部分。

给个比较例子：

>>> from sklearn.cross_validation import cross_val_score
>>> from sklearn.datasets import make_blobs
>>> from sklearn.ensemble import RandomForestClassifier
>>> from sklearn.ensemble import ExtraTreesClassifier
>>> from sklearn.tree import DecisionTreeClassifier

>>> X, y = make_blobs(n_samples=10000, n_features=10, centers=100,
...     random_state=0)

>>> clf = DecisionTreeClassifier(max_depth=None, min_samples_split=1,
...     random_state=0)
>>> scores = cross_val_score(clf, X, y)
>>> scores.mean()                             
0.97...

>>> clf = RandomForestClassifier(n_estimators=10, max_depth=None,
...     min_samples_split=1, random_state=0)
>>> scores = cross_val_score(clf, X, y)
>>> scores.mean()                             
0.999...

>>> clf = ExtraTreesClassifier(n_estimators=10, max_depth=None,
...     min_samples_split=1, random_state=0)
>>> scores = cross_val_score(clf, X, y)
>>> scores.mean() > 0.999
True

几点说明：

1）参数：最主要的调节参数是 n_estimators and max_features ，经验最好数据是，回归问题设置 max_features=n_features ，分类问题设置max_features=sqrt(n_features)(n_features是数据集的features个数).；设置max_depth=None 并且结合min_samples_split=1 (i.e., when fully developing the trees)经常导致好的结果；但切记，最好的参数还是通过CV调出来的。

2）默认机制：random forests, bootstrap samples are used by default (bootstrap=True) while the default strategy for extra-trees is to use the whole dataset (bootstrap=False).

3）并行：设置n_jobs=k 保证使用机器的k个cores；设置n_jobs=-1 使用所有可用的cores。

4）特征重要性评估：一个决策树，节点在越高的分支，相应的特征对最终预测结果的contribute越大。这里的大，是指影响输入数据集的比例比较大（the fraction of the input samples is large）。所以，对于某一个randomized tree，可以通过 The expected fraction of the samples they contribute to can thus be used as an estimate of the relative importance of the features.，然后对于 n_estimators 个randomized tree，通过averaging those expected activity rates over several randomized trees，达到区分特征重要性、特征选择的目的。但上面的叙述没什么X用，属性 feature_importances_ 已经保留了该重要性记录。。。。

最后还是几个例子：

3、AdaBoost

重要的还是不翻译：The core principle of AdaBoost is to fit a sequence of weak learners (i.e., models that are only slightly better than random guessing, such as small decision trees) on repeatedly modified versions of the data. The predictions from all of them are then combined through a weighted majority vote (or sum) to produce the final prediction. The data modifications at each so-called boosting iteration consist of applying weights $w_1$ , $w_2$ , ..., $w_N$ to each of the training samples. Initially, those weights are all set to $w_i = 1/N$ , so that the first step simply trains a weak learner on the original data. For each successive iteration, the sample weights are individually modified and the learning algorithm is reapplied to the reweighted data. At a given step, those training examples that were incorrectly predicted by the boosted model induced at the previous step have their weights increased, whereas the weights are decreased for those that were predicted correctly. As iterations proceed, examples that are difficult to predict receive ever-increasing influence. Each subsequent weak learner is thereby forced to concentrate on the examples that are missed by the previous ones in the sequence [HTF].

给个小例子：

>>> from sklearn.cross_validation import cross_val_score
>>> from sklearn.datasets import load_iris
>>> from sklearn.ensemble import AdaBoostClassifier

>>> iris = load_iris()
>>> clf = AdaBoostClassifier(n_estimators=100)
>>> scores = cross_val_score(clf, iris.data, iris.target)
>>> scores.mean()                             
0.9...

AdaBoost同样是分类回归双可用：

Discrete versus Real AdaBoost compares the classification error of a decision stump, decision tree, and a boosted decision stump using AdaBoost-SAMME and AdaBoost-SAMME.R.
Multi-class AdaBoosted Decision Trees shows the performance of AdaBoost-SAMME and AdaBoost-SAMME.R on a multi-class problem.
Two-class AdaBoost shows the decision boundary and decision function values for a non-linearly separable two-class problem using AdaBoost-SAMME.
Decision Tree Regression with AdaBoost demonstrates regression with the AdaBoost.R2 algorithm.

4、Gradient Tree Boosting

Gradient Tree Boosting or Gradient Boosted Regression Trees (GBRT) is an accurate and effective off-the-shelf procedure that can be used for both regression and classification problems.

优点：

Natural handling of data of mixed type (= heterogeneous features)
Predictive power
Robustness to outliers in output space (via robust loss functions)

缺点：并行困难（由于sequential nature of boosting造成的）

1）分类：

GradientBoostingClassifier supports both binary and multi-class classification.

>>>

>>> from sklearn.datasets import make_hastie_10_2
>>> from sklearn.ensemble import GradientBoostingClassifier

>>> X, y = make_hastie_10_2(random_state=0)
>>> X_train, X_test = X[:2000], X[2000:]
>>> y_train, y_test = y[:2000], y[2000:]

>>> clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0,
...     max_depth=1, random_state=0).fit(X_train, y_train)
>>> clf.score(X_test, y_test)                 
0.913...

Note

Classification with more than 2 classes requires the induction of n_classes regression trees at each iteration, thus, the total number of induced trees equals n_classes * n_estimators. For datasets with a large number of classes we strongly recommend to use RandomForestClassifier as an alternative to GradientBoostingClassifier .

2）回归：

GradientBoostingRegressor supports a number of different loss functions for regression which can be specified via the argument loss; the default loss function for regression is least squares ('ls'，least squares loss).

>>>

>>> import numpy as np
>>> from sklearn.metrics import mean_squared_error
>>> from sklearn.datasets import make_friedman1
>>> from sklearn.ensemble import GradientBoostingRegressor

>>> X, y = make_friedman1(n_samples=1200, random_state=0, noise=1.0)
>>> X_train, X_test = X[:200], X[200:]
>>> y_train, y_test = y[:200], y[200:]
>>> est = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1,
...     max_depth=1, random_state=0, loss='ls').fit(X_train, y_train)
>>> mean_squared_error(y_test, est.predict(X_test))    
5.00...

3）在原模型的基础上再增加一些weak-learners：

Both GradientBoostingRegressor and GradientBoostingClassifier support warm_start=True which allows you to add more estimators to an already fitted model.

>>>

>>> _ = est.set_params(n_estimators=200, warm_start=True)  # set warm_start and new nr of trees
>>> _ = est.fit(X_train, y_train) # fit additional 100 trees to est
>>> mean_squared_error(y_test, est.predict(X_test))    
3.84...

4）解释模型效果：

由于有多棵树组成，不能像单独的决策树一样画树结构，但是还有有些办法来summarize and interpret gradient boosting models.

features importance，The feature importance scores of a fit gradient boosting model can be accessed via the feature_importances_ property:

>>>

>>> from sklearn.datasets import make_hastie_10_2
>>> from sklearn.ensemble import GradientBoostingClassifier

>>> X, y = make_hastie_10_2(random_state=0)
>>> clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0,
...     max_depth=1, random_state=0).fit(X, y)
>>> clf.feature_importances_  
array([ 0.11,  0.1 ,  0.11,  ...

Gradient Boosting regression

partial dependence，Partial dependence plots (PDP) show the dependence between the target response and a set of ‘target’ features。

下面重要的翻译：

The Figure below shows four one-way and one two-way partial dependence plots for the California housing dataset:

One-way PDPs tell us about the interaction between the target response and the target feature (e.g. linear, non-linear). The upper left plot in the above Figure shows the effect of the median income in a district on the median house price; we can clearly see a linear relationship among them.

PDPs with two target features show the interactions among the two features. For example, the two-variable PDP in the above Figure shows the dependence of median house price on joint values of house age and avg. occupants per household. We can clearly see an interaction between the two features: For an avg. occupancy greater than two, the house price is nearly independent of the house age, whereas for values less than two there is a strong dependence on age.

The module partial_dependence provides a convenience function plot_partial_dependence to create one-way and two-way partial dependence plots. In the below example we show how to create a grid of partial dependence plots: two one-way PDPs for the features 0 and 1 and a two-way PDP between the two features:

>>>

>>> from sklearn.datasets import make_hastie_10_2
>>> from sklearn.ensemble import GradientBoostingClassifier
>>> from sklearn.ensemble.partial_dependence import plot_partial_dependence

>>> X, y = make_hastie_10_2(random_state=0)
>>> clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0,
...     max_depth=1, random_state=0).fit(X, y)
>>> features = [0, 1, (0, 1)]
>>> fig, axs = plot_partial_dependence(clf, X, features)

For multi-class models, you need to set the class label for which the PDPs should be created via the label argument:

>>>

>>> from sklearn.datasets import load_iris
>>> iris = load_iris()
>>> mc_clf = GradientBoostingClassifier(n_estimators=10,
...     max_depth=1).fit(iris.data, iris.target)
>>> features = [3, 2, (3, 2)]
>>> fig, axs = plot_partial_dependence(mc_clf, X, features, label=0)

If you need the raw values of the partial dependence function rather than the plots you can use the partial_dependence function:

>>>

>>> from sklearn.ensemble.partial_dependence import partial_dependence

>>> pdp, axes = partial_dependence(clf, [0], X=X)
>>> pdp  
array([[ 2.46643157,  2.46643157, ...
>>> axes  
[array([-1.62497054, -1.59201391, ...

Partial Dependence Plots

最后推荐大家看一下。。。。XgBoost：http://blog.csdn.net/mmc2015/article/details/47304779

xgboost简介：（摘自其他）xgboost的全称是eXtreme Gradient Boosting。正如其名，它是Gradient Boosting Machine的一个c++实现，作者为正在华盛顿大学研究机器学习的大牛陈天奇。他在研究中深感自己受制于现有库的计算速度和精度，因此在一年前开始着手搭建xgboost项目，并在去年夏天逐渐成型。xgboost最大的特点在于，它能够自动利用CPU的多线程进行并行，同时在算法上加以改进提高了精度。它的处女秀是Kaggle的希格斯子信号识别竞赛，因为出众的效率与较高的预测准确度在比赛论坛中引起了参赛选手的广泛关注，在1700多支队伍的激烈竞争中占有一席之地。随着它在Kaggle社区知名度的提高，最近也有队伍借助xgboost在比赛中夺得第一。