集成学习概念与python代码实现

  • bootstrap
  1. 来自短语to pull oneself up by one,不靠外界力量,也称为自助
  2. 重采样技术,用于统计推断,估计样本分布
  3. 有放回采样,抽样的数据和原始数目一样
  4. 1/3袋外比例 (1-1/n)^n =0.368

 

  • jacknife
  1. 瑞士小刀
  2. 不涉及放回问题
  3. 若X=(x1,x2,...,xn), 则jacknife样本为X_i=(x1,...x_i-1,x_i+1,....xn),可以得到n份jacknife样本

 

  • bagging
  1. 采用boostrap方法 进行采样,采样m轮
  2. 采样的样本之间相互独立
  3. 每轮的样本建立一个模型
  4. 采用投票机制,取类别的众数为最终结果;若模型结果为分数,则取均值作为最终的结果

 

  • boosting
  1. 代表adaboost

from sklearn.ensemble import AdaBoostClassifier
from sklearn.datasets import  load_iris
from sklearn.model_selection import  train_test_split

from sklearn.metrics import classification_report
iris=load_iris()
x=iris.data
y=iris.target


# 训练集、测试集(30%)划分

x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3)

# 构建分类器
cls=AdaBoostClassifier(n_estimators=100,learning_rate=1)
cls.fit(x_train,y_train)
x_test_pre=cls.predict(x_test)
print(classification_report(y_test,x_test_pre,target_names=['class0','class1','class2']))

# 结果里面 support 为每个标签出现的次数

'''
base_estimator : 基分类器,默认为DecisionTreeClassifier(max_depth=1)
n_estimators : 迭代次数
learning_rate : 学习率,学习率与迭代次数需要权衡的;若学习率过大,迭代次数就会小; 我们可以先从小的先尝试
algorithm : 默认采用SAMME.R,对基分类器有要求,需要支持概率的预测;这个算法收敛的速度更快,迭代次数更少,测试误差也更小
random_state : 

'''

 

 

  • 随机森林
  1. 基分类器: cart
  2. 随机性:样本选择的随机性、特征选择的随机性
  3. 变量重要性评估:a.利用oob样本,带入分类器得到分类误差e1, 对oob的特征值进行改变,代入分类器得到e2,如果特征越重要,那么abs(e1-e2)就越大       b.每个特征降低树不纯度的能力
  4. 随机森林的变量重要性衡量的并不完全是变量对目标变量预测的贡献能力,而是在这个模型中对目标变量预测的贡献能力
from sklearn.datasets import  load_iris
from sklearn.model_selection import  train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

iris=load_iris()
x=iris.data
y=iris.target

x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3)
cls=RandomForestClassifier(max_features='sqrt',bootstrap=True)
cls.fit(x_train,y_train)
x_test_pred=cls.predict(x_test)
print(classification_report(y_test,x_test_pred))


'''
n_estimators : 子数的数量,默认为100

criterion : 默认是gini,也可以选择entropy

max_depth : 数的最大深度,
The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

min_samples_split : 默认为2,每个切分的时候包含的最小节点数;如果小于这个值,就不再切分了

min_samples_leaf : int, float, optional (default=1)
The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.


min_weight_fraction_leaf : float, optional (default=0.)
The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.

max_features : 最大特征数,每颗数随机选择的特征数
If int, then consider max_features features at each split.
If float, then max_features is a fraction and int(max_features * n_features) features are considered at each split.
If “auto”, then max_features=sqrt(n_features).
If “sqrt”, then max_features=sqrt(n_features) (same as “auto”).
If “log2”, then max_features=log2(n_features).
If None, then max_features=n_features.

max_leaf_nodes : int or None, optional (default=None)
Grow trees with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.

New in version 0.19.

min_impurity_split : 默认(default=1e-7),树停止划分的阈值,大于不纯度的就开始分解,否在就不再分解(成为叶子节点)


bootstrap : 默认是true,如果是false的话,在所有的数据集上做重抽样去构建树 

oob_score : 默认false,是否使用oob样本去估计泛化能力

n_jobs :默认是none,等同于1,表示后端一个进程在运行,-1表示使用所有的进程


'''
  • gbdt
'''
loss: 损失函数,默认是deviance ; 或者 exponential 指数损失函数,用于提升树,例如adaboost
learning_rate: 学习率,同adaboost中的学习率类似
n_estimators:迭代次数
subsample:基学习器的采样比例,默认是1; 如果小于1的话,不用全局的样本数,随机采样,有利于方差的减小(泛华能力提升),但会提高偏移(bias,就是说拟合程度会降低)
criterion:切分的标准,默认是 friedman_mse; mse 均方误差 ;mae:绝对误差
min_sample_split: 小于该值,节点就不在切分了;
min_sample_leaf: 叶子节点对应的样本数不能小于该值
max_depth: 树的最大深度
min_impurity_decrease:小于该值就不再切分节点
min_impurity_split: 如果大于该值,就继续切分,除非这个节点是叶子节点了
max_features:同随机森林中的定义类似,每颗树的特征也是随机选取
max_leaf_nodes:

'''

from sklearn.datasets import load_iris
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import  classification_report
from sklearn.model_selection import train_test_split

iris=load_iris()
x=iris.data
y=iris.target

x_trian,x_test,y_train,y_test=train_test_split(x,y,train_size=0.7)

cls=GradientBoostingClassifier()
cls.fit(x_trian,y_train)
x_test_pred=cls.predict(x_test)
result=classification_report(y_test,x_test_pred)
print(result)
  • xgboost
from xgboost.sklearn import XGBClassifier
from sklearn.datasets import load_iris
from sklearn.metrics import  classification_report
from sklearn.model_selection import train_test_split

iris=load_iris()
x=iris.data
y=iris.target
x_trian,x_test,y_train,y_test=train_test_split(x,y,train_size=0.7)

cls=XGBClassifier()
cls.fit(x_trian,y_train)
x_test_pred=cls.predict(x_test)
result=classification_report(y_test,x_test_pred)
print(result)

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值