优化模型之集成算法

  • 装袋(Bagging)算法:先将训练集分离成多个子集,然后通过各个子集训练多个模型。
  • 提升(Boosting)算法:训练多个模型并组成一个序列,序列中的每一个模型都会修正前一个模型的错误。
  • 投票(Voting)算法:训练多个模型,并采用样本统计来提高模型的准确度。

1.装袋算法

装袋算法是一种提高分类准确率的算法,通过给定组合投票的方式获得最优解。比如你生病了,去n个医院看了n个医生,每个医生都给你开了药方,最后哪个药方的出现次数最多,就说这个药方越有可能时最优解。下面将介绍三种装袋模型:

  • 装袋决策树(Bagged Decision Trees)
  • 随机森林(Random Forest)
  • 极端随机树(Extra Trees)

1.1装袋决策树

装袋算法在数据具有很大的方差时非常有效,可以通过BaggingClassifier实现分类与回归树算法。

from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier

# 导入数据
filename = 'pima_data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataset = read_csv(filename, names=names)
# 将数据划分成输入数据和输出数据
array = dataset.values
X = array[:, 0:8]
Y = array[:, 8]
num_folds = 10
seed = 7
kfold = KFold(n_splits=num_folds, random_state=seed)
cart = DecisionTreeClassifier()
num_trees = 100
model = BaggingClassifier(base_estimator=cart, n_estimators=num_trees, random_state=seed)
result = cross_val_score(model, X, Y, cv=kfold)
print(result.mean())

相较于审查分类算法的0.69有较大的提升。

1.2 随机森林

随机森林由很多的决策树组成,而且每一棵决策树之间是没有关联的,当有一个新的输入样本进入时候,就让森林中的每一颗决策树分别进行判断,最后看哪一类被选择最多,作出预测。

from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

# 导入数据
filename = 'pima_data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataset = read_csv(filename, names=names)
# 将数据划分成输入数据和输出数据
array = dataset.values
X = array[:, 0:8]
Y = array[:, 8]
num_folds = 10
seed = 7
kfold = KFold(n_splits=num_folds, random_state=seed)
num_trees = 100
max_features = 3
model = RandomForestClassifier(n_estimators=num_trees, max_features=max_features, random_state=seed)
result = cross_val_score(model, X, Y, cv=kfold)
print(result.mean())

1.3 极端随机树

与随机森林十分相似,主要的区别:

  • 随机森林应用的是Bagging模型,而极端随机树是使用所有的训练样本得到每棵决策树,也就是每颗决策时应用的是相同的全部训练样本
  • 随机森林是在一个随机子集内得到最优分叉特征属性,而极端随机数是完全随机地选择分叉特征属性,从而实现对决策树进行分叉的。

实现的类是ExtraTreesClassifier。

from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import ExtraTreesClassifier

# 导入数据
filename = 'pima_data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataset = read_csv(filename, names=names)
# 将数据划分成输入数据和输出数据
array = dataset.values
X = array[:, 0:8]
Y = array[:, 8]
num_folds = 10
seed = 7
kfold = KFold(n_splits=num_folds, random_state=seed)
num_trees = 100
max_features = 3
model = ExtraTreesClassifier(n_estimators=num_trees, max_features=max_features, random_state=seed)
result = cross_val_score(model, X, Y, cv=kfold)
print(result.mean())

2. 提升算法

2.1 AdaBoost

AdaBoost是一种迭代算法,其核心思想是针对同一个训练集训练不同的分类器,然后把这些弱分类器集合起来,构成一个更强的最终分类器。

from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import AdaBoostClassifier

# 导入数据
filename = 'pima_data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataset = read_csv(filename, names=names)
# 将数据划分成输入数据和输出数据
array = dataset.values
X = array[:, 0:8]
Y = array[:, 8]
num_folds = 10
seed = 7
kfold = KFold(n_splits=num_folds, random_state=seed)
num_trees = 30
model = AdaBoostClassifier(n_estimators=num_trees, random_state=seed)
result = cross_val_score(model, X, Y, cv=kfold)
print(result.mean())

2.2 随机梯度提升

随机梯度提升(GBM)基于的思想是:要找到某个函数的最大值,最好的办法就是沿着该函数的梯度方向探索。

from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import GradientBoostingClassifier

# 导入数据
filename = 'pima_data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataset = read_csv(filename, names=names)
# 将数据划分成输入数据和输出数据
array = dataset.values
X = array[:, 0:8]
Y = array[:, 8]
num_folds = 10
seed = 7
kfold = KFold(n_splits=num_folds, random_state=seed)
num_trees = 100
model = GradientBoostingClassifier(n_estimators=num_trees, random_state=seed)
result = cross_val_score(model, X, Y, cv=kfold)
print(result.mean())

2.3 投票算法

是通过创建两个或多个算法模型,利用投票算法将这些算法包装起来,计算各个子模型的平均预测状况。

from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

# 导入数据
filename = 'pima_data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataset = read_csv(filename, names=names)
# 将数据划分成输入数据和输出数据
array = dataset.values
X = array[:, 0:8]
Y = array[:, 8]
num_folds = 10
seed = 7
kfold = KFold(n_splits=num_folds, random_state=seed)
num_trees = 100
model=[]
model_logistic=LogisticRegression()
model_cart=DecisionTreeClassifier()
model_svc=SVC()
model.append(('logistic',model_logistic))
model.append(('cart',model_cart))
model.append(('svc',model_svc))
ensemble_model=VotingClassifier(estimators=model)
result = cross_val_score(ensemble_model, X, Y, cv=kfold)
print(result.mean())

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值