- 装袋(Bagging)算法:先将训练集分离成多个子集,然后通过各个子集训练多个模型。
- 提升(Boosting)算法:训练多个模型并组成一个序列,序列中的每一个模型都会修正前一个模型的错误。
- 投票(Voting)算法:训练多个模型,并采用样本统计来提高模型的准确度。
1.装袋算法
装袋算法是一种提高分类准确率的算法,通过给定组合投票的方式获得最优解。比如你生病了,去n个医院看了n个医生,每个医生都给你开了药方,最后哪个药方的出现次数最多,就说这个药方越有可能时最优解。下面将介绍三种装袋模型:
- 装袋决策树(Bagged Decision Trees)
- 随机森林(Random Forest)
- 极端随机树(Extra Trees)
1.1装袋决策树
装袋算法在数据具有很大的方差时非常有效,可以通过BaggingClassifier实现分类与回归树算法。
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
# 导入数据
filename = 'pima_data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataset = read_csv(filename, names=names)
# 将数据划分成输入数据和输出数据
array = dataset.values
X = array[:, 0:8]
Y = array[:, 8]
num_folds = 10
seed = 7
kfold = KFold(n_splits=num_folds, random_state=seed)
cart = DecisionTreeClassifier()
num_trees = 100
model = BaggingClassifier(base_estimator=cart, n_estimators=num_trees, random_state=seed)
result = cross_val_score(model, X, Y, cv=kfold)
print(result.mean())
相较于审查分类算法的0.69有较大的提升。
1.2 随机森林
随机森林由很多的决策树组成,而且每一棵决策树之间是没有关联的,当有一个新的输入样本进入时候,就让森林中的每一颗决策树分别进行判断,最后看哪一类被选择最多,作出预测。
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
# 导入数据
filename = 'pima_data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataset = read_csv(filename, names=names)
# 将数据划分成输入数据和输出数据
array = dataset.values
X = array[:, 0:8]
Y = array[:, 8]
num_folds = 10
seed = 7
kfold = KFold(n_splits=num_folds, random_state=seed)
num_trees = 100
max_features = 3
model = RandomForestClassifier(n_estimators=num_trees, max_features=max_features, random_state=seed)
result = cross_val_score(model, X, Y, cv=kfold)
print(result.mean())
1.3 极端随机树
与随机森林十分相似,主要的区别:
- 随机森林应用的是Bagging模型,而极端随机树是使用所有的训练样本得到每棵决策树,也就是每颗决策时应用的是相同的全部训练样本
- 随机森林是在一个随机子集内得到最优分叉特征属性,而极端随机数是完全随机地选择分叉特征属性,从而实现对决策树进行分叉的。
实现的类是ExtraTreesClassifier。
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import ExtraTreesClassifier
# 导入数据
filename = 'pima_data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataset = read_csv(filename, names=names)
# 将数据划分成输入数据和输出数据
array = dataset.values
X = array[:, 0:8]
Y = array[:, 8]
num_folds = 10
seed = 7
kfold = KFold(n_splits=num_folds, random_state=seed)
num_trees = 100
max_features = 3
model = ExtraTreesClassifier(n_estimators=num_trees, max_features=max_features, random_state=seed)
result = cross_val_score(model, X, Y, cv=kfold)
print(result.mean())
2. 提升算法
2.1 AdaBoost
AdaBoost是一种迭代算法,其核心思想是针对同一个训练集训练不同的分类器,然后把这些弱分类器集合起来,构成一个更强的最终分类器。
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import AdaBoostClassifier
# 导入数据
filename = 'pima_data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataset = read_csv(filename, names=names)
# 将数据划分成输入数据和输出数据
array = dataset.values
X = array[:, 0:8]
Y = array[:, 8]
num_folds = 10
seed = 7
kfold = KFold(n_splits=num_folds, random_state=seed)
num_trees = 30
model = AdaBoostClassifier(n_estimators=num_trees, random_state=seed)
result = cross_val_score(model, X, Y, cv=kfold)
print(result.mean())
2.2 随机梯度提升
随机梯度提升(GBM)基于的思想是:要找到某个函数的最大值,最好的办法就是沿着该函数的梯度方向探索。
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import GradientBoostingClassifier
# 导入数据
filename = 'pima_data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataset = read_csv(filename, names=names)
# 将数据划分成输入数据和输出数据
array = dataset.values
X = array[:, 0:8]
Y = array[:, 8]
num_folds = 10
seed = 7
kfold = KFold(n_splits=num_folds, random_state=seed)
num_trees = 100
model = GradientBoostingClassifier(n_estimators=num_trees, random_state=seed)
result = cross_val_score(model, X, Y, cv=kfold)
print(result.mean())
2.3 投票算法
是通过创建两个或多个算法模型,利用投票算法将这些算法包装起来,计算各个子模型的平均预测状况。
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
# 导入数据
filename = 'pima_data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataset = read_csv(filename, names=names)
# 将数据划分成输入数据和输出数据
array = dataset.values
X = array[:, 0:8]
Y = array[:, 8]
num_folds = 10
seed = 7
kfold = KFold(n_splits=num_folds, random_state=seed)
num_trees = 100
model=[]
model_logistic=LogisticRegression()
model_cart=DecisionTreeClassifier()
model_svc=SVC()
model.append(('logistic',model_logistic))
model.append(('cart',model_cart))
model.append(('svc',model_svc))
ensemble_model=VotingClassifier(estimators=model)
result = cross_val_score(ensemble_model, X, Y, cv=kfold)
print(result.mean())