DW集成学习Task6 Boosting作业

最新推荐文章于 2024-09-13 18:11:35 发布

momokofly

最新推荐文章于 2024-09-13 18:11:35 发布

阅读量147

点赞数

分类专栏： DW 文章标签：其他 python 决策树机器学习

本文链接：https://blog.csdn.net/momokofly/article/details/119025396

版权

DW 专栏收录该内容

18 篇文章 0 订阅

订阅专栏

（视频在Task4中已看完，主要写一下主要的几个问题）

1.Adaboost的基本思路

Step1：给每个样本一个权重，初始化所有样本权重相同；
Step2：使用当前样本权重，训练一个（简单）模型；
Step3：根据模型结果，给判断正确的样本降权，给判断错误的样本加权；
Step4：使用新的样本权重，重新训练（简单）模型，重复若干轮；
Step5：将若干轮的（简单）模型线性合并为复合模型，作为最终模型。

2.Adaboost与GBDT的联系与区别？

联系：Adaboost和GBDT都属于Boosting框架下的方法
区别：AdaBoost是通过提升错分数据点的权重来定位模型的不足，而Gradient Boosting是通过算梯度（gradient）来定位模型的不足。因此相比AdaBoost, Gradient Boosting可以使用更多种类的目标函数,而当目标函数是均方误差时，计算损失函数的负梯度值在当前模型的值即为残差。

3.Boosting与Bagging的区别，以及如何提升模型的精度？

Bagging与Boosting相同之处是两者都是模型提升性能的方法，都可以将弱分类器集成到一起组成一个强分类器，分类效果好于单个弱分类器。
区别：1、从训练样本角度来看，Bagging靠抽样训练多个基分类器来提升，每个基分类器的训练样本来自总样本的抽样，且相互独立，每个样本的权重都是相同的，Boosting靠改变训练样本的权重来提升，每个基分类器的训练样本都相同，只是权重不同，增加分类错误样本的权重，减少分类正确样本的权重。2、从结果组合函数角度来看，Bagging中每个基分类器的结果的权重都相同，而Boosting对于分类误差小的分类器的结果的权重大，误差大的分类器的权重小。3、从并行计算的角度来看，Boosting流派，各分类器之间有依赖关系，因为下一个基分类器的样本权重要根据上一个分类的误差来调整权重，必须串行，比如AdaBoost、GBDT(Gradient Boosting Decision Tree)、XGBoost，而Bagging流派，各分类器之间没有依赖关系，可各自并行，比如随机森林（Random Forest）。4、从偏差和方差角度来说，Bagging主靠降低Variance来提高精度。Boosting靠降低Bias来提高精度。

4.使用基本分类模型和Boosting提升的模型，并画出他们的决策边界。

分别使用单一的决策树模型和Adaboost模型

#使用基本分类模型和Boosting提升的模型，并画出他们的决策边界
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
plt.style.use("ggplot")
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

#pandas设置最大显示行和列
pd.set_option('display.max_columns',50)
pd.set_option('display.max_rows',300)
 
#调整显示宽度，以便整行显示
pd.set_option('display.width',1000)


if __name__ == '__main__':
    # 加载训练数据：
    wine = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data",header=None)
    wine.columns = ['Class label', 'Alcohol', 'Malic acid', 'Ash', 'Alcalinity of ash', 'Magnesium', 'Total phenols', 'Flavanoids', 'Nonflavanoid phenols','Proanthocyanins', 'Color intensity', 'Hue', 'OD280/OD315 of diluted wines', 'Proline']
    #查看红酒类别
    print("Class labels", np.unique(wine["Class label"]))
    # 查看前五行数据
    print(wine.head())
 
  # 仅仅考虑2，3类葡萄酒，去除1类
wine = wine[wine['Class label'] != 1]
y = wine['Class label'].values
X = wine[['Alcohol', 'OD280/OD315 of diluted wines']].values  #选取两列数据作为X
# 将分类标签变成二进制编码
le = LabelEncoder()
y = le.fit_transform(y)
    # 按8：2分割训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1,stratify=y)  
# stratify参数代表了按照y的类别等比例抽样

# 使用单一决策树建模
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier(criterion='entropy',random_state=1,max_depth=1)
from sklearn.metrics import accuracy_score
tree = tree.fit(X_train,y_train)
y_train_pred = tree.predict(X_train)
y_test_pred = tree.predict(X_test)
tree_train = accuracy_score(y_train,y_train_pred)
tree_test = accuracy_score(y_test,y_test_pred)
print('Decision tree train/test accuracies %.3f/%.3f' % (tree_train,tree_test))
# Decision tree train/test accuracies 0.916/0.875

'''
AdaBoostClassifier相关参数：
base_estimator：基本分类器，默认为DecisionTreeClassifier(max_depth=1)
n_estimators：终止迭代的次数
learning_rate：学习率
algorithm：训练的相关算法，{'SAMME'，'SAMME.R'}，默认='SAMME.R'
random_state：随机种子
'''
from sklearn.ensemble import AdaBoostClassifier
ada = AdaBoostClassifier(base_estimator=tree,n_estimators=500,learning_rate=0.1,random_state=1)
ada = ada.fit(X_train,y_train)
y_train_pred = ada.predict(X_train)
y_test_pred = ada.predict(X_test)
ada_train = accuracy_score(y_train,y_train_pred)
ada_test = accuracy_score(y_test,y_test_pred)
print('Adaboost train/test accuracies %.3f/%.3f' % (ada_train,ada_test))
# Adaboost train/test accuracies 1.000/0.917


# 画出单层决策树与Adaboost的决策边界：
x_min = X_train[:, 0].min() - 1
x_max = X_train[:, 0].max() + 1
y_min = X_train[:, 1].min() - 1
y_max = X_train[:, 1].max() + 1
#生成矩阵坐标，从坐标向量中返回坐标矩阵
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),np.arange(y_min, y_max, 0.1))
f, axarr = plt.subplots(nrows=1, ncols=2,sharex='col',sharey='row',figsize=(12, 6))
# nrows，ncols：
for idx, clf, tt in zip([0, 1],[tree, ada],['Decision tree', 'Adaboost']):
    # zip :将对象中对应的元素打包成一个个元组
    clf.fit(X_train, y_train)
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    axarr[idx].contourf(xx, yy, Z, alpha=0.3)
    axarr[idx].scatter(X_train[y_train==0, 0],X_train[y_train==0, 1],c='blue', marker='^')
    axarr[idx].scatter(X_train[y_train==1, 0],X_train[y_train==1, 1],c='red', marker='o')
    axarr[idx].set_title(tt)
axarr[0].set_ylabel('Alcohol', fontsize=12)
plt.tight_layout()
plt.text(0, -0.2,s='OD280/OD315 of diluted wines',ha='center',va='center',fontsize=12,transform=axarr[1].transAxes)
plt.show()

决策边界如图所示：
在这里插入图片描述
参考：
http://www.bubuko.com/infodetail-3336406.html
https://blog.csdn.net/chengfulukou/article/details/76906710
http://ixyzero.com/blog/archives/4242.html
https://blog.csdn.net/dabingsun/article/details/103145562
https://blog.csdn.net/u012867518/article/details/115919618