Adaboost算法

一、基本原理的简单介绍

Adaboost是在boosting的基础上进一步完善的,主要解决boosting面临的两个问题:
1.提高那些被前一轮分类器分类错误样本的权重,降低那些分类正确样本的权重;
2.加大分类错误率低的分类器的权重,减小分类错误率高的分类器的权重。

二、公式推导

假设给定一个二分类的训练数据集: T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , ⋯   , ( x N , y N ) } T=\left\{\left(x_{1}, y_{1}\right),\left(x_{2}, y_{2}\right), \cdots,\left(x_{N}, y_{N}\right)\right\} T={(x1,y1),(x2,y2),,(xN,yN)},其中每个样本点由特征与类别组成。特征 x i ∈ X ⊆ R n x_{i} \in \mathcal{X} \subseteq \mathbf{R}^{n} xiXRn,类别 y i ∈ Y = { − 1 , + 1 } y_{i} \in \mathcal{Y}=\{-1,+1\} yiY={1,+1} X \mathcal{X} X是特征空间, Y \mathcal{Y} Y是类别集合,输出最终分类器 G ( x ) G(x) G(x)。Adaboost算法如下:
(1) 初始化训练数据的分布(假设数据权值分布服从均匀分布): D 1 = ( w 11 , ⋯   , w 1 i , ⋯   , w 1 N ) , w 1 i = 1 N , i = 1 , 2 , ⋯   , N D_{1}=\left(w_{11}, \cdots, w_{1 i}, \cdots, w_{1 N}\right), \quad w_{1 i}=\frac{1}{N}, \quad i=1,2, \cdots, N D1=(w11,,w1i,,w1N),w1i=N1,i=1,2,,N
(2) 对于m=1,2,…,M

  • 使用具有权值分布 D m D_m Dm的训练数据集进行学习,得到基本分类器: G m ( x ) : X → { − 1 , + 1 } G_{m}(x): \mathcal{X} \rightarrow\{-1,+1\} Gm(x):X{1,+1}
  • 计算 G m ( x ) G_m(x) Gm(x)在训练集上的分类误差率 e m = ∑ i = 1 N P ( G m ( x i ) ≠ y i ) = ∑ i = 1 N w m i I ( G m ( x i ) ≠ y i ) e_{m}=\sum_{i=1}^{N} P\left(G_{m}\left(x_{i}\right) \neq y_{i}\right)=\sum_{i=1}^{N} w_{m i} I\left(G_{m}\left(x_{i}\right) \neq y_{i}\right) em=i=1NP(Gm(xi)=yi)=i=1NwmiI(Gm(xi)=yi)
  • 计算 G m ( x ) G_m(x) Gm(x)的系数 α m = 1 2 log ⁡ 1 − e m e m \alpha_{m}=\frac{1}{2} \log \frac{1-e_{m}}{e_{m}} αm=21logem1em这里的log是自然对数ln ,分类错误率低,则分类器 G m ( x ) G_m(x) Gm(x)的权重大。
  • 更新训练数据集的权重分布( y i G m ( x i ) y_{i} G_{m}(x_{i}) yiGm(xi)分类正确结果为1,分类错误结果为-1,故更新数据权值 ω m + 1 \omega_{m+1} ωm+1时,若第m次训练分类错误,则增大在第m+1次训练时数据i的权重,反之则降低该参数的权重
    D m + 1 = ( w m + 1 , 1 , ⋯   , w m + 1 , i , ⋯   , w m + 1 , N ) w m + 1 , i = w m i Z m exp ⁡ ( − α m y i G m ( x i ) ) , i = 1 , 2 , ⋯   , N \begin{array}{c} D_{m+1}=\left(w_{m+1,1}, \cdots, w_{m+1, i}, \cdots, w_{m+1, N}\right) \\ w_{m+1, i}=\frac{w_{m i}}{Z_{m}} \exp \left(-\alpha_{m} y_{i} G_{m}\left(x_{i}\right)\right), \quad i=1,2, \cdots, N \end{array} Dm+1=(wm+1,1,,wm+1,i,,wm+1,N)wm+1,i=Zmwmiexp(αmyiGm(xi)),i=1,2,,N
    这里的 Z m Z_m Zm是规范化因子,使得 D m + 1 D_{m+1} Dm+1称为概率分布, Z m = ∑ i = 1 N w m i exp ⁡ ( − α m y i G m ( x i ) ) Z_{m}=\sum_{i=1}^{N} w_{m i} \exp \left(-\alpha_{m} y_{i} G_{m}\left(x_{i}\right)\right) Zm=i=1Nwmiexp(αmyiGm(xi))

(3) 构建基本分类器的线性组合 f ( x ) = ∑ m = 1 M α m G m ( x ) f(x)=\sum_{m=1}^{M} \alpha_{m} G_{m}(x) f(x)=m=1MαmGm(x),得到最终的分类器

G ( x ) = sign ⁡ ( f ( x ) ) = sign ⁡ ( ∑ m = 1 M α m G m ( x ) ) \begin{aligned} G(x) &=\operatorname{sign}(f(x)) \\ &=\operatorname{sign}\left(\sum_{m=1}^{M} \alpha_{m} G_{m}(x)\right) \end{aligned} G(x)=sign(f(x))=sign(m=1MαmGm(x))

三、举例说明Adaboost计算过程

训练数据如下表,假设基本分类器的形式是一个分割 x < v x<v x<v x > v x>v x>v表示,阈值v由该基本分类器在训练数据集上分类错误率 e m e_m em最低确定。
 序号  1 2 3 4 5 6 7 8 9 10 x 0 1 2 3 4 5 6 7 8 9 y 1 1 1 − 1 − 1 − 1 1 1 1 − 1 \begin{array}{ccccccccccc} \hline \text { 序号 } & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10 \\ \hline x & 0 & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 \\ y & 1 & 1 & 1 & -1 & -1 & -1 & 1 & 1 & 1 & -1 \\ \hline \end{array}  序号 xy1012113214315416517618719811091
解:
初始化样本权值分布
D 1 = ( w 11 , w 12 , ⋯   , w 110 ) w 1 i = 1 / N = 0.1 , i = 1 , 2 , ⋯   , 10 \begin{aligned} D_{1} &=\left(w_{11}, w_{12}, \cdots, w_{110}\right) \\ w_{1 i} &=1/N=0.1, \quad i=1,2, \cdots, 10 \end{aligned} D1w1i=(w11,w12,,w110)=1/N=0.1,i=1,2,,10

注:阈值 v v v可取: v = 0.5 , 1.5 , 2.5 , ⋯   , 9.5 v=0.5,1.5,2.5,\cdots,9.5 v=0.5,1.5,2.5,,9.5
这9个取值,构造9个基本分类器,此处为了演示,v只取部分数值进行计算。

对m=1,即在训练集 D 1 D_1 D1上训练基本分类器 G 1 ( x ) G_1(x) G1(x):

  • 在权值分布 D 1 D_1 D1的训练数据集上,遍历每个结点并计算分类误差率 e m e_m em,阈值取v=2.5时分类误差率最低,那么基本分类器为:
  • G 1 ( x ) = { 1 , x < 2.5 − 1 , x > 2.5 G_{1}(x)=\left\{\begin{array}{ll} 1, & x<2.5 \\ -1, & x>2.5 \end{array}\right. G1(x)={1,1,x<2.5x>2.5
  • G 1 ( x ) G_1(x) G1(x)在训练数据集上的误差率为 e 1 = P ( G 1 ( x i ) ≠ y i ) = 0.3 e_{1}=P\left(G_{1}\left(x_{i}\right) \neq y_{i}\right)=0.3 e1=P(G1(xi)=yi)=0.3
  • 计算 G 1 ( x ) G_1(x) G1(x)的系数: α 1 = 1 2 log ⁡ 1 − e 1 e 1 = 0.4236 \alpha_{1}=\frac{1}{2} \log \frac{1-e_{1}}{e_{1}}=0.4236 α1=21loge11e1=0.4236
  • 更新训练数据的权值分布:
    D 2 = ( w 21 , ⋯   , w 2 i , ⋯   , w 210 ) w 2 i = w 1 i Z 1 exp ⁡ ( − α 1 y i G 1 ( x i ) ) , i = 1 , 2 , ⋯   , 10 D 2 = ( 0.07143 , 0.07143 , 0.07143 , 0.07143 , 0.07143 , 0.07143 , 0.16667 , 0.16667 , 0.16667 , 0.07143 ) f 1 ( x ) = 0.4236 G 1 ( x ) \begin{aligned} D_{2}=&\left(w_{21}, \cdots, w_{2 i}, \cdots, w_{210}\right) \\ w_{2 i}=& \frac{w_{1 i}}{Z_{1}} \exp \left(-\alpha_{1} y_{i} G_{1}\left(x_{i}\right)\right), \quad i=1,2, \cdots, 10 \\ D_{2}=&(0.07143,0.07143,0.07143,0.07143,0.07143,0.07143,\\ &0.16667,0.16667,0.16667,0.07143) \\ f_{1}(x) &=0.4236 G_{1}(x) \end{aligned} D2=w2i=D2=f1(x)(w21,,w2i,,w210)Z1w1iexp(α1yiG1(xi)),i=1,2,,10(0.07143,0.07143,0.07143,0.07143,0.07143,0.07143,0.16667,0.16667,0.16667,0.07143)=0.4236G1(x)

对m=2,即在训练集 D 2 D_2 D2 D 2 D_2 D2是数据集 D 1 D_1 D1数据权值分布变化后的数据集上训练基本分类器 G 2 ( x ) G_2(x) G2(x):

  • 在权值分布 D 2 D_2 D2的训练数据集上,遍历每个结点并计算分类误差率 e m e_m em,阈值取v=8.5时分类误差率最低,那么基本分类器为:
    G 2 ( x ) = { 1 , x < 8.5 − 1 , x > 8.5 G_{2}(x)=\left\{\begin{array}{ll} 1, & x<8.5 \\ -1, & x>8.5 \end{array}\right. G2(x)={1,1,x<8.5x>8.5
  • G 2 ( x ) G_2(x) G2(x)在训练数据集上的误差率为 e 2 = 0.2143 e_2 = 0.2143 e2=0.2143
  • 计算 G 2 ( x ) G_2(x) G2(x)的系数: α 2 = 0.6496 \alpha_2 = 0.6496 α2=0.6496
  • 更新训练数据的权值分布:
    D 3 = ( 0.0455 , 0.0455 , 0.0455 , 0.1667 , 0.1667 , 0.1667 0.1060 , 0.1060 , 0.1060 , 0.0455 ) f 2 ( x ) = 0.4236 G 1 ( x ) + 0.6496 G 2 ( x ) \begin{aligned} D_{3}=&(0.0455,0.0455,0.0455,0.1667,0.1667,0.1667\\ &0.1060,0.1060,0.1060,0.0455) \\ f_{2}(x) &=0.4236 G_{1}(x)+0.6496 G_{2}(x) \end{aligned} D3=f2(x)(0.0455,0.0455,0.0455,0.1667,0.1667,0.16670.1060,0.1060,0.1060,0.0455)=0.4236G1(x)+0.6496G2(x)
    对m=3,即在训练集 D 3 D_3 D3 D 3 D_3 D3是数据集 D 2 D_2 D2数据权值分布变化后的数据集上训练基本分类器 G 3 ( x ) G_3(x) G3(x):
  • 在权值分布 D 3 D_3 D3的训练数据集上,遍历每个结点并计算分类误差率 e m e_m em,阈值取v=5.5时分类误差率最低,那么基本分类器为:
    G 3 ( x ) = { 1 , x > 5.5 − 1 , x < 5.5 G_{3}(x)=\left\{\begin{array}{ll} 1, & x>5.5 \\ -1, & x<5.5 \end{array}\right. G3(x)={1,1,x>5.5x<5.5
  • G 3 ( x ) G_3(x) G3(x)在训练数据集上的误差率为 e 3 = 0.1820 e_3 = 0.1820 e3=0.1820
  • 计算 G 3 ( x ) G_3(x) G3(x)的系数: α 3 = 0.7514 \alpha_3 = 0.7514 α3=0.7514
  • 更新训练数据的权值分布:
    D 4 = ( 0.125 , 0.125 , 0.125 , 0.102 , 0.102 , 0.102 , 0.065 , 0.065 , 0.065 , 0.125 ) D_{4}=(0.125,0.125,0.125,0.102,0.102,0.102,0.065,0.065,0.065,0.125) D4=(0.125,0.125,0.125,0.102,0.102,0.102,0.065,0.065,0.065,0.125)

于是得到: f 3 ( x ) = 0.4236 G 1 ( x ) + 0.6496 G 2 ( x ) + 0.7514 G 3 ( x ) f_{3}(x)=0.4236 G_{1}(x)+0.6496 G_{2}(x)+0.7514 G_{3}(x) f3(x)=0.4236G1(x)+0.6496G2(x)+0.7514G3(x),分类器 sign ⁡ [ f 3 ( x ) ] \operatorname{sign}\left[f_{3}(x)\right] sign[f3(x)]在训练数据集上的误分类点的个数为0。

假设此时对样本 ( x = 3 , y = − 1 ) (x=3,y=-1) (x=3,y=1)进行预测验证:

  1. x = 3 x=3 x=3在分类器 G 1 ( x ) G_1(x) G1(x)上预测的结果为-1
  2. x = 3 x=3 x=3在分类器 G 2 ( x ) G_2(x) G2(x)上预测的结果为1
  3. x = 3 x=3 x=3在分类器 G 3 ( x ) G_3(x) G3(x)上预测的结果为-1
  4. 故此时分类器 f 3 ( x ) = 0.4236 × ( − 1 ) + 0.6496 × 1 + 0.7514 × ( − 1 ) = − 0.5246 f_3(x)=0.4236\times(-1)+0.6496\times1+0.7514\times(-1)=-0.5246 f3(x)=0.4236×(1)+0.6496×1+0.7514×(1)=0.5246
  5. 故分类器 G ( x ) = sign ⁡ [ f 3 ( x ) ] G(x)=\operatorname{sign}\left[f_{3}(x)\right] G(x)=sign[f3(x)]的预测结果为-1,预测正确

于是得到最终分类器为: G ( x ) = sign ⁡ [ f 3 ( x ) ] = sign ⁡ [ 0.4236 G 1 ( x ) + 0.6496 G 2 ( x ) + 0.7514 G 3 ( x ) ] G(x)=\operatorname{sign}\left[f_{3}(x)\right]=\operatorname{sign}\left[0.4236 G_{1}(x)+0.6496 G_{2}(x)+0.7514 G_{3}(x)\right] G(x)=sign[f3(x)]=sign[0.4236G1(x)+0.6496G2(x)+0.7514G3(x)]

四、代码实例

4.1、红酒数据集导入及简单介绍

# 引入数据科学相关工具包:
import numpy as np
import pandas as pd 

# 加载训练数据:         
wine = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data",header=None)
wine.columns = ['Class label', 'Alcohol', 'Malic acid', 'Ash', 'Alcalinity of ash','Magnesium', 'Total phenols','Flavanoids', 'Nonflavanoid phenols', 
                'Proanthocyanins','Color intensity', 'Hue','OD280/OD315 of diluted wines','Proline']

# 数据查看:
print("Class labels",np.unique(wine["Class label"]))
print(wine.head())

在这里插入图片描述

下面对数据做简单解读:

  • Class label: 分类标签,共三个类别
  • Alcohol: 酒精
  • Malic acid: 苹果酸
  • Ash:
  • Alcalinity of ash: 灰的碱度
  • Magnesium:
  • Total phenols: 总酚
  • Flavanoids: 黄酮类化合物
  • Nonflavanoid phenols: 非黄烷类酚类
  • Proanthocyanins: 原花青素
  • Color intensity: 色彩强度
  • Hue: 色调
  • OD280/OD315 of diluted wines: 稀释酒OD280 OD350
  • Proline: 脯氨酸

4.2、数据预处理及数据集划分

# 数据预处理
# 仅仅考虑2,3类葡萄酒,去除1类
wine = wine[wine['Class label'] != 1]
y = wine['Class label'].values
X = wine[['Alcohol','OD280/OD315 of diluted wines']].values

# 将分类标签变成二进制编码:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)

# 按8:2分割训练集和测试集
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=1,stratify=y)  # stratify参数代表了按照y的类别等比例抽样

4.3、使用单一决策树和Adaboost分别建模(基于sklearn)

单一决策树建模并预测:

# 使用单一决策树建模
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier(criterion='entropy',random_state=1,max_depth=1)
from sklearn.metrics import accuracy_score
tree = tree.fit(X_train,y_train)
y_train_pred = tree.predict(X_train)
y_test_pred = tree.predict(X_test)
tree_train = accuracy_score(y_train,y_train_pred)
tree_test = accuracy_score(y_test,y_test_pred)
print('Decision tree train/test accuracies %.3f/%.3f' % (tree_train,tree_test))

Adaboost(基本分类器为决策树)建模并预测:

# 使用sklearn实现Adaboost(基分类器为决策树)
'''
AdaBoostClassifier相关参数:
base_estimator:基本分类器,默认为DecisionTreeClassifier(max_depth=1)
n_estimators:终止迭代的次数
learning_rate:学习率
algorithm:训练的相关算法,{'SAMME','SAMME.R'},默认='SAMME.R'
random_state:随机种子
'''
from sklearn.ensemble import AdaBoostClassifier
ada = AdaBoostClassifier(base_estimator=tree,n_estimators=500,learning_rate=0.1,random_state=1)
ada = ada.fit(X_train,y_train)
y_train_pred = ada.predict(X_train)
y_test_pred = ada.predict(X_test)
ada_train = accuracy_score(y_train,y_train_pred)
ada_test = accuracy_score(y_test,y_test_pred)
print('Adaboost train/test accuracies %.3f/%.3f' % (ada_train,ada_test))

预测结果:
在这里插入图片描述
参考:DataWhale/ensemble-learning

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值