AdaBoost分类算法实现

AdaBoost一般被认为是效果最好的分类器之一,总体思想是综合多个弱分类器来构造一个强分类器,

运行过程为:给定训练数据中每个样本一个权重,刚开始时,权重都相等。基于给定权重,构造一个

弱分类器,使得加权误差率最小,利用加权误差率计算该分类器的权重并预测样本的类别,分类正确

的样本的权重会降低,分类错误的样本的权重会增加。将此过程循环多次就得到一系列的弱分类器及

其权重,最后将这些弱分类器的分类结果的线性组合代入符号函数中得到最终的分类结果。

算法的伪代码如下:


本文使用Python语言实现AdaBoost分类算法

算法中选用的弱分类器是单层分类树,分类规则为"切分属性>=切分点"或“切分属性<切分点”

所有代码都位于一个文件adaboost.py中

from __future__ import division
import numpy as np

class AdaBoostCalssifier:
    def __init__(self, n_estimators = 20):
        self.n_estimators = n_estimators
        self.list_weakClassifier = []
    
    def weakClassify(self, X, dimen, val, thressIneq):
        results = np.ones((X.shape[0], 1))
        if thressIneq == 'lt':
            results[X[:,dimen] <= val] = -1.0
        else:
            results[X[:,dimen] > val] = -1.0
        return results
    
    def findBestClassifier(self, X, y, w):
        bestClassifier = []
        m = X.shape[0]
        n = X.shape[1]
        labelEst = np.zeros((m,1))
        numSteps = 10
        minErr = np.inf
        for i in range(n):
            rangeMin = X[:,i].min()
            rangeMax = X[:,i].max()
            stepSize = (rangeMax - rangeMin)/numSteps
            for j in range(numSteps+1):
                for inequal in ['lt', 'gt']:
                    val = rangeMin + j*stepSize
                    predictResults = self.weakClassify(X, i, val, inequal)
                    errArr = np.ones((m,1))
                    errArr[predictResults == y] = 0
                    weightedErr = (w * errArr).sum()
                    if weightedErr < minErr:
                        labelEst = predictResults
                        minErr = weightedErr
                        bestDim = i
                        bestVal = val
                        bestIneq = inequal
        bestClassifier.extend([bestDim, bestVal, bestIneq])
        return bestClassifier, minErr, labelEst
    
    def fit(self, X, y):
        m = X.shape[0]
        w = np.ones((m,1))/m
        weightedLabelEst = np.zeros((m,1))
        for i in range(self.n_estimators):
            bestClassifier, minErr, labelEst = self.findBestClassifier(X, y, w)
            if minErr>0.5: break
            print "weighted vector: ",w.T
            if minErr==0:
                alpha = 1000
            else:
                alpha = 0.5*np.log((1-minErr)/minErr)
            bestClassifier.append(alpha)
            self.list_weakClassifier.append(bestClassifier)
            print "current label estimation: ",labelEst.T
            weightedLabelEst = weightedLabelEst + alpha*labelEst
            print "weighted label estimation: ",weightedLabelEst.T
            finalLabelEst = np.sign(weightedLabelEst)
            errorVector = np.zeros((m,1))
            errorVector[finalLabelEst != y] = 1
            errorRate = errorVector.sum()/m
            print "errorRate: ",errorRate.T
            if minErr == 0 or errorRate == 0 or alpha == 0.5: break
            w = w*np.exp(-alpha*y*labelEst)
            w = w/w.sum()
            
    def predict(self, observations):
        m = observations.shape[0]
        results = np.zeros((m,1))
        for i in range(len(self.list_weakClassifier)):
            currentLabelEst = \
            self.weakClassify(observations, \
                              self.list_weakClassifier[i][0], \
                              self.list_weakClassifier[i][1], \
                              self.list_weakClassifier[i][2])
            results = results + self.list_weakClassifier[i][3]*currentLabelEst
        return np.sign(results)

利用《统计学习方法》中第140面的例8.1的训练数据检验下上述代码是否有效果,在Python Shell中运行以下代码:

import numpy as np
import adaboost
classifier = adaboost.AdaBoostCalssifier()
X = np.arange(10).reshape(10,1)
y = np.array([1,1,1,-1,-1,-1,1,1,1,-1]).reshape(10,1)
classifier.fit(X, y)
运行上述代码后可以检验下算法在训练数据集上的运行效果,如下所示:

weighted vector:  [[ 0.1  0.1  0.1  0.1  0.1  0.1  0.1  0.1  0.1  0.1]]
current label estimation:  [[ 1.  1.  1. -1. -1. -1. -1. -1. -1. -1.]]
weighted label estimation:  [[ 0.42364893  0.42364893  0.42364893 -0.42364893 -0.42364893 -0.42364893
  -0.42364893 -0.42364893 -0.42364893 -0.42364893]]
errorRate:  0.3
weighted vector:  [[ 0.07142857  0.07142857  0.07142857  0.07142857  0.07142857  0.07142857
   0.16666667  0.16666667  0.16666667  0.07142857]]
current label estimation:  [[ 1.  1.  1.  1.  1.  1.  1.  1.  1. -1.]]
weighted label estimation:  [[ 1.07329042  1.07329042  1.07329042  0.22599256  0.22599256  0.22599256
   0.22599256  0.22599256  0.22599256 -1.07329042]]
errorRate:  0.3
weighted vector:  [[ 0.04545455  0.04545455  0.04545455  0.16666667  0.16666667  0.16666667
   0.10606061  0.10606061  0.10606061  0.04545455]]
current label estimation:  [[-1. -1. -1. -1. -1. -1.  1.  1.  1.  1.]]
weighted label estimation:  [[ 0.32125172  0.32125172  0.32125172 -0.52604614 -0.52604614 -0.52604614
   0.97803126  0.97803126  0.97803126 -0.32125172]]
errorRate:  0.0

可以看到,虽然默认设置的弱分类器数量为20,但实际只循环了3次就使得错判率为0,这个结果和《统计学习方法》中第140面的例8.1手动计算结果基本是一致的.




  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值