机器学习实战(Peter Harrington)-----AdaBoost

个人理解:

AdaBoost是一种算法优化方式。

我们都知道,对于一个数据数据集来说(不论是多分类还是二分类问题),如果想要找到一个非常完美的分类方案,即错误率非常小(譬如百分之1)(以下称之为强模型),这是一件非常困难的是,可能需要很多次的迭代计算才能得到一个较好的模型来进行分类,与之相反的是,想找到一个非常简单但是错误率相较来说是比较高(譬如百分之30)的模型(以下称之为弱模型)是很简单的(极端的例子就是直接使用随机数进行预测,当然这是不行的,这里只是举一个例子)。基于此,我们可以设想以下是否能通过多个弱模型的线性组合来达到强模型的效果呢?实际上,这在概率论上是可行的,而AdaBoost正是基于这种思想的一种由弱到强的算法。

 

算法介绍:

在介绍Boost算法之前,我们先介绍另一种与之类似的算法---Bagging。

Bagging算法是基于数据集重构的。Bagging算法会基于原数据集重建S个新数据集(S_{1},S_{2},...S_{s}),每一个新数据集与原数据集的大小都是相等的,新数据集中的每个数据都是从原数据集中随机抽取而得到,也就是说,新数据集中的数据可能出现多个一样的情况(当然,极端情况就是某个新数据集中所有的数据都是一样的---它们都来自原数据集中的同一个数据)。构建了新的数据集,使用一个模型对这S个数据集进行训练,就可以得到S个分类器,当我们需要对新的数据进行预测分类时就可以使用这S个分类器分别预测,然后将分类器投票最多的类别作为最终的结果(比如一共有7个分类器,如果有4个的结果是正样本,那么最后的结果就是正样本)

# Bagging algorithm
def sigmoid(X):
    return 1.0 / (1.0 + exp(-X))


def createNewData(dataArr):
    '''
    # from orginal dataSet create new dataSet
    # input:    dataArr(list)       orginal dataSet
    # output:   newDataMat(mat)     new data matrix
    '''
    dataMat = mat(dataArr)
    m, n = shape(dataMat)
    newDataMat = zeros((m,n))
    for i in range(m):
        for j in range(n):
            x = random.randrange(m)
            y = random.randrange(n)
            newDataMat[i,j] = dataMat[x,y]
    return newDataMat


def logisticTrain(dataMat, labelMat, numIter):
    '''
    # logistic model train function
    # input:    dataMat(mat)    data mat point
    #           labelMat(mat)   label mat point
    #           numIter(int)    max iteration number
    # output:   weight(mat)     model weight value
    '''
    m, n = shape(dataMat)
    weights = ones((n,1))
    for j in range(numIter):
        dataIndex = list(range(m))
        for i in range(m):
            alpha = 4 / (1.0 + j + i) + 0.01
            randIndex = int(random.uniform(0, len(dataIndex)))
            h = sigmoid(sum(dataMat[randIndex] * weights))
            error = labelMat[randIndex] - h
            weights = weights + alpha * mat(dataMat[randIndex]).transpose() * error     # X.T * (y - Xw)
            del(dataIndex[randIndex])
    return weights


def classifier(testPoint, weights):
    positiveNum = 0; negativeNum = 0
    for i in range(shape(weights)[0]):
        hypothesis = sigmoid(testPoint * mat(weights[i,:]).T)
        if hypothesis > 0.5:
            positiveNum += 1
        else:
            negativeNum += 1
    if positiveNum > negativeNum:
        return 1
    else:
        return 0


def Bagging(dataArr, labelArr, numClas):
    '''
    # Bagging algrothim
    # input:    dataArr(list)   orginal data point
    #           labelArr(list)  orginal label 
    # output:   weightsMat(mat)     clasifier model
    '''
    xMat = mat(dataArr); yMat = mat(labelArr).T
    if numClas % 2 == 0:  numClas += 1
    m, n = shape(xMat)
    weightsMat = zeros((numClas, n))    
    # trainging model in here
    # maybe you can choose like logist/bayes...         (k-means algrothm don't need train)
    for i in range(numClas):
        newDataMat = createNewData(xMat)
        weights = logisticTrain(newDataMat, yMat, 4000)
        weightsMat[i,:] = weights.transpose()
        print('\n %d th classifer has been done' % i)
    return weightsMat   
# end

在这里用的是逻辑分类器,但是效果并不是很好,即使将迭代次数增加到4000,分类器增加到40,在训练集上依旧有19%的错误率,而在测试集上则有39%的错误率。

 

Boosting算法实质上也是S个分类器,这S个分类器所使用的模型都是一样的,但与Bagging不同的是,其S个分类器是串联的,也就是说,下一个分类器是基于上一个分类器的结果来进行训练的。对于在第i-1个分类器中分类结果正确的数据,在第i个分类器中其对应的权重就会降低,相反的,对于那些在第i-1个分类器中分类结果错误的数据,在第i个分类器中其对应的权重就会升高,这样的结果就会导致在第i个分类器中会更加多的去考虑上一个分类器中分类错误的那些样本。Boosting算法有很多种,我们这里主要是用到现在最流行的AdaBoost(Adaptive Boost)。

AdaBoost会为每一个分类器赋予一个权重值(注意,这里的权重值是对于整个分类器而言,前面的权重值是对于数据集中的每个数据而言),而对应的分类器权重值alpha是根据该分类器的错误率来计算的(即分类错误样本数/总的样本数),计算公式如下:

alpha = \frac{1}{2}ln(\frac{1-\varepsilon }{\varepsilon }),其中,\varepsilon在这里代表错误率。

根据计算出的alpha值,我们就可以更新权重向量,使分类正确的样本对应的权重值降低,而分类错误的样本对应的权重值增加,权重向量D更新公式如下:

D^{(t+1)}_{i} = \frac{D^{(t)}_{i}e^{\lambda *alpha}}{\sum D},其中,对于分类正确的样本,\lambda = -1,对于分类错误的样本,\lambda = 1

基于以上,通过构建S个分类器并将其进行组合,我们就可以得到最后的分类器f(x)

整体代码部分如下(Python Version 3.7.3):

from numpy import *
import matplotlib.pyplot as plt


def loadSimpData():
    datMat = matrix([[1.0, 2.1],
                    [2.0, 1.1],
                    [1.3, 1.0],
                    [1.0, 1.0],
                    [2.0, 1.0]])
    classLabels = [1.0, 1.0, -1.0, -1.0, 1.0]
    return datMat, classLabels


def plotPoint(datMat, classLabels):
    # Two-class drawing
    m, n = shape(datMat)
    X00 = []; X01 = []; X10 = []; X11 = []
    maxVal = max(classLabels); minVal = min(classLabels)
    for i in range(len(classLabels)):
        if classLabels[i] == maxVal:
            X00.append(datMat[i][0,0])
            X01.append(datMat[i][0,1])  
        elif classLabels[i] == minVal:
            X10.append(datMat[i][0,0])
            X11.append(datMat[i][0,1])  
        else:
            raise NameError('Please check the Label data')
    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.scatter(X00, X01, s=30, c='red', marker='v')
    ax.scatter(X10, X11, s=30, c='green')
    plt.show()


# create single layer decision tree
def stumpClassify(dataMatrix, dimen, threshVal, threshIneq):
    retArray = ones((shape(dataMatrix)[0], 1))
    if threshIneq == 'lt':
        retArray[dataMatrix[:, dimen] <= threshVal] = -1.0
    else:
        retArray[dataMatrix[:, dimen] > threshVal] = -1.0
    return retArray


def buildStump(dataArr, classLabels, D):
    dataMatrix = mat(dataArr)
    labelMat = mat(classLabels).T
    m, n = shape(dataMatrix)
    numSteps = 10.0
    bestStump = {}
    bestClasEst = mat(zeros((m, 1)))
    # 'inf' means endless
    minError = inf
    for i in range(n):
        rangeMin = dataMatrix[:, i].min()
        rangeMax = dataMatrix[:, i].max()
        stepSize = (rangeMax - rangeMin) / numSteps
        # iterations number?
        for j in range(-1, int(numSteps) + 1):
            #               -1.0    x <= thresold               1.0     x <= thresold
            # lt: f(x) =                            gt: f(x) = 
            #               1.0     x >  thresold               -1.0    x > thresold
            for inequal in ['lt', 'gt']:
                # Solving training data segmention point 
                threshVal = (rangeMin + float(j) * stepSize)
                # accorrding segmention point to division dataSet
                predictedVals = stumpClassify(dataMatrix, i, threshVal, inequal)
                # accorrding weighted value to compute classifiction error rate
                errArr = mat(ones((m, 1)))
                errArr[predictedVals == labelMat] = 0
                weightedError = D.T * errArr

                print ('split: dim %d, thresh %.2f, thresh ineqal: %s, the weighted error is %.3f' % (i, threshVal, inequal, weightedError))
                # new classifiction error rate smaller than minimum value of the record
                if weightedError < minError:
                    minError = weightedError
                    # update dataSet 
                    bestClasEst = predictedVals.copy()
                    # x1, x2, ... ,xn,chooose the best feature
                    bestStump['dim'] = i
                    bestStump['thresh'] = threshVal
                    bestStump['ineq'] = inequal
    return bestStump, minError, bestClasEst
# end


# complete AdaBoost classification function
def adaBoostTrainDS(dataArr, classLabels, numIt):
    weakClassArr = []
    m = shape(dataArr)[0]
    D = mat(ones((m, 1)) / m)
    aggClassEst = mat(zeros((m, 1)))
    for i in range(numIt):
        bestStump, error, classEst = buildStump(dataArr, classLabels, D)
        print('D:',D.T)
        # updata classifier weight
        alpha = float(0.5 * log((1.0 - error) / max(error, 1e-16)))
        bestStump['alpha'] = alpha
        # store the weak classifier
        weakClassArr.append(bestStump)
        print('classEst:', classEst.T)
        expon = multiply(-1 * alpha * mat(classLabels).T, classEst)
        # updata weight (D is the collection of weight values)
        D = multiply(D, exp(expon))
        D = D/D.sum()
        # update total element weight value
        aggClassEst += alpha * classEst
        print('aggClassEst:', aggClassEst.T)
        # two classifiction problem
        aggErrors = multiply(sign(aggClassEst) != mat(classLabels).T, ones((m, 1)))
        errorRate = aggErrors.sum() / m
        print('total error:', errorRate, '\n')
        if errorRate == 0.0:
            break
    return weakClassArr, aggClassEst


# classify with multiple weak classifiers
def adaClassify(datToClass, classifierArr):
    dataMatrix = mat(datToClass)
    m = shape(dataMatrix)[0]
    aggClassEst = mat(zeros((m, 1)))
    for i in range(len(classifierArr)):
        classEst = stumpClassify(dataMatrix, classifierArr[i]['dim'], classifierArr[i]['thresh'], classifierArr[i]['ineq'])
        aggClassEst += classifierArr[i]['alpha'] * classEst
        print(aggClassEst)
    return sign(aggClassEst)


# 马疝病预测
# process dataSet
def processDat(filename):
    dataArr = []; labelArr = []
    fr = open(filename)
    for line in fr.readlines():
        lineArr = []
        currLine = line.strip().split('\t')
        for i in range(len(currLine) - 1):
            lineArr.append(float(currLine[i]))
        dataArr.append(lineArr)
        labelArr.append(float(currLine[-1]))
    return dataArr, labelArr


def plotROC(predStrengths, classLabels):
    cur = (1.0, 1.0); ySum = 0.0
    numPosClas = sum(array(classLabels) == 1.0)
    # step size
    # Y is real positive rate
    # X is pesudo posidive rate
    yStep = 1 / float(numPosClas)
    xStep = 1 / float(len(classLabels) - numPosClas)
    # sort predictions value
    sortedIndicies = predStrengths.argsort()
    fig = plt.figure()
    fig.clf()
    ax = plt.subplot(111)
    for index in sortedIndicies.tolist()[0]:
        if classLabels[index] == 1.0:
            delX = 0; delY = yStep
        else:
            delX = xStep; delY = 0
            # compute AUC value
            ySum += cur[1]   
        # x,y why is this?               
        ax.plot([cur[0], cur[0] - delX], [cur[1], cur[1] - delY], c='b')
        cur = (cur[0] - delX, cur[1] - delY)
    ax.plot([0,1], [0,1], 'b--')
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    ax.axis([0, 1, 0, 1])
    plt.show()


if __name__ == "__main__":
    # datMat, classLabels = loadSimpData()
    # plotPoint(datMat, classLabels)
    # D = mat(ones((5, 1)) / 5)
    # buildStump(datMat, classLabels, D)
    # classifierArray = adaBoostTrainDS(datMat, classLabels, 9)
    dataArr, labelArr = processDat('horseColicTest.txt')
    classifierArray, aggClassEst = adaBoostTrainDS(dataArr, labelArr, 1)
    plotROC(aggClassEst.T, labelArr)
    pass
    

 

 

 

 

 

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值