个人理解:
AdaBoost是一种算法优化方式。
我们都知道,对于一个数据数据集来说(不论是多分类还是二分类问题),如果想要找到一个非常完美的分类方案,即错误率非常小(譬如百分之1)(以下称之为强模型),这是一件非常困难的是,可能需要很多次的迭代计算才能得到一个较好的模型来进行分类,与之相反的是,想找到一个非常简单但是错误率相较来说是比较高(譬如百分之30)的模型(以下称之为弱模型)是很简单的(极端的例子就是直接使用随机数进行预测,当然这是不行的,这里只是举一个例子)。基于此,我们可以设想以下是否能通过多个弱模型的线性组合来达到强模型的效果呢?实际上,这在概率论上是可行的,而AdaBoost正是基于这种思想的一种由弱到强的算法。
算法介绍:
在介绍Boost算法之前,我们先介绍另一种与之类似的算法---Bagging。
Bagging算法是基于数据集重构的。Bagging算法会基于原数据集重建S个新数据集(),每一个新数据集与原数据集的大小都是相等的,新数据集中的每个数据都是从原数据集中随机抽取而得到,也就是说,新数据集中的数据可能出现多个一样的情况(当然,极端情况就是某个新数据集中所有的数据都是一样的---它们都来自原数据集中的同一个数据)。构建了新的数据集,使用一个模型对这S个数据集进行训练,就可以得到S个分类器,当我们需要对新的数据进行预测分类时就可以使用这S个分类器分别预测,然后将分类器投票最多的类别作为最终的结果(比如一共有7个分类器,如果有4个的结果是正样本,那么最后的结果就是正样本)
# Bagging algorithm
def sigmoid(X):
return 1.0 / (1.0 + exp(-X))
def createNewData(dataArr):
'''
# from orginal dataSet create new dataSet
# input: dataArr(list) orginal dataSet
# output: newDataMat(mat) new data matrix
'''
dataMat = mat(dataArr)
m, n = shape(dataMat)
newDataMat = zeros((m,n))
for i in range(m):
for j in range(n):
x = random.randrange(m)
y = random.randrange(n)
newDataMat[i,j] = dataMat[x,y]
return newDataMat
def logisticTrain(dataMat, labelMat, numIter):
'''
# logistic model train function
# input: dataMat(mat) data mat point
# labelMat(mat) label mat point
# numIter(int) max iteration number
# output: weight(mat) model weight value
'''
m, n = shape(dataMat)
weights = ones((n,1))
for j in range(numIter):
dataIndex = list(range(m))
for i in range(m):
alpha = 4 / (1.0 + j + i) + 0.01
randIndex = int(random.uniform(0, len(dataIndex)))
h = sigmoid(sum(dataMat[randIndex] * weights))
error = labelMat[randIndex] - h
weights = weights + alpha * mat(dataMat[randIndex]).transpose() * error # X.T * (y - Xw)
del(dataIndex[randIndex])
return weights
def classifier(testPoint, weights):
positiveNum = 0; negativeNum = 0
for i in range(shape(weights)[0]):
hypothesis = sigmoid(testPoint * mat(weights[i,:]).T)
if hypothesis > 0.5:
positiveNum += 1
else:
negativeNum += 1
if positiveNum > negativeNum:
return 1
else:
return 0
def Bagging(dataArr, labelArr, numClas):
'''
# Bagging algrothim
# input: dataArr(list) orginal data point
# labelArr(list) orginal label
# output: weightsMat(mat) clasifier model
'''
xMat = mat(dataArr); yMat = mat(labelArr).T
if numClas % 2 == 0: numClas += 1
m, n = shape(xMat)
weightsMat = zeros((numClas, n))
# trainging model in here
# maybe you can choose like logist/bayes... (k-means algrothm don't need train)
for i in range(numClas):
newDataMat = createNewData(xMat)
weights = logisticTrain(newDataMat, yMat, 4000)
weightsMat[i,:] = weights.transpose()
print('\n %d th classifer has been done' % i)
return weightsMat
# end
在这里用的是逻辑分类器,但是效果并不是很好,即使将迭代次数增加到4000,分类器增加到40,在训练集上依旧有19%的错误率,而在测试集上则有39%的错误率。
Boosting算法实质上也是S个分类器,这S个分类器所使用的模型都是一样的,但与Bagging不同的是,其S个分类器是串联的,也就是说,下一个分类器是基于上一个分类器的结果来进行训练的。对于在第个分类器中分类结果正确的数据,在第
个分类器中其对应的权重就会降低,相反的,对于那些在第
个分类器中分类结果错误的数据,在第
个分类器中其对应的权重就会升高,这样的结果就会导致在第
个分类器中会更加多的去考虑上一个分类器中分类错误的那些样本。Boosting算法有很多种,我们这里主要是用到现在最流行的AdaBoost(Adaptive Boost)。
AdaBoost会为每一个分类器赋予一个权重值(注意,这里的权重值是对于整个分类器而言,前面的权重值是对于数据集中的每个数据而言),而对应的分类器权重值alpha是根据该分类器的错误率来计算的(即分类错误样本数/总的样本数),计算公式如下:
,其中,
在这里代表错误率。
根据计算出的alpha值,我们就可以更新权重向量,使分类正确的样本对应的权重值降低,而分类错误的样本对应的权重值增加,权重向量D更新公式如下:
,其中,对于分类正确的样本,
,对于分类错误的样本,
。
基于以上,通过构建S个分类器并将其进行组合,我们就可以得到最后的分类器。
整体代码部分如下(Python Version 3.7.3):
from numpy import *
import matplotlib.pyplot as plt
def loadSimpData():
datMat = matrix([[1.0, 2.1],
[2.0, 1.1],
[1.3, 1.0],
[1.0, 1.0],
[2.0, 1.0]])
classLabels = [1.0, 1.0, -1.0, -1.0, 1.0]
return datMat, classLabels
def plotPoint(datMat, classLabels):
# Two-class drawing
m, n = shape(datMat)
X00 = []; X01 = []; X10 = []; X11 = []
maxVal = max(classLabels); minVal = min(classLabels)
for i in range(len(classLabels)):
if classLabels[i] == maxVal:
X00.append(datMat[i][0,0])
X01.append(datMat[i][0,1])
elif classLabels[i] == minVal:
X10.append(datMat[i][0,0])
X11.append(datMat[i][0,1])
else:
raise NameError('Please check the Label data')
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(X00, X01, s=30, c='red', marker='v')
ax.scatter(X10, X11, s=30, c='green')
plt.show()
# create single layer decision tree
def stumpClassify(dataMatrix, dimen, threshVal, threshIneq):
retArray = ones((shape(dataMatrix)[0], 1))
if threshIneq == 'lt':
retArray[dataMatrix[:, dimen] <= threshVal] = -1.0
else:
retArray[dataMatrix[:, dimen] > threshVal] = -1.0
return retArray
def buildStump(dataArr, classLabels, D):
dataMatrix = mat(dataArr)
labelMat = mat(classLabels).T
m, n = shape(dataMatrix)
numSteps = 10.0
bestStump = {}
bestClasEst = mat(zeros((m, 1)))
# 'inf' means endless
minError = inf
for i in range(n):
rangeMin = dataMatrix[:, i].min()
rangeMax = dataMatrix[:, i].max()
stepSize = (rangeMax - rangeMin) / numSteps
# iterations number?
for j in range(-1, int(numSteps) + 1):
# -1.0 x <= thresold 1.0 x <= thresold
# lt: f(x) = gt: f(x) =
# 1.0 x > thresold -1.0 x > thresold
for inequal in ['lt', 'gt']:
# Solving training data segmention point
threshVal = (rangeMin + float(j) * stepSize)
# accorrding segmention point to division dataSet
predictedVals = stumpClassify(dataMatrix, i, threshVal, inequal)
# accorrding weighted value to compute classifiction error rate
errArr = mat(ones((m, 1)))
errArr[predictedVals == labelMat] = 0
weightedError = D.T * errArr
print ('split: dim %d, thresh %.2f, thresh ineqal: %s, the weighted error is %.3f' % (i, threshVal, inequal, weightedError))
# new classifiction error rate smaller than minimum value of the record
if weightedError < minError:
minError = weightedError
# update dataSet
bestClasEst = predictedVals.copy()
# x1, x2, ... ,xn,chooose the best feature
bestStump['dim'] = i
bestStump['thresh'] = threshVal
bestStump['ineq'] = inequal
return bestStump, minError, bestClasEst
# end
# complete AdaBoost classification function
def adaBoostTrainDS(dataArr, classLabels, numIt):
weakClassArr = []
m = shape(dataArr)[0]
D = mat(ones((m, 1)) / m)
aggClassEst = mat(zeros((m, 1)))
for i in range(numIt):
bestStump, error, classEst = buildStump(dataArr, classLabels, D)
print('D:',D.T)
# updata classifier weight
alpha = float(0.5 * log((1.0 - error) / max(error, 1e-16)))
bestStump['alpha'] = alpha
# store the weak classifier
weakClassArr.append(bestStump)
print('classEst:', classEst.T)
expon = multiply(-1 * alpha * mat(classLabels).T, classEst)
# updata weight (D is the collection of weight values)
D = multiply(D, exp(expon))
D = D/D.sum()
# update total element weight value
aggClassEst += alpha * classEst
print('aggClassEst:', aggClassEst.T)
# two classifiction problem
aggErrors = multiply(sign(aggClassEst) != mat(classLabels).T, ones((m, 1)))
errorRate = aggErrors.sum() / m
print('total error:', errorRate, '\n')
if errorRate == 0.0:
break
return weakClassArr, aggClassEst
# classify with multiple weak classifiers
def adaClassify(datToClass, classifierArr):
dataMatrix = mat(datToClass)
m = shape(dataMatrix)[0]
aggClassEst = mat(zeros((m, 1)))
for i in range(len(classifierArr)):
classEst = stumpClassify(dataMatrix, classifierArr[i]['dim'], classifierArr[i]['thresh'], classifierArr[i]['ineq'])
aggClassEst += classifierArr[i]['alpha'] * classEst
print(aggClassEst)
return sign(aggClassEst)
# 马疝病预测
# process dataSet
def processDat(filename):
dataArr = []; labelArr = []
fr = open(filename)
for line in fr.readlines():
lineArr = []
currLine = line.strip().split('\t')
for i in range(len(currLine) - 1):
lineArr.append(float(currLine[i]))
dataArr.append(lineArr)
labelArr.append(float(currLine[-1]))
return dataArr, labelArr
def plotROC(predStrengths, classLabels):
cur = (1.0, 1.0); ySum = 0.0
numPosClas = sum(array(classLabels) == 1.0)
# step size
# Y is real positive rate
# X is pesudo posidive rate
yStep = 1 / float(numPosClas)
xStep = 1 / float(len(classLabels) - numPosClas)
# sort predictions value
sortedIndicies = predStrengths.argsort()
fig = plt.figure()
fig.clf()
ax = plt.subplot(111)
for index in sortedIndicies.tolist()[0]:
if classLabels[index] == 1.0:
delX = 0; delY = yStep
else:
delX = xStep; delY = 0
# compute AUC value
ySum += cur[1]
# x,y why is this?
ax.plot([cur[0], cur[0] - delX], [cur[1], cur[1] - delY], c='b')
cur = (cur[0] - delX, cur[1] - delY)
ax.plot([0,1], [0,1], 'b--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
ax.axis([0, 1, 0, 1])
plt.show()
if __name__ == "__main__":
# datMat, classLabels = loadSimpData()
# plotPoint(datMat, classLabels)
# D = mat(ones((5, 1)) / 5)
# buildStump(datMat, classLabels, D)
# classifierArray = adaBoostTrainDS(datMat, classLabels, 9)
dataArr, labelArr = processDat('horseColicTest.txt')
classifierArray, aggClassEst = adaBoostTrainDS(dataArr, labelArr, 1)
plotROC(aggClassEst.T, labelArr)
pass