不同的分类器都拥有其优缺点,而元算法的目的即在于将不同分类器进行组合,从而获取集成方法的分类结果。元算法中有两大类常用的:bagging和boosting,bagging中常用的有随机森林法,而boosting中常用的有AdaBoost。
bagging:不同分类器串行,每个新分类器在前一个分类器的训练结果上训练,分类器间等权;
boosting:新分类器仅关注已有分类器的错分数据,权重表示分类器在上一轮表现的准确率;
基于弱分类器--单层决策树,Adaboost的训练代码实现如下:
def adaBoostTrainDS(dataArr,classLabels,numIt=40):
#输入--训练数据集,标签数据,迭代次数
weakClassArr = []
m = shape(dataArr)[0]
D = mat(ones((m,1))/m)
aggClassEst = mat(zeros((m,1)))
for i in range(numIt):
bestStump,error,classEst = buildStump(dataArr,classLabels,D) #构造弱分类器-单层决策树
print("D:",D.T)
alpha = float(0.5*log((1.0-error)/max(error,1e-16))) #权重
bestStump['alpha'] = alpha
weakClassArr.append(bestStump)
print("classEst: ",classEst.T)
expon = multiply(-1*alpha*mat(classLabels).T,classEst)
D = multiply(D,exp(expon))
D = D/D.sum()
aggClassEst += alpha*classEst
print("aggClassEst: ",aggClassEst.T)
aggErrors = multiply(sign(aggClassEst) != mat(classLabels).T,ones((m,1)))
errorRate = aggErrors.sum()/m #错误率
print("total error: ",errorRate)
if errorRate == 0.0: break
return weakClassArr,aggClassEst
分类函数代码实现如下:
def adaClassify(datToClass,classifierArr):
#输入:待分类数据;训练结果
dataMatrix = mat(datToClass)
m = shape(dataMatrix)[0]
aggClassEst = mat(zeros((m,1)))
for i in range(len(classifierArr)):
classEst = stumpClassify(dataMatrix,classifierArr[i]['dim'],
classifierArr[i]['thresh'],
classifierArr[i]['ineq'])
aggClassEst += classifierArr[i]['alpha']*classEst
print(aggClassEst)
return sign(aggClassEst)