60 篇文章 474 订阅

一、集成方法(Ensemble Method)

为了构造出一个强的学习算法，首先需要选定一个弱学习算法，并利用同一个训练集不断训练弱学习算法，以提升弱学习算法的性能。在AdaBoost算法中，有两个权重，第一个数训练集中每个样本有一个权重，称为样本权重，用向量$D$表示；另一个是每一个弱学习算法具有一个权重，用向量$\alpha$表示。假设有$n$个样本的训练集$\left&space;\{&space;\left&space;(X_1,y_1&space;\right&space;),\left&space;(X_2,y_2&space;\right&space;),\cdots&space;,\left&space;(X_n,y_n&space;\right&space;)&space;\right&space;\}$，初始时，设定每个样本的权重是相等的，即$\frac{1}{n}$，利用第一个弱学习算法$h_1$对其进行学习，学习完成后进行错误率$\varepsilon$的统计：

$\varepsilon&space;=\frac{\&hash;&space;error}{\&hash;&space;all}$

$\alpha&space;_1=\frac{1}{2}ln\left&space;(&space;\frac{1-\varepsilon&space;}{\varepsilon&space;}&space;\right&space;)$

在第一次学习完成后，需要重新调整样本的权重，以使得在第一分类中被错分的样本的权重，使得在接下来的学习中可以重点对其进行学习：

\begin{align*}&space;D_{t+1}\left&space;(&space;i&space;\right&space;)&space;&=&space;\frac{D_t\left&space;(&space;i&space;\right&space;)}{Z_t}\times&space;\begin{cases}&space;e^{-\alpha&space;_t}&space;&&space;\text{&space;if&space;}&space;h_t\left&space;(&space;x_i&space;\right&space;)=y_i&space;\\&space;e^{\alpha&space;_t}&space;&&space;\text{&space;if&space;}&space;h_t\left&space;(&space;x_i&space;\right&space;)\neq&space;y_i&space;\end{cases}\\&space;&=&space;\frac{D_t\left&space;(&space;i&space;\right&space;)exp\left&space;(&space;-\alpha&space;_ty_ih_t\left&space;(&space;x_i&space;\right&space;)&space;\right&space;)}{Z_t}&space;\end{align*}

$Z_t=sum\left&space;(&space;D&space;\right&space;)$

$H\left&space;(&space;X&space;\right&space;)=sign\left&space;(&space;\sum_{i=1}^{t}\alpha&space;_ih_i\left&space;(&space;X&space;\right&space;)&space;\right&space;)$

(图片来自参考文件1)

(来自参考文献2)

Python 代码
#coding:UTF-8
'''
Created on 2015年6月15日

@author: zhaozhiyong

'''

from numpy import *

datMat = mat([[1., 2.1],
[2., 1.1],
[1.3, 1.],
[1., 1.],
[2., 1.]])
classLabels = mat([1.0, 1.0, -1.0, -1.0, 1.0])
return datMat, classLabels

def singleStumpClassipy(dataMat, dim, threshold, thresholdIneq):
classMat = ones((shape(dataMat)[0], 1))
#根据thresholdIneq划分出不同的类，在'-1'和'1'之间切换
if thresholdIneq == 'left':#在threshold左侧的为'-1'
classMat[dataMat[:, dim] <= threshold] = -1.0
else:
classMat[dataMat[:, dim] > threshold] = -1.0

return classMat

def singleStump(dataArr, classLabels, D):
dataMat = mat(dataArr)
labelMat = mat(classLabels).T
m, n = shape(dataMat)
numSteps = 10.0
bestStump = {}
bestClasEst = zeros((m, 1))
minError = inf
for i in xrange(n):#对每一个特征
#取第i列特征的最小值和最大值，以确定步长
rangeMin = dataMat[:, i].min()
rangeMax = dataMat[:, i].max()
stepSize = (rangeMax - rangeMin) / numSteps
for j in xrange(-1, int(numSteps) + 1):
#不确定是哪个属于类'-1'，哪个属于类'1'，分两种情况
for inequal in ['left', 'right']:
threshold = rangeMin + j * stepSize#得到每个划分的阈值
predictionClass = singleStumpClassipy(dataMat, i, threshold, inequal)
errorMat = ones((m, 1))
errorMat[predictionClass == labelMat] = 0
weightedError = D.T * errorMat#D是每个样本的权重
if weightedError < minError:
minError = weightedError
bestClasEst = predictionClass.copy()
bestStump['dim'] = i
bestStump['threshold'] = threshold
bestStump['inequal'] = inequal

return bestStump, minError, bestClasEst

def adaBoostTrain(dataArr, classLabels, G):
weakClassArr = []
m = shape(dataArr)[0]#样本个数
#初始化D，即每个样本的权重
D = mat(ones((m, 1)) / m)
aggClasEst = mat(zeros((m, 1)))

for i in xrange(G):#G表示的是迭代次数
bestStump, minError, bestClasEst = singleStump(dataArr, classLabels, D)
print 'D:', D.T
#计算分类器的权重
alpha = float(0.5 * log((1.0 - minError) / max(minError, 1e-16)))
bestStump['alpha'] = alpha
weakClassArr.append(bestStump)
print 'bestClasEst:', bestClasEst.T

#重新计算每个样本的权重D
expon = multiply(-1 * alpha * mat(classLabels).T, bestClasEst)
D = multiply(D, exp(expon))
D = D / D.sum()

aggClasEst += alpha * bestClasEst
print 'aggClasEst:', aggClasEst
aggErrors = multiply(sign(aggClasEst) != mat(classLabels).T, ones((m, 1)))
errorRate = aggErrors.sum() / m
print 'total error:', errorRate
if errorRate == 0.0:
break
return weakClassArr

dataMat = mat(testData)
m = shape(dataMat)[0]
aggClassEst = mat(zeros((m, 1)))
for i in xrange(len(weakClassify)):#weakClassify是一个列表
classEst = singleStumpClassipy(dataMat, weakClassify[i]['dim'], weakClassify[i]['threshold'], weakClassify[i]['inequal'])
aggClassEst += weakClassify[i]['alpha'] * classEst
print aggClassEst
return sign(aggClassEst)

if __name__ == '__main__':
datMat, classLabels = loadSimpleData()
weakClassArr = adaBoostTrain(datMat, classLabels, 30)
print "weakClassArr:", weakClassArr
#test
result = adaBoostClassify([1, 1], weakClassArr)
print result


weakClassArr: [{'threshold': 1.3, 'dim': 0, 'inequal': 'left', 'alpha': 0.6931471805599453}, {'threshold': 1.0, 'dim': 1, 'inequal': 'left', 'alpha': 0.9729550745276565}, {'threshold': 0.90000000000000002, 'dim': 0, 'inequal': 'left', 'alpha': 0.8958797346140273}]

1、机器学习实战
2、A Short Introduction to Boosting
• 10
点赞
• 25
收藏
• 打赏
• 5
评论
11-12 32
12-22 1325
10-03 619
04-18 292
07-14 679
03-11 2342
04-23 370
01-09 3901
06-03 1万+
04-03 428
05-18 2万+
05-08 6642
10-30 1192
10-21 2439

“相关推荐”对你有帮助么？

• 非常没帮助
• 没帮助
• 一般
• 有帮助
• 非常有帮助

©️2022 CSDN 皮肤主题：编程工作室 设计师：CSDN官方博客

zhiyong_will

¥2 ¥4 ¥6 ¥10 ¥20

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、C币套餐、付费专栏及课程。