【机器学习算法-python实现】Adaboost的实现(1)-单层决策树(decision stump)

最新推荐文章于 2023-09-18 10:27:07 发布

vola9527

最新推荐文章于 2023-09-18 10:27:07 发布

阅读量2.2k

点赞数

转载地址：http://blog.csdn.net/buptgshengod/article/details/25049305

1.背景

上一节学习支持向量机，感觉公式都太难理解了，弄得我有点头大。不过这一章的Adaboost线比较起来就容易得多。Adaboost是用元算法的思想进行分类的。什么事元算法的思想呢？就是根据数据集的不同的特征在决定结果时所占的比重来划分数据集。一句话来说，就是根据特征的重要性比重来划分数据集。（不同的权值）就是要对每个特征值都构建决策树，并且赋予他们不同的权值，最后集合起来比较。

比如说我们可以通过是否有胡子和身高的高度这两个特征来来决定一个人的性别，很明显是否有胡子可能在判定性别方面比身高更准确，所以在判定的时候我们就赋予这个特征更大的权重，比如说我们把权重设成0.8：0.2。这样就比0.5：0.5的权重来的更准确些。

2.构建决策树

接着我们来构建决策树。我们的决策树要实现主要两个功能，一个是找出对结果影响最大的特征值。另外一个功能是找到这个特征值得阈值。阈值就是，比方说阈值是d，当特征值大于d结果为1，当特征值小于d结果为0。

首先看下数据集，是一个两个特征值的矩阵。

[python]view plaincopy 
    
 ef loadSimpData():  
     datMat = matrix([[ 1. ,  2.1],  
         [ 2. ,  1.1],  
         [ 1.3,  1. ],  
         [ 1. ,  1. ],  
         [ 2. ,  1. ]])  
     classLabels = [1.0, 1.0, -1.0, -1.0, 1.0]  
     return datMat,classLabels  

接着是树的分类函数。这个函数在下面的循环里要用到，作用很简单，就是比对每一列的特征值和目标函数，返回比对的结果。四个参数分别是（输入矩阵，第几列，阈值，lt或gt）

[python]view plaincopy 
    
 def stumpClassify(dataMatrix,dimen,threshVal,threshIneq):#just classify the data  
     retArray = ones((shape(dataMatrix)[0],1))  
     if threshIneq == 'lt':  
         retArray[dataMatrix[:,dimen] <= threshVal] = -1.0  
     else:  
         retArray[dataMatrix[:,dimen] > threshVal] = -1.0  
       
     return retArray  

最后是构建二叉树函数，通过循环比较得到最佳特征值和它的阈值。D是初始矩阵的权重。

[python]view plaincopy 
    
 def buildStump(dataArr,classLabels,D):  
     dataMatrix = mat(dataArr); labelMat = mat(classLabels).T  
     m,n = shape(dataMatrix)  
     numSteps = 10.0; bestStump = {}; bestClasEst = mat(zeros((m,1)))  
     minError = inf #init error sum, to +infinity  
     for i in range(n):#loop over all dimensions  
         rangeMin = dataMatrix[:,i].min(); rangeMax = dataMatrix[:,i].max();  
           
         stepSize = (rangeMax-rangeMin)/numSteps  
         for j in range(-1,int(numSteps)+1):#loop over all range in current dimension  
             for inequal in ['lt', 'gt']: #go over less than and greater than  
                 threshVal = (rangeMin + float(j) * stepSize)  
                   
                 predictedVals = stumpClassify(dataMatrix,i,threshVal,inequal)#call stump classify with i, j, lessThan  
                 errArr = mat(ones((m,1)))  
                   
                   
                 errArr[predictedVals == labelMat] = 0  
                   
                 weightedError = D.T*errArr  #calc total error multiplied by D  
                 #print "split: dim %d, thresh %.2f, thresh ineqal: %s, the weighted error is %.3f" % (i, threshVal, inequal, weightedError)  
                 if weightedError < minError:  
                     minError = weightedError  
                     bestClasEst = predictedVals.copy()  
                     bestStump['dim'] = i  
                     bestStump['thresh'] = threshVal  
                     bestStump['ineq'] = inequal  
     return bestStump,minError,bestClasEst