Boosting
Adaboost
简介
Boosting和Bagging所使用的分类器的类型都是一致的。在前者中,不同的分类器是通过串行训练而获得的,每个新分类器都根据已训练出的分类器的性能来进行训练。boosting通过集中关注被已有分类器错分的那些数据来获得新的分类器。
Boosting分类的结果是基于所有分类器的加权求和结果。在bagging中分类器权重相等,而在boosting中,分类器的权重并不是相等,每个权重代表的是其对应分类器在上一轮迭代中的成功度。
AdaBoost(adaptive boosting)运行过程如下:数据中的每个样本并赋予其一个权重,这些权重构成向量D。开始这些权重都只初始化成相等值。首先在训练数据上训练出一个弱分类器并计算该分类器的错误率,然后在同一数据集上再次训练弱分类器。在分类器的二次训练中会重新调整每个样本的权重,其中第一次分对的样本权重将会降低,第一次分错的样本权重将会提高。为了从所有弱分类器中得到最终的分类结果。Adaboost为每个分类器都分配了一个权重值alpha,这些alpha值是基于每个弱分类器的错误率进行计算的。
错误率的定义为:
alpha计算公式为:
计算出alpha值后,对权重向量D进行更新,使正确分类的样本的权重降低而错分样本权重进行升高。
如果某个样本被正确分类:
如果某个样本被错误分类:
案例
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
#读入数据
wine=pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data",header=None)
#加列名
wine.columns=['class label','Acohol','Malic acid','Ash','Alcalinity of ash','Magnesium','Total phenols','Flavanoids','Nonflavanoid phenols','Proanthocyanins','Color intensity','Hue','OD280/OD315 of diluted wines','Proline']
wine.head()
#仅考虑2,3类
wine=wine[wine['class label']!=1]
y=wine['class label'].values
X=wine[['Acohol','OD280/OD315 of diluted wines']].values
#将分类标签变成二进制编码
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
y=le.fit_transform(y)
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2, random_state=1, stratify=y)
单一决策树建模
from sklearn.tree import DecisionTreeClassifier
tree=DecisionTreeClassifier(criterion='entropy',random_state=1,max_depth=1)
from sklearn.metrics import accuracy_score
tree=tree.fit(X_train,y_train)
y_train_pred=tree.predict(X_train)
y_test_pred=tree.predict(X_test)
tree_train=accuracy_score(y_train,y_train_pred)
tree_test=accuracy_score(y_test,y_test_pred)
print('Decision tree train/test accuracies is %.3f/%.3f'%(tree_train,tree_test))
Decision tree train/test accuracies is 0.916/0.875
#Adaboost
from sklearn.ensemble import AdaBoostClassifier
ada=AdaBoostClassifier(base_estimator=tree, n_estimators=500, learning_rate=0.1,random_state=1)
ada=ada.fit(X_train,y_train)
y_train_pred=ada.predict(X_train)
y_test_pred=ada.predict(X_test)
ada_train=accuracy_score(y_train,y_train_pred)
ada_test=accuracy_score(y_test,y_test_pred)
print('Adaboost train/test accuracies %.3f/%.3f' %(ada_train,ada_test))
Adaboost train/test accuracies 1.000/0.917
#决策边界
x_min=X_train[:,0].min()-1
x_max=X_train[:,0].max()+1
y_min=X_train[:,1].min()-1
y_max=X_train[:,1].max()+1
xx,yy=np.meshgrid(np.arange(x_min,x_max,0.1),np.arange(y_min,y_max,0.1)) #网格点坐标矩阵
f, axarr=plt.subplots(nrows=1,ncols=2,sharex='col',sharey='row',figsize=(12,6))
for idx, clf, tt in zip([0,1],[tree,ada],['Decision tree','Adaboost']):
clf.fit(X_train,y_train)
Z=clf.predict(np.c_[xx.ravel(),yy.ravel()]) #扁平化操作
Z=Z.reshape(xx.shape)
axarr[idx].contourf(xx,yy,Z, alpha=0.3)
axarr[idx].scatter(X_train[y_train==0,0],X_train[y_train==0,1],c='blue',marker='*')
axarr[idx].scatter(X_train[y_train==1,0],X_train[y_train==1,1],c='red',marker='^')
axarr[idx].set_title(tt)
axarr[0].set_ylabel('Alcohol',fontsize=12)
plt.tight_layout()
plt.text(0, -0.2, s='OD280/OD315 of diluted wines',ha='center',va='center',fontsize=12,transform=axarr[1].transAxes)
plt.show()
算法实现
'''
Created on Nov 28, 2010
Adaboost is short for Adaptive Boosting
@author: Peter
'''
from numpy import *
def loadSimpData():
datMat = matrix([[ 1. , 2.1],
[ 2. , 1.1],
[ 1.3, 1. ],
[ 1. , 1. ],
[ 2. , 1. ]])
classLabels = [1.0, 1.0, -1.0, -1.0, 1.0]
return datMat,classLabels
def loadDataSet(fileName): #general function to parse tab -delimited floats
numFeat = len(open(fileName).readline().split('\t')) #get number of fields
dataMat = []; labelMat = []
fr = open(fileName)
for line in fr.readlines():
lineArr =[]
curLine = line.strip().split('\t')
for i in range(numFeat-1):
lineArr.append(float(curLine[i]))
dataMat.append(lineArr)
labelMat.append(float(curLine[-1]))
return dataMat,labelMat
def stumpClassify(dataMatrix,dimen,threshVal,threshIneq):#just classify the data
retArray = ones((shape(dataMatrix)[0],1))
if threshIneq == 'lt':
retArray[dataMatrix[:,dimen] <= threshVal] = -1.0
else:
retArray[dataMatrix[:,dimen] > threshVal] = -1.0
return retArray
def buildStump(dataArr,classLabels,D):
dataMatrix = mat(dataArr); labelMat = mat(classLabels).T
m,n = shape(dataMatrix)
numSteps = 10.0; bestStump = {}; bestClasEst = mat(zeros((m,1)))
minError = inf #init error sum, to +infinity
for i in range(n):#loop over all dimensions
rangeMin = dataMatrix[:,i].min(); rangeMax = dataMatrix[:,i].max();
stepSize = (rangeMax-rangeMin)/numSteps
for j in range(-1,int(numSteps)+1):#loop over all range in current dimension
for inequal in ['lt', 'gt']: #go over less than and greater than
threshVal = (rangeMin + float(j) * stepSize)
predictedVals = stumpClassify(dataMatrix,i,threshVal,inequal)#call stump classify with i, j, lessThan
errArr = mat(ones((m,1)))
errArr[predictedVals == labelMat] = 0
weightedError = D.T*errArr #calc total error multiplied by D
print("split: dim %d, thresh %.2f, thresh ineqal: %s, the weighted error is %.3f" % (i, threshVal, inequal, weightedError))
if weightedError < minError:
minError = weightedError
bestClasEst = predictedVals.copy()
bestStump['dim'] = i
bestStump['thresh'] = threshVal
bestStump['ineq'] = inequal
return bestStump,minError,bestClasEst
def adaBoostTrainDS(dataArr,classLabels,numIt=40):
weakClassArr = []
m = shape(dataArr)[0]
D = mat(ones((m,1))/m) #init D to all equal
aggClassEst = mat(zeros((m,1)))
for i in range(numIt):
bestStump,error,classEst = buildStump(dataArr,classLabels,D)#build Stump
print("D:",D.T)
alpha = float(0.5*log((1.0-error)/max(error,1e-16)))#calc alpha, throw in max(error,eps) to account for error=0
bestStump['alpha'] = alpha
weakClassArr.append(bestStump) #store Stump Params in Array
print("classEst: ",classEst.T)
expon = multiply(-1*alpha*mat(classLabels).T,classEst) #exponent for D calc, getting messy
D = multiply(D,exp(expon)) #Calc New D for next iteration
D = D/D.sum()
#calc training error of all classifiers, if this is 0 quit for loop early (use break)
aggClassEst += alpha*classEst
print("aggClassEst: ",aggClassEst.T)
aggErrors = multiply(sign(aggClassEst) != mat(classLabels).T,ones((m,1)))
errorRate = aggErrors.sum()/m
print("total error: ",errorRate)
if errorRate == 0.0: break
return weakClassArr
def adaClassify(datToClass,classifierArr):
dataMatrix = mat(datToClass)#do stuff similar to last aggClassEst in adaBoostTrainDS
m = shape(dataMatrix)[0]
aggClassEst = mat(zeros((m,1)))
for i in range(len(classifierArr)):
classEst = stumpClassify(dataMatrix, classifierArr[i]['dim'],\
classifierArr[i]['thresh'],\
classifierArr[i]['ineq'])#call stump classify
aggClassEst += classifierArr[i]['alpha']*classEst
print(aggClassEst)
return sign(aggClassEst)
def plotROC(predStrengths, classLabels):
import matplotlib.pyplot as plt
cur = (1.0,1.0) #cursor
ySum = 0.0 #variable to calculate AUC
numPosClas = sum(array(classLabels)==1.0)
yStep = 1/float(numPosClas); xStep = 1/float(len(classLabels)-numPosClas)
sortedIndicies = predStrengths.argsort()#get sorted index, it's reverse
fig = plt.figure()
fig.clf()
ax = plt.subplot(111)
#loop through all the values, drawing a line segment at each point
for index in sortedIndicies.tolist()[0]:
if classLabels[index] == 1.0:
delX = 0; delY = yStep;
else:
delX = xStep; delY = 0;
ySum += cur[1]
#draw line from cur to (cur[0]-delX,cur[1]-delY)
ax.plot([cur[0],cur[0]-delX],[cur[1],cur[1]-delY], c='b')
cur = (cur[0]-delX,cur[1]-delY)
ax.plot([0,1],[0,1],'b--')
plt.xlabel('False positive rate'); plt.ylabel('True positive rate')
plt.title('ROC curve for AdaBoost horse colic detection system')
ax.axis([0,1,0,1])
plt.show()
print("the Area Under the Curve is: ",ySum*xStep)
前向分布算法
简介
在给定训练数据以及损失函数
L
(
y
,
f
(
x
)
)
L(y, f(x))
L(y,f(x))的条件下,学习加法模型
f
(
x
)
f(x)
f(x)就是:
min
β
m
,
γ
m
∑
i
=
1
N
L
(
y
i
,
∑
m
=
1
M
β
m
b
(
x
i
;
γ
m
)
)
\min _{\beta_{m}, \gamma_{m}} \sum_{i=1}^{N} L\left(y_{i}, \sum_{m=1}^{M} \beta_{m} b\left(x_{i} ; \gamma_{m}\right)\right)
βm,γmmini=1∑NL(yi,m=1∑Mβmb(xi;γm))
通常这是一个复杂的优化问题,很难通过简单的凸优化的相关知识进行解决。前向分步算法可以用来求解这种方式的问题,它的基本思路是:因为学习的是加法模型,如果从前向后,每一步只优化一个基函数及其系数,逐步逼近目标函数,那么就可以降低优化的复杂度。具体而言,每一步只需要优化:
min
β
,
γ
∑
i
=
1
N
L
(
y
i
,
β
b
(
x
i
;
γ
)
)
\min _{\beta, \gamma} \sum_{i=1}^{N} L\left(y_{i}, \beta b\left(x_{i} ; \gamma\right)\right)
β,γmini=1∑NL(yi,βb(xi;γ))
(2) 前向分步算法:
给定数据集 T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , ⋯ , ( x N , y N ) } T=\left\{\left(x_{1}, y_{1}\right),\left(x_{2}, y_{2}\right), \cdots,\left(x_{N}, y_{N}\right)\right\} T={(x1,y1),(x2,y2),⋯,(xN,yN)}, x i ∈ X ⊆ R n x_{i} \in \mathcal{X} \subseteq \mathbf{R}^{n} xi∈X⊆Rn, y i ∈ Y = { + 1 , − 1 } y_{i} \in \mathcal{Y}=\{+1,-1\} yi∈Y={+1,−1}。损失函数 L ( y , f ( x ) ) L(y, f(x)) L(y,f(x)),基函数集合 { b ( x ; γ ) } \{b(x ; \gamma)\} {b(x;γ)},我们需要输出加法模型 f ( x ) f(x) f(x)。
- 初始化: f 0 ( x ) = 0 f_{0}(x)=0 f0(x)=0
- 对m = 1,2,…,M:
- (a) 极小化损失函数:
( β m , γ m ) = arg min β , γ ∑ i = 1 N L ( y i , f m − 1 ( x i ) + β b ( x i ; γ ) ) \left(\beta_{m}, \gamma_{m}\right)=\arg \min _{\beta, \gamma} \sum_{i=1}^{N} L\left(y_{i}, f_{m-1}\left(x_{i}\right)+\beta b\left(x_{i} ; \gamma\right)\right) (βm,γm)=argβ,γmini=1∑NL(yi,fm−1(xi)+βb(xi;γ))
得到参数 β m \beta_{m} βm与 γ m \gamma_{m} γm - (b) 更新:
f m ( x ) = f m − 1 ( x ) + β m b ( x ; γ m ) f_{m}(x)=f_{m-1}(x)+\beta_{m} b\left(x ; \gamma_{m}\right) fm(x)=fm−1(x)+βmb(x;γm)
- (a) 极小化损失函数:
- 得到加法模型:
f ( x ) = f M ( x ) = ∑ m = 1 M β m b ( x ; γ m ) f(x)=f_{M}(x)=\sum_{m=1}^{M} \beta_{m} b\left(x ; \gamma_{m}\right) f(x)=fM(x)=m=1∑Mβmb(x;γm)
这样,前向分步算法将同时求解从m=1到M的所有参数 β m \beta_{m} βm, γ m \gamma_{m} γm的优化问题简化为逐次求解各个 β m \beta_{m} βm, γ m \gamma_{m} γm的问题。
梯度提升决策树(GBDT)
简介
全称梯度提升决策树。它通过采用加法模型及基函数的线性组合,以及不断减少训练过程产生的误差来达到将数据分类或者回归的算法。训练过程如下,
GBDT通过多轮迭代,每轮迭代会产生一个弱分类器。每个分类器在上一轮分类器的残差基础上进行训练。GBDT对弱分类器的要求一般是足够简单且低方差高偏差。因为训练的过程是通过降低偏差来不断提高对最终分类器的精度。而由于上述高偏差和简单的要求,每个分类回归树的深度不会很深。最终的总分类器是将每轮训练得到的弱分类器加权求和得到的,也就是加法模型。
基于残差学习的提升树算法
在回归树中的样本标签是连续数值,可划分点包含了所有特征的所有可取的值。因此用平方误差可以很好的评判拟合程度。而为了设定提升标准,模仿分类错误率,我们用每个样本的残差表示每次使用基函数预测时没有解决的那部分问题。
输入数据集 T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , ⋯ , ( x N , y N ) } , x i ∈ X ⊆ R n , y i ∈ Y ⊆ R T=\left\{\left(x_{1}, y_{1}\right),\left(x_{2}, y_{2}\right), \cdots,\left(x_{N}, y_{N}\right)\right\}, x_{i} \in \mathcal{X} \subseteq \mathbf{R}^{n}, y_{i} \in \mathcal{Y} \subseteq \mathbf{R} T={(x1,y1),(x2,y2),⋯,(xN,yN)},xi∈X⊆Rn,yi∈Y⊆R,输出最终的提升树 f M ( x ) f_{M}(x) fM(x)
- 初始化 f 0 ( x ) = 0 f_0(x) = 0 f0(x)=0
- 对m = 1,2,…,M:
- 计算每个样本的残差: r m i = y i − f m − 1 ( x i ) , i = 1 , 2 , ⋯ , N r_{m i}=y_{i}-f_{m-1}\left(x_{i}\right), \quad i=1,2, \cdots, N rmi=yi−fm−1(xi),i=1,2,⋯,N
- 拟合残差 r m i r_{mi} rmi学习一棵回归树,得到 T ( x ; Θ m ) T\left(x ; \Theta_{m}\right) T(x;Θm)
- 更新 f m ( x ) = f m − 1 ( x ) + T ( x ; Θ m ) f_{m}(x)=f_{m-1}(x)+T\left(x ; \Theta_{m}\right) fm(x)=fm−1(x)+T(x;Θm)
- 得到最终的回归问题的提升树: f M ( x ) = ∑ m = 1 M T ( x ; Θ m ) f_{M}(x)=\sum_{m=1}^{M} T\left(x ; \Theta_{m}\right) fM(x)=∑m=1MT(x;Θm)
梯度提升算法
梯度提升算法(gradient boosting)是利用最速下降法的近似方法,利用损失函数的负梯度在当前模型的值 − [ ∂ L ( y , f ( x i ) ) ∂ f ( x i ) ] f ( x ) = f m − 1 ( x ) -\left[\frac{\partial L\left(y, f\left(x_{i}\right)\right)}{\partial f\left(x_{i}\right)}\right]_{f(x)=f_{m-1}(x)} −[∂f(xi)∂L(y,f(xi))]f(x)=fm−1(x)作为回归问题提升树算法中的残差的近似值,拟合回归树。
输入训练数据集 T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , ⋯ , ( x N , y N ) } , x i ∈ X ⊆ R n , y i ∈ Y ⊆ R T=\left\{\left(x_{1}, y_{1}\right),\left(x_{2}, y_{2}\right), \cdots,\left(x_{N}, y_{N}\right)\right\}, x_{i} \in \mathcal{X} \subseteq \mathbf{R}^{n}, y_{i} \in \mathcal{Y} \subseteq \mathbf{R} T={(x1,y1),(x2,y2),⋯,(xN,yN)},xi∈X⊆Rn,yi∈Y⊆R和损失函数 L ( y , f ( x ) ) L(y, f(x)) L(y,f(x)),输出回归树 f ^ ( x ) \hat{f}(x) f^(x)
- 初始化 f 0 ( x ) = arg min c ∑ i = 1 N L ( y i , c ) f_{0}(x)=\arg \min _{c} \sum_{i=1}^{N} L\left(y_{i}, c\right) f0(x)=argminc∑i=1NL(yi,c)
- 对于m=1,2,…,M:
- 对i = 1,2,…,N计算: r m i = − [ ∂ L ( y i , f ( x i ) ) ∂ f ( x i ) ] f ( x ) = f m − 1 ( x ) r_{m i}=-\left[\frac{\partial L\left(y_{i}, f\left(x_{i}\right)\right)}{\partial f\left(x_{i}\right)}\right]_{f(x)=f_{m-1}(x)} rmi=−[∂f(xi)∂L(yi,f(xi))]f(x)=fm−1(x)
- 对 r m i r_{mi} rmi拟合一个回归树,得到第m棵树的叶结点区域 R m j , j = 1 , 2 , ⋯ , J R_{m j}, j=1,2, \cdots, J Rmj,j=1,2,⋯,J
- 对j=1,2,…J,计算: c m j = arg min c ∑ x i ∈ R m j L ( y i , f m − 1 ( x i ) + c ) c_{m j}=\arg \min _{c} \sum_{x_{i} \in R_{m j}} L\left(y_{i}, f_{m-1}\left(x_{i}\right)+c\right) cmj=argminc∑xi∈RmjL(yi,fm−1(xi)+c)
- 更新 f m ( x ) = f m − 1 ( x ) + ∑ j = 1 J c m j I ( x ∈ R m j ) f_{m}(x)=f_{m-1}(x)+\sum_{j=1}^{J} c_{m j} I\left(x \in R_{m j}\right) fm(x)=fm−1(x)+∑j=1JcmjI(x∈Rmj)
- 得到回归树: f ^ ( x ) = f M ( x ) = ∑ m = 1 M ∑ j = 1 J c m j I ( x ∈ R m j ) \hat{f}(x)=f_{M}(x)=\sum_{m=1}^{M} \sum_{j=1}^{J} c_{m j} I\left(x \in R_{m j}\right) f^(x)=fM(x)=∑m=1M∑j=1JcmjI(x∈Rmj)
from sklearn.metrics import mean_squared_error
from sklearn.datasets import make_friedman1
from sklearn.ensemble import GradientBoostingRegressor
X, y = make_friedman1(n_samples=1200, random_state=0, noise=1.0)
X_train, X_test = X[:200], X[200:]
y_train, y_test = y[:200], y[200:]
est = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1,
max_depth=1, random_state=0, loss='ls').fit(X_train, y_train)
mean_squared_error(y_test, est.predict(X_test))
5.009154859960321
from sklearn.datasets import make_regression
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split
X, y = make_regression(random_state=0)
X_train, X_test, y_train, y_test = train_test_split(
X, y, random_state=0)
reg = GradientBoostingRegressor(random_state=0)
reg.fit(X_train, y_train)
reg.score(X_test, y_test)
0.43848663277068134
GradientBoostingRegressor参数解释:
loss:{‘ls’, ‘lad’, ‘huber’, ‘quantile’}, default=’ls’:‘ls’ 指最小二乘回归. ‘lad’ (最小绝对偏差) 是仅基于输入变量的顺序信息的高度鲁棒的损失函数。. ‘huber’ 是两者的结合. ‘quantile’允许分位数回归(用于alpha指定分位数)
learning_rate:学习率缩小了每棵树的贡献learning_rate。在learning_rate和n_estimators之间需要权衡。
n_estimators:要执行的提升次数。
subsample:用于拟合各个基础学习者的样本比例。如果小于1.0,则将导致随机梯度增强。subsample与参数n_estimators。选择会导致方差减少和偏差增加。subsample < 1.0
criterion:{'friedman_mse','mse','mae'},默认='friedman_mse':“ mse”是均方误差,“ mae”是平均绝对误差。默认值“ friedman_mse”通常是最好的,因为在某些情况下它可以提供更好的近似值。
min_samples_split:拆分内部节点所需的最少样本数
min_samples_leaf:在叶节点处需要的最小样本数。
min_weight_fraction_leaf:在所有叶节点处(所有输入样本)的权重总和中的最小加权分数。如果未提供sample_weight,则样本的权重相等。
max_depth:各个回归模型的最大深度。最大深度限制了树中节点的数量。调整此参数以获得最佳性能;最佳值取决于输入变量的相互作用。
min_impurity_decrease:如果节点分裂会导致杂质的减少大于或等于该值,则该节点将被分裂。
min_impurity_split:提前停止树木生长的阈值。如果节点的杂质高于阈值,则该节点将分裂
max_features{‘auto’, ‘sqrt’, ‘log2’},int或float:寻找最佳分割时要考虑的功能数量:
如果为int,则max_features在每个分割处考虑特征。
如果为float,max_features则为小数,并 在每次拆分时考虑要素。int(max_features * n_features)
如果“auto”,则max_features=n_features。
如果是“ sqrt”,则max_features=sqrt(n_features)。
如果为“ log2”,则为max_features=log2(n_features)。
如果没有,则max_features=n_features。