机器学习集成回归之梯度提升法（three）

最新推荐文章于 2024-04-30 10:06:39 发布

lamusique

最新推荐文章于 2024-04-30 10:06:39 发布

阅读量592

点赞数 1

分类专栏：应用型文章标签：梯度提升法

本文链接：https://blog.csdn.net/lamusique/article/details/86761467

版权

应用型专栏收录该内容

58 篇文章 10 订阅

订阅专栏

集成应用

开发出一种可以生成大量近似独立的模型的方法，然后把他集成起来。也就是集成方法由两层算法组成的层次架构。底层的算法叫做基学习器（basic learner）即单个机器学习算法，上层算法对基学习器处理，使其模型相对近似独立。

常用的上层算法有：投票bagging、提升boosting、随机森林random forest。常用的基学习器有：二元决策树、支持向量机等。

这里主要讲的是基于二元决策树的梯度提升法，即梯度提升决策回归树。（当然也有梯度提升决策分类树）

梯度提升法

梯度提升法是基于决策树的集成方法，在不同标签上训练决策树，然后将其组合。对于回归问题，目标是最小化均方误差，每个后续的决策树是在前面决策树遗留的错误上进行训练。

import numpy as np
import matplotlib.pyplot as plot
from sklearn.tree import DecisionTreeRegressor
from math import floor, sqrt
import csv
import random

#读取数据
file='winequality-red.csv'

xList = []
labels = []
names = []
firstLine = True
with open(file,'r') as f:
    for line in f.readlines():
        if firstLine:
            names = line.strip().split(";")
##            print(names)
            firstLine = False
        else:
            #split on semi-colon
            row = line.strip().split(";")
##            print(row)
            #put labels in separate array
            labels.append(float(row[-1]))
            #remove label from row
            row.pop()
            #convert row to floats
            floatRow = [float(num) for num in row]
            xList.append(floatRow)

nrows = len(xList)
ncols = len(xList[0])

#划分数据 30%用于测试
nSample = int(nrows * 0.30)
idxTest = random.sample(range(nrows), nSample)
idxTest.sort()
idxTrain = [idx for idx in range(nrows) if not(idx in idxTest)]

#定义训练集合
xTrain = [xList[r] for r in idxTrain]#list的每一项为一列向量
xTest = [xList[r] for r in idxTest]
yTrain = [labels[r] for r in idxTrain]
yTest = [labels[r] for r in idxTest]

#train a series of models on random subsets of the training data
#collect the models in a list and check error of composite as list grows

#30个深度为5的决策树
#maximum number of models to generate
numTreesMax = 30
#tree depth - typically at the high end
treeDepth = 5

#initialize a list to hold models
modelList = []
predList = []
eps = 0.1

#initialize residuals to be the labels y
residuals = list(yTrain)

for iTrees in range(numTreesMax):

    modelList.append(DecisionTreeRegressor(max_depth=treeDepth))
    modelList[-1].fit(xTrain, residuals)#结合residuals 对30个决策树不断训练

    #对最新的DTR模型预测
    latestInSamplePrediction = modelList[-1].predict(xTrain)

    #use new predictions to update residuals
    residuals = [residuals[i] - eps * latestInSamplePrediction[i] for i in range(len(residuals))]

    latestOutSamplePrediction = modelList[-1].predict(xTest)
    predList.append(list(latestOutSamplePrediction))


#build cumulative prediction from first "n" models
mse = []
allPredictions = []
for iModels in range(len(modelList)):#对预测结果进行可视化处理

    #add the first "iModels" of the predictions and multiply by eps
    prediction = []
    for iPred in range(len(xTest)):
        prediction.append(sum([predList[i][iPred] for i in range(iModels + 1)]) * eps)#30个模型集成 eps表示梯度

    allPredictions.append(prediction)
    errors = [(yTest[i] - prediction[i]) for i in range(len(yTest))]
    mse.append(sum([e * e for e in errors]) / len(yTest))


nModels = [i + 1 for i in range(len(modelList))]

plot.plot(nModels,mse)
plot.axis('tight')
plot.xlabel('Number of Trees in Ensemble')
plot.ylabel('Mean Squared Error')
plot.ylim((0.0, max(mse)))
plot.show()

print('Minimum MSE')
print(min(mse))

#printed output
#Minimum MSE
#0.405031864814

1、相对于单个决策树深度的设置：它在树桩（深度为1的决策树）情况下，也可以获得同更深的决策树一样低的均方误差值。对于梯度提升法，只有属性之间有强烈的相互影响下，才需考虑增加决策树的深度。

2、eps用于控制步长，梯度提升法使用梯度下降法，如果步长过大优化问题变成了发散，如果步长过小则迭代次数过多。

3、关于残差变量（residuals）的定义。术语残差通常用于表示预测误差（即观测值减去预测值）。梯度提升法会对标签的预测值进行一系列精确化。沿着梯度下降的方向，每走一步就重新计算残差。初始时预测值为0，则此时误差等于预测值

梯度提升法如何通过迭代获得预测模型

对iTrees的循环是以属性值训练一个决策树开始的，但是用残差代替标签进行训练。只有第一轮是用原始标签来训练数据。后续的循环都是用训练产生的（误差残差 - eps*预测值）作为目标结果进行训练。如前文所提到的，残差减去的相当于梯度下降的值。乘以一个步长控制参数eps就是为了保证迭代中的收敛。代码使用固定的欲裂数据集（测试数据）来测量性能，并绘制了均方差与决策树数目的关系图。最终得到的梯度提升回归决策树使用modelList[-1].predict(xTest)进行预测。

当然python的ensemble集成包里有实现了以上所有流程的函数很方便。sklearn.ensemble.GradientBoostingRegressor()用于回归决策树的训练与建立。（sklearn.ensemble.GradientBoostingClassifier()用于分类决策树的训练与成立）具体参数主要(n_estimators=200)表示决策树数目，(Max_depth是集成方法中单个决策树的深度)。对应方法有fit训练、predict预测。下面就采用GradientBoostingRegressor()代码实现数据集的梯度提升模型。

待续。。。

lamusique

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
机器学习集成回归之梯度提升法（three）

集成应用开发出一种可以生成大量近似独立的模型的方法，然后把他集成起来。也就是集成方法由两层算法组成的层次架构。底层的算法叫做基学习器（basic learner）即单个机器学习算法，上层算法对基学习器处理，使其模型相对近似独立。常用的上层算法有：投票bagging、提升boosting、随机森林random forest。常用的基学习器有：二元决策树、支持向量机等。这里主要讲的是基于二...
复制链接

扫一扫

专栏目录