集成模型(4)lightGBM主要原理及其python实现


前言:lightGBM主要流程和XgBoost比较相似,都是GBDT的一种改进,相对于XgBoost而言lightGBM则解决了大样本高纬度环境下耗时的问题。以及本文的实现代码主要用于算法核心的理解,文中不对的的地方也欢迎指正。

1主要原理

如前面所说,lightGBM在目标函数的优化上面和XgBoost的一样,都是使用到了二阶导数信息,优化过程可见前一篇博客,不同之处在于对大样本高纬度进行了优化改进,文本主要介绍改进部分。主要的改进思想:既然样本数量太多,特征维度太高,那么就对样本进行采样以及特征降维。lightGBM论文的作者分别提出了GOSS(Gradient-based One-Side Sampleing,基于梯度的单边采样)以及EFB(Exclusive Feature Bundling,互斥特征绑定),除了这些lightGBM还做出了其他的优化,例如基于leaf-wise的决策树生长策略以及类别特征的最优分割。

1.1GOSS,基于梯度的单边采样

基于梯度的单边采样的核心思想就是将样本按照一阶导数的绝对值大小进行降序排序,然后选出最大的a%个样本。为什么选择梯度绝对值最大的呢?因为对于损失函数而言,目标损失函数的优化通过一阶导数等于0找到最优解,而梯度绝对值最大说明当前该样本的预测值对应的损失也就不是最优,误差也就最大。因此选出梯度最大的a%个样本相当于对上一轮错分或者误差严重的样本进行着重训练,这和AdaBoost在每一轮给每个错分样本增加权重的思想比较类似,只是这里并没有显示的给每个样本指定一个权重。不过除了选择最大的a%个样本之外,还会从剩余的(1-a%)个样本中随机选取b%个样本,并对每个样本的梯度放大(1-a%)/b%倍,也就是用局部去代替整体,相当于现在用一个样本去代替原来的(1-a%)/b%个样本,当基学习器内部节点进行分裂时这(1-a%)/b%个样本作为整体被划分到左子节点或右子节点,这么做的目的就是不去改变原本数据的分布,又能加快模型速度。

算法流程:
输入:训练数据,迭代步数d(也就是基学习器数量),大梯度数据的采样率a%,小梯度数据的采样率b%,损失函数l以及弱学习器;
输出:训练好的强学习器

  1. 使用前d-1轮训练好模型的预测值,计算样本点梯度 ∂ l ( y i , y i ^ ( d − 1 ) ) ∂ y i ^ ( d − 1 ) \frac{\partial l(y_i,\hat{y_i}^{(d-1)})}{\partial \hat{y_i}^{(d-1)}} yi^(d1)l(yi,yi^(d1)),并根据提到杜绝对之进行降序排序;
  2. 对排序后的接过去前a%个样本生成一个大梯度样本的集合;
  3. 对剩余(1-a%)的样本随机选取(1-a%)b%个样本,生成小梯度样本点的集合
  4. 将大梯度样本和随机采样的小梯度样本合并;
  5. 将小梯度样本的梯度值以及二阶导数值扩大(1-a%)/b%倍;
  6. 使用合并的样本集合及梯度和二阶导数等信息学习一个新的基学习器;
  7. 不断重复1-6,直到迭代终止。
1.2EFB,互斥特征绑定

这一部分主要了解什么时互斥特征以及如何绑定?

(1)首先需要先定义什么样的特征称为互斥特征,例如所有样本在特征a和b上面的取值都不同时为非零值,则称a和b为互斥特征。下面的例子中a和b就是互斥特征,a和c就不互斥。

    a  b   c
0   0  1   4
1   2  0   0
2   3  0   5
3   4  0   0
4   0  0   0
5   0  3   3
6   0  0   0

但是实际算法中我们通常会允许小部分的冲突。
伪代码如下:
在这里插入图片描述
首先构建一个带权重的图,每个点代表特征,权重对应特征之间的总冲突;bundles是所有绑定的集合,每个绑定的冲突都小于K;needNew就是指当前特征是否需要生成一个新的绑定还是加入现有的绑定。

(2)合并互斥特征,就是将互斥的特征绑定到一起,成为一个特征,就达到了降维的目的。并且为了合并后不同特征能够区分开来,通常会加上一个偏移量,例如上面例子中特征a的范围是[0,4],b的范围是[0.3],就可以先将b加上一个偏移量5变为[5,8],这样特征a和b合并之后仍能区分出a,b,且合并后特征的取值范围就变成了[0,8]。

伪代码如下:
在这里插入图片描述

此外在进行上面互斥特征判断以及绑定之前,通常会将连续值的特征离散化,离散化后的划分点减少加快了速度,虽然离散化后找到的划分点可能并不是精确的划分点,但是因为基学习器本身就是弱学习器,因此是否精确并不太重要,并且不精确的划分也甚至能达到正则化的效果,即使单个基学习器的效果不好,但是在boosting的框架下影响也不大。

1.3Leaf-wise的决策树生长策略

一般决策树都是通过level-wise的策略来生长树,不加区分的对待同一层的叶子,只要满足条件(叶节点最小样本数量、分裂最小增益等)就进行分裂,然而实际上很多节点的分裂增益较低没被必要分裂,带来了没必要的开销(虽然可以设置节点分裂的最小增益,但是这个最小增益也是针对于全局的,设置的太小很多没必要的节点就会分裂导致,太大又会导致欠拟合)。
在这里插入图片描述
而lightGBM通过leaf-wise策略生长树,每次从当前所有叶子节点(并不是当前层的所有叶子节点)中找到分裂增益最大的叶子进行分裂,和level-wise相比,在分裂相同次数的情况下leaf-wie的层数更深,可以降低更多的误差,但样本少时,leaf-wise可能也会造成过拟合,可以通过设置树的最大深度来避免。
在这里插入图片描述

1.4类别特征的最优分割

lightGBM对于类别特征的处理其实和cart树对于类别特征的处理差不多,都是将其分为两个子集,而不是像经典决策树那样对类别特征的所有取值都进行划分。lightGBM具体的做法是对类别特征的取值先进行排序(根据sum_gradient/sum_hessian,sum_gradient是所有在该类别特征上取某个值的样本的一阶导数和,sum_hessian同理是二阶导数和),然后根据排序后的类别取值一次寻找最优的划分点。

2总结

以上就是lightGBM中主要的特性。

  1. lightGBM和XgBoost相比,不同之处在于,增加的样本采样以及特征降维,同时决策树的生长略也变成了leaf-wise,对于类别特征也进行了单独的处理。
  2. 和XgBoost的相同之处就是整体的学习过程是类似的,都是损失函数进行泰勒展开到二阶。
  3. lightGBM还专门加入了对缺失值的处理,我这里没有细看了。

不过我还是有几个疑问的地方没有解决:首先就是每个绑定内部的所有特征的冲突计算,如果多个特征在同一个样本的位置发生冲突是计算一次还是多次?(代码实现中我是只计算了一次)然后就是类别特征参不参与特征绑定,如果参与了绑定,那么类别特征的最优分割就不存在该特征了,还是说会加入一个判断,如果绑定了对该类别特征就不进行最优分割?(下面的实现中我并没有实现类别特征的单独处理)

3.python实现

3.1基学习器的实现
import pandas as pd
import numpy as np
import pygraphviz as pgv

'''构建回归树,节点分裂准则和叶节点输出值都是根据loss函数确定'''

#计算当前划分下的增益
def cal_Gain(G_L,G_R,H_L,H_R,reg_alpha,reg_lambda):
    return (G_L**2/(H_L+reg_lambda)+G_R**2/(H_R+reg_lambda)-(G_L+G_R)**2/((H_L+H_R)+reg_lambda))/2-reg_alpha

#选择最优划分特征以及划分点
def select_best_feature(data:pd.DataFrame,G_H,reg_alpha=0,reg_lambda=1):
    features = data.columns.tolist()
    best_feat = '' #最优划分特征
    best_split = -1 #最优划分点
    max_gain = -1 #最优划分特征及划分点对应的增益
    G_sum = np.sum(G_H[0]['gradient_sum']) #未划分前所有样本的一阶导之和
    H_sum = np.sum(G_H[0]['hessian_sum']) #未划分前所有样本的二阶导之和
    for i,feat in enumerate(features):
        G_H_df = G_H[i]
        split_vals = np.array(G_H_df.iloc[:,0])[1:-1]
        for val in split_vals:
            #根据特征的取值进行划分
            index = G_H_df.iloc[:,0]<val
            G_l = np.sum(G_H_df.loc[index,'gradient_sum']) #以该点作为划分点得到的左子树的一阶导数之和
            H_l = np.sum(G_H_df.loc[index,'hessian_sum'])
            cur_gain = cal_Gain(G_l,G_sum-G_l,H_l,H_sum-H_l,reg_alpha,reg_lambda) #计算增益
            if cur_gain>max_gain:
                max_gain = cur_gain
                best_feat = feat
                best_split = val
    return best_feat, best_split,max_gain

#返回叶节点最优的输出值,即最小化损失函数loss
def cal_best_w(gradient,hessian,reg_lambda):
    return -np.sum(gradient)/(np.sum(hessian)+reg_lambda)

#生成每个特征对应的直方图,对每个特征的每个bin计算一阶导数之和、二阶导数之和,用于计算节点分裂的增益
def histogram(data:pd.DataFrame, gradient, hessian):
    features = data.columns.tolist()
    tmp_df = data.copy()
    tmp_df['gradient'] = gradient
    tmp_df['hessian'] = hessian
    G_H = []
    for i,feat in enumerate(features):
        #统计每个特征离散后的每个离散值取值的所有样本的一阶导数之和、二阶导数之和
        gp = tmp_df.groupby(feat).agg({'gradient':['sum'], 'hessian':['sum']})
        gp.columns = pd.Index([f[0]+'_'+f[1] for f in gp.columns.tolist()])
        gp = gp.reset_index()
        G_H.append(gp)
    return G_H

#直方图做差
def histogram_speed(G_H, G_H_l):
    G_H_r =  []
    for i in np.arange(len(G_H)):
        G_H_df=  G_H[i]
        G_H_l_df = G_H_l[i]
        G_H_r_df = G_H_df.copy()
        for i,val in enumerate(G_H_l_df.iloc[:,0]):
            index = (G_H_r_df.iloc[:,0] == val)
            G_H_r_df.loc[index,'gradient_sum'] -= G_H_l_df.loc[i,'gradient_sum']
            G_H_r_df.loc[index,'hessian_sum'] -= G_H_l_df.loc[i,'hessian_sum']
        G_H_r.append(G_H_r_df)
    return G_H_r

#基于leaf-wise构建回归树
def build_treeRegressor(data:pd.DataFrame,G_H,gradient,hessian,num_leaves=8,max_depth=3,min_samples_leaf=1,
                        gamma=0,reg_alpha=0,reg_lambda=0):
    '''
    :param data:训练集
    :param G_H: 数组,包含每个特征的直方图统计量
    :param gradient: np.array, 样本的一阶导数
    :param hessian: np.array, 样本的二阶导数
    :param num_leaves: 树的最大叶节点数目
    :param max_depth: 树的最大层数
    :param min_samples_leaf: 叶节点最小样本数
    :param gamma: 分割所需要达到的最小增益
    :param reg_alpha: L1正则化参数
    :param reg_lambda: L2正则化参数
    :return: 树模型
    '''
    tree_leaves = []
    tree_leaves.append({'data': data, 'G_H': G_H,'gradient': gradient,'hessian': hessian,'cal':[],
                        'depth':0,'isSplit':True,'val':cal_best_w(gradient,hessian,reg_lambda)})
    while(len(tree_leaves)<num_leaves):
        best_feat = ''
        best_split = -1
        max_gain = -1
        best_leaf_index = -1
        # print('while')
        for i,leaf in enumerate(tree_leaves):
            #先检查叶节点
            if leaf['isSplit']==False:
                continue
            # print('for')
            data_leaf = leaf['data']
            G_H_leaf = leaf['G_H']
            leaf_feat, leaf_split, leaf_gain = select_best_feature(data_leaf, G_H_leaf, reg_alpha, reg_lambda)
            # 如果分割后产生的增益小于阈值,则不分割
            if leaf_gain < gamma:
                tree_leaves[i]['isSplit'] = False
                continue
            L_tree_index = data_leaf[leaf_feat] < leaf_split
            R_tree_index = data_leaf[leaf_feat] >= leaf_split
            # 如果分割后左子树或右子树样本数量小于叶节点最小样本数量则停止分割
            if len(L_tree_index) < min_samples_leaf or len(R_tree_index) < min_samples_leaf:
                tree_leaves[i]['isSplit'] = False
                continue
            if leaf_gain>max_gain:
                best_leaf_index = i
                max_gain = leaf_gain
                best_feat = leaf_feat
                best_split = leaf_split
        #如果找不到可以划分的叶节点,则停止
        if best_leaf_index == -1:
            break
        # print(best_leaf_index)
        best_leaf = tree_leaves[best_leaf_index]
        gradient_leaf = best_leaf['gradient']
        hessian_leaf = best_leaf['hessian']
        #cal_process保存的是到达这个节点所需要进行的c操作
        cal_process_l = best_leaf['cal'].copy()
        cal_process_r = best_leaf['cal'].copy()
        cal_process_l.append((best_feat,best_split,'l'))
        cal_process_r.append((best_feat,best_split,'r'))
        # 计算左右子树的一阶导数和二阶导数的直方图
        L_tree_index = best_leaf['data'][best_feat] < best_split
        R_tree_index = best_leaf['data'][best_feat] >= best_split
        G_H_l = histogram(best_leaf['data'][L_tree_index], gradient_leaf[L_tree_index], hessian_leaf[L_tree_index])
        G_H_r = histogram_speed(best_leaf['G_H'], G_H_l)

        isSplit = True
        # 当达到数的最大深度时,停止分裂
        if best_leaf['depth']+1>=max_depth:
            isSplit = False
        tree_leaves.append({'data': best_leaf['data'][L_tree_index], 'G_H': G_H_l,
                            'gradient': gradient_leaf[L_tree_index],'hessian': hessian_leaf[L_tree_index],
                            'cal':cal_process_l, 'depth':best_leaf['depth']+1,'isSplit':isSplit,
                            'val':cal_best_w(gradient_leaf[L_tree_index],hessian_leaf[L_tree_index],reg_lambda)})
        tree_leaves.append({'data': best_leaf['data'][R_tree_index], 'G_H': G_H_r,
                            'gradient': gradient_leaf[R_tree_index],'hessian': hessian_leaf[R_tree_index],
                            'cal':cal_process_r, 'depth':best_leaf['depth']+1,'isSplit':isSplit,
                            'val':cal_best_w(gradient_leaf[R_tree_index],hessian_leaf[R_tree_index],reg_lambda)})
        tree_leaves.pop(best_leaf_index)

    return tree_leaves

def predict(tree_leaves: [], data: pd.DataFrame):
    y_pred = np.zeros(len(data))
    for i in np.arange(len(data)):
        isOk = True
        for leaf in tree_leaves:
            cal_process = leaf['cal']
            for (feat, split_val, flag) in cal_process:
                if (flag == 'l') & (data.loc[i,feat]>=split_val):
                    isOk = False
                    break
                if (flag == 'r') & (data.loc[i,feat]<split_val):
                    isOk = False
                    break
            if isOk:
                y_pred[i] = leaf['val']
                break
            else:
                continue

    return y_pred

def plotTree(A,tree:{}, father_node,depth,label):
    #如果当前是根节点
    if depth == 1:
        A.add_node(father_node)
        #如果既是根节点又是叶节点,即树桩
        if tree['isLeaf'] == True:
            A.add_edge(father_node,tree['val'],label=label)
            return
        else:
            plotTree(A,tree['l_tree'], father_node,depth+1,'<=')
            plotTree(A,tree['r_tree'], father_node,depth+1,'>')
            return
    if tree['isLeaf'] == True:
        A.add_edge(father_node, tree['val'], label=label)
        return
    A.add_edge(father_node, tree['best_feat']+':'+str(tree['best_split']), label=label)
    plotTree(A,tree['l_tree'], tree['best_feat']+':'+str(tree['best_split']), depth+1,'<=')
    plotTree(A,tree['r_tree'], tree['best_feat']+':'+str(tree['best_split']), depth+1,'>')

3.2回归器的实现
import numpy as np
import pandas as pd
import treeRegressor
from sklearn import datasets
pd.set_option('display.max_rows', None)
pd.set_option('max.colwidth', 10240)
pd.set_option('display.max_columns',None)
pd.set_option('display.width',10240)
'''构建lightGBM回归器'''

#计算特征与已有绑定中所有特征的最大冲突
def conflict_count(bundle_nonzero_mark, bundle_conflict_mark, data_nonzero_mark):
    return np.sum(data_nonzero_mark & bundle_nonzero_mark & (~bundle_conflict_mark))

#互斥特征绑定, K=len(data)/10000,当样本数目大于10000时考虑特征互斥绑定,而小于10000则不考虑,
#因为样本数目少,没必要加速
def EFB(data:pd.DataFrame, K=0):
    features = data.columns.tolist()
    features_num = len(features)
    #生成带有权重的图,每个点代表特征,权值对应特征之间的总冲突
    G = np.zeros((features_num,features_num))
    for i in np.arange(features_num):
        for j in np.arange(i+1,features_num):
            data_feat_i = data.loc[:,features[i]]
            data_feat_j = data.loc[:,features[j]]
            conflict_num = np.sum((data_feat_i!=0) & (data_feat_j!=0))
            G[i][j] = G[j][i] = conflict_num
    #按照节点的度对特征进行排序
    features_conflict_sum = np.sum(G,axis=0)
    searchOrder = np.argsort(-features_conflict_sum)
    bundles = [] #所有bundle的集合,每个bundle的冲突都小于K
    bundles_conflict = [] #每个bundle对应的冲突
    bundles_nonzero_mark = [] #二维数组,每个bundle内的特征合并后的每个位置是否非零
    bundles_conflict_mark = [] #二维数组,每个bundle内的特征合并后的每个位置是否发生冲突
    for i in searchOrder:
        needNew = True #是否需要成为一个新的bundle还是加入已有的bundle
        data_nonzero_mark = np.array(data.iloc[:,i]).astype(bool)
        for j in np.arange(len(bundles)): #遍历已有绑定,查看是否能够加入
            cnt = conflict_count(bundles_nonzero_mark[j], bundles_conflict_mark[j],
                                 data_nonzero_mark) #计算特征i和bundle j的冲突
            if cnt+bundles_conflict[j] <= K:
                bundles[j].append(i)
                bundles_conflict[j] += cnt
                bundles_conflict_mark[j] = bundles_conflict_mark[j] | (bundles_nonzero_mark[j] & data_nonzero_mark)
                bundles_nonzero_mark[j] = data_nonzero_mark | bundles_nonzero_mark[j]
                needNew = False
                break

        #如果不能加入已有绑定,则创建一个新的绑定
        if needNew:
            bundles.append([i])
            bundles_conflict.append(0)
            bundles_nonzero_mark.append(np.array(data.iloc[:,i]).astype(bool)) #将非零元素转为True,零元素为False
            bundles_conflict_mark.append(np.zeros(len(data)).astype(bool))

    return bundles

#将每个bundle内的特征进行合并
def features_merge(data:pd.DataFrame, bundles):
    data = data.copy()
    features = data.columns.tolist()
    drop_features = []
    for bundle in bundles:
        if len(bundle) <= 1:
            continue
        total_bins = 0 #偏移量
        bundle_feat = np.zeros(len(data)) #合并后的特征值
        bundle_feat_name = ''
        for feat_index in bundle:
            bundle_feat_name += str(features[feat_index])+'_'
            bundle_feat += np.array(data.iloc[:,feat_index]+total_bins)
            total_bins += data.iloc[:,feat_index].nunique()
        drop_features.extend(bundle)
        data[bundle_feat_name] = bundle_feat
    data.drop(drop_features, axis=1, inplace=True)
    return data

#基于梯度的单边采样
def GOSS(data:pd.DataFrame, y_true, y_pred, top_rate=0.2, other_rate=0.1, loss='squarederror'):
    if loss == 'squarederror':
        gradient = -2*(y_true-y_pred) #一阶导数,梯度
        hessian = np.ones(len(y_true)) * 2 #二阶导数
    order_by_gradient = np.argsort(-np.abs(gradient)) #按梯度大小降序
    top_index = order_by_gradient[:int(len(data)*top_rate)]
    other_index = np.random.choice(order_by_gradient[int(len(data)*top_rate):],size=int(len(data)*(1-top_rate)*other_rate))

    gradient[other_index] *= (1-top_rate)/other_rate
    hessian[other_index] *= (1-top_rate)/other_rate

    sample_index = list(top_index)
    sample_index.extend(list(other_index))
    return data.loc[sample_index,:], gradient[sample_index], hessian[sample_index]

#将连续特征离散化, 返回离散化后的分箱
def get_bins(data:pd.DataFrame, category_features=[], max_bins=256):
    features = data.columns.tolist()
    continuous_features = [feat for feat in features if feat not in category_features]
    tmp_df = data.copy()
    bins_list = [] #保存离散化的过程,对新的数据集使用相同的操作
    for feat in continuous_features:
        # bins = list(pd.cut(np.array(tmp_df.loc[:,feat]), max_bins, retbins=True)[1])
        bins = list(pd.qcut(np.array(tmp_df.loc[:,feat]), max_bins,duplicates='drop', retbins=True)[1])
        bins_list.append(bins)
    return bins_list

#进行离散化操作
def process(data:pd.DataFrame, bins_list, category_features):
    data = data.copy()
    features = data.columns.tolist()
    continuous_features = [feat for feat in features if feat not in category_features]
    tmp_df = data.copy()
    for i,feat in enumerate(continuous_features):
        bins = bins_list[i]
        cur_bin = 0
        for i in np.arange(len(bins) - 1):
            index = ((tmp_df[feat] > bins[i]) & (tmp_df[feat] <= bins[i + 1]))
            if np.sum(index) == 0:
                continue
            data.loc[index, feat] = cur_bin
            cur_bin += 1
    return data

#生成每个特征对应的直方图,对每个特征的每个bin计算一阶导数之和、二阶导数之和,用于计算节点分裂的增益
def histogram(data:pd.DataFrame, gradient, hessian):
    features = data.columns.tolist()
    tmp_df = data.copy()
    tmp_df['gradient'] = gradient
    tmp_df['hessian'] = hessian
    G_H = []
    for i,feat in enumerate(features):
        #统计每个特征离散后的每个离散值取值的所有样本的一阶导数之和、二阶导数之和
        gp = tmp_df.groupby(feat).agg({'gradient':['sum'], 'hessian':['sum']})
        gp.columns = pd.Index([f[0]+'_'+f[1] for f in gp.columns.tolist()])
        gp = gp.reset_index()
        G_H.append(gp)
    return G_H

def build_lightGBMRegressor(data:pd.DataFrame,y_true:np.array,n=3,category_features=[],num_leaves=8,max_depth=3,
                            min_samples_leaf=1,gamma=1e-7,reg_alpha=0,reg_lambda=1,top_rate=0.2,other_rate=0.1,
                            loss='squarederror',lr=0.1):
    f0 = 0 #初始化
    y_pred = np.ones(len(y_true))*f0

    bins_list = get_bins(data, category_features, max_bins=256) #分箱
    data = process(data, bins_list, category_features) #将连续特征离散化
    bundles = EFB(data, K=len(data)/10000) #选取哪些特征可以绑定
    data = features_merge(data, bundles) #特征绑定
    lightGBMRegressor = []
    for i in np.arange(n):
        #基于梯度的采样
        if i<10: #前10轮不进行采样
            subsample_data, gradient, hessian = GOSS(data, y_true, y_pred, 1, other_rate, loss)
        else:
            subsample_data, gradient, hessian = GOSS(data, y_true, y_pred, top_rate, other_rate, loss)
        G_H = histogram(subsample_data, gradient, hessian)
        #构建基学习器
        fn = treeRegressor.build_treeRegressor(subsample_data,G_H,gradient,hessian,
                                               num_leaves, max_depth, min_samples_leaf, gamma, reg_alpha,reg_lambda)

        lightGBMRegressor.append(fn)
        if i==0:
            y_pred += treeRegressor.predict(fn, data)
        else:
            y_pred += lr*treeRegressor.predict(fn, data)
    print('train mse:{} mae:{}'.format(mean_squared_error(y_true,y_pred), mean_absolute_error(y_true, y_pred)))

    return lightGBMRegressor, bins_list, bundles

def predict(lightGBMRegressor, data:pd.DataFrame, bundles, bins_list,category_features, lr=0.1):
    y_pred = np.zeros(len(data))
    data = process(data, bins_list, category_features)  # 将连续特征离散化
    data = features_merge(data, bundles)  # 特征绑定
    for i,tree in enumerate(lightGBMRegressor):
        # y_pred += treeRegressor.predict(tree, data)
        if i==0:
            y_pred += treeRegressor.predict(tree, data)
        else:
            y_pred += lr*treeRegressor.predict(tree, data)
    return y_pred

if __name__ == "__main__":
    from sklearn.metrics import mean_absolute_error, mean_squared_error
    from sklearn.model_selection import train_test_split

    X,y = datasets.load_boston(return_X_y=True)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True, random_state=2020)
    print('train:.{} test:.{}'.format(X_train.shape, X_test.shape))
    X_train_df = pd.DataFrame(X_train)
    X_test_df = pd.DataFrame(X_test)
    category_features = [3]

    lightGBMRegressor,bins_list, bundles = build_lightGBMRegressor(X_train_df, y_train, n=20, max_depth=5, top_rate=0.2,
                                                num_leaves=20,lr=1,min_samples_leaf=2, category_features=category_features)

    y_pred_train = predict(lightGBMRegressor, X_train_df,bundles,bins_list,category_features, lr=1)
    y_pred_test = predict(lightGBMRegressor, X_test_df,bundles,bins_list,category_features, lr=1)
    print('train mse:{} mae:{}'.format(mean_squared_error(y_train,y_pred_train), mean_absolute_error(y_train, y_pred_train)))
    print('test mse:{} mae:{}'.format(mean_squared_error(y_test,y_pred_test), mean_absolute_error(y_test, y_pred_test)))

输出(误差有点大,原因可能是样本太少,又进行了采样以及连续值离散化的原因):

    train mse: 14.848270000722334 mae: 2.821941319538105
    test mse: 174.51428843166528 mae: 8.901842111423283
3.3分类器的实现
import numpy as np
import pandas as pd
import treeRegressor
from sklearn import datasets
pd.set_option('display.max_rows', None)
pd.set_option('max.colwidth', 10240)
pd.set_option('display.max_columns',None)
pd.set_option('display.width',10240)
'''构建lightGBM分类器'''

#计算特征与已有绑定中所有特征的最大冲突
def conflict_count(bundle_nonzero_mark, bundle_conflict_mark, data_nonzero_mark):
    return np.sum(data_nonzero_mark & bundle_nonzero_mark & (~bundle_conflict_mark))

#互斥特征绑定, K=len(data)/10000,当样本数目大于10000时考虑特征互斥绑定,而小于10000则不考虑,
#因为样本数目少,没必要加速
def EFB(data:pd.DataFrame, K=0):
    features = data.columns.tolist()
    features_num = len(features)
    #生成带有权重的图,每个点代表特征,权值对应特征之间的总冲突
    G = np.zeros((features_num,features_num))
    for i in np.arange(features_num):
        for j in np.arange(i+1,features_num):
            data_feat_i = data.loc[:,features[i]]
            data_feat_j = data.loc[:,features[j]]
            conflict_num = np.sum((data_feat_i!=0) & (data_feat_j!=0))
            G[i][j] = G[j][i] = conflict_num
    #按照节点的度对特征进行排序
    features_conflict_sum = np.sum(G,axis=0)
    searchOrder = np.argsort(-features_conflict_sum)
    bundles = [] #所有bundle的集合,每个bundle的冲突都小于K
    bundles_conflict = [] #每个bundle对应的冲突
    bundles_nonzero_mark = [] #二维数组,每个bundle内的特征合并后的每个位置是否非零
    bundles_conflict_mark = [] #二维数组,每个bundle内的特征合并后的每个位置是否发生冲突
    for i in searchOrder:
        needNew = True #是否需要成为一个新的bundle还是加入已有的bundle
        data_nonzero_mark = np.array(data.iloc[:,i]).astype(bool)
        for j in np.arange(len(bundles)): #遍历已有绑定,查看是否能够加入
            cnt = conflict_count(bundles_nonzero_mark[j], bundles_conflict_mark[j],
                                 data_nonzero_mark) #计算特征i和bundle j的冲突
            if cnt+bundles_conflict[j] <= K:
                bundles[j].append(i)
                bundles_conflict[j] += cnt
                bundles_conflict_mark[j] = bundles_conflict_mark[j] | (bundles_nonzero_mark[j] & data_nonzero_mark)
                bundles_nonzero_mark[j] = data_nonzero_mark | bundles_nonzero_mark[j]
                needNew = False
                break

        #如果不能加入已有绑定,则创建一个新的绑定
        if needNew:
            bundles.append([i])
            bundles_conflict.append(0)
            bundles_nonzero_mark.append(np.array(data.iloc[:,i]).astype(bool)) #将非零元素转为True,零元素为False
            bundles_conflict_mark.append(np.zeros(len(data)).astype(bool))

    return bundles

#将每个bundle内的特征进行合并
def features_merge(data:pd.DataFrame, bundles):
    data = data.copy()
    features = data.columns.tolist()
    drop_features = []
    for bundle in bundles:
        if len(bundle) <= 1:
            continue
        total_bins = 0 #偏移量
        bundle_feat = np.zeros(len(data)) #合并后的特征值
        bundle_feat_name = ''
        for feat_index in bundle:
            bundle_feat_name += str(features[feat_index])+'_'
            bundle_feat += np.array(data.iloc[:,feat_index]+total_bins)
            total_bins += data.iloc[:,feat_index].nunique()
        drop_features.extend(bundle)
        data[bundle_feat_name] = bundle_feat
    data.drop(drop_features, axis=1, inplace=True)
    return data

#基于梯度的单边采样
def GOSS(data:pd.DataFrame, y_true, y_pred, top_rate=0.2, other_rate=0.1, loss='logloss'):
    if loss == 'logloss':
        exp_y_pred = np.exp(y_pred)
        gradient = 1 - y_true - 1 / (1 + exp_y_pred) #一阶导数,梯度
        hessian = exp_y_pred / ((1 + exp_y_pred) ** 2) #二阶导数
    order_by_gradient = np.argsort(-np.abs(gradient)) #按梯度大小降序
    top_index = order_by_gradient[:int(len(data)*top_rate)]
    other_index = np.random.choice(order_by_gradient[int(len(data)*top_rate):],size=int(len(data)*(1-top_rate)*other_rate))

    if len(other_index)!=0:
        gradient[other_index] *= (1-top_rate)/other_rate
        hessian[other_index] *= (1-top_rate)/other_rate
    sample_index = list(top_index)
    sample_index.extend(list(other_index))
    return data.loc[sample_index,:], gradient[sample_index], hessian[sample_index]

#将连续特征离散化, 返回离散化后的分箱
def get_bins(data:pd.DataFrame, category_features=[], max_bins=256):
    features = data.columns.tolist()
    continuous_features = [feat for feat in features if feat not in category_features]
    tmp_df = data.copy()
    bins_list = [] #保存离散化的过程,对新的数据集使用相同的操作
    for feat in continuous_features:
        # bins = list(pd.cut(np.array(tmp_df.loc[:,feat]), max_bins, retbins=True)[1])
        bins = list(pd.qcut(np.array(tmp_df.loc[:,feat]), max_bins,duplicates='drop', retbins=True)[1])
        bins_list.append(bins)
    return bins_list

#进行离散化操作
def process(data:pd.DataFrame, bins_list, category_features):
    data = data.copy()
    features = data.columns.tolist()
    continuous_features = [feat for feat in features if feat not in category_features]
    tmp_df = data.copy()
    for i,feat in enumerate(continuous_features):
        bins = bins_list[i]
        cur_bin = 0
        for i in np.arange(len(bins) - 1):
            index = ((tmp_df[feat] > bins[i]) & (tmp_df[feat] <= bins[i + 1]))
            if np.sum(index) == 0:
                continue
            data.loc[index, feat] = cur_bin
            cur_bin += 1
    return data

#生成每个特征对应的直方图,对每个特征的每个bin计算一阶导数之和、二阶导数之和,用于计算节点分裂的增益
def histogram(data:pd.DataFrame, gradient, hessian):
    features = data.columns.tolist()
    tmp_df = data.copy()
    tmp_df['gradient'] = gradient
    tmp_df['hessian'] = hessian
    G_H = []
    for i,feat in enumerate(features):
        #统计每个特征离散后的每个离散值取值的所有样本的一阶导数之和、二阶导数之和
        gp = tmp_df.groupby(feat).agg({'gradient':['sum'], 'hessian':['sum']})
        gp.columns = pd.Index([f[0]+'_'+f[1] for f in gp.columns.tolist()])
        gp = gp.reset_index()
        G_H.append(gp)
    return G_H

def build_lightGBMClassifier(data:pd.DataFrame,y_true:np.array,n=3,category_features=[],num_leaves=8,max_depth=3,
                            min_samples_leaf=1,gamma=1e-7,reg_alpha=0,reg_lambda=1,top_rate=0.2,other_rate=0.1,
                            loss='logloss',lr=0.1):
    if loss == 'logloss':
        f0 = np.log(np.sum(y_true) / np.sum(1 - y_true))  # 初始化一个常数是的损失函数的值最小
    y_pred = np.ones(len(y_true))*f0

    bins_list = get_bins(data, category_features, max_bins=256) #分箱
    data = process(data, bins_list, category_features) #将连续特征离散化
    bundles = EFB(data, K=len(data)/10000) #选取哪些特征可以绑定
    data = features_merge(data, bundles) #特征绑定
    lightGBMClassifier = []
    for i in np.arange(n):
        #基于梯度的采样
        if i<5: #前10轮不进行采样
            subsample_data, gradient, hessian = GOSS(data, y_true, y_pred, 1, other_rate, loss)
        else:
            subsample_data, gradient, hessian = GOSS(data, y_true, y_pred, top_rate, other_rate, loss)
        G_H = histogram(subsample_data, gradient, hessian)
        #构建基学习器
        fn = treeRegressor.build_treeRegressor(subsample_data,G_H,gradient,hessian,
                                               num_leaves, max_depth, min_samples_leaf, gamma, reg_alpha,reg_lambda)

        lightGBMClassifier.append(fn)
        if i==0:
            y_pred += treeRegressor.predict(fn, data)
        else:
            y_pred += lr*treeRegressor.predict(fn, data)
    # print('train mse:{} mae:{}'.format(mean_squared_error(y_true,y_pred), mean_absolute_error(y_true, y_pred)))

    return lightGBMClassifier, bins_list, bundles

def predict(lightGBMClassifier, data:pd.DataFrame, bundles, bins_list,category_features, lr=0.1):
    y_pred = np.zeros(len(data))
    data = process(data, bins_list, category_features)  # 将连续特征离散化
    data = features_merge(data, bundles)  # 特征绑定
    for i,tree in enumerate(lightGBMClassifier):
        # y_pred += treeRegressor.predict(tree, data)
        if i==0:
            y_pred += treeRegressor.predict(tree, data)
        else:
            y_pred += lr*treeRegressor.predict(tree, data)
    y_pred_prob = 1 / (1 + np.exp(-y_pred))
    y_pred_prob[y_pred_prob > 0.5] = 1
    y_pred_prob[y_pred_prob <= 0.5] = 0
    print(y_pred_prob)
    return y_pred_prob

if __name__ == '__main__':
    from sklearn import datasets
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import precision_score, accuracy_score, recall_score

    X, y = datasets.load_breast_cancer(return_X_y=True)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True, random_state=2020)
    print('train:.{} test:.{}'.format(X_train.shape, X_test.shape))
    print(np.sum(y_train))
    X_train_df = pd.DataFrame(X_train)
    X_test_df = pd.DataFrame(X_test)

    xgboostClassifiers,bins_list, bundles = build_lightGBMClassifier(X_train_df, y_train, n=15, max_depth=5, top_rate=0.2,
                                                num_leaves=20,lr=1,min_samples_leaf=2, category_features=[])

    y_pred_train = predict(xgboostClassifiers, X_train_df,bundles,bins_list,[], lr=1)
    y_pred_test = predict(xgboostClassifiers, X_test_df,bundles,bins_list,[], lr=1)
    print('train acc:{} precision:{} recall:{}'.format(accuracy_score(y_train, y_pred_train),
                                       precision_score(y_train, y_pred_train),
                                       recall_score(y_train,y_pred_train)))
    print('test acc:{} precision:{} recall:{}'.format(accuracy_score(y_test, y_pred_test),
                                      precision_score(y_test, y_pred_test),
                                      recall_score(y_test,y_pred_test)))

参考文章:
1.Lightgbm基本原理介绍
2.LightGBM源码阅读+理论分析(处理特征类别,缺省值的实现细节)

评论 9
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值