项目实训写实记录No.8

最新推荐文章于 2024-06-28 16:42:10 发布

qq_44219737

最新推荐文章于 2024-06-28 16:42:10 发布

阅读量105

点赞数

文章标签： python 数据分析

本文链接：https://blog.csdn.net/qq_44219737/article/details/118771625

版权

利用决策树获得最优分箱的边界值列表

- - 1.决策树分箱概念
  - 2.决策数分箱的代码

1.决策树分箱概念

决策树分箱：决策树是自顶向下划分的, 是监督式分箱方法, 即需要使用到标签变量。由于分箱时使用了类信息，因此区间的边界更有可能定义在有帮助于提高分类准确率的地方。
实际效果就是用想要离散化的那个连续变量单变量用树模型(可以用sklearn中的cart树)拟合y

2.决策数分箱的代码

def divide_boxes(x: pd.Series, y: pd.Series, nan: float = -999.) -> list:
    '''
        利用决策树获得最优分箱的边界值列表
    '''
    boundary = []  # 待return的分箱边界值列表

    x = x.fillna(nan).values  # 填充缺失值
    y = y.values

    clf = DecisionTreeClassifier(criterion='entropy',    #“信息熵”最小化准则划分
                                 max_leaf_nodes=6,       # 最大叶子节点数
                                 min_samples_leaf=0.05)  # 叶子节点样本数量最小占比

    clf.fit(x.reshape(-1, 1), y)  # 训练决策树

    n_nodes = clf.tree_.node_count
    children_left = clf.tree_.children_left
    children_right = clf.tree_.children_right
    threshold = clf.tree_.threshold

    for i in range(n_nodes):
        if children_left[i] != children_right[i]:  # 获得决策树节点上的划分边界值
            boundary.append(threshold[i])

    boundary.sort()

    min_x = x.min()
    max_x = x.max() + 0.1  # +0.1是为了考虑后续groupby操作时，能包含特征最大值的样本
    boundary = [min_x] + boundary + [max_x]

    return boundary