分箱方法整理

最新推荐文章于 2024-08-28 15:57:39 发布

nikita_zj

最新推荐文章于 2024-08-28 15:57:39 发布

阅读量4.4k

点赞数 3

分类专栏：数据分析模型文章标签：数据挖掘数据分析

本文链接：https://blog.csdn.net/nikita_zj/article/details/122733883

版权

模型同时被 2 个专栏收录

15 篇文章 5 订阅

订阅专栏

数据分析

4 篇文章 1 订阅

订阅专栏

1. 卡方分箱-一种有监督分箱

1.1 卡方检验

卡方检验是对分类数据的频数进行分析的统计方法；用于分析分类变量和分类变量的关系（相关程度）；卡方检验分为优度检验和独立性检验。

1.1.1 拟合优度检验

拟合优度检验是对一个分类变量的检验，即根据总体的分布情况，计算出分类变量中各分类的期望频数，与分布的观测频数进行对比，判断期望频数与观察频数是否有显著差异。

1.1.2 列联分析：独立性分析

独立性检验对两个分类变量的检验，分析过程通过列联表（contingency table）方式呈现，实际就转换为分析列联表中行变量与列变量是否相互独立（或有关联）。

1.2 卡方分箱

论文：
https://www.aaai.org/Papers/AAAI/1992/AAAI92-019.pdfhttps://www.aaai.org/Papers/AAAI/1992/AAAI92-019.pdf

关键在于：

1. 初始化-首先根据连续变量的值的大小排序，进行初始的离散处理

2. 合并-箱子合并过程分为两个步骤，连续重复进行：
1）计算每个相邻箱子的卡方值
2）对低卡方值的相邻箱子进行合并

合并停止条件：

1. 所有相邻箱子的卡方值大于等于卡方阈值

2. 箱子数量达到预先设置的数量

toad包中的卡方分箱实现-按下图方式进行合并，直到满足合并停止条件。

源码如下：

@cython.boundscheck(False)
@cython.wraparound(False)
cpdef ChiMerge(feature, target, n_bins = None, min_samples = None,
            min_threshold = None, nan = -1, balance = True):
    """Chi-Merge
    Args:
        feature (array-like): feature to be merged
        target (array-like): a array of target classes
        n_bins (int): n bins will be merged into
        min_samples (number): min sample in each group, if float, it will be the percentage of samples
        min_threshold (number): min threshold of chi-square
    Returns:
        array: array of split points
    """

    # set default break condition
    if n_bins is None and min_samples is None and min_threshold is None:
        n_bins = DEFAULT_BINS

    if min_samples and min_samples < 1:
        min_samples = len(feature) * min_samples

    feature = fillna(feature, by = nan)
    target = to_ndarray(target)


    target_unique = np.unique(target)
    feature_unique = np.unique(feature)
    len_f = len(feature_unique) # 特征种类数
    len_t = len(target_unique) # 目标变量种类数

    cdef double [:,:] grouped = np.zeros((len_f, len_t), dtype=np.float)

    for r in range(len_f): # 对特征进行循环
        tmp = target[feature == feature_unique[r]] # 取出特定特征取值的taget集合
        for c in range(len_t):
            grouped[r, c] = (tmp == target_unique[c]).sum() # 统计特定特征值下特定target出现的个数


    cdef double [:,:] couple
    cdef double [:] cols, rows, chi_list
    cdef double chi, chi_min, total, e
    cdef int l, retain_ix, ix
    cdef Py_ssize_t i, j, k, p

    while(True):
        # break loop when reach n_bins
        if n_bins and len(grouped) <= n_bins: # 此时特征种类数小于等于最大分箱数
            break

        # break loop if min samples of groups is greater than threshold
        if min_samples and c_min(c_sum_axis_1(grouped)) > min_samples:
            break

        # Calc chi square for each group
        l = len(grouped) - 1 # len(grouped)为特征去重统计值
        chi_list = np.zeros(l, dtype=np.float)
        chi_min = np.inf

        for i in range(l): # 找出相邻箱体对应卡方值最小的两个箱体
            chi = 0
            couple = grouped[i:i+2,:] # 取出相邻的两行
            total = c_sum(couple) # 把grouped表中的每个数字加在一起
            cols = c_sum_axis_0(couple) # 每列求和-target每个取值对应的个数
            rows = c_sum_axis_1(couple) # 每行求和-特征每个取值对应的个数
            
            for j in range(couple.shape[0]):
                for k in range(couple.shape[1]):
                    e = rows[j] * cols[k] / total # 期望值
                    if e != 0:
                        chi += (couple[j, k] - e) ** 2 / e

            # balance weight of chi
            if balance:
                chi *= total

            chi_list[i] = chi

            if chi == chi_min:
                chi_ix.append(i)
                continue

            if chi < chi_min:
                chi_min = chi
                chi_ix = [i]


        # break loop when the minimun chi greater the threshold
        if min_threshold and chi_min > min_threshold:
            break

        # get indexes of the groups who has the minimun chi
        min_ix = np.array(chi_ix)

        # get the indexes witch needs to drop
        drop_ix = min_ix + 1

        # combine groups by indexes
        retain_ix = min_ix[0] # min_ix = [0,1,3,4,5]
        last_ix = retain_ix # 0
        for ix in min_ix: # 第i对箱体
            # set a new group
            if ix - last_ix > 1: # 考虑到了连续三个及以上的箱体可以合并的情况
                retain_ix = ix

            # combine all contiguous indexes into one group
            for p in range(grouped.shape[1]):
                grouped[retain_ix, p] = grouped[retain_ix, p] + grouped[ix + 1, p]

            last_ix = ix#1


        # drop binned groups
        grouped = np.delete(grouped, drop_ix, axis = 0)
        feature_unique = np.delete(feature_unique, drop_ix)


    return feature_unique[1:]

参考：

卡方分箱及代码实现_hutao_ljj的博客-CSDN博客_卡方分箱# 1.卡方分布https://blog.csdn.net/hutao_ljj/article/details/105448887