1. 卡方分箱-一种有监督分箱
1.1 卡方检验
卡方检验是对分类数据的频数进行分析的统计方法;用于分析分类变量和分类变量的关系(相关程度);卡方检验分为优度检验和独立性检验。
1.1.1 拟合优度检验
拟合优度检验是对一个分类变量的检验,即根据总体的分布情况,计算出分类变量中各分类的期望频数,与分布的观测频数进行对比,判断期望频数与观察频数是否有显著差异。
1.1.2 列联分析:独立性分析
独立性检验对两个分类变量的检验,分析过程通过列联表(contingency table)方式呈现,实际就转换为分析列联表中行变量与列变量是否相互独立(或有关联)。
1.2 卡方分箱
关键在于:
1. 初始化-首先根据连续变量的值的大小排序,进行初始的离散处理
2. 合并-箱子合并过程分为两个步骤,连续重复进行:
1) 计算每个相邻箱子的卡方值
2) 对低卡方值的相邻箱子进行合并
合并停止条件:
1. 所有相邻箱子的卡方值大于等于卡方阈值
2. 箱子数量达到预先设置的数量
toad包中的卡方分箱实现-按下图方式进行合并,直到满足合并停止条件。
源码如下:
@cython.boundscheck(False)
@cython.wraparound(False)
cpdef ChiMerge(feature, target, n_bins = None, min_samples = None,
min_threshold = None, nan = -1, balance = True):
"""Chi-Merge
Args:
feature (array-like): feature to be merged
target (array-like): a array of target classes
n_bins (int): n bins will be merged into
min_samples (number): min sample in each group, if float, it will be the percentage of samples
min_threshold (number): min threshold of chi-square
Returns:
array: array of split points
"""
# set default break condition
if n_bins is None and min_samples is None and min_threshold is None:
n_bins = DEFAULT_BINS
if min_samples and min_samples < 1:
min_samples = len(feature) * min_samples
feature = fillna(feature, by = nan)
target = to_ndarray(target)
target_unique = np.unique(target)
feature_unique = np.unique(feature)
len_f = len(feature_unique) # 特征种类数
len_t = len(target_unique) # 目标变量种类数
cdef double [:,:] grouped = np.zeros((len_f, len_t), dtype=np.float)
for r in range(len_f): # 对特征进行循环
tmp = target[feature == feature_unique[r]] # 取出特定特征取值的taget集合
for c in range(len_t):
grouped[r, c] = (tmp == target_unique[c]).sum() # 统计特定特征值下特定target出现的个数
cdef double [:,:] couple
cdef double [:] cols, rows, chi_list
cdef double chi, chi_min, total, e
cdef int l, retain_ix, ix
cdef Py_ssize_t i, j, k, p
while(True):
# break loop when reach n_bins
if n_bins and len(grouped) <= n_bins: # 此时特征种类数小于等于最大分箱数
break
# break loop if min samples of groups is greater than threshold
if min_samples and c_min(c_sum_axis_1(grouped)) > min_samples:
break
# Calc chi square for each group
l = len(grouped) - 1 # len(grouped)为特征去重统计值
chi_list = np.zeros(l, dtype=np.float)
chi_min = np.inf
for i in range(l): # 找出相邻箱体对应卡方值最小的两个箱体
chi = 0
couple = grouped[i:i+2,:] # 取出相邻的两行
total = c_sum(couple) # 把grouped表中的每个数字加在一起
cols = c_sum_axis_0(couple) # 每列求和-target每个取值对应的个数
rows = c_sum_axis_1(couple) # 每行求和-特征每个取值对应的个数
for j in range(couple.shape[0]):
for k in range(couple.shape[1]):
e = rows[j] * cols[k] / total # 期望值
if e != 0:
chi += (couple[j, k] - e) ** 2 / e
# balance weight of chi
if balance:
chi *= total
chi_list[i] = chi
if chi == chi_min:
chi_ix.append(i)
continue
if chi < chi_min:
chi_min = chi
chi_ix = [i]
# break loop when the minimun chi greater the threshold
if min_threshold and chi_min > min_threshold:
break
# get indexes of the groups who has the minimun chi
min_ix = np.array(chi_ix)
# get the indexes witch needs to drop
drop_ix = min_ix + 1
# combine groups by indexes
retain_ix = min_ix[0] # min_ix = [0,1,3,4,5]
last_ix = retain_ix # 0
for ix in min_ix: # 第i对箱体
# set a new group
if ix - last_ix > 1: # 考虑到了连续三个及以上的箱体可以合并的情况
retain_ix = ix
# combine all contiguous indexes into one group
for p in range(grouped.shape[1]):
grouped[retain_ix, p] = grouped[retain_ix, p] + grouped[ix + 1, p]
last_ix = ix#1
# drop binned groups
grouped = np.delete(grouped, drop_ix, axis = 0)
feature_unique = np.delete(feature_unique, drop_ix)
return feature_unique[1:]
参考:
卡方分箱及代码实现_hutao_ljj的博客-CSDN博客_卡方分箱# 1.卡方分布https://blog.csdn.net/hutao_ljj/article/details/105448887