贷款违约预测项目-数据分箱

最新推荐文章于 2023-11-14 21:26:16 发布

ReidChenJX

最新推荐文章于 2023-11-14 21:26:16 发布

阅读量682

点赞数 3

文章标签：数据挖掘数据分析

本文链接：https://blog.csdn.net/ReidChenJX/article/details/109575305

版权

我们知道，在使用误差平方和作为损失函数的模型中，离群点的存在会极大提高误差值。但如果直接删除离群点样本，训练数据减少也会降低模型的精度，特别是在本例中，离群点占正例的概率很大。为了保留离群点，同时又能够起到优化模型的效果，我们可采用数据分箱的技术。

特征筛除

在分箱之前，我们先剔除掉信息量少的特征。在信息论中，我们学习到，如果一个信号的方差很大，那么其中包含的信息量就多，同理，如果一个特征的数值分布广，那么包含的信息就多。处理数据前，我们可以剔除单一变量的特征，可认为该类特征较少信息。

# 特征筛选：单一变量比重检测，删除比重超过95%的特征
def drop_single_variable(data):
    drop_list = []
    for col in data.columns:
        precent = data[col].value_counts().max() / float(len(data))
        if precent >= 0.95:
            drop_list.append(col)
    data.drop(drop_list,axis=1,inplace=True)
    return drop_list
drop_list = drop_single_variable(data_train)
data_test.drop(drop_list,axis=1,inplace=True)

卡方分箱

最先想到的方法是卡方分箱。卡方分箱值利用卡方系数来合并数据分组，进而达到分箱的目的。
卡方统计的公式：
其中fo指实际频数，fe指期望频数。卡方系数的计算过程如下：

统计正样本个数占所有样本个数的比值，将其作为期望比例fp。
对特征A的取值进行从大到小排序，并去重。通过计算样本中A=a值的个数，乘以期望比例fp，当做特征A为a时的期望频数fe。通过统计样本中A=a时，正样本的个数作为实际频数fo。运用上面公式计算特征A=a时的卡方系数。
改变A的取值，遍历特征A所有的可能取值，可得到每一个取值下的卡方系数。

卡方系数的意义：如果特征A=a样本下，计算得到卡方系数接近于0，则意味着特征A的值是否为a，与标签Y是否为正，是两个独立条件。也意味着特征A=a对标签没有贡献。
以上步骤，我们对特征A所有的可能情况计算了卡方系数，默认根据特征A进行了多次分箱。接下来我们要考虑如何进行分箱合并。

选择出卡方系数最小的一箱数据，向前或向后进行合并，合并规则为：选择前一或后一个最小卡方系数的箱，进行合并。合并后计算新的卡方系数。
通过不断的重复步骤4，直到满足我们的分箱条件，如：满足最小分箱个数，或者每个箱子满足一定的卡方阈值。

根据顺序对数据进行分箱后，我们即可根据数据区间实现样本的分箱：利用 cut 函数。

卡方分箱实际操作代码如下：

# 计算数据特征的卡方值
def get_chi2(data, col):
    # 计算样本期望频率
    pos_cnt = data['isDefault'].sum()
    all_cnt = data['isDefault'].count()
    expected_ratio = float(pos_cnt/all_cnt)
    
    # 对变量按照顺序排序
    df = data[[col,'isDefault']]
    col_value = list(set(df[col]))    # 用set排除重复项
    col_value.sort()
    
    # 计算每一个区间的卡方统计量
    chi_list = []
    pos_list = []
    expected_pos_list = []
    
    for value in col_value:
        df_pos_cnt = df.loc[df[col]==value, 'isDefault'].sum()    # 实际频数
        df_all_cnt = df.loc[df[col]==value, 'isDefault'].count()
        
        expected_pos_cnt = df_all_cnt * expected_ratio    # 期望频数
        chi_square = (df_pos_cnt - expected_pos_cnt)**2 / expected_pos_cnt
        
        chi_list.append(chi_square)
        pos_list.append(df_pos_cnt)
        expected_pos_list.append(expected_pos_cnt)
    
    # 将结果导入DataFrame格式
    chi_result = pd.DataFrame({col:col_value,'chi_square':chi_list,
                              'pos_cnt':pos_list,'expected_pos_cnt':expected_pos_list})
    return chi_result

# 根据给定的自由度和显著性水平, 计算卡方阈值
def cal_chisqure_threshold(dfree=4, cf=0.1):
    
    percents = [0.95, 0.90, 0.5, 0.1, 0.05, 0.025, 0.01, 0.005]
    
    ## 计算每个自由度，在每个显著性水平下的卡方阈值
    df = pd.DataFrame(np.array([chi2.isf(percents, df=i) for i in range(1, 30)]))
    df.columns = percents
    df.index = df.index+1
    
    pd.set_option('precision', 3)
    return df.loc[dfree, cf]

# 给定数据集与特征名称，通过最大分箱数与卡方阈值，得出卡方表与最佳分箱区间
def chiMerge_chisqure(data, col, dfree=4, cf=0.1, maxInterval=7):

    chi_result = get_chi2(data, col)
    threshold = cal_chisqure_threshold(dfree, cf)
    min_chiSquare = chi_result['chi_square'].min()
    group_cnt = len(chi_result)
    
    # 如果变量区间的最小卡方值小于阈值，则继续合并直到最小值大于等于阈值
    
    while(min_chiSquare < threshold or group_cnt > maxInterval):
        min_index = chi_result[chi_result['chi_square']==chi_result['chi_square'].min()].index.tolist()[0]
        
        # 如果分箱区间在最前,则向下合并
        if min_index == 0:
            chi_result = merge_chiSquare(chi_result, min_index+1, min_index)    # min_index+1, min_index的顺序可保证最小值在前，便于切分区间
        
        # 如果分箱区间在最后，则向上合并
        elif min_index == group_cnt-1:
            chi_result = merge_chiSquare(chi_result, min_index-1, min_index)    # min_index-1, min_index的顺序保证最大值在最后，便于切分区间
        
        # 如果分箱区间在中间，则判断与其相邻的最小卡方的区间，然后进行合并
        else:
            if chi_result.loc[min_index-1, 'chi_square'] > chi_result.loc[min_index+1, 'chi_square']:
                chi_result = merge_chiSquare(chi_result, min_index, min_index+1)
            else:
                chi_result = merge_chiSquare(chi_result, min_index-1, min_index)
        
        min_chiSquare = chi_result['chi_square'].min()
        
        group_cnt = len(chi_result)

    boundary = list(chi_result.iloc[:,0])
    
    return chi_result, boundary


#     按index进行合并，并计算合并后的卡方值，mergeindex 是合并后的序列值
def merge_chiSquare(chi_result, index, mergeIndex, a = 'expected_pos_cnt',b = 'pos_cnt', c = 'chi_square'):

    chi_result.loc[mergeIndex, a] = chi_result.loc[mergeIndex, a] + chi_result.loc[index, a]
    chi_result.loc[mergeIndex, b] = chi_result.loc[mergeIndex, b] + chi_result.loc[index, b]
    ## 两个区间合并后，新的chi2值如何计算
    chi_result.loc[mergeIndex, c] = (chi_result.loc[mergeIndex, b] - chi_result.loc[mergeIndex, a])**2 /chi_result.loc[mergeIndex, a]
    
    chi_result = chi_result.drop([index])
    ## 重置index
    chi_result = chi_result.reset_index(drop=True)
    
    return chi_result

for fea in continuous_fea:
    chi_result, boundary = chiMerge_chisqure(data_train, fea)
    data_train[fea+'kf_bins'] = pd.cut(data_train[fea], bins= boundary, labels=False)
    data_test[fea+'kf_bins'] = pd.cut(data_test[fea], bins= boundary, labels=False)

卡方分箱优点：

减少离群点与异常点对模型的影响。将上万数据区分为个位数的数据箱，可有效防止异常数据的影响。在分箱过程中，还可以对缺失数据进行单独分箱，替代缺失值填充的过程。
防止过拟合，模型目的为二分类，整合数据能有效防止过拟合。
对采用梯度下降算法的模型，能加速拟合过程，采用分箱后的数据代替原始数据，可实现数据标准化的功能，避免量纲对模型的干扰。

卡方分箱缺点：分箱过程较慢。分箱涉及大量重复性计算过程。当然可以采用设置初始箱数的方法来加数计算过程，如：开始以100数为整体，进行初始分箱。

决策树分箱

分箱的实际意义在于：选择合适的切分点，对数据集进行切分。
在传统机器学习算法中，决策树刚好是最直观的进行数据切分的模型。
我们构造一颗以信息熵为指标的决策树，决策树的叶子节点数就是我们需要的分箱数。训练决策树后，通过获取树生成过程中的切分点，即可获得分箱区间，过程如下：

from sklearn.tree import DecisionTreeClassifier
# 利用决策树获得最优分箱的边界值列表
def optimal_binning_boundary(x: pd.Series, y: pd.Series, nan: float = -999.) -> list:
    
    boundary = []  # 待return的分箱边界值列表
    
    x = x.fillna(nan).values  # 填充缺失值
    y = y.values
    
    clf = DecisionTreeClassifier(criterion='entropy',    #“信息熵”最小化准则划分
                                 max_leaf_nodes=6,       # 最大叶子节点数
                                 min_samples_leaf=0.05)  # 叶子节点样本数量最小占比

    clf.fit(x.reshape(-1, 1), y)  # 训练决策树
    
    n_nodes = clf.tree_.node_count
    children_left = clf.tree_.children_left
    children_right = clf.tree_.children_right
    threshold = clf.tree_.threshold
    
    for i in range(n_nodes):
        if children_left[i] != children_right[i]:  # 获得决策树节点上的划分边界值
            boundary.append(threshold[i])

    boundary.sort()

    min_x = x.min() - 0.1  
    max_x = x.max() + 0.1  # -0.1 +0.1是为了考虑后续groupby操作时，能包含特征最小值，最大值的样本
    boundary = [min_x] + boundary + [max_x]
    return boundary

for fea in continuous_fea:
    boundary = optimal_binning_boundary(x=data_train[fea],y=data_train['isDefault'])
    data_train[fea+'_tr_bins'] = pd.cut(data_train[fea], bins= boundary, labels=False)
    data_test[fea+'_tr_bins'] = pd.cut(data_test[fea], bins= boundary, labels=False)

采用决策树分箱，明显比卡方分箱更快。并且能获得不弱于卡方分箱的WOE值。在实际过程中，更推荐使用决策树来进行分箱。

WOE值与IV值

分箱后，为进一步利用数据，可进行WOE与IV值的转换。

# 计算特征的WOE与IV值
def call_WOE_IV(data, var, target):
    eps = 0.0001
    gbi = pd.crosstab(data[var], data[target]) + eps
    gb = data[target].value_counts() + eps
    gbri = gbi / gb
    gbri.rename(columns={'0':'0_i','1':'1_i'},inplace=True)

    gbri['WOE'] = np.log(gbri[1] / gbri[0])
    gbri['IV'] = (gbri[1] - gbri[0]) * gbri['WOE']
    
    congb = pd.concat([gbi,gbri],axis=1)
    return congb
# 计算分箱后的WOE值，并生成新的特征
for col in data_train.columns:
    if 'tr_bins' in col:
        WOE_table = dict(call_WOE_IV(data_train,col,'isDefault')['WOE'])
        data_train[col+'_woe'] = data_train[col].map(WOE_table)
        data_test[col+'_woe'] = data_test[col].map(WOE_table)