关注微信公共号:小程在线
关注CSDN博客:程志伟的博客
基于这样的思想,我们总结出我们对一个特征进行分箱的步骤:
1)我们首先把连续型变量分成一组数量较多的分类型变量,比如,将几万个样本分成100组,或50组
2)确保每一组中都要包含两种类别的样本,否则IV值会无法计算
3)我们对相邻的组进行卡方检验,卡方检验的P值很大的组进行合并,直到数据中的组数小于设定的N箱为止
4)我们让一个特征分别分成[2,3,4.....20]箱,观察每个分箱个数下的IV值如何变化,找出最适合的分箱个数
5)分箱完毕后,我们计算每个箱的WOE值, ,观察分箱效果
这些步骤都完成后,我们可以对各个特征都进行分箱,然后观察每个特征的IV值,以此来挑选特征
#按照等频对需要分箱的列进行分箱
model_data["qcut"], updown = pd.qcut(model_data["age"], retbins=True, q=20)
model_data["qcut"]
Out[46]:
0 (52.0, 54.0]
1 (61.0, 64.0]
2 (39.0, 41.0]
3 (68.0, 74.0]
4 (52.0, 54.0]
195003 (31.018, 34.0]
195004 (48.498, 50.006]
195005 (45.0, 46.982]
195006 (61.0, 64.0]
195007 (52.0, 54.0]
Name: qcut, Length: 195008, dtype: category
Categories (20, interval[float64]): [(20.999, 28.0] < (28.0, 31.018] < (31.018, 34.0] < (34.0, 36.646] ... (61.0, 64.0] < (64.0, 68.0] < (68.0, 74.0] < (74.0, 107.0]]
#每个段的数量
model_data["qcut"].value_counts()
Out[47]:
(36.646, 39.0] 11019
(61.0, 64.0] 10740
(31.018, 34.0] 10704
(50.006, 52.0] 10680
(58.689, 61.0] 10572
(20.999, 28.0] 10313
(43.0, 45.0] 10116
(52.0, 54.0] 10115
(41.0, 43.0] 9778
(46.982, 48.498] 9750
(48.498, 50.006] 9750
(39.0, 41.0] 9476
(74.0, 107.0] 9296
(64.0, 68.0] 9285
(28.0, 31.018] 9188
(54.0, 56.0] 9119
(56.0, 58.689] 9088
(68.0, 74.0] 8859
(34.0, 36.646] 8797
(45.0, 46.982] 8363
Name: qcut, dtype: int64
#所有箱子的上限和下限
updown
Out[48]:
array([ 21. , 28. , 31.01848249, 34. ,
36.64611309, 39. , 41. , 43. ,
45. , 46.98189964, 48.4983757 , 50.00572276,
52. , 54. , 56. , 58.68881023,
61. , 64. , 68. , 74. ,
107. ])
# 统计每个分箱中0和1的数量
# 这里使用了数据透视表的功能groupby
coount_y0 = model_data[model_data["SeriousDlqin2yrs"] == 0].groupby(by="qcut").count()["SeriousDlqin2yrs"]
coount_y1 = model_data[model_data["SeriousDlqin2yrs"] == 1].groupby(by="qcut").count()["SeriousDlqin2yrs"]
#num_bins值分别为每个区间的上界,下界,0出现的次数,1出现的次数
num_bins = [*zip(updown,updown[1:],coount_y0,coount_y1)]
#注意zip会按照最短列来进行结合
num_bins
Out[50]:
[(21.0, 28.0, 4243, 6070),
(28.0, 31.018482492669673, 3571, 5617),
(31.018482492669673, 34.0, 4075, 6629),
(34.0, 36.646113086053596, 2908, 5889),
(36.646113086053596, 39.0, 5182, 5837),
(39.0, 41.0, 3956, 5520),
(41.0, 43.0, 4002, 5776),
(43.0, 45.0, 4389, 5727),
(45.0, 46.9818996394487, 2419, 5944),
(46.9818996394487, 48.498375695997765, 4813, 4937),
(48.498375695997765, 50.00572276244543, 4900, 4850),
(50.00572276244543, 52.0, 4728, 5952),
(52.0, 54.0, 4681, 5434),
(54.0, 56.0, 4677, 4442),
(56.0, 58.6888102319735, 4483, 4605),
(58.6888102319735, 61.0, 6583, 3989),
(61.0, 64.0, 6968, 3772),
(64.0, 68.0, 6623, 2662),
(68.0, 74.0, 6753, 2106),
(74.0, 107.0, 7737, 1559)]
3.3 定义WOE和IV函数
#计算WOE和BAD RATE
#BAD RATE与bad%不是一个东西
#BAD RATE是一个箱中,坏的样本所占的比例 (bad/total)
#而bad%是一个箱中的坏样本占整个特征中的坏样本的比例
columns = ["min","max","count_0","count_1"]
df = pd.DataFrame(num_bins,columns=columns)
df
Out[53]:
min max count_0 count_1
0 21.000000 28.000000 4243 6070
1 28.000000 31.018482 3571 5617
2 31.018482 34.000000 4075 6629
3 34.000000 36.646113 2908 5889
4 36.646113 39.000000 5182 5837
5 39.000000 41.000000 3956 5520
6 41.000000 43.000000 4002 5776
7 43.000000 45.000000 4389 5727
8 45.000000 46.981900 2419 5944
9 46.981900 48.498376 4813 4937
10 48.498376 50.005723 4900 4850
11 50.005723 52.000000 4728 5952
12 52.000000 54.000000 4681 5434
13 54.000000 56.000000 4677 4442
14 56.000000 58.688810 4483 4605
15 58.688810 61.000000 6583 3989
16 61.000000 64.000000 6968 3772
17 64.000000 68.000000 6623 2662
18 68.000000 74.000000 6753 2106
19 74.000000 107.000000 7737 1559
df["total"] = df.count_0 + df.count_1
#一个箱子中样本数占总数的比例
df["percentage"] = df.total / df.total.sum()
#BAD RATE是一个箱中,坏的样本所占的比例
df["bad_rate"] = df.count_1 / df.total
df["good%"] = df.count_0/df.count_0.sum()
#而bad%是一个箱中的坏样本占整个特征中的坏
df["bad%"] = df.count_1/df.count_1.sum()
df["woe"] = np.log(df["good%"] / df["bad%"])
df
Out[54]:
min max count_0 ... good% bad% woe
0 21.000000 28.000000 4243 ... 0.043433 0.062373 -0.361924
1 28.000000 31.018482 3571 ... 0.036554 0.057719 -0.456788
2 31.018482 34.000000 4075 ... 0.041713 0.068118 -0.490419
3 34.000000 36.646113 2908 ... 0.029767 0.060514 -0.709456
4 36.646113 39.000000 5182 ... 0.053045 0.059979 -0.122862
5 39.000000 41.000000 3956 ... 0.040495 0.056722 -0.336980
6 41.000000 43.000000 4002 ... 0.040966 0.059352 -0.370753
7 43.000000 45.000000 4389 ... 0.044927 0.058849 -0.269926
8 45.000000 46.981900 2419 ... 0.024762 0.061079 -0.902864
9 46.981900 48.498376 4813 ... 0.049268 0.050731 -0.029273
10 48.498376 50.005723 4900 ... 0.050158 0.049837 0.006421
11 50.005723 52.000000 4728 ... 0.048397 0.061161 -0.234061
12 52.000000 54.000000 4681 ... 0.047916 0.055838 -0.152999
13 54.000000 56.000000 4677 ... 0.047875 0.045645 0.047716
14 56.000000 58.688810 4483 ... 0.045890 0.047320 -0.030686
15 58.688810 61.000000 6583 ... 0.067386 0.040990 0.497114
16 61.000000 64.000000 6968 ... 0.071327 0.038760 0.609887
17 64.000000 68.000000 6623 ... 0.067795 0.027354 0.907635
18 68.000000 74.000000 6753 ... 0.069126 0.021641 1.161361
19 74.000000 107.000000 7737 ... 0.079199 0.016020 1.598134
[20 rows x 10 columns]
def get_woe(num_bins):
# 通过 num_bins 数据计算 woe
columns = ["min","max","count_0","count_1"]
df = pd.DataFrame(num_bins,columns=columns)
df["total"] = df.count_0 + df.count_1
df["percentage"] = df.total / df.total.sum()
df["bad_rate"] = df.count_1 / df.total
df["good%"] = df.count_0/df.count_0.sum()
df["bad%"] = df.count_1/df.count_1.sum()
df["woe"] = np.log(df["good%"] / df["bad%"])
return df
#计算IV值
def get_iv(df):
rate = df["good%"] - df["bad%"]
iv = np.sum(rate * df.woe)
return iv
3.4 卡方检验,合并箱体,画出IV曲线
num_bins_ = num_bins.copy()
import matplotlib.pyplot as plt
import scipy
x1 = num_bins_[0][2:]
x1
Out[57]: (4243, 6070)
x2 = num_bins_[1][2:]
x2
Out[58]: (3571, 5617)
#卡方检验
scipy.stats.chi2_contingency([x1,x2])
Out[59]:
(10.3901943590052,
0.0012668628767570208,
1,
array([[4132.39228757, 6180.60771243],
[3681.60771243, 5506.39228757]]))
scipy.stats.chi2_contingency([x1,x2])[1]
Out[60]: 0.0012668628767570208
IV = []
axisx = []
while len(num_bins_) > 2:
pvs = []
# 获取 num_bins_两两之间的卡方检验的置信度(或卡方值)
for i in range(len(num_bins_)-1):
x1 = num_bins_[i][2:]
x2 = num_bins_[i+1][2:]
# 0 返回 chi2 值,1 返回 p 值。
pv = scipy.stats.chi2_contingency([x1,x2])[1]
# chi2 = scipy.stats.chi2_contingency([x1,x2])[0]
pvs.append(pv)
# 通过 p 值进行处理。合并 p 值最大的两组
i = pvs.index(max(pvs))
num_bins_[i:i+2] = [(
num_bins_[i][0],
num_bins_[i+1][1],
num_bins_[i][2]+num_bins_[i+1][2],
num_bins_[i][3]+num_bins_[i+1][3])]
bins_df = get_woe(num_bins_)
axisx.append(len(num_bins_))
IV.append(get_iv(bins_df))
plt.figure()
plt.plot(axisx,IV)
plt.xticks(axisx)
plt.xlabel("number of box")
plt.ylabel("IV")
plt.show()

3.5 用最佳分箱个数分箱,并验证分箱结果
def get_bin(num_bins_,n):
while len(num_bins_) > n:
pvs = []
for i in range(len(num_bins_)-1):
x1 = num_bins_[i][2:]
x2 = num_bins_[i+1][2:]
pv = scipy.stats.chi2_contingency([x1,x2])[1]
# chi2 = scipy.stats.chi2_contingency([x1,x2])[0]
pvs.append(pv)
i = pvs.index(max(pvs))
num_bins_[i:i+2] = [(
num_bins_[i][0],
num_bins_[i+1][1],
num_bins_[i][2]+num_bins_[i+1][2],
num_bins_[i][3]+num_bins_[i+1][3])]
return num_bins_
afterbins = get_bin(num_bins,6)
afterbins
afterbins = get_bin(num_bins,6)
afterbins
3.6 将选取最佳分箱个数的过程包装为函数
def graphforbestbin(DF, X, Y, n=5,q=20,graph=True):
"""
自动最优分箱函数,基于卡方检验的分箱
参数:
DF: 需要输入的数据
X: 需要分箱的列名
Y: 分箱数据对应的标签 Y 列名
n: 保留分箱个数
q: 初始分箱的个数
graph: 是否要画出IV图像
区间为前开后闭 (]
"""
DF = DF[[X,Y]].copy()
DF["qcut"],bins = pd.qcut(DF[X], retbins=True, q=q,duplicates="drop")
coount_y0 = DF.loc[DF[Y]==0].groupby(by="qcut").count()[Y]
coount_y1 = DF.loc[DF[Y]==1].groupby(by="qcut").count()[Y]
num_bins = [*zip(bins,bins[1:],coount_y0,coount_y1)]
for i in range(q):
if 0 in num_bins[0][2:]:
num_bins[0:2] = [(
num_bins[0][0],
num_bins[1][1],
num_bins[0][2]+num_bins[1][2],
num_bins[0][3]+num_bins[1][3])]
continue
for i in range(len(num_bins)):
if 0 in num_bins[i][2:]:
num_bins[i-1:i+1] = [(
num_bins[i-1][0],
num_bins[i][1],
num_bins[i-1][2]+num_bins[i][2],
num_bins[i-1][3]+num_bins[i][3])]
break
else:
break
def get_woe(num_bins):
columns = ["min","max","count_0","count_1"]
df = pd.DataFrame(num_bins,columns=columns)
df["total"] = df.count_0 + df.count_1
df["percentage"] = df.total / df.total.sum()
df["bad_rate"] = df.count_1 / df.total
df["good%"] = df.count_0/df.count_0.sum()
df["bad%"] = df.count_1/df.count_1.sum()
df["woe"] = np.log(df["good%"] / df["bad%"])
return df
def get_iv(df):
rate = df["good%"] - df["bad%"]
iv = np.sum(rate * df.woe)
return iv
IV = []
axisx = []
while len(num_bins) > n:
pvs = []
for i in range(len(num_bins)-1):
x1 = num_bins[i][2:]
x2 = num_bins[i+1][2:]
pv = scipy.stats.chi2_contingency([x1,x2])[1]
pvs.append(pv)
i = pvs.index(max(pvs))
num_bins[i:i+2] = [(
num_bins[i][0],
num_bins[i+1][1],
num_bins[i][2]+num_bins[i+1][2],
num_bins[i][3]+num_bins[i+1][3])]
bins_df = pd.DataFrame(get_woe(num_bins))
axisx.append(len(num_bins))
IV.append(get_iv(bins_df))
if graph:
plt.figure()
plt.plot(axisx,IV)
plt.xticks(axisx)
plt.show()
return bins_df
3.7 对所有特征进行分箱选择
model_data.columns
Out[66]:
Index(['SeriousDlqin2yrs', 'RevolvingUtilizationOfUnsecuredLines', 'age',
'NumberOfTime30-59DaysPastDueNotWorse', 'DebtRatio', 'MonthlyIncome',
'NumberOfOpenCreditLinesAndLoans', 'NumberOfTimes90DaysLate',
'NumberRealEstateLoansOrLines', 'NumberOfTime60-89DaysPastDueNotWorse',
'NumberOfDependents', 'qcut'],
dtype='object')
for i in model_data.columns[1:-1]:
print(i)
graphforbestbin(model_data,i,"SeriousDlqin2yrs",n=2,q=20)
RevolvingUtilizationOfUnsecuredLines

age

不是所有的特征都可以使用这个分箱函数
auto_col_bins = {"RevolvingUtilizationOfUnsecuredLines":5,
"age":5,
"DebtRatio":4,
"MonthlyIncome":3,
"NumberOfOpenCreditLinesAndLoans":5}
#不能使用自动分箱的变量
hand_bins = {"NumberOfTime30-59DaysPastDueNotWorse":[0,1,2,13]
,"NumberOfTimes90DaysLate":[0,1,2,17]
,"NumberRealEstateLoansOrLines":[0,1,2,4,54]
,"NumberOfTime60-89DaysPastDueNotWorse":[0,1,2,8]
,"NumberOfDependents":[0,1,2,3]}
#保证区间覆盖使用 np.inf替换最大值,用-np.inf替换最小值
hand_bins = {k:[-np.inf,*v[:-1],np.inf] for k,v in hand_bins.items()}
hand_bins
Out[74]:
{'NumberOfTime30-59DaysPastDueNotWorse': [-inf, 0, 1, 2, inf],
'NumberOfTimes90DaysLate': [-inf, 0, 1, 2, inf],
'NumberRealEstateLoansOrLines': [-inf, 0, 1, 2, 4, inf],
'NumberOfTime60-89DaysPastDueNotWorse': [-inf, 0, 1, 2, inf],
'NumberOfDependents': [-inf, 0, 1, 2, inf]}
接下来对所有特征按照选择的箱体个数和手写的分箱范围进行分箱
bins_of_col = {}
# 生成自动分箱的分箱区间和分箱后的 IV 值
for col in auto_col_bins:
bins_df = graphforbestbin(model_data,col
,"SeriousDlqin2yrs"
,n=auto_col_bins[col]
#使用字典的性质来取出每个特征所对应的箱的数量
,q=12
,graph=False)
bins_list = sorted(set(bins_df["min"]).union(bins_df["max"]))
#保证区间覆盖使用 np.inf 替换最大值 -np.inf 替换最小值
bins_list[0],bins_list[-1] = -np.inf,np.inf
bins_of_col[col] = bins_list
bins_of_col.update(hand_bins)
bins_of_col
Out[76]:
{'RevolvingUtilizationOfUnsecuredLines': [-inf,
0.09916399825000001,
0.3241848415833333,
0.6110537703073564,
0.9999998999999999,
inf],
'age': [-inf,
45.68896056223215,
54.96325362710011,
63.0,
69.40827351386359,
inf],
'DebtRatio': [-inf, 0.055944251601378225, 0.33361565816666666, 32.0, inf],
'MonthlyIncome': [-inf, 0.22, 5833.0, inf],
'NumberOfOpenCreditLinesAndLoans': [-inf,
3.5863179118140547,
9.66596928081409,
11.0,
12.950941095975505,
inf],
'NumberOfTime30-59DaysPastDueNotWorse': [-inf, 0, 1, 2, inf],
'NumberOfTimes90DaysLate': [-inf, 0, 1, 2, inf],
'NumberRealEstateLoansOrLines': [-inf, 0, 1, 2, 4, inf],
'NumberOfTime60-89DaysPastDueNotWorse': [-inf, 0, 1, 2, inf],
'NumberOfDependents': [-inf, 0, 1, 2, inf]}
3.4 计算各箱的WOE并映射到数据中
def get_woe(df,col,y,bins):
df = df[[col,y]].copy()
df["cut"] = pd.cut(df[col],bins)
bins_df = df.groupby("cut")[y].value_counts().unstack()
woe = bins_df["woe"] =np.log((bins_df[0]/bins_df[0].sum())/(bins_df[1]/bins_df[1].sum()))
return woe
#将所有特征的WOE存储到字典当中
woeall = {}
for col in bins_of_col:
woeall[col] = get_woe(model_data,col,"SeriousDlqin2yrs",bins_of_col[col])
woeall
Out[77]:
{'RevolvingUtilizationOfUnsecuredLines': cut
(-inf, 0.0992] 2.206724
(0.0992, 0.324] 0.602023
(0.324, 0.611] -0.389316
(0.611, 1.0] -1.027200
(1.0, inf] -2.044288
dtype: float64, 'age': cut
(-inf, 45.689] -0.418293
(45.689, 54.963] -0.237499
(54.963, 63.0] 0.394675
(63.0, 69.408] 0.858461
(69.408, inf] 1.444452
dtype: float64, 'DebtRatio': cut
(-inf, 0.0559] 0.901068
(0.0559, 0.334] 0.055626
(0.334, 32.0] -0.292042
(32.0, inf] 0.193145
dtype: float64, 'MonthlyIncome': cut
(-inf, 0.22] 0.844512
(0.22, 5833.0] -0.247486
(5833.0, inf] 0.234283
dtype: float64, 'NumberOfOpenCreditLinesAndLoans': cut
(-inf, 3.586] -0.359898
(3.586, 9.666] 0.040580
(9.666, 11.0] 0.372690
(11.0, 12.951] -0.561171
(12.951, inf] 0.216359
dtype: float64, 'NumberOfTime30-59DaysPastDueNotWorse': cut
(-inf, 0.0] 1.095369
(0.0, 1.0] -1.380808
(1.0, 2.0] -2.050939
(2.0, inf] -2.306884
dtype: float64, 'NumberOfTimes90DaysLate': cut
(-inf, 0.0] 0.690287
(0.0, 1.0] -2.570842
(1.0, 2.0] -2.962132
(2.0, inf] -3.147651
dtype: float64, 'NumberRealEstateLoansOrLines': cut
(-inf, 0.0] 0.213342
(0.0, 1.0] -0.007980
(1.0, 2.0] -0.116866
(2.0, 4.0] -0.413832
(4.0, inf] -0.827201
dtype: float64, 'NumberOfTime60-89DaysPastDueNotWorse': cut
(-inf, 0.0] 0.539335
(0.0, 1.0] -2.457792
(1.0, 2.0] -2.932266
(2.0, inf] -2.880735
dtype: float64, 'NumberOfDependents': cut
(-inf, 0.0] 0.629351
(0.0, 1.0] -0.584413
(1.0, 2.0] -0.529292
(2.0, inf] -0.474088
dtype: float64}
把所有WOE映射到原始数据中
#不希望覆盖掉原本的数据,创建一个新的DataFrame,索引和原始数据model_data一模一样
model_woe = pd.DataFrame(index=model_data.index)
#对所有特征操作可以写成
for col in bins_of_col:
model_woe[col] = pd.cut(model_data[col],bins_of_col[col]).map(woeall[col])
model_woe.head()
Out[79]:
RevolvingUtilizationOfUnsecuredLines ... NumberOfDependents
0 2.206724 ... 0.629351
1 0.602023 ... 0.629351
2 -2.044288 ... -0.474088
3 2.206724 ... 0.629351
4 -1.027200 ... -0.584413
[5 rows x 10 columns]
#将标签补充到数据中
model_woe["SeriousDlqin2yrs"] = model_data["SeriousDlqin2yrs"]
#这就是我们的建模数据了
model_woe.head()
Out[80]:
RevolvingUtilizationOfUnsecuredLines ... SeriousDlqin2yrs
0 2.206724 ... 0
1 0.602023 ... 0
2 -2.044288 ... 1
3 2.206724 ... 0
4 -1.027200 ... 1
[5 rows x 11 columns]
3.5 建模与模型验证
vali_woe = pd.DataFrame(index=vali_data.index)
for col in bins_of_col:
vali_woe[col] = pd.cut(vali_data[col],bins_of_col[col]).map(woeall[col])
vali_woe["SeriousDlqin2yrs"] = vali_data["SeriousDlqin2yrs"]
vali_X = vali_woe.iloc[:,:-1]
vali_y = vali_woe.iloc[:,-1]
X = model_woe.iloc[:,:-1]
y = model_woe.iloc[:,-1]
#建模
from sklearn.linear_model import LogisticRegression as LR
lr = LR().fit(X,y)
lr.score(vali_X,vali_y)
H:\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
FutureWarning)
Out[83]: 0.8638963338757538
lr.n_iter_
Out[84]: array([6], dtype=int32)
#ROC曲线上的结果
import scikitplot as skplt
vali_proba_df = pd.DataFrame(lr.predict_proba(vali_X))
skplt.metrics.plot_roc(vali_y, vali_proba_df,
plot_micro=False,figsize=(6,6),
plot_macro=False)
Out[87]: <matplotlib.axes._subplots.AxesSubplot at 0x29581588908>

3.6 制作评分卡
假设对数几率为 时设定的特定分数为600,PDO=20,那么对数几率为 时的分数就是620。带入以上线
性表达式,可以得到
B = 20/np.log(2)
A = 600 + B*np.log(1/60)
B,A
Out[88]: (28.85390081777927, 481.8621880878296)
#计算基础分
base_score = A - B*lr.intercept_
base_score
Out[89]: array([481.37158938])
file = "H:/程志伟/python//ScoreData.csv"
with open(file,"w") as fdata:
fdata.write("base_score,{}\n".format(base_score))
for i,col in enumerate(X.columns):
score = woeall[col] * (-B*lr.coef_[0][i])
score.name = "Score"
score.index.name = col
score.to_csv(file,header=True,mode="a")
就可以在excel中得到每个变量的分箱,至此评分卡全部完成。