金融风控实战——风控数据挖掘方法(决策树规则挖掘)

风控数据挖掘方法(决策树规则挖掘)

import pandas as pd
import numpy as np
data = pd.read_excel("/Users/zhucan/Desktop/金融风控实战/第二课/oil_data_for_tree.xlsx")
data.head()

结果:

set(data.class_new)
#{'A', 'B', 'C', 'D', 'E', 'F'}
#org_lst 不需要做特殊变换,直接去重
#agg_lst 数值型变量做聚合
#dstc_lst 文本变量做cnt
org_lst = ["uid","create_dt","oil_actv_dt","class_new","bad_ind"]
agg_lst = ["oil_amount","discount_amount","sale_amount","amount","pay_amount","coupon_amount","payment_coupon_amount"]
dstc_lst = ["channel_code","oil_code","scene","source_app","call_source"]
#复制,查看缺失情况
df = data[org_lst].copy()
df[agg_lst] = data[agg_lst].copy()
df[dstc_lst] = data[dstc_lst].copy()

df.isna().sum()

 结果:

uid                         0
create_dt                4944
oil_actv_dt                 0
class_new                   0
bad_ind                     0
oil_amount               4944
discount_amount          4944
sale_amount              4944
amount                   4944
pay_amount               4944
coupon_amount            4944
payment_coupon_amount    4946
channel_code                0
oil_code                    0
scene                       0
source_app                  0
call_source                 0
dtype: int64
#对creat_dt补全,用oil_actv_dt来填补,并截取6个月的数据
#构造变量的时候不能直接对历史所有数据做累加
#否则随着时间推移,变量的分布会有很大的变化
def time_isna(x,y):
    if str(x) == "NaT":
        x = y
    else:
        x = x
    return x

df2 = df.sort_values(["uid","create_dt"],ascending = False)
df2["create_dt"] = df2.apply(lambda x:time_isna(x.create_dt,x.oil_actv_dt),axis = 1)
df2["dtn"] = (df2.oil_actv_dt - df2.create_dt).apply(lambda x:x.days)
df = df2[df2["dtn"] < 180]
df.head()

结果:

#取出org_list
base = df[org_lst]
base["dtn"] = df["dtn"]
base = base.sort_values(['uid','create_dt'],ascending = False)
base = base.drop_duplicates(['uid'],keep = 'first')    #去重操作
base.shape

结果:

(11099, 6)
#变量衍生
gn = pd.DataFrame()
for i in agg_lst:  
    tp = pd.DataFrame(df.groupby('uid').apply(lambda df:len(df[i])).reset_index())  
    tp.columns = ['uid',i + '_cnt']  
    if gn.empty == True:  
        gn = tp  
    else:  
        gn = pd.merge(gn,tp,on = 'uid',how = 'left')      
    
    tp = pd.DataFrame(df.groupby('uid').apply(lambda df:np.where(df[i]>0,1,0).sum()).reset_index())  
    tp.columns = ['uid',i + '_num']  
    if gn.empty == True:  
        gn = tp  
    else:  
        gn = pd.merge(gn,tp,on = 'uid',how = 'left')  
    
    tp = pd.DataFrame(df.groupby('uid').apply(lambda df:np.nansum(df[i])).reset_index())  
    tp.columns = ['uid',i + '_tot']  
    if gn.empty == True:  
        gn = tp  
    else:  
        gn = pd.merge(gn,tp,on = 'uid',how = 'left')  
        
    tp = pd.DataFrame(df.groupby('uid').apply(lambda df:np.nanmean(df[i])).reset_index())  
    tp.columns = ['uid',i + '_avg']  
    if gn.empty == True:  
        gn = tp  
    else:  
        gn = pd.merge(gn,tp,on = 'uid',how = 'left')  
    
    tp = pd.DataFrame(df.groupby('uid').apply(lambda df:np.nanmax(df[i])).reset_index())  
    tp.columns = ['uid',i + '_max']  
    if gn.empty == True:  
        gn = tp  
    else:  
        gn = pd.merge(gn,tp,on = 'uid',how = 'left')  
        
    tp = pd.DataFrame(df.groupby('uid').apply(lambda df:np.nanmin(df[i])).reset_index())  
    tp.columns = ['uid',i + '_min']  
    if gn.empty == True:  
        gn = tp  
    else:  
        gn = pd.merge(gn,tp,on = 'uid',how = 'left')  
    
    tp = pd.DataFrame(df.groupby('uid').apply(lambda df:np.nanvar(df[i])).reset_index())  
    tp.columns = ['uid',i + '_var']  
    if gn.empty == True:  
        gn = tp  
    else:  
        gn = pd.merge(gn,tp,on = 'uid',how = 'left')  
    
    tp = pd.DataFrame(df.groupby('uid').apply(lambda df:np.nanmean(df[i])/max(np.nanvar(df[i]),1)).reset_index())  
    tp.columns = ['uid',i + '_bianyi']  
    if gn.empty == True:  
        gn = tp  
    else:  
        gn = pd.merge(gn,tp,on = 'uid',how = 'left') 
#对dstc_lst变量求distinct个数
gc = pd.DataFrame()
for i in dstc_lst:
    tp = pd.DataFrame(df.groupby('uid').apply(lambda df:len(set(df[i]))).reset_index())
    tp.columns = ['uid',i + '_dstc']  
    if gc.empty == True:  
        gc = tp  
    else:  
        gc = pd.merge(gc,tp,on = 'uid',how = 'left')  
fn = pd.merge(base,gn,on= 'uid')
fn = pd.merge(fn,gc,on= 'uid') 
fn

结果: 

fn = fn.fillna(0) #组合后会有很多空值,此处填充为0

#训练模型
x = fn.drop(['uid','oil_actv_dt','create_dt','bad_ind','class_new'],axis = 1)
y = fn.bad_ind.copy()
from sklearn import tree
dtree = tree.DecisionTreeRegressor(max_depth = 2,min_samples_leaf = 500,min_samples_split = 5000)
#限制树的最大深度
#叶子最少包含样本的个数
#节点必须包含训练样本的个数
dtree = dtree.fit(x,y) #利用x的值,预测y是否是坏人
feature_names = x.columns
import matplotlib.pyplot as plt
plt.figure(figsize=(12,9),dpi=80)
tree.plot_tree(dtree,filled = True,feature_names = feature_names)

 俩结果

sum(fn.bad_ind)/len(fn.bad_ind)
#0.04658077304261645
#生成策略
dff1 = fn.loc[(fn.amount_tot>48077.5)&(fn.coupon_amount_cnt>3.5)].copy()
dff1['level'] = 'oil_A'
dff2 = fn.loc[(fn.amount_tot>48077.5)&(fn.coupon_amount_cnt<=3.5)].copy()
dff2['level'] = 'oil_B'
dff3 = fn.loc[(fn.amount_tot<=48077.5)].copy()
dff3['level'] = 'oil_C'

dff1.head()

结果:

dff1 = dff1.append(dff2)
dff1 = dff1.append(dff3)
dff1 = dff1.reset_index(drop = True)
dff1.head()

结果:

last = dff1[['class_new','level','bad_ind','uid','oil_actv_dt','bad_ind']].copy()
last['oil_actv_dt'] = last['oil_actv_dt'] .apply(lambda x:str(x)[:7]).copy()
last.head(5)

结果: 

 

后续excel操作

求和项:bad_ind列标签
行标签ABCDEF总计
oil_A0.9%0.7%1.6%1.7%2.9%5.5%1.2%
oil_B1.8%2.2%2.7%5.3%6.2%13.1%3.0%
oil_C5.1%6.7%6.3%5.9%15.2%19.9%7.4%
总计2.9%3.9%4.2%4.9%10.6%16.1%4.7%

原先只能放款A这部分(bad_rate百分之3以下)

求和项:bad_ind列标签
行标签ABCDEF总计
oil_A0.9%0.7%1.6%1.7%2.9%5.5%1.2%
oil_B1.8%2.2%2.7%5.3%6.2%13.1%3.0%
oil_C5.1%6.7%6.3%5.9%15.2%19.9%7.4%
总计2.9%3.9%4.2%4.9%10.6%16.1%4.7%

现在涂黄的都能放款

一综合

 

  • 0
    点赞
  • 15
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值