风控模型大数据挖掘竞赛

一、数据集介绍

该数据集中包含三个文件:LC.csv LP.csv LCIS.csv

LC数据集为标的特征表,每只标一条记录。共有21个字段,包括一个主键、7个标本身的信息字段、13个成交时借款人的信息字段。LP数据集为标的还款计划和还款记录表。每只标每期还款一个记录。共有10个字段,包括2个主键,2个还款计划字段和4个还款状态字段。LCIS数据集包含了某一个客户投资的从2015年1月1日起成交的所有标,共36个字段。包含1个主键、7个标自身信息字段和13个成交当时借款人的信息字段以及15个客户投资与收益相关的信息字段。。

二、读取数据集,合并训练集测试集

1.读取数据集

代码如下:

import numpy as np
import pandas as pd
from pandas import Series, DataFrame

import matplotlib.pyplot as plt
%matplotlib inline

# 读取训练数据集
path = r'F:\教师培训\ppd7\PPD-First-Round-Data-Updated\PPD-First-Round-Data-Update\Training Set'

train_loginfo = pd.read_csv(path + '\PPD_LogInfo_3_1_Training_Set.csv', encoding='gbk')
train_master = pd.read_csv(path + '\PPD_Training_Master_GBK_3_1_Training_Set.csv', encoding='gbk')
train_userupdate = pd.read_csv(path + '\PPD_Userupdate_Info_3_1_Training_Set.csv', encoding='gbk')

# 读取测试数据集
path = 'F:\教师培训\ppd7\PPD-First-Round-Data-Updated\PPD-First-Round-Data-Update\Test Set'
test_loginfo = pd.read_csv(path + '\PPD_LogInfo_2_Test_Set.csv', encoding='gbk')
test_master = pd.read_csv(path + '\PPD_Master_GBK_2_Test_Set.csv', encoding='gb18030')
test_userupdate = pd.read_csv(path + '\PPD_Userupdate_Info_2_Test_Set.csv', encoding='gbk')

# 合并时用于标记哪些样本来自训练集和测试集
train_master['sample_status']='train'
test_master['sample_status']='test'

# 训练集和测试集的合并(axis=0,增加行)
df_Master = pd.concat([train_master,test_master], axis=0).reset_index(drop=True)
df_loginfo = pd.concat([train_loginfo,test_loginfo], axis=0).reset_index(drop=True)
df_userupdate = pd.concat([train_userupdate,test_userupdate], axis=0).reset_index(drop=True)

2.缺失值处理

代码如下:

# 缺失值为-1,替换成nan
df_Master = df_Master.replace({-1:np.nan})
# 删除缺失率大于0.8的列
p = (df_Master.isnull().sum().sort_values(ascending=False)/49999).reset_index().rename(columns={'index':'feat_name', 0:'rate'})
df_Master = df_Master.drop(columns=p[p.rate>0.8]['feat_name'])
plt.figure(figsize=(20,5))
(df_Master.isnull().sum().sort_values(ascending=False)/49999)[:30].plot.bar(rot=45)

在这里插入图片描述

df_Master.isnull().sum(axis=1).sort_values(ascending=True).reset_index(drop=True).plot.line()

在这里插入图片描述

# 删除缺失超过100的行数据
a = df_Master.isnull().sum(axis=1).sort_values(ascending=False).reset_index().rename(columns={0:'count_lack_row'})
df_Master = df_Master.drop(index=a[a.count_lack_row>100]['index'])
df_Master.isnull().sum(axis=1).sort_values(ascending=True).reset_index(drop=True).plot.line()

在这里插入图片描述

3.将数据集划分为数值类型和类别类型并分别处理

# 对数值类型的列进行处理
df_Master_Num = df_Master.select_dtypes(include=np.float64).drop(columns=['Idx','target'])
df_Master_Num.shape

# 通过观察数值类型的数据集发现一共分为五类,分别是与用户信息相关的,微博日志信息相关的,
# 第三方机构相关的,社交网络相关的,受教育程度相关的,分别提取出来,使用不同的特征处理方法处理
df_Master_Num_UserInfo_col = []
df_Master_Num_WeblogInfo_col = []
df_Master_Num_ThirdParty_Info_Period_col = []
df_Master_Num_SocialNetwork_col = []
df_Master_Num_Education_Info_col = []

for c in df_Master_Num.columns:
    if c.find('UserInfo') != -1:
        df_Master_Num_UserInfo_col.append(c)
    elif c.find('WeblogInfo') != -1:
        df_Master_Num_WeblogInfo_col.append(c)
    elif c.find('ThirdParty_Info_Period') != -1:
        df_Master_Num_ThirdParty_Info_Period_col.append(c)
    elif c.find('SocialNetwork') != -1:
        df_Master_Num_SocialNetwork_col.append(c)
    elif c.find('Education_Info') != -1:
        df_Master_Num_Education_Info_col.append(c)
        
# 首先处理与用户信息相关的数据
df_Master_Num_UserInfo = df_Master_Num[df_Master_Num_UserInfo_col]
df_Master_Num_UserInfo        
df_Master_Num_UserInfo.isnull().sum()    
# 根据此结果对UserInfo_1和UserInfo_3采用众数进行填充
df_Master_Num_UserInfo['UserInfo_1'] = df_Master_Num_UserInfo['UserInfo_1'].fillna(1.0)
df_Master_Num_UserInfo['UserInfo_1'].isnull().sum()
df_Master_Num_UserInfo['UserInfo_3'] = df_Master_Num_UserInfo['UserInfo_3'].fillna(5.0)
df_Master_Num_UserInfo['UserInfo_3'].isnull().sum()
#UserInfo_11    31384
# UserInfo_12    31384
# UserInfo_13    31384
# 这三列缺失比较严重,他们的值分别取01,所以填充为-1,表示为另外一个类别,至此用户信息数据清缺失值处理完毕
df_Master_Num_UserInfo['UserInfo_11'] = df_Master_Num_UserInfo['UserInfo_13'].fillna(-1)
df_Master_Num_UserInfo['UserInfo_12'] = df_Master_Num_UserInfo['UserInfo_13'].fillna(-1)
df_Master_Num_UserInfo['UserInfo_13'] = df_Master_Num_UserInfo['UserInfo_13'].fillna(-1)

# 开始处理微博缺失信息
df_Master_Num_WeblogInfo = df_Master_Num[df_Master_Num_WeblogInfo_col]
df_Master_Num_WeblogInfo.head()       
# 由于是类别类型,所以采用众数填充
for c in df_Master_Num_WeblogInfo.columns:
    m =df_Master_Num_WeblogInfo[c].mode()[0]
    df_Master_Num_WeblogInfo[c] = df_Master_Num_WeblogInfo[c].fillna(m)

#对df_Master_Num_ThirdParty_Info_Period进行处理
df_Master_Num_ThirdParty_Info_Period = df_Master_Num[df_Master_Num_ThirdParty_Info_Period_col]
# 由于是数值类型,所以采用均值填充
for c in df_Master_Num_ThirdParty_Info_Period.columns:
    m =df_Master_Num_ThirdParty_Info_Period[c].mean()
    df_Master_Num_ThirdParty_Info_Period[c].fillna(m, inplace=True)

#对df_Master_Num_SocialNetwork进行处理 
df_Master_Num_SocialNetwork = df_Master_Num[df_Master_Num_SocialNetwork_col]
df_Master_Num_SocialNetwork.isnull().sum()
df_Master_Num_SocialNetwork['SocialNetwork_12'].fillna(0, inplace=True)
df_Master_Num_SocialNetwork['SocialNetwork_12'].value_counts()
# 由于是数值类型,所以采用均值填充
for c in df_Master_Num_SocialNetwork.columns:
    m =df_Master_Num_SocialNetwork[c].mean()
    df_Master_Num_SocialNetwork[c].fillna(m, inplace=True)
df_Master_Num_Education_Info = df_Master_Num[df_Master_Num_Education_Info_col]

#对df_Master_Num_Education_Info进行处理 
df_Master_Num_Education_Info = df_Master_Num[df_Master_Num_Education_Info_col]
df_Master_Num.isnull().sum()
# df_Master_Num_UserInfo
# df_Master_Num_WeblogInfo
# df_Master_Num_ThirdParty_Info_Period
# df_Master_Num_SocialNetwork
# df_Master_Num_Education_Info
# 缺失值处理之后,重新合并
df_Master_Num_clean = pd.concat([df_Master_Num_UserInfo,df_Master_Num_WeblogInfo,\
                                 df_Master_Num_ThirdParty_Info_Period,df_Master_Num_SocialNetwork,\
                                 df_Master_Num_Education_Info], axis=1)



# 接下来对string类型进行处理,提取df_Master的object类型数据并进行处理
df_Master_obj = df_Master.select_dtypes(include=object).drop(columns=['ListingInfo','sample_status'])
df_Master_obj.isnull().sum()
#采取众数填充
df_Master_obj['UserInfo_2'].fillna('缺失',inplace=True)#众数最多为I
df_Master_obj['UserInfo_4'].fillna('缺失',inplace=True)
df_Master_obj['WeblogInfo_19'].value_counts()
df_Master_obj['WeblogInfo_19'].fillna('I',inplace=True)
df_Master_obj['WeblogInfo_20'].value_counts()
df_Master_obj['WeblogInfo_20'].fillna('I5',inplace=True)
df_Master_obj['WeblogInfo_21'].value_counts()
df_Master_obj['WeblogInfo_21'].fillna('D',inplace=True)
# 至此  df_Master_obj  缺失值填充完毕

# df_Master_obj数据格式统一化
df_Master_obj.replace({'UserInfo_9':{'中国移动 ':'中国移动', '中国电信 ':'中国电信','中国联通 ':'中国联通'}}, inplace=True)
df_Master_obj['UserInfo_8'] = df_Master_obj['UserInfo_8'].apply(lambda x: x if x.find('市') == -1 else x[:-1])
df_Master_obj['UserInfo_20'] = df_Master_obj['UserInfo_20'].apply(lambda x: x if x.find('市') == -1 else x[:-1])
df_Master_obj['UserInfo_19'] = df_Master_obj['UserInfo_19'].apply(lambda x: x if x.find('省') == -1 else x[:-1])
df_Master_obj.UserInfo_19.replace({ 
    '广西壮族自治区':'广西',
    '宁夏回族自治区':'宁夏',
    '新疆维吾尔自治区':'新疆', 
    '西藏自治区':'西藏',
    '内蒙古自治区':'内蒙古'
}, inplace=True)
df_Master_obj['UserInfo_19'] = df_Master_obj['UserInfo_19'].apply(lambda x: x if x.find('市') == -1 else x[:-1])
# 至此df_Master_obj格式统一完毕,下面进行df_Master_obj特征延伸处理

# 首先将数据按列进行分类UserInfo,Education_Info, WeblogInfo三个类别,
df_Master_obj_UserInfo_col = ['UserInfo_2', 'UserInfo_4', 'UserInfo_7', 'UserInfo_8',\
                              'UserInfo_9','UserInfo_19', 'UserInfo_20', 'UserInfo_22', 'UserInfo_23','UserInfo_24']
df_Master_obj_Education_Info_col = ['Education_Info2', 'Education_Info3', 'Education_Info4',\
                                    'Education_Info6', 'Education_Info7', 'Education_Info8']

df_Master_obj_WeblogInfo_col = ['WeblogInfo_19', 'WeblogInfo_20', 'WeblogInfo_21']


#首先对df_Master_obj_UserInfo_col进行处理
df_Master_obj_UserInfo = df_Master_obj[df_Master_obj_UserInfo_col]




# 因为lightgbm可以对类别类型的数据进行处理,直接将类型转为category,此种效果太差,故不采用
# 所以需要进行二值化处理
# 将UserInfo_7,UserInfo_19坏账率排在前五位的省份进行哑变量处理
def get_badrate(df_Master_obj_UserInfo,df_Master,x):
    n = pd.concat([df_Master_obj_UserInfo[x],df_Master['target']],axis=1)
    n = n[n['target'].notnull()]
    p = (n.groupby(x)['target'].sum()/n.groupby(x)['target'].count()).reset_index()
    p = p.rename(columns={x:'province','target':'bad_rate'})
    p = p.sort_values('bad_rate',ascending=False)[:6]
    c = []
    for p in p['province']:
        c.append(x+'_is_'+p)
    k = pd.get_dummies(df_Master_obj_UserInfo[x],prefix=x+'_is')[c]
    df_Master_obj_UserInfo = pd.concat([df_Master_obj_UserInfo,k], axis=1)
    return df_Master_obj_UserInfo

df_Master_obj_UserInfo = get_badrate(df_Master_obj_UserInfo,df_Master,'UserInfo_7')
df_Master_obj_UserInfo = get_badrate(df_Master_obj_UserInfo,df_Master,'UserInfo_19')


# 构造出地址变化次数的特征time_change_address
df_Master_obj_UserInfo['time_change_address'] = df_Master_obj_UserInfo[['UserInfo_2','UserInfo_4','UserInfo_8','UserInfo_20']].apply(lambda x:x.nunique(), axis=1)
df_Master_obj_UserInfo.shape


# # UserInfo_2、UserInfo_4、UserInfo_8、UserInfo_20城市数量太多,所以首先二值化,然后使用lightgbm进行城市筛选
import lightgbm as lgb
plt.rcParams['font.sans-serif'] = ['Microsoft YaHei']

c = ['UserInfo_2','UserInfo_4','UserInfo_8', 'UserInfo_20']
def get_import_city(df_Master_obj_UserInfo, df_Master):
    fig = plt.figure(figsize=(20, 8))
    dummy_city = pd.get_dummies(df_Master_obj_UserInfo[c])
    data = pd.concat([dummy_city,df_Master['target']],axis=1)
    train = data[data['target'].notnull()].drop(columns='target')
    test = data[data['target'].notnull()]['target']
    lgb_model = lgb.LGBMClassifier().fit(train, test, eval_metric = 'auc')
    feature_importance = pd.DataFrame({'name':lgb_model.booster_.feature_name(),
                                   'importance':lgb_model.feature_importances_}).sort_values(by=['importance'],ascending=False)
    df_Master_obj_UserInfo = pd.concat([df_Master_obj_UserInfo, dummy_city[feature_importance[:20].name]], axis=1)
    return df_Master_obj_UserInfo


df_Master_obj_UserInfo = get_import_city(df_Master_obj_UserInfo, df_Master)


# UserInfo_9进行哑变量处理
df_Master_obj_UserInfo = pd.concat([pd.get_dummies(df_Master_obj_UserInfo['UserInfo_9']), df_Master_obj_UserInfo], axis=1)
df_Master_obj_UserInfo.shape
c = ['UserInfo_22','UserInfo_23','UserInfo_24']
def get_import_city(df_Master_obj_UserInfo, df_Master):
    fig = plt.figure(figsize=(20, 8))
    dummy_city = pd.get_dummies(df_Master_obj_UserInfo[c])
    data = pd.concat([dummy_city,df_Master['target']],axis=1)
    train = data[data['target'].notnull()].drop(columns='target')
    test = data[data['target'].notnull()]['target']
    lgb_model = lgb.LGBMClassifier().fit(train, test, eval_metric = 'auc')
    feature_importance = pd.DataFrame({'name':lgb_model.booster_.feature_name(),
                                   'importance':lgb_model.feature_importances_}).sort_values(by=['importance'],ascending=False)
    df_Master_obj_UserInfo = pd.concat([df_Master_obj_UserInfo, dummy_city[feature_importance[:20].name]], axis=1)
    return df_Master_obj_UserInfo
df_Master_obj_UserInfo = get_import_city(df_Master_obj_UserInfo, df_Master)
# 至此,用户信息类型数据清洗完毕,将原来的信息进行删除
df_Master_obj_UserInfo_clean = df_Master_obj_UserInfo.drop(columns=df_Master_obj_UserInfo_col)
df_Master_obj_UserInfo_clean.shape

# 接下来处理df_Master_obj_Education_Info_col信息,类别比较少直接哑变量处理即可
df_Master_obj_Education_Info = df_Master_obj[df_Master_obj_Education_Info_col]
df_Master_obj_Education_Info = pd.concat([df_Master_obj_Education_Info, pd.get_dummies(df_Master_obj_Education_Info)], axis=1)


# 至此,受教育类型数据清洗完毕,将原来的信息进行删除
df_Master_obj_Education_Info_clean = df_Master_obj_Education_Info.drop(columns=df_Master_obj_Education_Info_col)
df_Master_obj_Education_Info_clean.shape

# 接下来对微博相关的信息进行处理
df_Master_obj_WeblogInfo = df_Master_obj[df_Master_obj_WeblogInfo_col]
df_Master_obj_WeblogInfo

# 对WeblogInfo_19,21进行哑变量处理,对20进行哑变量处理后再特征筛选,避免维灾难
pd.get_dummies(df_Master_obj_WeblogInfo[['WeblogInfo_19', 'WeblogInfo_21']])
pd.get_dummies(df_Master_obj_WeblogInfo[['WeblogInfo_20']])

c = ['WeblogInfo_20']
def get_import_city(df_Master_obj_WeblogInfo, df_Master):
    fig = plt.figure(figsize=(20, 8))
    dummy_web = pd.get_dummies(df_Master_obj_WeblogInfo[c])
    data = pd.concat([dummy_web,df_Master['target']],axis=1)
    train = data[data['target'].notnull()].drop(columns='target')
    test = data[data['target'].notnull()]['target']
    lgb_model = lgb.LGBMClassifier().fit(train, test, eval_metric = 'auc')
    feature_importance = pd.DataFrame({'name':lgb_model.booster_.feature_name(),
                                   'importance':lgb_model.feature_importances_}).sort_values(by=['importance'],ascending=False)
    df_Master_obj_WeblogInfo = pd.concat([df_Master_obj_WeblogInfo, dummy_web[feature_importance[:20].name]], axis=1)
    return df_Master_obj_WeblogInfo, feature_importance
df_Master_obj_WeblogInfo, feature_importance = get_import_city(df_Master_obj_WeblogInfo, df_Master)

df_Master_obj_WeblogInfo = pd.concat([df_Master_obj_WeblogInfo, pd.get_dummies(df_Master_obj_WeblogInfo[['WeblogInfo_19','WeblogInfo_21']])], axis=1)
df_Master_obj_WeblogInfo.drop(columns=df_Master_obj_WeblogInfo_col,inplace=True)
df_Master_obj_WeblogInfo.shape
# 至此所有的字符串微博类型的信息清洗完毕,接下里进行合并
df_Master_obj_WeblogInfo_clean = df_Master_obj_WeblogInfo
df_Master_obj_clean = pd.concat([df_Master_obj_WeblogInfo_clean,
                                 df_Master_obj_UserInfo_clean,
                                 df_Master_obj_Education_Info_clean],axis=1)
df_Master_clean = pd.concat([df_Master_obj_clean,
                             df_Master_Num_clean,
                            df_Master[['Idx','target','ListingInfo','sample_status']]
                            ], axis=1)
#写入磁盘
df_Master_clean.to_csv(r'F:\教师培训\ppd7\df_Master_clean.csv', encoding='gb18030', index=False)                
              

4.lightGBM建模

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import roc_auc_score,roc_curve,auc
import lightgbm as lgb
from multiprocessing import cpu_count

c=['Idx','target','ListingInfo','sample_status']
x = df_Master_clean[df_Master_clean['target'].notnull()].drop(columns=c)
y = df_Master_clean[df_Master_clean['target'].notnull()]['target']
x_train,x_test, y_train, y_test = train_test_split(x,y,random_state=2,test_size=0.2)

def roc_auc_plot(clf,x_train,y_train,x_test, y_test):
    train_auc = roc_auc_score(y_train,clf.predict_proba(x_train)[:,1])
    train_fpr, train_tpr, _ = roc_curve(y_train,clf.predict_proba(x_train)[:,1])
    train_ks = abs(train_fpr-train_tpr).max()
    print('train_ks = ', train_ks)
    print('train_auc = ', train_auc)
    
    test_auc = roc_auc_score(y_test,clf.predict_proba(x_test)[:,1])
    test_fpr, test_tpr, _ = roc_curve(y_test,clf.predict_proba(x_test)[:,1])
    test_ks = abs(test_fpr-test_tpr).max()
    print('test_ks = ', test_ks)
    print('test_auc = ', test_auc)
    
    from matplotlib import pyplot as plt
    plt.plot(train_fpr,train_tpr,label = 'train_roc')
    plt.plot(test_fpr,test_tpr,label = 'test_roc')
    plt.plot([0,1],[0,1],'k--', c='r')
    plt.xlabel('False positive rate')
    plt.ylabel('True positive rate')
    plt.title('ROC Curve')
    plt.legend(loc = 'best')
    plt.show()

在这里插入图片描述

6.对df_userupdate表进行信息特征变换

#统一进行小写处理
df_userupdate['UserupdateInfo1'] = df_userupdate['UserupdateInfo1'].apply(lambda x:x.lower())
#提取info1234信息
# 修改信息表
# 衍生变量:
# 1)最近的修改时间距离成交时间差;
# 2)修改信息总次数
# 3)每种信息修改的次数
# 4)按照日期修改的次数
info1 = df_userupdate.groupby('Idx')['UserupdateInfo1'].count()
df_userupdate['ListingInfo1'] = pd.to_datetime(df_userupdate['ListingInfo1'])
df_userupdate['UserupdateInfo2'] = pd.to_datetime(df_userupdate['UserupdateInfo2'])

info2 = df_userupdate.groupby('Idx')['ListingInfo1'].max()-df_userupdate.groupby('Idx')['UserupdateInfo2'].max()
info2 = info2.apply(lambda x : x.days)

info3 = df_userupdate.pivot_table(index='Idx', columns='UserupdateInfo1', values='UserupdateInfo2', aggfunc={'UserupdateInfo2':'count'}).fillna(0)

# 4)按照日期修改的次数
info4 = df_userupdate.groupby('Idx')['UserupdateInfo2'].nunique()
df_userupdate_info = pd.concat([info1, info2, info3, info4],axis=1)
df_userupdate_info.rename(columns={0:'df_use_update0'}, inplace=True)

#序列化到磁盘
df_userupdate_info.to_csv(r'F:\教师培训\ppd6\df_userupdate_info.csv',encoding='gb18030', index=True)

7.对df_loginfo表进行信息特征变换


# 衍生的变量有
# 1)累计登陆次数
# 2)登陆时间的平均间隔
# 3)最近一次的登陆时间距离成交时间差
df_loginfo['Listinginfo1'] = pd.to_datetime(df_loginfo['Listinginfo1'])
df_loginfo['LogInfo3'] = pd.to_datetime(df_loginfo['LogInfo3'])
info1 = df_loginfo.groupby('Idx')['LogInfo3'].count()
def f(x):
    x = x.sort_values(ascending=True)
    y = x-x.shift()
    p = y.apply(lambda z:z.days) 
    res = p.sum()/len(x)
    return round(res, 3)
   
info2 = df_loginfo.groupby('Idx')['LogInfo3'].apply(f)
info3 = df_loginfo.groupby('Idx')['Listinginfo1'].max() - df_loginfo.groupby('Idx')['LogInfo3'].max()
info3 = info3.apply(lambda x:x.days)
df_loginfo_info = pd.concat([info1, info2, info3], axis=1)
df_loginfo_info.rename(columns={'LogInfo3':'login_info1','LogInfo3':'login_info2',0:'login_info3'}, inplace=True)
df_loginfo_info.to_csv(r'F:\教师培训\ppd6\df_loginfo_info.csv',encoding='gb18030', index=True)

8.合并以上三个表df_loginfo_update_info,df_Master_clean,df_loginfo_update_info并重新建模

df_loginfo_update_info = pd.read_csv(r'F:\教师培训\ppd6\df_userupdate_info.csv', encoding='gb18030')
df_loginfo_info = pd.read_csv(r'F:\教师培训\ppd6\df_loginfo_info.csv', encoding='gb18030')
df_Master_merge_clean = df_Master_clean.merge(df_loginfo_update_info, on='Idx')
df_Master_merge_clean = df_Master_merge_clean.merge(df_loginfo_update_info, on='Idx')

#划分测试集训练集
c=['Idx','target','ListingInfo','sample_status']
x = df_Master_merge_clean[df_Master_merge_clean['target'].notnull()].drop(columns=c)
y = df_Master_merge_clean[df_Master_merge_clean['target'].notnull()]['target']
x_train,x_test, y_train, y_test = train_test_split(x,y,random_state=2,test_size=0.2)

# LGBMClassifier在df_Master2数据集上进行预测达到效果最好,所以依此模型为基础进行预测,修改,调参

def roc_auc_plot(clf,x_train,y_train,x_test, y_test):
    train_auc = roc_auc_score(y_train,clf.predict_proba(x_train)[:,1])
    train_fpr, train_tpr, _ = roc_curve(y_train,clf.predict_proba(x_train)[:,1])
    train_ks = abs(train_fpr-train_tpr).max()
    print('train_ks = ', train_ks)
    print('train_auc = ', train_auc)
    
    test_auc = roc_auc_score(y_test,clf.predict_proba(x_test)[:,1])
    test_fpr, test_tpr, _ = roc_curve(y_test,clf.predict_proba(x_test)[:,1])
    test_ks = abs(test_fpr-test_tpr).max()
    print('test_ks = ', test_ks)
    print('test_auc = ', test_auc)
    
    from matplotlib import pyplot as plt
    plt.plot(train_fpr,train_tpr,label = 'train_roc')
    plt.plot(test_fpr,test_tpr,label = 'test_roc')
    plt.plot([0,1],[0,1],'k--', c='r')
    plt.xlabel('False positive rate')
    plt.ylabel('True positive rate')
    plt.title('ROC Curve')
    plt.legend(loc = 'best')
    plt.show()

lgb_model = lgb.LGBMClassifier(n_estimators=800,
                                boosting_type='gbdt',
                               learning_rate=0.04,
                               min_child_samples=68,
                               min_child_weight=0.01,
                                  max_depth=4,
                              num_leaves=16,
                              colsample_bytree=0.8,
                              subsample=0.8,
                              reg_alpha=0.7777777777777778,
                              reg_lambda=0.3,
                               objective='binary')

clf = lgb_model.fit(x_train, y_train,
              eval_set=[(x_train, y_train),(x_test,y_test)],
              eval_metric='auc',early_stopping_rounds=100)
roc_auc_plot(clf,x_train,y_train,x_test, y_test)

在这里插入图片描述
加了其他两个表的数据,效果下降了,预计是数据集的问题,接下来要对420维的这个数据集进行处理,并使用不同模型,进一步探索

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 1
    评论
Python风控模型是运用Python编程语言开发的一种风险管理模型。它通过数据分析和模型建立,能够帮助企业识别和管理风险。 Python作为一种简单易学的编程语言,具有丰富的数据处理和分析库,如NumPy、Pandas、Scikit-learn等,这使得Python成为了构建风控模型的理想选择。 Python风控模型的主要作用在于帮助企业评估和量化风险,包括市场风险、信用风险、操作风险等各种类型的风险。通过收集和分析大量的数据,Python风控模型可以建立有效的预测模型,从而提前识别潜在的风险事件,帮助企业采取对策和控制风险。 Python风控模型的开发过程主要包括数据预处理、特征工程、模型选择和建立、模型训练和评估等步骤。在数据预处理中,对原始数据进行清洗和转换,以便后续分析使用。在特征工程中,根据数据特点和实际需求,构建适合于模型的特征集。在模型选择和建立中,选择合适的模型算法,并进行模型参数的调优。在模型训练和评估中,使用历史数据进行模型训练,并通过评估指标来评价模型的预测能力和稳定性。 Python风控模型的优势在于其灵活性和可扩展性。Python编程语言的优雅和简洁语法使得编写程序变得简单,并且可以方便地集成其他Python库和工具。此外,Python还具有丰富的可视化工具,可以直观地展示模型预测结果和风险分析。 总而言之,Python风控模型是一种利用Python编程语言构建的风险管理模型。它通过数据分析和建模,帮助企业评估和管理各类风险,并提供决策支持。其灵活性和可扩展性使得Python成为了开发风控模型的常用工具。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

sunnuan01

一起学习,共同进步

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值