金融风控之贷款违约–特征工程
1 学习目标
(1)学习特征预处理、缺失值、异常值处理、数据分桶等特征处理方法;
(2)学习特征交互、编码、选择的相应方法;
2 主要内容
(1)数据预处理
缺失值的填充;
时间格式处理;
对象类型特征转换到数值;
(2)异常值处理
基于3sigma原则;
基于箱型图;
(3)数据分箱
固定宽度分箱;
分位数分箱;
离散数值型数据分箱;
连续数值型数据分箱;
卡方分箱;
(4)特征交互
特征和特征之间组合;
特征和特征之间衍生;
其他特征衍生的尝试;
(5)特征编码
one-hot编码;
label-encode编码;
(6)特征选择
Filter;
Wrapper;
Embedded;
3 代码实例
3.1 导入包并读取数据
# 导入相关库
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
from tqdm import tqdm
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.preprocessing import MinMaxScaler
import xgboost as xgb
import lightgbm as lgb
from catboost import CatBoostRegressor
import warnings
from sklearn.model_selection import StratifiedKFold, KFold
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, log_loss
warnings.filterwarnings('ignore')
pd.options.display.max_columns = None
pd.options.display.max_rows = None
%matplotlib inline
pd.set_option('display.max_colwidth', -1)
# 读取文件
data_train =pd.read_csv('./RawData/train.csv')
data_test_a = pd.read_csv('./RawData/testA.csv')
3.2 特征预处理
数据预处理部分一般我们要处理一些EDA阶段分析出来的问题,这里介绍了数据缺失值的填充,时间格式特征的转化处理,某些对象类别特征的处理。
# 首先查找出数据中的对象特征和数值特征
numerical_fea = list(data_train.select_dtypes(exclude=['object']).columns)
category_fea = list(filter(lambda x: x not in numerical_fea,list(data_train.columns)))
label = 'isDefault'
numerical_fea.remove(label)
缺失值填充
把所有缺失值替换为指定的值0
data_train = data_train.fillna(0)
向用缺失值上面的值替换缺失值
data_train = data_train.fillna(axis=0,method=‘ffill’)
纵向用缺失值下面的值替换缺失值,且设置最多只填充两个连续的缺失值
data_train = data_train.fillna(axis=0,method=‘bfill’,limit=2)
#查看缺失值情况
data_train.isnull().sum()
id 0
loanAmnt 0
term 0
interestRate 0
installment 0
grade 0
subGrade 0
employmentTitle 1
employmentLength 46799
homeOwnership 0
annualIncome 0
verificationStatus 0
issueDate 0
isDefault 0
purpose 0
postCode 1
regionCode 0
dti 239
delinquency_2years 0
ficoRangeLow 0
ficoRangeHigh 0
openAcc 0
pubRec 0
pubRecBankruptcies 405
revolBal 0
revolUtil 531
totalAcc 0
initialListStatus 0
applicationType 0
earliesCreditLine 0
title 1
policyCode 0
n0 40270
n1 40270
n2 40270
n3 40270
n4 33239
n5 40270
n6 40270
n7 40270
n8 40271
n9 40270
n10 33239
n11 69752
n12 40270
n13 40270
n14 40270
dtype: int64
#按照平均数填充数值型特征
data_train[numerical_fea] = data_train[numerical_fea].fillna(data_train[numerical_fea].median())
data_test_a[numerical_fea] = data_test_a[numerical_fea].fillna(data_train[numerical_fea].median())
#按照众数填充类别型特征
data_train[category_fea] = data_train[category_fea].fillna(data_train[category_fea].mode())
data_test_a[category_fea] = data_test_a[category_fea].fillna(data_train[category_fea].mode())
data_train.isnull().sum()
id 0
loanAmnt 0
term 0
interestRate 0
installment 0
grade 0
subGrade 0
employmentTitle 0
employmentLength 46799
homeOwnership 0
annualIncome 0
verificationStatus 0
issueDate 0
isDefault 0
purpose 0
postCode 0
regionCode 0
dti 0
delinquency_2years 0
ficoRangeLow 0
ficoRangeHigh 0
openAcc 0
pubRec 0
pubRecBankruptcies 0
revolBal 0
revolUtil 0
totalAcc 0
initialListStatus 0
applicationType 0
earliesCreditLine 0
title 0
policyCode 0
n0 0
n1 0
n2 0
n3 0
n4 0
n5 0
n6 0
n7 0
n8 0
n9 0
n10 0
n11 0
n12 0
n13 0
n14 0
dtype: int64
#查看类别特征
category_fea
['grade', 'subGrade', 'employmentLength', 'issueDate', 'earliesCreditLine']
category_fea:对象型类别特征需要进行预处理,其中[‘issueDate’]为时间格式特征。
# 时间格式处理
#转化成时间格式
for data in [data_train, data_test_a]:
data['issueDate'] = pd.to_datetime(data['issueDate'],format='%Y-%m-%d')
startdate = datetime.datetime.strptime('2007-06-01', '%Y-%m-%d')
#构造时间特征
data['issueDateDT'] = data['issueDate'].apply(lambda x: x-startdate).dt.days
# 工作年限查看与处理
data_train['employmentLength'].value_counts(dropna=False).sort_index()
1 year 52489
10+ years 262753
2 years 72358
3 years 64152
4 years 47985
5 years 50102
6 years 37254
7 years 35407
8 years 36192
9 years 30272
< 1 year 64237
NaN 46799
Name: employmentLength, dtype: int64
# 工作年限转换到数值
def employmentLength_to_int(s):
if pd.isnull(s):
return s
else:
return np.int8(s.split()[0])
for data in [data_train, data_test_a]:
data['employmentLength'].replace(to_replace='10+ years', value='10 years', inplace=True)
data['employmentLength'].replace('< 1 year', '0 years', inplace=True)
data['employmentLength'] = data['employmentLength'].apply(employmentLength_to_int)
data['employmentLength'].value_counts(dropna=False).sort_index()
0.0 15989
1.0 13182
2.0 18207
3.0 16011
4.0 11833
5.0 12543
6.0 9328
7.0 8823
8.0 8976
9.0 7594
10.0 65772
NaN 11742
Name: employmentLength, dtype: int64
# 对earliesCreditLine进行预处理
data_train['earliesCreditLine'].sample(5)
421796 Jul-1994
763344 Jul-1996
782690 Dec-2002
531619 May-2011
114247 Nov-2001
Name: earliesCreditLine, dtype: object
for data in [data_train, data_test_a]:
data['earliesCreditLine'] = data['earliesCreditLine'].apply(lambda s: int(s[-4:]))
# 类别特征处理
# 部分类别特征
cate_features = ['grade', 'subGrade', 'employmentTitle', 'homeOwnership', 'verificationStatus', 'purpose', 'postCode', 'regionCode', \
'applicationType', 'initialListStatus', 'title', 'policyCode']
for f in cate_features:
print(f, '类型数:', data[f].nunique())
grade 类型数: 7
subGrade 类型数: 35
employmentTitle 类型数: 79282
homeOwnership 类型数: 6
verificationStatus 类型数: 3
purpose 类型数: 14
postCode 类型数: 889
regionCode 类型数: 51
applicationType 类型数: 2
initialListStatus 类型数: 2
title 类型数: 12058
policyCode 类型数: 1
# 像等级这种类别特征,是有优先级的可以labelencode或者自映射
for data in [data_train, data_test_a]:
data['grade'] = data['grade'].map({'A':1,'B':2,'C':3,'D':4,'E':5,'F':6,'G':7})
# 类型数在2之上,又不是高维稀疏的,且纯分类特征
for data in [data_train, data_test_a]:
data = pd.get_dummies(data, columns=['subGrade', 'homeOwnership', 'verificationStatus', 'purpose', 'regionCode'], drop_first=True)
3.3 异常值处理
当你发现异常值后,一定要先分清是什么原因导致的异常值,然后再考虑如何处理。首先,如果这一异常值并不代表一种规律性的,而是极其偶然的现象,或者说你并不想研究这种偶然的现象,这时可以将其删除。其次,如果异常值存在且代表了一种真实存在的现象,那就不能随便删除。在现有的欺诈场景中很多时候欺诈数据本身相对于正常数据勒说就是异常的,我们要把这些异常点纳入,重新拟合模型,研究其规律。能用监督的用监督模型,不能用的还可以考虑用异常检测的算法来做。
注意test的数据不能删。
检测异常的方法一:均方差
在统计学中,如果一个数据分布近似正态,那么大约 68% 的数据值会在均值的一个标准差范围内,大约 95% 会在两个标准差范围内,大约 99.7% 会在三个标准差范围内。
def find_outliers_by_3segama(data,fea):
data_std = np.std(data[fea])
data_mean = np.mean(data[fea])
outliers_cut_off = data_std * 3
lower_rule = data_mean - outliers_cut_off
upper_rule = data_mean + outliers_cut_off
data[fea+'_outliers'] = data[fea].apply(lambda x:str('异常值') if x > upper_rule or x < lower_rule else '正常值')
return data
# 得到特征的异常值后可以进一步分析变量异常值和目标变量的关系
data_train = data_train.copy()
for fea in numerical_fea:
data_train = find_outliers_by_3segama(data_train,fea)
print(data_train[fea+'_outliers'].value_counts())
print(data_train.groupby(fea+'_outliers')['isDefault'].sum())
print('*'*20)
正常值 800000
Name: id_outliers, dtype: int64
id_outliers
正常值 159610
Name: isDefault, dtype: int64
********************
正常值 800000
Name: loanAmnt_outliers, dtype: int64
loanAmnt_outliers
正常值 159610
Name: isDefault, dtype: int64
********************
正常值 800000
Name: term_outliers, dtype: int64
term_outliers
正常值 159610
Name: isDefault, dtype: int64
********************
正常值 794259
异常值 5741
Name: interestRate_outliers, dtype: int64
interestRate_outliers
异常值 2916
正常值 156694
Name: isDefault, dtype: int64
********************
正常值 792046
异常值 7954
Name: installment_outliers, dtype: int64
installment_outliers
异常值 2152
正常值 157458
Name: isDefault, dtype: int64
********************
正常值 800000
Name: employmentTitle_outliers, dtype: int64
employmentTitle_outliers
正常值 159610
Name: isDefault, dtype: int64
********************
正常值 799701
异常值 299
Name: homeOwnership_outliers, dtype: int64
homeOwnership_outliers
异常值 62
正常值 159548
Name: isDefault, dtype: int64
********************
正常值 793973
异常值 6027
Name: annualIncome_outliers, dtype: int64
annualIncome_outliers
异常值 756
正常值 158854
Name: isDefault, dtype: int64
********************
正常值 800000
Name: verificationStatus_outliers, dtype: int64
verificationStatus_outliers
正常值 159610
Name: isDefault, dtype: int64
********************
正常值 783003
异常值 16997
Name: purpose_outliers, dtype: int64
purpose_outliers
异常值 3635
正常值 155975
Name: isDefault, dtype: int64
********************
正常值 798931
异常值 1069
Name: postCode_outliers, dtype: int64
postCode_outliers
异常值 221
正常值 159389
Name: isDefault, dtype: int64
********************
正常值 799994
异常值 6
Name: regionCode_outliers, dtype: int64
regionCode_outliers
异常值 1
正常值 159609
Name: isDefault, dtype: int64
********************
正常值 798440
异常值 1560
Name: dti_outliers, dtype: int64
dti_outliers
异常值 466
正常值 159144
Name: isDefault, dtype: int64
********************
正常值 778245
异常值 21755
Name: delinquency_2years_outliers, dtype: int64
delinquency_2years_outliers
异常值 5089
正常值 154521
Name: isDefault, dtype: int64
********************
正常值 788261
异常值 11739
Name: ficoRangeLow_outliers, dtype: int64
ficoRangeLow_outliers
异常值 778
正常值 158832
Name: isDefault, dtype: int64
********************
正常值 788261
异常值 11739
Name: ficoRangeHigh_outliers, dtype: int64
ficoRangeHigh_outliers
异常值 778
正常值 158832
Name: isDefault, dtype: int64
********************
正常值 790889
异常值 9111
Name: openAcc_outliers, dtype: int64
openAcc_outliers
异常值 2195
正常值 157415
Name: isDefault, dtype: int64
********************
正常值 792471
异常值 7529
Name: pubRec_outliers, dtype: int64
pubRec_outliers
异常值 1701
正常值 157909
Name: isDefault, dtype: int64
********************
正常值 794120
异常值 5880
Name: pubRecBankruptcies_outliers, dtype: int64
pubRecBankruptcies_outliers
异常值 1423
正常值 158187
Name: isDefault, dtype: int64
********************
正常值 790001
异常值 9999
Name: revolBal_outliers, dtype: int64
revolBal_outliers
异常值 1359
正常值 158251
Name: isDefault, dtype: int64
********************
正常值 799948
异常值 52
Name: revolUtil_outliers, dtype: int64
revolUtil_outliers
异常值 23
正常值 159587
Name: isDefault, dtype: int64
********************
正常值 791663
异常值 8337
Name: totalAcc_outliers, dtype: int64
totalAcc_outliers
异常值 1668
正常值 157942
Name: isDefault, dtype: int64
********************
正常值 800000
Name: initialListStatus_outliers, dtype: int64
initialListStatus_outliers
正常值 159610
Name: isDefault, dtype: int64
********************
正常值 784586
异常值 15414
Name: applicationType_outliers, dtype: int64
applicationType_outliers
异常值 3875
正常值 155735
Name: isDefault, dtype: int64
********************
正常值 775134
异常值 24866
Name: title_outliers, dtype: int64
title_outliers
异常值 3900
正常值 155710
Name: isDefault, dtype: int64
********************
正常值 800000
Name: policyCode_outliers, dtype: int64
policyCode_outliers
正常值 159610
Name: isDefault, dtype: int64
********************
正常值 782773
异常值 17227
Name: n0_outliers, dtype: int64
n0_outliers
异常值 3485
正常值 156125
Name: isDefault, dtype: int64
********************
正常值 790500
异常值 9500
Name: n1_outliers, dtype: int64
n1_outliers
异常值 2491
正常值 157119
Name: isDefault, dtype: int64
********************
正常值 789067
异常值 10933
Name: n2_outliers, dtype: int64
n2_outliers
异常值 3205
正常值 156405
Name: isDefault, dtype: int64
********************
正常值 789067
异常值 10933
Name: n3_outliers, dtype: int64
n3_outliers
异常值 3205
正常值 156405
Name: isDefault, dtype: int64
********************
正常值 788660
异常值 11340
Name: n4_outliers, dtype: int64
n4_outliers
异常值 2476
正常值 157134
Name: isDefault, dtype: int64
********************
正常值 790355
异常值 9645
Name: n5_outliers, dtype: int64
n5_outliers
异常值 1858
正常值 157752
Name: isDefault, dtype: int64
********************
正常值 786006
异常值 13994
Name: n6_outliers, dtype: int64
n6_outliers
异常值 3182
正常值 156428
Name: isDefault, dtype: int64
********************
正常值 788430
异常值 11570
Name: n7_outliers, dtype: int64
n7_outliers
异常值 2746
正常值 156864
Name: isDefault, dtype: int64
********************
正常值 789625
异常值 10375
Name: n8_outliers, dtype: int64
n8_outliers
异常值 2131
正常值 157479
Name: isDefault, dtype: int64
********************
正常值 786384
异常值 13616
Name: n9_outliers, dtype: int64
n9_outliers
异常值 3953
正常值 155657
Name: isDefault, dtype: int64
********************
正常值 788979
异常值 11021
Name: n10_outliers, dtype: int64
n10_outliers
异常值 2639
正常值 156971
Name: isDefault, dtype: int64
********************
正常值 799434
异常值 566
Name: n11_outliers, dtype: int64
n11_outliers
异常值 112
正常值 159498
Name: isDefault, dtype: int64
********************
正常值 797585
异常值 2415
Name: n12_outliers, dtype: int64
n12_outliers
异常值 545
正常值 159065
Name: isDefault, dtype: int64
********************
正常值 788907
异常值 11093
Name: n13_outliers, dtype: int64
n13_outliers
异常值 2482
正常值 157128
Name: isDefault, dtype: int64
********************
正常值 788884
异常值 11116
Name: n14_outliers, dtype: int64
n14_outliers
异常值 3364
正常值 156246
Name: isDefault, dtype: int64
********************
# 可以看到异常值在两个变量上的分布几乎符合整体的分布,如果异常值都属于为1的用户数据里面代表什么呢?
#删除异常值
for fea in numerical_fea:
data_train = data_train[data_train[fea+'_outliers']=='正常值']
data_train = data_train.reset_index(drop=True)
检测异常的方法二:箱型图
总结一句话:四分位数会将数据分为三个点和四个区间,IQR = Q3 -Q1,下触须=Q1 − 1.5x IQR,上触须=Q3 + 1.5x IQR;
3.4 数据分桶
特征分箱的目的:
从模型效果上来看,特征分箱主要是为了降低变量的复杂性,减少变量噪音对模型的影响,提高自变量和因变量的相关度。从而使模型更加稳定。
数据分桶的对象:
将连续变量离散化
将多状态的离散变量合并成少状态
分箱的原因:
数据的特征内的值跨度可能比较大,对有监督和无监督中如k-均值聚类它使用欧氏距离作为相似度函数来测量数据点之间的相似度。都会造成大吃小的影响,其中一种解决方法是对计数值进行区间量化即数据分桶也叫做数据分箱,然后使用量化后的结果。
分箱的优点:
处理缺失值:当数据源可能存在缺失值,此时可以把null单独作为一个分箱。
处理异常值:当数据中存在离群点时,可以把其通过分箱离散化处理,从而提高变量的鲁棒性(抗干扰能力)。例如,age若出现200这种异常值,可分入“age > 60”这个分箱里,排除影响。
业务解释性:我们习惯于线性判断变量的作用,当x越来越大,y就越来越大。但实际x与y之间经常存在着非线性关系,此时可经过WOE变换。
特别要注意一下分箱的基本原则:
(1)最小分箱占比不低于5%
(2)箱内不能全部是好客户
(3)连续箱单调
# 固定宽度分箱
# 当数值横跨多个数量级时,最好按照 10 的幂(或任何常数的幂)来进行分组:9、99、999、9999,等等。
# 固定宽度分箱非常容易计算,但如果计数值中有比较大的缺口,就会产生很多没有任何数据的空箱子。
# 通过除法映射到间隔均匀的分箱中,每个分箱的取值范围都是loanAmnt/1000
data['loanAmnt_bin1'] = np.floor_divide(data['loanAmnt'], 1000)
## 通过对数函数映射到指数宽度分箱
data['loanAmnt_bin2'] = np.floor(np.log10(data['loanAmnt']))
# 分位数分箱
data['loanAmnt_bin3'] = pd.qcut(data['loanAmnt'], 10, labels=False)
3.5 特征交互
交互特征的构造非常简单,使用起来却代价不菲。如果线性模型中包含有交互特征对,那它的训练时间和评分时间就会从 O(n) 增加到 O(n2),其中 n 是单一特征的数量。
for col in ['grade', 'subGrade']:
temp_dict = data_train.groupby([col])['isDefault'].agg(['mean']).reset_index().rename(columns={'mean': col + '_target_mean'})
temp_dict.index = temp_dict[col].values
temp_dict = temp_dict[col + '_target_mean'].to_dict()
data_train[col + '_target_mean'] = data_train[col].map(temp_dict)
data_test_a[col + '_target_mean'] = data_test_a[col].map(temp_dict)
# 其他衍生变量 mean 和 std
for df in [data_train, data_test_a]:
for item in ['n0','n1','n2','n3','n4','n5','n6','n7','n8','n9','n10','n11','n12','n13','n14']:
df['grade_to_mean_' + item] = df['grade'] / df.groupby([item])['grade'].transform('mean')
df['grade_to_std_' + item] = df['grade'] / df.groupby([item])['grade'].transform('std')
3.6 特征编码
# labelEncode 直接放入树模型中
#label-encode:subGrade,postCode,title
# 高维类别特征需要进行转换
for col in tqdm(['employmentTitle', 'postCode', 'title','subGrade']):
le = LabelEncoder()
le.fit(list(data_train[col].astype(str).values) + list(data_test_a[col].astype(str).values))
data_train[col] = le.transform(list(data_train[col].astype(str).values))
data_test_a[col] = le.transform(list(data_test_a[col].astype(str).values))
print('Label Encoding 完成')
100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00, 1.44s/it]
Label Encoding 完成
逻辑回归等模型要单独增加的特征工程
对特征做归一化,去除相关性高的特征
归一化目的是让训练过程更好更快的收敛,避免特征大吃小的问题
去除相关性是增加模型的可解释性,加快预测过程。
features = [f for f in data_train.columns if f not in ['id','issueDate','isDefault'] and '_outliers' not in f]
x_train = data_train[features]
x_test = data_test_a[features]
y_train = data_train['isDefault']
def cv_model(clf, train_x, train_y, test_x, clf_name):
folds = 5
seed = 2020
kf = KFold(n_splits=folds, shuffle=True, random_state=seed)
train = np.zeros(train_x.shape[0])
test = np.zeros(test_x.shape[0])
cv_scores = []
for i, (train_index, valid_index) in enumerate(kf.split(train_x, train_y)):
print('************************************ {} ************************************'.format(str(i+1)))
trn_x, trn_y, val_x, val_y = train_x.iloc[train_index], train_y[train_index], train_x.iloc[valid_index], train_y[valid_index]
if clf_name == "lgb":
train_matrix = clf.Dataset(trn_x, label=trn_y)
valid_matrix = clf.Dataset(val_x, label=val_y)
params = {
'boosting_type': 'gbdt',
'objective': 'binary',
'metric': 'auc',
'min_child_weight': 5,
'num_leaves': 2 ** 5,
'lambda_l2': 10,
'feature_fraction': 0.8,
'bagging_fraction': 0.8,
'bagging_freq': 4,
'learning_rate': 0.1,
'seed': 2020,
'nthread': 28,
'n_jobs':24,
'silent': True,
'verbose': -1,
}
model = clf.train(params, train_matrix, 50000, valid_sets=[train_matrix, valid_matrix], verbose_eval=200,early_stopping_rounds=200)
val_pred = model.predict(val_x, num_iteration=model.best_iteration)
test_pred = model.predict(test_x, num_iteration=model.best_iteration)
# print(list(sorted(zip(features, model.feature_importance("gain")), key=lambda x: x[1], reverse=True))[:20])
if clf_name == "xgb":
train_matrix = clf.DMatrix(trn_x , label=trn_y)
valid_matrix = clf.DMatrix(val_x , label=val_y)
params = {'booster': 'gbtree',
'objective': 'binary:logistic',
'eval_metric': 'auc',
'gamma': 1,
'min_child_weight': 1.5,
'max_depth': 5,
'lambda': 10,
'subsample': 0.7,
'colsample_bytree': 0.7,
'colsample_bylevel': 0.7,
'eta': 0.04,
'tree_method': 'exact',
'seed': 2020,
'nthread': 36,
"silent": True,
}
watchlist = [(train_matrix, 'train'),(valid_matrix, 'eval')]
model = clf.train(params, train_matrix, num_boost_round=50000, evals=watchlist, verbose_eval=200, early_stopping_rounds=200)
val_pred = model.predict(valid_matrix, ntree_limit=model.best_ntree_limit)
test_pred = model.predict(test_x , ntree_limit=model.best_ntree_limit)
if clf_name == "cat":
params = {'learning_rate': 0.05, 'depth': 5, 'l2_leaf_reg': 10, 'bootstrap_type': 'Bernoulli',
'od_type': 'Iter', 'od_wait': 50, 'random_seed': 11, 'allow_writing_files': False}
model = clf(iterations=20000, **params)
model.fit(trn_x, trn_y, eval_set=(val_x, val_y),
cat_features=[], use_best_model=True, verbose=500)
val_pred = model.predict(val_x)
test_pred = model.predict(test_x)
train[valid_index] = val_pred
test = test_pred / kf.n_splits
cv_scores.append(roc_auc_score(val_y, val_pred))
print(cv_scores)
print("%s_scotrainre_list:" % clf_name, cv_scores)
print("%s_score_mean:" % clf_name, np.mean(cv_scores))
print("%s_score_std:" % clf_name, np.std(cv_scores))
return train, test
def lgb_model(x_train, y_train, x_test):
lgb_train, lgb_test = cv_model(lgb, x_train, y_train, x_test, "lgb")
return lgb_train, lgb_test
def xgb_model(x_train, y_train, x_test):
xgb_train, xgb_test = cv_model(xgb, x_train, y_train, x_test, "xgb")
return xgb_train, xgb_test
def cat_model(x_train, y_train, x_test):
cat_train, cat_test = cv_model(CatBoostRegressor, x_train, y_train, x_test, "cat")
lgb_train, lgb_test = lgb_model(x_train, y_train, x_test)
************************************ 1 ************************************
Training until validation scores don't improve for 200 rounds.
[200] training's auc: 0.749225 valid_1's auc: 0.729679
[400] training's auc: 0.765075 valid_1's auc: 0.730496
[600] training's auc: 0.778745 valid_1's auc: 0.730435
Early stopping, best iteration is:
[455] training's auc: 0.769202 valid_1's auc: 0.730686
[0.7306859913754798]
************************************ 2 ************************************
Training until validation scores don't improve for 200 rounds.
[200] training's auc: 0.749221 valid_1's auc: 0.731315
[400] training's auc: 0.765117 valid_1's auc: 0.731658
[600] training's auc: 0.778542 valid_1's auc: 0.731333
Early stopping, best iteration is:
[407] training's auc: 0.765671 valid_1's auc: 0.73173
[0.7306859913754798, 0.7317304414673989]
************************************ 3 ************************************
Training until validation scores don't improve for 200 rounds.
[200] training's auc: 0.748436 valid_1's auc: 0.732775
[400] training's auc: 0.764216 valid_1's auc: 0.733173
Early stopping, best iteration is:
[386] training's auc: 0.763261 valid_1's auc: 0.733261
[0.7306859913754798, 0.7317304414673989, 0.7332610441015461]
************************************ 4 ************************************
Training until validation scores don't improve for 200 rounds.
[200] training's auc: 0.749631 valid_1's auc: 0.728327
[400] training's auc: 0.765139 valid_1's auc: 0.728845
Early stopping, best iteration is:
[286] training's auc: 0.756978 valid_1's auc: 0.728976
[0.7306859913754798, 0.7317304414673989, 0.7332610441015461, 0.7289759386807912]
************************************ 5 ************************************
Training until validation scores don't improve for 200 rounds.
[200] training's auc: 0.748414 valid_1's auc: 0.732727
[400] training's auc: 0.763727 valid_1's auc: 0.733531
[600] training's auc: 0.777489 valid_1's auc: 0.733566
Early stopping, best iteration is:
[524] training's auc: 0.772372 valid_1's auc: 0.733772
[0.7306859913754798, 0.7317304414673989, 0.7332610441015461, 0.7289759386807912, 0.7337723979789789]
lgb_scotrainre_list: [0.7306859913754798, 0.7317304414673989, 0.7332610441015461, 0.7289759386807912, 0.7337723979789789]
lgb_score_mean: 0.7316851627208389
lgb_score_std: 0.0017424259863954693
x_test.head()
loanAmnt | term | interestRate | installment | grade | subGrade | employmentTitle | employmentLength | homeOwnership | annualIncome | verificationStatus | purpose | postCode | regionCode | dti | delinquency_2years | ficoRangeLow | ficoRangeHigh | openAcc | pubRec | pubRecBankruptcies | revolBal | revolUtil | totalAcc | initialListStatus | applicationType | earliesCreditLine | title | policyCode | n0 | n1 | n2 | n3 | n4 | n5 | n6 | n7 | n8 | n9 | n10 | n11 | n12 | n13 | n14 | issueDateDT | grade_target_mean | subGrade_target_mean | grade_to_mean_n0 | grade_to_std_n0 | grade_to_mean_n1 | grade_to_std_n1 | grade_to_mean_n2 | grade_to_std_n2 | grade_to_mean_n3 | grade_to_std_n3 | grade_to_mean_n4 | grade_to_std_n4 | grade_to_mean_n5 | grade_to_std_n5 | grade_to_mean_n6 | grade_to_std_n6 | grade_to_mean_n7 | grade_to_std_n7 | grade_to_mean_n8 | grade_to_std_n8 | grade_to_mean_n9 | grade_to_std_n9 | grade_to_mean_n10 | grade_to_std_n10 | grade_to_mean_n11 | grade_to_std_n11 | grade_to_mean_n12 | grade_to_std_n12 | grade_to_mean_n13 | grade_to_std_n13 | grade_to_mean_n14 | grade_to_std_n14 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 14000.0 | 3 | 10.99 | 458.28 | 2 | 7 | 226216 | 10.0 | 0 | 80000.0 | 0 | 0 | 72 | 21 | 10.56 | 1.0 | 715.0 | 719.0 | 17.0 | 0.0 | 0.0 | 9846.0 | 30.7 | 29.0 | 0 | 0 | 1974 | 0 | 1.0 | 1.0 | 4.0 | 6.0 | 6.0 | 6.0 | 8.0 | 4.0 | 15.0 | 19.0 | 6.0 | 17.0 | 0.0 | 0.0 | 1.0 | 3.0 | 2587 | 0.131210 | 0.128111 | 0.689391 | 1.584493 | 0.725914 | 1.572393 | 0.718408 | 1.572653 | 0.718408 | 1.572653 | 0.739029 | 1.536987 | 0.739042 | 1.552761 | 0.733574 | 1.563972 | 0.721477 | 1.464778 | 0.739010 | 1.541761 | 0.717902 | 1.571221 | 0.739364 | 1.546269 | 0.728272 | 1.544297 | 0.728493 | 1.544314 | 0.676117 | 1.613241 | 0.682304 | 1.531164 |
1 | 20000.0 | 5 | 14.65 | 472.14 | 3 | 14 | 218168 | 10.0 | 0 | 50000.0 | 0 | 2 | 152 | 8 | 21.40 | 2.0 | 670.0 | 674.0 | 5.0 | 0.0 | 0.0 | 8946.0 | 56.6 | 14.0 | 0 | 0 | 2001 | 18780 | 1.0 | 2.0 | 1.0 | 3.0 | 3.0 | 1.0 | 1.0 | 3.0 | 3.0 | 9.0 | 3.0 | 5.0 | 0.0 | 0.0 | 2.0 | 2.0 | 2952 | 0.224522 | 0.262219 | 1.032531 | 2.393318 | 1.094619 | 2.310323 | 1.139782 | 2.355686 | 1.139782 | 2.355686 | 1.025995 | 2.346520 | 0.949093 | 2.310742 | 1.089937 | 2.341953 | 1.069680 | 2.345655 | 1.071861 | 2.318389 | 1.138036 | 2.356084 | 1.074717 | 2.419804 | 1.092408 | 2.316446 | 1.092740 | 2.316471 | 1.009677 | 2.472946 | 1.104785 | 2.336065 |
2 | 12000.0 | 3 | 19.99 | 445.91 | 4 | 18 | 102813 | 2.0 | 1 | 60000.0 | 2 | 0 | 475 | 20 | 33.50 | 0.0 | 710.0 | 714.0 | 12.0 | 0.0 | 0.0 | 970.0 | 17.6 | 43.0 | 1 | 0 | 2006 | 0 | 1.0 | 0.0 | 1.0 | 4.0 | 4.0 | 1.0 | 1.0 | 36.0 | 5.0 | 6.0 | 4.0 | 12.0 | 0.0 | 0.0 | 0.0 | 7.0 | 3410 | 0.304227 | 0.325175 | 1.480586 | 3.071121 | 1.459492 | 3.080431 | 1.481468 | 3.145597 | 1.481468 | 3.145597 | 1.367993 | 3.128694 | 1.265457 | 3.080989 | 1.314815 | 2.787252 | 1.463280 | 3.145348 | 1.399917 | 3.152224 | 1.479744 | 3.143083 | 1.452872 | 3.071997 | 1.456544 | 3.088594 | 1.456986 | 3.088628 | 1.462954 | 3.082872 | 1.176775 | 2.884269 |
3 | 17500.0 | 5 | 14.31 | 410.02 | 3 | 13 | 220769 | 4.0 | 0 | 37000.0 | 1 | 4 | 166 | 11 | 13.95 | 0.0 | 685.0 | 689.0 | 10.0 | 1.0 | 1.0 | 10249.0 | 52.3 | 18.0 | 0 | 0 | 2002 | 16334 | 1.0 | 0.0 | 2.0 | 2.0 | 2.0 | 4.0 | 7.0 | 2.0 | 8.0 | 14.0 | 2.0 | 10.0 | 0.0 | 0.0 | 0.0 | 3.0 | 2710 | 0.224522 | 0.251584 | 1.110439 | 2.303341 | 1.103936 | 2.345575 | 1.142302 | 2.306262 | 1.142302 | 2.306262 | 1.112193 | 2.310972 | 1.118677 | 2.296251 | 1.090854 | 2.399042 | 1.085389 | 2.328604 | 1.106185 | 2.346539 | 1.140189 | 2.308880 | 1.089865 | 2.331753 | 1.092408 | 2.316446 | 1.092740 | 2.316471 | 1.097216 | 2.312154 | 1.023456 | 2.296746 |
4 | 35000.0 | 3 | 17.09 | 1249.42 | 4 | 15 | 192707 | 0.0 | 1 | 80000.0 | 1 | 0 | 19 | 8 | 24.97 | 0.0 | 685.0 | 689.0 | 19.0 | 0.0 | 0.0 | 33199.0 | 35.6 | 22.0 | 0 | 0 | 2000 | 0 | 1.0 | 0.0 | 8.0 | 11.0 | 11.0 | 9.0 | 11.0 | 3.0 | 16.0 | 18.0 | 11.0 | 19.0 | 0.0 | 0.0 | 0.0 | 1.0 | 3775 | 0.304227 | 0.279444 | 1.480586 | 3.071121 | 1.394121 | 3.027637 | 1.337120 | 2.960070 | 1.337120 | 2.960070 | 1.501807 | 3.015502 | 1.495346 | 3.093075 | 1.453249 | 3.122604 | 1.462273 | 3.003880 | 1.485046 | 3.060551 | 1.340248 | 2.947138 | 1.468331 | 2.987120 | 1.456544 | 3.088594 | 1.456986 | 3.088628 | 1.462954 | 3.082872 | 1.562159 | 3.262761 |
data_test_a.columns
Index(['loanAmnt', 'term', 'interestRate', 'installment', 'grade', 'subGrade',
'employmentTitle', 'employmentLength', 'homeOwnership', 'annualIncome',
'verificationStatus', 'purpose', 'postCode', 'regionCode', 'dti',
'delinquency_2years', 'ficoRangeLow', 'ficoRangeHigh', 'openAcc',
'pubRec', 'pubRecBankruptcies', 'revolBal', 'revolUtil', 'totalAcc',
'initialListStatus', 'applicationType', 'earliesCreditLine', 'title',
'policyCode', 'n0', 'n1', 'n2', 'n3', 'n4', 'n5', 'n6', 'n7', 'n8',
'n9', 'n10', 'n11', 'n12', 'n13', 'n14', 'issueDateDT',
'grade_target_mean', 'subGrade_target_mean', 'grade_to_mean_n0',
'grade_to_std_n0', 'grade_to_mean_n1', 'grade_to_std_n1',
'grade_to_mean_n2', 'grade_to_std_n2', 'grade_to_mean_n3',
'grade_to_std_n3', 'grade_to_mean_n4', 'grade_to_std_n4',
'grade_to_mean_n5', 'grade_to_std_n5', 'grade_to_mean_n6',
'grade_to_std_n6', 'grade_to_mean_n7', 'grade_to_std_n7',
'grade_to_mean_n8', 'grade_to_std_n8', 'grade_to_mean_n9',
'grade_to_std_n9', 'grade_to_mean_n10', 'grade_to_std_n10',
'grade_to_mean_n11', 'grade_to_std_n11', 'grade_to_mean_n12',
'grade_to_std_n12', 'grade_to_mean_n13', 'grade_to_std_n13',
'grade_to_mean_n14', 'grade_to_std_n14'],
dtype='object')
test_result_data = pd.read_csv('./RawData/testA.csv')
result = pd.DataFrame({'id':list(range(800000,800000+len(test_result_data),1)),'isDefault': lgb_test})
result.to_csv('./result.csv',index=None)