天池项目笔记-金融风控-贷款违约预测 Task3

Task03_特征工程 features engineering

目前只是数据预处理和初步洞察(初步 basic preprocessing)后续更新思路。可以考虑使用评分卡模型对这一问题进行分析。

1.时间格式处理

1.1 将earliesCreditLine 特征转为日期类型

​ 通过观察原始数据,‘earliesCreditLine’数据为字符串数据,将日期以非结构化形式保存。这一步将这一数据进行结构化处理转为datetime类型,方便模型使用和后续的特征工程构建,记录为’earliesCreditLine_date’。

​ 如’Aug-2001’数据,表示2001年8月,通过下面的代码将数据转为’%Y-%m-%d’的日期类型,并统一将日期设置为各个月的1号。

# earliesCreditLine 转为日期类型
dic_month = {'Jan':'01', 'Feb':'02', 'Mar':'03', 'Apr':'04', 'May':'05', 'Jun':'06', 'Jul':'07', 'Aug':'08', 'Sep':'09', 'Oct':'10', 'Nov':'11', 'Dec':'12'}

def get_month_year(str):
  month = list(str)[0] + list(str)[1] + list(str)[2]
  month = dic_month[month]
  year = list(str)[-4] + list(str)[-3] + list(str)[-2] + list(str)[-1]
  date = year + '-' + month + '-' + '01'
  return date


train_data['earliesCreditLine_date'] = train_data['earliesCreditLine'].apply(lambda x : get_month_year(x))
test_data['earliesCreditLine_date'] = test_data['earliesCreditLine'].apply(lambda x : get_month_year(x))
train_data = train_data.drop(columns = 'earliesCreditLine')
test_data = test_data.drop(columns = 'earliesCreditLine')
1.2时间特征构建

​ 数据中的时间特征有两个,分别是’issueDate’贷款发放的日期和’earliesCreditLine‘借款人最早报告的信用额度开立的月份。某一数据的时间点没有太大意义,所以我们使用时间区间,将所有的时间数据转为datetime类型后,统一减去某比较久远的过去的值,构建新的特征’issueDateDT’和’earliesCreditLine_dateDT’。后续这一时间还可以用于其他探索,为了baseline我们先做到这里。

import datetime
# issueDate 转换为时间差数值
train_data['issueDate'] = pd.to_datetime(train_data['issueDate'],format='%Y-%m-%d')
startdate = datetime.datetime.strptime('2007-06-01', '%Y-%m-%d')
train_data['issueDateDT'] = train_data['issueDate'].apply(lambda x: x-startdate).dt.days

test_data['issueDate'] = pd.to_datetime(test_data['issueDate'],format='%Y-%m-%d')
startdate = datetime.datetime.strptime('2007-06-01', '%Y-%m-%d')
test_data['issueDateDT'] = test_data['issueDate'].apply(lambda x: x-startdate).dt.days

plt.hist(train_data['issueDateDT'], label='train');
plt.hist(test_data['issueDateDT'], label='test');
plt.legend();
plt.title('Distribution of issueDateDT dates');
# earliesCreditLine_date 转换为数值(时间差)
train_data['earliesCreditLine_date'] = pd.to_datetime(train_data['earliesCreditLine_date'],format='%Y-%m-%d')
startdate = datetime.datetime.strptime('1950-01-01', '%Y-%m-%d')
train_data['earliesCreditLine_dateDT'] = train_data['earliesCreditLine_date'].apply(lambda x: x-startdate).dt.days

test_data['earliesCreditLine_date'] = pd.to_datetime(test_data['earliesCreditLine_date'],format='%Y-%m-%d')
startdate = datetime.datetime.strptime('1950-01-01', '%Y-%m-%d')
test_data['earliesCreditLine_dateDT'] = test_data['earliesCreditLine_date'].apply(lambda x: x-startdate).dt.days

plt.hist(train_data['earliesCreditLine_dateDT'], label='train');
plt.hist(test_data['earliesCreditLine_dateDT'], label='test');
plt.legend();
plt.title('Distribution of issueDateDT dates');

在这里插入图片描述
在这里插入图片描述
上图绘制了新增的两个时间特征在训练集和测试集中的分布,基本比较一致。

2.特征分类 features classification

​ 在这里我们结合前面数据分布的工作,更细致地将不同特征分为类别特征和数值型特征。

# employmentTitle 可以用数据类型特征/分箱后作为分类类型特征,暂时用前者方法处理
# issueDate 日期类型信息
# earliesCreditLine 可转为日期类型信息
# policycode,n11 几乎只有一种取值,drop
feature_columns = ['loanAmnt', 'term', 'interestRate', 'installment', 'grade',
       'subGrade', 'employmentTitle', 'employmentLength', 'homeOwnership',
       'annualIncome', 'verificationStatus', 
       'purpose', 'postCode', 'regionCode', 'dti', 'delinquency_2years',
       'ficoRangeLow', 'ficoRangeHigh', 'openAcc', 'pubRec',
       'pubRecBankruptcies', 'revolBal', 'revolUtil', 'totalAcc',
       'initialListStatus', 'applicationType',  'title',
      'n0', 'n1', 'n2', 'n2.1', 'n4', 'n5', 'n6', 'n7', 'n8',
       'n9', 'n10', 'n12', 'n13', 'n14','issueDateDT','earliesCreditLine_dateDT'] 
numerical_fea = ['loanAmnt', 'term', 'interestRate', 'installment', 
        'employmentTitle',  'annualIncome',  
        'postCode', 'regionCode', 'dti', 'delinquency_2years',
       'ficoRangeLow', 'ficoRangeHigh', 'openAcc', 'pubRec',
       'pubRecBankruptcies', 'revolBal', 'revolUtil', 'totalAcc',
       'initialListStatus', 'applicationType',  'title',
        'n0', 'n1', 'n2', 'n2.1', 'n4', 'n5', 'n6', 'n7', 'n8',
       'n9', 'n10', 'n12', 'n13', 'n14', 'issueDateDT', 'earliesCreditLine_dateDT']
categorical_fea = ['grade','subGrade','employmentLength','homeOwnership','verificationStatus','purpose']

3.类别特征编码 label encoding

​ 考虑到首先使用树模型,于是决定先不将数据变得过于稀疏,因此先采用label encoding对类别特征进行预处理,至于哪一种编码效果好,或者将哪些数据看作数值类型,都有待尝试。更换不同的模型时,此处可以进行调整。

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
le = LabelEncoder()
train_data['grade'] = le.fit_transform(train_data['grade'])
train_data['subGrade'] = le.fit_transform(train_data['subGrade'])
train_data['employmentLength'] = train_data['employmentLength'].apply(lambda x : str(x))
train_data['employmentLength'] = le.fit_transform(train_data['employmentLength'])
test_data['grade'] = le.fit_transform(test_data['grade'])
test_data['subGrade'] = le.fit_transform(test_data['subGrade'])
test_data['employmentLength'] = test_data['employmentLength'].apply(lambda x : str(x))
test_data['employmentLength'] = le.fit_transform(test_data['employmentLength'])

最后再来看一下处理后喂给baseline模型的数据:

train_data.head()
idloanAmntterminterestRateinstallmentgradesubGradeemploymentTitleemploymentLengthhomeOwnershipannualIncomeverificationStatusissueDateisDefaultpurposepostCoderegionCodedtidelinquency_2yearsficoRangeLowficoRangeHighopenAccpubRecpubRecBankruptciesrevolBalrevolUtiltotalAccinitialListStatusapplicationTypetitlepolicyCoden0n1n2n2.1n4n5n6n7n8n9n10n11n12n13n14issueDateDTearliesCreditLine_dateearliesCreditLine_dateDT
0035000.0519.52917.97421320.022110000.022014-07-0111137.03217.050.0730.0734.07.00.00.024178.048.927.0001.01.00.02.02.02.04.09.08.04.012.02.07.00.00.00.02.025872001-08-0118840
1118000.0518.49461.90316219843.05046000.022012-08-0100156.01827.830.0700.0704.013.00.00.015096.038.918.0101723.01.0NaNNaNNaNNaN10.0NaNNaNNaNNaNNaN13.0NaNNaNNaNNaN18882002-05-0119113
2212000.0516.99298.1731731698.08074000.022015-10-0100337.01422.770.0675.0679.011.00.00.04606.051.827.0000.01.00.00.03.03.00.00.021.04.05.03.011.00.00.00.04.030442006-05-0120574
3311000.037.26340.960346854.011118000.012015-08-0104148.01117.210.0685.0689.09.00.00.09948.052.628.0104.01.06.04.06.06.04.016.04.07.021.06.09.00.00.00.01.029831999-05-0118017
443000.0312.99101.0721154.011129000.022016-03-01010301.02132.160.0690.0694.012.00.00.02942.032.027.00011.01.01.02.07.07.02.04.09.010.015.07.012.00.00.00.04.031961977-08-0110074
©️2020 CSDN 皮肤主题: 游动-白 设计师:上身试试 返回首页