零基础入门金融风控-贷款违约预测-Task03——特征工程

最新推荐文章于 2022-07-31 16:46:14 发布

Code My Life

最新推荐文章于 2022-07-31 16:46:14 发布

阅读量464

点赞数

本文链接：https://blog.csdn.net/upon120/article/details/108718770

版权

有幸参加了阿里云举办的零基础入门金融风控-贷款违约预测训练营。收获颇多。

每天记录一些自己之前的知识盲点，需经常温习。

第三次的学习任务，是特征工程。在数据科学领域，有句话说得好：“特征工程决定了模型的上限。”可见其重要之处。

一、特征预处理

1、缺失值填充：

先查看一下原始缺失值情况：

# 查看缺失值情况
train.isnull().sum()

# 打印结果：
id                        0
loanAmnt                  0
term                      0
interestRate              0
installment               0
grade                     0
subGrade                  0
employmentTitle           1
employmentLength      46799
homeOwnership             0
annualIncome              0
verificationStatus        0
issueDate                 0
isDefault                 0
purpose                   0
postCode                  1
regionCode                0
dti                     239
delinquency_2years        0
ficoRangeLow              0
ficoRangeHigh             0
openAcc                   0
pubRec                    0
pubRecBankruptcies      405
revolBal                  0
revolUtil               531
totalAcc                  0
initialListStatus         0
applicationType           0
earliesCreditLine         0
title                     1
policyCode                0
n0                    40270
n1                    40270
n2                    40270
n4                    33239
n5                    40270
n6                    40270
n7                    40270
n8                    40271
n9                    40270
n10                   33239
n11                   69752
n12                   40270
n13                   40270
n14                   40270
dtype: int64

对于数值型特征来说，一般采用平均数进行填充；对于类别型特征来说，一般采用众数进行填充：

# 按照平均数填充数值型特征
train[numerical_fea] = train[numerical_fea].fillna(train[numerical_fea].median())
testA[numerical_fea] = testA[numerical_fea].fillna(train[numerical_fea].median())
# 按照众数填充类别型特征
train[category_fea] = train[category_fea].fillna(train[category_fea].mode())
testA[category_fea] = testA[category_fea].fillna(train[category_fea].mode())

注意：此时无论对于训练集还是测试集，均应该采用训练集的平均数（或众数）来进行填充。目的是对缺失值采取相同的对待方式。如果训练集采用训练集的平均数（或众数）进行填充，测试集采用测试集的平均数（或众数）来进行填充的话，将改变分布特征。

此时再次查看打印结果：

train.isnull().sum()

# 打印结果
id                        0
loanAmnt                  0
term                      0
interestRate              0
installment               0
grade                     0
subGrade                  0
employmentTitle           0
employmentLength      46799
homeOwnership             0
annualIncome              0
verificationStatus        0
issueDate                 0
isDefault                 0
purpose                   0
postCode                  0
regionCode                0
dti                       0
delinquency_2years        0
ficoRangeLow              0
ficoRangeHigh             0
openAcc                   0
pubRec                    0
pubRecBankruptcies        0
revolBal                  0
revolUtil                 0
totalAcc                  0
initialListStatus         0
applicationType           0
earliesCreditLine         0
title                     0
policyCode                0
n0                        0
n1                        0
n2                        0
n4                        0
n5                        0
n6                        0
n7                        0
n8                        0
n9                        0
n10                       0
n11                       0
n12                       0
n13                       0
n14                       0
dtype: int64

可以看出，只剩下employmentLength这一列没有进行处理了。由于这一列比较复杂，含有“5 years”、“10+ years”、‘< 1 year’和空值，故处理起来比较困难。

2、时间格式处理

使用pandas自带的处理时间序列的方法来进行处理。

# 转化成时间格式
for data in [train, testA]:
    data['issueDate'] = pd.to_datetime(data['issueDate'], format='%Y-%m-%d')
    startdate = datetime.datetime.strptime('2007-06-01', '%Y-%m-%d')
    # 构造时间特征
    data['issueDateDT'] = data['issueDate'].apply(lambda x : x-startdate).dt.days

此时再来看一下处理后的效果：

train['employmentLength'].value_counts(dropna=False).sort_index()

# 打印结果
1 year        52489
10+ years    262753
2 years       72358
3 years       64152
4 years       47985
5 years       50102
6 years       37254
7 years       35407
8 years       36192
9 years       30272
< 1 year      64237
NaN           46799
Name: employmentLength, dtype: int64

3、对象类型特征转换到数值

def employmentLength_to_int(s):
    if pd.isnull(s):
        return s
    else:
        return np.int8(s.split()[0])

for data in [train, testA]:
    data['employmentLength'].replace('10+ years', '10 years', inplace=True)
    data['employmentLength'].replace('< 1 year', '0 years', inplace=True)
    data['employmentLength'] = data['employmentLength'].apply(employmentLength_to_int)

此时再来看一下处理后的效果：

data['employmentLength'].value_counts(dropna=False).sort_index()

# 打印结果
0.0     15989
1.0     13182
2.0     18207
3.0     16011
4.0     11833
5.0     12543
6.0      9328
7.0      8823
8.0      8976
9.0      7594
10.0    65772
NaN     11742
Name: employmentLength, dtype: int64

这个时候就已经干净很多了，巴适~

Code My Life

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
零基础入门金融风控-贷款违约预测-Task03——特征工程

有幸参加了阿里云举办的零基础入门金融风控-贷款违约预测训练营。收获颇多。每天记录一些自己之前的知识盲点，需经常温习。第三次的学习任务，是特征工程。在数据科学领域，有句话说得好：“特征工程决定了模型的上限。”可见其重要之处。一、特征预处理 1、缺失值填充：先查看一下原始缺失值情况：# 查看缺失值情况train.isnull().sum()# 打印结果：id 0...
复制链接

扫一扫