贷款违约预测task3

这一部分主要是针对样本数据做特征工程,会用到一些基本的特征工作知识,并熟悉基本的流程代码。
1、缺失值填充

#按照平均数填充数值型特征
train_data[numerical_fea] = train_data[numerical_fea].fillna(train_data[numerical_fea].median())
test_data[numerical_fea] = test_data[numerical_fea].fillna(test_data[numerical_fea].median())

#按照众数填充类别特征
train_data[category_fea] = train_data[category_fea].fillna(train_data[category_fea].mode())
test_data[category_fea] = test_data[category_fea].fillna(test_data[category_fea].mode())

2、对象性特征转换数值

train_data['employmentLength'].replace('10+ years','10 years',inplace=True)
train_data['employmentLength'].replace('< 1 year','0 years',inplace=True)
test_data['employmentLength'].replace('10+ years','10 years',inplace=True)
test_data['employmentLength'].replace('< 1 year','0 years',inplace=True)

train_data['employmentLength'] = train_data['employmentLength'].apply(lambda x: x if pd.isnull(x) else np.int8(x.split()[0]))
test_data['employmentLength'] = test_data['employmentLength'].apply(lambda x: x if pd.isnull(x) else np.int8(x.split()[0]))

train_data['earliesCreditLine'] = train_data['earliesCreditLine'].apply(lambda x: int(x[-4:]))
test_data['earliesCreditLine'] = test_data['earliesCreditLine'].apply(lambda x: int(x[-4:]))

#过滤数值型类别特征
def get_numerical_serial_fea(data,feas):
    numerical_serial_fea = []
    numerical_noserial_fea = []
    for fea in feas:
        temp = data[fea].nunique()
        if temp <= 10:
            numerical_noserial_fea.append(fea)
            continue
        numerical_serial_fea.append(fea)
    return numerical_serial_fea,numerical_noserial_fea
numerical_serial_fea,numerical_noserial_fea = get_numerical_serial_fea(train_data,numerical_fea)

for data in [train_data,test_data]:
    data['grade'] = data['grade'].map({'A':1,'B':2,'C':3,'D':4,'E':5,'F':6,'G':7})

3、异常值处理

def find_outliers_by_3sigm(data,fea):
    data_std = np.std(data[fea])
    data_mean = np.mean(data[fea])
    outliers_cut_off = data_std *3 
    lower_rule = data_mean - outliers_cut_off
    upper_rule = data_mean + outliers_cut_off
    data[fea+'_outliers'] = data[fea].apply(lambda x: str('异常值') if x > upper_rule or x < lower_rule else '正常值')
    return data
for fea in numerical_fea:
    data1 = find_outliers_by_3sigm(train_data,fea)
    print(data1[fea+'_outliers'].value_counts())
## 删除异常值
for fea in numerical_fea:
    train_data = train_data[train_data[fea+'_outliers'] == '正常值']
    train_data = train_data.reset_index(drop=True)
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值