金融风控 task3

最新推荐文章于 2024-09-18 20:39:31 发布

阿德罗斯

最新推荐文章于 2024-09-18 20:39:31 发布

阅读量2.5k

点赞数

分类专栏： Datawhale 文章标签： python 数据分析

本文链接：https://blog.csdn.net/qq_37393071/article/details/108721920

版权

Datawhale 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

文章目录

学习目标
学习过程

学习目标

学习特征预处理、缺失值、异常值处理、特征分桶等特征处理方式
学习特征交互、编码、选择的相应方法

学习过程

读取数据

import pandasaspd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns 
import datetime fromt qdm 
import tqdm from sklearn.preprocessing
import LabelEncoder from sklearn.feature_selection
import SelectKBest from sklearn.feature_selection
import chi2 from sklearn.preprocessing
import MinMaxScaler
import xgboost as xgb
import lightgbmaslgb 
from catboost import CatBoostRegressor
import warnings 
from sklearn.model_selection import StratifiedKFold, KFold 
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, log_loss warnings.filterwarnings('ignore'）

## 我在这里使用的绝对路径进行读取，path 为数据存放目录
data_train =pd.read_csv(path+'/train.csv')
data_test_a = pd.read_csv(path+'/testA.csv')

异常值处理

特征预处理

numerical_fea = list(train.select_dtypes(exclude = ['object']).columns)
category_fea = list(train.select_dtypes(include = ['object']).columns)
label = 'isDefault'
numerical_fea.remove(label)

缺失值填充

使用train.isnull().sun() 查看缺失值的情况如下;

// An highlighted block
id                        0
loanAmnt                  0
term                      0
interestRate              0
installment               0
grade                     0
subGrade                  0
employmentTitle           1
employmentLength      46799
homeOwnership             0
annualIncome              0
verificationStatus        0
issueDate                 0
isDefault                 0
purpose                   0
postCode                  1
regionCode                0
dti                     239
delinquency_2years        0
ficoRangeLow              0
ficoRangeHigh             0
openAcc                   0
pubRec                    0
pubRecBankruptcies      405
revolBal                  0
revolUtil               531
totalAcc                  0
initialListStatus         0
applicationType           0
earliesCreditLine         0
title                     1
policyCode                0
n0                    40270
n1                    40270
n2                    40270
n3                    40270
n4                    33239
n5                    40270
n6                    40270
n7                    40270
n8                    40271
n9                    40270
n10                   33239
n11                   69752
n12                   40270
n13                   40270
n14                   40270

根据结果我们发现0-n14以及employLength特征缺失值较多，employmentTitle，postCode，dti，pubRecBankruptcies，revolUtil，title有较少的缺失，我们这里采用的方法是对于数值型变量，我们取中位数，对于类别型变量，我们使用众数来填充缺失值

train[numerical_fea] = train[numerical_fea].fillna(train[numerical_fea].median())
train[category_fea] = train[category_fea].fillna(train[category_fea].mode())