学习目标
- 学习特征预处理、缺失值、异常值处理、特征分桶等特征处理方式
- 学习特征交互、编码、选择的相应方法
学习过程
读取数据
import pandasaspd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime fromt qdm
import tqdm from sklearn.preprocessing
import LabelEncoder from sklearn.feature_selection
import SelectKBest from sklearn.feature_selection
import chi2 from sklearn.preprocessing
import MinMaxScaler
import xgboost as xgb
import lightgbmaslgb
from catboost import CatBoostRegressor
import warnings
from sklearn.model_selection import StratifiedKFold, KFold
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, log_loss warnings.filterwarnings('ignore')
## 我在这里使用的绝对路径进行读取,path 为数据存放目录
data_train =pd.read_csv(path+'/train.csv')
data_test_a = pd.read_csv(path+'/testA.csv')
异常值处理
特征预处理
numerical_fea = list(train.select_dtypes(exclude = ['object']).columns)
category_fea = list(train.select_dtypes(include = ['object']).columns)
label = 'isDefault'
numerical_fea.remove(label)
缺失值填充
使用train.isnull().sun() 查看缺失值的情况如下;
// An highlighted block
id 0
loanAmnt 0
term 0
interestRate 0
installment 0
grade 0
subGrade 0
employmentTitle 1
employmentLength 46799
homeOwnership 0
annualIncome 0
verificationStatus 0
issueDate 0
isDefault 0
purpose 0
postCode 1
regionCode 0
dti 239
delinquency_2years 0
ficoRangeLow 0
ficoRangeHigh 0
openAcc 0
pubRec 0
pubRecBankruptcies 405
revolBal 0
revolUtil 531
totalAcc 0
initialListStatus 0
applicationType 0
earliesCreditLine 0
title 1
policyCode 0
n0 40270
n1 40270
n2 40270
n3 40270
n4 33239
n5 40270
n6 40270
n7 40270
n8 40271
n9 40270
n10 33239
n11 69752
n12 40270
n13 40270
n14 40270
根据结果我们发现0-n14以及employLength特征缺失值较多,employmentTitle,postCode,dti,pubRecBankruptcies,revolUtil,title有较少的缺失,我们这里采用的方法是对于数值型变量,我们取中位数,对于类别型变量,我们使用众数来填充缺失值
train[numerical_fea] = train[numerical_fea].fillna(train[numerical_fea].median())
train[category_fea] = train[category_fea].fillna(train[category_fea].mode())
weiwandaixv…