金融风控训练营特征工程学习笔记

最新推荐文章于 2021-05-05 20:46:08 发布

ic_mathy

最新推荐文章于 2021-05-05 20:46:08 发布

阅读量260

点赞数

文章标签：数据分析

本文链接：https://blog.csdn.net/ic_mathy/article/details/116245810

版权

一、学习知识点概要

数据预处理
- 缺失值的填充
- 时间格式处理
- 对象类型特征转换到数值
异常值处理
- 基于3segama原则--均方差
特征处理
- 数据分箱
- 特征交互
- 特征编码
特征选择

二、学习内容

1. 导入包

In [1]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
from tqdm import tqdm

from sklearn.preprocessing import LabelEncoder
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.preprocessing import MinMaxScaler
import xgboost as xgb
import lightgbm as lgb

import warnings
from sklearn.model_selection import StratifiedKFold, KFold
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, log_loss
warnings.filterwarnings('ignore')

2. 导入数据

In [2]:

train=pd.read_csv('C://Users//Administrator//Desktop//train.csv')
testA=pd.read_csv('C://Users//Administrator//Desktop//testA.csv')

3. 数据预处理

3.1 分开数值型和类别型

In [3]:

numerical_fea = list(train.select_dtypes(exclude=['object']).columns)
category_fea = list(filter(lambda x: x not in numerical_fea,list(train.columns)))
numerical_fea.remove('isDefault')
print("数值型特征：\n",numerical_fea)
print("类别型特征：\n",category_fea)

数值型特征：
 ['id', 'loanAmnt', 'term', 'interestRate', 'installment', 'employmentTitle', 'homeOwnership', 'annualIncome', 'verificationStatus', 'purpose', 'postCode', 'regionCode', 'dti', 'delinquency_2years', 'ficoRangeLow', 'ficoRangeHigh', 'openAcc', 'pubRec', 'pubRecBankruptcies', 'revolBal', 'revolUtil', 'totalAcc', 'initialListStatus', 'applicationType', 'title', 'policyCode', 'n0', 'n1', 'n2', 'n3', 'n4', 'n5', 'n6', 'n7', 'n8', 'n9', 'n10', 'n11', 'n12', 'n13', 'n14']
类别型特征：
 ['grade', 'subGrade', 'employmentLength', 'issueDate', 'earliesCreditLine']

3.2 填充

3.2.1 数值型填充平均值

In [4]:

#####用训练集的数值填充测试集
#按照平均数填充数值型特征
train[numerical_fea] = train[numerical_fea].fillna(train[numerical_fea].median())
testA[numerical_fea] = testA[numerical_fea].fillna(train[numerical_fea].median())
#按照众数填充类别型特征
#train[category_fea] = train[category_fea].fillna(train[category_fea].mode())
#testA[category_fea] = testA[category_fea].fillna(train[category_fea].mode())

注：这里employmentLength是object类型，无法替换，所以还有缺失值

In [5]:

train.isnull().sum()

Out[5]:

id                        0
loanAmnt                  0
term                      0
interestRate              0
installment               0
grade                     0
subGrade                  0
employmentTitle           0
employmentLength      46799
homeOwnership             0
annualIncome              0
verificationStatus        0
issueDate                 0
isDefault                 0
purpose                   0
postCode                  0
regionCode                0
dti                       0
delinquency_2years        0
ficoRangeLow              0
ficoRangeHigh             0
openAcc                   0
pubRec                    0
pubRecBankruptcies        0
revolBal                  0
revolUtil                 0
totalAcc                  0
initialListStatus         0
applicationType           0
earliesCreditLine         0
title                     0
policyCode                0
n0                        0
n1                        0
n2                        0
n3                        0
n4                        0
n5                        0
n6                        0
n7                        0
n8                        0
n9                        0
n10                       0
n11                       0
n12                       0
n13                       0
n14                       0
dtype: int64

In [6]:

train['employmentLength'].value_counts(dropna=False).sort_index()

Out[6]:

1 year        52489
10+ years    262753
2 years       72358
3 years       64152
4 years       47985
5 years       50102
6 years       37254
7 years       35407
8 years       36192
9 years       30272
< 1 year      64237
NaN           46799
Name: employmentLength, dtype: int64

3.2.2 类别型填充

把employmentLength转化为int8类型，再填充平均值

In [7]:

def employmentLength_to_int(s):
    if pd.isnull(s):
        return s
    else:
        return np.int8(s.split()[0])       ##空格前面的内容
for data in [train, testA]:
    data['employmentLength'].replace(to_replace='10+ years', value='10 years', inplace=True)
    data['employmentLength'].replace('< 1 year', '0 years', inplace=True)
    data['employmentLength'] = data['employmentLength'].apply(employmentLength_to_int)

In [8]:

train['employmentLength'] = train['employmentLength'].fillna(train['employmentLength'].median())
testA['employmentLength'] = testA['employmentLength'].fillna(train['employmentLength'].median())  
data['employmentLength'].value_counts(dropna=False).sort_index()

Out[8]:

0.0     15989
1.0     13182
2.0     18207
3.0     16011
4.0     11833
5.0     12543
6.0     21070
7.0      8823
8.0      8976
9.0      7594
10.0    65772
Name: employmentLength, dtype: int64

3.3 对issueDate处理

In [9]:

#转化成时间格式
for data in [train, testA]:
    data['issueDate'] = pd.to_datetime(data['issueDate'],format='%Y-%m-%d')
    startdate = datetime.datetime.strptime('2007-06-01', '%Y-%m-%d')
    #构造时间特征
    data['issueDateDT'] = data['issueDate'].apply(lambda x: x-startdate).dt.days

3.4 对earliesCreditLine处理

In [10]:

## 随机提取5行
train['earliesCreditLine'].sample(5)

Out[10]:

141601    Aug-2008
187415    Nov-1997
686069    Nov-1997
790315    Apr-2003
451221    Jun-1993
Name: earliesCreditLine, dtype: object

In [11]:

##只提取年份
for data in [train, testA]:
    data['earliesCreditLine'] = data['earliesCreditLine'].apply(lambda s: int(s[-4:]))

注：查看数值型中每个特征的类型数，决定后面是否进行数据分桶，loanAmnt，installment，employmentTitle，annualIncome，dti，revolBal，title类型数很多

In [12]:

# 部分类别特征
for f in numerical_fea:
    print(f, '类型数：', data[f].nunique())

id 类型数： 200000
loanAmnt 类型数： 1444
term 类型数： 2
interestRate 类型数： 597
installment 类型数： 41575
employmentTitle 类型数： 79282
homeOwnership 类型数： 6
annualIncome 类型数： 15530
verificationStatus 类型数： 3
purpose 类型数： 14
postCode 类型数： 889
regionCode 类型数： 51
dti 类型数： 4816
delinquency_2years 类型数： 23
ficoRangeLow 类型数： 39
ficoRangeHigh 类型数： 39
openAcc 类型数： 66
pubRec 类型数： 22
pubRecBankruptcies 类型数： 10
revolBal 类型数： 46395
revolUtil 类型数： 1145
totalAcc 类型数： 113
initialListStatus 类型数： 2
applicationType 类型数： 2
title 类型数： 12058
policyCode 类型数： 1
n0 类型数： 30
n1 类型数： 28
n2 类型数： 42
n3 类型数： 42
n4 类型数： 45
n5 类型数： 56
n6 类型数： 86
n7 类型数： 58
n8 类型数： 87
n9 类型数： 39
n10 类型数： 65
n11 类型数： 4
n12 类型数： 4
n13 类型数： 22
n14 类型数： 27

3.5 对grade处理

In [13]:

##像等级这种类别特征，是有优先级的可以labelencode或者自映射
for data in [train, testA]:
    data['grade'] = data['grade'].map({'A':1,'B':2,'C':3,'D':4,'E':5,'F':6,'G':7})

In [14]:

### one-hot编码
for data in [train, testA]:
    data = pd.get_dummies(data, columns=['subGrade', 'homeOwnership', 'verificationStatus', 'purpose', 'regionCode'], drop_first=True)

4. 异常值处理

检测异常的方法一：均方差在统计学中，如果一个数据分布近似正态，那么大约 68% 的数据值会在均值的一个标准差范围内，大约 95% 会在两个标准差范围内，大约 99.7% 会在三个标准差范围内。

检测异常的方法二：箱型图四分位数会将数据分为三个点和四个区间，IQR = Q3 -Q1，下触须=Q1 − 1.5x IQR，上触须=Q3 + 1.5x IQR

In [15]:

def find_outliers_by_3segama(data,fea):
    data_std = np.std(data[fea])
    data_mean = np.mean(data[fea])
    outliers_cut_off = data_std * 3
    lower_rule = data_mean - outliers_cut_off
    upper_rule = data_mean + outliers_cut_off
    data[fea+'_outliers'] = data[fea].apply(lambda x:str('异常值') if x > upper_rule or x < lower_rule else '正常值')
    return data

In [16]:

for fea in numerical_fea:
    train = find_outliers_by_3segama(train,fea)
    print(train[fea+'_outliers'].value_counts())
    print(train.groupby(fea+'_outliers')['isDefault'].sum())
    print('*'*10)

正常值    800000
Name: id_outliers, dtype: int64
id_outliers
正常值    159610
Name: isDefault, dtype: int64
**********
正常值    800000
Name: loanAmnt_outliers, dtype: int64
loanAmnt_outliers
正常值    159610
Name: isDefault, dtype: int64
**********
正常值    800000
Name: term_outliers, dtype: int64
term_outliers
正常值    159610
Name: isDefault, dtype: int64
**********
正常值    794259
异常值      5741
Name: interestRate_outliers, dtype: int64
interestRate_outliers
异常值      2916
正常值    156694
Name: isDefault, dtype: int64
**********
正常值    792046
异常值      7954
Name: installment_outliers, dtype: int64
installment_outliers
异常值      2152
正常值    157458
Name: isDefault, dtype: int64
**********
正常值    800000
Name: employmentTitle_outliers, dtype: int64
employmentTitle_outliers
正常值    159610
Name: isDefault, dtype: int64
**********
正常值    799701
异常值       299
Name: homeOwnership_outliers, dtype: int64
homeOwnership_outliers
异常值        62
正常值    159548
Name: isDefault, dtype: int64
**********
正常值    793973
异常值      6027
Name: annualIncome_outliers, dtype: int64
annualIncome_outliers
异常值       756
正常值    158854
Name: isDefault, dtype: int64
**********
正常值    800000
Name: verificationStatus_outliers, dtype: int64
verificationStatus_outliers
正常值    159610
Name: isDefault, dtype: int64
**********
正常值    783003
异常值     16997
Name: purpose_outliers, dtype: int64
purpose_outliers
异常值      3635
正常值    155975
Name: isDefault, dtype: int64
**********
正常值    798931
异常值      1069
Name: postCode_outliers, dtype: int64
postCode_outliers
异常值       221
正常值    159389
Name: isDefault, dtype: int64
**********
正常值    799994
异常值         6
Name: regionCode_outliers, dtype: int64
regionCode_outliers
异常值         1
正常值    159609
Name: isDefault, dtype: int64
**********
正常值    798440
异常值      1560
Name: dti_outliers, dtype: int64
dti_outliers
异常值       466
正常值    159144
Name: isDefault, dtype: int64
**********
正常值    778245
异常值     21755
Name: delinquency_2years_outliers, dtype: int64
delinquency_2years_outliers
异常值      5089
正常值    154521
Name: isDefault, dtype: int64
**********
正常值    788261
异常值     11739
Name: ficoRangeLow_outliers, dtype: int64
ficoRangeLow_outliers
异常值       778
正常值    158832
Name: isDefault, dtype: int64
**********
正常值    788261
异常值     11739
Name: ficoRangeHigh_outliers, dtype: int64
ficoRangeHigh_outliers
异常值       778
正常值    158832
Name: isDefault, dtype: int64
**********
正常值    790889
异常值      9111
Name: openAcc_outliers, dtype: int64
openAcc_outliers
异常值      2195
正常值    157415
Name: isDefault, dtype: int64
**********
正常值    792471
异常值      7529
Name: pubRec_outliers, dtype: int64
pubRec_outliers
异常值      1701
正常值    157909
Name: isDefault, dtype: int64
**********
正常值    794120
异常值      5880
Name: pubRecBankruptcies_outliers, dtype: int64
pubRecBankruptcies_outliers
异常值      1423
正常值    158187
Name: isDefault, dtype: int64
**********
正常值    790001
异常值      9999
Name: revolBal_outliers, dtype: int64
revolBal_outliers
异常值      1359
正常值    158251
Name: isDefault, dtype: int64
**********
正常值    799948
异常值        52
Name: revolUtil_outliers, dtype: int64
revolUtil_outliers
异常值        23
正常值    159587
Name: isDefault, dtype: int64
**********
正常值    791663
异常值      8337
Name: totalAcc_outliers, dtype: int64
totalAcc_outliers
异常值      1668
正常值    157942
Name: isDefault, dtype: int64
**********
正常值    800000
Name: initialListStatus_outliers, dtype: int64
initialListStatus_outliers
正常值    159610
Name: isDefault, dtype: int64
**********
正常值    784586
异常值     15414
Name: applicationType_outliers, dtype: int64
applicationType_outliers
异常值      3875
正常值    155735
Name: isDefault, dtype: int64
**********
正常值    775134
异常值     24866
Name: title_outliers, dtype: int64
title_outliers
异常值      3900
正常值    155710
Name: isDefault, dtype: int64
**********
正常值    800000
Name: policyCode_outliers, dtype: int64
policyCode_outliers
正常值    159610
Name: isDefault, dtype: int64
**********
正常值    782773
异常值     17227
Name: n0_outliers, dtype: int64
n0_outliers
异常值      3485
正常值    156125
Name: isDefault, dtype: int64
**********
正常值    790500
异常值      9500
Name: n1_outliers, dtype: int64
n1_outliers
异常值      2491
正常值    157119
Name: isDefault, dtype: int64
**********
正常值    789067
异常值     10933
Name: n2_outliers, dtype: int64
n2_outliers
异常值      3205
正常值    156405
Name: isDefault, dtype: int64
**********
正常值    789067
异常值     10933
Name: n3_outliers, dtype: int64
n3_outliers
异常值      3205
正常值    156405
Name: isDefault, dtype: int64
**********
正常值    788660
异常值     11340
Name: n4_outliers, dtype: int64
n4_outliers
异常值      2476
正常值    157134
Name: isDefault, dtype: int64
**********
正常值    790355
异常值      9645
Name: n5_outliers, dtype: int64
n5_outliers
异常值      1858
正常值    157752
Name: isDefault, dtype: int64
**********
正常值    786006
异常值     13994
Name: n6_outliers, dtype: int64
n6_outliers
异常值      3182
正常值    156428
Name: isDefault, dtype: int64
**********
正常值    788430
异常值     11570
Name: n7_outliers, dtype: int64
n7_outliers
异常值      2746
正常值    156864
Name: isDefault, dtype: int64
**********
正常值    789625
异常值     10375
Name: n8_outliers, dtype: int64
n8_outliers
异常值      2131
正常值    157479
Name: isDefault, dtype: int64
**********
正常值    786384
异常值     13616
Name: n9_outliers, dtype: int64
n9_outliers
异常值      3953
正常值    155657
Name: isDefault, dtype: int64
**********
正常值    788979
异常值     11021
Name: n10_outliers, dtype: int64
n10_outliers
异常值      2639
正常值    156971
Name: isDefault, dtype: int64
**********
正常值    799434
异常值       566
Name: n11_outliers, dtype: int64
n11_outliers
异常值       112
正常值    159498
Name: isDefault, dtype: int64
**********
正常值    797585
异常值      2415
Name: n12_outliers, dtype: int64
n12_outliers
异常值       545
正常值    159065
Name: isDefault, dtype: int64
**********
正常值    788907
异常值     11093
Name: n13_outliers, dtype: int64
n13_outliers
异常值      2482
正常值    157128
Name: isDefault, dtype: int64
**********
正常值    788884
异常值     11116
Name: n14_outliers, dtype: int64
n14_outliers
异常值      3364
正常值    156246
Name: isDefault, dtype: int64
**********

In [17]:

##删除异常值
for fea in numerical_fea:
    train = train[train[fea+'_outliers']=='正常值']
    train = train.reset_index(drop=True)

5. 特征处理

5.1 对'n0', 'n1', 'n2', 'n3', 'n4', 'n5', 'n6', 'n7', 'n8', 'n9', 'n10', 'n13', 'n14'处理

特征交互

In [18]:

for col in ['grade', 'subGrade']: 
    temp_dict = train.groupby([col])['isDefault'].agg(['mean']).reset_index().rename(columns={'mean': col + '_target_mean'})
    temp_dict.index = temp_dict[col].values
    temp_dict = temp_dict[col + '_target_mean'].to_dict()

    train[col + '_target_mean'] = train[col].map(temp_dict)
    testA[col + '_target_mean'] = testA[col].map(temp_dict)

In [19]:

# 其他衍生变量 mean 和 std
for df in [train, testA]:
    for item in ['n0','n1','n2','n4','n5','n6','n7','n8','n9','n10','n11','n12','n13','n14']:
        df['grade_to_mean_' + item] = df['grade'] / df.groupby([item])['grade'].transform('mean')
        df['grade_to_std_' + item] = df['grade'] / df.groupby([item])['grade'].transform('std')

5.2 对loanAmnt，installment，employmentTitle，annualIncome，dti，revolBal，title处理

数据分桶

特征分箱的目的：从模型效果上来看，特征分箱主要是为了降低变量的复杂性，减少变量噪音对模型的影响，提高自变量和因变量的相关度。从而使模型更加稳定。

数据分桶的对象：将连续变量离散化将多状态的离散变量合并成少状态

分箱的原因：数据的特征内的值跨度可能比较大，对有监督和无监督中如k-均值聚类它使用欧氏距离作为相似度函数来测量数据点之间的相似度。都会造成大吃小的影响，其中一种解决方法是对计数值进行区间量化即数据分桶也叫做数据分箱，然后使用量化后的结果。

分箱的优点：处理缺失值：当数据源可能存在缺失值，此时可以把null单独作为一个分箱。处理异常值：当数据中存在离群点时，可以把其通过分箱离散化处理，从而提高变量的鲁棒性（抗干扰能力）。例如，age若出现200这种异常值，可分入“age > 60”这个分箱里，排除影响。业务解释性：我们习惯于线性判断变量的作用，当x越来越大，y就越来越大。但实际x与y之间经常存在着非线性关系，此时可经过WOE变换。

特别要注意一下分箱的基本原则：（1）最小分箱占比不低于5% （2）箱内不能全部是好客户（3）连续箱单调

分箱方式：等距分桶：每个桶的宽度是固定的，即值域范围是固定的，比如是 0-99，100-199，200-299等；这种适合样本分布比较均匀的情况，避免出现有的桶的数量很少，而有的桶数量过多的情况；等频分桶：也称为分位数分桶。也就是每个桶有一样多的样本，但可能出现数值相差太大的样本放在同个桶的情况；

In [20]:

# 通过除法映射到间隔均匀的分箱中，每个分箱的取值范围都是loanAmnt/1000
##1000个箱
data['loanAmnt_bin'] = np.floor_divide(data['loanAmnt'], 1000)
data['installment_bin'] = np.floor_divide(data['installment'],  100)
data['employmentTitle_bin'] = np.floor(np.log10(data['employmentTitle']))
data['annualIncome_bin'] = np.floor_divide(data['annualIncome'], 10)
data['dti_bin']= pd.qcut(data['dti'],  10, labels=False)
data['revolBal_bin'] = np.floor_divide(data['revolBal'], 100)
data['revolUtil_bin']  = np.floor_divide(data['revolUtil'], 10)

5.3 对postCode,title,subGrade处理

特征编码

In [21]:

for col in tqdm([ 'postCode', 'title','subGrade']):
    le = LabelEncoder()
    le.fit(list(train[col].astype(str).values) + list(testA[col].astype(str).values))
    train[col] = le.transform(list(train[col].astype(str).values))
    testA[col] = le.transform(list(testA[col].astype(str).values))
print('Label Encoding 完成')

100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:04<00:00,  1.53s/it]

Label Encoding 完成

各数据处理完成

In [22]:

train

Out[22]:

	id	loanAmnt	term	interestRate	installment	grade	subGrade	employmentTitle	employmentLength	homeOwnership	...	grade_to_mean_n10	grade_to_std_n10	grade_to_mean_n11	grade_to_std_n11	grade_to_mean_n12	grade_to_std_n12	grade_to_mean_n13	grade_to_std_n13	grade_to_mean_n14	grade_to_std_n14
0	0	35000.0	5	19.52	917.97	5	21	320.0	2.0	2	...	1.842210	4.108917	1.852810	4.009823	1.852810	4.009823	1.857394	4.005352	1.856379	3.991791
1	1	18000.0	5	18.49	461.90	4	16	219843.0	5.0	0	...	1.484104	3.173687	1.482248	3.207858	1.482248	3.207858	1.485915	3.204282	1.485103	3.193433
2	2	12000.0	5	16.99	298.17	4	17	31698.0	8.0	0	...	1.504230	3.089208	1.482248	3.207858	1.482248	3.207858	1.485915	3.204282	1.315111	3.146801
3	6	2050.0	3	7.69	63.95	1	3	180083.0	9.0	0	...	0.370128	0.799459	0.370562	0.801965	0.370562	0.801965	0.371479	0.801070	0.344287	0.793451
4	7	11500.0	3	14.98	398.54	3	12	214017.0	1.0	1	...	1.104961	2.446307	1.111686	2.405894	1.111686	2.405894	1.114436	2.403211	1.113827	2.395075
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
612737	799994	15000.0	5	19.52	393.42	5	21	29191.0	7.0	0	...	1.840014	4.045690	1.852810	4.009823	1.852810	4.009823	1.857394	4.005352	1.539050	3.936523
612738	799995	25000.0	3	14.49	860.41	3	13	2659.0	7.0	1	...	1.114550	2.373772	1.111686	2.405894	1.111686	2.405894	1.114436	2.403211	1.032860	2.380353
612739	799997	6000.0	3	13.33	203.12	3	12	2582.0	10.0	1	...	1.096272	2.498103	1.111686	2.405894	1.111686	2.405894	1.041645	2.512092	0.986333	2.360101
612740	799998	19200.0	3	6.92	592.14	1	3	151.0	10.0	0	...	0.374164	0.786672	0.370562	0.801965	0.370562	0.801965	0.371479	0.801070	0.318729	0.780495
612741	799999	9000.0	3	11.06	294.91	2	7	13.0	5.0	0	...	0.736884	1.643567	0.741124	1.603929	0.741124	1.603929	0.742958	1.602141	0.742552	1.596716

612742 rows × 119 columns

5.特征选择

In [23]:

features = [f for f in train.columns if f not in ['id','issueDate','isDefault'] and '_outliers' not in f]
x_train = train[features]
x_test = testA[features]
y_train = train['isDefault']

删除特征间相关性强的变量

In [24]:

correlation = x_train.corr()

f , ax = plt.subplots(figsize = (7, 7))
plt.title('Correlation of Features with Price',y=1,size=16)
sns.heatmap(correlation,square = True,  vmax=0.8)

Out[24]:

<AxesSubplot:title={'center':'Correlation of Features with Price'}>

可以明显看到从grade_target_mean到grade_to_std_n13这些变量相关性非常强，可以考虑删掉

In [25]:

#feature是包含grade_的特征
feature=[x for i,x in enumerate(features) if x.find('grade_') != -1]
x_train=x_train.drop(feature,1)

In [26]:

correlation = x_train.corr()

f , ax = plt.subplots(figsize = (7, 7))
plt.title('Correlation of Features with Price',y=1,size=16)
sns.heatmap(correlation,square = True,  vmax=0.8)

Out[26]:

<AxesSubplot:title={'center':'Correlation of Features with Price'}>

In [27]:

x_train=x_train.drop(['policyCode','n11'],1)

In [28]:

correlation = x_train.corr()

f , ax = plt.subplots(figsize = (7, 7))
plt.title('Correlation of Numeric Features with Price',y=1,size=16)
sns.heatmap(correlation,square = True,  vmax=0.8)

Out[28]:

<AxesSubplot:title={'center':'Correlation of Numeric Features with Price'}>

In [29]:

x_train=x_train.drop(['n12'],1)

In [30]:

correlation = x_train.corr()

f , ax = plt.subplots(figsize = (7, 7))
plt.title('Correlation of Numeric Features with Price',y=1,size=16)
sns.heatmap(correlation,square = True,  vmax=0.8)

Out[30]:

<AxesSubplot:title={'center':'Correlation of Numeric Features with Price'}>

In [31]:

x_train=x_train.drop(['applicationType'],1)
correlation = x_train.corr()

f , ax = plt.subplots(figsize = (7, 7))
plt.title('Correlation of Numeric Features with Price',y=1,size=16)
sns.heatmap(correlation,square = True,  vmax=0.8)

Out[31]:

<AxesSubplot:title={'center':'Correlation of Numeric Features with Price'}>

选择特征与标签相关性强的变量,此处用Filter相关系数法

In [32]:

#计算协方差
data_corr = x_train.corrwith(train.isDefault) #计算相关性
result = pd.DataFrame(columns=['features', 'corr'])
result['features'] = data_corr.index
result['corr'] = data_corr.values
result

Out[32]:

	features	corr
0	loanAmnt	0.061056
1	term	0.174659
2	interestRate	0.254421
3	installment	0.043117
4	grade	0.256237
5	subGrade	0.262355
6	employmentTitle	-0.026137
7	employmentLength	-0.013302
8	homeOwnership	0.053502
9	annualIncome	-0.065541
10	verificationStatus	0.086956
11	purpose	-0.032990
12	postCode	0.004510
13	regionCode	0.001558
14	dti	0.105192
15	delinquency_2years	0.014012
16	ficoRangeLow	-0.128541
17	ficoRangeHigh	-0.128541
18	openAcc	0.017294
19	pubRec	0.028772
20	pubRecBankruptcies	0.023167
21	revolBal	-0.019310
22	revolUtil	0.060353
23	totalAcc	-0.024568
24	initialListStatus	-0.005529
25	earliesCreditLine	0.038076
26	title	-0.040678
27	n0	0.015002
28	n1	0.035943
29	n2	0.067048
30	n3	0.067048
31	n4	0.009364
32	n5	-0.021715
33	n6	-0.004452
34	n7	0.027581
35	n8	-0.011180
36	n9	0.064747
37	n10	0.015907
38	n13	0.010801
39	n14	0.078981
40	issueDateDT	0.043304
41	subGrade_target_mean	0.263363

In [33]:

from sklearn.feature_selection import SelectKBest
from scipy.stats import pearsonr
#选择K个最好的特征，返回选择特征后的数据
#第一个参数为计算评估特征是否好的函数，该函数输入特征矩阵和目标向量，
#输出二元组（评分，P值）的数组，数组第i项为第i个特征的评分和P值。在此定义为计算相关系数
#参数k为选择的特征个数

SelectKBest(k=5).fit_transform(x_train,y_train)

Out[33]:

array([[ 5.        , 19.52      ,  5.        , 21.        ,  0.38044389],
       [ 5.        , 18.49      ,  4.        , 16.        ,  0.29818972],
       [ 5.        , 16.99      ,  4.        , 17.        ,  0.30254055],
       ...,
       [ 3.        , 13.33      ,  3.        , 12.        ,  0.22468573],
       [ 3.        ,  6.92      ,  1.        ,  3.        ,  0.0655316 ],
       [ 3.        , 11.06      ,  2.        ,  7.        ,  0.12811053]])

三、学习问题与解答

感觉学习任务中特征选择方面没太看明白，文字说明不多，我的逻辑有点理不清，可能是我太菜了吧，上网查了一些，结合特征选择这篇文章感觉好理解多了。

四、学习思考与总结

特征工程这一步确实比较复杂，按我的理解，大概就是分成四个大部分：数据预处理（填充、时间格式）、异常值处理、特征处理（数据分桶、特征交互、特征编码）、特征选择（特征之间、特征与标签之间），每个环节处理方式都有很多种，具体用什么方法还需要具体分析，也不知道理解的对不对，还是懵懵懂懂。特征工程是数据分析建模过程的一大重点难点，我还需要再好好学习。