一、学习知识点概要
- 数据预处理
- 缺失值的填充
- 时间格式处理
- 对象类型特征转换到数值
- 异常值处理
- 基于3segama原则--均方差
- 特征处理
- 数据分箱
- 特征交互
- 特征编码
- 特征选择
二、学习内容
1. 导入包
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
from tqdm import tqdm
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.preprocessing import MinMaxScaler
import xgboost as xgb
import lightgbm as lgb
import warnings
from sklearn.model_selection import StratifiedKFold, KFold
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, log_loss
warnings.filterwarnings('ignore')
2. 导入数据
In [2]:
train=pd.read_csv('C://Users//Administrator//Desktop//train.csv')
testA=pd.read_csv('C://Users//Administrator//Desktop//testA.csv')
3. 数据预处理
3.1 分开数值型和类别型
In [3]:
numerical_fea = list(train.select_dtypes(exclude=['object']).columns)
category_fea = list(filter(lambda x: x not in numerical_fea,list(train.columns)))
numerical_fea.remove('isDefault')
print("数值型特征:\n",numerical_fea)
print("类别型特征:\n",category_fea)
数值型特征: ['id', 'loanAmnt', 'term', 'interestRate', 'installment', 'employmentTitle', 'homeOwnership', 'annualIncome', 'verificationStatus', 'purpose', 'postCode', 'regionCode', 'dti', 'delinquency_2years', 'ficoRangeLow', 'ficoRangeHigh', 'openAcc', 'pubRec', 'pubRecBankruptcies', 'revolBal', 'revolUtil', 'totalAcc', 'initialListStatus', 'applicationType', 'title', 'policyCode', 'n0', 'n1', 'n2', 'n3', 'n4', 'n5', 'n6', 'n7', 'n8', 'n9', 'n10', 'n11', 'n12', 'n13', 'n14'] 类别型特征: ['grade', 'subGrade', 'employmentLength', 'issueDate', 'earliesCreditLine']
3.2 填充
3.2.1 数值型填充平均值
In [4]:
#####用训练集的数值填充测试集
#按照平均数填充数值型特征
train[numerical_fea] = train[numerical_fea].fillna(train[numerical_fea].median())
testA[numerical_fea] = testA[numerical_fea].fillna(train[numerical_fea].median())
#按照众数填充类别型特征
#train[category_fea] = train[category_fea].fillna(train[category_fea].mode())
#testA[category_fea] = testA[category_fea].fillna(train[category_fea].mode())
注:这里employmentLength是object类型,无法替换,所以还有缺失值
In [5]:
train.isnull().sum()
Out[5]:
id 0 loanAmnt 0 term 0 interestRate 0 installment 0 grade 0 subGrade 0 employmentTitle 0 employmentLength 46799 homeOwnership 0 annualIncome 0 verificationStatus 0 issueDate 0 isDefault 0 purpose 0 postCode 0 regionCode 0 dti 0 delinquency_2years 0 ficoRangeLow 0 ficoRangeHigh 0 openAcc 0 pubRec 0 pubRecBankruptcies 0 revolBal 0 revolUtil 0 totalAcc 0 initialListStatus 0 applicationType 0 earliesCreditLine 0 title 0 policyCode 0 n0 0 n1 0 n2 0 n3 0 n4 0 n5 0 n6 0 n7 0 n8 0 n9 0 n10 0 n11 0 n12 0 n13 0 n14 0 dtype: int64
In [6]:
train['employmentLength'].value_counts(dropna=False).sort_index()
Out[6]:
1 year 52489 10+ years 262753 2 years 72358 3 years 64152 4 years 47985 5 years 50102 6 years 37254 7 years 35407 8 years 36192 9 years 30272 < 1 year 64237 NaN 46799 Name: employmentLength, dtype: int64
3.2.2 类别型填充
把employmentLength转化为int8类型,再填充平均值
In [7]:
def employmentLength_to_int(s):
if pd.isnull(s):
return s
else:
return np.int8(s.split()[0]) ##空格前面的内容
for data in [train, testA]:
data['employmentLength'].replace(to_replace='10+ years', value='10 years', inplace=True)
data['employmentLength'].replace('< 1 year', '0 years', inplace=True)
data['employmentLength'] = data['employmentLength'].apply(employmentLength_to_int)
In [8]:
train['employmentLength'] = train['employmentLength'].fillna(train['employmentLength'].median())
testA['employmentLength'] = testA['employmentLength'].fillna(train['employmentLength'].median())
data['employmentLength'].value_counts(dropna=False).sort_index()
Out[8]:
0.0 15989 1.0 13182 2.0 18207 3.0 16011 4.0 11833 5.0 12543 6.0 21070 7.0 8823 8.0 8976 9.0 7594 10.0 65772 Name: employmentLength, dtype: int64
3.3 对issueDate处理
In [9]:
#转化成时间格式
for data in [train, testA]:
data['issueDate'] = pd.to_datetime(data['issueDate'],format='%Y-%m-%d')
startdate = datetime.datetime.strptime('2007-06-01', '%Y-%m-%d')
#构造时间特征
data['issueDateDT'] = data['issueDate'].apply(lambda x: x-startdate).dt.days
3.4 对earliesCreditLine处理
In [10]:
## 随机提取5行
train['earliesCreditLine'].sample(5)
Out[10]:
141601 Aug-2008 187415 Nov-1997 686069 Nov-1997 790315 Apr-2003 451221 Jun-1993 Name: earliesCreditLine, dtype: object
In [11]:
##只提取年份
for data in [train, testA]:
data['earliesCreditLine'] = data['earliesCreditLine'].apply(lambda s: int(s[-4:]))
注:查看数值型中每个特征的类型数,决定后面是否进行数据分桶,loanAmnt,installment,employmentTitle,annualIncome,dti,revolBal,title类型数很多
In [12]:
# 部分类别特征
for f in numerical_fea:
print(f, '类型数:', data[f].nunique())
id 类型数: 200000 loanAmnt 类型数: 1444 term 类型数: 2 interestRate 类型数: 597 installment 类型数: 41575 employmentTitle 类型数: 79282 homeOwnership 类型数: 6 annualIncome 类型数: 15530 verificationStatus 类型数: 3 purpose 类型数: 14 postCode 类型数: 889 regionCode 类型数: 51 dti 类型数: 4816 delinquency_2years 类型数: 23 ficoRangeLow 类型数: 39 ficoRangeHigh 类型数: 39 openAcc 类型数: 66 pubRec 类型数: 22 pubRecBankruptcies 类型数: 10 revolBal 类型数: 46395 revolUtil 类型数: 1145 totalAcc 类型数: 113 initialListStatus 类型数: 2 applicationType 类型数: 2 title 类型数: 12058 policyCode 类型数: 1 n0 类型数: 30 n1 类型数: 28 n2 类型数: 42 n3 类型数: 42 n4 类型数: 45 n5 类型数: 56 n6 类型数: 86 n7 类型数: 58 n8 类型数: 87 n9 类型数: 39 n10 类型数: 65 n11 类型数: 4 n12 类型数: 4 n13 类型数: 22 n14 类型数: 27
3.5 对grade处理
In [13]:
##像等级这种类别特征,是有优先级的可以labelencode或者自映射
for data in [train, testA]:
data['grade'] = data['grade'].map({'A':1,'B':2,'C':3,'D':4,'E':5,'F':6,'G':7})
In [14]:
### one-hot编码
for data in [train, testA]:
data = pd.get_dummies(data, columns=['subGrade', 'homeOwnership', 'verificationStatus', 'purpose', 'regionCode'], drop_first=True)
4. 异常值处理
检测异常的方法一:均方差 在统计学中,如果一个数据分布近似正态,那么大约 68% 的数据值会在均值的一个标准差范围内,大约 95% 会在两个标准差范围内,大约 99.7% 会在三个标准差范围内。
检测异常的方法二:箱型图 四分位数会将数据分为三个点和四个区间,IQR = Q3 -Q1,下触须=Q1 − 1.5x IQR,上触须=Q3 + 1.5x IQR
In [15]:
def find_outliers_by_3segama(data,fea):
data_std = np.std(data[fea])
data_mean = np.mean(data[fea])
outliers_cut_off = data_std * 3
lower_rule = data_mean - outliers_cut_off
upper_rule = data_mean + outliers_cut_off
data[fea+'_outliers'] = data[fea].apply(lambda x:str('异常值') if x > upper_rule or x < lower_rule else '正常值')
return data
In [16]:
for fea in numerical_fea:
train = find_outliers_by_3segama(train,fea)
print(train[fea+'_outliers'].value_counts())
print(train.groupby(fea+'_outliers')['isDefault'].sum())
print('*'*10)
正常值 800000 Name: id_outliers, dtype: int64 id_outliers 正常值 159610 Name: isDefault, dtype: int64 ********** 正常值 800000 Name: loanAmnt_outliers, dtype: int64 loanAmnt_outliers 正常值 159610 Name: isDefault, dtype: int64 ********** 正常值 800000 Name: term_outliers, dtype: int64 term_outliers 正常值 159610 Name: isDefault, dtype: int64 ********** 正常值 794259 异常值 5741 Name: interestRate_outliers, dtype: int64 interestRate_outliers 异常值 2916 正常值 156694 Name: isDefault, dtype: int64 ********** 正常值 792046 异常值 7954 Name: installment_outliers, dtype: int64 installment_outliers 异常值 2152 正常值 157458 Name: isDefault, dtype: int64 ********** 正常值 800000 Name: employmentTitle_outliers, dtype: int64 employmentTitle_outliers 正常值 159610 Name: isDefault, dtype: int64 ********** 正常值 799701 异常值 299 Name: homeOwnership_outliers, dtype: int64 homeOwnership_outliers 异常值 62 正常值 159548 Name: isDefault, dtype: int64 ********** 正常值 793973 异常值 6027 Name: annualIncome_outliers, dtype: int64 annualIncome_outliers 异常值 756 正常值 158854 Name: isDefault, dtype: int64 ********** 正常值 800000 Name: verificationStatus_outliers, dtype: int64 verificationStatus_outliers 正常值 159610 Name: isDefault, dtype: int64 ********** 正常值 783003 异常值 16997 Name: purpose_outliers, dtype: int64 purpose_outliers 异常值 3635 正常值 155975 Name: isDefault, dtype: int64 ********** 正常值 798931 异常值 1069 Name: postCode_outliers, dtype: int64 postCode_outliers 异常值 221 正常值 159389 Name: isDefault, dtype: int64 ********** 正常值 799994 异常值 6 Name: regionCode_outliers, dtype: int64 regionCode_outliers 异常值 1 正常值 159609 Name: isDefault, dtype: int64 ********** 正常值 798440 异常值 1560 Name: dti_outliers, dtype: int64 dti_outliers 异常值 466 正常值 159144 Name: isDefault, dtype: int64 ********** 正常值 778245 异常值 21755 Name: delinquency_2years_outliers, dtype: int64 delinquency_2years_outliers 异常值 5089 正常值 154521 Name: isDefault, dtype: int64 ********** 正常值 788261 异常值 11739 Name: ficoRangeLow_outliers, dtype: int64 ficoRangeLow_outliers 异常值 778 正常值 158832 Name: isDefault, dtype: int64 ********** 正常值 788261 异常值 11739 Name: ficoRangeHigh_outliers, dtype: int64 ficoRangeHigh_outliers 异常值 778 正常值 158832 Name: isDefault, dtype: int64 ********** 正常值 790889 异常值 9111 Name: openAcc_outliers, dtype: int64 openAcc_outliers 异常值 2195 正常值 157415 Name: isDefault, dtype: int64 ********** 正常值 792471 异常值 7529 Name: pubRec_outliers, dtype: int64 pubRec_outliers 异常值 1701 正常值 157909 Name: isDefault, dtype: int64 ********** 正常值 794120 异常值 5880 Name: pubRecBankruptcies_outliers, dtype: int64 pubRecBankruptcies_outliers 异常值 1423 正常值 158187 Name: isDefault, dtype: int64 ********** 正常值 790001 异常值 9999 Name: revolBal_outliers, dtype: int64 revolBal_outliers 异常值 1359 正常值 158251 Name: isDefault, dtype: int64 ********** 正常值 799948 异常值 52 Name: revolUtil_outliers, dtype: int64 revolUtil_outliers 异常值 23 正常值 159587 Name: isDefault, dtype: int64 ********** 正常值 791663 异常值 8337 Name: totalAcc_outliers, dtype: int64 totalAcc_outliers 异常值 1668 正常值 157942 Name: isDefault, dtype: int64 ********** 正常值 800000 Name: initialListStatus_outliers, dtype: int64 initialListStatus_outliers 正常值 159610 Name: isDefault, dtype: int64 ********** 正常值 784586 异常值 15414 Name: applicationType_outliers, dtype: int64 applicationType_outliers 异常值 3875 正常值 155735 Name: isDefault, dtype: int64 ********** 正常值 775134 异常值 24866 Name: title_outliers, dtype: int64 title_outliers 异常值 3900 正常值 155710 Name: isDefault, dtype: int64 ********** 正常值 800000 Name: policyCode_outliers, dtype: int64 policyCode_outliers 正常值 159610 Name: isDefault, dtype: int64 ********** 正常值 782773 异常值 17227 Name: n0_outliers, dtype: int64 n0_outliers 异常值 3485 正常值 156125 Name: isDefault, dtype: int64 ********** 正常值 790500 异常值 9500 Name: n1_outliers, dtype: int64 n1_outliers 异常值 2491 正常值 157119 Name: isDefault, dtype: int64 ********** 正常值 789067 异常值 10933 Name: n2_outliers, dtype: int64 n2_outliers 异常值 3205 正常值 156405 Name: isDefault, dtype: int64 ********** 正常值 789067 异常值 10933 Name: n3_outliers, dtype: int64 n3_outliers 异常值 3205 正常值 156405 Name: isDefault, dtype: int64 ********** 正常值 788660 异常值 11340 Name: n4_outliers, dtype: int64 n4_outliers 异常值 2476 正常值 157134 Name: isDefault, dtype: int64 ********** 正常值 790355 异常值 9645 Name: n5_outliers, dtype: int64 n5_outliers 异常值 1858 正常值 157752 Name: isDefault, dtype: int64 ********** 正常值 786006 异常值 13994 Name: n6_outliers, dtype: int64 n6_outliers 异常值 3182 正常值 156428 Name: isDefault, dtype: int64 ********** 正常值 788430 异常值 11570 Name: n7_outliers, dtype: int64 n7_outliers 异常值 2746 正常值 156864 Name: isDefault, dtype: int64 ********** 正常值 789625 异常值 10375 Name: n8_outliers, dtype: int64 n8_outliers 异常值 2131 正常值 157479 Name: isDefault, dtype: int64 ********** 正常值 786384 异常值 13616 Name: n9_outliers, dtype: int64 n9_outliers 异常值 3953 正常值 155657 Name: isDefault, dtype: int64 ********** 正常值 788979 异常值 11021 Name: n10_outliers, dtype: int64 n10_outliers 异常值 2639 正常值 156971 Name: isDefault, dtype: int64 ********** 正常值 799434 异常值 566 Name: n11_outliers, dtype: int64 n11_outliers 异常值 112 正常值 159498 Name: isDefault, dtype: int64 ********** 正常值 797585 异常值 2415 Name: n12_outliers, dtype: int64 n12_outliers 异常值 545 正常值 159065 Name: isDefault, dtype: int64 ********** 正常值 788907 异常值 11093 Name: n13_outliers, dtype: int64 n13_outliers 异常值 2482 正常值 157128 Name: isDefault, dtype: int64 ********** 正常值 788884 异常值 11116 Name: n14_outliers, dtype: int64 n14_outliers 异常值 3364 正常值 156246 Name: isDefault, dtype: int64 **********
In [17]:
##删除异常值 for fea in numerical_fea: train = train[train[fea+'_outliers']=='正常值'] train = train.reset_index(drop=True)
5. 特征处理
5.1 对'n0', 'n1', 'n2', 'n3', 'n4', 'n5', 'n6', 'n7', 'n8', 'n9', 'n10', 'n13', 'n14'处理
特征交互
In [18]:
for col in ['grade', 'subGrade']:
temp_dict = train.groupby([col])['isDefault'].agg(['mean']).reset_index().rename(columns={'mean': col + '_target_mean'})
temp_dict.index = temp_dict[col].values
temp_dict = temp_dict[col + '_target_mean'].to_dict()
train[col + '_target_mean'] = train[col].map(temp_dict)
testA[col + '_target_mean'] = testA[col].map(temp_dict)
In [19]:
# 其他衍生变量 mean 和 std
for df in [train, testA]:
for item in ['n0','n1','n2','n4','n5','n6','n7','n8','n9','n10','n11','n12','n13','n14']:
df['grade_to_mean_' + item] = df['grade'] / df.groupby([item])['grade'].transform('mean')
df['grade_to_std_' + item] = df['grade'] / df.groupby([item])['grade'].transform('std')
5.2 对loanAmnt,installment,employmentTitle,annualIncome,dti,revolBal,title处理
数据分桶
特征分箱的目的: 从模型效果上来看,特征分箱主要是为了降低变量的复杂性,减少变量噪音对模型的影响,提高自变量和因变量的相关度。从而使模型更加稳定。
数据分桶的对象: 将连续变量离散化 将多状态的离散变量合并成少状态
分箱的原因: 数据的特征内的值跨度可能比较大,对有监督和无监督中如k-均值聚类它使用欧氏距离作为相似度函数来测量数据点之间的相似度。都会造成大吃小的影响,其中一种解决方法是对计数值进行区间量化即数据分桶也叫做数据分箱,然后使用量化后的结果。
分箱的优点: 处理缺失值:当数据源可能存在缺失值,此时可以把null单独作为一个分箱。 处理异常值:当数据中存在离群点时,可以把其通过分箱离散化处理,从而提高变量的鲁棒性(抗干扰能力)。例如,age若出现200这种异常值,可分入“age > 60”这个分箱里,排除影响。 业务解释性:我们习惯于线性判断变量的作用,当x越来越大,y就越来越大。但实际x与y之间经常存在着非线性关系,此时可经过WOE变换。
特别要注意一下分箱的基本原则: (1)最小分箱占比不低于5% (2)箱内不能全部是好客户 (3)连续箱单调
分箱方式: 等距分桶:每个桶的宽度是固定的,即值域范围是固定的,比如是 0-99,100-199,200-299等;这种适合样本分布比较均匀的情况,避免出现有的桶的数量很少,而有的桶数量过多的情况; 等频分桶:也称为分位数分桶。也就是每个桶有一样多的样本,但可能出现数值相差太大的样本放在同个桶的情况;
In [20]:
# 通过除法映射到间隔均匀的分箱中,每个分箱的取值范围都是loanAmnt/1000
##1000个箱
data['loanAmnt_bin'] = np.floor_divide(data['loanAmnt'], 1000)
data['installment_bin'] = np.floor_divide(data['installment'], 100)
data['employmentTitle_bin'] = np.floor(np.log10(data['employmentTitle']))
data['annualIncome_bin'] = np.floor_divide(data['annualIncome'], 10)
data['dti_bin']= pd.qcut(data['dti'], 10, labels=False)
data['revolBal_bin'] = np.floor_divide(data['revolBal'], 100)
data['revolUtil_bin'] = np.floor_divide(data['revolUtil'], 10)
5.3 对postCode,title,subGrade处理
特征编码
In [21]:
for col in tqdm([ 'postCode', 'title','subGrade']):
le = LabelEncoder()
le.fit(list(train[col].astype(str).values) + list(testA[col].astype(str).values))
train[col] = le.transform(list(train[col].astype(str).values))
testA[col] = le.transform(list(testA[col].astype(str).values))
print('Label Encoding 完成')
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:04<00:00, 1.53s/it]
Label Encoding 完成
各数据处理完成
In [22]:
train
Out[22]:
id | loanAmnt | term | interestRate | installment | grade | subGrade | employmentTitle | employmentLength | homeOwnership | ... | grade_to_mean_n10 | grade_to_std_n10 | grade_to_mean_n11 | grade_to_std_n11 | grade_to_mean_n12 | grade_to_std_n12 | grade_to_mean_n13 | grade_to_std_n13 | grade_to_mean_n14 | grade_to_std_n14 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 35000.0 | 5 | 19.52 | 917.97 | 5 | 21 | 320.0 | 2.0 | 2 | ... | 1.842210 | 4.108917 | 1.852810 | 4.009823 | 1.852810 | 4.009823 | 1.857394 | 4.005352 | 1.856379 | 3.991791 |
1 | 1 | 18000.0 | 5 | 18.49 | 461.90 | 4 | 16 | 219843.0 | 5.0 | 0 | ... | 1.484104 | 3.173687 | 1.482248 | 3.207858 | 1.482248 | 3.207858 | 1.485915 | 3.204282 | 1.485103 | 3.193433 |
2 | 2 | 12000.0 | 5 | 16.99 | 298.17 | 4 | 17 | 31698.0 | 8.0 | 0 | ... | 1.504230 | 3.089208 | 1.482248 | 3.207858 | 1.482248 | 3.207858 | 1.485915 | 3.204282 | 1.315111 | 3.146801 |
3 | 6 | 2050.0 | 3 | 7.69 | 63.95 | 1 | 3 | 180083.0 | 9.0 | 0 | ... | 0.370128 | 0.799459 | 0.370562 | 0.801965 | 0.370562 | 0.801965 | 0.371479 | 0.801070 | 0.344287 | 0.793451 |
4 | 7 | 11500.0 | 3 | 14.98 | 398.54 | 3 | 12 | 214017.0 | 1.0 | 1 | ... | 1.104961 | 2.446307 | 1.111686 | 2.405894 | 1.111686 | 2.405894 | 1.114436 | 2.403211 | 1.113827 | 2.395075 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
612737 | 799994 | 15000.0 | 5 | 19.52 | 393.42 | 5 | 21 | 29191.0 | 7.0 | 0 | ... | 1.840014 | 4.045690 | 1.852810 | 4.009823 | 1.852810 | 4.009823 | 1.857394 | 4.005352 | 1.539050 | 3.936523 |
612738 | 799995 | 25000.0 | 3 | 14.49 | 860.41 | 3 | 13 | 2659.0 | 7.0 | 1 | ... | 1.114550 | 2.373772 | 1.111686 | 2.405894 | 1.111686 | 2.405894 | 1.114436 | 2.403211 | 1.032860 | 2.380353 |
612739 | 799997 | 6000.0 | 3 | 13.33 | 203.12 | 3 | 12 | 2582.0 | 10.0 | 1 | ... | 1.096272 | 2.498103 | 1.111686 | 2.405894 | 1.111686 | 2.405894 | 1.041645 | 2.512092 | 0.986333 | 2.360101 |
612740 | 799998 | 19200.0 | 3 | 6.92 | 592.14 | 1 | 3 | 151.0 | 10.0 | 0 | ... | 0.374164 | 0.786672 | 0.370562 | 0.801965 | 0.370562 | 0.801965 | 0.371479 | 0.801070 | 0.318729 | 0.780495 |
612741 | 799999 | 9000.0 | 3 | 11.06 | 294.91 | 2 | 7 | 13.0 | 5.0 | 0 | ... | 0.736884 | 1.643567 | 0.741124 | 1.603929 | 0.741124 | 1.603929 | 0.742958 | 1.602141 | 0.742552 | 1.596716 |
612742 rows × 119 columns
5.特征选择
In [23]:
features = [f for f in train.columns if f not in ['id','issueDate','isDefault'] and '_outliers' not in f]
x_train = train[features]
x_test = testA[features]
y_train = train['isDefault']
删除特征间相关性强的变量
In [24]:
correlation = x_train.corr()
f , ax = plt.subplots(figsize = (7, 7))
plt.title('Correlation of Features with Price',y=1,size=16)
sns.heatmap(correlation,square = True, vmax=0.8)
Out[24]:
<AxesSubplot:title={'center':'Correlation of Features with Price'}>
可以明显看到从grade_target_mean到grade_to_std_n13这些变量相关性非常强,可以考虑删掉
In [25]:
#feature是包含grade_的特征
feature=[x for i,x in enumerate(features) if x.find('grade_') != -1]
x_train=x_train.drop(feature,1)
In [26]:
correlation = x_train.corr()
f , ax = plt.subplots(figsize = (7, 7))
plt.title('Correlation of Features with Price',y=1,size=16)
sns.heatmap(correlation,square = True, vmax=0.8)
Out[26]:
<AxesSubplot:title={'center':'Correlation of Features with Price'}>
In [27]:
x_train=x_train.drop(['policyCode','n11'],1)
In [28]:
correlation = x_train.corr()
f , ax = plt.subplots(figsize = (7, 7))
plt.title('Correlation of Numeric Features with Price',y=1,size=16)
sns.heatmap(correlation,square = True, vmax=0.8)
Out[28]:
<AxesSubplot:title={'center':'Correlation of Numeric Features with Price'}>
In [29]:
x_train=x_train.drop(['n12'],1)
In [30]:
correlation = x_train.corr()
f , ax = plt.subplots(figsize = (7, 7))
plt.title('Correlation of Numeric Features with Price',y=1,size=16)
sns.heatmap(correlation,square = True, vmax=0.8)
Out[30]:
<AxesSubplot:title={'center':'Correlation of Numeric Features with Price'}>
In [31]:
x_train=x_train.drop(['applicationType'],1)
correlation = x_train.corr()
f , ax = plt.subplots(figsize = (7, 7))
plt.title('Correlation of Numeric Features with Price',y=1,size=16)
sns.heatmap(correlation,square = True, vmax=0.8)
Out[31]:
<AxesSubplot:title={'center':'Correlation of Numeric Features with Price'}>
选择特征与标签相关性强的变量,此处用Filter相关系数法
In [32]:
#计算协方差
data_corr = x_train.corrwith(train.isDefault) #计算相关性
result = pd.DataFrame(columns=['features', 'corr'])
result['features'] = data_corr.index
result['corr'] = data_corr.values
result
Out[32]:
features | corr | |
---|---|---|
0 | loanAmnt | 0.061056 |
1 | term | 0.174659 |
2 | interestRate | 0.254421 |
3 | installment | 0.043117 |
4 | grade | 0.256237 |
5 | subGrade | 0.262355 |
6 | employmentTitle | -0.026137 |
7 | employmentLength | -0.013302 |
8 | homeOwnership | 0.053502 |
9 | annualIncome | -0.065541 |
10 | verificationStatus | 0.086956 |
11 | purpose | -0.032990 |
12 | postCode | 0.004510 |
13 | regionCode | 0.001558 |
14 | dti | 0.105192 |
15 | delinquency_2years | 0.014012 |
16 | ficoRangeLow | -0.128541 |
17 | ficoRangeHigh | -0.128541 |
18 | openAcc | 0.017294 |
19 | pubRec | 0.028772 |
20 | pubRecBankruptcies | 0.023167 |
21 | revolBal | -0.019310 |
22 | revolUtil | 0.060353 |
23 | totalAcc | -0.024568 |
24 | initialListStatus | -0.005529 |
25 | earliesCreditLine | 0.038076 |
26 | title | -0.040678 |
27 | n0 | 0.015002 |
28 | n1 | 0.035943 |
29 | n2 | 0.067048 |
30 | n3 | 0.067048 |
31 | n4 | 0.009364 |
32 | n5 | -0.021715 |
33 | n6 | -0.004452 |
34 | n7 | 0.027581 |
35 | n8 | -0.011180 |
36 | n9 | 0.064747 |
37 | n10 | 0.015907 |
38 | n13 | 0.010801 |
39 | n14 | 0.078981 |
40 | issueDateDT | 0.043304 |
41 | subGrade_target_mean | 0.263363 |
In [33]:
from sklearn.feature_selection import SelectKBest
from scipy.stats import pearsonr
#选择K个最好的特征,返回选择特征后的数据
#第一个参数为计算评估特征是否好的函数,该函数输入特征矩阵和目标向量,
#输出二元组(评分,P值)的数组,数组第i项为第i个特征的评分和P值。在此定义为计算相关系数
#参数k为选择的特征个数
SelectKBest(k=5).fit_transform(x_train,y_train)
Out[33]:
array([[ 5. , 19.52 , 5. , 21. , 0.38044389], [ 5. , 18.49 , 4. , 16. , 0.29818972], [ 5. , 16.99 , 4. , 17. , 0.30254055], ..., [ 3. , 13.33 , 3. , 12. , 0.22468573], [ 3. , 6.92 , 1. , 3. , 0.0655316 ], [ 3. , 11.06 , 2. , 7. , 0.12811053]])
三、学习问题与解答
感觉学习任务中特征选择方面没太看明白,文字说明不多,我的逻辑有点理不清,可能是我太菜了吧,上网查了一些,结合特征选择这篇文章感觉好理解多了。
四、学习思考与总结
特征工程这一步确实比较复杂,按我的理解,大概就是分成四个大部分:数据预处理(填充、时间格式)、异常值处理、特征处理(数据分桶、特征交互、特征编码)、特征选择(特征之间、特征与标签之间),每个环节处理方式都有很多种,具体用什么方法还需要具体分析,也不知道理解的对不对,还是懵懵懂懂。特征工程是数据分析建模过程的一大重点难点,我还需要再好好学习。