【天池】金融风控数据挖掘task2

Task2 探索性数据分析

此部分为零基础入门金融风控的 Task2 数据分析部分,带你来了解数据,熟悉数据,为后续的特征工程做准备,代码如下:

import pandas as pd
import numpy as np
train = pd.read_csv(r'/home/corn/桌面/tianchifengkong/train.csv')
testA = pd.read_csv(r'/home/corn/桌面/tianchifengkong/testA.csv')
print('Train data shape:',train.shape)
print('TestA data shape:',testA.shape)
Train data shape: (800000, 47)
TestA data shape: (200000, 48)
train.head()
idloanAmntterminterestRateinstallmentgradesubGradeemploymentTitleemploymentLengthhomeOwnership...n5n6n7n8n9n10n11n12n13n14
0035000.0519.52917.97EE2320.02 years2...9.08.04.012.02.07.00.00.00.02.0
1118000.0518.49461.90DD2219843.05 years0...NaNNaNNaNNaNNaN13.0NaNNaNNaNNaN
2212000.0516.99298.17DD331698.08 years0...0.021.04.05.03.011.00.00.00.04.0
3311000.037.26340.96AA446854.010+ years1...16.04.07.021.06.09.00.00.00.01.0
443000.0312.99101.07CC254.0NaN1...4.09.010.015.07.012.00.00.00.04.0

5 rows × 47 columns

testA.head()
idloanAmntterminterestRateinstallmentgradesubGradeemploymentTitleemploymentLengthhomeOwnership...n5n6n7n8n9n10n11n12n13n14
080000014000.0310.99458.28BB37027.010+ years0...8.04.015.019.06.017.00.00.01.03.0
180000120000.0514.65472.14CC560426.010+ years0...1.03.03.09.03.05.00.00.02.02.0
280000212000.0319.99445.91DD423547.02 years1...1.036.05.06.04.012.00.00.00.07.0
380000317500.0514.31410.02CC4636.04 years0...7.02.08.014.02.010.00.00.00.03.0
480000435000.0317.091249.42DD1368446.0< 1 year1...11.03.016.018.011.019.00.00.00.01.0

5 rows × 48 columns

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

#1.文件读取
# os.getcwd  获取当前工作目录
data_train = pd.read_csv(r'/home/corn/桌面/tianchifengkong/train.csv')
data_test_a = pd.read_csv(r'/home/corn/桌面/tianchifengkong/testA.csv')

#1.1.文件量大的时候通过nrows参数输出前几行样例、或者设置chunksize分块读取

# data_train_sample = pd.read_csv("./train.csv",nrows=5)

# chunker = pd.read_csv("./train.csv",chunksize=5)
# for item in chunker:
# print(type(item))
# #<class 'pandas.core.frame.DataFrame'>
# print(len(item))

#2.数据总体了解(数据缺失值、唯一值、数据类型等)
# data_train.info() 
# data_train.describe()
# data_train.head(3).append(data_train.tail(3))

# nan可视化  
# 主要的目的在于 nan存在的个数是否真的很大,如果很小一般选择填充,如果使用lgb等树模型可以直接空缺,
# 让树自己去优化,但如果nan存在的过多、可以考虑删掉查看训练集测试集中特征属性只有一值的特征
missing = data_train.isnull().sum()/len(data_train)
missing = missing[missing > 0]
missing.sort_values(inplace=True)
missing.plot.bar()                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          






 

<AxesSubplot:>

在这里插入图片描述

#3.特征类型分析及数值分析
#数值分析常用分箱:特征分箱主要是为了降低变量的复杂性,减少变量噪音对模型的影响,提高自变量和因变量的相关度。
#3.1.数值型特征分析

#分离出数值型特征变量
numerical_fea = list(data_train.select_dtypes(exclude=['object']).columns)  #数值型
category_fea = list(filter(lambda x: x not in numerical_fea,list(data_train.columns))) #类别型

#分离出连续性特征变量、离散型特征变量
def get_numerical_serial_fea(data,feas):
    numerical_serial_fea = []
    numerical_noserial_fea = []
    for fea in feas:
        temp = data[fea].nunique()
        if temp <= 10:
            numerical_noserial_fea.append(fea)
            continue
        numerical_serial_fea.append(fea)
    return numerical_serial_fea,numerical_noserial_fea
numerical_serial_fea,numerical_noserial_fea = get_numerical_serial_fea(data_train,numerical_fea)
numerical_serial_fea

['id',
 'loanAmnt',
 'interestRate',
 'installment',
 'employmentTitle',
 'annualIncome',
 'purpose',
 'postCode',
 'regionCode',
 'dti',
 'delinquency_2years',
 'ficoRangeLow',
 'ficoRangeHigh',
 'openAcc',
 'pubRec',
 'pubRecBankruptcies',
 'revolBal',
 'revolUtil',
 'totalAcc',
 'title',
 'n0',
 'n1',
 'n2',
 'n2.1',
 'n4',
 'n5',
 'n6',
 'n7',
 'n8',
 'n9',
 'n10',
 'n13',
 'n14']
#3.2.连续性变量可视化分析
%matplotlib inline

f = pd.melt(data_train, value_vars=numerical_serial_fea)
g = sns.FacetGrid(f, col="variable", col_wrap=2, sharex=False, sharey=False)
g = g.map(sns.distplot, "value")

在这里插入图片描述

#3.3.特征标准化  可用log()     

#Ploting Transaction Amount Values Distribution
plt.figure(figsize=(16,12))
plt.suptitle('Transaction Values Distribution', fontsize=22)
plt.subplot(221)
sub_plot_1 = sns.distplot(data_train['loanAmnt'])
sub_plot_1.set_title("loanAmnt Distribuition", fontsize=18)
sub_plot_1.set_xlabel("")
sub_plot_1.set_ylabel("Probability", fontsize=15)
plt.subplot(222)
sub_plot_2 = sns.distplot(np.log(data_train['loanAmnt']))
sub_plot_2.set_title("loanAmnt (Log) Distribuition", fontsize=18)
sub_plot_2.set_xlabel("")
sub_plot_2.set_ylabel("Probability", fontsize=15)
Text(0, 0.5, 'Probability')

在这里插入图片描述

#4.特征分布可视化

train_loan_fr = data_train.loc[data_train['isDefault'] == 1]
train_loan_nofr = data_train.loc[data_train['isDefault'] == 0]

fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 8))
train_loan_fr.groupby('grade')['grade'].count().plot(kind='barh', ax=ax1, title='Count of grade fraud')
train_loan_nofr.groupby('grade')['grade'].count().plot(kind='barh', ax=ax2, title='Count of grade non-fraud')
train_loan_fr.groupby('employmentLength')['employmentLength'].count().plot(kind='barh', ax=ax3, title='Count of employmentLength fraud') 
train_loan_nofr.groupby('employmentLength')['employmentLength'].count().plot(kind='barh',ax=ax4, title='Count of employmentLength non-fraud')
plt.show()

在这里插入图片描述

#4.1.时间数据查看  时间戳、时间互转

#转化成时间格式
data_train['issueDate'] = pd.to_datetime(data_train['issueDate'],format='%Y-%m-%d')
startdate = datetime.datetime.strptime('2007-06-01', '%Y-%m-%d')
data_train['issueDateDT'] = data_train['issueDate'].apply(lambda x: x-startdate).dt.days

#转化成时间格式
data_test_a['issueDate'] = pd.to_datetime(data_train['issueDate'],format='%Y-%m-%d')
startdate = datetime.datetime.strptime('2007-06-01', '%Y-%m-%d')
data_test_a['issueDateDT'] = data_test_a['issueDate'].apply(lambda x: x-startdate).dt.days

plt.hist(data_train['issueDateDT'], label='train');
plt.hist(data_test_a['issueDateDT'], label='test');
plt.legend();
plt.title('Distribution of issueDateDT dates');
#train 和 test issueDateDT 日期有重叠 所以使用基于时间的分割进行验证是不明智的

在这里插入图片描述

#4.2 数据透视图方法查看数据  pivot_table

#透视图 索引可以有多个,“columns(列)”是可选的,聚合函数aggfunc最后是被应用到了变量“values”中你所列举的项目上。
pivot = pd.pivot_table(data_train, index=['grade'], columns=['issueDateDT'], values=['loanAmnt'], aggfunc=np.sum)
pivot
loanAmnt
issueDateDT0306192122153183214245274...3926395739874018404840794110414041714201
grade
ANaN53650.042000.019500.034425.063950.043500.0168825.085600.0101825.0...13093850.011757325.011945975.09144000.07977650.06888900.05109800.03919275.02694025.02245625.0
BNaN13000.024000.032125.07025.095750.0164300.0303175.0434425.0538450.0...16863100.017275175.016217500.011431350.08967750.07572725.04884600.04329400.03922575.03257100.0
CNaN68750.08175.010000.061800.052550.0175375.0151100.0243725.0393150.0...17502375.017471500.016111225.011973675.010184450.07765000.05354450.04552600.02870050.02246250.0
DNaNNaN5500.02850.028625.0NaN167975.0171325.0192900.0269325.0...11403075.010964150.010747675.07082050.07189625.05195700.03455175.03038500.02452375.01771750.0
E7500.0NaN10000.0NaN17975.01500.094375.0116450.042000.0139775.0...3983050.03410125.03107150.02341825.02225675.01643675.01091025.01131625.0883950.0802425.0
FNaNNaN31250.02125.0NaNNaNNaN49000.027000.043000.0...1074175.0868925.0761675.0685325.0665750.0685200.0316700.0315075.072300.0NaN
GNaNNaNNaNNaNNaNNaNNaN24625.0NaNNaN...56100.0243275.0224825.064050.0198575.0245825.053125.023750.025100.01000.0

7 rows × 139 columns

#导出EDA报告

import pandas_profiling
pfr = pandas_profiling.ProfileReport(data_train)
pfr.to_file("./EDAreport.html")

---------------------------------------------------------------------------

ModuleNotFoundError                       Traceback (most recent call last)

<ipython-input-36-a7a73771a4c1> in <module>()
----> 1 import pandas_profiling
      2 pfr = pandas_profiling.ProfileReport(data_train)
      3 pfr.to_file("./EDAreport.html")


ModuleNotFoundError: No module named 'pandas_profiling'
  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值