2021-06-02

最新推荐文章于 2021-11-09 20:10:55 发布

?vssyu

最新推荐文章于 2021-11-09 20:10:55 发布

阅读量96

点赞数

本文链接：https://blog.csdn.net/vssyu/article/details/117464071

版权

#检查是否有缺失值

train_data.isnull().any()

当有一列有缺失值会变成Ture。

data_train.isnull().any().sum()

查看缺失值有多少列。

missing = train_data.isnull().sum()/len(train_data)
missing = missing[missing > 0]
missing.sort_values(inplace=True)
missing.plot.bar()

在这里插入图片描述

可以看缺失值在每列的比例。

one_value_fea = [col for col in train_data.columns if train_data[col].nunique()<=1]

看只有一个值的列。

numerical_fea=list(train_data.select_dtypes(exclude=['object']).columns)
category_fea=list(filter(lambda x:x not in numerical_fea ,list(train_data.columns)))

可以用select_dtypes(exclude=[‘objecet’]).colums)筛选数值变量。

def get_numerical_serise_fea(data,cols):
  numer_serise=[]
  numer_not_serise=[]
  for col in cols:
    if data[col].nunique()>10:
      numer_serise.append(col)
      continue
    numer_not_serise.append(col)
  return numer_serise,numer_not_serise
numerical_serial_fea,numerical_noserial_fea =get_numerical_serise_fea(train_data,numerical_fea)

此方法不一定能选出连续变量。

f = pd.melt(train_data, value_vars=numerical_serial_fea)
g = sns.FacetGrid(f, col="variable",  col_wrap=2, sharex=False, sharey=False)
g = g.map(sns.distplot, "value")

在这里插入图片描述
这是画出连续变量的分布。FacetGrid暂时还不会用。

#Ploting Transaction Amount Values Distribution
plt.figure(figsize=(16,12))
plt.suptitle('Transaction Values Distribution', fontsize=22)
plt.subplot(221)
sub_plot_1 = sns.distplot(train_data['loanAmnt'],fit=norm)
sub_plot_1.set_title("loanAmnt Distribuition", fontsize=18)
sub_plot_1.set_xlabel("")
sub_plot_1.set_ylabel("Probability", fontsize=15)

plt.subplot(222)
sub_plot_2 = sns.distplot(np.log(train_data['loanAmnt']))
sub_plot_2.set_title("loanAmnt (Log) Distribuition", fontsize=18)
sub_plot_2.set_xlabel("")
sub_plot_2.set_ylabel("Probability", fontsize=15)

在这里插入图片描述

画出图像经过log变换后。黑线是正太函数。导入norm要先导入from scipy.stats import norm。

fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 8))
train_loan_fr.groupby('grade')['grade'].count().plot(kind='bar', ax=ax1, title='Count of grade fraud')
train_loan_nofr.groupby('grade')['grade'].count().plot(kind='bar', ax=ax2, title='Count of grade non-fraud')
train_loan_fr.groupby('employmentLength')['employmentLength'].count().plot(kind='barh', ax=ax3, title='Count of employmentLength fraud')
train_loan_nofr.groupby('employmentLength')['employmentLength'].count().plot(kind='barh', ax=ax4, title='Count of employmentLength non-fraud')
plt.show()

在这里插入图片描述

这个是画直方图的。kind=bar是竖的直方图，barh是横的直方图。

train_data.groupby('employmentLength')['employmentLength'].count()

employmentLength
1 year        52489
10+ years    262753
2 years       72358
3 years       64152
4 years       47985
5 years       50102
6 years       37254
7 years       35407
8 years       36192
9 years       30272
< 1 year      64237
Name: employmentLength, dtype: int64

fig, ((ax1, ax2)) = plt.subplots(1, 2, figsize=(18, 6))
train_data.loc[train_data['isDefault'] == 1]['loanAmnt'].apply(np.log) .plot(kind='hist',bins=100,title='Log Loan Amt - Fraud',color='r',xlim=(-3, 10),ax= ax1)
train_data.loc[train_data['isDefault'] == 0] ['loanAmnt'].apply(np.log) .plot(kind='hist',bins=100,title='Log Loan Amt - Not Fraud',color='b',xlim=(-3, 10),ax=ax2)

在这里插入图片描述

fig, ((ax1, ax2)) = plt.subplots(1, 2, figsize=(15, 6))
train_data.loc[train_data['isDefault'] == 1]['pubRecBankruptcies'].value_counts().plot(kind='bar',title='fraud',color='r',ax= ax1)
train_data.loc[train_data['isDefault'] == 0]['pubRecBankruptcies'].value_counts().plot(kind='bar',title='notfraud',color=

在这里插入图片描述

在分类任务里可以用这个图，分析每个特征在每个类里的分布。

total = len(data_train)
total_amt = data_train.groupby(['isDefault'])['loanAmnt'].sum().sum()
plt.figure(figsize=(12,5))
plt.subplot(121)##1代表行，2代表列，所以一共有2个图，1代表此时绘制第一个图。
plot_tr = sns.countplot(x='isDefault',data=data_train)#data_train‘isDefault’这个特征每种类别的数量**
plot_tr.set_title("Fraud Loan Distribution \n 0: good user | 1: bad user", fontsize=14)
plot_tr.set_xlabel("Is fraud by count", fontsize=16)
plot_tr.set_ylabel('Count', fontsize=16)
for p in plot_tr.patches:
    height = p.get_height()
    plot_tr.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.2f}%'.format(height/total*100),
            ha="center", fontsize=15) 
    
percent_amt = (data_train.groupby(['isDefault'])['loanAmnt'].sum())
percent_amt = percent_amt.reset_index()
plt.subplot(122)
plot_tr_2 = sns.barplot(x='isDefault', y='loanAmnt',  dodge=True, data=percent_amt)
plot_tr_2.set_title("Total Amount in loanAmnt  \n 0: good user | 1: bad user", fontsize=14)
plot_tr_2.set_xlabel("Is fraud by percent", fontsize=16)
plot_tr_2.set_ylabel('Total Loan Amount Scalar', fontsize=16)
for p in plot_tr_2.patches:
    height = p.get_height()
    plot_tr_2.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.2f}%'.format(height/total_amt * 100),
            ha="center", fontsize=15)

在这里插入图片描述

for p in plot_tr_2.patches:
    height = p.get_height()
    plot_tr_2.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.2f}%'.format(height/total_amt * 100),
            ha="center", fontsize=15)

这一段主要是加直方图上的值得。

?vssyu

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
2021-06-02

#检查是否有缺失值train_data.isnull().any()当有一列有缺失值会变成Ture。data_train.isnull().any().sum()查看缺失值有多少列。missing = train_data.isnull().sum()/len(train_data)missing = missing[missing > 0]missing.sort_values(inplace=True)missing.plot.bar()可以看缺失值在每列的比例。one
复制链接

扫一扫