数据挖掘小组学习:数据探索和分析

1.1 读取数据

本次的数据集格式为.csv,利用pandas中的read_csv()函数读入数据集,encoding为gb18030:

import pandas as pd
users_data = pd.read_csv(data_path,encoding='gb18030')
users_data.head(5)  # 查看读取数据的前5行

在这里插入图片描述

1.2 探索数据

users_data.info()
print(f'数据集的shape为:{users_data.shape}')

输出如下:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4754 entries, 0 to 4753
Data columns (total 89 columns):
custid                                        4754 non-null int64
trade_no                                      4754 non-null float64
bank_card_no                                  4754 non-null object
low_volume_percent                            4752 non-null float64
middle_volume_percent                         4752 non-null float64
take_amount_in_later_12_month_highest         4754 non-null int64
trans_amount_increase_rate_lately             4751 non-null float64
trans_activity_month                          4752 non-null float64
trans_activity_day                            4752 non-null float64
transd_mcc                                    4752 non-null float64
trans_days_interval_filter                    4746 non-null float64
trans_days_interval                           4752 non-null float64
regional_mobility                             4752 non-null float64
student_feature                               1756 non-null float64
repayment_capability                          4754 non-null int64
is_high_user                                  4754 non-null int64
number_of_trans_from_2011                     4752 non-null float64
first_transaction_time                        4752 non-null float64
historical_trans_amount                       4754 non-null int64
historical_trans_day                          4752 non-null float64
rank_trad_1_month                             4752 non-null float64
trans_amount_3_month                          4754 non-null int64
avg_consume_less_12_valid_month               4752 non-null float64
abs                                           4754 non-null int64
top_trans_count_last_1_month                  4752 non-null float64
avg_price_last_12_month                       4754 non-null int64
avg_price_top_last_12_valid_month             4650 non-null float64
reg_preference_for_trad                       4752 non-null object
trans_top_time_last_1_month                   4746 non-null float64
trans_top_time_last_6_month                   4746 non-null float64
consume_top_time_last_1_month                 4746 non-null float64
consume_top_time_last_6_month                 4746 non-null float64
cross_consume_count_last_1_month              4328 non-null float64
trans_fail_top_count_enum_last_1_month        4738 non-null float64
trans_fail_top_count_enum_last_6_month        4738 non-null float64
trans_fail_top_count_enum_last_12_month       4738 non-null float64
consume_mini_time_last_1_month                4728 non-null float64
max_cumulative_consume_later_1_month          4754 non-null int64
max_consume_count_later_6_month               4746 non-null float64
railway_consume_count_last_12_month           4742 non-null float64
pawns_auctions_trusts_consume_last_1_month    4754 non-null int64
pawns_auctions_trusts_consume_last_6_month    4754 non-null int64
jewelry_consume_count_last_6_month            4742 non-null float64
status                                        4754 non-null int64
source                                        4754 non-null object
first_transaction_day                         4752 non-null float64
trans_day_last_12_month                       4752 non-null float64
id_name                                       4478 non-null object
apply_score                                   4450 non-null float64
apply_credibility                             4450 non-null float64
query_org_count                               4450 non-null float64
query_finance_count                           4450 non-null float64
query_cash_count                              4450 non-null float64
query_sum_count                               4450 non-null float64
latest_query_time                             4450 non-null datetime64[ns]
latest_one_month_apply                        4450 non-null float64
latest_three_month_apply                      4450 non-null float64
latest_six_month_apply                        4450 non-null float64
loans_score                                   4457 non-null float64
loans_credibility_behavior                    4457 non-null float64
loans_count                                   4457 non-null float64
loans_settle_count                            4457 non-null float64
loans_overdue_count                           4457 non-null float64
loans_org_count_behavior                      4457 non-null float64
consfin_org_count_behavior                    4457 non-null float64
loans_cash_count                              4457 non-null float64
latest_one_month_loan                         4457 non-null float64
latest_three_month_loan                       4457 non-null float64
latest_six_month_loan                         4457 non-null float64
history_suc_fee                               4457 non-null float64
history_fail_fee                              4457 non-null float64
latest_one_month_suc                          4457 non-null float64
latest_one_month_fail                         4457 non-null float64
loans_long_time                               4457 non-null float64
loans_latest_time                             4457 non-null datetime64[ns]
loans_credit_limit                            4457 non-null float64
loans_credibility_limit                       4457 non-null float64
loans_org_count_current                       4457 non-null float64
loans_product_count                           4457 non-null float64
loans_max_limit                               4457 non-null float64
loans_avg_limit                               4457 non-null float64
consfin_credit_limit                          4457 non-null float64
consfin_credibility                           4457 non-null float64
consfin_org_count_current                     4457 non-null float64
consfin_product_count                         4457 non-null float64
consfin_max_limit                             4457 non-null float64
consfin_avg_limit                             4457 non-null float64
latest_query_day                              4450 non-null float64
loans_latest_day                              4457 non-null float64
dtypes: datetime64[ns](2), float64(71), int64(12), object(4)
memory usage: 3.2+ MB
数据集的shape为:(4754, 89)

由上面的信息可知,数据集共有4754条实例和88个特征,且大部分特征的非空实例数量较为完整。其中’status’为标签,代表用户的逾期状态,0代表未逾期,1代表逾期。

1.3 数据清洗¶

1.3.1 去除重复记录

print(f'去重前数据集的shape为:{users_data.shape}')
users_data.drop_duplicates('custid',inplace=True)
print(f'去重后数据集的shape为:{users_data.shape}')

输出为:

去重前数据集的shape为:(4754, 89)
去重后数据集的shape为:(4754, 89)

由上面的结果可知数据集没有重复记录。

1.3.2 删除无关特征

通过探索数据部分,这里选择的无关特征有:

irrelevant_features = ['custid','trade_no','bank_card_no','first_transaction_time','id_name','source','loans_latest_time','latest_query_time']
users_data.drop(irrelevant_features,axis=1,inplace=True)
users_data.head(5)

输出:
在这里插入图片描述特可见征变为80个。

1.3.3 数据类型分析

由info()函数的内容可知,80个特征中,特征’reg_preference_for_trad’数据类型是object,有10个特征数据类型是int型,69个特征数据类型是float.

1.3.4 数据类型转换

这里对特征’reg_preference_for_trad’的数据类型进行转换,先看一下该特征的所有取值类别:

df_plot = users_data[['reg_preference_for_trad','status']].groupby('reg_preference_for_trad').count()
df_plot

在这里插入图片描述
由上面的结果可见,特征’reg_preference_for_trad’的取值类别有5种,将其转化为int类型,[‘境外’,‘一线城市’,‘二线城市’,‘三线城市’,‘其他城市’]对应为[6,4,3,2,1],数值越大,代表用户消费场景级别更高。

由info()函数的内容可知,有两条实例需要做缺失值处理;这里使用众数填补缺失值

a = users_data['reg_preference_for_trad'].mode()
print(a[0])
users_data['reg_preference_for_trad'].fillna(a[0],inplace=True)
users_data.info()

然后将数据类型进行转换:

reg = ['境外','一线城市','二线城市','三线城市','其他城市']
to_int = [6,4,3,2,1]
reg_to_int = dict(zip(reg,to_int))
def trans_reg(var):
    global reg_to_int
    new_var = reg_to_int[var]
    return new_var
users_data['reg_preference_for_trad'] = users_data['reg_preference_for_trad'].apply(trans_reg)

1.3.5 缺失值处理

# 找出哪些特征存在缺失值
users_data.isnull().any()

输出:
在这里插入图片描述

# 将student_feature的缺失值填补为0
users_data['student_feature'].fillna(0,inplace=True)
users_data.info()

这里将所有特征的缺失值补全为平均值:(在没有分析特征的前提下做这样处理并不好,只是为了尝试一下数据处理的操作)

# 获取所有特征名
col_names = users_data.columns.values.tolist()
for col in col_names:
    val = float(f'{users_data[col].mean():.2f}')
    users_data[col].fillna(val,inplace=True)
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值