算法进阶day1

任务

数据类型转换和缺失值处理(尝试不同的填充看效果)以及及其他你能借鉴的数据探索。

读取数据

import pandas as pd
row_data = pd.read_csv('I:\DatawhaleWeek02\Data\data.csv', encoding='GBK')

数据分析

# 数据分析
# 查看每列的缺失值情况
print(row_data.isnull().sum())
# 查看数据的大小
print(row_data.shape)
# 查看属性
print(row_data.columns)

结果

Unnamed: 0                                  0
custid                                      0
trade_no                                    0
bank_card_no                                0
low_volume_percent                          2
middle_volume_percent                       2
take_amount_in_later_12_month_highest       0
trans_amount_increase_rate_lately           3
trans_activity_month                        2
trans_activity_day                          2
transd_mcc                                  2
trans_days_interval_filter                  8
trans_days_interval                         2
regional_mobility                           2
student_feature                          2998
repayment_capability                        0
is_high_user                                0
number_of_trans_from_2011                   2
first_transaction_time                      2
historical_trans_amount                     0
historical_trans_day                        2
rank_trad_1_month                           2
trans_amount_3_month                        0
avg_consume_less_12_valid_month             2
abs                                         0
top_trans_count_last_1_month                2
avg_price_last_12_month                     0
avg_price_top_last_12_valid_month         104
reg_preference_for_trad                     2
trans_top_time_last_1_month                 8
                                         ... 
loans_credibility_behavior                297
loans_count                               297
loans_settle_count                        297
loans_overdue_count                       297
loans_org_count_behavior                  297
consfin_org_count_behavior                297
loans_cash_count                          297
latest_one_month_loan                     297
latest_three_month_loan                   297
latest_six_month_loan                     297
history_suc_fee                           297
history_fail_fee                          297
latest_one_month_suc                      297
latest_one_month_fail                     297
loans_long_time                           297
loans_latest_time                         297
loans_credit_limit                        297
loans_credibility_limit                   297
loans_org_count_current                   297
loans_product_count                       297
loans_max_limit                           297
loans_avg_limit                           297
consfin_credit_limit                      297
consfin_credibility                       297
consfin_org_count_current                 297
consfin_product_count                     297
consfin_max_limit                         297
consfin_avg_limit                         297
latest_query_day                          304
loans_latest_day                          297
Length: 90, dtype: int64
(4754, 90)
Index(['Unnamed: 0', 'custid', 'trade_no', 'bank_card_no',
       'low_volume_percent', 'middle_volume_percent',
       'take_amount_in_later_12_month_highest',
       'trans_amount_increase_rate_lately', 'trans_activity_month',
       'trans_activity_day', 'transd_mcc', 'trans_days_interval_filter',
       'trans_days_interval', 'regional_mobility', 'student_feature',
       'repayment_capability', 'is_high_user', 'number_of_trans_from_2011',
       'first_transaction_time', 'historical_trans_amount',
       'historical_trans_day', 'rank_trad_1_month', 'trans_amount_3_month',
       'avg_consume_less_12_valid_month', 'abs',
       'top_trans_count_last_1_month', 'avg_price_last_12_month',
       'avg_price_top_last_12_valid_month', 'reg_preference_for_trad',
       'trans_top_time_last_1_month', 'trans_top_time_last_6_month',
       'consume_top_time_last_1_month', 'consume_top_time_last_6_month',
       'cross_consume_count_last_1_month',
       'trans_fail_top_count_enum_last_1_month',
       'trans_fail_top_count_enum_last_6_month',
       'trans_fail_top_count_enum_last_12_month',
       'consume_mini_time_last_1_month',
       'max_cumulative_consume_later_1_month',
       'max_consume_count_later_6_month',
       'railway_consume_count_last_12_month',
       'pawns_auctions_trusts_consume_last_1_month',
       'pawns_auctions_trusts_consume_last_6_month',
       'jewelry_consume_count_last_6_month', 'status', 'source',
       'first_transaction_day', 'trans_day_last_12_month', 'id_name',
       'apply_score', 'apply_credibility', 'query_org_count',
       'query_finance_count', 'query_cash_count', 'query_sum_count',
       'latest_query_time', 'latest_one_month_apply',
       'latest_three_month_apply', 'latest_six_month_apply', 'loans_score',
       'loans_credibility_behavior', 'loans_count', 'loans_settle_count',
       'loans_overdue_count', 'loans_org_count_behavior',
       'consfin_org_count_behavior', 'loans_cash_count',
       'latest_one_month_loan', 'latest_three_month_loan',
       'latest_six_month_loan', 'history_suc_fee', 'history_fail_fee',
       'latest_one_month_suc', 'latest_one_month_fail', 'loans_long_time',
       'loans_latest_time', 'loans_credit_limit', 'loans_credibility_limit',
       'loans_org_count_current', 'loans_product_count', 'loans_max_limit',
       'loans_avg_limit', 'consfin_credit_limit', 'consfin_credibility',
       'consfin_org_count_current', 'consfin_product_count',
       'consfin_max_limit', 'consfin_avg_limit', 'latest_query_day',
       'loans_latest_day'],
      dtype='object')

数据预处理

1.缺失值和异常值的处理

# 1.数据清洗:主要是删除原始数据集中的无关数据、重复数据,平滑噪声数据,
# 筛掉与挖掘主题无关的数据,处理缺失值、异常值等

# 1.1首先对无关数据source,trade_no,id,bank_card_no,id_name,以及缺失值较多的列student_feature进行删除
row_data = row_data.drop(['student_feature', 'custid', 'bank_card_no', 'id_name','trade_no', 'source'], axis=1)

# 1.2对缺失值进行填充:可分为三种方法:删除、插补和不处理。
# 插补方法有均值/中位数/众数,固定值(如0),
# 最近邻插补(kNN),函数模型插补法
# (拉格朗日插值法scipy.interpolate、牛顿插值法、回归法、随机森林法)
row_data.fillna(0)# 将缺失值填充为0
row_data.fillna(row_data.mean()) # 用均值填充缺失值

2.数据类型转换

# 将reg_preference_for_trad设为索引
row_data['reg_preference_for_trad'] = row_data['reg_preference_for_trad'].map(
    {'境外':0,'一线城市':1, '二线城市':2, '三线城市':3})
print(row_data['reg_preference_for_trad'])

划分数据集

from sklearn.model_selection import train_test_split
train_x = row_data.drop(columns=['status'])
train_y = row_data['status']
X_train,X_test,y_train,y_test = train_test_split(train_x,train_y,test_size=0.3,random_state=2018)

遇到的问题

1.读取csv文件的时候出现了不知道源文件是使用什么编码格式的问题,因为使用utf-8发生错误
2.对于异常值和缺失值数据只是使用了比较简单的方法,没有结合场景来分析,本人对于金融数据的处理比较生疏,希望向各位大神学习
3.整个的数据预处理部分肯定是需要改进的,希望能看见大神们是怎么做的

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值