任务
数据类型转换和缺失值处理(尝试不同的填充看效果)以及及其他你能借鉴的数据探索。
读取数据
import pandas as pd
row_data = pd.read_csv('I:\DatawhaleWeek02\Data\data.csv', encoding='GBK')
数据分析
# 数据分析
# 查看每列的缺失值情况
print(row_data.isnull().sum())
# 查看数据的大小
print(row_data.shape)
# 查看属性
print(row_data.columns)
结果
Unnamed: 0 0
custid 0
trade_no 0
bank_card_no 0
low_volume_percent 2
middle_volume_percent 2
take_amount_in_later_12_month_highest 0
trans_amount_increase_rate_lately 3
trans_activity_month 2
trans_activity_day 2
transd_mcc 2
trans_days_interval_filter 8
trans_days_interval 2
regional_mobility 2
student_feature 2998
repayment_capability 0
is_high_user 0
number_of_trans_from_2011 2
first_transaction_time 2
historical_trans_amount 0
historical_trans_day 2
rank_trad_1_month 2
trans_amount_3_month 0
avg_consume_less_12_valid_month 2
abs 0
top_trans_count_last_1_month 2
avg_price_last_12_month 0
avg_price_top_last_12_valid_month 104
reg_preference_for_trad 2
trans_top_time_last_1_month 8
...
loans_credibility_behavior 297
loans_count 297
loans_settle_count 297
loans_overdue_count 297
loans_org_count_behavior 297
consfin_org_count_behavior 297
loans_cash_count 297
latest_one_month_loan 297
latest_three_month_loan 297
latest_six_month_loan 297
history_suc_fee 297
history_fail_fee 297
latest_one_month_suc 297
latest_one_month_fail 297
loans_long_time 297
loans_latest_time 297
loans_credit_limit 297
loans_credibility_limit 297
loans_org_count_current 297
loans_product_count 297
loans_max_limit 297
loans_avg_limit 297
consfin_credit_limit 297
consfin_credibility 297
consfin_org_count_current 297
consfin_product_count 297
consfin_max_limit 297
consfin_avg_limit 297
latest_query_day 304
loans_latest_day 297
Length: 90, dtype: int64
(4754, 90)
Index(['Unnamed: 0', 'custid', 'trade_no', 'bank_card_no',
'low_volume_percent', 'middle_volume_percent',
'take_amount_in_later_12_month_highest',
'trans_amount_increase_rate_lately', 'trans_activity_month',
'trans_activity_day', 'transd_mcc', 'trans_days_interval_filter',
'trans_days_interval', 'regional_mobility', 'student_feature',
'repayment_capability', 'is_high_user', 'number_of_trans_from_2011',
'first_transaction_time', 'historical_trans_amount',
'historical_trans_day', 'rank_trad_1_month', 'trans_amount_3_month',
'avg_consume_less_12_valid_month', 'abs',
'top_trans_count_last_1_month', 'avg_price_last_12_month',
'avg_price_top_last_12_valid_month', 'reg_preference_for_trad',
'trans_top_time_last_1_month', 'trans_top_time_last_6_month',
'consume_top_time_last_1_month', 'consume_top_time_last_6_month',
'cross_consume_count_last_1_month',
'trans_fail_top_count_enum_last_1_month',
'trans_fail_top_count_enum_last_6_month',
'trans_fail_top_count_enum_last_12_month',
'consume_mini_time_last_1_month',
'max_cumulative_consume_later_1_month',
'max_consume_count_later_6_month',
'railway_consume_count_last_12_month',
'pawns_auctions_trusts_consume_last_1_month',
'pawns_auctions_trusts_consume_last_6_month',
'jewelry_consume_count_last_6_month', 'status', 'source',
'first_transaction_day', 'trans_day_last_12_month', 'id_name',
'apply_score', 'apply_credibility', 'query_org_count',
'query_finance_count', 'query_cash_count', 'query_sum_count',
'latest_query_time', 'latest_one_month_apply',
'latest_three_month_apply', 'latest_six_month_apply', 'loans_score',
'loans_credibility_behavior', 'loans_count', 'loans_settle_count',
'loans_overdue_count', 'loans_org_count_behavior',
'consfin_org_count_behavior', 'loans_cash_count',
'latest_one_month_loan', 'latest_three_month_loan',
'latest_six_month_loan', 'history_suc_fee', 'history_fail_fee',
'latest_one_month_suc', 'latest_one_month_fail', 'loans_long_time',
'loans_latest_time', 'loans_credit_limit', 'loans_credibility_limit',
'loans_org_count_current', 'loans_product_count', 'loans_max_limit',
'loans_avg_limit', 'consfin_credit_limit', 'consfin_credibility',
'consfin_org_count_current', 'consfin_product_count',
'consfin_max_limit', 'consfin_avg_limit', 'latest_query_day',
'loans_latest_day'],
dtype='object')
数据预处理
1.缺失值和异常值的处理
# 1.数据清洗:主要是删除原始数据集中的无关数据、重复数据,平滑噪声数据,
# 筛掉与挖掘主题无关的数据,处理缺失值、异常值等
# 1.1首先对无关数据source,trade_no,id,bank_card_no,id_name,以及缺失值较多的列student_feature进行删除
row_data = row_data.drop(['student_feature', 'custid', 'bank_card_no', 'id_name','trade_no', 'source'], axis=1)
# 1.2对缺失值进行填充:可分为三种方法:删除、插补和不处理。
# 插补方法有均值/中位数/众数,固定值(如0),
# 最近邻插补(kNN),函数模型插补法
# (拉格朗日插值法scipy.interpolate、牛顿插值法、回归法、随机森林法)
row_data.fillna(0)# 将缺失值填充为0
row_data.fillna(row_data.mean()) # 用均值填充缺失值
2.数据类型转换
# 将reg_preference_for_trad设为索引
row_data['reg_preference_for_trad'] = row_data['reg_preference_for_trad'].map(
{'境外':0,'一线城市':1, '二线城市':2, '三线城市':3})
print(row_data['reg_preference_for_trad'])
划分数据集
from sklearn.model_selection import train_test_split
train_x = row_data.drop(columns=['status'])
train_y = row_data['status']
X_train,X_test,y_train,y_test = train_test_split(train_x,train_y,test_size=0.3,random_state=2018)
遇到的问题
1.读取csv文件的时候出现了不知道源文件是使用什么编码格式的问题,因为使用utf-8发生错误
2.对于异常值和缺失值数据只是使用了比较简单的方法,没有结合场景来分析,本人对于金融数据的处理比较生疏,希望向各位大神学习
3.整个的数据预处理部分肯定是需要改进的,希望能看见大神们是怎么做的