DataWhale一周算法进阶1---数据预处理(数据探索)

一 任务

数据类型转换和缺失值处理(尝试不同的填充看效果)以及及其他你能借鉴的数据探索。

建议:将数据探索的过程记录下来,例如:
1.为什么要这样转换,转换后对结果有没有影响
2.记录下转换过程中遇到的问题

二 数据探索

一 数据中的变量
字段类型说明
custidX
trade_noX
bank_card_noX
low_volume_percent浮点
middle_volume_percent浮点
take_amount_in_later_12_month_highest整型
trans_amount_increase_rate_lately
trans_activity_month
trans_activity_day
transd_mcc
trans_days_interval_filter
trans_days_interval
regional_mobility
student_feature
repayment_capability
is_high_user
number_of_trans_from_2011
first_transaction_time
historical_trans_amount
historical_trans_day
rank_trad_1_month
trans_amount_3_month
avg_consume_less_12_valid_month
abs
top_trans_count_last_1_month
avg_price_last_12_month
avg_price_top_last_12_valid_month
reg_preference_for_trad
trans_top_time_last_1_month
trans_top_time_last_6_month
consume_top_time_last_1_month
consume_top_time_last_6_month
cross_consume_count_last_1_month
trans_fail_top_count_enum_last_1_month
trans_fail_top_count_enum_last_6_month
trans_fail_top_count_enum_last_12_month
consume_mini_time_last_1_month
max_cumulative_consume_later_1_month
max_consume_count_later_6_month
railway_consume_count_last_12_month
pawns_auctions_trusts_consume_last_1_month
pawns_auctions_trusts_consume_last_6_month
jewelry_consume_count_last_6_month
status
sourceX
first_transaction_day
trans_day_last_12_month
id_nameX
apply_score
apply_credibility
query_org_count
query_finance_count
query_cash_count
query_sum_count
latest_query_time
latest_one_month_apply
latest_three_month_apply
latest_six_month_apply
loans_score
loans_credibility_behavior
loans_count
loans_settle_count
loans_overdue_count
loans_org_count_behavior
consfin_org_count_behavior
loans_cash_count
latest_one_month_loan
latest_three_month_loan
latest_six_month_loan
history_suc_fee
history_fail_fee
latest_one_month_suc
latest_one_month_fail
loans_long_time
loans_latest_time
loans_credit_limit
loans_credibility_limit
loans_org_count_current
loans_product_count
loans_max_limit
loans_avg_limit
consfin_credit_limit
consfin_credibility
consfin_org_count_current
consfin_product_count
consfin_max_limit
consfin_avg_limit
latest_query_day
loans_latest_day
二 数据预处理

下面是在建立预测模型的时候,需要知道的关于清洗和预处理数据的步骤
1.变量识别
2.单变量分析
3.双变量分析
4.缺失值处理
5.异常值处理
6.变量转换
7.产生新变量
在实际工作中我们会反复迭代步骤4-7,直到我们得到一个满意的模型。

导包
import pandas as pd
data_all = pd.read_csv('./data.csv', encoding='gbk')
data_all.head()
移除模型无关信息项和status
# 移除模型无关信息项和status
# 移除每一列的信息都一样的

data_all.drop(['custid', 'trade_no', 'bank_card_no', 'id_name'], axis=1, inplace=True)
# data_all.drop(columns=['custid','trade_no','bank_card_no','id_name'])
# data_all.drop(columns='status')
# data_df = pd.DataFrame(data_all)
# data_df.ix[:, (data_df != data_df.ix[0]).any()]
# data_df.shape[1]

# data_df = pd.DataFrame(data_all)
# data_df.drop(['custid', 'trade_no', 'bank_card_no', 'id_name','status'], 1, inplace=True)
# # data_df.shape[1]
# data_df.ix[:,(data_df != data_df.ix[0]).any()]
# data_df.shape[1]

X = data_all.drop(labels='status',axis=1)
L = []
for col in X:
    if len(X[col].unique()) == 1:
        L.append(col)
for col in L:
    X.drop(col, axis=1, inplace=True)
将城市的汉字转为数字
# 将城市的汉字转为数字
data_all['reg_preference_for_trad'].unique()
# data_all.replace({'一线城市':1,'三线城市':3,'二线城市':2,'境外':4,'其他城市':5})
data_all['reg_preference_for_trad'].isnull().sum()
#缺失值较少,考虑将其填充
# method : {‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None
# pad/ffill:用前一个非缺失值去填充该缺失值
# backfill/bfill:用下一个非缺失值填充该缺失值
# None:指定一个值去替换缺失值
data_all['reg_preference_for_trad'].fillna(method='ffill')
data_all.replace({'一线城市':1,'三线城市':3,'二线城市':2,'境外':4,'其他城市':5})
缺失值较多的student_feature的处理
# 关于student_feature的处理
data_all['student_feature'].unique()
data_all['student_feature'].isnull().sum()
data_all.drop(labels='student_feature', axis=1, inplace=True)
日期型特征处理
#将latest_query_time和loans_latest_time离散化为年,月,日三个特征
data_all['latest_query_time'] = pd.to_datetime(data_all['latest_query_time'])
data_all['latest_query_time_year'] = data_all['latest_query_time'].dt.year
data_all['latest_query_time_month'] = data_all['latest_query_time'].dt.month
data_all['latest_query_time_day'] = data_all['latest_query_time'].dt.day

data_all['loans_latest_time'] = pd.to_datetime(data_all['loans_latest_time'])
data_all['loans_latest_time_year'] = data_all['loans_latest_time'].dt.year
data_all['loans_latest_time_month'] = data_all['loans_latest_time'].dt.month
data_all['loans_latest_time_day'] = data_all['loans_latest_time'].dt.day
data_all.drop(labels=['latest_query_time', 'loans_latest_time'], axis=1, inplace=True)
#对日期缺失值进行众数填充
#常用方法有三种:删除,补全和忽略
data_all['latest_query_time_year'].fillna(data_all['latest_query_time_year'].mode(), inplace=True)
data_all['latest_query_time_month'].fillna(data_all['latest_query_time_month'].mode(), inplace=True)
data_all['latest_query_time_day'].fillna(data_all['latest_query_time_day'].mode(), inplace=True)

data_all['loans_latest_time_year'].fillna(data_all['loans_latest_time_year'].mode(), inplace=True)
data_all['loans_latest_time_month'].fillna(data_all['loans_latest_time_month'].mode(), inplace=True)
data_all['loans_latest_time_day'].fillna(data_all['loans_latest_time_day'].mode(), inplace=True)
其他数值特征的处理
#对数值型特征操作,现将之前处理的字符串型特征去除
data_else = data_all.drop(labels=['reg_preference_for_trad','latest_query_time_year','latest_query_time_month','latest_query_time_day',
                             'loans_latest_time_year','loans_latest_time_month','loans_latest_time_day'], axis=1)
data_else.fillna(data_all.mode(), inplace=True)

三 问题

1.data_df.ix[:,(data_df != data_df.ix[0]).any()] 没能去除列所有项都一样的列,借鉴了别人的去重代码

  1. 其实我觉得student_feature这个特征应该把那些空的给标0 总觉的空的不是学生 1 or 2 的是学生一样

Reference

https://www.cnblogs.com/wxshi/p/7479578.html

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值