文章目录
一 任务
数据类型转换和缺失值处理(尝试不同的填充看效果)以及及其他你能借鉴的数据探索。
建议:将数据探索的过程记录下来,例如:
1.为什么要这样转换,转换后对结果有没有影响
2.记录下转换过程中遇到的问题
二 数据探索
一 数据中的变量
字段 | 类型 | 说明 |
---|---|---|
custid | X | |
trade_no | X | |
bank_card_no | X | |
low_volume_percent | 浮点 | |
middle_volume_percent | 浮点 | |
take_amount_in_later_12_month_highest | 整型 | |
trans_amount_increase_rate_lately | ||
trans_activity_month | ||
trans_activity_day | ||
transd_mcc | ||
trans_days_interval_filter | ||
trans_days_interval | ||
regional_mobility | ||
student_feature | ||
repayment_capability | ||
is_high_user | ||
number_of_trans_from_2011 | ||
first_transaction_time | ||
historical_trans_amount | ||
historical_trans_day | ||
rank_trad_1_month | ||
trans_amount_3_month | ||
avg_consume_less_12_valid_month | ||
abs | ||
top_trans_count_last_1_month | ||
avg_price_last_12_month | ||
avg_price_top_last_12_valid_month | ||
reg_preference_for_trad | ||
trans_top_time_last_1_month | ||
trans_top_time_last_6_month | ||
consume_top_time_last_1_month | ||
consume_top_time_last_6_month | ||
cross_consume_count_last_1_month | ||
trans_fail_top_count_enum_last_1_month | ||
trans_fail_top_count_enum_last_6_month | ||
trans_fail_top_count_enum_last_12_month | ||
consume_mini_time_last_1_month | ||
max_cumulative_consume_later_1_month | ||
max_consume_count_later_6_month | ||
railway_consume_count_last_12_month | ||
pawns_auctions_trusts_consume_last_1_month | ||
pawns_auctions_trusts_consume_last_6_month | ||
jewelry_consume_count_last_6_month | ||
status | ||
source | X | |
first_transaction_day | ||
trans_day_last_12_month | ||
id_name | X | |
apply_score | ||
apply_credibility | ||
query_org_count | ||
query_finance_count | ||
query_cash_count | ||
query_sum_count | ||
latest_query_time | ||
latest_one_month_apply | ||
latest_three_month_apply | ||
latest_six_month_apply | ||
loans_score | ||
loans_credibility_behavior | ||
loans_count | ||
loans_settle_count | ||
loans_overdue_count | ||
loans_org_count_behavior | ||
consfin_org_count_behavior | ||
loans_cash_count | ||
latest_one_month_loan | ||
latest_three_month_loan | ||
latest_six_month_loan | ||
history_suc_fee | ||
history_fail_fee | ||
latest_one_month_suc | ||
latest_one_month_fail | ||
loans_long_time | ||
loans_latest_time | ||
loans_credit_limit | ||
loans_credibility_limit | ||
loans_org_count_current | ||
loans_product_count | ||
loans_max_limit | ||
loans_avg_limit | ||
consfin_credit_limit | ||
consfin_credibility | ||
consfin_org_count_current | ||
consfin_product_count | ||
consfin_max_limit | ||
consfin_avg_limit | ||
latest_query_day | ||
loans_latest_day |
二 数据预处理
下面是在建立预测模型的时候,需要知道的关于清洗和预处理数据的步骤
1.变量识别
2.单变量分析
3.双变量分析
4.缺失值处理
5.异常值处理
6.变量转换
7.产生新变量
在实际工作中我们会反复迭代步骤4-7,直到我们得到一个满意的模型。
导包
import pandas as pd
data_all = pd.read_csv('./data.csv', encoding='gbk')
data_all.head()
移除模型无关信息项和status
# 移除模型无关信息项和status
# 移除每一列的信息都一样的
data_all.drop(['custid', 'trade_no', 'bank_card_no', 'id_name'], axis=1, inplace=True)
# data_all.drop(columns=['custid','trade_no','bank_card_no','id_name'])
# data_all.drop(columns='status')
# data_df = pd.DataFrame(data_all)
# data_df.ix[:, (data_df != data_df.ix[0]).any()]
# data_df.shape[1]
# data_df = pd.DataFrame(data_all)
# data_df.drop(['custid', 'trade_no', 'bank_card_no', 'id_name','status'], 1, inplace=True)
# # data_df.shape[1]
# data_df.ix[:,(data_df != data_df.ix[0]).any()]
# data_df.shape[1]
X = data_all.drop(labels='status',axis=1)
L = []
for col in X:
if len(X[col].unique()) == 1:
L.append(col)
for col in L:
X.drop(col, axis=1, inplace=True)
将城市的汉字转为数字
# 将城市的汉字转为数字
data_all['reg_preference_for_trad'].unique()
# data_all.replace({'一线城市':1,'三线城市':3,'二线城市':2,'境外':4,'其他城市':5})
data_all['reg_preference_for_trad'].isnull().sum()
#缺失值较少,考虑将其填充
# method : {‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None
# pad/ffill:用前一个非缺失值去填充该缺失值
# backfill/bfill:用下一个非缺失值填充该缺失值
# None:指定一个值去替换缺失值
data_all['reg_preference_for_trad'].fillna(method='ffill')
data_all.replace({'一线城市':1,'三线城市':3,'二线城市':2,'境外':4,'其他城市':5})
缺失值较多的student_feature的处理
# 关于student_feature的处理
data_all['student_feature'].unique()
data_all['student_feature'].isnull().sum()
data_all.drop(labels='student_feature', axis=1, inplace=True)
日期型特征处理
#将latest_query_time和loans_latest_time离散化为年,月,日三个特征
data_all['latest_query_time'] = pd.to_datetime(data_all['latest_query_time'])
data_all['latest_query_time_year'] = data_all['latest_query_time'].dt.year
data_all['latest_query_time_month'] = data_all['latest_query_time'].dt.month
data_all['latest_query_time_day'] = data_all['latest_query_time'].dt.day
data_all['loans_latest_time'] = pd.to_datetime(data_all['loans_latest_time'])
data_all['loans_latest_time_year'] = data_all['loans_latest_time'].dt.year
data_all['loans_latest_time_month'] = data_all['loans_latest_time'].dt.month
data_all['loans_latest_time_day'] = data_all['loans_latest_time'].dt.day
data_all.drop(labels=['latest_query_time', 'loans_latest_time'], axis=1, inplace=True)
#对日期缺失值进行众数填充
#常用方法有三种:删除,补全和忽略
data_all['latest_query_time_year'].fillna(data_all['latest_query_time_year'].mode(), inplace=True)
data_all['latest_query_time_month'].fillna(data_all['latest_query_time_month'].mode(), inplace=True)
data_all['latest_query_time_day'].fillna(data_all['latest_query_time_day'].mode(), inplace=True)
data_all['loans_latest_time_year'].fillna(data_all['loans_latest_time_year'].mode(), inplace=True)
data_all['loans_latest_time_month'].fillna(data_all['loans_latest_time_month'].mode(), inplace=True)
data_all['loans_latest_time_day'].fillna(data_all['loans_latest_time_day'].mode(), inplace=True)
其他数值特征的处理
#对数值型特征操作,现将之前处理的字符串型特征去除
data_else = data_all.drop(labels=['reg_preference_for_trad','latest_query_time_year','latest_query_time_month','latest_query_time_day',
'loans_latest_time_year','loans_latest_time_month','loans_latest_time_day'], axis=1)
data_else.fillna(data_all.mode(), inplace=True)
三 问题
1.data_df.ix[:,(data_df != data_df.ix[0]).any()] 没能去除列所有项都一样的列,借鉴了别人的去重代码
- 其实我觉得student_feature这个特征应该把那些空的给标0 总觉的空的不是学生 1 or 2 的是学生一样