#算法实践进阶:数据预处理
准备数据
- 说明:这份数据集是金融数据(非原始数据,已经处理过了),我们要做的是预测贷款用户是否会逾期。表格中 “status” 是结果标签:0表示未逾期,1表示逾期。
- 实践数据下载:https://pan.baidu.com/s/1wO9qJRjnrm8uhaSP67K0lw
本次实验中需要的包
import pandas as pd
from sklearn.preprocessing import LabelBinarizer, Imputer
导入数据
data = pd.read_csv('data.csv', encoding='gbk') #excel和csv的中文存错格式是GBK
data.head()
Unnamed: 0 | custid | trade_no | bank_card_no | low_volume_percent | middle_volume_percent | take_amount_in_later_12_month_highest | trans_amount_increase_rate_lately | trans_activity_month | trans_activity_day | ... | loans_max_limit | loans_avg_limit | consfin_credit_limit | consfin_credibility | consfin_org_count_current | consfin_product_count | consfin_max_limit | consfin_avg_limit | latest_query_day | loans_latest_day | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 5 | 2791858 | 20180507115231274000000023057383 | 卡号1 | 0.01 | 0.99 | 0 | 0.90 | 0.55 | 0.313 | ... | 2900.0 | 1688.0 | 1200.0 | 75.0 | 1.0 | 2.0 | 1200.0 | 1200.0 | 12.0 | 18.0 |
1 | 10 | 534047 | 20180507121002192000000023073000 | 卡号1 | 0.02 | 0.94 | 2000 | 1.28 | 1.00 | 0.458 | ... | 3500.0 | 1758.0 | 15100.0 | 80.0 | 5.0 | 6.0 | 22800.0 | 9360.0 | 4.0 | 2.0 |
2 | 12 | 2849787 | 20180507125159718000000023114911 | 卡号1 | 0.04 | 0.96 | 0 | 1.00 | 1.00 | 0.114 | ... | 1600.0 | 1250.0 | 4200.0 | 87.0 | 1.0 | 1.0 | 4200.0 | 4200.0 | 2.0 | 6.0 |
3 | 13 | 1809708 | 20180507121358683000000388283484 | 卡号1 | 0.00 | 0.96 | 2000 | 0.13 | 0.57 | 0.777 | ... | 3200.0 | 1541.0 | 16300.0 | 80.0 | 5.0 | 5.0 | 30000.0 | 12180.0 | 2.0 | 4.0 |
4 | 14 | 2499829 | 20180507115448545000000388205844 | 卡号1 | 0.01 | 0.99 | 0 | 0.46 | 1.00 | 0.175 | ... | 2300.0 | 1630.0 | 8300.0 | 79.0 | 2.0 | 2.0 | 8400.0 | 8250.0 | 22.0 | 120.0 |
5 rows × 90 columns
- 划分数据
- 将数据划分为数据集以及标签
label = data.status
data = data.drop(['status'],axis=1) # 除去标签一列
data.head()
Unnamed: 0 | custid | trade_no | bank_card_no | low_volume_percent | middle_volume_percent | take_amount_in_later_12_month_highest | trans_amount_increase_rate_lately | trans_activity_month | trans_activity_day | ... | loans_max_limit | loans_avg_limit | consfin_credit_limit | consfin_credibility | consfin_org_count_current | consfin_product_count | consfin_max_limit | consfin_avg_limit | latest_query_day | loans_latest_day | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 5 | 2791858 | 20180507115231274000000023057383 | 卡号1 | 0.01 | 0.99 | 0 | 0.90 | 0.55 | 0.313 | ... | 2900.0 | 1688.0 | 1200.0 | 75.0 | 1.0 | 2.0 | 1200.0 | 1200.0 | 12.0 | 18.0 |
1 | 10 | 534047 | 20180507121002192000000023073000 | 卡号1 | 0.02 | 0.94 | 2000 | 1.28 | 1.00 | 0.458 | ... | 3500.0 | 1758.0 | 15100.0 | 80.0 | 5.0 | 6.0 | 22800.0 | 9360.0 | 4.0 | 2.0 |
2 | 12 | 2849787 | 20180507125159718000000023114911 | 卡号1 | 0.04 | 0.96 | 0 | 1.00 | 1.00 | 0.114 | ... | 1600.0 | 1250.0 | 4200.0 | 87.0 | 1.0 | 1.0 | 4200.0 | 4200.0 | 2.0 | 6.0 |
3 | 13 | 1809708 | 20180507121358683000000388283484 | 卡号1 | 0.00 | 0.96 | 2000 | 0.13 | 0.57 | 0.777 | ... | 3200.0 | 1541.0 | 16300.0 | 80.0 | 5.0 | 5.0 | 30000.0 | 12180.0 | 2.0 | 4.0 |
4 | 14 | 2499829 | 20180507115448545000000388205844 | 卡号1 | 0.01 | 0.99 | 0 | 0.46 | 1.00 | 0.175 | ... | 2300.0 | 1630.0 | 8300.0 | 79.0 | 2.0 | 2.0 | 8400.0 | 8250.0 | 22.0 | 120.0 |
5 rows × 89 columns
数据预处理
-
数据类型分析
查看数据集的特征信息(数据类型float,int,非数值型数据)
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4754 entries, 0 to 4753
Data columns (total 89 columns):
Unnamed: 0 4754 non-null int64
custid 4754 non-null int64
trade_no 4754 non-null object
bank_card_no 4754 non-null object
low_volume_percent 4752 non-null float64
middle_volume_percent 4752 non-null float64
take_amount_in_later_12_month_highest 4754 non-null int64
trans_amount_increase_rate_lately 4751 non-null float64
trans_activity_month 4752 non-null float64
trans_activity_day 4752 non-null float64
transd_mcc 4752 non-null float64
、、、
、、、
loans_avg_limit 4457 non-null float64
consfin_credit_limit 4457 non-null float64
consfin_credibility 4457 non-null float64
consfin_org_count_current 4457 non-null float64
consfin_product_count 4457 non-null float64
consfin_max_limit 4457 non-null float64
consfin_avg_limit 4457 non-null float64
latest_query_day 4450 non-null float64
loans_latest_day 4457 non-null float64
dtypes: float64(70), int64(12), object(7)
memory usage: 3.2+ MB
(只截图了部分):有70个float型,12个int型以及7个非数值型数据
- 无关特征删除
- 将数据区分为数值型以及非数值型分别处理
object_column = ['trade_no','bank_card_no','reg_preference_for_trad',
'source','id_name', 'latest_query_time', 'loans_latest_time']
data_obj = data_del[object_column]
data_num = data_del.drop(object_column, axis=1)
查看非数值型特征
data_obj.describe()
trade_no | bank_card_no | reg_preference_for_trad | source | id_name | latest_query_time | loans_latest_time | |
---|---|---|---|---|---|---|---|
count | 4476 | 4476 | 4474 | 4476 | 4476 | 4450 | 4457 |
unique | 4476 | 1 | 5 | 1 | 4307 | 207 | 232 |
top | 20180507114559071000000023047288 | 卡号1 | 一线城市 | xs | 张龙 | 2018-04-14 | 2018-05-03 |
freq | 1 | 4476 | 3196 | 4476 | 5 | 423 | 134 |
- trade_no是每个交易唯一的交易号,id_name是用户名,这里我认为是无用特征。而bank_card_no和source特征的unique值均为1,即所有样本在这两个特征上的值都是一致的,对后续的训练没有帮助。综上,删除bank_card_no、source、trade_no、id_name这四个非数值型特征。
data_obj.drop(['trade_no','bank_card_no','source','trade_no','id_name'], axis=1, inplace=True)
- 对于数值型特征,由于数据集并未给出每个特征的含义,所以暂时只手动删除custid、Unnamed: 0和有大量缺失值的student_feature这两个特征。其余特征会通过后续数据处理进一步筛选。
data_num.drop(['custid', 'student_feature', 'Unnamed: 0'], axis=1, inplace=True)
-
缺失值处理
- 删除——删除行(样本)、删除列(特征)
- 填充——均值、众数、中位数填充、前值填充
- 对数值型数据,采取均值填充
imputer = Imputer(strategy='mean')
num = imputer.fit_transform(data_num)
data_num = pd.DataFrame(num, columns=data_num.columns)
对于非数值型数据,采取前值填充:
data_obj.ffill(inplace=True)
-
非数值型数据的类型转换
对于reg_preference_for_trad特征,我们需要将其转化为数值型特征,采用One-hot编码。
encoder = LabelBinarizer()
reg_preference_1hot = encoder.fit_transform(data_obj['reg_preference_for_trad'])
data_obj.drop(['reg_preference_for_trad'], axis=1, inplace=True)
reg_preference_df = pd.DataFrame(reg_preference_1hot, columns=encoder.classes_)
data_obj = pd.concat([data_obj, reg_preference_df], axis=1)
data_obj.head()
latest_query_time | loans_latest_time | 一线城市 | 三线城市 | 二线城市 | 其他城市 | 境外 | |
---|---|---|---|---|---|---|---|
0 | 2018-04-25 | 2018-04-19 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | 2018-05-03 | 2018-05-05 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | 2018-05-05 | 2018-05-01 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | 2018-05-05 | 2018-05-03 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 |
4 | 2018-04-15 | 2018-01-07 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
- 时间数据转换
data_obj['latest_query_time'] = pd.to_datetime(data_obj['latest_query_time'])
data_obj['latest_query_time_month'] = data_obj['latest_query_time'].dt.month
data_obj['latest_query_time_weekday'] = data_obj['latest_query_time'].dt.weekday
data_obj['loans_latest_time'] = pd.to_datetime(data_obj['loans_latest_time'])
data_obj['loans_latest_time_month'] = data_obj['loans_latest_time'].dt.month
data_obj['loans_latest_time_weekday'] = data_obj['loans_latest_time'].dt.weekday
data_obj = data_obj.drop(['latest_query_time', 'loans_latest_time'], axis=1)
data_obj.head()
一线城市 | 三线城市 | 二线城市 | 其他城市 | 境外 | latest_query_time_month | latest_query_time_weekday | loans_latest_time_month | loans_latest_time_weekday | |
---|---|---|---|---|---|---|---|---|---|
0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 4.0 | 2.0 | 4.0 | 3.0 |
1 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 5.0 | 3.0 | 5.0 | 5.0 |
2 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 5.0 | 5.0 | 5.0 | 1.0 |
3 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 5.0 | 5.0 | 5.0 | 3.0 |
4 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 4.0 | 6.0 | 1.0 | 6.0 |
data_processed = pd.concat([data_num, data_obj], axis=1)
data_processed.head()
low_volume_percent | middle_volume_percent | take_amount_in_later_12_month_highest | trans_amount_increase_rate_lately | trans_activity_month | trans_activity_day | transd_mcc | trans_days_interval_filter | trans_days_interval | regional_mobility | ... | loans_latest_day | 一线城市 | 三线城市 | 二线城市 | 其他城市 | 境外 | latest_query_time_month | latest_query_time_weekday | loans_latest_time_month | loans_latest_time_weekday | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.01 | 0.99 | 0.0 | 0.90 | 0.55 | 0.313 | 17.0 | 27.0 | 26.0 | 3.0 | ... | 18.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 4.0 | 2.0 | 4.0 | 3.0 |
1 | 0.02 | 0.94 | 2000.0 | 1.28 | 1.00 | 0.458 | 19.0 | 30.0 | 14.0 | 4.0 | ... | 2.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 5.0 | 3.0 | 5.0 | 5.0 |
2 | 0.04 | 0.96 | 0.0 | 1.00 | 1.00 | 0.114 | 13.0 | 68.0 | 22.0 | 1.0 | ... | 6.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 5.0 | 5.0 | 5.0 | 1.0 |
3 | 0.00 | 0.96 | 2000.0 | 0.13 | 0.57 | 0.777 | 22.0 | 14.0 | 6.0 | 3.0 | ... | 4.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 5.0 | 5.0 | 5.0 | 3.0 |
4 | 0.01 | 0.99 | 0.0 | 0.46 | 1.00 | 0.175 | 13.0 | 66.0 | 42.0 | 1.0 | ... | 120.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 4.0 | 6.0 | 1.0 | 6.0 |
5 rows × 88 columns
- 保存数据
data_saved = pd.concat([data_processed, label], axis=1)
data_saved.to_csv('data_processed.csv', index=False)