数据预处理

最新推荐文章于 2021-07-16 15:14:33 发布

qq_44723638

最新推荐文章于 2021-07-16 15:14:33 发布

阅读量141

点赞数

本文链接：https://blog.csdn.net/qq_44723638/article/details/88908879

版权

数据预处理

数据集描述

本次所使用的数据是从网上获取的金融数据（已处理，并非原始数据），根据用户的历史行为数据预测贷款用户是否会逾期，预测结果有两类：逾期和非逾期。用户的历史行为数据则主要包括以下部分：

每个用户在银行的所有交易行为（存款、汇款、转账、购物、贷款等）发生的次数、时间、金额总量等以及这些数据所具有的特征如：最大值、累计值、间隔值等。部分示例如下所示：

number_of_trans_from_2011

historical_trans_amount

jewelry_consume_count_last_6_month

trans_days_interval

每个用户在银行的贷款行为发生的次数、产生的金额、逾期的情况、逾期的金额等，以及银行根据其历史行为数据所给予的贷款金额的限制、时间的限制等。部分示例如下：

loans_long_time

loans_product_count

loans_credit_limit

loans_credibility_limit

数据清洗

基本信息获取

import numpy as np
import pandas as pd
file=pd.read_csv("data.csv",encoding="gbk")#读取数据集
file.info()#获取基本信息

输出结果（部分）为：

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4754 entries, 0 to 4753
Data columns (total 90 columns):
Unnamed: 0                                    4754 non-null int64
custid                                        4754 non-null int64
trade_no                                      4754 non-null object
bank_card_no                                  4754 non-null object
low_volume_percent                            4752 non-null float64
...
latest_query_day                              4450 non-null float64
loans_latest_day                              4457 non-null float64
dtypes: float64(70), int64(13), object(7)

可以看出总共有4754行×90列数据，数据的类型分为浮点数（70列）、整数（13列）和object(7列)三种，以浮点数为主。且可以看出有些列包含的数据少于4754，则说明该列存在缺失值。

删除无关特征
去除唯一属性，如id属性，自增列，含唯一值的列，这些属性并不能描述本身的分布规律。

（1）删除含唯一值的列

newfile=file.copy()#复制一份数据
const_cols = [c for c in newfile.columns if newfile[c].nunique(dropna=False)==1 ]
print(const_cols)#查看含唯一值的列

输出结果：

['bank_card_no', 'source']

则可以删去这两列

newfile.drop(columns=["bank_card_no","source"])#删去仅含唯一值的两列

（2）删除id属性

newfile.drop(columns=["custid","trade_no","id_name"])#删去客户编号、交易编号、客户姓名

查看每一列缺失值占该列的百分比

percent =(newfile.isnull().sum()/newfile.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
print(missing_data)

输出结果：

                                           Total   Percent
student_feature                              2998  0.630627
cross_consume_count_last_1_month              426  0.089609
apply_credibility                             304  0.063946
...
Unnamed: 0                                      0  0.000000

处理缺失值

通过上面的结果可以看出“student_feature"该特征缺失值达到60%以上，查看缺失值状态发现，缺失值部分显示为”NA“，未缺失部分显示为”1“，则可推测该缺失值表明"student_feature"状态为“0”，故均以“0”填充。

newfile["student_feature"]=newfile["student_feature"].fillna(0)

由上可知，数据的类型主要有数值型和非数值型两类，故将其分开。

newfile_num = newfile.select_dtypes(include='number')#数值型特征
newfile_obj =newfile.select_dtypes(exclude='number'）#非数值型特征

对于数值型缺失值以其均值填充：

from sklearn.preprocessing import Imputer
imputer=Imputer(strategy="mean")
imputer.fit(newfile_num)
fit_num=imputer.transform(newfile_num)
fit_num=pd.DataFrame(fit_num,columns=newfile_num.columns)
fit_num.info(）

对于非数值型缺失值：
可以看出非数值型缺失值共有三个，其中“latest_query_time”和“loans_latest_time”为字符串显示的日期型数据，需要先将其进行转换，然后利用其中位数进行缺失值的填充。

newfile_2_obj=pd.DataFrame()
newfile_2_obj["latest_query_time_year"]=pd.to_datetime(newfile_obj["latest_query_time"]).dt.year
newfile_2_obj["latest_query_time_month"]=pd.to_datetime(newfile_obj["latest_query_time"]).dt.month
newfile_2_obj["latest_query_time_weekday"]=pd.to_datetime(newfile_obj["latest_query_time"]).dt.weekday#数据类型转换
newfile_2_obj["loans_latest_time_year"]=pd.to_datetime(newfile_obj["loans_latest_time"]).dt.year
newfile_2_obj["loans_latest_time_month"]=pd.to_datetime(newfile_obj["loans_latest_time"]).dt.month
newfile_2_obj["loans_latest_time_weekday"]=pd.to_datetime(newfile_obj["loans_latest_time"]).dt.weekday#数据类型转换
newfile_2_obj["latest_query_time_year"]=newfile_2_obj["latest_query_time_year"].median()
newfile_2_obj["latest_query_time_month"]=newfile_2_obj["latest_query_time_month"].median()
newfile_2_obj["latest_query_time_weekday"]=newfile_2_obj["latest_query_time_weekday"].median()#缺失值填充
newfile_2_obj["loans_latest_time_year"]=newfile_2_obj["loans_latest_time_year"].median()
newfile_2_obj["loans_latest_time_month"]=newfile_2_obj["loans_latest_time_month"].median()
newfile_2_obj["loans_latest_time_weekday"]=newfile_2_obj["loans_latest_time_weekday"].median()#缺失值填充

对于最后一个非数值型的数据“reg_preference_for_trad”，有上面可知其缺失值只有两个，故将其缺失值填充为出现频率最高的值“一线城市”，

newfile_2_obj["reg_preference_for_trad"]=newfile_obj["reg_preference_for_trad"].fillna(“一线城市”)

然后为方便，将中文表示改为数字表示：

newfile_2_obj["reg_preference_for_trad"]=newfile_2_obj["reg_preference_for_trad"].map({"境外":1,"一线城市":2,"二线城市":3,"三线城市":4,"其他城市":5})

综上，缺失值全部填补完毕，分成两个DataFrame。

数据合并及导出

将上部分处理的两个DataFrame合并在一起得到一个新的数据集，此时数据集不存在缺失值，数据类型均为数值型。

newfile_pro=pd.concat([fit_num,newfile_2_obj],axis=1,sort=False)

输出结果（部分）如下：

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4754 entries, 0 to 4753
Data columns (total 89 columns):
Unnamed: 0                                    4754 non-null float64
low_volume_percent                            4754 non-null float64
middle_volume_percent                         4754 non-null float64
...
loans_latest_time_year                        4754 non-null float64
loans_latest_time_month                       4754 non-null float64
loans_latest_time_weekday                     4754 non-null float64
reg_preference_for_trad                       4754 non-null int64

导出处理好的数据。

newfile_pro.to_csv("newdata.csv",encoding="gbk")

全部代码如下所示：

import numpy as np
import pandas as pd
file=pd.read_csv("data.csv",encoding="gbk")
newfile=file.drop(columns=["custid","trade_no","id_name","bank_card_no","source"])
newfile["student_feature"]=newfile["student_feature"].fillna(0)
newfile_num = newfile.select_dtypes(include='number')
newfile_obj=newfile.select_dtypes(exclude="number")
from sklearn.preprocessing import Imputer
imputer=Imputer(strategy="mean")
imputer.fit(newfile_num)
fit_num=imputer.transform(newfile_num)
fit_num=pd.DataFrame(fit_num,columns=newfile_num.columns)
newfile_2_obj=pd.DataFrame()
newfile_2_obj["latest_query_time_year"]=pd.to_datetime(newfile_obj["latest_query_time"]).dt.year
newfile_2_obj["latest_query_time_month"]=pd.to_datetime(newfile_obj["latest_query_time"]).dt.month
newfile_2_obj["latest_query_time_weekday"]=pd.to_datetime(newfile_obj["latest_query_time"]).dt.weekday
newfile_2_obj["loans_latest_time_year"]=pd.to_datetime(newfile_obj["loans_latest_time"]).dt.year
newfile_2_obj["loans_latest_time_month"]=pd.to_datetime(newfile_obj["loans_latest_time"]).dt.month
newfile_2_obj["loans_latest_time_weekday"]=pd.to_datetime(newfile_obj["loans_latest_time"]).dt.weekday
newfile_2_obj["latest_query_time_year"]=newfile_2_obj["latest_query_time_year"].median()
newfile_2_obj["latest_query_time_month"]=newfile_2_obj["latest_query_time_month"].median()
newfile_2_obj["latest_query_time_weekday"]=newfile_2_obj["latest_query_time_weekday"].median()
newfile_2_obj["loans_latest_time_year"]=newfile_2_obj["loans_latest_time_year"].median()
newfile_2_obj["loans_latest_time_month"]=newfile_2_obj["loans_latest_time_month"].median()
newfile_2_obj["loans_latest_time_weekday"]=newfile_2_obj["loans_latest_time_weekday"].median()
newfile_2_obj["reg_preference_for_trad"]=newfile_obj["reg_preference_for_trad"].fillna("一线城市")
newfile_2_obj["reg_preference_for_trad"]=newfile_2_obj["reg_preference_for_trad"].map({"境外":1,"一线城市":2,"二线城市":3,"三线城市":4,"其他城市":5})
newfile_pro=pd.concat([fit_num,newfile_2_obj],axis=1,sort=False)
newfile_pro.to_csv("newdata.csv",encoding="gbk")