针对于LR,NN模型的数据处理
针对生成树的数据处理链接https://editor.csdn.net/md?articleId=105156784
文章目录
导入数据
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
from operator import itemgetter
%matplotlib inline
Train_data= pd.read_csv(r'D:\ershouche\used_car_train_20200313.csv', sep=' ')
Test_data = pd.read_csv(r'D:\ershouche\used_car_testA_20200313.csv', sep=' ')
Train_data['notRepairedDamage'].replace('-', np.nan, inplace=True)
Train_data['notRepairedDamage'].value_counts()
0.0 111361
1.0 14315
Name: notRepairedDamage, dtype: int64
#对偏斜类做删除处理
del Train_data["seller"]
del Train_data["offerType"]
del Test_data["seller"]
del Test_data["offerType"]
Train_data.isnull().sum()
SaleID 0
name 0
regDate 0
model 1
brand 0
bodyType 4506
fuelType 8680
gearbox 5981
power 0
kilometer 0
notRepairedDamage 24324
regionCode 0
creatDate 0
price 0
v_0 0
v_1 0
v_2 0
v_3 0
v_4 0
v_5 0
v_6 0
v_7 0
v_8 0
v_9 0
v_10 0
v_11 0
v_12 0
v_13 0
v_14 0
dtype: int64
1.缺失值的处理
由EDA数据分析,我们可以清楚的知道在我们的缺失数据的变量为bodyType 缺失数为4506 ,fuelType 缺失数为8680;gearbox 缺失数为5981,notRepairedDamage 有24324 种,且其均为分类变量,而gearbox,notRepairedDamage 变量为0,1分类,所以我们可以采取虚拟变量法填补缺失值,同时,在bodytype 和fuelType 变量,其虽为分类变量,但其类别较多,不太适合虚拟变量法,又其含有缺失值的个数分别占全部全部训练集的3.00%,5.787%,所以对于这两个变量我们可以直接采取删除缺失值的方法。
1.1虚拟变量法处理缺失值
虚拟变量
又称哑变量,通常取值为0或1。引入哑变量可以使问题描述更加简明。
pd.get_dummies( )
参数column:欲转换为虚拟变量的指标。
参数prefix:定义列名称bb
#添加虚拟变量
# 将gearbox转化为虚拟变量,添加在Train_data的最后一列
train = pd.get_dummies(Train_data,columns=['gearbox'],
prefix=['gearbox1'],prefix_sep='_')
train['gearbox1'] = Train_data['gearbox']
print(train.isnull().sum())
SaleID 0
name 0
regDate 0
model 1
brand 0
bodyType 4506
fuelType 8680
power 0
kilometer 0
notRepairedDamage 24324
regionCode 0
creatDate 0
price 0
v_0 0
v_1 0
v_2 0
v_3 0
v_4 0
v_5 0
v_6 0
v_7 0
v_8 0
v_9 0
v_10 0
v_11 0
v_12 0
v_13 0
v_14 0
gearbox1_0.0 0
gearbox1_1.0 0
gearbox1 5981
dtype: int64
train1 = pd.get_dummies(train,columns=['notRepairedDamage'],
prefix=['notRepairedDamage'],prefix_sep='_')
train1['notRepairedDamage'] = train['notRepairedDamage']
print(train1.isnull().sum())
SaleID 0
name 0
regDate 0
model 1
brand 0
bodyType 4506
fuelType 8680
power 0
kilometer 0
regionCode 0
creatDate 0
price 0
v_0 0
v_1 0
v_2 0
v_3 0
v_4 0
v_5 0
v_6 0
v_7 0
v_8 0
v_9 0
v_10 0
v_11 0
v_12 0
v_13 0
v_14 0
gearbox1_0.0 0
gearbox1_1.0 0
gearbox1 5981
notRepairedDamage_0.0 0
notRepairedDamage_1.0 0
notRepairedDamage 24324
dtype: int64
del train1['gearbox1']
del train1['notRepairedDamage']
1.2删除缺失值
train2=train1.dropna