数据挖掘实战:EDA(Exploratory Data Analysis)数据探索分析

接上篇:数据挖掘实战:二手车交易价格预测
EDA的价值主要在于熟悉数据集,了解数据集,对数据集进行验证来确定所获得数据集可以用于接下来的机
器学习或者深度学习使用。

载入各种数据科学以及可视化库

#coding:utf-8
#导入warnings包,利用过滤器来实现忽略警告语句。
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno

载入数据

## 1) 载入训练集和测试集;
Train_data = pd.read_csv('used_car_train_20200313.csv', sep=' ')
Test_data = pd.read_csv('used_car_testA_20200313.csv', sep=' ')
## 2) 简略观察数据(head()+shape)
Train_data.head().append(Train_data.tail())
SaleIDnameregDatemodelbrandbodyTypefuelTypegearboxpowerkilometer...v_5v_6v_7v_8v_9v_10v_11v_12v_13v_14
007362004040230.061.00.00.06012.5...0.2356760.1019880.1295490.0228160.097462-2.8818032.804097-2.4208210.7952920.914762
1122622003030140.012.00.00.0015.0...0.2647770.1210040.1357310.0265970.020582-4.9004822.096338-1.030483-1.7226740.245522
221487420040403115.0151.00.00.016312.5...0.2514100.1149120.1651470.0621730.027075-4.8467491.8035591.565330-0.832687-0.229963
337186519960908109.0100.00.01.019315.0...0.2742930.1103000.1219640.0333950.000000-4.5095991.285940-0.501868-2.438353-0.478699
4411108020120103110.051.00.00.0685.0...0.2280360.0732050.0918800.0788190.121534-1.8962400.9107830.9311102.8345181.923482
14999514999516397820000607121.0104.00.01.016315.0...0.2802640.0003100.0484410.0711580.0191741.988114-2.9839730.589167-1.304370-0.302592
14999614999618453520091102116.0110.00.00.012510.0...0.2532170.0007770.0840790.0996810.0793711.839166-2.7746152.5539940.924196-0.272160
1499971499971475872010100360.0111.01.00.0906.0...0.2333530.0007050.1188720.1001180.0979142.439812-1.6306772.2901971.8919220.414931
149998149998459072006031234.0103.01.00.015615.0...0.2563690.0002520.0814790.0835580.0814982.075380-2.6337191.4149370.431981-1.659014
1499991499991776721999020419.0286.00.01.019312.5...0.2844750.0000000.0400720.0625430.0258191.978453-3.1799130.031724-1.483350-0.342674

10 rows × 31 columns

Train_data.shape
(150000, 31)
Test_data.head().append(Test_data.tail())
SaleIDnameregDatemodelbrandbodyTypefuelTypegearboxpowerkilometer...v_5v_6v_7v_8v_9v_10v_11v_12v_13v_14
01500006693220111212222.045.01.01.031315.0...0.2644050.1218000.0708990.1065580.078867-7.050969-0.8546264.8001510.620011-3.664654
11500011749601999021119.0210.00.00.07512.5...0.2617450.0000000.0967330.0137050.0523833.679418-0.729039-3.796107-1.541230-0.757055
215000253562009030482.0210.00.00.01097.0...0.2602160.1120810.0780820.0620780.050540-4.9266901.0011060.8265620.1382260.754033
315000350688201004050.000.00.01.01607.0...0.2604660.1067270.0811460.0759710.048268-4.8646370.5054931.8703790.3660381.312775
41500041614281997070326.0142.00.00.07515.0...0.2509990.0000000.0778060.0286000.0817093.616475-0.673236-3.197685-0.025678-0.101290
4999519999520903199605034.044.00.00.011615.0...0.2846640.1300440.0498330.0288070.004616-5.9785111.303174-1.207191-1.981240-0.357695
49996199996708199910110.000.00.00.07515.0...0.2681010.1080950.0660390.0254680.025971-3.9138251.759524-2.075658-1.1548470.169073
4999719999766932004041249.010.01.01.022415.0...0.2694320.1057240.1176520.0574790.015669-4.6390650.6547131.137756-1.3905310.254420
49998199998969002002000827.010.00.01.033415.0...0.2611520.0004900.1373660.0862160.0513831.833504-2.8286872.465630-0.911682-2.057353
4999919999919338420041109166.061.0NaN1.0689.0...0.2287300.0003000.1035340.0806250.1242642.914571-1.1352700.5476282.094057-1.552150

10 rows × 30 columns

Test_data.shape
(50000, 30)

总览数据概况

  1. describe种有每列的统计量,个数count、平均值mean、方差std、最小值min、中位数25% 50% 75% 、以
    及最大值 看这个信息主要是瞬间掌握数据的大概的范围以及每个值的异常值的判断,比如有的时候会发现
    999 9999 -1 等值这些其实都是nan的另外一种表达方式,有的时候需要注意下
  2. info 通过info来了解数据每列的type,有助于了解是否存在除了nan以外的特殊符号异常
## 1) 通过describe()来熟悉数据的相关统计量
Train_data.describe()
SaleIDnameregDatemodelbrandbodyTypefuelTypegearboxpowerkilometer...v_5v_6v_7v_8v_9v_10v_11v_12v_13v_14
count150000.000000150000.0000001.500000e+05149999.000000150000.000000145494.000000141320.000000144019.000000150000.000000150000.000000...150000.000000150000.000000150000.000000150000.000000150000.000000150000.000000150000.000000150000.000000150000.000000150000.000000
mean74999.50000068349.1728732.003417e+0747.1290218.0527331.7923690.3758420.224943119.31654712.597160...0.2482040.0449230.1246920.0581440.061996-0.0010000.0090350.0048130.000313-0.000688
std43301.41452761103.8750955.364988e+0449.5360407.8649561.7606400.5486770.417546177.1684193.919576...0.0458040.0517430.2014100.0291860.0356923.7723863.2860712.5174781.2889881.038685
min0.0000000.0000001.991000e+070.0000000.0000000.0000000.0000000.0000000.0000000.500000...0.0000000.0000000.0000000.0000000.000000-9.168192-5.558207-9.639552-4.153899-6.546556
25%37499.75000011156.0000001.999091e+0710.0000001.0000000.0000000.0000000.00000075.00000012.500000...0.2436150.0000380.0624740.0353340.033930-3.722303-1.951543-1.871846-1.057789-0.437034
50%74999.50000051638.0000002.003091e+0730.0000006.0000001.0000000.0000000.000000110.00000015.000000...0.2577980.0008120.0958660.0570140.0584841.624076-0.358053-0.130753-0.0362450.141246
75%112499.250000118841.2500002.007111e+0766.00000013.0000003.0000001.0000000.000000150.00000015.000000...0.2652970.1020090.1252430.0793820.0874912.8443571.2550221.7769330.9428130.680378
max149999.000000196812.0000002.015121e+07247.00000039.0000007.0000006.0000001.00000019312.00000015.000000...0.2918380.1514201.4049360.1607910.22278712.35701118.81904213.84779211.1476698.658418

8 rows × 30 columns

Test_data.describe()
SaleIDnameregDatemodelbrandbodyTypefuelTypegearboxpowerkilometer...v_5v_6v_7v_8v_9v_10v_11v_12v_13v_14
count50000.00000050000.0000005.000000e+0450000.00000050000.00000048587.00000047107.00000048090.00000050000.00000050000.000000...50000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.000000
mean174999.50000068542.2232802.003393e+0746.8445208.0562401.7821850.3734050.224350119.88362012.595580...0.2486690.0450210.1227440.0579970.062000-0.017855-0.013742-0.013554-0.0031470.001516
std14433.90106761052.8081335.368870e+0449.4695487.8194771.7607360.5464420.417158185.0973873.908979...0.0446010.0517660.1959720.0292110.0356533.7479853.2312582.5159621.2865971.027360
min150000.0000000.0000001.991000e+070.0000000.0000000.0000000.0000000.0000000.0000000.500000...0.0000000.0000000.0000000.0000000.000000-9.160049-5.411964-8.916949-4.123333-6.112667
25%162499.75000011203.5000001.999091e+0710.0000001.0000000.0000000.0000000.00000075.00000012.500000...0.2437620.0000440.0626440.0350840.033714-3.700121-1.971325-1.876703-1.060428-0.437920
50%174999.50000052248.5000002.003091e+0729.0000006.0000001.0000000.0000000.000000109.00000015.000000...0.2578770.0008150.0958280.0570840.0587641.613212-0.355843-0.142779-0.0359560.138799
75%187499.250000118856.5000002.007110e+0765.00000013.0000003.0000001.0000000.000000150.00000015.000000...0.2653280.1020250.1254380.0790770.0874892.8327081.2629141.7643350.9414690.681163
max199999.000000196805.0000002.015121e+07246.00000039.0000007.0000006.0000001.00000020000.00000015.000000...0.2916180.1532651.3588130.1563550.21477512.33887218.85621812.9504985.9132732.624622

8 rows × 29 columns

## 2) 通过info()来熟悉数据类型
Train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 31 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   SaleID             150000 non-null  int64  
 1   name               150000 non-null  int64  
 2   regDate            150000 non-null  int64  
 3   model              149999 non-null  float64
 4   brand              150000 non-null  int64  
 5   bodyType           145494 non-null  float64
 6   fuelType           141320 non-null  float64
 7   gearbox            144019 non-null  float64
 8   power              150000 non-null  int64  
 9   kilometer          150000 non-null  float64
 10  notRepairedDamage  150000 non-null  object 
 11  regionCode         150000 non-null  int64  
 12  seller             150000 non-null  int64  
 13  offerType          150000 non-null  int64  
 14  creatDate          150000 non-null  int64  
 15  price              150000 non-null  int64  
 16  v_0                150000 non-null  float64
 17  v_1                150000 non-null  float64
 18  v_2                150000 non-null  float64
 19  v_3                150000 non-null  float64
 20  v_4                150000 non-null  float64
 21  v_5                150000 non-null  float64
 22  v_6                150000 non-null  float64
 23  v_7                150000 non-null  float64
 24  v_8                150000 non-null  float64
 25  v_9                150000 non-null  float64
 26  v_10               150000 non-null  float64
 27  v_11               150000 non-null  float64
 28  v_12               150000 non-null  float64
 29  v_13               150000 non-null  float64
 30  v_14               150000 non-null  float64
dtypes: float64(20), int64(10), object(1)
memory usage: 35.5+ MB
Test_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 30 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   SaleID             50000 non-null  int64  
 1   name               50000 non-null  int64  
 2   regDate            50000 non-null  int64  
 3   model              50000 non-null  float64
 4   brand              50000 non-null  int64  
 5   bodyType           48587 non-null  float64
 6   fuelType           47107 non-null  float64
 7   gearbox            48090 non-null  float64
 8   power              50000 non-null  int64  
 9   kilometer          50000 non-null  float64
 10  notRepairedDamage  50000 non-null  object 
 11  regionCode         50000 non-null  int64  
 12  seller             50000 non-null  int64  
 13  offerType          50000 non-null  int64  
 14  creatDate          50000 non-null  int64  
 15  v_0                50000 non-null  float64
 16  v_1                50000 non-null  float64
 17  v_2                50000 non-null  float64
 18  v_3                50000 non-null  float64
 19  v_4                50000 non-null  float64
 20  v_5                50000 non-null  float64
 21  v_6                50000 non-null  float64
 22  v_7                50000 non-null  float64
 23  v_8                50000 non-null  float64
 24  v_9                50000 non-null  float64
 25  v_10               50000 non-null  float64
 26  v_11               50000 non-null  float64
 27  v_12               50000 non-null  float64
 28  v_13               50000 non-null  float64
 29  v_14               50000 non-null  float64
dtypes: float64(20), int64(9), object(1)
memory usage: 11.4+ MB

判断数据缺失和异常

## 1) 查看每列的存在nan情况
Train_data.isnull().sum()
SaleID                  0
name                    0
regDate                 0
model                   1
brand                   0
bodyType             4506
fuelType             8680
gearbox              5981
power                   0
kilometer               0
notRepairedDamage       0
regionCode              0
seller                  0
offerType               0
creatDate               0
price                   0
v_0                     0
v_1                     0
v_2                     0
v_3                     0
v_4                     0
v_5                     0
v_6                     0
v_7                     0
v_8                     0
v_9                     0
v_10                    0
v_11                    0
v_12                    0
v_13                    0
v_14                    0
dtype: int64
Test_data.isnull().sum()
SaleID                  0
name                    0
regDate                 0
model                   0
brand                   0
bodyType             1413
fuelType             2893
gearbox              1910
power                   0
kilometer               0
notRepairedDamage       0
regionCode              0
seller                  0
offerType               0
creatDate               0
v_0                     0
v_1                     0
v_2                     0
v_3                     0
v_4                     0
v_5                     0
v_6                     0
v_7                     0
v_8                     0
v_9                     0
v_10                    0
v_11                    0
v_12                    0
v_13                    0
v_14                    0
dtype: int64
# nan可视化
missing = Train_data.isnull().sum()
missing = missing[missing > 0]
missing.sort_values(inplace=True)
missing.plot.bar()
<matplotlib.axes._subplots.AxesSubplot at 0x1fd0a25c3c8>

在这里插入图片描述

通过以上两句可以很直观的了解哪些列存在 “nan”, 并可以把nan的个数打印,主要的目的在于 nan存在的个数是
否真的很大,如果很小一般选择填充,如果使用lgb等树模型可以直接空缺,让树自己去优化,但如果nan存在的
过多、可以考虑删掉

# 可视化看下缺省值
msno.matrix(Train_data.sample(250))
<matplotlib.axes._subplots.AxesSubplot at 0x1fd0a71add8>

在这里插入图片描述

msno.bar(Train_data.sample(1000))
<matplotlib.axes._subplots.AxesSubplot at 0x1fd0ad32208>

在这里插入图片描述

# 可视化看下缺省值
msno.matrix(Test_data.sample(250))
<matplotlib.axes._subplots.AxesSubplot at 0x1fd0a9e7c50>

在这里插入图片描述

msno.bar(Test_data.sample(1000))
<matplotlib.axes._subplots.AxesSubplot at 0x1fd0aace5f8>

在这里插入图片描述

测试集的缺省和训练集的差不多情况, 可视化有四列有缺省,notRepairedDamage缺省得最多

## 2) 查看异常值检测
Train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 31 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   SaleID             150000 non-null  int64  
 1   name               150000 non-null  int64  
 2   regDate            150000 non-null  int64  
 3   model              149999 non-null  float64
 4   brand              150000 non-null  int64  
 5   bodyType           145494 non-null  float64
 6   fuelType           141320 non-null  float64
 7   gearbox            144019 non-null  float64
 8   power              150000 non-null  int64  
 9   kilometer          150000 non-null  float64
 10  notRepairedDamage  150000 non-null  object 
 11  regionCode         150000 non-null  int64  
 12  seller             150000 non-null  int64  
 13  offerType          150000 non-null  int64  
 14  creatDate          150000 non-null  int64  
 15  price              150000 non-null  int64  
 16  v_0                150000 non-null  float64
 17  v_1                150000 non-null  float64
 18  v_2                150000 non-null  float64
 19  v_3                150000 non-null  float64
 20  v_4                150000 non-null  float64
 21  v_5                150000 non-null  float64
 22  v_6                150000 non-null  float64
 23  v_7                150000 non-null  float64
 24  v_8                150000 non-null  float64
 25  v_9                150000 non-null  float64
 26  v_10               150000 non-null  float64
 27  v_11               150000 non-null  float64
 28  v_12               150000 non-null  float64
 29  v_13               150000 non-null  float64
 30  v_14               150000 non-null  float64
dtypes: float64(20), int64(10), object(1)
memory usage: 35.5+ MB

可以发现除了notRepairedDamage 为object类型其他都为数字 这里我们把他的几个不同的值都进行显示就知道了

Train_data['notRepairedDamage'].value_counts()
0.0    111361
-       24324
1.0     14315
Name: notRepairedDamage, dtype: int64

可以看出来‘ - ’也为空缺值,因为很多模型对nan有直接的处理,这里我们先不做处理,先替换成nan

Train_data['notRepairedDamage'].replace('-', np.nan, inplace=True)
Train_data['notRepairedDamage'].value_counts()
0.0    111361
1.0     14315
Name: notRepairedDamage, dtype: int64
Train_data.isnull().sum()
SaleID                   0
name                     0
regDate                  0
model                    1
brand                    0
bodyType              4506
fuelType              8680
gearbox               5981
power                    0
kilometer                0
notRepairedDamage    24324
regionCode               0
seller                   0
offerType                0
creatDate                0
price                    0
v_0                      0
v_1                      0
v_2                      0
v_3                      0
v_4                      0
v_5                      0
v_6                      0
v_7                      0
v_8                      0
v_9                      0
v_10                     0
v_11                     0
v_12                     0
v_13                     0
v_14                     0
dtype: int64
Test_data['notRepairedDamage'].value_counts()
0.0    37249
-       8031
1.0     4720
Name: notRepairedDamage, dtype: int64
Test_data['notRepairedDamage'].replace('-', np.nan, inplace=True)

以下两个类别特征严重倾斜,一般不会对预测有什么帮助,故这边先删掉,当然你也可以继续挖掘,但是一般意
义不大

Train_data["seller"].value_counts()
0    149999
1         1
Name: seller, dtype: int64
Train_data["offerType"].value_counts()
0    150000
Name: offerType, dtype: int64
del Train_data["seller"]
del Train_data["offerType"]
del Test_data["seller"]
del Test_data["offerType"]

了解预测值的分布

Train_data['price']
0         1850
1         3600
2         6222
3         2400
4         5200
          ... 
149995    5900
149996    9500
149997    7500
149998    4999
149999    4700
Name: price, Length: 150000, dtype: int64
Train_data['price'].value_counts()
500      2337
1500     2158
1200     1922
1000     1850
2500     1821
         ... 
25321       1
8886        1
8801        1
37920       1
8188        1
Name: price, Length: 3763, dtype: int64
## 1) 总体分布概况(无界约翰逊分布等)
import scipy.stats as st
y = Train_data['price']
plt.figure(1); plt.title('Johnson SU')
sns.distplot(y, kde=False, fit=st.johnsonsu)
plt.figure(2); plt.title('Normal')
sns.distplot(y, kde=False, fit=st.norm)
plt.figure(3); plt.title('Log Normal')
sns.distplot(y, kde=False, fit=st.lognorm)
<matplotlib.axes._subplots.AxesSubplot at 0x1fd0b58d780>

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

价格不服从正态分布,所以在进行回归之前,它必须进行转换。虽然对数变换做得很好,但最佳拟合是无界约翰
逊分布

## 2) 查看skewness and kurtosis
sns.distplot(Train_data['price']);
print("Skewness: %f" % Train_data['price'].skew())
print("Kurtosis: %f" % Train_data['price'].kurt())
Skewness: 3.346487
Kurtosis: 18.995183

在这里插入图片描述

Train_data.skew(), Train_data.kurt()
(SaleID               6.017846e-17
 name                 5.576058e-01
 regDate              2.849508e-02
 model                1.484388e+00
 brand                1.150760e+00
 bodyType             9.915299e-01
 fuelType             1.595486e+00
 gearbox              1.317514e+00
 power                6.586318e+01
 kilometer           -1.525921e+00
 notRepairedDamage    2.430640e+00
 regionCode           6.888812e-01
 creatDate           -7.901331e+01
 price                3.346487e+00
 v_0                 -1.316712e+00
 v_1                  3.594543e-01
 v_2                  4.842556e+00
 v_3                  1.062920e-01
 v_4                  3.679890e-01
 v_5                 -4.737094e+00
 v_6                  3.680730e-01
 v_7                  5.130233e+00
 v_8                  2.046133e-01
 v_9                  4.195007e-01
 v_10                 2.522046e-02
 v_11                 3.029146e+00
 v_12                 3.653576e-01
 v_13                 2.679152e-01
 v_14                -1.186355e+00
 dtype: float64,
 SaleID                 -1.200000
 name                   -1.039945
 regDate                -0.697308
 model                   1.740483
 brand                   1.076201
 bodyType                0.206937
 fuelType                5.880049
 gearbox                -0.264161
 power                5733.451054
 kilometer               1.141934
 notRepairedDamage       3.908072
 regionCode             -0.340832
 creatDate            6881.080328
 price                  18.995183
 v_0                     3.993841
 v_1                    -1.753017
 v_2                    23.860591
 v_3                    -0.418006
 v_4                    -0.197295
 v_5                    22.934081
 v_6                    -1.742567
 v_7                    25.845489
 v_8                    -0.636225
 v_9                    -0.321491
 v_10                   -0.577935
 v_11                   12.568731
 v_12                    0.268937
 v_13                   -0.438274
 v_14                    2.393526
 dtype: float64)
sns.distplot(Train_data.skew(),color='blue',axlabel ='Skewness')
<matplotlib.axes._subplots.AxesSubplot at 0x1fd0c47db70>

在这里插入图片描述

sns.distplot(Train_data.kurt(),color='orange',axlabel ='Kurtness')
<matplotlib.axes._subplots.AxesSubplot at 0x1fd0c53efd0>

在这里插入图片描述

## 3) 查看预测值的具体频数
plt.hist(Train_data['price'], orientation = 'vertical',histtype = 'bar', color ='red')
plt.show()

在这里插入图片描述

# log变换 z之后的分布较均匀,可以进行log变换进行预测,这也是预测问题常用的trick
plt.hist(np.log(Train_data['price']), orientation = 'vertical',histtype = 'bar', color ='red')
plt.show()

在这里插入图片描述

特征分为类别特征和数字特征,并对类别特征查看unique分布

# 分离label即预测值
Y_train = Train_data['price']
# 这个区别方式适用于没有直接label coding的数据
# 这里不适用,需要人为根据实际含义来区分
# 数字特征
# numeric_features = Train_data.select_dtypes(include=[np.number])
# numeric_features.columns
# # 类型特征
# categorical_features = Train_data.select_dtypes(include=[np.object])
# categorical_features.columns
numeric_features = ['power', 'kilometer', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13','v_14' ]
categorical_features = ['name', 'model', 'brand', 'bodyType', 'fuelType', 'gearbox', 'notRepairedDamage', 'regionCode',]
# 特征nunique分布
for cat_fea in categorical_features:
    print(cat_fea + "的特征分布如下:")
    print("{}特征有个{}不同的值".format(cat_fea, Train_data[cat_fea].nunique()))
    print(Train_data[cat_fea].value_counts())
name的特征分布如下:
name特征有个99662不同的值
708       282
387       282
55        280
1541      263
203       233
         ... 
5074        1
7123        1
11221       1
13270       1
174485      1
Name: name, Length: 99662, dtype: int64
model的特征分布如下:
model特征有个248不同的值
0.0      11762
19.0      9573
4.0       8445
1.0       6038
29.0      5186
         ...  
245.0        2
209.0        2
240.0        2
242.0        2
247.0        1
Name: model, Length: 248, dtype: int64
brand的特征分布如下:
brand特征有个40不同的值
0     31480
4     16737
14    16089
10    14249
1     13794
6     10217
9      7306
5      4665
13     3817
11     2945
3      2461
7      2361
16     2223
8      2077
25     2064
27     2053
21     1547
15     1458
19     1388
20     1236
12     1109
22     1085
26      966
30      940
17      913
24      772
28      649
32      592
29      406
37      333
2       321
31      318
18      316
36      228
34      227
33      218
23      186
35      180
38       65
39        9
Name: brand, dtype: int64
bodyType的特征分布如下:
bodyType特征有个8不同的值
0.0    41420
1.0    35272
2.0    30324
3.0    13491
4.0     9609
5.0     7607
6.0     6482
7.0     1289
Name: bodyType, dtype: int64
fuelType的特征分布如下:
fuelType特征有个7不同的值
0.0    91656
1.0    46991
2.0     2212
3.0      262
4.0      118
5.0       45
6.0       36
Name: fuelType, dtype: int64
gearbox的特征分布如下:
gearbox特征有个2不同的值
0.0    111623
1.0     32396
Name: gearbox, dtype: int64
notRepairedDamage的特征分布如下:
notRepairedDamage特征有个2不同的值
0.0    111361
1.0     14315
Name: notRepairedDamage, dtype: int64
regionCode的特征分布如下:
regionCode特征有个7905不同的值
419     369
764     258
125     137
176     136
462     134
       ... 
6414      1
7063      1
4239      1
5931      1
7267      1
Name: regionCode, Length: 7905, dtype: int64
# 特征nunique分布
for cat_fea in categorical_features:
    print(cat_fea + "的特征分布如下:")
    print("{}特征有个{}不同的值".format(cat_fea, Test_data[cat_fea].nunique()))
    print(Test_data[cat_fea].value_counts())
name的特征分布如下:
name特征有个37453不同的值
55       97
708      96
387      95
1541     88
713      74
         ..
22270     1
89855     1
42752     1
48899     1
11808     1
Name: name, Length: 37453, dtype: int64
model的特征分布如下:
model特征有个247不同的值
0.0      3896
19.0     3245
4.0      3007
1.0      1981
29.0     1742
         ... 
242.0       1
240.0       1
244.0       1
243.0       1
246.0       1
Name: model, Length: 247, dtype: int64
brand的特征分布如下:
brand特征有个40不同的值
0     10348
4      5763
14     5314
10     4766
1      4532
6      3502
9      2423
5      1569
13     1245
11      919
7       795
3       773
16      771
8       704
25      695
27      650
21      544
15      511
20      450
19      450
12      389
22      363
30      324
17      317
26      303
24      268
28      225
32      193
29      117
31      115
18      106
2       104
37       92
34       77
33       76
36       67
23       62
35       53
38       23
39        2
Name: brand, dtype: int64
bodyType的特征分布如下:
bodyType特征有个8不同的值
0.0    13985
1.0    11882
2.0     9900
3.0     4433
4.0     3303
5.0     2537
6.0     2116
7.0      431
Name: bodyType, dtype: int64
fuelType的特征分布如下:
fuelType特征有个7不同的值
0.0    30656
1.0    15544
2.0      774
3.0       72
4.0       37
6.0       14
5.0       10
Name: fuelType, dtype: int64
gearbox的特征分布如下:
gearbox特征有个2不同的值
0.0    37301
1.0    10789
Name: gearbox, dtype: int64
notRepairedDamage的特征分布如下:
notRepairedDamage特征有个2不同的值
0.0    37249
1.0     4720
Name: notRepairedDamage, dtype: int64
regionCode的特征分布如下:
regionCode特征有个6971不同的值
419     146
764      78
188      52
125      51
759      51
       ... 
7753      1
7463      1
7230      1
826       1
112       1
Name: regionCode, Length: 6971, dtype: int64

数字特征分析

numeric_features.append('price')
numeric_features
['power',
 'kilometer',
 'v_0',
 'v_1',
 'v_2',
 'v_3',
 'v_4',
 'v_5',
 'v_6',
 'v_7',
 'v_8',
 'v_9',
 'v_10',
 'v_11',
 'v_12',
 'v_13',
 'v_14',
 'price']
Train_data.head()
SaleIDnameregDatemodelbrandbodyTypefuelTypegearboxpowerkilometer...v_5v_6v_7v_8v_9v_10v_11v_12v_13v_14
007362004040230.061.00.00.06012.5...0.2356760.1019880.1295490.0228160.097462-2.8818032.804097-2.4208210.7952920.914762
1122622003030140.012.00.00.0015.0...0.2647770.1210040.1357310.0265970.020582-4.9004822.096338-1.030483-1.7226740.245522
221487420040403115.0151.00.00.016312.5...0.2514100.1149120.1651470.0621730.027075-4.8467491.8035591.565330-0.832687-0.229963
337186519960908109.0100.00.01.019315.0...0.2742930.1103000.1219640.0333950.000000-4.5095991.285940-0.501868-2.438353-0.478699
4411108020120103110.051.00.00.0685.0...0.2280360.0732050.0918800.0788190.121534-1.8962400.9107830.9311102.8345181.923482

5 rows × 29 columns

## 1) 相关性分析
price_numeric = Train_data[numeric_features]
correlation = price_numeric.corr()
print(correlation['price'].sort_values(ascending = False),'\n')
price        1.000000
v_12         0.692823
v_8          0.685798
v_0          0.628397
power        0.219834
v_5          0.164317
v_2          0.085322
v_6          0.068970
v_1          0.060914
v_14         0.035911
v_13        -0.013993
v_7         -0.053024
v_4         -0.147085
v_9         -0.206205
v_10        -0.246175
v_11        -0.275320
kilometer   -0.440519
v_3         -0.730946
Name: price, dtype: float64 
f , ax = plt.subplots(figsize = (7, 7))
plt.title('Correlation of Numeric Features with Price',y=1,size=16)
sns.heatmap(correlation,square = True, vmax=0.8)
<matplotlib.axes._subplots.AxesSubplot at 0x1fd26134b70>

在这里插入图片描述

del price_numeric['price']
## 2) 查看几个特征得 偏度和峰值
for col in numeric_features:
    print('{:15}'.format(col),
    'Skewness: {:05.2f}'.format(Train_data[col].skew()) ,
    ' ' ,
    'Kurtosis: {:06.2f}'.format(Train_data[col].kurt())
    )
power           Skewness: 65.86   Kurtosis: 5733.45
kilometer       Skewness: -1.53   Kurtosis: 001.14
v_0             Skewness: -1.32   Kurtosis: 003.99
v_1             Skewness: 00.36   Kurtosis: -01.75
v_2             Skewness: 04.84   Kurtosis: 023.86
v_3             Skewness: 00.11   Kurtosis: -00.42
v_4             Skewness: 00.37   Kurtosis: -00.20
v_5             Skewness: -4.74   Kurtosis: 022.93
v_6             Skewness: 00.37   Kurtosis: -01.74
v_7             Skewness: 05.13   Kurtosis: 025.85
v_8             Skewness: 00.20   Kurtosis: -00.64
v_9             Skewness: 00.42   Kurtosis: -00.32
v_10            Skewness: 00.03   Kurtosis: -00.58
v_11            Skewness: 03.03   Kurtosis: 012.57
v_12            Skewness: 00.37   Kurtosis: 000.27
v_13            Skewness: 00.27   Kurtosis: -00.44
v_14            Skewness: -1.19   Kurtosis: 002.39
price           Skewness: 03.35   Kurtosis: 019.00
## 3) 每个数字特征得分布可视化
f = pd.melt(Train_data, value_vars=numeric_features)
g = sns.FacetGrid(f, col="variable", col_wrap=2, sharex=False, sharey=False)
g = g.map(sns.distplot, "value")

在这里插入图片描述
可以看出匿名特征相对分布均匀

## 4) 数字特征相互之间的关系可视化
sns.set()
columns = ['price', 'v_12', 'v_8' , 'v_0', 'power', 'v_5', 'v_2', 'v_6', 'v_1', 'v_14']
sns.pairplot(Train_data[columns],size = 2 ,kind ='scatter',diag_kind='kde')
plt.show()

在这里插入图片描述

Train_data.columns
Index(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType',
       'gearbox', 'power', 'kilometer', 'notRepairedDamage', 'regionCode',
       'creatDate', 'price', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6',
       'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13', 'v_14'],
      dtype='object')
Y_train
0         1850
1         3600
2         6222
3         2400
4         5200
          ... 
149995    5900
149996    9500
149997    7500
149998    4999
149999    4700
Name: price, Length: 150000, dtype: int64
## 5) 多变量互相回归关系可视化
fig, ((ax1, ax2), (ax3, ax4), (ax5, ax6), (ax7, ax8), (ax9, ax10)) = plt.subplots(nrows=5, ncols=2, figsize=(24, 20))
# ['v_12', 'v_8' , 'v_0', 'power', 'v_5',  'v_2', 'v_6', 'v_1', 'v_14']
v_12_scatter_plot = pd.concat([Y_train,Train_data['v_12']],axis = 1)
sns.regplot(x='v_12',y = 'price', data = v_12_scatter_plot,scatter= True, fit_reg=True, ax=ax1)

v_8_scatter_plot = pd.concat([Y_train,Train_data['v_8']],axis = 1)
sns.regplot(x='v_8',y = 'price',data = v_8_scatter_plot,scatter= True, fit_reg=True, ax=ax2)

v_0_scatter_plot = pd.concat([Y_train,Train_data['v_0']],axis = 1)
sns.regplot(x='v_0',y = 'price',data = v_0_scatter_plot,scatter= True, fit_reg=True, ax=ax3)

power_scatter_plot = pd.concat([Y_train,Train_data['power']],axis = 1)
sns.regplot(x='power',y = 'price',data = power_scatter_plot,scatter= True, fit_reg=True, ax=ax4)

v_5_scatter_plot = pd.concat([Y_train,Train_data['v_5']],axis = 1)
sns.regplot(x='v_5',y = 'price',data = v_5_scatter_plot,scatter= True, fit_reg=True, ax=ax5)

v_2_scatter_plot = pd.concat([Y_train,Train_data['v_2']],axis = 1)
sns.regplot(x='v_2',y = 'price',data = v_2_scatter_plot,scatter= True, fit_reg=True, ax=ax6)

v_6_scatter_plot = pd.concat([Y_train,Train_data['v_6']],axis = 1)
sns.regplot(x='v_6',y = 'price',data = v_6_scatter_plot,scatter= True, fit_reg=True, ax=ax7)

v_1_scatter_plot = pd.concat([Y_train,Train_data['v_1']],axis = 1)
sns.regplot(x='v_1',y = 'price',data = v_1_scatter_plot,scatter= True, fit_reg=True, ax=ax8)

v_14_scatter_plot = pd.concat([Y_train,Train_data['v_14']],axis = 1)
sns.regplot(x='v_14',y = 'price',data = v_14_scatter_plot,scatter= True, fit_reg=True, ax=ax9)

v_13_scatter_plot = pd.concat([Y_train,Train_data['v_13']],axis = 1)
sns.regplot(x='v_13',y = 'price',data = v_13_scatter_plot,scatter= True, fit_reg=True, ax=ax10)
<matplotlib.axes._subplots.AxesSubplot at 0x1fd1d2cb358>

在这里插入图片描述

类别特征分析

## 1) unique分布
for fea in categorical_features:
    print(Train_data[fea].nunique())
99662
248
40
8
7
2
2
7905
categorical_features
['name',
 'model',
 'brand',
 'bodyType',
 'fuelType',
 'gearbox',
 'notRepairedDamage',
 'regionCode']
## 2) 类别特征箱形图可视化

# 因为 name和 regionCode的类别太稀疏了,这里我们把不稀疏的几类画一下
categorical_features = ['model',
 'brand',
 'bodyType',
 'fuelType',
 'gearbox',
 'notRepairedDamage']
for c in categorical_features:
    Train_data[c] = Train_data[c].astype('category')
    if Train_data[c].isnull().any():
        Train_data[c] = Train_data[c].cat.add_categories(['MISSING'])
        Train_data[c] = Train_data[c].fillna('MISSING')

def boxplot(x, y, **kwargs):
    sns.boxplot(x=x, y=y)
    x=plt.xticks(rotation=90)

f = pd.melt(Train_data, id_vars=['price'], value_vars=categorical_features)
g = sns.FacetGrid(f, col="variable",  col_wrap=2, sharex=False, sharey=False, size=5)
g = g.map(boxplot, "value", "price")

在这里插入图片描述

Train_data.columns
Index(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType',
       'gearbox', 'power', 'kilometer', 'notRepairedDamage', 'regionCode',
       'creatDate', 'price', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6',
       'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13', 'v_14'],
      dtype='object')
## 3) 类别特征的小提琴图可视化
catg_list = categorical_features
target = 'price'
for catg in catg_list :
    sns.violinplot(x=catg, y=target, data=Train_data)
    plt.show()

在这里插入图片描述

在这里插入图片描述

在这里插入图片描述
在这里插入图片描述

在这里插入图片描述

在这里插入图片描述

categorical_features = ['model',
 'brand',
 'bodyType',
 'fuelType',
 'gearbox',
 'notRepairedDamage']
## 4) 类别特征的柱形图可视化
def bar_plot(x, y, **kwargs):
    sns.barplot(x=x, y=y)
    x=plt.xticks(rotation=90)

f = pd.melt(Train_data, id_vars=['price'], value_vars=categorical_features)
g = sns.FacetGrid(f, col="variable",  col_wrap=2, sharex=False, sharey=False, size=5)
g = g.map(bar_plot, "value", "price")

在这里插入图片描述

##  5) 类别特征的每个类别频数可视化(count_plot)
def count_plot(x,  **kwargs):
    sns.countplot(x=x)
    x=plt.xticks(rotation=90)

f = pd.melt(Train_data,  value_vars=categorical_features)
g = sns.FacetGrid(f, col="variable",  col_wrap=2, sharex=False, sharey=False, size=5)
g = g.map(count_plot, "value")

在这里插入图片描述

用pandas_profiling生成数据报告

用pandas_profiling生成一个较为全面的可视化和数据报告(较为简单、方便) 最终打开html文件即可

总结

  • EDA在实际比赛中是最基本的步骤。
  • 后续要做一些特征工程,多查资料理解问题基础上进一步处理,找到强相关特征
  • EDA有利于发现数据的一些特性,数据之间的关联性,对于后续的特征构建是很有帮助。
  • 初步分析:对数据有个大致理解
  • 缺失值/异常值分析:对不同缺失值/异常值做填充或丢弃处理,填充又有多种方式
  • 对于Label做专门的分析,分析标签的分布情况等。
  • 通过作图,更直观的分析特征与label之间的关联
  • 0
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值