二手车价格预测task02:数据探索性分析

  • task02学习了数据的分析画图

在这里插入图片描述

以下是按照教程进行数据分析的过程

# 导包
import warnings
warnings.filterwarnings('ignore') 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
  • 读取数据
Train_data = pd.read_csv('car_train_0110.csv', sep=' ')
Test_data = pd.read_csv('car_testA_0110.csv', sep=' ')
Train_data.head().append(Train_data.tail())
SaleIDnameregDatemodelbrandbodyTypefuelTypegearboxpowerkilometer...v_14v_15v_16v_17v_18v_19v_20v_21v_22v_23
01348907342016000213.09NaN0.01.0015.0...0.0921390.00000018.763832-1.512063-1.008718-12.100623-0.9470529.0772970.5812143.945923
13066481969732008030772.097.05.01.017315.0...0.0010700.122335-5.685612-0.489963-2.223693-0.226865-0.658246-3.9496214.593618-1.145653
2340675253472002031218.0123.00.01.05012.5...0.0644100.003345-3.2957001.8164993.554439-0.6836750.9714952.625318-0.851922-1.246135
35733253822000061138.087.00.01.05415.0...0.0692310.000000-3.4055211.4978264.7826360.0391011.2276463.040629-0.801854-1.251894
42652351731742003010987.005.05.01.01313.0...0.0000990.001655-4.4754290.1241381.364567-0.319848-1.131568-3.303424-1.998466-1.279368
2499951055693322017000313.09NaNNaN1.05815.0...0.0791190.00144711.78250820.402576-2.7227720.462388-4.4293857.8834130.698405-1.082013
2499961467101021102003051129.0173.00.00.06115.0...0.0000000.002342-2.9882721.5005323.502201-0.761715-2.484556-2.532968-0.940266-1.106426
2499971160668280220130312124.0166.00.01.01223.0...0.0033580.100760-6.939560-1.144959-5.3379490.896026-0.592565-3.8727252.1359843.807554
249998900826597120121212111.047.05.00.01849.0...0.0029740.008251-7.222167-1.383696-5.402794-0.409451-1.891556-3.104789-3.7773743.186218
24999976453569542005111113.093.00.01.05812.5...0.0000000.00907110.491312-11.270043-0.272595-0.026478-2.168249-0.980042-0.955164-1.169593

10 rows × 40 columns

  • name - 汽车编码
  • regDate - 汽车注册时间 – ***
  • model - 车型编码
  • brand - 品牌
  • bodyType - 车身类型
  • fuelType - 燃油类型
  • gearbox - 变速箱
  • power - 汽车功率
  • kilometer - 汽车行驶公里 –
  • notRepairedDamage - 汽车有尚未修复的损坏 – ***
  • regionCode - 看车地区编码
  • seller - 销售方
  • offerType - 报价类型
  • creatDate - 广告发布时间
  • price - 汽车价格
Train_data.columns
Index(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType',
       'gearbox', 'power', 'kilometer', 'notRepairedDamage', 'regionCode',
       'seller', 'offerType', 'creatDate', 'price', 'v_0', 'v_1', 'v_2', 'v_3',
       'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12',
       'v_13', 'v_14', 'v_15', 'v_16', 'v_17', 'v_18', 'v_19', 'v_20', 'v_21',
       'v_22', 'v_23'],
      dtype='object')
Train_data_part = Train_data.cloumns=['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType','gearbox', 'power', 'kilometer', 'notRepairedDamage', 'regionCode','seller', 'offerType', 'creatDate', 'price']
Train_data_part
['SaleID',
 'name',
 'regDate',
 'model',
 'brand',
 'bodyType',
 'fuelType',
 'gearbox',
 'power',
 'kilometer',
 'notRepairedDamage',
 'regionCode',
 'seller',
 'offerType',
 'creatDate',
 'price']
Train_data.describe()
SaleIDnameregDatemodelbrandbodyTypefuelTypegearboxpowerkilometer...v_14v_15v_16v_17v_18v_19v_20v_21v_22v_23
count250000.000000250000.0000002.500000e+05250000.000000250000.000000224620.000000227510.000000236487.000000250000.000000250000.000000...250000.000000250000.000000250000.000000250000.000000250000.000000250000.000000250000.000000250000.000000250000.000000250000.000000
mean185351.79076883153.3621722.003401e+0744.9114807.7852364.5632711.6650080.780783115.52841212.577418...0.0324890.0304080.0147250.0009150.0062730.006604-0.0013740.000609-0.0040250.001834
std107121.18876372540.7999647.770250e+0450.6400817.6940101.9125152.3396460.413717196.1418283.990632...0.0387920.0493338.7791635.7710814.8809814.1247223.8036263.5553532.8647132.323680
min1.0000000.0000001.910000e+070.0000000.0000000.0000000.0000000.0000000.0000000.500000...0.0000000.000000-10.412444-15.538236-21.009214-13.989955-9.599285-11.181255-7.671327-2.350888
25%92501.75000014500.0000001.999061e+076.0000001.0000003.0000000.0000001.00000070.00000012.500000...0.0001290.000000-5.552269-0.901181-3.150385-0.478173-1.727237-3.067073-2.092178-1.402804
50%185264.50000065314.5000002.003111e+0727.0000006.0000004.0000000.0000001.000000105.00000015.000000...0.0019610.002567-3.8217700.223181-0.0585020.038427-0.995044-0.880587-1.199807-1.145588
75%278128.500000143761.2500002.008081e+0770.00000011.0000007.0000005.0000001.000000150.00000015.000000...0.0756720.0565683.5997471.2637372.8004750.5691981.5633823.2699872.7376140.044865
max370946.000000233044.0000002.019121e+07250.00000039.0000007.0000006.0000001.00000020000.00000015.000000...0.1307850.18434036.75687826.13456123.05566016.57602720.32457214.0394228.7645978.574730

8 rows × 40 columns

Test_data.describe()|
  File "<ipython-input-8-b48c1a6ece76>", line 1
    Test_data.describe()|
                         ^
SyntaxError: invalid syntax

power这里的max好像异常

Train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250000 entries, 0 to 249999
Data columns (total 40 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   SaleID             250000 non-null  int64  
 1   name               250000 non-null  int64  
 2   regDate            250000 non-null  int64  
 3   model              250000 non-null  float64
 4   brand              250000 non-null  int64  
 5   bodyType           224620 non-null  float64
 6   fuelType           227510 non-null  float64
 7   gearbox            236487 non-null  float64
 8   power              250000 non-null  int64  
 9   kilometer          250000 non-null  float64
 10  notRepairedDamage  201464 non-null  float64
 11  regionCode         250000 non-null  int64  
 12  seller             250000 non-null  int64  
 13  offerType          250000 non-null  int64  
 14  creatDate          250000 non-null  int64  
 15  price              250000 non-null  int64  
 16  v_0                250000 non-null  float64
 17  v_1                250000 non-null  float64
 18  v_2                250000 non-null  float64
 19  v_3                250000 non-null  float64
 20  v_4                250000 non-null  float64
 21  v_5                250000 non-null  float64
 22  v_6                250000 non-null  float64
 23  v_7                250000 non-null  float64
 24  v_8                250000 non-null  float64
 25  v_9                250000 non-null  float64
 26  v_10               250000 non-null  float64
 27  v_11               250000 non-null  float64
 28  v_12               250000 non-null  float64
 29  v_13               250000 non-null  float64
 30  v_14               250000 non-null  float64
 31  v_15               250000 non-null  float64
 32  v_16               250000 non-null  float64
 33  v_17               250000 non-null  float64
 34  v_18               250000 non-null  float64
 35  v_19               250000 non-null  float64
 36  v_20               250000 non-null  float64
 37  v_21               250000 non-null  float64
 38  v_22               250000 non-null  float64
 39  v_23               250000 non-null  float64
dtypes: float64(30), int64(10)
memory usage: 76.3 MB
Test_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 39 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   SaleID             50000 non-null  int64  
 1   name               50000 non-null  int64  
 2   regDate            50000 non-null  int64  
 3   model              50000 non-null  float64
 4   brand              50000 non-null  int64  
 5   bodyType           44890 non-null  float64
 6   fuelType           45598 non-null  float64
 7   gearbox            47287 non-null  float64
 8   power              50000 non-null  int64  
 9   kilometer          50000 non-null  float64
 10  notRepairedDamage  40372 non-null  float64
 11  regionCode         50000 non-null  int64  
 12  seller             50000 non-null  int64  
 13  offerType          50000 non-null  int64  
 14  creatDate          50000 non-null  int64  
 15  v_0                50000 non-null  float64
 16  v_1                50000 non-null  float64
 17  v_2                50000 non-null  float64
 18  v_3                50000 non-null  float64
 19  v_4                50000 non-null  float64
 20  v_5                50000 non-null  float64
 21  v_6                50000 non-null  float64
 22  v_7                50000 non-null  float64
 23  v_8                50000 non-null  float64
 24  v_9                50000 non-null  float64
 25  v_10               50000 non-null  float64
 26  v_11               50000 non-null  float64
 27  v_12               50000 non-null  float64
 28  v_13               50000 non-null  float64
 29  v_14               50000 non-null  float64
 30  v_15               50000 non-null  float64
 31  v_16               50000 non-null  float64
 32  v_17               50000 non-null  float64
 33  v_18               50000 non-null  float64
 34  v_19               50000 non-null  float64
 35  v_20               50000 non-null  float64
 36  v_21               50000 non-null  float64
 37  v_22               50000 non-null  float64
 38  v_23               50000 non-null  float64
dtypes: float64(30), int64(9)
memory usage: 14.9 MB
# 查看每列的存在nan情况
Train_data.isnull()
SaleIDnameregDatemodelbrandbodyTypefuelTypegearboxpowerkilometer...v_14v_15v_16v_17v_18v_19v_20v_21v_22v_23
0FalseFalseFalseFalseFalseTrueFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
1FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
2FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
3FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
4FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
..................................................................
249995FalseFalseFalseFalseFalseTrueTrueFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
249996FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
249997FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
249998FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
249999FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse

250000 rows × 40 columns

Train_data.isnull().sum() # sum是对每一列的数据进行求和
SaleID                   0
name                     0
regDate                  0
model                    0
brand                    0
bodyType             25380
fuelType             22490
gearbox              13513
power                    0
kilometer                0
notRepairedDamage    48536
regionCode               0
seller                   0
offerType                0
creatDate                0
price                    0
v_0                      0
v_1                      0
v_2                      0
v_3                      0
v_4                      0
v_5                      0
v_6                      0
v_7                      0
v_8                      0
v_9                      0
v_10                     0
v_11                     0
v_12                     0
v_13                     0
v_14                     0
v_15                     0
v_16                     0
v_17                     0
v_18                     0
v_19                     0
v_20                     0
v_21                     0
v_22                     0
v_23                     0
dtype: int64

NAN值的可视化

missing = Train_data.isnull().sum() # 为NAN的个数
missing = missing[missing > 0] # 只剩下空值的missing了
type(missing)
pandas.core.series.Series
missing
bodyType             25380
fuelType             22490
gearbox              13513
notRepairedDamage    48536
dtype: int64
# inplace=True 是在原数据上进行修改
missing.sort_values(inplace=True)
missing # 排序前
gearbox              13513
fuelType             22490
bodyType             25380
notRepairedDamage    48536
dtype: int64
missing # 排序后
gearbox              13513
fuelType             22490
bodyType             25380
notRepairedDamage    48536
dtype: int64
# 画出图 : 横轴为特征的名字,纵轴为数值
missing.plot.bar()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-kpoIHOAZ-1618584682418)(output_25_1.png)]
通过以上两句可以很直观的了解哪些列存在 “nan”, 并可以把nan的个数打印,主要的目的在于 nan存在的个数是
否真的很大,如果很小一般选择填充,如果使用lgb等树模型可以直接空缺,让树自己去优化,但如果nan存在的
过多、可以考虑删掉

# 可视化查看缺省值
msno.matrix(Train_data.sample(250))

在这里插入图片描述

msno.bar(Train_data.sample(1000))
# 可以看出1000个数据内有哪些数据不足1000,上面还有标出有多少条数据

在这里插入图片描述

# 可视化看下缺省值
msno.matrix(Test_data)

在这里插入图片描述

msno.bar(Test_data.sample(1000))

在这里插入图片描述

  • 可以看出训练集和测试集数据不一致的分布也是非常相似的

异常值检测

Train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250000 entries, 0 to 249999
Data columns (total 40 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   SaleID             250000 non-null  int64  
 1   name               250000 non-null  int64  
 2   regDate            250000 non-null  int64  
 3   model              250000 non-null  float64
 4   brand              250000 non-null  int64  
 5   bodyType           224620 non-null  float64
 6   fuelType           227510 non-null  float64
 7   gearbox            236487 non-null  float64
 8   power              250000 non-null  int64  
 9   kilometer          250000 non-null  float64
 10  notRepairedDamage  201464 non-null  float64
 11  regionCode         250000 non-null  int64  
 12  seller             250000 non-null  int64  
 13  offerType          250000 non-null  int64  
 14  creatDate          250000 non-null  int64  
 15  price              250000 non-null  int64  
 16  v_0                250000 non-null  float64
 17  v_1                250000 non-null  float64
 18  v_2                250000 non-null  float64
 19  v_3                250000 non-null  float64
 20  v_4                250000 non-null  float64
 21  v_5                250000 non-null  float64
 22  v_6                250000 non-null  float64
 23  v_7                250000 non-null  float64
 24  v_8                250000 non-null  float64
 25  v_9                250000 non-null  float64
 26  v_10               250000 non-null  float64
 27  v_11               250000 non-null  float64
 28  v_12               250000 non-null  float64
 29  v_13               250000 non-null  float64
 30  v_14               250000 non-null  float64
 31  v_15               250000 non-null  float64
 32  v_16               250000 non-null  float64
 33  v_17               250000 non-null  float64
 34  v_18               250000 non-null  float64
 35  v_19               250000 non-null  float64
 36  v_20               250000 non-null  float64
 37  v_21               250000 non-null  float64
 38  v_22               250000 non-null  float64
 39  v_23               250000 non-null  float64
dtypes: float64(30), int64(10)
memory usage: 76.3 MB
  • .value_counts 获取该特征列数据的种类|
# .value_counts 获取该特征列数据的种类
Train_data['notRepairedDamage'].value_counts()
1.0    176922
0.0     24542
Name: notRepairedDamage, dtype: int64
# Train_data.value_counts()
# 二手车原数据中这个特征为类别型特征,且 - 也表示为空值,这里是
    # 将 - 替换为nan
# Train_data['notRepairedDamage'].replace('-', np.nan, inplace=True)

以下两个类别特征严重倾斜,一般不会对预测有什么帮助,故这边先删掉,当然你也可以继续挖掘,但是一般意义不大

Train_data["seller"].value_counts()
1    249999
0         1
Name: seller, dtype: int64
Test_data["seller"].value_counts()
1    50000
Name: seller, dtype: int64
Train_data["offerType"].value_counts()
0    249991
1         9
Name: offerType, dtype: int64
Test_data['offerType'].value_counts()
0    49999
1        1
Name: offerType, dtype: int64
del Train_data["seller"] 
del Train_data["offerType"] 
del Test_data["seller"] 
del Test_data["offerType"]

所有特征的value_counts()

for f in Train_data.columns:
    print(f)
    print(Train_data[f].value_counts())
SaleID
2049      1
265515    1
277805    1
271662    1
312626    1
         ..
107105    1
113250    1
111203    1
98917     1
2047      1
Name: SaleID, Length: 250000, dtype: int64
name
451       452
73        429
1791      428
821       391
243       346
         ... 
92419       1
88325       1
82182       1
84231       1
157427      1
Name: name, Length: 164312, dtype: int64
regDate
20000010    306
20000001    288
20000002    288
20000007    279
20000008    278
           ... 
19850904      1
19851010      1
19750511      1
19870912      1
19400705      1
Name: regDate, Length: 7537, dtype: int64
model
0.0      20344
6.0      17741
4.0      13837
1.0      13634
12.0      8841
         ...  
226.0        5
245.0        5
243.0        4
249.0        4
250.0        1
Name: model, Length: 251, dtype: int64
brand
0     53699
4     27109
11    26944
10    23762
1     22144
6     17202
9     12210
5      7343
15     6500
12     4704
7      3839
3      3831
17     3543
13     3502
8      3374
28     3161
19     2561
18     2451
16     2274
22     2264
23     2088
14     1892
24     1678
25     1611
20     1610
27     1392
29     1259
34      963
30      604
2       570
31      540
21      522
38      516
35      415
32      406
36      377
33      368
37      324
26      307
39      141
Name: brand, dtype: int64
bodyType
7.0    64571
3.0    53858
4.0    45646
5.0    20343
6.0    15290
2.0    12755
1.0     9882
0.0     2275
Name: bodyType, dtype: int64
fuelType
0.0    150664
5.0     72494
4.0      3577
3.0       385
2.0       183
1.0       147
6.0        60
Name: fuelType, dtype: int64
gearbox
1.0    184645
0.0     51842
Name: gearbox, dtype: int64
power
0        27280
75       16158
60       10765
150      10373
140       9145
         ...  
1986         1
1090         1
10311        1
960          1
3454         1
Name: power, Length: 703, dtype: int64
kilometer
15.0    162161
12.5     25743
10.0     10777
9.0       8424
8.0       7434
7.0       6642
6.0       5859
5.0       5100
0.5       4634
4.0       4204
3.0       4021
2.0       3749
1.0       1252
Name: kilometer, dtype: int64
notRepairedDamage
1.0    176922
0.0     24542
Name: notRepairedDamage, dtype: int64
regionCode
487     550
868     424
149     236
539     227
32      216
       ... 
7959      1
8002      1
6715      1
7117      1
4144      1
Name: regionCode, Length: 8081, dtype: int64
creatDate
20160403    9758
20160404    9521
20160320    9176
20160312    8946
20160321    8895
            ... 
20150618       1
20160114       1
20160201       1
20150611       1
20140310       1
Name: creatDate, Length: 107, dtype: int64
price
0        7312
500      3815
1500     3587
1000     3149
1200     3071
         ... 
11320       1
7230        1
11448       1
9529        1
8188        1
Name: price, Length: 4585, dtype: int64
v_0
71.666307    2
72.346416    2
78.107692    2
71.715545    2
73.734706    2
            ..
71.161494    1
70.253614    1
70.797686    1
74.588185    1
77.825581    1
Name: v_0, Length: 249747, dtype: int64
v_1
-1.470958     2
-3.128523     2
-3.224945     2
-3.293795     2
-3.322763     2
             ..
-3.970355     1
 11.487790    1
-3.456756     1
-3.746283     1
-3.579301     1
Name: v_1, Length: 249747, dtype: int64
v_2
-0.527186    2
-0.998414    2
 0.652201    2
 0.356107    2
-9.312859    2
            ..
-0.815970    1
-6.729062    1
-1.683035    1
 0.171102    1
 0.852139    1
Name: v_2, Length: 249747, dtype: int64
v_3
 3.580573     2
-0.633228     2
-2.541859     2
 0.161395     2
 20.571558    2
             ..
 1.067454     1
-0.826230     1
-6.306510     1
 0.201140     1
 4.146853     1
Name: v_3, Length: 249747, dtype: int64
v_4
 2.038620     2
-0.591751     2
-12.603294    2
-0.321072     2
-0.429618     2
             ..
 0.742918     1
 2.722358     1
-0.317880     1
-0.356648     1
 1.327513     1
Name: v_4, Length: 249747, dtype: int64
v_5
 1.273623    2
-1.589295    2
-2.350140    2
 0.080770    2
-2.434300    2
            ..
-1.374641    1
-2.369201    1
-2.194464    1
 1.226827    1
-1.218480    1
Name: v_5, Length: 249747, dtype: int64
v_6
 3.854950    2
-2.337177    2
-2.840736    2
-2.988814    2
 0.912034    2
            ..
-8.718013    1
 3.185567    1
 3.443525    1
-2.653621    1
-3.138425    1
Name: v_6, Length: 249747, dtype: int64
v_7
-2.915058    2
-2.518469    2
-1.175198    2
-3.672233    2
-2.563102    2
            ..
-1.171334    1
-2.324847    1
 4.015706    1
-1.895407    1
-2.156468    1
Name: v_7, Length: 249747, dtype: int64
v_8
0.000000    48244
0.315924        2
0.315905        2
0.314498        2
0.315560        2
            ...  
0.315494        1
0.289243        1
0.316095        1
0.316209        1
0.315702        1
Name: v_8, Length: 201543, dtype: int64
v_9
1.101174    2
0.118624    2
0.164335    2
0.114609    2
0.112811    2
           ..
1.110851    1
1.101634    1
0.116084    1
0.090707    1
0.112558    1
Name: v_9, Length: 249747, dtype: int64
v_10
0.000000    25342
0.081665        2
0.086726        2
0.081616        2
0.081701        2
            ...  
0.089640        1
0.091852        1
0.082066        1
0.081448        1
0.087517        1
Name: v_10, Length: 224427, dtype: int64
v_11
0.000000    7421
0.121584       2
0.102037       2
0.166840       2
0.134519       2
            ... 
0.092895       1
0.108411       1
0.131894       1
0.075781       1
0.078286       1
Name: v_11, Length: 242335, dtype: int64
v_12
0.000000    22426
0.053098        2
0.053437        2
0.055474        2
0.053432        2
            ...  
0.053485        1
0.053471        1
0.055616        1
0.053447        1
0.053329        1
Name: v_12, Length: 227338, dtype: int64
v_13
0.000000    13495
0.130205        2
0.123467        2
0.123337        2
0.130232        2
            ...  
0.123242        1
0.123755        1
0.123252        1
0.123047        1
0.123567        1
Name: v_13, Length: 236266, dtype: int64
v_14
0.000000    53857
0.003751        2
0.000746        2
0.002838        2
0.002283        2
            ...  
0.094690        1
0.000690        1
0.086957        1
0.002928        1
0.083676        1
Name: v_14, Length: 195953, dtype: int64
v_15
0.000000    97223
0.010717        2
0.012704        2
0.143362        2
0.005417        2
            ...  
0.001720        1
0.003263        1
0.007882        1
0.005242        1
0.094839        1
Name: v_15, Length: 152622, dtype: int64
v_16
-3.254226     2
-2.855248     2
-4.373334     2
 7.744677     2
-2.847659     2
             ..
 10.816862    1
-6.670231     1
-6.291694     1
-4.147668     1
-6.389964     1
Name: v_16, Length: 249747, dtype: int64
v_17
 0.498971     2
 0.217974     2
-13.000712    2
-0.675390     2
-1.530593     2
             ..
 1.397213     1
 0.610112     1
 2.335480     1
-1.500048     1
 5.289472     1
Name: v_17, Length: 249747, dtype: int64
v_18
-3.753102    2
 7.731945    2
-0.058593    2
-1.171759    2
-2.045338    2
            ..
-0.860834    1
 0.643066    1
 5.023034    1
-2.016881    1
 6.565301    1
Name: v_18, Length: 249747, dtype: int64
v_19
 0.082562    2
-0.469708    2
-0.138257    2
-0.657417    2
 0.862429    2
            ..
-1.249647    1
-0.664831    1
-0.660867    1
 0.040847    1
 0.700206    1
Name: v_19, Length: 249747, dtype: int64
v_20
-1.214032    2
-2.031659    2
-2.426898    2
-1.542005    2
-0.657360    2
            ..
-2.098785    1
 0.725159    1
-4.682086    1
 0.342639    1
 1.612570    1
Name: v_20, Length: 249747, dtype: int64
v_21
-3.244933    2
-3.440059    2
-4.070917    2
-3.001142    2
-4.153741    2
            ..
-3.289808    1
 4.942931    1
 2.670356    1
-3.793230    1
 3.226273    1
Name: v_21, Length: 249747, dtype: int64
v_22
-2.957315    2
 4.311760    2
-2.101273    2
-0.936764    2
-2.562937    2
            ..
 6.155533    1
-2.379816    1
-1.419529    1
 4.872987    1
 5.062041    1
Name: v_22, Length: 249747, dtype: int64
v_23
-1.044909    2
-1.259228    2
-1.183081    2
-0.989739    2
-0.958179    2
            ..
-1.946367    1
-1.503159    1
-1.175261    1
-0.908192    1
-1.182790    1
Name: v_23, Length: 249747, dtype: int64
  • bodyType : 八个类别
  • fuelType : 七个类别
  • gearbox : 两个类别
  • kilometer : 12个类别
  • notRepairedDamage : 两个类别
  • seller : 两个类别但是严重倾斜 **
  • offerType : 两个类别但是严重倾斜 **
  • V_8 V_10 V_11 V_12 V_13 V_14 V_15 各有一个值特别大的类别特征

了解预测值的分布

type(Train_data['price'])
pandas.core.series.Series
Train_data['price'].value_counts()
0        7312
500      3815
1500     3587
1000     3149
1200     3071
         ... 
11320       1
7230        1
11448       1
9529        1
8188        1
Name: price, Length: 4585, dtype: int64
## 1) price的分布情况(无界约尔逊分布等)
import scipy.stats as st
y = Train_data['price']
plt.figure(1); plt.title('Johnson SU')
sns.distplot(y, kde=False, fit=st.johnsonsu)
plt.figure(2); plt.title('Normal')
sns.distplot(y, kde=False, fit=st.norm)
plt.figure(3); plt.title('Log Normal')
sns.distplot(y, kde=False, fit=st.lognorm)
<AxesSubplot:title={'center':'Log Normal'}, xlabel='price'>

在这里插入图片描述

在这里插入图片描述

在这里插入图片描述

figure语法及操作

(1)figure语法说明

figure(num=None, figsize=None, dpi=None, facecolor=None, edgecolor=None, frameon=True)

  • num:图像编号或名称,数字为编号 ,字符串为名称
  • figsize:指定figure的宽和高,单位为英寸
  • dpi参数指定绘图对象的分辨率,即每英寸多少个像素,缺省值为 1英寸等于2.5cm,A4纸是 21*30cm的纸张
  • facecolor:背景颜色
  • edgecolor:边框颜色
  • frameon:是否显示边框

(2) 示例:

fig=plt.figure(figsize=(4,3),facecolor='blue')
plt.plot([1,2,3,4],[3,5,7,9])
plt.show()

在这里插入图片描述

plt.figure(3); plt.title('Log Normal')
sns.distplot(y, kde=True,rug=True, fit=st.lognorm)
# ked = True
<AxesSubplot:title={'center':'Log Normal'}, xlabel='price', ylabel='Density'>

在这里插入图片描述

sns画图

  • seaborn.distplot(a, bins=None, hist=True, kde=True, rug=False, fit=None, hist_kws=None, kde_kws=None, rug_kws=None, fit_kws=None, color=None, vertical=False, norm_hist=False, axlabel=None, label=None, ax=None)

将kde设置为True

  • Kernel density estimaton核密度估计

  • 核密度估计是在概率论中用来估计未知的密度函数,属于非参数检验方法之一。.由于核密度估计方法不利用有关数据分布的先验知识,对数据分布不附加任何假定,是一种从数据样本本身出发研究数据分布特征的方法,因而,在统计学理论和应用领域均受到高度的重视。

    • hist: bool, optional #控制是否显示条形图,默认为True
    • kde: bool, optional #控制是否显示核密度估计图,默认为True
    • rug: bool, optional #控制是否显示观测的小细条(边际毛毯)默认为false

对预测值分布进行处理

价格不服从正态分布,所以在进行回归之前,需要将其转换.虽然对数变换做的很好,但最佳拟合是无界约翰逊分布

## 2) 查看skewness and kurtosis
sns.distplot(Train_data['price']);
print("Skewness: %f" % Train_data['price'].skew())
print("Kurtosis: %f" % Train_data['price'].kurt())
Skewness: 3.535346
Kurtosis: 21.230678

在这里插入图片描述

Train_data.skew(), Train_data.kurt()
(SaleID                0.001712
 name                  0.513079
 regDate              -1.540844
 model                 1.499765
 brand                 1.314846
 bodyType             -0.070459
 fuelType              0.701802
 gearbox              -1.357379
 power                58.590829
 kilometer            -1.557472
 notRepairedDamage    -2.312519
 regionCode            0.690405
 creatDate           -95.428563
 price                 3.535346
 v_0                  -1.504738
 v_1                   1.582428
 v_2                   1.198679
 v_3                   1.352193
 v_4                   0.217941
 v_5                   2.052749
 v_6                   0.090718
 v_7                   0.823610
 v_8                  -1.532964
 v_9                   1.529931
 v_10                 -2.584452
 v_11                 -0.906428
 v_12                 -2.842834
 v_13                 -3.869655
 v_14                  0.491706
 v_15                  1.308716
 v_16                  1.662893
 v_17                  0.233318
 v_18                  0.814453
 v_19                  0.100073
 v_20                  2.001253
 v_21                  0.180020
 v_22                  0.819133
 v_23                  1.357847
 dtype: float64,
 SaleID                  -1.201476
 name                    -1.084474
 regDate                 11.041006
 model                    1.741896
 brand                    1.814245
 bodyType                -1.070358
 fuelType                -1.495782
 gearbox                 -0.157525
 power                 4473.885260
 kilometer                1.250933
 notRepairedDamage        3.347777
 regionCode              -0.352973
 creatDate            11376.694263
 price                   21.230678
 v_0                      2.901641
 v_1                      1.098703
 v_2                      3.749872
 v_3                      4.294578
 v_4                      6.953348
 v_5                      6.489791
 v_6                     -0.564878
 v_7                     -0.729838
 v_8                      0.370812
 v_9                      0.377943
 v_10                     4.796855
 v_11                     1.547812
 v_12                     6.136342
 v_13                    13.199575
 v_14                    -1.597532
 v_15                    -0.029594
 v_16                     2.240928
 v_17                     2.569341
 v_18                     2.967738
 v_19                     6.923953
 v_20                     6.852809
 v_21                    -0.759948
 v_22                    -0.741708
 v_23                     0.143713
 dtype: float64)
sns.distplot(Train_data.skew(),color='blue',axlabel ='Skewness')
# 通过axlabel、label设置标签
<AxesSubplot:xlabel='Skewness', ylabel='Density'>

在这里插入图片描述

sns.distplot(Train_data.kurt(),color='orange',axlabel ='Kurtness')
<AxesSubplot:xlabel='Kurtness', ylabel='Density'>

在这里插入图片描述


评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值