二手车Task2 数据分析

#coding:utf-8
#导入warnings包,利用过滤器来实现忽略警告语句。
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
## 1) 载入训练集和测试集;
path = './'
Train_data = pd.read_csv(path+'car_train_0110.csv', sep=' ')
Test_data = pd.read_csv(path+'car_testA_0110.csv', sep=' ')
Train_data.head()

SaleIDnameregDatemodelbrandbodyTypefuelTypegearboxpowerkilometer...v_14v_15v_16v_17v_18v_19v_20v_21v_22v_23
01348907342016000213.09NaN0.01.0015.0...0.0921390.00000018.763832-1.512063-1.008718-12.100623-0.9470529.0772970.5812143.945923
13066481969732008030772.097.05.01.017315.0...0.0010700.122335-5.685612-0.489963-2.223693-0.226865-0.658246-3.9496214.593618-1.145653
2340675253472002031218.0123.00.01.05012.5...0.0644100.003345-3.2957001.8164993.554439-0.6836750.9714952.625318-0.851922-1.246135
35733253822000061138.087.00.01.05415.0...0.0692310.000000-3.4055211.4978264.7826360.0391011.2276463.040629-0.801854-1.251894
42652351731742003010987.005.05.01.01313.0...0.0000990.001655-4.4754290.1241381.364567-0.319848-1.131568-3.303424-1.998466-1.279368

5 rows × 40 columns

所有特征集均脱敏处理(方便大家观看)¶
name - 汽车编码
regDate - 汽车注册时间
model - 车型编码
brand - 品牌
bodyType - 车身类型
fuelType - 燃油类型
gearbox - 变速箱
power - 汽车功率
kilometer - 汽车行驶公里
notRepairedDamage - 汽车有尚未修复的损坏
regionCode - 看车地区编码
seller - 销售方
offerType - 报价类型
creatDate - 广告发布时间
price - 汽车价格
v_0’, ‘v_1’, ‘v_2’, ‘v_3’, ‘v_4’, ‘v_5’, ‘v_6’, ‘v_7’, ‘v_8’, ‘v_9’, ‘v_10’, ‘v_11’, ‘v_12’, ‘v_13’,‘v_14’ 【匿名特征,包含v0-14在内15个匿名特征】

Train_data.head().append(Test_data.tail())
SaleIDnameregDatemodelbrandbodyTypefuelTypegearboxpowerkilometer...v_14v_15v_16v_17v_18v_19v_20v_21v_22v_23
01348907342016000213.09NaN0.01.0015.0...0.0921390.00000018.763832-1.512063-1.008718-12.100623-0.9470529.0772970.5812143.945923
13066481969732008030772.097.05.01.017315.0...0.0010700.122335-5.685612-0.489963-2.223693-0.226865-0.658246-3.9496214.593618-1.145653
2340675253472002031218.0123.00.01.05012.5...0.0644100.003345-3.2957001.8164993.554439-0.6836750.9714952.625318-0.851922-1.246135
35733253822000061138.087.00.01.05415.0...0.0692310.000000-3.4055211.4978264.7826360.0391011.2276463.040629-0.801854-1.251894
42652351731742003010987.005.05.01.01313.0...0.0000990.001655-4.4754290.1241381.364567-0.319848-1.131568-3.303424-1.998466-1.279368
499953750333803200104076.0295.00.00.018610.0...0.0000000.000372-3.3976360.9401834.1156670.146320-2.348749-2.636560-0.965214-1.097192
499964065562850020071001130.0102.00.00.02727.0...0.0032080.116459-7.055336-1.260228-4.9379790.881517-1.590285-3.4956083.3018873.947193
49997511668983831998010223.0104.00.01.01900.5...0.0495800.067015-4.9165010.507919-0.0354750.2562850.7340840.7799311.8224165.012697
4999853313914892003100170.017.04.0NaN10115.0...0.0845910.000000-0.4244393.893203-0.1468841.83069418.008141-2.513048-3.310876-1.589404
499995928039942007040776.004.05.0NaN015.0...0.0557240.110924-1.4227502.749703-2.1607180.83808917.664283-5.8023253.063008-1.308131

10 rows × 40 columns

Train_data.shape

(250000, 40)
Test_data.head().append(Test_data.tail())
SaleIDnameregDatemodelbrandbodyTypefuelTypegearboxpowerkilometer...v_14v_15v_16v_17v_18v_19v_20v_21v_22v_23
07203265052006050519.0137.00.01.0908.0...0.0833400.105382-5.9989930.147048-1.9028470.3489902.3249613.3439104.048742-1.431822
17143161836200103015.053.04.01.07515.0...0.0744780.000000-3.2872212.0813172.937052-0.1230181.2023953.570743-1.180587-1.348598
2704693212291201706106.018NaN5.00.015015.0...0.0020320.0000004.3682188.252188-4.136109-13.334970-4.444620-0.706978-1.7202183.569112
3624972134519820005215.0327.00.01.006.0...0.0988060.100883-2.5374860.5139554.4149620.3576852.7007325.3236026.085956-0.900585
466975314282006020530.047.05.01.012215.0...0.0883970.002509-6.197633-0.191814-1.224360-0.3269852.2549314.183037-2.5740040.014203
499953750333803200104076.0295.00.00.018610.0...0.0000000.000372-3.3976360.9401834.1156670.146320-2.348749-2.636560-0.965214-1.097192
499964065562850020071001130.0102.00.00.02727.0...0.0032080.116459-7.055336-1.260228-4.9379790.881517-1.590285-3.4956083.3018873.947193
49997511668983831998010223.0104.00.01.01900.5...0.0495800.067015-4.9165010.507919-0.0354750.2562850.7340840.7799311.8224165.012697
4999853313914892003100170.017.04.0NaN10115.0...0.0845910.000000-0.4244393.893203-0.1468841.83069418.008141-2.513048-3.310876-1.589404
499995928039942007040776.004.05.0NaN015.0...0.0557240.110924-1.4227502.749703-2.1607180.83808917.664283-5.8023253.063008-1.308131

10 rows × 39 columns

Test_data.shape
(50000, 39)
Train_data.describe()

SaleIDnameregDatemodelbrandbodyTypefuelTypegearboxpowerkilometer...v_14v_15v_16v_17v_18v_19v_20v_21v_22v_23
count250000.000000250000.0000002.500000e+05250000.000000250000.000000224620.000000227510.000000236487.000000250000.000000250000.000000...250000.000000250000.000000250000.000000250000.000000250000.000000250000.000000250000.000000250000.000000250000.000000250000.000000
mean185351.79076883153.3621722.003401e+0744.9114807.7852364.5632711.6650080.780783115.52841212.577418...0.0324890.0304080.0147250.0009150.0062730.006604-0.0013740.000609-0.0040250.001834
std107121.18876372540.7999647.770250e+0450.6400817.6940101.9125152.3396460.413717196.1418283.990632...0.0387920.0493338.7791635.7710814.8809814.1247223.8036263.5553532.8647132.323680
min1.0000000.0000001.910000e+070.0000000.0000000.0000000.0000000.0000000.0000000.500000...0.0000000.000000-10.412444-15.538236-21.009214-13.989955-9.599285-11.181255-7.671327-2.350888
25%92501.75000014500.0000001.999061e+076.0000001.0000003.0000000.0000001.00000070.00000012.500000...0.0001290.000000-5.552269-0.901181-3.150385-0.478173-1.727237-3.067073-2.092178-1.402804
50%185264.50000065314.5000002.003111e+0727.0000006.0000004.0000000.0000001.000000105.00000015.000000...0.0019610.002567-3.8217700.223181-0.0585020.038427-0.995044-0.880587-1.199807-1.145588
75%278128.500000143761.2500002.008081e+0770.00000011.0000007.0000005.0000001.000000150.00000015.000000...0.0756720.0565683.5997471.2637372.8004750.5691981.5633823.2699872.7376140.044865
max370946.000000233044.0000002.019121e+07250.00000039.0000007.0000006.0000001.00000020000.00000015.000000...0.1307850.18434036.75687826.13456123.05566016.57602720.32457214.0394228.7645978.574730

8 rows × 40 columns

Test_data.describe()
SaleIDnameregDatemodelbrandbodyTypefuelTypegearboxpowerkilometer...v_14v_15v_16v_17v_18v_19v_20v_21v_22v_23
count50000.00000050000.0000005.000000e+0450000.00000050000.00000044890.00000045598.00000047287.00000050000.00000050000.000000...50000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.000000
mean556029.05338082878.2514202.003441e+0744.9228407.7794204.5562261.6811920.781081114.11606012.555210...0.0325700.030773-0.0248190.007051-0.008488-0.0301040.014609-0.0033530.013125-0.011936
std106952.40256572292.0769367.788055e+0450.5762557.6616671.9082912.3448290.413518177.2741544.034901...0.0387790.0495218.7596635.7842994.8252614.1005613.8126673.5489442.8667742.316144
min370951.0000000.0000001.910000e+070.0000000.0000000.0000000.0000000.0000000.0000000.500000...0.0000000.000000-10.196998-15.167961-21.925773-13.682825-9.282567-11.117367-6.365723-2.394516
25%463258.50000014121.2500001.999061e+076.0000001.0000003.0000000.0000001.00000069.00000012.500000...0.0001350.000000-5.575131-0.891030-3.105073-0.481952-1.697763-3.069575-2.089326-1.402958
50%556296.00000065359.0000002.003111e+0727.0000006.0000004.0000000.0000001.000000105.00000015.000000...0.0019490.002593-3.8375720.221379-0.0818360.039376-0.971210-0.877377-1.192502-1.146398
75%648862.250000143083.7500002.008091e+0770.00000011.0000007.0000005.0000001.000000150.00000015.000000...0.0758260.0620633.5312691.2576872.7845380.5600461.5725083.2769182.772742-0.010769
max741887.000000233028.0000002.019040e+07248.00000039.0000007.0000006.0000001.00000017700.00000015.000000...0.1359000.18009136.36498626.04357222.59844116.33305120.27363311.6918517.9703038.749647

8 rows × 39 columns

## 2) 通过info()来熟悉数据类型
Train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250000 entries, 0 to 249999
Data columns (total 40 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   SaleID             250000 non-null  int64  
 1   name               250000 non-null  int64  
 2   regDate            250000 non-null  int64  
 3   model              250000 non-null  float64
 4   brand              250000 non-null  int64  
 5   bodyType           224620 non-null  float64
 6   fuelType           227510 non-null  float64
 7   gearbox            236487 non-null  float64
 8   power              250000 non-null  int64  
 9   kilometer          250000 non-null  float64
 10  notRepairedDamage  201464 non-null  float64
 11  regionCode         250000 non-null  int64  
 12  seller             250000 non-null  int64  
 13  offerType          250000 non-null  int64  
 14  creatDate          250000 non-null  int64  
 15  price              250000 non-null  int64  
 16  v_0                250000 non-null  float64
 17  v_1                250000 non-null  float64
 18  v_2                250000 non-null  float64
 19  v_3                250000 non-null  float64
 20  v_4                250000 non-null  float64
 21  v_5                250000 non-null  float64
 22  v_6                250000 non-null  float64
 23  v_7                250000 non-null  float64
 24  v_8                250000 non-null  float64
 25  v_9                250000 non-null  float64
 26  v_10               250000 non-null  float64
 27  v_11               250000 non-null  float64
 28  v_12               250000 non-null  float64
 29  v_13               250000 non-null  float64
 30  v_14               250000 non-null  float64
 31  v_15               250000 non-null  float64
 32  v_16               250000 non-null  float64
 33  v_17               250000 non-null  float64
 34  v_18               250000 non-null  float64
 35  v_19               250000 non-null  float64
 36  v_20               250000 non-null  float64
 37  v_21               250000 non-null  float64
 38  v_22               250000 non-null  float64
 39  v_23               250000 non-null  float64
dtypes: float64(30), int64(10)
memory usage: 76.3 MB
Test_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 39 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   SaleID             50000 non-null  int64  
 1   name               50000 non-null  int64  
 2   regDate            50000 non-null  int64  
 3   model              50000 non-null  float64
 4   brand              50000 non-null  int64  
 5   bodyType           44890 non-null  float64
 6   fuelType           45598 non-null  float64
 7   gearbox            47287 non-null  float64
 8   power              50000 non-null  int64  
 9   kilometer          50000 non-null  float64
 10  notRepairedDamage  40372 non-null  float64
 11  regionCode         50000 non-null  int64  
 12  seller             50000 non-null  int64  
 13  offerType          50000 non-null  int64  
 14  creatDate          50000 non-null  int64  
 15  v_0                50000 non-null  float64
 16  v_1                50000 non-null  float64
 17  v_2                50000 non-null  float64
 18  v_3                50000 non-null  float64
 19  v_4                50000 non-null  float64
 20  v_5                50000 non-null  float64
 21  v_6                50000 non-null  float64
 22  v_7                50000 non-null  float64
 23  v_8                50000 non-null  float64
 24  v_9                50000 non-null  float64
 25  v_10               50000 non-null  float64
 26  v_11               50000 non-null  float64
 27  v_12               50000 non-null  float64
 28  v_13               50000 non-null  float64
 29  v_14               50000 non-null  float64
 30  v_15               50000 non-null  float64
 31  v_16               50000 non-null  float64
 32  v_17               50000 non-null  float64
 33  v_18               50000 non-null  float64
 34  v_19               50000 non-null  float64
 35  v_20               50000 non-null  float64
 36  v_21               50000 non-null  float64
 37  v_22               50000 non-null  float64
 38  v_23               50000 non-null  float64
dtypes: float64(30), int64(9)
memory usage: 14.9 MB
Train_data.isnull().sum()
SaleID                   0
name                     0
regDate                  0
model                    0
brand                    0
bodyType             25380
fuelType             22490
gearbox              13513
power                    0
kilometer                0
notRepairedDamage    48536
regionCode               0
seller                   0
offerType                0
creatDate                0
price                    0
v_0                      0
v_1                      0
v_2                      0
v_3                      0
v_4                      0
v_5                      0
v_6                      0
v_7                      0
v_8                      0
v_9                      0
v_10                     0
v_11                     0
v_12                     0
v_13                     0
v_14                     0
v_15                     0
v_16                     0
v_17                     0
v_18                     0
v_19                     0
v_20                     0
v_21                     0
v_22                     0
v_23                     0
dtype: int64
Test_data.isnull().sum()
SaleID                  0
name                    0
regDate                 0
model                   0
brand                   0
bodyType             5110
fuelType             4402
gearbox              2713
power                   0
kilometer               0
notRepairedDamage    9628
regionCode              0
seller                  0
offerType               0
creatDate               0
v_0                     0
v_1                     0
v_2                     0
v_3                     0
v_4                     0
v_5                     0
v_6                     0
v_7                     0
v_8                     0
v_9                     0
v_10                    0
v_11                    0
v_12                    0
v_13                    0
v_14                    0
v_15                    0
v_16                    0
v_17                    0
v_18                    0
v_19                    0
v_20                    0
v_21                    0
v_22                    0
v_23                    0
dtype: int64
# nan可视化
missing = Train_data.isnull().sum()
missing = missing[missing > 0]
missing.sort_values(inplace=True)
missing.plot.bar()

<AxesSubplot:>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-IF3bYWr1-1618460688068)(output_14_1.png)]

# 可视化看下缺省值
msno.matrix(Train_data.sample(250))
<AxesSubplot:>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Hmg3T2pE-1618460688071)(output_15_1.png)]

msno.bar(Train_data.sample(1000))
<AxesSubplot:>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-giHsOp0o-1618460688080)(output_16_1.png)]

msno.bar(Train_data.sample(1000))

<AxesSubplot:>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-O6oFq5c8-1618460688082)(output_17_1.png)]

Train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250000 entries, 0 to 249999
Data columns (total 40 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   SaleID             250000 non-null  int64  
 1   name               250000 non-null  int64  
 2   regDate            250000 non-null  int64  
 3   model              250000 non-null  float64
 4   brand              250000 non-null  int64  
 5   bodyType           224620 non-null  float64
 6   fuelType           227510 non-null  float64
 7   gearbox            236487 non-null  float64
 8   power              250000 non-null  int64  
 9   kilometer          250000 non-null  float64
 10  notRepairedDamage  201464 non-null  float64
 11  regionCode         250000 non-null  int64  
 12  seller             250000 non-null  int64  
 13  offerType          250000 non-null  int64  
 14  creatDate          250000 non-null  int64  
 15  price              250000 non-null  int64  
 16  v_0                250000 non-null  float64
 17  v_1                250000 non-null  float64
 18  v_2                250000 non-null  float64
 19  v_3                250000 non-null  float64
 20  v_4                250000 non-null  float64
 21  v_5                250000 non-null  float64
 22  v_6                250000 non-null  float64
 23  v_7                250000 non-null  float64
 24  v_8                250000 non-null  float64
 25  v_9                250000 non-null  float64
 26  v_10               250000 non-null  float64
 27  v_11               250000 non-null  float64
 28  v_12               250000 non-null  float64
 29  v_13               250000 non-null  float64
 30  v_14               250000 non-null  float64
 31  v_15               250000 non-null  float64
 32  v_16               250000 non-null  float64
 33  v_17               250000 non-null  float64
 34  v_18               250000 non-null  float64
 35  v_19               250000 non-null  float64
 36  v_20               250000 non-null  float64
 37  v_21               250000 non-null  float64
 38  v_22               250000 non-null  float64
 39  v_23               250000 non-null  float64
dtypes: float64(30), int64(10)
memory usage: 76.3 MB
Test_data['notRepairedDamage'].value_counts()

1.0    35555
0.0     4817
Name: notRepairedDamage, dtype: int64
Test_data['notRepairedDamage'].replace('-', np.nan, inplace=True)
Train_data["seller"].value_counts()
1    249999
0         1
Name: seller, dtype: int64
Train_data["offerType"].value_counts()
0    249991
1         9
Name: offerType, dtype: int64
del Train_data["seller"]
del Train_data["offerType"]
del Test_data["seller"]
del Test_data["offerType"]
#drop函数用于pandas

---------------------------------------------------------------------------

KeyError                                  Traceback (most recent call last)

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   3079             try:
-> 3080                 return self._engine.get_loc(casted_key)
   3081             except KeyError as err:


pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()


pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()


pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()


pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()


KeyError: 'seller'


The above exception was the direct cause of the following exception:


KeyError                                  Traceback (most recent call last)

<ipython-input-25-ac78d43311cb> in <module>
----> 1 del Train_data["seller"]
      2 del Train_data["offerType"]
      3 del Test_data["seller"]
      4 del Test_data["offerType"]
      5 #drop函数用于pandas


C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\generic.py in __delitem__(self, key)
   3964             # there was no match, this call should raise the appropriate
   3965             # exception:
-> 3966             loc = self.axes[-1].get_loc(key)
   3967             self._mgr.idelete(loc)
   3968 


C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   3080                 return self._engine.get_loc(casted_key)
   3081             except KeyError as err:
-> 3082                 raise KeyError(key) from err
   3083 
   3084         if tolerance is not None:


KeyError: 'seller'
Train_data['price']

0           520
1          5500
2          1100
3          1200
4          3300
          ...  
249995     1200
249996     1200
249997    16500
249998    31950
249999     1990
Name: price, Length: 250000, dtype: int64
Train_data['price'].value_counts()
0        7312
500      3815
1500     3587
1000     3149
1200     3071
         ... 
11320       1
7230        1
11448       1
9529        1
8188        1
Name: price, Length: 4585, dtype: int64
%matplotlib inline  #IPython notebook中的魔法方法,这样每次运行后可以直接得到图像,不再需要使用plt.show()
import numpy as np  #导入numpy包,用于生成数组
import seaborn as sns  #习惯上简写成sns           
sns.set()#切换到seaborn的默认运行配置
UsageError: unrecognized arguments: #IPython notebook中的魔法方法,这样每次运行后可以直接得到图像,不再需要使用plt.show()
x=np.random.randn(100) 
sns.kdeplot(x,cut=0)
<AxesSubplot:ylabel='Density'>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-BZyyCorw-1618460688084)(output_27_1.png)]

y=np.random.randn(100)
sns.kdeplot(x,y,shade=True)
sns.kdeplot(x,y,shade=True,cbar=True)
<AxesSubplot:>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-UA8LOj4n-1618460688086)(output_28_1.png)]

displot()集合了matplotlib的hist()与核函数估计kdeplot的功能,增加了rugplot分布观测条显示与利用scipy库fit拟合参数分布的新颖用途。具体用法如下:

seaborn.distplot(a,bins=None,hist=True,kde=True, rug=False, fit=None, hist_kws=None, kde_kws=None, rug_kws=None, fit_kws=None, color=None, vertical=False, norm_hist=False, axlabel=None, label=None, ax=None)
先介绍一下直方图(Histograms):

直方图又称质量分布图,它是表示资料变化情况的一种主要工具。用直方图可以解析出资料的规则性,比较直观地看出产品质量特性的分布状态,对于资料分布状况一目了然,便于判断其总体质量分布情况。直方图表示通过沿数据范围形成分箱,然后绘制条以显示落入每个分箱的观测次数的数据分布。

接下来还是通过具体的例子来体验一下distplot的用法:

sns.distplot(x,color="g")
<AxesSubplot:ylabel='Density'>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-uPlnSZg2-1618460688087)(output_31_1.png)]

import matplotlib.pyplot as plt
fig,axes=plt.subplots(1,3) #创建一个一行三列的画布
sns.distplot(x,ax=axes[0]) #左图
sns.distplot(x,hist=False,ax=axes[1]) #中图
sns.distplot(x,kde=False,ax=axes[2]) #右图
<AxesSubplot:>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-QB16zc3q-1618460688088)(output_32_1.png)]

## 1) 总体分布概况(无界约翰逊分布等)
#简称约翰逊分布。经约翰变换后服从正态分布的随机变量的概率分布
import scipy.stats as st
y = Train_data['price']
plt.figure(1); plt.title('Johnson SU')
sns.distplot(y, kde=True, fit=st.johnsonsu)
plt.figure(2); plt.title('Normal')
sns.distplot(y, kde=False, fit=st.norm)
plt.figure(3); plt.title('Log Normal')
sns.distplot(y, kde=False, fit=st.lognorm)
'''我们看看他符合什么总体分布
无界约翰逊分布johnsonsu?
正态norm?
对数正态(比正态偏上一点)lognorm?

'''
'我们看看他符合什么总体分布\n无界约翰逊分布johnsonsu?\n正态norm?\n对数正态(比正态偏上一点)lognorm?\n\n'

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-BIRc6ZH0-1618460688091)(output_33_1.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-UXCH7aXj-1618460688093)(output_33_2.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-dVSpJeqf-1618460688093)(output_33_3.png)]

2) 查看skewness偏度 和 kurtosis峰度

偏度: 是描述数据分布形态的统计量,其描述的是某总体取值分布的对称性,简单来说就是数据的不对称程度,绝对值越大表明数据分布越不对称,偏斜程度大

峰度: 描述某变量所有取值分布形态陡缓程度的统计量,简单来说就是数据分布顶的尖锐程度(>0尖顶峰, <0平顶峰, =0与正态分布陡峭程度一致)

## 2) 查看skewness and kurtosis
sns.distplot(Train_data['price']);
print("Skewness: %f" % Train_data['price'].skew())
print("Kurtosis: %f" % Train_data['price'].kurt())
Skewness: 3.535346
Kurtosis: 21.230678

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-sTZN54X1-1618460688095)(output_35_1.png)]

Train_data.skew(), Train_data.kurt()
(SaleID                0.001712
 name                  0.513079
 regDate              -1.540844
 model                 1.499765
 brand                 1.314846
 bodyType             -0.070459
 fuelType              0.701802
 gearbox              -1.357379
 power                58.590829
 kilometer            -1.557472
 notRepairedDamage    -2.312519
 regionCode            0.690405
 creatDate           -95.428563
 price                 3.535346
 v_0                  -1.504738
 v_1                   1.582428
 v_2                   1.198679
 v_3                   1.352193
 v_4                   0.217941
 v_5                   2.052749
 v_6                   0.090718
 v_7                   0.823610
 v_8                  -1.532964
 v_9                   1.529931
 v_10                 -2.584452
 v_11                 -0.906428
 v_12                 -2.842834
 v_13                 -3.869655
 v_14                  0.491706
 v_15                  1.308716
 v_16                  1.662893
 v_17                  0.233318
 v_18                  0.814453
 v_19                  0.100073
 v_20                  2.001253
 v_21                  0.180020
 v_22                  0.819133
 v_23                  1.357847
 dtype: float64,
 SaleID                  -1.201476
 name                    -1.084474
 regDate                 11.041006
 model                    1.741896
 brand                    1.814245
 bodyType                -1.070358
 fuelType                -1.495782
 gearbox                 -0.157525
 power                 4473.885260
 kilometer                1.250933
 notRepairedDamage        3.347777
 regionCode              -0.352973
 creatDate            11376.694263
 price                   21.230678
 v_0                      2.901641
 v_1                      1.098703
 v_2                      3.749872
 v_3                      4.294578
 v_4                      6.953348
 v_5                      6.489791
 v_6                     -0.564878
 v_7                     -0.729838
 v_8                      0.370812
 v_9                      0.377943
 v_10                     4.796855
 v_11                     1.547812
 v_12                     6.136342
 v_13                    13.199575
 v_14                    -1.597532
 v_15                    -0.029594
 v_16                     2.240928
 v_17                     2.569341
 v_18                     2.967738
 v_19                     6.923953
 v_20                     6.852809
 v_21                    -0.759948
 v_22                    -0.741708
 v_23                     0.143713
 dtype: float64)
sns.distplot(Train_data.skew(),color='blue',axlabel ='Skewness')
<AxesSubplot:xlabel='Skewness', ylabel='Density'>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-imnSDRwd-1618460688096)(output_37_1.png)]

sns.distplot(Train_data.kurt(),color='orange',axlabel ='Kurtness')

<AxesSubplot:xlabel='Kurtness', ylabel='Density'>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-EnGtRb0Y-1618460688099)(output_38_1.png)]

## 3) 查看预测值的具体频数
plt.hist(Train_data['price'], orientation = 'vertical',histtype = 'bar', color ='red')
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-rmh7p764-1618460688100)(output_39_0.png)]

plt.hist
x: 数据集,最终的直方图将对数据集进行统计
bins: 统计的区间分布
range: tuple, 显示的区间,range在没有给出bins时生效
density: bool,默认为false,显示的是频数统计结果,为True则显示频率统计结果,这里需要注意,频率统计结果=区间数目/(总数*区间宽度),和normed效果一致,官方推荐使用density
histtype: 可选{‘bar’, ‘barstacked’, ‘step’, ‘stepfilled’}之一,默认为bar,推荐使用默认配置,step使用的是梯状,stepfilled则会对梯状内部进行填充,效果与bar类似
align: 可选{‘left’, ‘mid’, ‘right’}之一,默认为’mid’,控制柱状图的水平分布,left或者right,会有部分空白区域,推荐使用默认
log: bool,默认False,即y坐标轴是否选择指数刻度
stacked: bool,默认为False,是否为堆积状图

plt.hist绘制直方图参数density 为True和False分别代表是否归一化 参数orientation决定了是采用纵轴代表频率还是横轴代表频率的展现形式

# log变换 z之后的分布较均匀,可以进行log变换进行预测,这也是预测问题常用的trick
plt.hist(np.log(Train_data['price']), orientation = 'vertical',histtype = 'bar', color ='red') 
plt.show()
---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-41-0f7fcb2a3190> in <module>
      1 # log变换 z之后的分布较均匀,可以进行log变换进行预测,这也是预测问题常用的trick
----> 2 plt.hist(np.log(Train_data['price']), orientation = 'vertical',histtype = 'bar', color ='red')
      3 plt.show()


C:\ProgramData\Anaconda3\lib\site-packages\matplotlib\pyplot.py in hist(x, bins, range, density, weights, cumulative, bottom, histtype, align, orientation, rwidth, log, color, label, stacked, data, **kwargs)
   2683         orientation='vertical', rwidth=None, log=False, color=None,
   2684         label=None, stacked=False, *, data=None, **kwargs):
-> 2685     return gca().hist(
   2686         x, bins=bins, range=range, density=density, weights=weights,
   2687         cumulative=cumulative, bottom=bottom, histtype=histtype,


C:\ProgramData\Anaconda3\lib\site-packages\matplotlib\__init__.py in inner(ax, data, *args, **kwargs)
   1445     def inner(ax, *args, data=None, **kwargs):
   1446         if data is None:
-> 1447             return func(ax, *map(sanitize_sequence, args), **kwargs)
   1448 
   1449         bound = new_sig.bind(ax, *args, **kwargs)


C:\ProgramData\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py in hist(self, x, bins, range, density, weights, cumulative, bottom, histtype, align, orientation, rwidth, log, color, label, stacked, **kwargs)
   6649             # this will automatically overwrite bins,
   6650             # so that each histogram uses the same bins
-> 6651             m, bins = np.histogram(x[i], bins, weights=w[i], **hist_kwargs)
   6652             tops.append(m)
   6653         tops = np.array(tops, float)  # causes problems later if it's an int


<__array_function__ internals> in histogram(*args, **kwargs)


C:\ProgramData\Anaconda3\lib\site-packages\numpy\lib\histograms.py in histogram(a, bins, range, normed, weights, density)
    790     a, weights = _ravel_and_check_weights(a, weights)
    791 
--> 792     bin_edges, uniform_bins = _get_bin_edges(a, bins, range, weights)
    793 
    794     # Histogram is an integer or a float array depending on the weights.


C:\ProgramData\Anaconda3\lib\site-packages\numpy\lib\histograms.py in _get_bin_edges(a, bins, range, weights)
    424             raise ValueError('`bins` must be positive, when an integer')
    425 
--> 426         first_edge, last_edge = _get_outer_edges(a, range)
    427 
    428     elif np.ndim(bins) == 1:


C:\ProgramData\Anaconda3\lib\site-packages\numpy\lib\histograms.py in _get_outer_edges(a, range)
    313                 'max must be larger than min in range parameter.')
    314         if not (np.isfinite(first_edge) and np.isfinite(last_edge)):
--> 315             raise ValueError(
    316                 "supplied range of [{}, {}] is not finite".format(first_edge, last_edge))
    317     elif a.size == 0:


ValueError: supplied range of [-inf, 11.512925464970229] is not finite

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-DSC7R3b8-1618460688103)(output_41_1.png)]

# 分离label即预测值
Y_train = Train_data['price']
# 这个区别方式适用于没有直接label coding的数据
# 这里不适用,需要人为根据实际含义来区分
# 数字特征
# numeric_features = Train_data.select_dtypes(include=[np.number])
# numeric_features.columns
# # 类型特征
# categorical_features = Train_data.select_dtypes(include=[np.object])
# categorical_features.columns
numeric_features = ['power', 'kilometer', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13','v_14' ]

categorical_features = ['name', 'model', 'brand', 'bodyType', 'fuelType', 'gearbox', 'notRepairedDamage', 'regionCode',]

unique()是以 数组形式(numpy.ndarray)返回列的所有唯一值(特征的所有唯一值)

nunique() Return number of unique elements in the object.即返回的是唯一值的个数

# 特征nunique分布
for cat_fea in categorical_features:
    print(cat_fea + "的特征分布如下:")
    print("{}特征有个{}不同的值".format(cat_fea, Train_data[cat_fea].nunique()))
    print(Train_data[cat_fea].value_counts())
name的特征分布如下:
name特征有个164312不同的值
451       452
73        429
1791      428
821       391
243       346
         ... 
92419       1
88325       1
82182       1
84231       1
157427      1
Name: name, Length: 164312, dtype: int64
model的特征分布如下:
model特征有个251不同的值
0.0      20344
6.0      17741
4.0      13837
1.0      13634
12.0      8841
         ...  
226.0        5
245.0        5
243.0        4
249.0        4
250.0        1
Name: model, Length: 251, dtype: int64
brand的特征分布如下:
brand特征有个40不同的值
0     53699
4     27109
11    26944
10    23762
1     22144
6     17202
9     12210
5      7343
15     6500
12     4704
7      3839
3      3831
17     3543
13     3502
8      3374
28     3161
19     2561
18     2451
16     2274
22     2264
23     2088
14     1892
24     1678
25     1611
20     1610
27     1392
29     1259
34      963
30      604
2       570
31      540
21      522
38      516
35      415
32      406
36      377
33      368
37      324
26      307
39      141
Name: brand, dtype: int64
bodyType的特征分布如下:
bodyType特征有个8不同的值
7.0    64571
3.0    53858
4.0    45646
5.0    20343
6.0    15290
2.0    12755
1.0     9882
0.0     2275
Name: bodyType, dtype: int64
fuelType的特征分布如下:
fuelType特征有个7不同的值
0.0    150664
5.0     72494
4.0      3577
3.0       385
2.0       183
1.0       147
6.0        60
Name: fuelType, dtype: int64
gearbox的特征分布如下:
gearbox特征有个2不同的值
1.0    184645
0.0     51842
Name: gearbox, dtype: int64
notRepairedDamage的特征分布如下:
notRepairedDamage特征有个2不同的值
1.0    176922
0.0     24542
Name: notRepairedDamage, dtype: int64
regionCode的特征分布如下:
regionCode特征有个8081不同的值
487     550
868     424
149     236
539     227
32      216
       ... 
7959      1
8002      1
6715      1
7117      1
4144      1
Name: regionCode, Length: 8081, dtype: int64
# 特征nunique分布
for cat_fea in categorical_features:
    print(cat_fea + "的特征分布如下:")
    print("{}特征有个{}不同的值".format(cat_fea, Test_data[cat_fea].nunique()))
    print(Test_data[cat_fea].value_counts())
name的特征分布如下:
name特征有个38668不同的值
73        98
821       89
243       77
451       74
826       73
          ..
106879     1
108926     1
176509     1
178556     1
67583      1
Name: name, Length: 38668, dtype: int64
model的特征分布如下:
model特征有个249不同的值
0.0      3916
6.0      3496
1.0      2806
4.0      2802
12.0     1745
         ... 
247.0       2
246.0       2
214.0       1
243.0       1
232.0       1
Name: model, Length: 249, dtype: int64
brand的特征分布如下:
brand特征有个40不同的值
0     10697
4      5464
11     5374
10     4747
1      4390
6      3496
9      2408
5      1534
15     1325
12      929
7       782
3       736
17      732
13      679
8       666
28      645
19      534
18      487
16      458
22      430
14      416
23      397
24      390
25      297
20      293
27      265
29      236
34      206
30      133
21      121
2       101
38       92
31       87
35       76
36       73
26       72
32       70
37       61
33       61
39       40
Name: brand, dtype: int64
bodyType的特征分布如下:
bodyType特征有个8不同的值
7.0    12748
3.0    10808
4.0     9143
5.0     4175
6.0     3079
2.0     2484
1.0     1980
0.0      473
Name: bodyType, dtype: int64
fuelType的特征分布如下:
fuelType特征有个7不同的值
0.0    30045
5.0    14645
4.0      754
3.0       73
2.0       43
1.0       23
6.0       15
Name: fuelType, dtype: int64
gearbox的特征分布如下:
gearbox特征有个2不同的值
1.0    36935
0.0    10352
Name: gearbox, dtype: int64
notRepairedDamage的特征分布如下:
notRepairedDamage特征有个2不同的值
1.0    35555
0.0     4817
Name: notRepairedDamage, dtype: int64
regionCode的特征分布如下:
regionCode特征有个7078不同的值
487     122
868      93
539      46
32       46
222      46
       ... 
3761      1
6232      1
7891      1
2106      1
2246      1
Name: regionCode, Length: 7078, dtype: int64
numeric_features.append('price')
numeric_features
['power',
 'kilometer',
 'v_0',
 'v_1',
 'v_2',
 'v_3',
 'v_4',
 'v_5',
 'v_6',
 'v_7',
 'v_8',
 'v_9',
 'v_10',
 'v_11',
 'v_12',
 'v_13',
 'v_14',
 'price']
## 1) 相关性分析
price_numeric = Train_data[numeric_features]
correlation = price_numeric.corr()
print(correlation['price'].sort_values(ascending = False),'\n')
price        1.000000
v_0          0.514477
v_11         0.481618
power        0.189456
v_8          0.183505
v_10         0.163891
v_12         0.129570
v_13         0.114883
v_7          0.090440
v_14         0.075673
v_4          0.004413
v_2         -0.018823
v_6         -0.036826
v_5         -0.039637
v_9         -0.165831
v_1         -0.207255
kilometer   -0.404961
v_3         -0.595468
Name: price, dtype: float64 
f , ax = plt.subplots(figsize = (7, 7))

plt.title('Correlation of Numeric Features with Price',y=1,size=16)

sns.heatmap(correlation,square = True,  vmax=0.8)
<AxesSubplot:title={'center':'Correlation of Numeric Features with Price'}>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-JwxFTkaQ-1618460688104)(output_50_1.png)]

data:矩阵数据集,可以使numpy的数组(array),如果是pandas的dataframe,则df的index/column信息会分别对应到heatmap的columns和rows
linewidths,热力图矩阵之间的间隔大小
vmax,vmin, 图例中最大值和最小值的显示值,没有该参数时默认不显示
cmap:matplotlib的colormap名称或颜色对象;如果没有提供,默认为cubehelix map (数据集为连续数据集时) 或 RdBu_r (数据集为离散数据集时)
center:将数据设置为图例中的均值数据,即图例中心的数据值;通过设置center值,可以调整生成的图像颜色的整体深浅;设置center数据时,如果有数据溢出,则手动设置的vmax、vmin会自动改变
annotate的缩写,annot默认为False,当annot为True时,在heatmap中每个方格写入数据
annot_kws,当annot为True时,可设置各个参数,包括大小,颜色,加粗,斜体字等
square:布尔值,可选参数。

如果为True,则将坐标轴方向设置为“equal”,以使每个单元格为方形。

del price_numeric['price']
## 2) 查看几个特征得 偏度和峰值
for col in numeric_features:
    print('{:15}'.format(col), 
          'Skewness: {:05.2f}'.format(Train_data[col].skew()) , 
          '   ' ,
          'Kurtosis: {:06.2f}'.format(Train_data[col].kurt())  
         )
power           Skewness: 58.59     Kurtosis: 4473.89
kilometer       Skewness: -1.56     Kurtosis: 001.25
v_0             Skewness: -1.50     Kurtosis: 002.90
v_1             Skewness: 01.58     Kurtosis: 001.10
v_2             Skewness: 01.20     Kurtosis: 003.75
v_3             Skewness: 01.35     Kurtosis: 004.29
v_4             Skewness: 00.22     Kurtosis: 006.95
v_5             Skewness: 02.05     Kurtosis: 006.49
v_6             Skewness: 00.09     Kurtosis: -00.56
v_7             Skewness: 00.82     Kurtosis: -00.73
v_8             Skewness: -1.53     Kurtosis: 000.37
v_9             Skewness: 01.53     Kurtosis: 000.38
v_10            Skewness: -2.58     Kurtosis: 004.80
v_11            Skewness: -0.91     Kurtosis: 001.55
v_12            Skewness: -2.84     Kurtosis: 006.14
v_13            Skewness: -3.87     Kurtosis: 013.20
v_14            Skewness: 00.49     Kurtosis: -01.60
price           Skewness: 03.54     Kurtosis: 021.23

转换数据

df.melt() 是 df.pivot() 逆转操作函数

将列名转换为列数据(columns name → column values),重构DataFrame

如果说 df.pivot() 将长数据集转换成宽数据集,df.melt() 则是将宽数据集变成长数据集

melt() 既是顶级类函数也是实例对象函数,作为类函数出现时,需要指明 DataFrame 的名称
参数 类型 说明
frame dataframe
被 melt 的数据集名称

在 pd.melt() 中使用

id_vars 
tuple

list

ndarray

可选项

不需要被转换的列名,在转换后作为标识符列(不是索引列)

value_vars 
tuple

list

ndarray

可选项

需要被转换的现有列

如果未指明,除 id_vars 之外的其他列都被转换

var_name  string
variable 默认值

自定义列名名称

设置由 ‘value_vars’ 组成的新的 column name

value_name 
string

value 默认值

自定义列名名称

设置由 ‘value_vars’  的数据组成的新的 column name

col_level 
int

string

可选项

如果列是MultiIndex,则使用此级别

seaborn.FacetGrid

data : DataFrame

处理后的(“长格式”)dataframe数据,其中每一列都是一个变量(特征),每一行都是一个样本

row, col, hue : strings

定义数据子集的变量,这些变量将在网格的不同方面绘制。请参阅下面*_order参数以控制该变量的级别顺序

例如:col=“sex”, hue=“smoker”,即列表示性别,颜色语意表示是否吸烟,下面示例会给出详细说明

col_wrap : int, optional

这个意思是图网格列维度限制,比如col_wrap =3,那么在这个画布里最多只能画3列。行不限制,这样就限制了列的个数。

share{x,y} : bool, ‘col’, or ‘row’ optional

是否共享x轴或者y轴,就是说如果为真,就共享同一个轴,否则就不共享,默认是都共享,即都为True

g = sns.FacetGrid(tips, col=“sex”, hue=“smoker”,sharex=True, sharey=True)# 都共享
g.map(plt.scatter, “total_bill”, “tip”, alpha=0.8)
g.add_legend();

map是python内置函数,会根据提供的函数对指定的序列做映射。

map()函数的格式是:

map(function,iterable,…)
第一个参数接受一个函数名,后面的参数接受一个或多个可迭代的序列,返回的是一个集合。

把函数依次作用在list中的每一个元素上,得到一个新的list并返回。注意,map不改变原list,而是返回一个新list。

## 3) 每个数字特征得分布可视化
f = pd.melt(Train_data, value_vars=numeric_features)
g = sns.FacetGrid(f, col="variable",  col_wrap=2, sharex=False, sharey=False)
g = g.map(sns.distplot, "value")

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-y5qOGg1w-1618460688110)(output_57_0.png)]

## 4) 数字特征相互之间的关系可视化
sns.set()
columns = ['price', 'v_12', 'v_8' , 'v_0', 'power', 'v_5',  'v_2', 'v_6', 'v_1', 'v_14']
sns.pairplot(Train_data[columns],size = 2 ,kind ='scatter',diag_kind='kde')
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Oa5cvdtd-1618460688111)(output_58_0.png)]

Train_data.columns
Index(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType',
       'gearbox', 'power', 'kilometer', 'notRepairedDamage', 'regionCode',
       'creatDate', 'price', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6',
       'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13', 'v_14', 'v_15',
       'v_16', 'v_17', 'v_18', 'v_19', 'v_20', 'v_21', 'v_22', 'v_23'],
      dtype='object')
Y_train
0           520
1          5500
2          1100
3          1200
4          3300
          ...  
249995     1200
249996     1200
249997    16500
249998    31950
249999     1990
Name: price, Length: 250000, dtype: int64

fig, ax = plt.subplots(1,3),其中参数1和3分别代表子图的行数和列数,一共有 1x3 个子图像。函数返回一个figure图像和子图ax的array列表。
fig, ax = plt.subplots(1,3,1),最后一个参数1代表第一个子图。
如果想要设置子图的宽度和高度可以在函数内加入figsize值
fig, ax = plt.subplots(1,3,figsize=(15,7)),这样就会有1行3个15x7大小的子图。

## 5) 多变量互相回归关系可视化
fig, ((ax1, ax2), (ax3, ax4), (ax5, ax6), (ax7, ax8), (ax9, ax10)) = plt.subplots(nrows=5, ncols=2, figsize=(24, 20))
# ['v_12', 'v_8' , 'v_0', 'power', 'v_5',  'v_2', 'v_6', 'v_1', 'v_14']
v_12_scatter_plot = pd.concat([Y_train,Train_data['v_12']],axis = 1)
'''
objs: series,dataframe或者是panel构成的序列lsit 
axis: 需要合并链接的轴,0是行,1是列 
join:连接的方式 inner,或者outer
'''
sns.regplot(x='v_12',y = 'price', data = v_12_scatter_plot,scatter= True, fit_reg=True, ax=ax1)

v_8_scatter_plot = pd.concat([Y_train,Train_data['v_8']],axis = 1)
sns.regplot(x='v_8',y = 'price',data = v_8_scatter_plot,scatter= True, fit_reg=True, ax=ax2)

v_0_scatter_plot = pd.concat([Y_train,Train_data['v_0']],axis = 1)
sns.regplot(x='v_0',y = 'price',data = v_0_scatter_plot,scatter= True, fit_reg=True, ax=ax3)

power_scatter_plot = pd.concat([Y_train,Train_data['power']],axis = 1)
sns.regplot(x='power',y = 'price',data = power_scatter_plot,scatter= True, fit_reg=True, ax=ax4)

v_5_scatter_plot = pd.concat([Y_train,Train_data['v_5']],axis = 1)
sns.regplot(x='v_5',y = 'price',data = v_5_scatter_plot,scatter= True, fit_reg=True, ax=ax5)

v_2_scatter_plot = pd.concat([Y_train,Train_data['v_2']],axis = 1)
sns.regplot(x='v_2',y = 'price',data = v_2_scatter_plot,scatter= True, fit_reg=True, ax=ax6)

v_6_scatter_plot = pd.concat([Y_train,Train_data['v_6']],axis = 1)
sns.regplot(x='v_6',y = 'price',data = v_6_scatter_plot,scatter= True, fit_reg=True, ax=ax7)

v_1_scatter_plot = pd.concat([Y_train,Train_data['v_1']],axis = 1)
sns.regplot(x='v_1',y = 'price',data = v_1_scatter_plot,scatter= True, fit_reg=True, ax=ax8)

v_14_scatter_plot = pd.concat([Y_train,Train_data['v_14']],axis = 1)
sns.regplot(x='v_14',y = 'price',data = v_14_scatter_plot,scatter= True, fit_reg=True, ax=ax9)

v_13_scatter_plot = pd.concat([Y_train,Train_data['v_13']],axis = 1)
sns.regplot(x='v_13',y = 'price',data = v_13_scatter_plot,scatter= True, fit_reg=True, ax=ax10)
'''
sns.regplot()的用法
参数说明

x,y:就是x,y轴的值

data:x,y所属的df

x_estimator:将此函数应用于x的每个唯一值并绘制结果估计值。当x是离散变量时,这很有用。如果给定x_ci,则此估计值将自举并绘制置信区间

x_bins:将x分成多少段

其他的参数可以参考官网文档:https://www.cntofu.com/book/172/docs/28.md
sns.regplot():绘图数据和线性回归模型拟合

'''
<AxesSubplot:xlabel='v_13', ylabel='price'>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-gY8sdLxP-1618460688115)(output_62_1.png)]

2、pd.qcut函数,按照数据出现频率百分比划分,比如要把数据分为四份,则四段分别是数据的0-25%,25%-50%,50%-75%,75%-100%
pd.qcut(x, q, labels=None, retbins=False, precision=3, duplicates=‘raise’)

1、pd.cut函数有7个参数,主要用于对数据从最大值到最小值进行等距划分

pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False)

参数:

x : 输入待cut的一维数组

bins : cut的段数,一般为整型,但也可以为序列向量。

right : 布尔值,确定右区间是否开闭,取True时右区间闭合

labels : 数组或布尔值,默认为None,用来标识分后的bins,长度必须与结果bins相等,返回值为整数或者对bins的标识

retbins : 布尔值,可选。是否返回数值所在分组,Ture则返回

precision : 整型,bins小数精度,也就是数据以几位小数显示

include_lowest : 布尔类型,是否包含左区间

## 1) unique分布
for fea in categorical_features:
    print(Train_data[fea].nunique())
164312
251
40
8
7
2
2
8081
categorical_features
['name',
 'model',
 'brand',
 'bodyType',
 'fuelType',
 'gearbox',
 'notRepairedDamage',
 'regionCode']
## 2) 类别特征箱形图可视化

# 因为 name和 regionCode的类别太稀疏了,这里我们把不稀疏的几类画一下
categorical_features = ['model',
 'brand',
 'bodyType',
 'fuelType',
 'gearbox',
 'notRepairedDamage']
for c in categorical_features:
    Train_data[c] = Train_data[c].astype('category')
    if Train_data[c].isnull().any():
        '''1.np.array.any()和numpy.array.all()

np.array.any()是或操作,任意一个元素为True,输出为True。

np.array.all()是与操作,所有元素为True,输出为True。'''
        Train_data[c] = Train_data[c].cat.add_categories(['MISSING'])
        Train_data[c] = Train_data[c].fillna('MISSING')

def boxplot(x, y, **kwargs):
    sns.boxplot(x=x, y=y)
    x=plt.xticks(rotation=90)

f = pd.melt(Train_data, id_vars=['price'], value_vars=categorical_features)
g = sns.FacetGrid(f, col="variable",  col_wrap=2, sharex=False, sharey=False, size=5)
g = g.map(boxplot, "value", "price")

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-jnQPwPid-1618460688125)(output_66_0.png)]

Train_data.columns
Index(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType',
       'gearbox', 'power', 'kilometer', 'notRepairedDamage', 'regionCode',
       'creatDate', 'price', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6',
       'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13', 'v_14', 'v_15',
       'v_16', 'v_17', 'v_18', 'v_19', 'v_20', 'v_21', 'v_22', 'v_23'],
      dtype='object')
## 3) 类别特征的小提琴图可视化
catg_list = categorical_features
target = 'price'
for catg in catg_list :
    sns.violinplot(x=catg, y=target, data=Train_data)
    plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-voCMBxYk-1618460688127)(output_68_0.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Uo7yvKkg-1618460688129)(output_68_1.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-kuVLfUzR-1618460688133)(output_68_2.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-kcUEuAkJ-1618460688134)(output_68_3.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-QCkMQ3bM-1618460688136)(output_68_4.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Jh7J6swE-1618460688140)(output_68_5.png)]

categorical_features = ['model',
 'brand',
 'bodyType',
 'fuelType',
 'gearbox',
 'notRepairedDamage']
## 4) 类别特征的柱形图可视化
def bar_plot(x, y, **kwargs):
    sns.barplot(x=x, y=y)
    x=plt.xticks(rotation=90)

f = pd.melt(Train_data, id_vars=['price'], value_vars=categorical_features)
g = sns.FacetGrid(f, col="variable",  col_wrap=2, sharex=False, sharey=False, size=5)
g = g.map(bar_plot, "value", "price")

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-o3RO6hfI-1618460688142)(output_70_0.png)]

##  5) 类别特征的每个类别频数可视化(count_plot)
def count_plot(x,  **kwargs):
    sns.countplot(x=x)
    x=plt.xticks(rotation=90)

f = pd.melt(Train_data,  value_vars=categorical_features)
g = sns.FacetGrid(f, col="variable",  col_wrap=2, sharex=False, sharey=False, size=5)
g = g.map(count_plot, "value")

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-1ugoBEPk-1618460688143)(output_71_0.png)]

import pandas_profiling
pfr = pandas_profiling.ProfileReport(Train_data)
pfr.to_file("./example.html")
Summarize dataset:   0%|          | 0/51 [00:00<?, ?it/s]

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
基于Spark的二手车数据分析可以通过以下步骤进行: 1. 引用中提到的车辆数据存储、处理和查询的需求,可以使用Spark作为分布式计算平台来实现。Spark提供了强大的数据处理和分析功能,可以处理大规模的数据集,并支持实时处理和多维度查询。 2. 使用引用中提到的数据集,例如Kaggle的tmdb-movie-metadata电影数据集,作为二手车数据集进行分析。首先,使用Python作为编程语言,使用Spark对数据进行预处理,包括清洗、去重、格式转换等操作。 3. 进行数据分类和分析。可以从多个方面对二手车数据进行分类和分析,例如车辆品牌、型号、年份、里程数、价格等。通过Spark的强大的分布式计算能力,可以高效地进行数据处理和分析。 4. 可以使用Spark的机器学习库进行预测和建模。根据二手车数据集的特征,可以构建机器学习模型,例如线性回归、决策树、随机森林等,来预测二手车的价格、销量等指标。 5. 对分析结果进行可视化。可以使用Spark提供的可视化工具,如Spark SQL、Spark Streaming等,将分析结果以图表、报表等形式展示出来,方便用户进行数据可视化和交互式分析。 总结:基于Spark的二手车数据分析可以通过使用Spark作为分布式计算平台,对二手车数据集进行预处理、分类和分析,并使用机器学习模型进行预测和建模,最后将分析结果以可视化形式展示出来。<span class="em">1</span><span class="em">2</span><span class="em">3</span> #### 引用[.reference_title] - *1* *3* [基于spark的车辆分析](https://blog.csdn.net/jc_benben/article/details/119561696)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v93^chatsearchT3_2"}}] [.reference_item style="max-width: 50%"] - *2* [基于Spark的电影数据集分析](https://download.csdn.net/download/qq_44806047/85760608)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v93^chatsearchT3_2"}}] [.reference_item style="max-width: 50%"] [ .reference_list ]
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值