二手车交易价格预测task02-数据分析

题目链接
https://tianchi.aliyun.com/competition/entrance/231784/introduction?spm=5176.12281957.1004.1.38b02448ausjSX

1 载入各种数据科学以及可视化库

import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
data_train = pd.read_csv('used_car_train_20200313.csv',sep=' ')
data_testA = pd.read_csv('used_car_testA_20200313.csv',sep=' ')

2 观察数据

data_train.head()
SaleIDnameregDatemodelbrandbodyTypefuelTypegearboxpowerkilometer...v_5v_6v_7v_8v_9v_10v_11v_12v_13v_14
007362004040230.061.00.00.06012.5...0.2356760.1019880.1295490.0228160.097462-2.8818032.804097-2.4208210.7952920.914762
1122622003030140.012.00.00.0015.0...0.2647770.1210040.1357310.0265970.020582-4.9004822.096338-1.030483-1.7226740.245522
221487420040403115.0151.00.00.016312.5...0.2514100.1149120.1651470.0621730.027075-4.8467491.8035591.565330-0.832687-0.229963
337186519960908109.0100.00.01.019315.0...0.2742930.1103000.1219640.0333950.000000-4.5095991.285940-0.501868-2.438353-0.478699
4411108020120103110.051.00.00.0685.0...0.2280360.0732050.0918800.0788190.121534-1.8962400.9107830.9311102.8345181.923482

5 rows × 31 columns

特征含义
  • name - 汽车编码
  • regDate - 汽车注册时间
  • model - 车型编码
  • brand - 品牌
  • bodyType - 车身类型
  • fuelType - 燃油类型
  • gearbox - 变速箱
  • power - 汽车功率
  • kilometer - 汽车行驶公里
  • notRepairedDamage - 汽车有尚未修复的损坏
  • regionCode - 看车地区编码
  • seller - 销售方
  • offerType - 报价类型
  • creatDate - 广告发布时间
  • price - 汽车价格
  • v_0’, ‘v_1’, ‘v_2’, ‘v_3’, ‘v_4’, ‘v_5’, ‘v_6’, ‘v_7’, ‘v_8’, ‘v_9’, ‘v_10’, ‘v_11’, ‘v_12’, ‘v_13’,‘v_14’(根据汽车的评论、标签等大量信息得到的embedding向量)【人工构造 匿名特征】
详细信息
data_train.describe()
SaleIDnameregDatemodelbrandbodyTypefuelTypegearboxpowerkilometer...v_5v_6v_7v_8v_9v_10v_11v_12v_13v_14
count150000.000000150000.0000001.500000e+05149999.000000150000.000000145494.000000141320.000000144019.000000150000.000000150000.000000...150000.000000150000.000000150000.000000150000.000000150000.000000150000.000000150000.000000150000.000000150000.000000150000.000000
mean74999.50000068349.1728732.003417e+0747.1290218.0527331.7923690.3758420.224943119.31654712.597160...0.2482040.0449230.1246920.0581440.061996-0.0010000.0090350.0048130.000313-0.000688
std43301.41452761103.8750955.364988e+0449.5360407.8649561.7606400.5486770.417546177.1684193.919576...0.0458040.0517430.2014100.0291860.0356923.7723863.2860712.5174781.2889881.038685
min0.0000000.0000001.991000e+070.0000000.0000000.0000000.0000000.0000000.0000000.500000...0.0000000.0000000.0000000.0000000.000000-9.168192-5.558207-9.639552-4.153899-6.546556
25%37499.75000011156.0000001.999091e+0710.0000001.0000000.0000000.0000000.00000075.00000012.500000...0.2436150.0000380.0624740.0353340.033930-3.722303-1.951543-1.871846-1.057789-0.437034
50%74999.50000051638.0000002.003091e+0730.0000006.0000001.0000000.0000000.000000110.00000015.000000...0.2577980.0008120.0958660.0570140.0584841.624076-0.358053-0.130753-0.0362450.141246
75%112499.250000118841.2500002.007111e+0766.00000013.0000003.0000001.0000000.000000150.00000015.000000...0.2652970.1020090.1252430.0793820.0874912.8443571.2550221.7769330.9428130.680378
max149999.000000196812.0000002.015121e+07247.00000039.0000007.0000006.0000001.00000019312.00000015.000000...0.2918380.1514201.4049360.1607910.22278712.35701118.81904213.84779211.1476698.658418

8 rows × 30 columns

print(data_train.shape,data_testA.shape)
(150000, 31) (50000, 30)

3 查看缺失值

print(data_train.isnull().sum())
print(data_testA.isnull().sum())
SaleID                  0
name                    0
regDate                 0
model                   1
brand                   0
bodyType             4506
fuelType             8680
gearbox              5981
power                   0
kilometer               0
notRepairedDamage       0
regionCode              0
seller                  0
offerType               0
creatDate               0
price                   0
v_0                     0
v_1                     0
v_2                     0
v_3                     0
v_4                     0
v_5                     0
v_6                     0
v_7                     0
v_8                     0
v_9                     0
v_10                    0
v_11                    0
v_12                    0
v_13                    0
v_14                    0
dtype: int64
SaleID                  0
name                    0
regDate                 0
model                   0
brand                   0
bodyType             1413
fuelType             2893
gearbox              1910
power                   0
kilometer               0
notRepairedDamage       0
regionCode              0
seller                  0
offerType               0
creatDate               0
v_0                     0
v_1                     0
v_2                     0
v_3                     0
v_4                     0
v_5                     0
v_6                     0
v_7                     0
v_8                     0
v_9                     0
v_10                    0
v_11                    0
v_12                    0
v_13                    0
v_14                    0
dtype: int64

从统计情况看,训练数据中model、bodyType、fuelType、gearbox存在缺失值

缺失值可视化
missing = data_train.isnull().sum()
missing = missing[missing > 0]
missing.sort_values(inplace=True)
missing.plot.bar()
<matplotlib.axes._subplots.AxesSubplot at 0x7f7ee2a3f4e0>

在这里插入图片描述

# 可视化看下缺省值
msno.matrix(data_train.sample(250))
<matplotlib.axes._subplots.AxesSubplot at 0x7f7ee294d5f8>

在这里插入图片描述

msno.bar(data_train.sample(1000))
<matplotlib.axes._subplots.AxesSubplot at 0x7f7ee1f47f60>

在这里插入图片描述

数据中只有notRepairedDamage是Object类型

data_train['notRepairedDamage'].value_counts()
0.0    111361
-       24324
1.0     14315
Name: notRepairedDamage, dtype: int64

可以看出来‘ - ’也为空缺值,因为很多模型对nan有直接的处理,这里我们先不做处理,先替换成nan

data_train['notRepairedDamage'].replace('-', np.nan, inplace=True)
data_testA['notRepairedDamage'].replace('-', np.nan, inplace=True)
data_train['notRepairedDamage'].value_counts()
0.0    111361
1.0     14315
Name: notRepairedDamage, dtype: int64

seller,offerType两个类别特征严重倾斜,一般不会对预测有什么帮助,故这边先删掉,当然你也可以继续挖掘,但是一般意义不大

print(data_train.seller.value_counts())
print(data_train.offerType.value_counts())
0    149999
1         1
Name: seller, dtype: int64
0    150000
Name: offerType, dtype: int64
del data_train['seller']
del data_train['offerType']
del data_testA['seller']
del data_testA['offerType']
data_train.shape
(150000, 29)

4 了解预测值的分布

sns.violinplot(data_train.price)
<matplotlib.axes._subplots.AxesSubplot at 0x7f7ee15a82e8>

在这里插入图片描述

可以看出,数据基本服从长尾分布

sns.boxplot(data_train.price)
<matplotlib.axes._subplots.AxesSubplot at 0x7f7ee159e390>

在这里插入图片描述

从箱线图中可以看出,大部分数据在0~20000之间

## 1) 总体分布概况(无界约翰逊分布等)
import scipy.stats as st
y = data_train['price']
plt.figure(1); plt.title('Johnson SU')
sns.distplot(y, kde=False, fit=st.johnsonsu)
plt.figure(2); plt.title('Normal')
sns.distplot(y, kde=False, fit=st.norm)
plt.figure(3); plt.title('Log Normal')
sns.distplot(y, kde=False, fit=st.lognorm)
/home/myth/anaconda3/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
/home/myth/anaconda3/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
/home/myth/anaconda3/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "





<matplotlib.axes._subplots.AxesSubplot at 0x7f7edcc72908>

在这里插入图片描述

在这里插入图片描述

在这里插入图片描述

价格不服从正态分布,先对其进行转换

## 2) 查看skewness and kurtosis
sns.distplot(data_train['price']);
print("Skewness: %f" % data_train['price'].skew())
print("Kurtosis: %f" % data_train['price'].kurt())
/home/myth/anaconda3/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "


Skewness: 3.346487
Kurtosis: 18.995183

在这里插入图片描述

skewness:Definition:是描述数据分布形态的统计量,其描述的是某总体取值分布的对称性,简单来说就是数据的不对称程度
偏度是三阶中心距计算出来的。
(1)Skewness = 0 ,分布形态与正态分布偏度相同。
(2)Skewness > 0 ,正偏差数值较大,为正偏或右偏。长尾巴拖在右边,数据右端有较多的极端值。
(3)Skewness < 0 ,负偏差数值较大,为负偏或左偏。长尾巴拖在左边,数据左端有较多的极端值。
(4)数值的绝对值越大,表明数据分布越不对称,偏斜程度大。
计算公式:
S k e w n e s s = E [ ( ( x − E ( x ) ) / ( D ( x ) ) ) 3 ] Skewness=E[((x-E(x))/(\sqrt{D(x)}))^3] Skewness=E[((xE(x))/(D(x) ))3]
| Skewness| 越大,分布形态偏移程度越大。
kurtosis:Definition:偏度是描述某变量所有取值分布形态陡缓程度的统计量,简单来说就是数据分布顶的尖锐程度
峰度是四阶标准矩计算出来的。
(1)Kurtosis=0 与正态分布的陡缓程度相同。
(2)Kurtosis>0 比正态分布的高峰更加陡峭——尖顶峰
(3)Kurtosis<0 比正态分布的高峰来得平台——平顶峰
计算公式:
K u r t o s i s = E [ ( ( x − E ( x ) ) / ( ( D ( x ) ) ) ) 4 ] − 3 Kurtosis=E[ ( (x-E(x))/ (\sqrt(D(x))) )^4 ]-3 Kurtosis=E[((xE(x))/(( D(x))))4]3

#数据的偏度
sns.distplot(data_train.skew(),color='blue',axlabel ='Skewness')
/home/myth/anaconda3/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "





<matplotlib.axes._subplots.AxesSubplot at 0x7f7edc91c4e0>

在这里插入图片描述

sns.distplot(data_train.kurt(),color='blue',axlabel ='Skewness')
/home/myth/anaconda3/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "





<matplotlib.axes._subplots.AxesSubplot at 0x7f7ecdbf89b0>

在这里插入图片描述

## 3) 查看预测值的具体频数
plt.hist(data_train['price'], orientation = 'vertical',histtype = 'bar', color ='red')
plt.show()

在这里插入图片描述

查看频数, 大于20000得值极少,其实这里也可以把这些当作特殊得值(异常值)直接用填充或者删掉,再前面进行

# log变换 z之后的分布较均匀,可以进行log变换进行预测,这也是预测问题常用的trick
plt.hist(np.log(data_train['price']), orientation = 'vertical',histtype = 'bar', color ='red') 
plt.show()

在这里插入图片描述

5 特征分为类别特征和数字特征,并对类别特征查看unique分布

# 数字特征
numeric_features = data_train.select_dtypes(include=[np.number])
print(numeric_features.columns)
# # 类型特征
categorical_features = data_train.select_dtypes(include=[np.object])
print(categorical_features.columns)
Index(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType',
       'gearbox', 'power', 'kilometer', 'regionCode', 'creatDate', 'price',
       'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9',
       'v_10', 'v_11', 'v_12', 'v_13', 'v_14'],
      dtype='object')
Index(['notRepairedDamage'], dtype='object')

上述方法不适用,因为有些类型特征也是由数字表示的,需要人工区分

numeric_features = ['power', 'kilometer', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13','v_14' ]

categorical_features = ['name', 'model', 'brand', 'bodyType', 'fuelType', 'gearbox', 'notRepairedDamage', 'regionCode',]
# 特征nunique分布
for cat_fea in categorical_features:
#     print(cat_fea + "的特征分布如下:")
    print("{}特征有个{}不同的值".format(cat_fea, data_train[cat_fea].nunique()))
#     print(data_train[cat_fea].value_counts())
name特征有个99662不同的值
model特征有个248不同的值
brand特征有个40不同的值
bodyType特征有个8不同的值
fuelType特征有个7不同的值
gearbox特征有个2不同的值
notRepairedDamage特征有个2不同的值
regionCode特征有个7905不同的值

数字特征分析

numeric_features.append('price')
## 1) 相关性分析
price_numeric = data_train[numeric_features]
correlation = price_numeric.corr()
print(correlation['price'].sort_values(ascending = False),'\n')
price        1.000000
v_12         0.692823
v_8          0.685798
v_0          0.628397
power        0.219834
v_5          0.164317
v_2          0.085322
v_6          0.068970
v_1          0.060914
v_14         0.035911
v_13        -0.013993
v_7         -0.053024
v_4         -0.147085
v_9         -0.206205
v_10        -0.246175
v_11        -0.275320
kilometer   -0.440519
v_3         -0.730946
Name: price, dtype: float64 
用热力图表示
f , ax = plt.subplots(figsize = (7, 7))

plt.title('Correlation of Numeric Features with Price',y=1,size=16)

sns.heatmap(correlation,square = True,  vmax=0.8)
<matplotlib.axes._subplots.AxesSubplot at 0x7f7edc6e9a20>

在这里插入图片描述

## 2) 查看几个特征的偏度和峰值
for col in numeric_features:
    print('{:15}'.format(col), 
          'Skewness: {:05.2f}'.format(data_train[col].skew()) , 
          '  ' ,
          'Kurtosis: {:06.2f}'.format(data_train[col].kurt())  
         )
power           Skewness: 65.86    Kurtosis: 5733.45
kilometer       Skewness: -1.53    Kurtosis: 001.14
v_0             Skewness: -1.32    Kurtosis: 003.99
v_1             Skewness: 00.36    Kurtosis: -01.75
v_2             Skewness: 04.84    Kurtosis: 023.86
v_3             Skewness: 00.11    Kurtosis: -00.42
v_4             Skewness: 00.37    Kurtosis: -00.20
v_5             Skewness: -4.74    Kurtosis: 022.93
v_6             Skewness: 00.37    Kurtosis: -01.74
v_7             Skewness: 05.13    Kurtosis: 025.85
v_8             Skewness: 00.20    Kurtosis: -00.64
v_9             Skewness: 00.42    Kurtosis: -00.32
v_10            Skewness: 00.03    Kurtosis: -00.58
v_11            Skewness: 03.03    Kurtosis: 012.57
v_12            Skewness: 00.37    Kurtosis: 000.27
v_13            Skewness: 00.27    Kurtosis: -00.44
v_14            Skewness: -1.19    Kurtosis: 002.39
price           Skewness: 03.35    Kurtosis: 019.00
## 3) 每个数字特征得分布可视化
f = pd.melt(data_train, value_vars=numeric_features)
g = sns.FacetGrid(f, col="variable",  col_wrap=2, sharex=False, sharey=False)
g = g.map(sns.distplot, "value")
/home/myth/anaconda3/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
/home/myth/anaconda3/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
/home/myth/anaconda3/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
/home/myth/anaconda3/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
/home/myth/anaconda3/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
/home/myth/anaconda3/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
/home/myth/anaconda3/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
/home/myth/anaconda3/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
/home/myth/anaconda3/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
/home/myth/anaconda3/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
/home/myth/anaconda3/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
/home/myth/anaconda3/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
/home/myth/anaconda3/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
/home/myth/anaconda3/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
/home/myth/anaconda3/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
/home/myth/anaconda3/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
/home/myth/anaconda3/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
/home/myth/anaconda3/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "

在这里插入图片描述

## 4) 数字特征相互之间的关系可视化
sns.set()
columns = ['price', 'v_12', 'v_8' , 'v_0', 'power', 'v_5',  'v_2', 'v_6', 'v_1', 'v_14']
sns.pairplot(data_train[columns],size = 2 ,kind ='scatter',diag_kind='kde')
plt.show()

在这里插入图片描述

参考资料:pairplot相关用法
https://www.jianshu.com/p/6e18d21a4cad

## 5) 多变量互相回归关系可视化
fig, ((ax1, ax2), (ax3, ax4), (ax5, ax6), (ax7, ax8), (ax9, ax10)) = plt.subplots(nrows=5, ncols=2, figsize=(24, 20))
# ['v_12', 'v_8' , 'v_0', 'power', 'v_5',  'v_2', 'v_6', 'v_1', 'v_14']
v_12_scatter_plot = pd.concat([data_train['price'],data_train['v_12']],axis = 1)
sns.regplot(x='v_12',y = 'price', data = v_12_scatter_plot,scatter= True, fit_reg=True, ax=ax1)

v_8_scatter_plot = pd.concat([data_train['price'],data_train['v_8']],axis = 1)
sns.regplot(x='v_8',y = 'price',data = v_8_scatter_plot,scatter= True, fit_reg=True, ax=ax2)

v_0_scatter_plot = pd.concat([data_train['price'],data_train['v_0']],axis = 1)
sns.regplot(x='v_0',y = 'price',data = v_0_scatter_plot,scatter= True, fit_reg=True, ax=ax3)

power_scatter_plot = pd.concat([data_train['price'],data_train['power']],axis = 1)
sns.regplot(x='power',y = 'price',data = power_scatter_plot,scatter= True, fit_reg=True, ax=ax4)

v_5_scatter_plot = pd.concat([data_train['price'],data_train['v_5']],axis = 1)
sns.regplot(x='v_5',y = 'price',data = v_5_scatter_plot,scatter= True, fit_reg=True, ax=ax5)

v_2_scatter_plot = pd.concat([data_train['price'],data_train['v_2']],axis = 1)
sns.regplot(x='v_2',y = 'price',data = v_2_scatter_plot,scatter= True, fit_reg=True, ax=ax6)

v_6_scatter_plot = pd.concat([data_train['price'],data_train['v_6']],axis = 1)
sns.regplot(x='v_6',y = 'price',data = v_6_scatter_plot,scatter= True, fit_reg=True, ax=ax7)

v_1_scatter_plot = pd.concat([data_train['price'],data_train['v_1']],axis = 1)
sns.regplot(x='v_1',y = 'price',data = v_1_scatter_plot,scatter= True, fit_reg=True, ax=ax8)

v_14_scatter_plot = pd.concat([data_train['price'],data_train['v_14']],axis = 1)
sns.regplot(x='v_14',y = 'price',data = v_14_scatter_plot,scatter= True, fit_reg=True, ax=ax9)

v_13_scatter_plot = pd.concat([data_train['price'],data_train['v_13']],axis = 1)
sns.regplot(x='v_13',y = 'price',data = v_13_scatter_plot,scatter= True, fit_reg=True, ax=ax10)
<matplotlib.axes._subplots.AxesSubplot at 0x7f7ecfda8da0>

在这里插入图片描述

类别特征分析

## 1) unique分布
for fea in categorical_features:
    print(data_train[fea].nunique())
99662
248
40
8
7
2
2
7905
## 2) 类别特征箱形图可视化

# 因为 name和 regionCode的类别太稀疏了,这里我们把不稀疏的几类画一下
categorical_features = ['model',
 'brand',
 'bodyType',
 'fuelType',
 'gearbox',
 'notRepairedDamage']
for c in categorical_features:
    data_train[c] = data_train[c].astype('category')
    if data_train[c].isnull().any():
        data_train[c] = data_train[c].cat.add_categories(['MISSING'])
        data_train[c] = data_train[c].fillna('MISSING')

def boxplot(x, y, **kwargs):
    sns.boxplot(x=x, y=y)
    x=plt.xticks(rotation=90)

f = pd.melt(data_train, id_vars=['price'], value_vars=categorical_features)
g = sns.FacetGrid(f, col="variable",  col_wrap=2, sharex=False, sharey=False, size=5)
g = g.map(boxplot, "value", "price")

在这里插入图片描述

常用的画图函数:

箱线图
sns.boxplot(x=x, y=y,data=data)
小提琴图
sns.violinplot(x=x, y=y,data=data)
柱状图
sns.barplot(x=x, y=y,data=data)
频数柱状图
sns.countplot(x=x)

用pandas_profiling生成数据报告

import pandas_profiling
pfr = pandas_profiling.ProfileReport(data_train)
pfr.to_file("./example.html")
HBox(children=(FloatProgress(value=0.0, description='variables', max=29.0, style=ProgressStyle(description_wid…
HBox(children=(FloatProgress(value=0.0, description='variables', max=29.0, style=ProgressStyle(description_wid…






HBox(children=(FloatProgress(value=0.0, description='correlations', max=6.0, style=ProgressStyle(description_w…






HBox(children=(FloatProgress(value=0.0, description='interactions [continuous]', max=729.0, style=ProgressStyl…






HBox(children=(FloatProgress(value=0.0, description='table', max=1.0, style=ProgressStyle(description_width='i…






HBox(children=(FloatProgress(value=0.0, description='missing', max=4.0, style=ProgressStyle(description_width=…






HBox(children=(FloatProgress(value=0.0, description='warnings', max=3.0, style=ProgressStyle(description_width…






HBox(children=(FloatProgress(value=0.0, description='package', max=1.0, style=ProgressStyle(description_width=…






HBox(children=(FloatProgress(value=0.0, description='build report structure', max=1.0, style=ProgressStyle(des…

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值