SecondHandCarPriceForecast的整理
baseline部分
task1
task2
2.1 EDA (Exploratory Data Analysis)目标
- EDA的价值主要在于熟悉数据集,了解数据集,对数据集进行验证来确定所获得数据集可以用于接下来的机器学习或者深度学习使用。
- 当了解了数据集之后我们下一步就是要去了解变量间的相互关系以及变量与预测值之间的存在关系。
- 引导数据科学从业者进行数据处理以及特征工程的步骤,使数据集的结构和特征集让接下来的预测问题更加可靠。
2.2 内容介绍
2.2.1 数据总览:
- 通过describe()来熟悉数据的相关统计量
- 通过info()来熟悉数据类型
2.2.2 判断缺失和异常值
- 查看每列的存在nan情况
- 异常值检测
2.2.3 了解预测分布
- 总体分布概况(无界约翰逊分布等)
- 查看skewness and kurtosis
- 查看预测值的具体频数
2.2.4 特征分为类别特征和数字特征,并对类别特征查看unique分布
2.2.5 数字特征分析
- 相关性分析
- 查看几个特征得 偏度和峰值
- 每个数字特征得分布可视化
- 数字特征相互之间的关系可视化
- 多变量互相回归关系可视化
2.2.6 类型特征分析
- unique分布
- 类别特征箱形图可视化
- 类别特征的小提琴图可视化
- 类别特征的柱形图可视化类别
- 特征的每个类别频数可视化(count_plot)
2.2.7 用pandas_profiling生成数据报告
2.3 代码展示与解释
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
missingno是专门处理异常值的库,seaborn是图像库。
Train_data = pd.read_csv('train.csv', sep=' ')
Test_data = pd.read_csv('testA.csv', sep=' ')
Train_data.info()
#运行结果
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 31 columns):
SaleID 150000 non-null int64
name 150000 non-null int64
regDate 150000 non-null int64
model 149999 non-null float64
brand 150000 non-null int64
bodyType 145494 non-null float64
fuelType 141320 non-null float64
gearbox 144019 non-null float64
power 150000 non-null int64
kilometer 150000 non-null float64
notRepairedDamage 150000 non-null object
regionCode 150000 non-null int64
seller 150000 non-null int64
offerType 150000 non-null int64
creatDate 150000 non-null int64
price 150000 non-null int64
v_0 150000 non-null float64
v_1 150000 non-null float64
v_2 150000 non-null float64
v_3 150000 non-null float64
v_4 150000 non-null float64
v_5 150000 non-null float64
v_6 150000 non-null float64
v_7 150000 non-null float64
v_8 150000 non-null float64
v_9 150000 non-null float64
v_10 150000 non-null float64
v_11 150000 non-null float64
v_12 150000 non-null float64
v_13 150000 non-null float64
v_14 150000 non-null float64
dtypes: float64(20), int64(10), object(1)
memory usage: 35.5+ MB
info()函数可以熟悉数据类型,比如这里的notRepairedDamage的数据类型是object。
同时也要用describe()函数查看。
## 1) 查看每列的存在nan情况
Train_data.isnull().sum()
SaleID 0
name 0
regDate 0
model 1
brand 0
bodyType 4506
fuelType 8680
gearbox 5981
power 0
kilometer 0
notRepairedDamage 0
regionCode 0
seller 0
offerType 0
creatDate 0
price 0
v_0 0
v_1 0
v_2 0
v_3 0
v_4 0
v_5 0
v_6 0
v_7 0
v_8 0
v_9 0
v_10 0
v_11 0
v_12 0
v_13 0
v_14 0
dtype: int64
Train_data.isnull().sum()是看代码缺失情况。
Train_data.isnull()
#结果
SaleID name regDate model brand bodyType fuelType gearbox power kilometer ... v_5 v_6 v_7 v_8 v_9 v_10 v_11 v_12 v_13 v_14
0 False False False False False False False False False False ... False False False False False False False False False False
1 False False False False False False False False False False ... False False False False False False False False False False
2 False False False False False False False False False False ... False False False False False False False False False False
3 False False False False False False False False False False ... False False False False False False False False False False
4 False False False False False False False False False False ... False False False False False False False False False False
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
149995 False False False False False False False False False False ... False False False False False False False False False False
149996 False False False False False False False False False False ... False False False False False False False False False False
149997 False False False False False False False False False False ... False False False False False False False False False False
149998 False False False False False False False False False False ... False False False False False False False False False False
149999 False False False False False False False False False False ... False False False False False False False False False False
150000 rows × 31 columns
一般用Train_data.isnull().sum()。
对缺失值可视化:
missing = Train_data.isnull().sum()
missing = missing[missing > 0]
missing.sort_values(inplace=True)
missing.plot.bar()
missing = missing[missing > 0] 可以用来排除小于0的值。
msno.matrix(Train_data.sample(250))
msno.bar(Train_data.sample(1000))
msno.bar()和msno.matrix()目前不会用。主要是不会看他们的图。
Train_data['notRepairedDamage'].value_counts()
#结果
500 2337
1500 2158
1200 1922
1000 1850
2500 1821
...
25321 1
8886 1
8801 1
37920 1
8188 1
左边是价格,右边是这个价格的数量。
#总体分布概况(无界约翰逊分布等)
import scipy.stats as st
y = Train_data['price']
plt.figure(1); plt.title('Johnson SU')
sns.distplot(y, kde=False, fit=st.johnsonsu)
plt.figure(2); plt.title('Normal')
sns.distplot(y, kde=False, fit=st.norm)
plt.figure(3); plt.title('Log Normal')
sns.distplot(y, kde=False, fit=st.lognorm)
scipy.stats可接收类型是Series,Train_data[‘price’]的类型是Series。
import scipy.stats as st
y = Train_data['price']
plt.figure(1); plt.title('Johnson SU')
sns.distplot(y, kde=False, fit=st.johnsonsu)
plt.figure(2); plt.title('Normal')
sns.distplot(y, kde=False, fit=st.norm)
plt.figure(3); plt.title('Log Normal')
sns.distplot(y, kde=False, fit=st.lognorm)
(kde)核密度估计 Kernel Density Estimation(KDE)
的讲解:https://blog.csdn.net/unixtch/article/details/78556499
fit里面填写的概率分布,norm是正太分布,Jonson SU是无界约翰逊分布。黑线就是那个分布。
## 查看skewness and kurtosis
sns.distplot(Train_data['price']);
print("Skewness: %f" % Train_data['price'].skew())
print("Kurtosis: %f" % Train_data['price'].kurt())
Skewness: 3.346487
Kurtosis: 18.995183
峰度(peakedness;kurtosis)又称峰态系数。表征概率密度分布曲线在平均值处峰值高低的特征数。直观看来,峰度反映了峰部的尖度。样本的峰度是和正态分布相比较而言统计量,如果峰度大于三,峰的形状比较尖,比正态分布峰要陡峭。反之亦然。
偏度(skewness),是统计数据分布偏斜方向和程度的度量,是统计数据分布非对称程度的数字特征。偏度(Skewness)亦称偏态、偏态系数。
Train_data.skew(), Train_data.kurt()
(SaleID 6.017846e-17
name 5.576058e-01
regDate 2.849508e-02
model 1.484388e+00
brand 1.150760e+00
bodyType 9.915299e-01
fuelType 1.595486e+00
gearbox 1.317514e+00
power 6.586318e+01
kilometer -1.525921e+00
notRepairedDamage 2.430640e+00
regionCode 6.888812e-01
creatDate -7.901331e+01
price 3.346487e+00
v_0 -1.316712e+00
v_1 3.594543e-01
v_2 4.842556e+00
v_3 1.062920e-01
v_4 3.679890e-01
v_5 -4.737094e+00
v_6 3.680730e-01
v_7 5.130233e+00
v_8 2.046133e-01
v_9 4.195007e-01
v_10 2.522046e-02
v_11 3.029146e+00
v_12 3.653576e-01
v_13 2.679152e-01
v_14 -1.186355e+00
dtype: float64, SaleID -1.200000
name -1.039945
regDate -0.697308
model 1.740483
brand 1.076201
bodyType 0.206937
fuelType 5.880049
gearbox -0.264161
power 5733.451054
kilometer 1.141934
notRepairedDamage 3.908072
regionCode -0.340832
creatDate 6881.080328
price 18.995183
v_0 3.993841
v_1 -1.753017
v_2 23.860591
v_3 -0.418006
v_4 -0.197295
v_5 22.934081
v_6 -1.742567
v_7 25.845489
v_8 -0.636225
v_9 -0.321491
v_10 -0.577935
v_11 12.568731
v_12 0.268937
v_13 -0.438274
v_14 2.393526
dtype: float64)
可以统计每列的偏度和峰度。
## 查看预测值的具体频数
plt.hist(Train_data['price'], orientation = 'vertical',histtype = 'bar', color ='red')
plt.show()
查看频数, 大于20000得值极少,其实这里也可以把这些当作特殊得值(异常值)直接用填充或者删掉,再前面进行
# log变换 z之后的分布较均匀,可以进行log变换进行预测,这也是预测问题常用的trick
plt.hist(np.log(Train_data['price']), orientation = 'vertical',histtype = 'bar', color ='red')
plt.show()
2.2.3.1数字类型数据处理
这个要看说明是什么数据类型。
for cat_fea in categorical_features:
print(cat_fea + "的特征分布如下:")
print("{}特征有个{}不同的值".format(cat_fea, Train_data[cat_fea].nunique()))
print(Train_data[cat_fea].value_counts())
#结果
name的特征分布如下:
name特征有个99662不同的值
708 282
387 282
55 280
1541 263
203 233
...
5074 1
7123 1
11221 1
13270 1
174485 1
Name: name, Length: 99662, dtype: int64
model的特征分布如下:
model特征有个248不同的值
0.0 11762
19.0 9573
4.0 8445
nunique()函数直接统计dataframe中每列的不同值的个数,也可用于series,但不能用于list.返回的是不同值的个数。
numeric_features.append('price')
numeric_features
['power',
'kilometer',
'v_0',
'v_1',
'v_2',
'v_3',
'v_4',
'v_5',
'v_6',
'v_7',
'v_8',
'v_9',
'v_10',
'v_11',
'v_12',
'v_13',
'v_14',
'price']
## 1) 相关性分析
price_numeric = Train_data[numeric_features]
correlation = price_numeric.corr()
print(correlation)
power kilometer v_0 v_1 v_2 v_3 \
power 1.000000 -0.019631 0.215028 0.023746 -0.031487 -0.185342
kilometer -0.019631 1.000000 -0.225034 -0.022228 -0.110375 0.402502
v_0 0.215028 -0.225034 1.000000 0.245049 -0.452591 -0.710480
v_1 0.023746 -0.022228 0.245049 1.000000 -0.001133 -0.001915
v_2 -0.031487 -0.110375 -0.452591 -0.001133 1.000000 0.001224
v_3 -0.185342 0.402502 -0.710480 -0.001915 0.001224 1.000000
v_4 -0.141013 -0.214861 -0.259714 -0.000468 -0.001021 -0.001694
v_5 0.119727 0.049502 0.726250 0.109303 -0.921857 -0.233412
v_6 0.025648 -0.024664 0.243783 0.999415 0.023877 -0.000747
v_7 -0.060397 -0.017835 -0.584363 -0.110806 0.973689 0.191278
v_8 0.155956 -0.407686 0.514149 -0.298966 0.180285 -0.933161
v_9 -0.140203 -0.149422 -0.186243 -0.007698 -0.236164 0.079292
v_10 -0.092717 0.083358 -0.582943 -0.921904 0.274341 0.247385
v_11 -0.122107 0.066542 -0.667809 0.370445 0.800915 0.429777
v_12 0.161990 -0.370153 0.415711 -0.087593 0.535270 -0.811301
v_13 -0.103430 -0.285158 -0.136938 0.017349 -0.055376 -0.246052
v_14 -0.023808 -0.120389 -0.039809 0.002143 -0.013785 -0.058561
price 0.219834 -0.440519 0.628397 0.060914 0.085322 -0.730946
v_4 v_5 v_6 v_7 v_8 v_9 \
power -0.141013 0.119727 0.025648 -0.060397 0.155956 -0.140203
kilometer -0.214861 0.049502 -0.024664 -0.017835 -0.407686 -0.149422
v_0 -0.259714 0.726250 0.243783 -0.584363 0.514149 -0.186243
v_1 -0.000468 0.109303 0.999415 -0.110806 -0.298966 -0.007698
v_2 -0.001021 -0.921857 0.023877 0.973689 0.180285 -0.236164
v_3 -0.001694 -0.233412 -0.000747 0.191278 -0.933161 0.079292
v_4 1.000000 -0.259739 -0.011275 -0.054241 0.051741 0.962928
v_5 -0.259739 1.000000 0.091229 -0.939385 0.010686 -0.050343
v_6 -0.011275 0.091229 1.000000 -0.085410 -0.294956 -0.023057
v_7 -0.054241 -0.939385 -0.085410 1.000000 0.028695 -0.264091
v_8 0.051741 0.010686 -0.294956 0.028695 1.000000 -0.063577
v_9 0.962928 -0.050343 -0.023057 -0.264091 -0.063577 1.000000
v_10 0.071116 -0.440588 -0.917056 0.410014 0.094497 0.026562
v_11 0.110660 -0.845954 0.386446 0.813175 -0.369353 -0.056200
v_12 -0.134611 -0.258521 -0.070238 0.385378 0.882121 -0.313634
v_13 0.934580 -0.162689 0.000758 -0.154535 0.250423 0.880545
v_14 -0.178518 0.037804 -0.003322 -0.020218 0.030416 -0.214151
price -0.147085 0.164317 0.068970 -0.053024 0.685798 -0.206205
v_10 v_11 v_12 v_13 v_14 price
power -0.092717 -0.122107 0.161990 -0.103430 -0.023808 0.219834
kilometer 0.083358 0.066542 -0.370153 -0.285158 -0.120389 -0.440519
v_0 -0.582943 -0.667809 0.415711 -0.136938 -0.039809 0.628397
v_1 -0.921904 0.370445 -0.087593 0.017349 0.002143 0.060914
v_2 0.274341 0.800915 0.535270 -0.055376 -0.013785 0.085322
v_3 0.247385 0.429777 -0.811301 -0.246052 -0.058561 -0.730946
v_4 0.071116 0.110660 -0.134611 0.934580 -0.178518 -0.147085
v_5 -0.440588 -0.845954 -0.258521 -0.162689 0.037804 0.164317
v_6 -0.917056 0.386446 -0.070238 0.000758 -0.003322 0.068970
v_7 0.410014 0.813175 0.385378 -0.154535 -0.020218 -0.053024
v_8 0.094497 -0.369353 0.882121 0.250423 0.030416 0.685798
v_9 0.026562 -0.056200 -0.313634 0.880545 -0.214151 -0.206205
v_10 1.000000 0.006306 0.001289 -0.000580 0.002244 -0.246175
v_11 0.006306 1.000000 0.006695 -0.001671 -0.001156 -0.275320
v_12 0.001289 0.006695 1.000000 0.001512 0.002045 0.692823
v_13 -0.000580 -0.001671 0.001512 1.000000 0.001419 -0.013993
v_14 0.002244 -0.001156 0.002045 0.001419 1.000000 0.035911
price -0.246175 -0.275320 0.692823 -0.013993 0.035911 1.000000
corr()函数是计算两个变量间的相关系数的。
print(correlation['price'].sort_values(ascending = False),'\n')
price 1.000000
v_12 0.692823
v_8 0.685798
v_0 0.628397
power 0.219834
v_5 0.164317
v_2 0.085322
v_6 0.068970
v_1 0.060914
v_14 0.035911
v_13 -0.013993
v_7 -0.053024
v_4 -0.147085
v_9 -0.206205
v_10 -0.246175
v_11 -0.275320
kilometer -0.440519
v_3 -0.730946
Name: price, dtype: float64
这里只取’price‘一列。
f , ax = plt.subplots(figsize = (7, 7))
plt.title('Correlation of Numeric Features with Price',y=1,size=16)
sns.heatmap(correlation,square = True, vmax=0.8)
这是热力图。
## 2) 查看几个特征得 偏度和峰值
for col in numeric_features:
print('{:15}'.format(col),
'Skewness: {:05.2f}'.format(Train_data[col].skew()) ,
' ' ,
'Kurtosis: {:06.2f}'.format(Train_data[col].kurt())
)
结果
power Skewness: 65.86 Kurtosis: 5733.45
kilometer Skewness: -1.53 Kurtosis: 001.14
v_0 Skewness: -1.32 Kurtosis: 003.99
v_1 Skewness: 00.36 Kurtosis: -01.75
v_2 Skewness: 04.84 Kurtosis: 023.86
v_3 Skewness: 00.11 Kurtosis: -00.42
v_4 Skewness: 00.37 Kurtosis: -00.20
v_5 Skewness: -4.74 Kurtosis: 022.93
v_6 Skewness: 00.37 Kurtosis: -01.74
v_7 Skewness: 05.13 Kurtosis: 025.85
v_8 Skewness: 00.20 Kurtosis: -00.64
v_9 Skewness: 00.42 Kurtosis: -00.32
v_10 Skewness: 00.03 Kurtosis: -00.58
v_11 Skewness: 03.03 Kurtosis: 012.57
v_12 Skewness: 00.37 Kurtosis: 000.27
v_13 Skewness: 00.27 Kurtosis: -00.44
v_14 Skewness: -1.19 Kurtosis: 002.39
price Skewness: 03.35 Kurtosis: 019.00
每个数字特征得分布可视化
f = pd.melt(Train_data, value_vars=numeric_features)
g = sns.FacetGrid(f, col="variable", col_wrap=2, sharex=False, sharey=False)
g = g.map(sns.distplot, "value")
暂时不懂他的写法。
数字特征相互之间的关系可视化。
sns.set()
columns = ['price', 'v_12', 'v_8' , 'v_0', 'power', 'v_5', 'v_2', 'v_6', 'v_1', 'v_14']
sns.pairplot(Train_data[columns],size = 2 ,kind ='scatter',diag_kind='kde')
plt.show()
pairplot函数可以看https://www.cnblogs.com/cgmcoding/p/13274481.html
这个函数主要用来显示两个变量之间的关系的。
fig, ((ax1, ax2), (ax3, ax4), (ax5, ax6), (ax7, ax8), (ax9, ax10)) = plt.subplots(nrows=5, ncols=2, figsize=(24, 20))
# ['v_12', 'v_8' , 'v_0', 'power', 'v_5', 'v_2', 'v_6', 'v_1', 'v_14']
v_12_scatter_plot = pd.concat([Y_train,Train_data['v_12']],axis = 1)
sns.regplot(x='v_12',y = 'price', data = v_12_scatter_plot,scatter= True, fit_reg=True, ax=ax1)
v_8_scatter_plot = pd.concat([Y_train,Train_data['v_8']],axis = 1)
sns.regplot(x='v_8',y = 'price',data = v_8_scatter_plot,scatter= True, fit_reg=True, ax=ax2)
v_0_scatter_plot = pd.concat([Y_train,Train_data['v_0']],axis = 1)
sns.regplot(x='v_0',y = 'price',data = v_0_scatter_plot,scatter= True, fit_reg=True, ax=ax3)
power_scatter_plot = pd.concat([Y_train,Train_data['power']],axis = 1)
sns.regplot(x='power',y = 'price',data = power_scatter_plot,scatter= True, fit_reg=True, ax=ax4)
v_5_scatter_plot = pd.concat([Y_train,Train_data['v_5']],axis = 1)
sns.regplot(x='v_5',y = 'price',data = v_5_scatter_plot,scatter= True, fit_reg=True, ax=ax5)
v_2_scatter_plot = pd.concat([Y_train,Train_data['v_2']],axis = 1)
sns.regplot(x='v_2',y = 'price',data = v_2_scatter_plot,scatter= True, fit_reg=True, ax=ax6)
v_6_scatter_plot = pd.concat([Y_train,Train_data['v_6']],axis = 1)
sns.regplot(x='v_6',y = 'price',data = v_6_scatter_plot,scatter= True, fit_reg=True, ax=ax7)
v_1_scatter_plot = pd.concat([Y_train,Train_data['v_1']],axis = 1)
sns.regplot(x='v_1',y = 'price',data = v_1_scatter_plot,scatter= True, fit_reg=True, ax=ax8)
v_14_scatter_plot = pd.concat([Y_train,Train_data['v_14']],axis = 1)
sns.regplot(x='v_14',y = 'price',data = v_14_scatter_plot,scatter= True, fit_reg=True, ax=ax9)
v_13_scatter_plot = pd.concat([Y_train,Train_data['v_13']],axis = 1)
sns.regplot(x='v_13',y = 'price',data = v_13_scatter_plot,scatter= True, fit_reg=True, ax=ax10)
这些代码处理多变量关系