second-hand

零基础入门数据挖掘-二手车交易价格预测

Task1 赛题理解

1.1、赛题地址:

https://tianchi.aliyun.com/competition/entrance/231784/introductionspm=5176.12281957.1004.1.38b02448ausjSX

1.2、赛题分析

 目标:预测二手汽车的交易价格
 分析:本赛题是典型的回归问题,根据数据集中提供的特征利用机器学习算法等实现价格预测。数据集中的特征过多,结合现实情况猜测几个影响价格比较大的特征,品牌、所跑公里数、燃油类型等。先粗略查看数据集,密密麻麻,不知所云,那下一步就开始进行数据的读入分析吧!

Task2 数据分析

数据分析思维导图

#导入第三方包
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno

2.1、读取数据

Train_data=pd.read_csv('used_car_train_20200313.csv',sep=' ')
Test_data=pd.read_csv('used_car_testA_20200313.csv',sep=' ')

2.2、观察数据

 查看训练数据集及测试数据集前五行及后五行,对数据集中的变量,数据量进行大体了解。

Train_data.head().append(Train_data.tail())
SaleIDnameregDatemodelbrandbodyTypefuelTypegearboxpowerkilometer...v_5v_6v_7v_8v_9v_10v_11v_12v_13v_14
007362004040230.061.00.00.06012.5...0.2356760.1019880.1295490.0228160.097462-2.8818032.804097-2.4208210.7952920.914762
1122622003030140.012.00.00.0015.0...0.2647770.1210040.1357310.0265970.020582-4.9004822.096338-1.030483-1.7226740.245522
221487420040403115.0151.00.00.016312.5...0.2514100.1149120.1651470.0621730.027075-4.8467491.8035591.565330-0.832687-0.229963
337186519960908109.0100.00.01.019315.0...0.2742930.1103000.1219640.0333950.000000-4.5095991.285940-0.501868-2.438353-0.478699
4411108020120103110.051.00.00.0685.0...0.2280360.0732050.0918800.0788190.121534-1.8962400.9107830.9311102.8345181.923482
14999514999516397820000607121.0104.00.01.016315.0...0.2802640.0003100.0484410.0711580.0191741.988114-2.9839730.589167-1.304370-0.302592
14999614999618453520091102116.0110.00.00.012510.0...0.2532170.0007770.0840790.0996810.0793711.839166-2.7746152.5539940.924196-0.272160
1499971499971475872010100360.0111.01.00.0906.0...0.2333530.0007050.1188720.1001180.0979142.439812-1.6306772.2901971.8919220.414931
149998149998459072006031234.0103.01.00.015615.0...0.2563690.0002520.0814790.0835580.0814982.075380-2.6337191.4149370.431981-1.659014
1499991499991776721999020419.0286.00.01.019312.5...0.2844750.0000000.0400720.0625430.0258191.978453-3.1799130.031724-1.483350-0.342674

10 rows × 31 columns

Train_data.shape
(150000, 31)
Test_data.shape
(50000, 30)

2.3、概述数据

 结合上面的数据,更好的梳理赛题数据,该数据来自某交易平台的二手车交易记录,总数量超过40w,包含31列变量信息,其中15列为匿名变量。从中抽取15万条作为训练集,5万条作为测试集A,5万条作为测试集B,同时对name、model、brand和regionCode等信息进行脱敏。


labeldescription
SaleID销售样本ID
name汽车编码
regDate汽车注册时间
model车型编码
brand品牌
bodyType车身类型
fuelType燃油类型
gearbox变速箱
power汽车功率
kilometer汽车行驶公里
notRepairedDamage汽车有尚未修复的损坏
regionCode看车地区编码
seller销售方
offerType报价类型
creatDate广告发布时间
v_匿名特征,包含v0-14在内15个匿名特征
price(目标)汽车价格

2.4、数据的描述性统计

描述性统计是什么?
 描述性统计是借助图表或者总结性的数值来描述数据的统计手段。数据挖掘工作的数据分析阶段,
我们可借助描述性统计来描绘或总结数据的基本情况,一来可以梳理自己的思维,二来可以更好地向他人展示数据分析结果。数值分析的过程中,我们往往要计算出数据的统计特征,用来做科学计算的NumPy和SciPy工具可以满足我们的需求。Matpotlob工具可用来绘制图,满足图分析的需求。

Train_data.describe()#默认只输出数值型数据
#Train_data.describe().T  #可转置观察
SaleIDnameregDatemodelbrandbodyTypefuelTypegearboxpowerkilometer...v_5v_6v_7v_8v_9v_10v_11v_12v_13v_14
count150000.000000150000.0000001.500000e+05149999.000000150000.000000145494.000000141320.000000144019.000000150000.000000150000.000000...150000.000000150000.000000150000.000000150000.000000150000.000000150000.000000150000.000000150000.000000150000.000000150000.000000
mean74999.50000068349.1728732.003417e+0747.1290218.0527331.7923690.3758420.224943119.31654712.597160...0.2482040.0449230.1246920.0581440.061996-0.0010000.0090350.0048130.000313-0.000688
std43301.41452761103.8750955.364988e+0449.5360407.8649561.7606400.5486770.417546177.1684193.919576...0.0458040.0517430.2014100.0291860.0356923.7723863.2860712.5174781.2889881.038685
min0.0000000.0000001.991000e+070.0000000.0000000.0000000.0000000.0000000.0000000.500000...0.0000000.0000000.0000000.0000000.000000-9.168192-5.558207-9.639552-4.153899-6.546556
25%37499.75000011156.0000001.999091e+0710.0000001.0000000.0000000.0000000.00000075.00000012.500000...0.2436150.0000380.0624740.0353340.033930-3.722303-1.951543-1.871846-1.057789-0.437034
50%74999.50000051638.0000002.003091e+0730.0000006.0000001.0000000.0000000.000000110.00000015.000000...0.2577980.0008120.0958660.0570140.0584841.624076-0.358053-0.130753-0.0362450.141246
75%112499.250000118841.2500002.007111e+0766.00000013.0000003.0000001.0000000.000000150.00000015.000000...0.2652970.1020090.1252430.0793820.0874912.8443571.2550221.7769330.9428130.680378
max149999.000000196812.0000002.015121e+07247.00000039.0000007.0000006.0000001.00000019312.00000015.000000...0.2918380.1514201.4049360.1607910.22278712.35701118.81904213.84779211.1476698.658418

8 rows × 30 columns

Train_data.describe(include=['object'])# 对于离散型数据可以进行描述统计
notRepairedDamage
count150000
unique3
top0.0
freq111361

 上述代码展示了训练数据集各列的计数、均值、最大最小值、标准差和第一、二、三个四分位值。

Test_data.describe()
SaleIDnameregDatemodelbrandbodyTypefuelTypegearboxpowerkilometer...v_5v_6v_7v_8v_9v_10v_11v_12v_13v_14
count50000.00000050000.0000005.000000e+0450000.00000050000.00000048587.00000047107.00000048090.00000050000.00000050000.000000...50000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.000000
mean174999.50000068542.2232802.003393e+0746.8445208.0562401.7821850.3734050.224350119.88362012.595580...0.2486690.0450210.1227440.0579970.062000-0.017855-0.013742-0.013554-0.0031470.001516
std14433.90106761052.8081335.368870e+0449.4695487.8194771.7607360.5464420.417158185.0973873.908979...0.0446010.0517660.1959720.0292110.0356533.7479853.2312582.5159621.2865971.027360
min150000.0000000.0000001.991000e+070.0000000.0000000.0000000.0000000.0000000.0000000.500000...0.0000000.0000000.0000000.0000000.000000-9.160049-5.411964-8.916949-4.123333-6.112667
25%162499.75000011203.5000001.999091e+0710.0000001.0000000.0000000.0000000.00000075.00000012.500000...0.2437620.0000440.0626440.0350840.033714-3.700121-1.971325-1.876703-1.060428-0.437920
50%174999.50000052248.5000002.003091e+0729.0000006.0000001.0000000.0000000.000000109.00000015.000000...0.2578770.0008150.0958280.0570840.0587641.613212-0.355843-0.142779-0.0359560.138799
75%187499.250000118856.5000002.007110e+0765.00000013.0000003.0000001.0000000.000000150.00000015.000000...0.2653280.1020250.1254380.0790770.0874892.8327081.2629141.7643350.9414690.681163
max199999.000000196805.0000002.015121e+07246.00000039.0000007.0000006.0000001.00000020000.00000015.000000...0.2916180.1532651.3588130.1563550.21477512.33887218.85621812.9504985.9132732.624622

8 rows × 29 columns

 上述代码展示了测试数据集各列的计数、均值、最大最小值、标准差和第一、二、三个四分位值。

Test_data.describe(include=['object'])# 对于离散型数据可以进行描述统计
notRepairedDamage
count50000
unique3
top0.0
freq37249

describe参数详解
 关于参数include,我在使用时没深入考虑关于数值型数据和离散型数据两类数据的问题,只是使用默认参数对数值型数据进行了描述性统计,后来看到还有关于离散型数据的描述统计,注意到了include这个参数,加入include这个参数就可以对不同数据类型进行描述统计。大家也要考虑全面数据分类问题呀!

#通过info()来熟悉数据类型
#Train_data.info()
#Test_data.info()

2.5、缺失值处理

2.5.1、 识别有缺失值的样本或特征:

#Train_data.isnull().sum()
#Test_data.isnull().sum()
#缺失值可视化
sns.set(color_codes=True)
missing = Train_data.isnull().sum()
missing = missing[missing > 0]
missing.sort_values(inplace=True)
missing.plot.bar()

<matplotlib.axes._subplots.AxesSubplot at 0x1a3e0e2a1c8>

2.5.2、 missingno库
 数据分析之前首先要保证数据集的质量,missingno库提供了一个灵活易用的可视化工具来观察数据缺失情况,是基于matplotlib的,接受pandas数据源,下面介绍几种不同方式可视化展示数据集数据缺失情况的函数

Matrix:使用最多,能快速直观地看到数据集的完整性情况,矩阵显示
Bar Chart:可以简单的展示无效数据的条形图
Heatmap:方便观察两个变量间的相关性,但是当数据集变大,这种结论的解释性会变差
endrogram:树状图采用由scipy提供的层次聚类算法通过它们之间的无效相关性(根据二进制距离测量)将变量彼此相加。在树的每个步骤中,基于哪个组合最小化剩余簇的距离来分割变量。变量集越单调,它们的总距离越接近0,并且它们的平均距离越接近零。

 注:如果在安装missingno库的过程中出现“PackagesNotFoundError: The following packages are not available from current channels”错误,参考网址,这个网址解救了我

2.5.3、缺省值 VS null
 缺省值是default-value;而null值是空值,是缺省值的一种,常见的缺省值还有0和False

#可视化看缺省值
#msno.matrix(Train_data.sample(250))
#msno.heatmap(Train_data)
#msno.bar(Train_data.sample(1000))
#查看异常值检测
Train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 31 columns):
SaleID               150000 non-null int64
name                 150000 non-null int64
regDate              150000 non-null int64
model                149999 non-null float64
brand                150000 non-null int64
bodyType             145494 non-null float64
fuelType             141320 non-null float64
gearbox              144019 non-null float64
power                150000 non-null int64
kilometer            150000 non-null float64
notRepairedDamage    150000 non-null object
regionCode           150000 non-null int64
seller               150000 non-null int64
offerType            150000 non-null int64
creatDate            150000 non-null int64
price                150000 non-null int64
v_0                  150000 non-null float64
v_1                  150000 non-null float64
v_2                  150000 non-null float64
v_3                  150000 non-null float64
v_4                  150000 non-null float64
v_5                  150000 non-null float64
v_6                  150000 non-null float64
v_7                  150000 non-null float64
v_8                  150000 non-null float64
v_9                  150000 non-null float64
v_10                 150000 non-null float64
v_11                 150000 non-null float64
v_12                 150000 non-null float64
v_13                 150000 non-null float64
v_14                 150000 non-null float64
dtypes: float64(20), int64(10), object(1)
memory usage: 35.5+ MB

可以发现除了notRepairedDamage 为object类型其他都为数字 将他的几个不同的值都进行显示,想起了前面的describe()函数

#对notRepairedDamage的不同值进行显示
Train_data['notRepairedDamage'].value_counts()
0.0    111361
-       24324
1.0     14315
Name: notRepairedDamage, dtype: int64

可以看出来‘ - ’也为空缺值,因为很多模型对nan有直接的处理,这里我们先不做处理,先替换成nan

#先把空缺值替换成nan
Train_data['notRepairedDamage'].replace('-',np.nan,inplace=True)
#再次对处理后的notRepairedDamage的不同值进行显示
Train_data['notRepairedDamage'].value_counts()
0.0    111361
1.0     14315
Name: notRepairedDamage, dtype: int64
Train_data.isnull().sum()
SaleID                   0
name                     0
regDate                  0
model                    1
brand                    0
bodyType              4506
fuelType              8680
gearbox               5981
power                    0
kilometer                0
notRepairedDamage    24324
regionCode               0
seller                   0
offerType                0
creatDate                0
price                    0
v_0                      0
v_1                      0
v_2                      0
v_3                      0
v_4                      0
v_5                      0
v_6                      0
v_7                      0
v_8                      0
v_9                      0
v_10                     0
v_11                     0
v_12                     0
v_13                     0
v_14                     0
dtype: int64
Test_data['notRepairedDamage'].value_counts()
0.0    37249
-       8031
1.0     4720
Name: notRepairedDamage, dtype: int64
Test_data['notRepairedDamage'].replace('-',np.nan,inplace=True)

2.5.4、 缺失值处理方法:
(1)删除法(缺失比例较小时使用)
(2)替换法 (用某个常数替换)
 &缺失值为离散值----考虑用众数替换
 &缺失值为数值型----考虑用均值或中位数替换
(3)插补法(用模型进行插补)

Train_data["seller"].value_counts()
0    149999
1         1
Name: seller, dtype: int64
Train_data["offerType"].value_counts()
0    150000
Name: offerType, dtype: int64

**“seller”,“offerType”**两个类别严重倾斜,不会对预测有什么帮助,选择删除。

del Train_data["seller"]
del Train_data["offerType"]
del Test_data["seller"]
del Test_data["offerType"]

2.6、分析目标变量

#了解数据的分布特征
Train_data['price']

0         1850
1         3600
2         6222
3         2400
4         5200
          ... 
149995    5900
149996    9500
149997    7500
149998    4999
149999    4700
Name: price, Length: 150000, dtype: int64
Train_data['price'].value_counts()
500      2337
1500     2158
1200     1922
1000     1850
2500     1821
         ... 
25321       1
8886        1
8801        1
37920       1
8188        1
Name: price, Length: 3763, dtype: int64
#总体分布概况(无界约翰分布等)
import scipy.stats as st
y=Train_data['price']
plt.figure(1);plt.title('Johnson SU')
sns.distplot(y,kde=False,fit=st.johnsonsu)
plt.figure(2);plt.title('Normal')
sns.distplot(y,kde=False,fit=st.norm)
plt.figure(3);plt.title('Log Normal')
sns.distplot(y,kde=False,fit=st.lognorm)
<matplotlib.axes._subplots.AxesSubplot at 0x2995c111688>


2.6.1、无界约翰逊分布:
   SKEW函数 :返回分布的不对称度。不对称度反映以平均值为中心的分布的不对称程度。正不对称度表示不对称部分的分布更趋向正值。负不对称度表示不对称部分的分布更趋向负值。
   KURT函数 :返回一组数据的峰度系数(Kurtosis)。峰度系数反映与正态分布相比某一分布的相对尖锐度或平坦度。正峰度系数表示相对尖锐的分布。负峰度系数表示相对平坦的分布。

#最佳拟合是无界约翰逊分布
#skew:返回分布的不对称值;kurt:返回一组数据的峰度系数
sns.distplot(Train_data['price'])
print("Skewness:%f" % Train_data['price'].skew())
print("Kurtosis:%f"%Train_data['price'].kurt())
Skewness:3.346487
Kurtosis:18.995183

Train_data.skew(),Train_data.kurt()
(SaleID               6.017846e-17
 name                 5.576058e-01
 regDate              2.849508e-02
 model                1.484388e+00
 brand                1.150760e+00
 bodyType             9.915299e-01
 fuelType             1.595486e+00
 gearbox              1.317514e+00
 power                6.586318e+01
 kilometer           -1.525921e+00
 notRepairedDamage    2.430640e+00
 regionCode           6.888812e-01
 creatDate           -7.901331e+01
 price                3.346487e+00
 v_0                 -1.316712e+00
 v_1                  3.594543e-01
 v_2                  4.842556e+00
 v_3                  1.062920e-01
 v_4                  3.679890e-01
 v_5                 -4.737094e+00
 v_6                  3.680730e-01
 v_7                  5.130233e+00
 v_8                  2.046133e-01
 v_9                  4.195007e-01
 v_10                 2.522046e-02
 v_11                 3.029146e+00
 v_12                 3.653576e-01
 v_13                 2.679152e-01
 v_14                -1.186355e+00
 dtype: float64,
 SaleID                 -1.200000
 name                   -1.039945
 regDate                -0.697308
 model                   1.740483
 brand                   1.076201
 bodyType                0.206937
 fuelType                5.880049
 gearbox                -0.264161
 power                5733.451054
 kilometer               1.141934
 notRepairedDamage       3.908072
 regionCode             -0.340832
 creatDate            6881.080328
 price                  18.995183
 v_0                     3.993841
 v_1                    -1.753017
 v_2                    23.860591
 v_3                    -0.418006
 v_4                    -0.197295
 v_5                    22.934081
 v_6                    -1.742567
 v_7                    25.845489
 v_8                    -0.636225
 v_9                    -0.321491
 v_10                   -0.577935
 v_11                   12.568731
 v_12                    0.268937
 v_13                   -0.438274
 v_14                    2.393526
 dtype: float64)
sns.distplot(Train_data.skew(),color='blue',axlabel='Skewness')
<matplotlib.axes._subplots.AxesSubplot at 0x20213945d88>

sns.distplot(Train_data.kurt(),color='orange',axlabel='Kurtness')
<matplotlib.axes._subplots.AxesSubplot at 0x20213470c08>

#查看预测值的具体频数
plt.hist(Train_data['price'],orientation='vertical',histtype='bar')
plt.show
<function matplotlib.pyplot.show(*args, **kw)>

plt.hist(np.log(Train_data['price']), orientation = 'vertical',histtype = 'bar', color ='pink') 
plt.show()
#log变换z之后的分布较均匀,进行log变换进行预测,预测问题常用的trick

Y_train=Train_data['price']

2.7、分析变量

#数字特征
numeric_features = Train_data.select_dtypes(include=[np.number])
print("数字特征:{}".format(numeric_features.columns))
# # 类型特征
categorical_features = Train_data.select_dtypes(include=[np.object])
print("类型特征:{}".format(categorical_features.columns))
数字特征:Index(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType',
       'gearbox', 'power', 'kilometer', 'regionCode', 'creatDate', 'price',
       'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9',
       'v_10', 'v_11', 'v_12', 'v_13', 'v_14'],
      dtype='object')
类型特征:Index(['notRepairedDamage'], dtype='object')

上述方法不适用,需人为设定!

2.7.1、特征nunique分布:

numeric_features=['power','kilometer','v_0','v_1','v_2','v_3', 'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13','v_14' ]
categorical_features=['name','model','brand','bodyType','fuelType', 'gearbox', 'notRepairedDamage', 'regionCode']
#特征nunique分布
for cat_fea in categorical_features:
    print(cat_fea+"的特征分布如下:")
    print("{}特征有{}个不同的值".format(cat_fea,Train_data[cat_fea].nunique()))
    print(Train_data[cat_fea].value_counts())
name的特征分布如下:
name特征有99662个不同的值
708       282
387       282
55        280
1541      263
203       233
         ... 
5074        1
7123        1
11221       1
13270       1
174485      1
Name: name, Length: 99662, dtype: int64
model的特征分布如下:
model特征有248个不同的值
0.0      11762
19.0      9573
4.0       8445
1.0       6038
29.0      5186
         ...  
245.0        2
209.0        2
240.0        2
242.0        2
247.0        1
Name: model, Length: 248, dtype: int64
brand的特征分布如下:
brand特征有40个不同的值
0     31480
4     16737
14    16089
10    14249
1     13794
6     10217
9      7306
5      4665
13     3817
11     2945
3      2461
7      2361
16     2223
8      2077
25     2064
27     2053
21     1547
15     1458
19     1388
20     1236
12     1109
22     1085
26      966
30      940
17      913
24      772
28      649
32      592
29      406
37      333
2       321
31      318
18      316
36      228
34      227
33      218
23      186
35      180
38       65
39        9
Name: brand, dtype: int64
bodyType的特征分布如下:
bodyType特征有8个不同的值
0.0    41420
1.0    35272
2.0    30324
3.0    13491
4.0     9609
5.0     7607
6.0     6482
7.0     1289
Name: bodyType, dtype: int64
fuelType的特征分布如下:
fuelType特征有7个不同的值
0.0    91656
1.0    46991
2.0     2212
3.0      262
4.0      118
5.0       45
6.0       36
Name: fuelType, dtype: int64
gearbox的特征分布如下:
gearbox特征有2个不同的值
0.0    111623
1.0     32396
Name: gearbox, dtype: int64
notRepairedDamage的特征分布如下:
notRepairedDamage特征有2个不同的值
0.0    111361
1.0     14315
Name: notRepairedDamage, dtype: int64
regionCode的特征分布如下:
regionCode特征有7905个不同的值
419     369
764     258
125     137
176     136
462     134
       ... 
6414      1
7063      1
4239      1
5931      1
7267      1
Name: regionCode, Length: 7905, dtype: int64
for cat_fea in categorical_features:
    print(cat_fea+"的分布特征如下:")
    print("{}特征有{}个不同的值".format(cat_fea,Test_data[cat_fea].nunique()))
    print(Test_data[cat_fea].value_counts())
name的分布特征如下:
name特征有37453个不同的值
55       97
708      96
387      95
1541     88
713      74
         ..
22270     1
89855     1
42752     1
48899     1
11808     1
Name: name, Length: 37453, dtype: int64
model的分布特征如下:
model特征有247个不同的值
0.0      3896
19.0     3245
4.0      3007
1.0      1981
29.0     1742
         ... 
242.0       1
240.0       1
244.0       1
243.0       1
246.0       1
Name: model, Length: 247, dtype: int64
brand的分布特征如下:
brand特征有40个不同的值
0     10348
4      5763
14     5314
10     4766
1      4532
6      3502
9      2423
5      1569
13     1245
11      919
7       795
3       773
16      771
8       704
25      695
27      650
21      544
15      511
20      450
19      450
12      389
22      363
30      324
17      317
26      303
24      268
28      225
32      193
29      117
31      115
18      106
2       104
37       92
34       77
33       76
36       67
23       62
35       53
38       23
39        2
Name: brand, dtype: int64
bodyType的分布特征如下:
bodyType特征有8个不同的值
0.0    13985
1.0    11882
2.0     9900
3.0     4433
4.0     3303
5.0     2537
6.0     2116
7.0      431
Name: bodyType, dtype: int64
fuelType的分布特征如下:
fuelType特征有7个不同的值
0.0    30656
1.0    15544
2.0      774
3.0       72
4.0       37
6.0       14
5.0       10
Name: fuelType, dtype: int64
gearbox的分布特征如下:
gearbox特征有2个不同的值
0.0    37301
1.0    10789
Name: gearbox, dtype: int64
notRepairedDamage的分布特征如下:
notRepairedDamage特征有2个不同的值
0.0    37249
1.0     4720
Name: notRepairedDamage, dtype: int64
regionCode的分布特征如下:
regionCode特征有6971个不同的值
419     146
764      78
188      52
125      51
759      51
       ... 
7753      1
7463      1
7230      1
826       1
112       1
Name: regionCode, Length: 6971, dtype: int64

2.7.2、数字特征分析:

#数字特征分析
numeric_features.append('price')
numeric_features
['power',
 'kilometer',
 'v_0',
 'v_1',
 'v_2',
 'v_3',
 'v_4',
 'v_5',
 'v_6',
 'v_7',
 'v_8',
 'v_9',
 'v_10',
 'v_11',
 'v_12',
 'v_13',
 'v_14',
 'price']
Train_data.head()
SaleIDnameregDatemodelbrandbodyTypefuelTypegearboxpowerkilometer...v_5v_6v_7v_8v_9v_10v_11v_12v_13v_14
007362004040230.061.00.00.06012.5...0.2356760.1019880.1295490.0228160.097462-2.8818032.804097-2.4208210.7952920.914762
1122622003030140.012.00.00.0015.0...0.2647770.1210040.1357310.0265970.020582-4.9004822.096338-1.030483-1.7226740.245522
221487420040403115.0151.00.00.016312.5...0.2514100.1149120.1651470.0621730.027075-4.8467491.8035591.565330-0.832687-0.229963
337186519960908109.0100.00.01.019315.0...0.2742930.1103000.1219640.0333950.000000-4.5095991.285940-0.501868-2.438353-0.478699
4411108020120103110.051.00.00.0685.0...0.2280360.0732050.0918800.0788190.121534-1.8962400.9107830.9311102.8345181.923482

5 rows × 29 columns

2.7.3、相关性分析:

#相关性分析
price_numeric=Train_data[numeric_features]
correlation=price_numeric.corr()
print(correlation['price'].sort_values(ascending=False),'\n')
f,ax=plt.subplots(figsize=(7,7))
plt.title('Correlation of Numeric Features with Price',y=1,size=16)
sns.heatmap(correlation,square=True,vmax=0.8)
<matplotlib.axes._subplots.AxesSubplot at 0x216b6630848>

del price_numeric['price']
#查看几个特征的偏度,峰值
for col in numeric_features:
    print('{:15}'.format(col),
         'Skewness:{:05.2f}'.format(Train_data[col].skew()),
         '  ',
         'Kurtosis:{:06.2f}'.format(Train_data[col].kurt())
         )

2.7.4、melt的使用:
参数: pandas.melt(frame, id_vars=None, value_vars=None, var_name=None, value_name=‘value’, col_level=None)

   frame:要处理的数据集。

   id_vars:不需要被转换的列名。

   value_vars:需要转换的列名,如果剩下的列全部都要转换,就不用写了。

   var_name和value_name是自定义设置对应的列名。

   col_level :如果列是MultiIndex,则使用此级别。

#每个数字特征分布可视化
f=pd.melt(Train_data,value_vars=numeric_features)
f
variablevalue
0power60.0
1power0.0
2power163.0
3power193.0
4power68.0
.........
2699995price5900.0
2699996price9500.0
2699997price7500.0
2699998price4999.0
2699999price4700.0

2700000 rows × 2 columns

g=sns.FacetGrid(f,col='variable',col_wrap=2,sharex=False,sharey=False)
g=g.map(sns.distplot,"value")
#数字特征相互关系之间的关系可视化
sns.set()
columns=['price','v_12','v_8','v_0','power','v_5','v_2','v_6','v_1','v_14']
sns.pairplot(Train_data[columns],size=2,kind='scatter',diag_kind='kde')
plt.show()

Train_data.columns
Index(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType',
       'gearbox', 'power', 'kilometer', 'notRepairedDamage', 'regionCode',
       'seller', 'offerType', 'creatDate', 'price', 'v_0', 'v_1', 'v_2', 'v_3',
       'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12',
       'v_13', 'v_14'],
      dtype='object')
Y_train
0         1850
1         3600
2         6222
3         2400
4         5200
          ... 
149995    5900
149996    9500
149997    7500
149998    4999
149999    4700
Name: price, Length: 150000, dtype: int64
#多变量互相回归关系可视化
fig,((ax1,ax2),(ax3,ax4),(ax5,ax6),(ax7,ax8),(ax9,ax10))=plt.subplots(nrows=5,ncols=2,figsize=(24,20))
v_12_scatter_plot=pd.concat([Y_train,Train_data['v_12']],axis=1)
sns.regplot(x='v_12',y='price',data=v_12_scatter_plot,scatter=True,fit_reg=True,ax=ax1)
v_8_scatter_plot=pd.concat([Y_train,Train_data['v_8']],axis=1)
sns.regplot(x='v_8',y='price',data=v_8_scatter_plot,scatter=True,fit_reg=True,ax=ax2)
v_0_scatter_plot = pd.concat([Y_train,Train_data['v_0']],axis = 1)
sns.regplot(x='v_0',y = 'price',data = v_0_scatter_plot,scatter= True, fit_reg=True, ax=ax3)

power_scatter_plot = pd.concat([Y_train,Train_data['power']],axis = 1)
sns.regplot(x='power',y = 'price',data = power_scatter_plot,scatter= True, fit_reg=True, ax=ax4)

v_5_scatter_plot = pd.concat([Y_train,Train_data['v_5']],axis = 1)
sns.regplot(x='v_5',y = 'price',data = v_5_scatter_plot,scatter= True, fit_reg=True, ax=ax5)

v_2_scatter_plot = pd.concat([Y_train,Train_data['v_2']],axis = 1)
sns.regplot(x='v_2',y = 'price',data = v_2_scatter_plot,scatter= True, fit_reg=True, ax=ax6)

v_6_scatter_plot = pd.concat([Y_train,Train_data['v_6']],axis = 1)
sns.regplot(x='v_6',y = 'price',data = v_6_scatter_plot,scatter= True, fit_reg=True, ax=ax7)

v_1_scatter_plot = pd.concat([Y_train,Train_data['v_1']],axis = 1)
sns.regplot(x='v_1',y = 'price',data = v_1_scatter_plot,scatter= True, fit_reg=True, ax=ax8)

v_14_scatter_plot = pd.concat([Y_train,Train_data['v_14']],axis = 1)
sns.regplot(x='v_14',y = 'price',data = v_14_scatter_plot,scatter= True, fit_reg=True, ax=ax9)

v_13_scatter_plot = pd.concat([Y_train,Train_data['v_13']],axis = 1)
sns.regplot(x='v_13',y = 'price',data = v_13_scatter_plot,scatter= True, fit_reg=True, ax=ax10)

<matplotlib.axes._subplots.AxesSubplot at 0x216c27f3788>

#类别特征分析
#(1)unique分布
for fea in categorical_features:
    print(Train_data[fea].nunique())
categorical_features
['name',
 'model',
 'brand',
 'bodyType',
 'fuelType',
 'gearbox',
 'notRepairedDamage',
 'regionCode']
Train_data.columns
categorical_features = ['model',
 'brand',
 'bodyType',
 'fuelType',
 'gearbox',
 'notRepairedDamage']

利用panads生成数据分析报告

import pandas_profiling
pfr=pandas_profiling.ProfileReport(Train_data)
pfr.to_file("./example.html")
HBox(children=(FloatProgress(value=0.0, description='variables', max=29.0, style=ProgressStyle(description_wid…






HBox(children=(FloatProgress(value=0.0, description='correlations', max=6.0, style=ProgressStyle(description_w…






HBox(children=(FloatProgress(value=0.0, description='interactions [continuous]', max=729.0, style=ProgressStyl…






HBox(children=(FloatProgress(value=0.0, description='table', max=1.0, style=ProgressStyle(description_width='i…






HBox(children=(FloatProgress(value=0.0, description='missing', max=4.0, style=ProgressStyle(description_width=…






HBox(children=(FloatProgress(value=0.0, description='warnings', max=3.0, style=ProgressStyle(description_width…






HBox(children=(FloatProgress(value=0.0, description='package', max=1.0, style=ProgressStyle(description_width=…






HBox(children=(FloatProgress(value=0.0, description='build report structure', max=1.0, style=ProgressStyle(des…

2.7、根据数据报告得出的结论

观察pandas生成的数据报告,猜测特征brand(品牌)、bodyType(车身类型)、fuelType(燃油类型)、gearbox(变速箱)、kilometer(汽车行驶公里)影响price比较大,利用可视化进行观察

sns.barplot(x="brand", y="price", data=Train_data,palette="Set2",errcolor='grey')
<matplotlib.axes._subplots.AxesSubplot at 0x1a3e1b75888>

在这里插入图片描述

有一个品牌二手车价格很高。

sns.barplot(x="bodyType", y="price", data=Train_data,palette="Blues",errcolor='grey')
<matplotlib.axes._subplots.AxesSubplot at 0x22ca6075748>

车身类型为4.0、5.0、6.0的二手车价格更高。

sns.barplot(x="fuelType", y="price", data=Train_data,palette="BuGn_r",errcolor='grey')
<matplotlib.axes._subplots.AxesSubplot at 0x22ca8f59d48>

燃油类型为4.0、6.0的二手车价格更高。

sns.barplot(x="gearbox", y="price", data=Train_data,palette="plasma_r",errcolor='yellow')
<matplotlib.axes._subplots.AxesSubplot at 0x1a3e1563a08>

sns.barplot(x="kilometer", y="price", data=Train_data)
<matplotlib.axes._subplots.AxesSubplot at 0x22ca702c3c8>

(https://img-blog.csdnimg.cn/20200323023712411.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl80MzE3OTQ0OQ==,size_16,color_FFFFFF,t_70#pic_center)


  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值