【Datawhale】[task2]2.3代码示例

最新推荐文章于 2022-04-04 22:54:52 发布

zyq_go

最新推荐文章于 2022-04-04 22:54:52 发布

阅读量433

点赞数

分类专栏：日常学习

原文链接：https://tianchi.aliyun.com/notebook-ai/detail?spm=5176.12281978.0.0.68021b43TmIsNI&postId=95457

版权

日常学习专栏收录该内容

27 篇文章 0 订阅

订阅专栏

2.3 代码示例

2.3.1 载入各种数据科学以及可视化库

#coding：utf-8
#导入warning包，利用过滤器来实现忽略警告语句。
import warnings
warnings.filterwarnings('ignore')

import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
import os 

#结果保存路径
output_path='G:/newjourney/Datawhale/output'
if not os.path.exists(output_path):
    os.makedirs(output_path)

2.3.2 载入数据

## 1)载入训练集和测试集
path='G:/newjourney/Datawhale/'
Train_data=pd.read_csv(path+'used_car_train_20200313.csv',sep=' ')
Test_data=pd.read_csv(path+'used_car_testB_20200421.csv',sep=' ')

## 2)简略观察数据（head()+shape）
Train_data.head().append(Train_data.tail())
# 哇，这个学到了，利用append把head和tail一起展示出来，优秀

	SaleID	name	regDate	model	brand	bodyType	fuelType	gearbox	power	kilometer	...	v_5	v_6	v_7	v_8	v_9	v_10	v_11	v_12	v_13	v_14
0	0	736	20040402	30.0	6	1.0	0.0	0.0	60	12.5	...	0.235676	0.101988	0.129549	0.022816	0.097462	-2.881803	2.804097	-2.420821	0.795292	0.914762
1	1	2262	20030301	40.0	1	2.0	0.0	0.0	0	15.0	...	0.264777	0.121004	0.135731	0.026597	0.020582	-4.900482	2.096338	-1.030483	-1.722674	0.245522
2	2	14874	20040403	115.0	15	1.0	0.0	0.0	163	12.5	...	0.251410	0.114912	0.165147	0.062173	0.027075	-4.846749	1.803559	1.565330	-0.832687	-0.229963
3	3	71865	19960908	109.0	10	0.0	0.0	1.0	193	15.0	...	0.274293	0.110300	0.121964	0.033395	0.000000	-4.509599	1.285940	-0.501868	-2.438353	-0.478699
4	4	111080	20120103	110.0	5	1.0	0.0	0.0	68	5.0	...	0.228036	0.073205	0.091880	0.078819	0.121534	-1.896240	0.910783	0.931110	2.834518	1.923482
149995	149995	163978	20000607	121.0	10	4.0	0.0	1.0	163	15.0	...	0.280264	0.000310	0.048441	0.071158	0.019174	1.988114	-2.983973	0.589167	-1.304370	-0.302592
149996	149996	184535	20091102	116.0	11	0.0	0.0	0.0	125	10.0	...	0.253217	0.000777	0.084079	0.099681	0.079371	1.839166	-2.774615	2.553994	0.924196	-0.272160
149997	149997	147587	20101003	60.0	11	1.0	1.0	0.0	90	6.0	...	0.233353	0.000705	0.118872	0.100118	0.097914	2.439812	-1.630677	2.290197	1.891922	0.414931
149998	149998	45907	20060312	34.0	10	3.0	1.0	0.0	156	15.0	...	0.256369	0.000252	0.081479	0.083558	0.081498	2.075380	-2.633719	1.414937	0.431981	-1.659014
149999	149999	177672	19990204	19.0	28	6.0	0.0	1.0	193	12.5	...	0.284475	0.000000	0.040072	0.062543	0.025819	1.978453	-3.179913	0.031724	-1.483350	-0.342674

10 rows × 31 columns

Train_data.shape

(150000, 31)

Test_data.head().append(Test_data.tail())

	SaleID	name	regDate	model	brand	bodyType	fuelType	gearbox	power	kilometer	...	v_5	v_6	v_7	v_8	v_9	v_10	v_11	v_12	v_13	v_14
0	200000	133777	20000501	67.0	0	1.0	0.0	0.0	101	15.0	...	0.236520	0.000241	0.105319	0.046233	0.094522	3.619512	-0.280607	-2.019761	0.978828	0.803322
1	200001	61206	19950211	19.0	6	2.0	0.0	0.0	73	6.0	...	0.261518	0.000000	0.120323	0.046784	0.035385	2.997376	-1.406705	-1.020884	-1.349990	-0.200542
2	200002	67829	20090606	5.0	5	4.0	0.0	0.0	120	5.0	...	0.261691	0.090836	0.000000	0.079655	0.073586	-3.951084	-0.433467	0.918964	1.634604	1.027173
3	200003	8892	20020601	22.0	9	1.0	0.0	0.0	58	15.0	...	0.236050	0.101777	0.098950	0.026830	0.096614	-2.846788	2.800267	-2.524610	1.076819	0.461610
4	200004	76998	20030301	46.0	6	0.0	NaN	0.0	116	15.0	...	0.257000	0.000000	0.066732	0.057771	0.068852	2.839010	-1.659801	-0.924142	0.199423	0.451014
49995	249995	111443	20041005	4.0	4	0.0	NaN	1.0	150	15.0	...	0.263668	0.000292	0.141804	0.076393	0.039272	2.072901	-2.531869	1.716978	-1.063437	0.326587
49996	249996	152834	20130409	65.0	1	0.0	0.0	0.0	179	4.0	...	0.255310	0.000991	0.155868	0.108425	0.067841	1.358504	-3.290295	4.269809	0.140524	0.556221
49997	249997	132531	20041211	4.0	4	0.0	0.0	1.0	147	12.5	...	0.262933	0.000318	0.141872	0.071968	0.042966	2.165658	-2.417885	1.370612	-1.073133	0.270602
49998	249998	143405	20020702	40.0	1	4.0	0.0	1.0	176	15.0	...	0.282106	0.000023	0.067483	0.067526	0.009006	2.030114	-2.939244	0.569078	-1.718245	0.316379
49999	249999	78202	20090708	32.0	8	1.0	0.0	0.0	0	3.0	...	0.231449	0.103947	0.096027	0.062328	0.110180	-3.689090	2.032376	0.109157	2.202828	0.847469

10 rows × 30 columns

Test_data.shape

(50000, 30)

要养成看数据集的head()和shape的习惯，这会让你每一步更放心

2.3.3总览数据概括

1、describe中有每列的统计量，个数count、平均值mean、方差std、最小值min、中位数25%、50%、75%、以及最大值。通过这些信息可以瞬间掌握数据的大概范围以及每个值的异常值的判断，比如有的时候会发现999 9999 -1等值，这些其实都是nan的另外一种表达方式，有的时候要注意一下。

2、info 通过info来了解每列的type，有助于了解是否存在除了nan以外的特殊符号异常

## 1)通过describe()来熟悉数据的相关统计量
Train_data.describe()

	SaleID	name	regDate	model	brand	bodyType	fuelType	gearbox	power	kilometer	...	v_5	v_6	v_7	v_8	v_9	v_10	v_11	v_12	v_13	v_14
count	150000.000000	150000.000000	1.500000e+05	149999.000000	150000.000000	145494.000000	141320.000000	144019.000000	150000.000000	150000.000000	...	150000.000000	150000.000000	150000.000000	150000.000000	150000.000000	150000.000000	150000.000000	150000.000000	150000.000000	150000.000000
mean	74999.500000	68349.172873	2.003417e+07	47.129021	8.052733	1.792369	0.375842	0.224943	119.316547	12.597160	...	0.248204	0.044923	0.124692	0.058144	0.061996	-0.001000	0.009035	0.004813	0.000313	-0.000688
std	43301.414527	61103.875095	5.364988e+04	49.536040	7.864956	1.760640	0.548677	0.417546	177.168419	3.919576	...	0.045804	0.051743	0.201410	0.029186	0.035692	3.772386	3.286071	2.517478	1.288988	1.038685
min	0.000000	0.000000	1.991000e+07	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.500000	...	0.000000	0.000000	0.000000	0.000000	0.000000	-9.168192	-5.558207	-9.639552	-4.153899	-6.546556
25%	37499.750000	11156.000000	1.999091e+07	10.000000	1.000000	0.000000	0.000000	0.000000	75.000000	12.500000	...	0.243615	0.000038	0.062474	0.035334	0.033930	-3.722303	-1.951543	-1.871846	-1.057789	-0.437034
50%	74999.500000	51638.000000	2.003091e+07	30.000000	6.000000	1.000000	0.000000	0.000000	110.000000	15.000000	...	0.257798	0.000812	0.095866	0.057014	0.058484	1.624076	-0.358053	-0.130753	-0.036245	0.141246
75%	112499.250000	118841.250000	2.007111e+07	66.000000	13.000000	3.000000	1.000000	0.000000	150.000000	15.000000	...	0.265297	0.102009	0.125243	0.079382	0.087491	2.844357	1.255022	1.776933	0.942813	0.680378
max	149999.000000	196812.000000	2.015121e+07	247.000000	39.000000	7.000000	6.000000	1.000000	19312.000000	15.000000	...	0.291838	0.151420	1.404936	0.160791	0.222787	12.357011	18.819042	13.847792	11.147669	8.658418

8 rows × 30 columns

Test_data.describe()

	SaleID	name	regDate	model	brand	bodyType	fuelType	gearbox	power	kilometer	...	v_5	v_6	v_7	v_8	v_9	v_10	v_11	v_12	v_13	v_14
count	50000.000000	50000.000000	5.000000e+04	50000.00000	50000.000000	48496.000000	47076.000000	48032.000000	50000.000000	50000.000000	...	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000
mean	224999.500000	68505.606100	2.003401e+07	47.64948	8.087140	1.793736	0.376498	0.226953	119.766960	12.598260	...	0.248147	0.044624	0.124693	0.058198	0.062113	0.019633	0.002759	0.004342	0.004570	-0.007209
std	14433.901067	61032.124271	5.351615e+04	49.90741	7.899648	1.764970	0.549281	0.418866	206.313348	3.912519	...	0.045836	0.051664	0.201440	0.029171	0.035723	3.764095	3.289523	2.515912	1.287194	1.044718
min	200000.000000	1.000000	1.991000e+07	0.00000	0.000000	0.000000	0.000000	0.000000	0.000000	0.500000	...	0.000000	0.000000	0.000000	0.000000	0.000000	-9.119719	-5.662163	-8.291868	-4.157649	-6.098192
25%	212499.750000	11315.000000	1.999100e+07	11.00000	1.000000	0.000000	0.000000	0.000000	75.000000	12.500000	...	0.243436	0.000035	0.062519	0.035413	0.033880	-3.675196	-1.963928	-1.865406	-1.048722	-0.440706
50%	224999.500000	52215.000000	2.003091e+07	30.00000	6.000000	1.000000	0.000000	0.000000	110.000000	15.000000	...	0.257818	0.000801	0.095880	0.056804	0.058749	1.632134	-0.375537	-0.138943	-0.036352	0.136849
75%	237499.250000	118710.750000	2.007110e+07	66.00000	13.000000	3.000000	1.000000	0.000000	150.000000	15.000000	...	0.265263	0.101654	0.125470	0.079387	0.087624	2.846205	1.263451	1.775632	0.945239	0.685555
max	249999.000000	196808.000000	2.015121e+07	246.00000	39.000000	7.000000	6.000000	1.000000	19211.000000	15.000000	...	0.291176	0.153403	1.411559	0.157458	0.211304	12.177864	18.789496	13.384828	5.635374	2.649768

8 rows × 29 columns

## 2)通过info()来熟悉数据类型
Train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 31 columns):
SaleID               150000 non-null int64
name                 150000 non-null int64
regDate              150000 non-null int64
model                149999 non-null float64
brand                150000 non-null int64
bodyType             145494 non-null float64
fuelType             141320 non-null float64
gearbox              144019 non-null float64
power                150000 non-null int64
kilometer            150000 non-null float64
notRepairedDamage    150000 non-null object
regionCode           150000 non-null int64
seller               150000 non-null int64
offerType            150000 non-null int64
creatDate            150000 non-null int64
price                150000 non-null int64
v_0                  150000 non-null float64
v_1                  150000 non-null float64
v_2                  150000 non-null float64
v_3                  150000 non-null float64
v_4                  150000 non-null float64
v_5                  150000 non-null float64
v_6                  150000 non-null float64
v_7                  150000 non-null float64
v_8                  150000 non-null float64
v_9                  150000 non-null float64
v_10                 150000 non-null float64
v_11                 150000 non-null float64
v_12                 150000 non-null float64
v_13                 150000 non-null float64
v_14                 150000 non-null float64
dtypes: float64(20), int64(10), object(1)
memory usage: 35.5+ MB

Test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 30 columns):
SaleID               50000 non-null int64
name                 50000 non-null int64
regDate              50000 non-null int64
model                50000 non-null float64
brand                50000 non-null int64
bodyType             48496 non-null float64
fuelType             47076 non-null float64
gearbox              48032 non-null float64
power                50000 non-null int64
kilometer            50000 non-null float64
notRepairedDamage    50000 non-null object
regionCode           50000 non-null int64
seller               50000 non-null int64
offerType            50000 non-null int64
creatDate            50000 non-null int64
v_0                  50000 non-null float64
v_1                  50000 non-null float64
v_2                  50000 non-null float64
v_3                  50000 non-null float64
v_4                  50000 non-null float64
v_5                  50000 non-null float64
v_6                  50000 non-null float64
v_7                  50000 non-null float64
v_8                  50000 non-null float64
v_9                  50000 non-null float64
v_10                 50000 non-null float64
v_11                 50000 non-null float64
v_12                 50000 non-null float64
v_13                 50000 non-null float64
v_14                 50000 non-null float64
dtypes: float64(20), int64(9), object(1)
memory usage: 11.4+ MB

2.3.4判断数据缺失和异常

## 1)查看每列的存在nan情况
Train_data.isnull().sum()

SaleID                  0
name                    0
regDate                 0
model                   1
brand                   0
bodyType             4506
fuelType             8680
gearbox              5981
power                   0
kilometer               0
notRepairedDamage       0
regionCode              0
seller                  0
offerType               0
creatDate               0
price                   0
v_0                     0
v_1                     0
v_2                     0
v_3                     0
v_4                     0
v_5                     0
v_6                     0
v_7                     0
v_8                     0
v_9                     0
v_10                    0
v_11                    0
v_12                    0
v_13                    0
v_14                    0
dtype: int64

Test_data.isnull().sum()

SaleID                  0
name                    0
regDate                 0
model                   0
brand                   0
bodyType             1504
fuelType             2924
gearbox              1968
power                   0
kilometer               0
notRepairedDamage       0
regionCode              0
seller                  0
offerType               0
creatDate               0
v_0                     0
v_1                     0
v_2                     0
v_3                     0
v_4                     0
v_5                     0
v_6                     0
v_7                     0
v_8                     0
v_9                     0
v_10                    0
v_11                    0
v_12                    0
v_13                    0
v_14                    0
dtype: int64

#nan可视化
missing=Train_data.isnull().sum()
missing=missing[missing>0]
missing.sort_values(inplace=True)
missing.plot.bar()

<matplotlib.axes._subplots.AxesSubplot at 0x263bf61cda0>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-H1or7kIn-1588234847898)(output_17_1.png)]

通过以上两句可以很直观的了解哪些列存在"nan",并可以把nan的个数打印，主要的目的在于判断nan存在的个数是否真的很大，如果很小，一般选择填充；如果适用lgb等树模型可以直接空缺，让树自己去优化；但如果nan存在的过多，可以考虑删掉。

# 可视化下看缺省值，利用missingno（msno）
msno.matrix(Train_data.sample(250)) #这里看了250个样本

<matplotlib.axes._subplots.AxesSubplot at 0x263bf70dcc0>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-4sT96KeO-1588234847900)(output_19_1.png)]

msno.matrix(Train_data)

<matplotlib.axes._subplots.AxesSubplot at 0x263c17172e8>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-cXIvvEr2-1588234847904)(output_20_1.png)]

msno.bar(Train_data.sample(1000))

<matplotlib.axes._subplots.AxesSubplot at 0x263bf973358>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-68GvpzF1-1588234847909)(output_21_1.png)]

#可视化看下缺省值
msno.matrix(Test_data)

<matplotlib.axes._subplots.AxesSubplot at 0x263c19801d0>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-2FayR8XY-1588234847914)(output_22_1.png)]

msno.bar(Test_data)

<matplotlib.axes._subplots.AxesSubplot at 0x263bf9d3be0>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-6z8ur528-1588234847915)(output_23_1.png)]

测试集和训练集的缺省情况类似，

训练集（model 1
brand 0
bodyType 4506
fuelType 8680
gearbox 5981）
训练集（bodyType 1504
fuelType 2924
gearbox 1968）其中，fuelType缺省的最多。

## 2)查看异常值检测
Train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 31 columns):
SaleID               150000 non-null int64
name                 150000 non-null int64
regDate              150000 non-null int64
model                149999 non-null float64
brand                150000 non-null int64
bodyType             145494 non-null float64
fuelType             141320 non-null float64
gearbox              144019 non-null float64
power                150000 non-null int64
kilometer            150000 non-null float64
notRepairedDamage    150000 non-null object
regionCode           150000 non-null int64
seller               150000 non-null int64
offerType            150000 non-null int64
creatDate            150000 non-null int64
price                150000 non-null int64
v_0                  150000 non-null float64
v_1                  150000 non-null float64
v_2                  150000 non-null float64
v_3                  150000 non-null float64
v_4                  150000 non-null float64
v_5                  150000 non-null float64
v_6                  150000 non-null float64
v_7                  150000 non-null float64
v_8                  150000 non-null float64
v_9                  150000 non-null float64
v_10                 150000 non-null float64
v_11                 150000 non-null float64
v_12                 150000 non-null float64
v_13                 150000 non-null float64
v_14                 150000 non-null float64
dtypes: float64(20), int64(10), object(1)
memory usage: 35.5+ MB

可以发现处理notRepairedDamage 150000 non-null object为object类型，其它都为数字，我们可以具体看一下notRepairedDamage的值

Train_data['notRepairedDamage'].value_counts()

0.0    111361
-       24324
1.0     14315
Name: notRepairedDamage, dtype: int64

可以发现"-“也为空缺值，因为很多模型对nan有直接的处理，故在此我们先将”-“替换成"nan”

Train_data['notRepairedDamage'].replace('-',np.nan,inplace=True)
Train_data['notRepairedDamage'].value_counts()

0.0    111361
1.0     14315
Name: notRepairedDamage, dtype: int64

Train_data.isnull().sum()

SaleID                   0
name                     0
regDate                  0
model                    1
brand                    0
bodyType              4506
fuelType              8680
gearbox               5981
power                    0
kilometer                0
notRepairedDamage    24324
regionCode               0
seller                   0
offerType                0
creatDate                0
price                    0
v_0                      0
v_1                      0
v_2                      0
v_3                      0
v_4                      0
v_5                      0
v_6                      0
v_7                      0
v_8                      0
v_9                      0
v_10                     0
v_11                     0
v_12                     0
v_13                     0
v_14                     0
dtype: int64

Test_data['notRepairedDamage'].value_counts()

0.0    37224
-       8069
1.0     4707
Name: notRepairedDamage, dtype: int64

Test_data['notRepairedDamage'].replace('-',np.nan,inplace=True)

Test_data.isnull().sum()

SaleID                  0
name                    0
regDate                 0
model                   0
brand                   0
bodyType             1504
fuelType             2924
gearbox              1968
power                   0
kilometer               0
notRepairedDamage    8069
regionCode              0
seller                  0
offerType               0
creatDate               0
v_0                     0
v_1                     0
v_2                     0
v_3                     0
v_4                     0
v_5                     0
v_6                     0
v_7                     0
v_8                     0
v_9                     0
v_10                    0
v_11                    0
v_12                    0
v_13                    0
v_14                    0
dtype: int64

Test_data['notRepairedDamage'].value_counts()

0.0    37224
1.0     4707
Name: notRepairedDamage, dtype: int64

"seller"和"offerType"两个类别严重倾斜，一般不会对预测有什么帮助，故这边先删掉，当然也可以继续挖掘，但是一般意义不大
#但有个问题是，怎么突然发现这两个类别严重倾斜呢？一个一个试的嘛？

#可以看一下这两个类别的具体情况
Train_data['seller'].value_counts()

0    149999
1         1
Name: seller, dtype: int64

Train_data['offerType'].value_counts()

0    150000
Name: offerType, dtype: int64

#所以进行删除,训练集和测试集都要删除
del Train_data['seller']
del Test_data['seller']
del Train_data['offerType']
del Test_data['offerType']

2.3.5了解预测值的分布

Train_data['price']

0          1850
1          3600
2          6222
3          2400
4          5200
5          8000
6          3500
7          1000
8          2850
9           650
10         3100
11         5450
12         1600
13         3100
14         6900
15         3200
16        10500
17         3700
18          790
19         1450
20          990
21         2800
22          350
23          599
24         9250
25         3650
26         2800
27         2399
28         4900
29         2999
          ...  
149970      900
149971     3400
149972      999
149973     3500
149974     4500
149975     3990
149976     1200
149977      330
149978     3350
149979     5000
149980     4350
149981     9000
149982     2000
149983    12000
149984     6700
149985     4200
149986     2800
149987     3000
149988     7500
149989     1150
149990      450
149991    24950
149992      950
149993     4399
149994    14780
149995     5900
149996     9500
149997     7500
149998     4999
149999     4700
Name: price, Length: 150000, dtype: int64

Train_data['price'].value_counts()

500      2337
1500     2158
1200     1922
1000     1850
2500     1821
600      1535
3500     1533
800      1513
2000     1378
999      1356
750      1279
4500     1271
650      1257
1800     1223
2200     1201
850      1198
700      1174
900      1107
1300     1105
950      1104
3000     1098
1100     1079
5500     1079
1600     1074
300      1071
550      1042
350      1005
1250     1003
6500      973
1999      929
         ... 
21560       1
7859        1
3120        1
2279        1
6066        1
6322        1
4275        1
10420       1
43300       1
305         1
1765        1
15970       1
44400       1
8885        1
2992        1
31850       1
15413       1
13495       1
9525        1
7270        1
13879       1
3760        1
24250       1
11360       1
10295       1
25321       1
8886        1
8801        1
37920       1
8188        1
Name: price, Length: 3763, dtype: int64

## 1)总体分布概况（无界约翰逊分布（johnsonsu）？/正态norm？/对数正态（lognorm）？等）

import scipy.stats as st
y=Train_data['price']
plt.figure(1);plt.title('Juhnson SU')
sns.distplot(y,kde=False,fit=st.johnsonsu)
plt.figure(2);plt.title('Normal')
sns.distplot(y,kde=False,fit=st.norm)
plt.figure(3);plt.title('Log Normal')
sns.distplot(y,kde=False,fit=st.lognorm)
#这里画图为什么用sns.distplot()?
#回答：Seaborn是基于matplotlib的Python可视化库。 它提供了一个高级界面来绘制有吸引力的统计图形。Seaborn其实是在matplotlib的基础上进行了更高级的API封装，从而使得作图更加容易，不需要经过大量的调整就能使你的图变得精致。

<matplotlib.axes._subplots.AxesSubplot at 0x263b6bbd080>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-CnO0RUlw-1588234847917)(output_42_1.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ySq02FTT-1588234847917)(output_42_2.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Esc5fWxu-1588234847918)(output_42_3.png)]

价格不服从正态分布,所以在进行回归之前,需要进行转换.虽然对数变换做的很好,但最佳拟合是无界约翰逊分布

## 2)查看skewness 和kurtosis
sns.distplot(Train_data['price']);
print("Skewness: %f" % Train_data['price'].skew())
print("Kurtosis:%f" % Train_data['price'].kurt())

Skewness: 3.346487
Kurtosis:18.995183

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-1qAaaIB3-1588234847919)(output_44_1.png)]

#训练集中全部特征的偏度和峰度，学习了
Train_data.skew(),Train_data.kurt()

(SaleID               6.017846e-17
 name                 5.576058e-01
 regDate              2.849508e-02
 model                1.484388e+00
 brand                1.150760e+00
 bodyType             9.915299e-01
 fuelType             1.595486e+00
 gearbox              1.317514e+00
 power                6.586318e+01
 kilometer           -1.525921e+00
 notRepairedDamage    2.430640e+00
 regionCode           6.888812e-01
 creatDate           -7.901331e+01
 price                3.346487e+00
 v_0                 -1.316712e+00
 v_1                  3.594543e-01
 v_2                  4.842556e+00
 v_3                  1.062920e-01
 v_4                  3.679890e-01
 v_5                 -4.737094e+00
 v_6                  3.680730e-01
 v_7                  5.130233e+00
 v_8                  2.046133e-01
 v_9                  4.195007e-01
 v_10                 2.522046e-02
 v_11                 3.029146e+00
 v_12                 3.653576e-01
 v_13                 2.679152e-01
 v_14                -1.186355e+00
 dtype: float64, SaleID                 -1.200000
 name                   -1.039945
 regDate                -0.697308
 model                   1.740483
 brand                   1.076201
 bodyType                0.206937
 fuelType                5.880049
 gearbox                -0.264161
 power                5733.451054
 kilometer               1.141934
 notRepairedDamage       3.908072
 regionCode             -0.340832
 creatDate            6881.080328
 price                  18.995183
 v_0                     3.993841
 v_1                    -1.753017
 v_2                    23.860591
 v_3                    -0.418006
 v_4                    -0.197295
 v_5                    22.934081
 v_6                    -1.742567
 v_7                    25.845489
 v_8                    -0.636225
 v_9                    -0.321491
 v_10                   -0.577935
 v_11                   12.568731
 v_12                    0.268937
 v_13                   -0.438274
 v_14                    2.393526
 dtype: float64)

sns.distplot(Train_data.skew(),color='blue',axlabel='Skewness')

<matplotlib.axes._subplots.AxesSubplot at 0x263c3402f28>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-zdVlowAX-1588234847920)(output_46_1.png)]

sns.distplot(Train_data.kurt(),color='orange',axlabel='Kurtness')

<matplotlib.axes._subplots.AxesSubplot at 0x263db8ed240>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-qW40gido-1588234847921)(output_47_1.png)]

##３）查看预测值的具体频数
plt.hist(Train_data['price'],orientation='vertical',histtype='bar',color='red');
#小tip：plt语句后加";",就可绘出图，当然也可以再写一句plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-E0SWUUWT-1588234847922)(output_48_0.png)]

查看频数，大于20000的值极少，其实这里也可以把这些当作特殊值（异常值）直接填充或删除掉

# Log变换分布之后的分布较均匀，可以用log变化进行预测，这也是预测问题常用的trick
plt.hist(np.log(Train_data['price']),orientation='vertical',histtype='bar',color='red');

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-7IhnmVPd-1588234847925)(output_50_0.png)]

2.3.6 特征分为类别特征和数字特征，并对类别特征查看unique分布

数据类型

列

#分离label即预测值
Y_train=Train_data['price']

Train_data.columns

Index(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType',
       'gearbox', 'power', 'kilometer', 'notRepairedDamage', 'regionCode',
       'creatDate', 'price', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6',
       'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13', 'v_14'],
      dtype='object')

'''
这个区别方式适用于没有直接Label coding的数据
这里不适用，需要人为根据实际含义来区分
+ 数字特征
numeric_features=Train_data.select_dtypes(include=[np.number])
numeric_features.columns
+ 类别特征
categorical_features=Train_data.select_dtypes(include=[np.object])
categorical_features.columns

'''

numeric_features=['power', 'kilometer', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6',\
       'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13', 'v_14']

categorical_features=[ 'name', 'model', 'brand', 'bodyType', 'fuelType',\
       'gearbox',  'notRepairedDamage', 'regionCode']

# 特征unique分布;训练集
for cat_fea in categorical_features:
    print(cat_fea+"的特征分布如下：")
    print("{}特征有{}不同的值".format(cat_fea,Train_data[cat_fea].nunique()))
    print(Train_data[cat_fea].value_counts())

name的特征分布如下：
name特征有99662不同的值
708       282
387       282
55        280
1541      263
203       233
53        221
713       217
290       197
1186      184
911       182
2044      176
1513      160
1180      158
631       157
893       153
2765      147
473       141
1139      137
1108      132
444       129
306       127
2866      123
2402      116
533       114
1479      113
422       113
4635      110
725       110
964       109
1373      104
         ... 
89083       1
95230       1
164864      1
173060      1
179207      1
181256      1
185354      1
25564       1
19417       1
189324      1
162719      1
191373      1
193422      1
136082      1
140180      1
144278      1
146327      1
148376      1
158621      1
1404        1
15319       1
46022       1
64463       1
976         1
3025        1
5074        1
7123        1
11221       1
13270       1
174485      1
Name: name, Length: 99662, dtype: int64
model的特征分布如下：
model特征有248不同的值
0.0      11762
19.0      9573
4.0       8445
1.0       6038
29.0      5186
48.0      5052
40.0      4502
26.0      4496
8.0       4391
31.0      3827
13.0      3762
17.0      3121
65.0      2730
49.0      2608
46.0      2454
30.0      2342
44.0      2195
5.0       2063
10.0      2004
21.0      1872
73.0      1789
11.0      1775
23.0      1696
22.0      1524
69.0      1522
63.0      1469
7.0       1460
16.0      1349
88.0      1309
66.0      1250
         ...  
141.0       37
133.0       35
216.0       30
202.0       28
151.0       26
226.0       26
231.0       23
234.0       23
233.0       20
198.0       18
224.0       18
227.0       17
237.0       17
220.0       16
230.0       16
239.0       14
223.0       13
236.0       11
241.0       10
232.0       10
229.0       10
235.0        7
246.0        7
243.0        4
244.0        3
245.0        2
209.0        2
240.0        2
242.0        2
247.0        1
Name: model, Length: 248, dtype: int64
brand的特征分布如下：
brand特征有40不同的值
0     31480
4     16737
14    16089
10    14249
1     13794
6     10217
9      7306
5      4665
13     3817
11     2945
3      2461
7      2361
16     2223
8      2077
25     2064
27     2053
21     1547
15     1458
19     1388
20     1236
12     1109
22     1085
26      966
30      940
17      913
24      772
28      649
32      592
29      406
37      333
2       321
31      318
18      316
36      228
34      227
33      218
23      186
35      180
38       65
39        9
Name: brand, dtype: int64
bodyType的特征分布如下：
bodyType特征有8不同的值
0.0    41420
1.0    35272
2.0    30324
3.0    13491
4.0     9609
5.0     7607
6.0     6482
7.0     1289
Name: bodyType, dtype: int64
fuelType的特征分布如下：
fuelType特征有7不同的值
0.0    91656
1.0    46991
2.0     2212
3.0      262
4.0      118
5.0       45
6.0       36
Name: fuelType, dtype: int64
gearbox的特征分布如下：
gearbox特征有2不同的值
0.0    111623
1.0     32396
Name: gearbox, dtype: int64
notRepairedDamage的特征分布如下：
notRepairedDamage特征有2不同的值
0.0    111361
1.0     14315
Name: notRepairedDamage, dtype: int64
regionCode的特征分布如下：
regionCode特征有7905不同的值
419     369
764     258
125     137
176     136
462     134
428     132
24      130
1184    130
122     129
828     126
70      125
827     120
207     118
1222    117
2418    117
85      116
2615    115
2222    113
759     112
188     111
1757    110
1157    109
2401    107
1069    107
3545    107
424     107
272     107
451     106
450     105
129     105
       ... 
6324      1
7372      1
7500      1
8107      1
2453      1
7942      1
5135      1
6760      1
8070      1
7220      1
8041      1
8012      1
5965      1
823       1
7401      1
8106      1
5224      1
8117      1
7507      1
7989      1
6505      1
6377      1
8042      1
7763      1
7786      1
6414      1
7063      1
4239      1
5931      1
7267      1
Name: regionCode, Length: 7905, dtype: int64

#特征unique分布；测试集
for cat_fea in categorical_features:
    print(cat_fea+"的特征分布如下：")
    print("{}特征有{}个不同的值".format(cat_fea,Test_data[cat_fea].nunique()))
    print(Test_data[cat_fea].value_counts())

name的特征分布如下：
name特征有37536个不同的值
387       94
55        93
1541      86
708       85
203       78
713       75
911       72
1180      71
53        68
290       68
631       67
1186      60
473       54
306       53
2866      52
2044      50
422       49
893       47
1513      46
2765      45
533       44
964       44
1139      41
1479      41
2825      38
444       37
4635      37
984       37
282       35
691       33
          ..
9747       1
7857       1
75120      1
144754     1
15731      1
66932      1
76360      1
66082      1
89231      1
93561      1
161146     1
21886      1
42368      1
101765     1
89653      1
38278      1
89645      1
60809      1
62858      1
195979     1
185951     1
81299      1
168479     1
28057      1
30106      1
97691      1
155039     1
44449      1
112034     1
105129     1
Name: name, Length: 37536, dtype: int64
model的特征分布如下：
model特征有245个不同的值
0.0      3772
19.0     3226
4.0      2790
1.0      1981
29.0     1778
48.0     1711
40.0     1524
26.0     1512
8.0      1464
31.0     1281
13.0     1214
17.0     1033
65.0      918
49.0      880
46.0      871
30.0      793
44.0      731
5.0       677
21.0      628
10.0      625
23.0      583
11.0      562
73.0      561
69.0      531
63.0      515
16.0      506
22.0      482
7.0       442
88.0      416
66.0      395
         ... 
157.0      12
151.0      12
141.0      12
193.0      12
89.0       12
68.0       11
233.0      11
226.0      11
133.0      11
227.0       8
198.0       8
18.0        8
224.0       7
237.0       7
239.0       6
231.0       6
235.0       6
220.0       6
246.0       4
234.0       4
230.0       4
223.0       3
236.0       3
232.0       3
245.0       3
229.0       2
209.0       2
242.0       1
241.0       1
244.0       1
Name: model, Length: 245, dtype: int64
brand的特征分布如下：
brand特征有40个不同的值
0     10473
4      5532
14     5345
10     4713
1      4627
6      3500
9      2360
5      1485
13     1386
11      942
3       820
16      770
25      728
7       727
8       708
27      623
21      543
15      476
19      473
20      411
12      399
22      358
26      328
30      321
17      312
24      248
28      216
32      183
29      139
37      117
2       115
31      113
18      107
33       84
35       75
34       75
36       72
23       60
38       31
39        5
Name: brand, dtype: int64
bodyType的特征分布如下：
bodyType特征有8个不同的值
0.0    13765
1.0    11960
2.0     9886
3.0     4491
4.0     3258
5.0     2494
6.0     2212
7.0      430
Name: bodyType, dtype: int64
fuelType的特征分布如下：
fuelType特征有7个不同的值
0.0    30489
1.0    15708
2.0      736
3.0       78
4.0       31
5.0       18
6.0       16
Name: fuelType, dtype: int64
gearbox的特征分布如下：
gearbox特征有2个不同的值
0.0    37131
1.0    10901
Name: gearbox, dtype: int64
notRepairedDamage的特征分布如下：
notRepairedDamage特征有2个不同的值
0.0    37224
1.0     4707
Name: notRepairedDamage, dtype: int64
regionCode的特征分布如下：
regionCode特征有6998个不同的值
419     120
764      98
176      48
3304     45
85       45
2222     45
3545     44
462      42
1000     42
2154     42
24       41
2775     41
70       41
309      40
1688     40
188      40
792      40
955      39
172      39
3573     39
122      39
759      38
60       38
2418     38
256      38
1483     38
2690     37
125      37
827      37
450      37
       ... 
1521      1
7602      1
5523      1
7538      1
5459      1
7410      1
6630      1
6374      1
6342      1
1010      1
6897      1
5104      1
7089      1
4069      1
6993      1
2052      1
4944      1
2867      1
4912      1
2771      1
6310      1
6865      1
6833      1
4656      1
6609      1
2451      1
4231      1
6513      1
6481      1
6061      1
Name: regionCode, Length: 6998, dtype: int64

2.3.7 数字特征分析

#加上预测值
numeric_features.append('price')

numeric_features

['power',
 'kilometer',
 'v_0',
 'v_1',
 'v_2',
 'v_3',
 'v_4',
 'v_5',
 'v_6',
 'v_7',
 'v_8',
 'v_9',
 'v_10',
 'v_11',
 'v_12',
 'v_13',
 'v_14',
 'price',
 'price']

Train_data.head().append(Train_data.tail())

	SaleID	name	regDate	model	brand	bodyType	fuelType	gearbox	power	kilometer	...	v_5	v_6	v_7	v_8	v_9	v_10	v_11	v_12	v_13	v_14
0	0	736	20040402	30.0	6	1.0	0.0	0.0	60	12.5	...	0.235676	0.101988	0.129549	0.022816	0.097462	-2.881803	2.804097	-2.420821	0.795292	0.914762
1	1	2262	20030301	40.0	1	2.0	0.0	0.0	0	15.0	...	0.264777	0.121004	0.135731	0.026597	0.020582	-4.900482	2.096338	-1.030483	-1.722674	0.245522
2	2	14874	20040403	115.0	15	1.0	0.0	0.0	163	12.5	...	0.251410	0.114912	0.165147	0.062173	0.027075	-4.846749	1.803559	1.565330	-0.832687	-0.229963
3	3	71865	19960908	109.0	10	0.0	0.0	1.0	193	15.0	...	0.274293	0.110300	0.121964	0.033395	0.000000	-4.509599	1.285940	-0.501868	-2.438353	-0.478699
4	4	111080	20120103	110.0	5	1.0	0.0	0.0	68	5.0	...	0.228036	0.073205	0.091880	0.078819	0.121534	-1.896240	0.910783	0.931110	2.834518	1.923482
149995	149995	163978	20000607	121.0	10	4.0	0.0	1.0	163	15.0	...	0.280264	0.000310	0.048441	0.071158	0.019174	1.988114	-2.983973	0.589167	-1.304370	-0.302592
149996	149996	184535	20091102	116.0	11	0.0	0.0	0.0	125	10.0	...	0.253217	0.000777	0.084079	0.099681	0.079371	1.839166	-2.774615	2.553994	0.924196	-0.272160
149997	149997	147587	20101003	60.0	11	1.0	1.0	0.0	90	6.0	...	0.233353	0.000705	0.118872	0.100118	0.097914	2.439812	-1.630677	2.290197	1.891922	0.414931
149998	149998	45907	20060312	34.0	10	3.0	1.0	0.0	156	15.0	...	0.256369	0.000252	0.081479	0.083558	0.081498	2.075380	-2.633719	1.414937	0.431981	-1.659014
149999	149999	177672	19990204	19.0	28	6.0	0.0	1.0	193	12.5	...	0.284475	0.000000	0.040072	0.062543	0.025819	1.978453	-3.179913	0.031724	-1.483350	-0.342674

10 rows × 29 columns

## 1)相关性分析
price_numeric=Train_data[numeric_features]
correlation=price_numeric.corr()
correlation
print(correlation['price'])

              price     price
power      0.219834  0.219834
kilometer -0.440519 -0.440519
v_0        0.628397  0.628397
v_1        0.060914  0.060914
v_2        0.085322  0.085322
v_3       -0.730946 -0.730946
v_4       -0.147085 -0.147085
v_5        0.164317  0.164317
v_6        0.068970  0.068970
v_7       -0.053024 -0.053024
v_8        0.685798  0.685798
v_9       -0.206205 -0.206205
v_10      -0.246175 -0.246175
v_11      -0.275320 -0.275320
v_12       0.692823  0.692823
v_13      -0.013993 -0.013993
v_14       0.035911  0.035911
price      1.000000  1.000000
price      1.000000  1.000000

print(correlation['price'].sort_values(ascending=True))

---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-81-e5597112ec71> in <module>
      3 correlation=price_numeric.corr()
      4 correlation
----> 5 print(correlation['price'].sort_values(ascending=True))


TypeError: sort_values() missing 1 required positional argument: 'by'

print(correlation['price'].sort_values(by=correlation['price'],ascending = False),'\n')

---------------------------------------------------------------------------

KeyError                                  Traceback (most recent call last)

<ipython-input-74-a6cc18605a79> in <module>
      2 price_numeric=Train_data[numeric_features]
      3 correlation=price_numeric.corr()
----> 4 print(correlation['price'].sort_values(by=correlation['price'],ascending = False),'\n')


G:\baidudownload2\anaconda\lib\site-packages\pandas\core\frame.py in sort_values(self, by, axis, ascending, inplace, kind, na_position)
   4717 
   4718             by = by[0]
-> 4719             k = self._get_label_or_level_values(by, axis=axis)
   4720 
   4721             if isinstance(ascending, (tuple, list)):


G:\baidudownload2\anaconda\lib\site-packages\pandas\core\generic.py in _get_label_or_level_values(self, key, axis)
   1704             values = self.axes[axis].get_level_values(key)._values
   1705         else:
-> 1706             raise KeyError(key)
   1707 
   1708         # Check for duplicates


KeyError:               price     price
power      0.219834  0.219834
kilometer -0.440519 -0.440519
v_0        0.628397  0.628397
v_1        0.060914  0.060914
v_2        0.085322  0.085322
v_3       -0.730946 -0.730946
v_4       -0.147085 -0.147085
v_5        0.164317  0.164317
v_6        0.068970  0.068970
v_7       -0.053024 -0.053024
v_8        0.685798  0.685798
v_9       -0.206205 -0.206205
v_10      -0.246175 -0.246175
v_11      -0.275320 -0.275320
v_12       0.692823  0.692823
v_13      -0.013993 -0.013993
v_14       0.035911  0.035911
price      1.000000  1.000000
price      1.000000  1.000000

print(data.corr()) #相关系数矩阵，即给出了任意两款菜式之间的相关系数
print("显示“百合酱蒸凤爪”与其他菜式的相关系数:")
print(data.corr()[u'price']) #只显示“百合酱蒸凤爪”与其他菜式的相关系数

f,ax=plt.subplots(figsize=(7,7))
plt.title('Correlation of Numeric Features with Price',y=1,size=16)
sns.heatmap(correlation,square=True,vmax=0.8)

<matplotlib.axes._subplots.AxesSubplot at 0x263c33cfa90>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-i5LpwIeD-1588234847926)(output_67_1.png)]

del price_numeric['price']

## 2)查看几个特征的偏度和峰值
for col in numeric_features:
    print('{:15}'.format(col),
         'Skewness:{:05.2f}'.format(Train_data[col].skew()),
         ' ',
         'kurtosis:{:06.2f}'.format(Train_data[col].kurt()))

power           Skewness:65.86   kurtosis:5733.45
kilometer       Skewness:-1.53   kurtosis:001.14
v_0             Skewness:-1.32   kurtosis:003.99
v_1             Skewness:00.36   kurtosis:-01.75
v_2             Skewness:04.84   kurtosis:023.86
v_3             Skewness:00.11   kurtosis:-00.42
v_4             Skewness:00.37   kurtosis:-00.20
v_5             Skewness:-4.74   kurtosis:022.93
v_6             Skewness:00.37   kurtosis:-01.74
v_7             Skewness:05.13   kurtosis:025.85
v_8             Skewness:00.20   kurtosis:-00.64
v_9             Skewness:00.42   kurtosis:-00.32
v_10            Skewness:00.03   kurtosis:-00.58
v_11            Skewness:03.03   kurtosis:012.57
v_12            Skewness:00.37   kurtosis:000.27
v_13            Skewness:00.27   kurtosis:-00.44
v_14            Skewness:-1.19   kurtosis:002.39
price           Skewness:03.35   kurtosis:019.00
price           Skewness:03.35   kurtosis:019.00

## 3)每个数字特征的分布可视化
f=pd.melt(Train_data,value_vars=numeric_features)
g=sns.FacetGrid(f,col="variable",col_wrap=2,sharex=False,sharey=False)
g=g.map(sns.distplot,"value")

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-MtlS2Yfr-1588234847928)(output_70_0.png)]

可以看出匿名特征相对分布均匀

## 4)数字特征相互之间的关系可视化
sns.set()
columns=['price','power', 'v_0', 'v_1', 'v_2',  'v_5', 'v_6',\
            'v_8','v_12',  'v_14']#为什么是这几个特征？
sns.pairplot(Train_data[columns],size=2,kind='scatter',diag_kind='kde');

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-1e9oL2QM-1588234847928)(output_72_0.png)]

Train_data.columns

Index(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType',
       'gearbox', 'power', 'kilometer', 'notRepairedDamage', 'regionCode',
       'creatDate', 'price', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6',
       'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13', 'v_14'],
      dtype='object')

Y_train

0          1850
1          3600
2          6222
3          2400
4          5200
5          8000
6          3500
7          1000
8          2850
9           650
10         3100
11         5450
12         1600
13         3100
14         6900
15         3200
16        10500
17         3700
18          790
19         1450
20          990
21         2800
22          350
23          599
24         9250
25         3650
26         2800
27         2399
28         4900
29         2999
          ...  
149970      900
149971     3400
149972      999
149973     3500
149974     4500
149975     3990
149976     1200
149977      330
149978     3350
149979     5000
149980     4350
149981     9000
149982     2000
149983    12000
149984     6700
149985     4200
149986     2800
149987     3000
149988     7500
149989     1150
149990      450
149991    24950
149992      950
149993     4399
149994    14780
149995     5900
149996     9500
149997     7500
149998     4999
149999     4700
Name: price, Length: 150000, dtype: int64

## 5)多变量互相回归关系可视化
fig,((ax1,ax2),(ax3,ax4),(ax5,ax6),(ax7,ax8),(ax9,ax10))=plt.subplots(nrows=5,ncols=2,figsize=(24,20))
# ['v_12', 'v_8' , 'v_0', 'power', 'v_5',  'v_2', 'v_6', 'v_1', 'v_14']
v_12_scatter_plot = pd.concat([Y_train,Train_data['v_12']],axis = 1)
sns.regplot(x='v_12',y = 'price', data = v_12_scatter_plot,scatter= True, fit_reg=True, ax=ax1)

v_8_scatter_plot = pd.concat([Y_train,Train_data['v_8']],axis = 1)
sns.regplot(x='v_8',y = 'price',data = v_8_scatter_plot,scatter= True, fit_reg=True, ax=ax2)

v_0_scatter_plot = pd.concat([Y_train,Train_data['v_0']],axis = 1)
sns.regplot(x='v_0',y = 'price',data = v_0_scatter_plot,scatter= True, fit_reg=True, ax=ax3)

power_scatter_plot = pd.concat([Y_train,Train_data['power']],axis = 1)
sns.regplot(x='power',y = 'price',data = power_scatter_plot,scatter= True, fit_reg=True, ax=ax4)

v_5_scatter_plot = pd.concat([Y_train,Train_data['v_5']],axis = 1)
sns.regplot(x='v_5',y = 'price',data = v_5_scatter_plot,scatter= True, fit_reg=True, ax=ax5)

v_2_scatter_plot = pd.concat([Y_train,Train_data['v_2']],axis = 1)
sns.regplot(x='v_2',y = 'price',data = v_2_scatter_plot,scatter= True, fit_reg=True, ax=ax6)

v_6_scatter_plot = pd.concat([Y_train,Train_data['v_6']],axis = 1)
sns.regplot(x='v_6',y = 'price',data = v_6_scatter_plot,scatter= True, fit_reg=True, ax=ax7)

v_1_scatter_plot = pd.concat([Y_train,Train_data['v_1']],axis = 1)
sns.regplot(x='v_1',y = 'price',data = v_1_scatter_plot,scatter= True, fit_reg=True, ax=ax8)

v_14_scatter_plot = pd.concat([Y_train,Train_data['v_14']],axis = 1)
sns.regplot(x='v_14',y = 'price',data = v_14_scatter_plot,scatter= True, fit_reg=True, ax=ax9)

v_13_scatter_plot = pd.concat([Y_train,Train_data['v_13']],axis = 1)
sns.regplot(x='v_13',y = 'price',data = v_13_scatter_plot,scatter= True, fit_reg=True, ax=ax10)

<matplotlib.axes._subplots.AxesSubplot at 0x263d7d11cf8>

2.3.8类别特征分析

## 1) unique分布
for fea in categorical_features:
    print(Train_data[fea].nunique())

categorical_features

['name',
 'model',
 'brand',
 'bodyType',
 'fuelType',
 'gearbox',
 'notRepairedDamage',
 'regionCode']

## 2) 类别特征箱形图可视化

# 因为 name和 regionCode的类别太稀疏了，这里我们把不稀疏的几类画一下
categorical_features = ['model',
 'brand',
 'bodyType',
 'fuelType',
 'gearbox',
 'notRepairedDamage']
for c in categorical_features:
    Train_data[c] = Train_data[c].astype('category')
    if Train_data[c].isnull().any():
        Train_data[c] = Train_data[c].cat.add_categories(['MISSING'])
        Train_data[c] = Train_data[c].fillna('MISSING')

def boxplot(x, y, **kwargs):
    sns.boxplot(x=x, y=y)
    x=plt.xticks(rotation=90)

f = pd.melt(Train_data, id_vars=['price'], value_vars=categorical_features)
g = sns.FacetGrid(f, col="variable",  col_wrap=2, sharex=False, sharey=False, size=5)
g = g.map(boxplot, "value", "price")

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-bwGTalz3-1588234847929)(output_79_0.png)]

Train_data.columns

Index(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType',
       'gearbox', 'power', 'kilometer', 'notRepairedDamage', 'regionCode',
       'creatDate', 'price', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6',
       'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13', 'v_14'],
      dtype='object')

## 3) 类别特征的小提琴图可视化
catg_list = categorical_features
target = 'price'
for catg in catg_list :
    sns.violinplot(x=catg, y=target, data=Train_data)
    plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-daMc5Y6c-1588234847930)(output_81_0.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-UJ9CnRDu-1588234847930)(output_81_1.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-DGNBmFBY-1588234847931)(output_81_2.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-sICXd0tC-1588234847931)(output_81_3.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-4QV1i2Gj-1588234847935)(output_81_4.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-yK7hk2QY-1588234847937)(output_81_5.png)]

categorical_features = ['model',
 'brand',
 'bodyType',
 'fuelType',
 'gearbox',
 'notRepairedDamage']

## 4) 类别特征的柱形图可视化
def bar_plot(x, y, **kwargs):
    sns.barplot(x=x, y=y)
    x=plt.xticks(rotation=90)

f = pd.melt(Train_data, id_vars=['price'], value_vars=categorical_features)
g = sns.FacetGrid(f, col="variable",  col_wrap=2, sharex=False, sharey=False, size=5)
g = g.map(bar_plot, "value", "price")

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Dl2d8XUu-1588234847939)(output_83_0.png)]

##  5) 类别特征的每个类别频数可视化(count_plot)
def count_plot(x,  **kwargs):
    sns.countplot(x=x)
    x=plt.xticks(rotation=90)

f = pd.melt(Train_data,  value_vars=categorical_features)
g = sns.FacetGrid(f, col="variable",  col_wrap=2, sharex=False, sharey=False, size=5)
g = g.map(count_plot, "value")

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-NVZPRcmN-1588234847940)(output_84_0.png)]

2.3.9用pandas_profiling生成数据报告

用pandas_profiling 生成一个较为全面的可视化和数据报告，最终打开html文件即可

import pandas_profiling

---------------------------------------------------------------------------

ModuleNotFoundError                       Traceback (most recent call last)

<ipython-input-101-6a00893fb3e1> in <module>
----> 1 import pandas_profiling


ModuleNotFoundError: No module named 'pandas_profiling'

pfr=pandas_profiling.ProfileReport(Train_data)
pfr.to_file(os.path.join(output_path,'example.html'))