数据挖掘竞赛_二手车交易价格_Task01&02

最新推荐文章于 2023-03-29 22:50:49 发布

hello_JeremyWang

最新推荐文章于 2023-03-29 22:50:49 发布

阅读量215

点赞数 1

文章标签： python 机器学习

本文链接：https://blog.csdn.net/hello_jeremywang/article/details/120612171

版权

前言

最近在参加 ‘Coggle数据科学 30 Days of ML’ 的学习活动，将所做的笔记在这里记录一下。活动是完全免费的，感觉是一个很好的提升机会。该活动可以在公众号Coggle上面找到。

EDA的简介

本次主要是读取数据和初步的EDA。数据探索有利于我们发现数据的一些特性，数据之间的关联性，对于后续的特征构建是很有帮助的。

对于数据的初步分析（直接查看数据，或.sum(), .mean()，.descirbe()等统计函数）可以从：样本数量，训练集数量，是否有时间特征，是否是时许问题，特征所表示的含义（非匿名特征），特征类型（字符类似，int，float，time），特征的缺失情况（注意缺失的在数据中的表现形式，有些是空的有些是”NAN”符号等），特征的均值方差情况。
分析记录某些特征值缺失占比30%以上样本的缺失处理，有助于后续的模型验证和调节，分析特征应该是填充（填充方式是什么，均值填充，0填充，众数填充等），还是舍去，还是先做样本分类用不同的特征模型去预测。
对于异常值做专门的分析，分析特征异常的label是否为异常值（或者偏离均值较远或者事特殊符号）,异常值是否应该剔除，还是用正常值填充，是记录异常，还是机器本身异常等。
对于Label做专门的分析，分析标签的分布情况等。
进步分析可以通过对特征作图，特征和label联合做图（统计图，离散图），直观了解特征的分布情况，通过这一步也可以发现数据之中的一些异常值等，通过箱型图分析一些特征值的偏离情况，对于特征和特征联合作图，对于特征和label联合作图，分析其中的一些关联性。
好啦，进入正题

导入包

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

Task01 读入数据

方法一

Train_Data = pd.read_csv("D:/竞赛/天池/二手车交易价格预测/used_car_train_20200313/used_car_train_20200313.csv")
Test_Data = pd.read_csv("D:/竞赛/天池/二手车交易价格预测/used_car_testB_20200421/used_car_testB_20200421.csv")

Train_Data.head()

	SaleID name regDate model brand bodyType fuelType gearbox power kilometer notRepairedDamage regionCode seller offerType creatDate price v_0 v_1 v_2 v_3 v_4 v_5 v_6 v_7 v_8 v_9 v_10 v_11 v_12 v_13 v_14
0	0 736 20040402 30.0 6 1.0 0.0 0.0 60 12.5 0.0 ...
1	1 2262 20030301 40.0 1 2.0 0.0 0.0 0 15.0 - 43...
2	2 14874 20040403 115.0 15 1.0 0.0 0.0 163 12.5...
3	3 71865 19960908 109.0 10 0.0 0.0 1.0 193 15.0...
4	4 111080 20120103 110.0 5 1.0 0.0 0.0 68 5.0 0...

Test_Data.head()

	SaleID name regDate model brand bodyType fuelType gearbox power kilometer notRepairedDamage regionCode seller offerType creatDate v_0 v_1 v_2 v_3 v_4 v_5 v_6 v_7 v_8 v_9 v_10 v_11 v_12 v_13 v_14
0	200000 133777 20000501 67.0 0 1.0 0.0 0.0 101 ...
1	200001 61206 19950211 19.0 6 2.0 0.0 0.0 73 6....
2	200002 67829 20090606 5.0 5 4.0 0.0 0.0 120 5....
3	200003 8892 20020601 22.0 9 1.0 0.0 0.0 58 15....
4	200004 76998 20030301 46.0 6 0.0 0.0 116 15.0...

Train_Data.shape

(150000, 1)

Test_Data.shape

(50000, 1)

#数据格式清理，使得每列数据和相应的列名一一对应
columns_name = Train_Data.columns.str.split()[0]
Train_Data = Train_Data.iloc[:,0].str.split(' ',expand=True)
Train_Data.columns = columns_name

columns_name = Test_Data.columns.str.split()[0]
Test_Data = Test_Data.iloc[:,0].str.split(' ',expand=True)
Test_Data.columns = columns_name

Train_Data.head()

	SaleID	name	regDate	model	brand	bodyType	gearbox	power	kilometer	...	v_5	v_6	v_7	v_8	v_9	v_10	v_11	v_12	v_13	v_14
0	0	736	20040402	30.0	6	1.0	0.0	60	12.5	...	0.23567590669911015	0.10198824077953883	0.129548661418789	0.02281636740006269	0.09746182870576199	-2.8818032385553165	2.8040967707208506	-2.4208207926122784	0.7952919433118377	0.9147624995703408
1	1	2262	20030301	40.0	1	2.0	0.0	0	15.0	...	0.2647772555037097	0.12100359404116512	0.1357307068829055	0.026597448118262774	0.020581662632484482	-4.9004818817666775	2.0963376444273414	-1.0304828371563102	-1.7226737753851349	0.2455224109670493
2	2	14874	20040403	115.0	15	1.0	0.0	163	12.5	...	0.25141014780875875	0.11491227654046415	0.16514749334496415	0.06217283730726245	0.02707482416830506	-4.846749260269903	1.803558941229932	1.5653296250457633	-0.8326873267265079	-0.22996285613259074
3	3	71865	19960908	109.0	10	0.0	1.0	193	15.0	...	0.2742931709082824	0.11030008468643802	0.12196374573186793	0.033394547122199615	0.0	-4.5095988235247955	1.2859397444845837	-0.5018679084368517	-2.4383527366881763	-0.4786993792688288
4	4	111080	20120103	110.0	5	1.0	0.0	68	5.0	...	0.2280356217997828	0.0732050535564685	0.09188047928262777	0.07881938473498606	0.12153424142524565	-1.8962402786050725	0.9107831337379366	0.9311095588151709	2.8345178203938377	1.9234819632780635

5 rows × 31 columns

Test_Data.head()

	SaleID	name	regDate	model	brand	bodyType	fuelType	power	kilometer	...	v_5	v_6	v_7	v_8	v_9	v_10	v_11	v_12	v_13	v_14
0	200000	133777	20000501	67.0	0	1.0	0.0	101	15.0	...	0.23651953807463394	0.00024077965516603838	0.10531899027928028	0.04623333858358547	0.09452231271103513	3.619512125855613	-0.2806066537539741	-2.019761143295854	0.9788277260800712	0.8033215020878566
1	200001	61206	19950211	19.0	6	2.0	0.0	73	6.0	...	0.26151841976421497	0.0	0.12032345361861815	0.0467842378305852	0.035385262671275085	2.9973763596922285	-1.4067050523440334	-1.0208835817916766	-1.3499898633435856	-0.20054163936348302
2	200002	67829	20090606	5.0	5	4.0	0.0	120	5.0	...	0.26169061811955624	0.09083648656092408	0.0	0.07965532570737711	0.0735862207476284	-3.951083771010004	-0.4334673285213749	0.9189638428560336	1.6346039890078308	1.027172758680927
3	200003	8892	20020601	22.0	9	1.0	0.0	58	15.0	...	0.2360495075063573	0.10177689524069478	0.09894989511106032	0.026829627502826414	0.09661365556957097	-2.8467877718832733	2.8002670817288	-2.5246103235495783	1.0768192298469703	0.4616102367935517
4	200004	76998	20030301	46.0	6	0.0		116	15.0	...	0.2569995358233025	0.0	0.06673175734217886	0.05777117201467578	0.06885246978026283	2.839010006118193	-1.6598006754576482	-0.9241417494176124	0.19942261240833112	0.4510139980592859

5 rows × 30 columns

方法二

Train_Data = pd.read_csv("D:/竞赛/天池/二手车交易价格预测/used_car_train_20200313/used_car_train_20200313.csv", sep=' ')
Test_Data = pd.read_csv("D:/竞赛/天池/二手车交易价格预测/used_car_testB_20200421/used_car_testB_20200421.csv",sep=' ')

Train_Data.head()

	SaleID	name	regDate	model	brand	bodyType	gearbox	power	kilometer	...	v_5	v_6	v_7	v_8	v_9	v_10	v_11	v_12	v_13	v_14
0	0	736	20040402	30.0	6	1.0	0.0	60	12.5	...	0.235676	0.101988	0.129549	0.022816	0.097462	-2.881803	2.804097	-2.420821	0.795292	0.914762
1	1	2262	20030301	40.0	1	2.0	0.0	0	15.0	...	0.264777	0.121004	0.135731	0.026597	0.020582	-4.900482	2.096338	-1.030483	-1.722674	0.245522
2	2	14874	20040403	115.0	15	1.0	0.0	163	12.5	...	0.251410	0.114912	0.165147	0.062173	0.027075	-4.846749	1.803559	1.565330	-0.832687	-0.229963
3	3	71865	19960908	109.0	10	0.0	1.0	193	15.0	...	0.274293	0.110300	0.121964	0.033395	0.000000	-4.509599	1.285940	-0.501868	-2.438353	-0.478699
4	4	111080	20120103	110.0	5	1.0	0.0	68	5.0	...	0.228036	0.073205	0.091880	0.078819	0.121534	-1.896240	0.910783	0.931110	2.834518	1.923482

5 rows × 31 columns

Test_Data.head()

	SaleID	name	regDate	model	brand	bodyType	fuelType	power	kilometer	...	v_5	v_6	v_7	v_8	v_9	v_10	v_11	v_12	v_13	v_14
0	200000	133777	20000501	67.0	0	1.0	0.0	101	15.0	...	0.236520	0.000241	0.105319	0.046233	0.094522	3.619512	-0.280607	-2.019761	0.978828	0.803322
1	200001	61206	19950211	19.0	6	2.0	0.0	73	6.0	...	0.261518	0.000000	0.120323	0.046784	0.035385	2.997376	-1.406705	-1.020884	-1.349990	-0.200542
2	200002	67829	20090606	5.0	5	4.0	0.0	120	5.0	...	0.261691	0.090836	0.000000	0.079655	0.073586	-3.951084	-0.433467	0.918964	1.634604	1.027173
3	200003	8892	20020601	22.0	9	1.0	0.0	58	15.0	...	0.236050	0.101777	0.098950	0.026830	0.096614	-2.846788	2.800267	-2.524610	1.076819	0.461610
4	200004	76998	20030301	46.0	6	0.0	NaN	116	15.0	...	0.257000	0.000000	0.066732	0.057771	0.068852	2.839010	-1.659801	-0.924142	0.199423	0.451014

5 rows × 30 columns

Task2 数据分析

在这里插入图片描述

分析每个字段的取值、范围和类型

Train_Data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 31 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   SaleID             150000 non-null  int64  
 1   name               150000 non-null  int64  
 2   regDate            150000 non-null  int64  
 3   model              149999 non-null  float64
 4   brand              150000 non-null  int64  
 5   bodyType           145494 non-null  float64
 6   fuelType           141320 non-null  float64
 7   gearbox            144019 non-null  float64
 8   power              150000 non-null  int64  
 9   kilometer          150000 non-null  float64
 10  notRepairedDamage  150000 non-null  object 
 11  regionCode         150000 non-null  int64  
 12  seller             150000 non-null  int64  
 13  offerType          150000 non-null  int64  
 14  creatDate          150000 non-null  int64  
 15  price              150000 non-null  int64  
 16  v_0                150000 non-null  float64
 17  v_1                150000 non-null  float64
 18  v_2                150000 non-null  float64
 19  v_3                150000 non-null  float64
 20  v_4                150000 non-null  float64
 21  v_5                150000 non-null  float64
 22  v_6                150000 non-null  float64
 23  v_7                150000 non-null  float64
 24  v_8                150000 non-null  float64
 25  v_9                150000 non-null  float64
 26  v_10               150000 non-null  float64
 27  v_11               150000 non-null  float64
 28  v_12               150000 non-null  float64
 29  v_13               150000 non-null  float64
 30  v_14               150000 non-null  float64
dtypes: float64(20), int64(10), object(1)
memory usage: 35.5+ MB

Test_Data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 30 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   SaleID             50000 non-null  int64  
 1   name               50000 non-null  int64  
 2   regDate            50000 non-null  int64  
 3   model              50000 non-null  float64
 4   brand              50000 non-null  int64  
 5   bodyType           48496 non-null  float64
 6   fuelType           47076 non-null  float64
 7   gearbox            48032 non-null  float64
 8   power              50000 non-null  int64  
 9   kilometer          50000 non-null  float64
 10  notRepairedDamage  50000 non-null  object 
 11  regionCode         50000 non-null  int64  
 12  seller             50000 non-null  int64  
 13  offerType          50000 non-null  int64  
 14  creatDate          50000 non-null  int64  
 15  v_0                50000 non-null  float64
 16  v_1                50000 non-null  float64
 17  v_2                50000 non-null  float64
 18  v_3                50000 non-null  float64
 19  v_4                50000 non-null  float64
 20  v_5                50000 non-null  float64
 21  v_6                50000 non-null  float64
 22  v_7                50000 non-null  float64
 23  v_8                50000 non-null  float64
 24  v_9                50000 non-null  float64
 25  v_10               50000 non-null  float64
 26  v_11               50000 non-null  float64
 27  v_12               50000 non-null  float64
 28  v_13               50000 non-null  float64
 29  v_14               50000 non-null  float64
dtypes: float64(20), int64(9), object(1)
memory usage: 11.4+ MB

Train_Data.describe().iloc[:,:15]

	SaleID	name	regDate	model	brand	bodyType	fuelType	gearbox	power	kilometer	regionCode	seller	offerType	creatDate	price
count	150000.000000	150000.000000	1.500000e+05	149999.000000	150000.000000	145494.000000	141320.000000	144019.000000	150000.000000	150000.000000	150000.000000	150000.000000	150000.0	1.500000e+05	150000.000000
mean	74999.500000	68349.172873	2.003417e+07	47.129021	8.052733	1.792369	0.375842	0.224943	119.316547	12.597160	2583.077267	0.000007	0.0	2.016033e+07	5923.327333
std	43301.414527	61103.875095	5.364988e+04	49.536040	7.864956	1.760640	0.548677	0.417546	177.168419	3.919576	1885.363218	0.002582	0.0	1.067328e+02	7501.998477
min	0.000000	0.000000	1.991000e+07	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.500000	0.000000	0.000000	0.0	2.015062e+07	11.000000
25%	37499.750000	11156.000000	1.999091e+07	10.000000	1.000000	0.000000	0.000000	0.000000	75.000000	12.500000	1018.000000	0.000000	0.0	2.016031e+07	1300.000000
50%	74999.500000	51638.000000	2.003091e+07	30.000000	6.000000	1.000000	0.000000	0.000000	110.000000	15.000000	2196.000000	0.000000	0.0	2.016032e+07	3250.000000
75%	112499.250000	118841.250000	2.007111e+07	66.000000	13.000000	3.000000	1.000000	0.000000	150.000000	15.000000	3843.000000	0.000000	0.0	2.016033e+07	7700.000000
max	149999.000000	196812.000000	2.015121e+07	247.000000	39.000000	7.000000	6.000000	1.000000	19312.000000	15.000000	8120.000000	1.000000	0.0	2.016041e+07	99999.000000

Train_Data.describe().iloc[:,15:]

	v_0	v_1	v_2	v_3	v_4	v_5	v_6	v_7	v_8	v_9	v_10	v_11	v_12	v_13	v_14
count	150000.000000	150000.000000	150000.000000	150000.000000	150000.000000	150000.000000	150000.000000	150000.000000	150000.000000	150000.000000	150000.000000	150000.000000	150000.000000	150000.000000	150000.000000
mean	44.406268	-0.044809	0.080765	0.078833	0.017875	0.248204	0.044923	0.124692	0.058144	0.061996	-0.001000	0.009035	0.004813	0.000313	-0.000688
std	2.457548	3.641893	2.929618	2.026514	1.193661	0.045804	0.051743	0.201410	0.029186	0.035692	3.772386	3.286071	2.517478	1.288988	1.038685
min	30.451976	-4.295589	-4.470671	-7.275037	-4.364565	0.000000	0.000000	0.000000	0.000000	0.000000	-9.168192	-5.558207	-9.639552	-4.153899	-6.546556
25%	43.135799	-3.192349	-0.970671	-1.462580	-0.921191	0.243615	0.000038	0.062474	0.035334	0.033930	-3.722303	-1.951543	-1.871846	-1.057789	-0.437034
50%	44.610266	-3.052671	-0.382947	0.099722	-0.075910	0.257798	0.000812	0.095866	0.057014	0.058484	1.624076	-0.358053	-0.130753	-0.036245	0.141246
75%	46.004721	4.000670	0.241335	1.565838	0.868758	0.265297	0.102009	0.125243	0.079382	0.087491	2.844357	1.255022	1.776933	0.942813	0.680378
max	52.304178	7.320308	19.035496	9.854702	6.829352	0.291838	0.151420	1.404936	0.160791	0.222787	12.357011	18.819042	13.847792	11.147669	8.658418

Test_Data.describe().iloc[:,:15]

	SaleID	name	regDate	model	brand	bodyType	fuelType	gearbox	power	kilometer	regionCode	seller	offerType	creatDate	v_0
count	50000.000000	50000.000000	5.000000e+04	50000.00000	50000.000000	48496.000000	47076.000000	48032.000000	50000.000000	50000.000000	50000.000000	50000.0	50000.0	5.000000e+04	50000.000000
mean	224999.500000	68505.606100	2.003401e+07	47.64948	8.087140	1.793736	0.376498	0.226953	119.766960	12.598260	2581.080680	0.0	0.0	2.016033e+07	44.400023
std	14433.901067	61032.124271	5.351615e+04	49.90741	7.899648	1.764970	0.549281	0.418866	206.313348	3.912519	1889.248559	0.0	0.0	1.113395e+02	2.459920
min	200000.000000	1.000000	1.991000e+07	0.00000	0.000000	0.000000	0.000000	0.000000	0.000000	0.500000	0.000000	0.0	0.0	2.014031e+07	31.122325
25%	212499.750000	11315.000000	1.999100e+07	11.00000	1.000000	0.000000	0.000000	0.000000	75.000000	12.500000	1006.000000	0.0	0.0	2.016031e+07	43.120935
50%	224999.500000	52215.000000	2.003091e+07	30.00000	6.000000	1.000000	0.000000	0.000000	110.000000	15.000000	2204.500000	0.0	0.0	2.016032e+07	44.601493
75%	237499.250000	118710.750000	2.007110e+07	66.00000	13.000000	3.000000	1.000000	0.000000	150.000000	15.000000	3842.000000	0.0	0.0	2.016033e+07	45.987018
max	249999.000000	196808.000000	2.015121e+07	246.00000	39.000000	7.000000	6.000000	1.000000	19211.000000	15.000000	8120.000000	0.0	0.0	2.016041e+07	51.676686

Test_Data.describe().iloc[:,15:]

	v_1	v_2	v_3	v_4	v_5	v_6	v_7	v_8	v_9	v_10	v_11	v_12	v_13	v_14
count	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000
mean	-0.065525	0.079706	0.078381	0.022361	0.248147	0.044624	0.124693	0.058198	0.062113	0.019633	0.002759	0.004342	0.004570	-0.007209
std	3.636631	2.930829	2.019136	1.194215	0.045836	0.051664	0.201440	0.029171	0.035723	3.764095	3.289523	2.515912	1.287194	1.044718
min	-4.231855	-4.032142	-5.801254	-4.233626	0.000000	0.000000	0.000000	0.000000	0.000000	-9.119719	-5.662163	-8.291868	-4.157649	-6.098192
25%	-3.193169	-0.967832	-1.456793	-0.922153	0.243436	0.000035	0.062519	0.035413	0.033880	-3.675196	-1.963928	-1.865406	-1.048722	-0.440706
50%	-3.053506	-0.384910	0.118448	-0.068187	0.257818	0.000801	0.095880	0.056804	0.058749	1.632134	-0.375537	-0.138943	-0.036352	0.136849
75%	3.978703	0.239689	1.563490	0.871565	0.265263	0.101654	0.125470	0.079387	0.087624	2.846205	1.263451	1.775632	0.945239	0.685555
max	7.190759	18.865988	9.386558	4.959106	0.291176	0.153403	1.411559	0.157458	0.211304	12.177864	18.789496	13.384828	5.635374	2.649768

可以发现除了notRepairedDamage 为object类型其他都为数字这里我们把他的几个不同的值都进行显示就知道了

Train_Data['notRepairedDamage'].value_counts()

0.0    111361
-       24324
1.0     14315
Name: notRepairedDamage, dtype: int64

Train_Data['notRepairedDamage'].replace('-',np.nan,inplace=True)

Test_Data['notRepairedDamage'].value_counts()

0.0    37224
-       8069
1.0     4707
Name: notRepairedDamage, dtype: int64

Test_Data['notRepairedDamage'].replace('-',np.nan,inplace=True)

查看缺失值

missing = Train_Data.isnull().sum()
missing = missing[missing>0]
missing.sort_values(inplace=True)
missing.plot.bar()

<AxesSubplot:>

在这里插入图片描述

missing = Test_Data.isnull().sum()
missing = missing[missing>0]
missing.sort_values(inplace=True)
missing.plot.bar()

<AxesSubplot:>

在这里插入图片描述

删除分布极不均衡的列

del Train_Data["seller"]
del Train_Data["offerType"]
del Test_Data["seller"]
del Test_Data["offerType"]

观察数据的偏度和峰度

sns.distplot(Train_Data.skew(),color='blue',axlabel ='Skewness')

在这里插入图片描述

sns.distplot(Test_Data.skew(),color='blue',axlabel ='Skewness')

在这里插入图片描述

sns.distplot(Train_Data.kurt(),color='orange',axlabel ='Kurtness')

在这里插入图片描述

sns.distplot(Test_Data.kurt(),color='orange',axlabel ='Kurtness')

在这里插入图片描述

发现存在偏度和峰度较大的值

查看预测值的具体频数

plt.hist(Train_Data['price'], orientation = 'vertical',histtype = 'bar', color ='red')
plt.show()

在这里插入图片描述

查看频数, 大于20000得值极少，其实这里也可以把这些当作特殊得值（异常值）直接用填充或者删掉，再前面进行

log变换 z之后的分布较均匀，可以进行log变换进行预测，这也是预测问题常用的trick

plt.hist(np.log(Train_Data['price']), orientation = 'vertical',histtype = 'bar', color ='red') 
plt.show()

在这里插入图片描述

特征分为类别特征和数字特征

numeric_features = ['power', 'kilometer', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13','v_14' ]

categorical_features = ['name', 'model', 'brand', 'bodyType', 'fuelType', 'gearbox', 'notRepairedDamage', 'regionCode',]

类别特征

特征nunique分布

for cat_fea in categorical_features:
    print(cat_fea + "的特征分布如下：")
    print("{}特征有个{}不同的值".format(cat_fea, Train_Data[cat_fea].nunique()))
    print(Train_Data[cat_fea].value_counts())

name的特征分布如下：
name特征有个99662不同的值
708       282
387       282
55        280
1541      263
203       233
         ... 
119983      1
63443       1
104410      1
154956      1
177672      1
Name: name, Length: 99662, dtype: int64
model的特征分布如下：
model特征有个248不同的值
0.0      11762
19.0      9573
4.0       8445
1.0       6038
29.0      5186
         ...  
240.0        2
209.0        2
245.0        2
242.0        2
247.0        1
Name: model, Length: 248, dtype: int64
brand的特征分布如下：
brand特征有个40不同的值
0     31480
4     16737
14    16089
10    14249
1     13794
6     10217
9      7306
5      4665
13     3817
11     2945
3      2461
7      2361
16     2223
8      2077
25     2064
27     2053
21     1547
15     1458
19     1388
20     1236
12     1109
22     1085
26      966
30      940
17      913
24      772
28      649
32      592
29      406
37      333
2       321
31      318
18      316
36      228
34      227
33      218
23      186
35      180
38       65
39        9
Name: brand, dtype: int64
bodyType的特征分布如下：
bodyType特征有个8不同的值
0.0    41420
1.0    35272
2.0    30324
3.0    13491
4.0     9609
5.0     7607
6.0     6482
7.0     1289
Name: bodyType, dtype: int64
fuelType的特征分布如下：
fuelType特征有个7不同的值
0.0    91656
1.0    46991
2.0     2212
3.0      262
4.0      118
5.0       45
6.0       36
Name: fuelType, dtype: int64
gearbox的特征分布如下：
gearbox特征有个2不同的值
0.0    111623
1.0     32396
Name: gearbox, dtype: int64
notRepairedDamage的特征分布如下：
notRepairedDamage特征有个3不同的值
0.0    111361
-       24324
1.0     14315
Name: notRepairedDamage, dtype: int64
regionCode的特征分布如下：
regionCode特征有个7905不同的值
419     369
764     258
125     137
176     136
462     134
       ... 
6377      1
7994      1
7973      1
7975      1
8117      1
Name: regionCode, Length: 7905, dtype: int64

可视化

def count_plot(x,  **kwargs):
    sns.countplot(x=x)
    x=plt.xticks(rotation=90)

f = pd.melt(Train_Data,  value_vars=categorical_features)
g = sns.FacetGrid(f, col="variable",  col_wrap=2, sharex=False, sharey=False, size=5)
g = g.map(count_plot, "value")

在这里插入图片描述

for cat_fea in categorical_features:
    print(cat_fea + "的特征分布如下：")
    print("{}特征有个{}不同的值".format(cat_fea, Test_Data[cat_fea].nunique()))
    print(Test_Data[cat_fea].value_counts())

name的特征分布如下：
name特征有个37536不同的值
387       94
55        93
1541      86
708       85
203       78
          ..
69206      1
125326     1
82297      1
168470     1
78202      1
Name: name, Length: 37536, dtype: int64
model的特征分布如下：
model特征有个245不同的值
0.0      3772
19.0     3226
4.0      2790
1.0      1981
29.0     1778
         ... 
209.0       2
229.0       2
241.0       1
242.0       1
244.0       1
Name: model, Length: 245, dtype: int64
brand的特征分布如下：
brand特征有个40不同的值
0     10473
4      5532
14     5345
10     4713
1      4627
6      3500
9      2360
5      1485
13     1386
11      942
3       820
16      770
25      728
7       727
8       708
27      623
21      543
15      476
19      473
20      411
12      399
22      358
26      328
30      321
17      312
24      248
28      216
32      183
29      139
37      117
2       115
31      113
18      107
33       84
35       75
34       75
36       72
23       60
38       31
39        5
Name: brand, dtype: int64
bodyType的特征分布如下：
bodyType特征有个8不同的值
0.0    13765
1.0    11960
2.0     9886
3.0     4491
4.0     3258
5.0     2494
6.0     2212
7.0      430
Name: bodyType, dtype: int64
fuelType的特征分布如下：
fuelType特征有个7不同的值
0.0    30489
1.0    15708
2.0      736
3.0       78
4.0       31
5.0       18
6.0       16
Name: fuelType, dtype: int64
gearbox的特征分布如下：
gearbox特征有个2不同的值
0.0    37131
1.0    10901
Name: gearbox, dtype: int64
notRepairedDamage的特征分布如下：
notRepairedDamage特征有个2不同的值
0.0    37224
1.0     4707
Name: notRepairedDamage, dtype: int64
regionCode的特征分布如下：
regionCode特征有个6998不同的值
419     120
764      98
176      48
85       45
3304     45
       ... 
5365      1
6353      1
7077      1
1317      1
2214      1
Name: regionCode, Length: 6998, dtype: int64

def count_plot(x,  **kwargs):
    sns.countplot(x=x)
    x=plt.xticks(rotation=90)

f = pd.melt(Test_Data,  value_vars=categorical_features)
g = sns.FacetGrid(f, col="variable",  col_wrap=2, sharex=False, sharey=False, size=5)
g = g.map(count_plot, "value")

在这里插入图片描述

类别特征箱形图可视化

pd.melt(Train_Data, id_vars=['price'], value_vars=categorical_features)

	price	variable	value
0	1850	name	736
1	3600	name	2262
2	6222	name	14874
3	2400	name	71865
4	5200	name	111080
...	...	...	...
1199995	5900	regionCode	4576
1199996	9500	regionCode	2826
1199997	7500	regionCode	3302
1199998	4999	regionCode	1877
1199999	4700	regionCode	235

1200000 rows × 3 columns

# 因为 name和 regionCode的类别太稀疏了，这里我们把不稀疏的几类画一下
categorical_features = ['model',
 'brand',
 'bodyType',
 'fuelType',
 'gearbox',
 'notRepairedDamage']
for c in categorical_features:
    Train_Data[c] = Train_Data[c].astype('category')
    if Train_Data[c].isnull().any():
        Train_Data[c] = Train_Data[c].cat.add_categories(['MISSING'])
        Train_Data[c] = Train_Data[c].fillna('MISSING')

def boxplot(x, y, **kwargs):
    sns.boxplot(x=x, y=y)
    x=plt.xticks(rotation=90)

f = pd.melt(Train_Data, id_vars=['price'], value_vars=categorical_features)
g = sns.FacetGrid(f, col="variable",  col_wrap=2, sharex=False, sharey=False, size=5)
g = g.map(boxplot, "value", "price")

在这里插入图片描述

类别特征的小提琴图可视化

## 3) 类别特征的小提琴图可视化
catg_list = categorical_features
target = 'price'
for catg in catg_list :
    sns.violinplot(x=catg, y=target, data=Train_Data)
    plt.show()

在这里插入图片描述

类别特征的柱形图可视化

数字特征

相关性(查看各个变量和预测变量的相关性)

numeric_features.append("price")
price_numeric = Train_Data[numeric_features]
correlation = price_numeric.corr()
print(correlation['price'].sort_values(ascending = False),'\n')

price        1.000000
v_12         0.692823
v_8          0.685798
v_0          0.628397
power        0.219834
v_5          0.164317
v_2          0.085322
v_6          0.068970
v_1          0.060914
v_14         0.035911
v_13        -0.013993
v_7         -0.053024
v_4         -0.147085
v_9         -0.206205
v_10        -0.246175
v_11        -0.275320
kilometer   -0.440519
v_3         -0.730946
Name: price, dtype: float64

(查看各个变量之间的相关性)

plt.title('Correlation of Numeric Features with Price',y=1,size=16)

sns.heatmap(correlation,square = True,  vmax=0.8)

在这里插入图片描述

del price_numeric['price']

查看几个特征得偏度和峰值

可以找到刚刚发现的偏度和峰度较大的变量

for col in numeric_features:
    print('{:15}'.format(col), 
          'Skewness: {:05.2f}'.format(Train_Data[col].skew()) , 
          '   ' ,
          'Kurtosis: {:06.2f}'.format(Train_Data[col].kurt())  
         )

power           Skewness: 65.86     Kurtosis: 5733.45
kilometer       Skewness: -1.53     Kurtosis: 001.14
v_0             Skewness: -1.32     Kurtosis: 003.99
v_1             Skewness: 00.36     Kurtosis: -01.75
v_2             Skewness: 04.84     Kurtosis: 023.86
v_3             Skewness: 00.11     Kurtosis: -00.42
v_4             Skewness: 00.37     Kurtosis: -00.20
v_5             Skewness: -4.74     Kurtosis: 022.93
v_6             Skewness: 00.37     Kurtosis: -01.74
v_7             Skewness: 05.13     Kurtosis: 025.85
v_8             Skewness: 00.20     Kurtosis: -00.64
v_9             Skewness: 00.42     Kurtosis: -00.32
v_10            Skewness: 00.03     Kurtosis: -00.58
v_11            Skewness: 03.03     Kurtosis: 012.57
v_12            Skewness: 00.37     Kurtosis: 000.27
v_13            Skewness: 00.27     Kurtosis: -00.44
v_14            Skewness: -1.19     Kurtosis: 002.39
price           Skewness: 03.35     Kurtosis: 019.00

for col in numeric_features:
    print('{:15}'.format(col), 
          'Skewness: {:05.2f}'.format(Test_Data[col].skew()) , 
          '   ' ,
          'Kurtosis: {:06.2f}'.format(Test_Data[col].kurt())  
         )

power           Skewness: 60.02     Kurtosis: 4533.77
kilometer       Skewness: -1.52     Kurtosis: 001.13
v_0             Skewness: -1.31     Kurtosis: 003.98
v_1             Skewness: 00.37     Kurtosis: -01.74
v_2             Skewness: 04.84     Kurtosis: 023.85
v_3             Skewness: 00.09     Kurtosis: -00.44
v_4             Skewness: 00.38     Kurtosis: -00.22
v_5             Skewness: -4.73     Kurtosis: 022.87
v_6             Skewness: 00.38     Kurtosis: -01.73
v_7             Skewness: 05.13     Kurtosis: 025.83
v_8             Skewness: 00.22     Kurtosis: -00.62
v_9             Skewness: 00.42     Kurtosis: -00.33
v_10            Skewness: 00.02     Kurtosis: -00.56
v_11            Skewness: 03.02     Kurtosis: 012.48
v_12            Skewness: 00.38     Kurtosis: 000.32
v_13            Skewness: 00.26     Kurtosis: -00.49
v_14            Skewness: -1.21     Kurtosis: 002.40

每个数字特征得分布可视化
a) 观察有没有异常的分布
b) 观察train和test有没有分布不一样的变量

f = pd.melt(Train_Data, value_vars=numeric_features)
g = sns.FacetGrid(f, col="variable",  col_wrap=2, sharex=False, sharey=False)
g = g.map(sns.distplot, "value")

在这里插入图片描述

numeric_features.remove("price")
f = pd.melt(Test_Data, value_vars=numeric_features)
g = sns.FacetGrid(f, col="variable",  col_wrap=2, sharex=False, sharey=False)
g = g.map(sns.distplot, "value")

在这里插入图片描述

hello_JeremyWang

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
数据挖掘竞赛_二手车交易价格_Task01&02

最近在参加 ‘Coggle数据科学30 Days of ML’ 的学习活动，将所做的笔记在这里记录一下。活动是完全免费的，感觉是一个很好的提升机会。活动的宣传图片如下：好啦，进入正题导入包import pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport seaborn as snsimport warningswarnings.filterwarnings('ignore')Task01 读入数据方法
复制链接

扫一扫