- task02学习了数据的分析画图
-
- 敲了一遍task数据分析,加了些注释说明
-
- 删除了两个类别特征异常的列和是三个和price相关性非常的列后进行预测,结果如图,效果并没有提高.应该做进一步的处理和特征工程(task03)
以下是按照教程进行数据分析的过程
# 导包
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
- 读取数据
Train_data = pd.read_csv('car_train_0110.csv', sep=' ')
Test_data = pd.read_csv('car_testA_0110.csv', sep=' ')
Train_data.head().append(Train_data.tail())
SaleID | name | regDate | model | brand | bodyType | fuelType | gearbox | power | kilometer | ... | v_14 | v_15 | v_16 | v_17 | v_18 | v_19 | v_20 | v_21 | v_22 | v_23 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 134890 | 734 | 20160002 | 13.0 | 9 | NaN | 0.0 | 1.0 | 0 | 15.0 | ... | 0.092139 | 0.000000 | 18.763832 | -1.512063 | -1.008718 | -12.100623 | -0.947052 | 9.077297 | 0.581214 | 3.945923 |
1 | 306648 | 196973 | 20080307 | 72.0 | 9 | 7.0 | 5.0 | 1.0 | 173 | 15.0 | ... | 0.001070 | 0.122335 | -5.685612 | -0.489963 | -2.223693 | -0.226865 | -0.658246 | -3.949621 | 4.593618 | -1.145653 |
2 | 340675 | 25347 | 20020312 | 18.0 | 12 | 3.0 | 0.0 | 1.0 | 50 | 12.5 | ... | 0.064410 | 0.003345 | -3.295700 | 1.816499 | 3.554439 | -0.683675 | 0.971495 | 2.625318 | -0.851922 | -1.246135 |
3 | 57332 | 5382 | 20000611 | 38.0 | 8 | 7.0 | 0.0 | 1.0 | 54 | 15.0 | ... | 0.069231 | 0.000000 | -3.405521 | 1.497826 | 4.782636 | 0.039101 | 1.227646 | 3.040629 | -0.801854 | -1.251894 |
4 | 265235 | 173174 | 20030109 | 87.0 | 0 | 5.0 | 5.0 | 1.0 | 131 | 3.0 | ... | 0.000099 | 0.001655 | -4.475429 | 0.124138 | 1.364567 | -0.319848 | -1.131568 | -3.303424 | -1.998466 | -1.279368 |
249995 | 10556 | 9332 | 20170003 | 13.0 | 9 | NaN | NaN | 1.0 | 58 | 15.0 | ... | 0.079119 | 0.001447 | 11.782508 | 20.402576 | -2.722772 | 0.462388 | -4.429385 | 7.883413 | 0.698405 | -1.082013 |
249996 | 146710 | 102110 | 20030511 | 29.0 | 17 | 3.0 | 0.0 | 0.0 | 61 | 15.0 | ... | 0.000000 | 0.002342 | -2.988272 | 1.500532 | 3.502201 | -0.761715 | -2.484556 | -2.532968 | -0.940266 | -1.106426 |
249997 | 116066 | 82802 | 20130312 | 124.0 | 16 | 6.0 | 0.0 | 1.0 | 122 | 3.0 | ... | 0.003358 | 0.100760 | -6.939560 | -1.144959 | -5.337949 | 0.896026 | -0.592565 | -3.872725 | 2.135984 | 3.807554 |
249998 | 90082 | 65971 | 20121212 | 111.0 | 4 | 7.0 | 5.0 | 0.0 | 184 | 9.0 | ... | 0.002974 | 0.008251 | -7.222167 | -1.383696 | -5.402794 | -0.409451 | -1.891556 | -3.104789 | -3.777374 | 3.186218 |
249999 | 76453 | 56954 | 20051111 | 13.0 | 9 | 3.0 | 0.0 | 1.0 | 58 | 12.5 | ... | 0.000000 | 0.009071 | 10.491312 | -11.270043 | -0.272595 | -0.026478 | -2.168249 | -0.980042 | -0.955164 | -1.169593 |
10 rows × 40 columns
- name - 汽车编码
- regDate - 汽车注册时间 – ***
- model - 车型编码
- brand - 品牌
- bodyType - 车身类型
- fuelType - 燃油类型
- gearbox - 变速箱
- power - 汽车功率
- kilometer - 汽车行驶公里 –
- notRepairedDamage - 汽车有尚未修复的损坏 – ***
- regionCode - 看车地区编码
- seller - 销售方
- offerType - 报价类型
- creatDate - 广告发布时间
- price - 汽车价格
Train_data.columns
Index(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType',
'gearbox', 'power', 'kilometer', 'notRepairedDamage', 'regionCode',
'seller', 'offerType', 'creatDate', 'price', 'v_0', 'v_1', 'v_2', 'v_3',
'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12',
'v_13', 'v_14', 'v_15', 'v_16', 'v_17', 'v_18', 'v_19', 'v_20', 'v_21',
'v_22', 'v_23'],
dtype='object')
Train_data_part = Train_data.cloumns=['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType','gearbox', 'power', 'kilometer', 'notRepairedDamage', 'regionCode','seller', 'offerType', 'creatDate', 'price']
Train_data_part
['SaleID',
'name',
'regDate',
'model',
'brand',
'bodyType',
'fuelType',
'gearbox',
'power',
'kilometer',
'notRepairedDamage',
'regionCode',
'seller',
'offerType',
'creatDate',
'price']
Train_data.describe()
SaleID | name | regDate | model | brand | bodyType | fuelType | gearbox | power | kilometer | ... | v_14 | v_15 | v_16 | v_17 | v_18 | v_19 | v_20 | v_21 | v_22 | v_23 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 250000.000000 | 250000.000000 | 2.500000e+05 | 250000.000000 | 250000.000000 | 224620.000000 | 227510.000000 | 236487.000000 | 250000.000000 | 250000.000000 | ... | 250000.000000 | 250000.000000 | 250000.000000 | 250000.000000 | 250000.000000 | 250000.000000 | 250000.000000 | 250000.000000 | 250000.000000 | 250000.000000 |
mean | 185351.790768 | 83153.362172 | 2.003401e+07 | 44.911480 | 7.785236 | 4.563271 | 1.665008 | 0.780783 | 115.528412 | 12.577418 | ... | 0.032489 | 0.030408 | 0.014725 | 0.000915 | 0.006273 | 0.006604 | -0.001374 | 0.000609 | -0.004025 | 0.001834 |
std | 107121.188763 | 72540.799964 | 7.770250e+04 | 50.640081 | 7.694010 | 1.912515 | 2.339646 | 0.413717 | 196.141828 | 3.990632 | ... | 0.038792 | 0.049333 | 8.779163 | 5.771081 | 4.880981 | 4.124722 | 3.803626 | 3.555353 | 2.864713 | 2.323680 |
min | 1.000000 | 0.000000 | 1.910000e+07 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.500000 | ... | 0.000000 | 0.000000 | -10.412444 | -15.538236 | -21.009214 | -13.989955 | -9.599285 | -11.181255 | -7.671327 | -2.350888 |
25% | 92501.750000 | 14500.000000 | 1.999061e+07 | 6.000000 | 1.000000 | 3.000000 | 0.000000 | 1.000000 | 70.000000 | 12.500000 | ... | 0.000129 | 0.000000 | -5.552269 | -0.901181 | -3.150385 | -0.478173 | -1.727237 | -3.067073 | -2.092178 | -1.402804 |
50% | 185264.500000 | 65314.500000 | 2.003111e+07 | 27.000000 | 6.000000 | 4.000000 | 0.000000 | 1.000000 | 105.000000 | 15.000000 | ... | 0.001961 | 0.002567 | -3.821770 | 0.223181 | -0.058502 | 0.038427 | -0.995044 | -0.880587 | -1.199807 | -1.145588 |
75% | 278128.500000 | 143761.250000 | 2.008081e+07 | 70.000000 | 11.000000 | 7.000000 | 5.000000 | 1.000000 | 150.000000 | 15.000000 | ... | 0.075672 | 0.056568 | 3.599747 | 1.263737 | 2.800475 | 0.569198 | 1.563382 | 3.269987 | 2.737614 | 0.044865 |
max | 370946.000000 | 233044.000000 | 2.019121e+07 | 250.000000 | 39.000000 | 7.000000 | 6.000000 | 1.000000 | 20000.000000 | 15.000000 | ... | 0.130785 | 0.184340 | 36.756878 | 26.134561 | 23.055660 | 16.576027 | 20.324572 | 14.039422 | 8.764597 | 8.574730 |
8 rows × 40 columns
Test_data.describe()|
File "<ipython-input-8-b48c1a6ece76>", line 1
Test_data.describe()|
^
SyntaxError: invalid syntax
power这里的max好像异常
Train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250000 entries, 0 to 249999
Data columns (total 40 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 SaleID 250000 non-null int64
1 name 250000 non-null int64
2 regDate 250000 non-null int64
3 model 250000 non-null float64
4 brand 250000 non-null int64
5 bodyType 224620 non-null float64
6 fuelType 227510 non-null float64
7 gearbox 236487 non-null float64
8 power 250000 non-null int64
9 kilometer 250000 non-null float64
10 notRepairedDamage 201464 non-null float64
11 regionCode 250000 non-null int64
12 seller 250000 non-null int64
13 offerType 250000 non-null int64
14 creatDate 250000 non-null int64
15 price 250000 non-null int64
16 v_0 250000 non-null float64
17 v_1 250000 non-null float64
18 v_2 250000 non-null float64
19 v_3 250000 non-null float64
20 v_4 250000 non-null float64
21 v_5 250000 non-null float64
22 v_6 250000 non-null float64
23 v_7 250000 non-null float64
24 v_8 250000 non-null float64
25 v_9 250000 non-null float64
26 v_10 250000 non-null float64
27 v_11 250000 non-null float64
28 v_12 250000 non-null float64
29 v_13 250000 non-null float64
30 v_14 250000 non-null float64
31 v_15 250000 non-null float64
32 v_16 250000 non-null float64
33 v_17 250000 non-null float64
34 v_18 250000 non-null float64
35 v_19 250000 non-null float64
36 v_20 250000 non-null float64
37 v_21 250000 non-null float64
38 v_22 250000 non-null float64
39 v_23 250000 non-null float64
dtypes: float64(30), int64(10)
memory usage: 76.3 MB
Test_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 39 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 SaleID 50000 non-null int64
1 name 50000 non-null int64
2 regDate 50000 non-null int64
3 model 50000 non-null float64
4 brand 50000 non-null int64
5 bodyType 44890 non-null float64
6 fuelType 45598 non-null float64
7 gearbox 47287 non-null float64
8 power 50000 non-null int64
9 kilometer 50000 non-null float64
10 notRepairedDamage 40372 non-null float64
11 regionCode 50000 non-null int64
12 seller 50000 non-null int64
13 offerType 50000 non-null int64
14 creatDate 50000 non-null int64
15 v_0 50000 non-null float64
16 v_1 50000 non-null float64
17 v_2 50000 non-null float64
18 v_3 50000 non-null float64
19 v_4 50000 non-null float64
20 v_5 50000 non-null float64
21 v_6 50000 non-null float64
22 v_7 50000 non-null float64
23 v_8 50000 non-null float64
24 v_9 50000 non-null float64
25 v_10 50000 non-null float64
26 v_11 50000 non-null float64
27 v_12 50000 non-null float64
28 v_13 50000 non-null float64
29 v_14 50000 non-null float64
30 v_15 50000 non-null float64
31 v_16 50000 non-null float64
32 v_17 50000 non-null float64
33 v_18 50000 non-null float64
34 v_19 50000 non-null float64
35 v_20 50000 non-null float64
36 v_21 50000 non-null float64
37 v_22 50000 non-null float64
38 v_23 50000 non-null float64
dtypes: float64(30), int64(9)
memory usage: 14.9 MB
# 查看每列的存在nan情况
Train_data.isnull()
SaleID | name | regDate | model | brand | bodyType | fuelType | gearbox | power | kilometer | ... | v_14 | v_15 | v_16 | v_17 | v_18 | v_19 | v_20 | v_21 | v_22 | v_23 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | False | False | False | False | False | True | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
1 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
2 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
3 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
4 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
249995 | False | False | False | False | False | True | True | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
249996 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
249997 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
249998 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
249999 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
250000 rows × 40 columns
Train_data.isnull().sum() # sum是对每一列的数据进行求和
SaleID 0
name 0
regDate 0
model 0
brand 0
bodyType 25380
fuelType 22490
gearbox 13513
power 0
kilometer 0
notRepairedDamage 48536
regionCode 0
seller 0
offerType 0
creatDate 0
price 0
v_0 0
v_1 0
v_2 0
v_3 0
v_4 0
v_5 0
v_6 0
v_7 0
v_8 0
v_9 0
v_10 0
v_11 0
v_12 0
v_13 0
v_14 0
v_15 0
v_16 0
v_17 0
v_18 0
v_19 0
v_20 0
v_21 0
v_22 0
v_23 0
dtype: int64
NAN值的可视化
missing = Train_data.isnull().sum() # 为NAN的个数
missing = missing[missing > 0] # 只剩下空值的missing了
type(missing)
pandas.core.series.Series
missing
bodyType 25380
fuelType 22490
gearbox 13513
notRepairedDamage 48536
dtype: int64
# inplace=True 是在原数据上进行修改
missing.sort_values(inplace=True)
missing # 排序前
gearbox 13513
fuelType 22490
bodyType 25380
notRepairedDamage 48536
dtype: int64
missing # 排序后
gearbox 13513
fuelType 22490
bodyType 25380
notRepairedDamage 48536
dtype: int64
# 画出图 : 横轴为特征的名字,纵轴为数值
missing.plot.bar()
通过以上两句可以很直观的了解哪些列存在 “nan”, 并可以把nan的个数打印,主要的目的在于 nan存在的个数是
否真的很大,如果很小一般选择填充,如果使用lgb等树模型可以直接空缺,让树自己去优化,但如果nan存在的
过多、可以考虑删掉
# 可视化查看缺省值
msno.matrix(Train_data.sample(250))
msno.bar(Train_data.sample(1000))
# 可以看出1000个数据内有哪些数据不足1000,上面还有标出有多少条数据
# 可视化看下缺省值
msno.matrix(Test_data)
msno.bar(Test_data.sample(1000))
- 可以看出训练集和测试集数据不一致的分布也是非常相似的
异常值检测
Train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250000 entries, 0 to 249999
Data columns (total 40 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 SaleID 250000 non-null int64
1 name 250000 non-null int64
2 regDate 250000 non-null int64
3 model 250000 non-null float64
4 brand 250000 non-null int64
5 bodyType 224620 non-null float64
6 fuelType 227510 non-null float64
7 gearbox 236487 non-null float64
8 power 250000 non-null int64
9 kilometer 250000 non-null float64
10 notRepairedDamage 201464 non-null float64
11 regionCode 250000 non-null int64
12 seller 250000 non-null int64
13 offerType 250000 non-null int64
14 creatDate 250000 non-null int64
15 price 250000 non-null int64
16 v_0 250000 non-null float64
17 v_1 250000 non-null float64
18 v_2 250000 non-null float64
19 v_3 250000 non-null float64
20 v_4 250000 non-null float64
21 v_5 250000 non-null float64
22 v_6 250000 non-null float64
23 v_7 250000 non-null float64
24 v_8 250000 non-null float64
25 v_9 250000 non-null float64
26 v_10 250000 non-null float64
27 v_11 250000 non-null float64
28 v_12 250000 non-null float64
29 v_13 250000 non-null float64
30 v_14 250000 non-null float64
31 v_15 250000 non-null float64
32 v_16 250000 non-null float64
33 v_17 250000 non-null float64
34 v_18 250000 non-null float64
35 v_19 250000 non-null float64
36 v_20 250000 non-null float64
37 v_21 250000 non-null float64
38 v_22 250000 non-null float64
39 v_23 250000 non-null float64
dtypes: float64(30), int64(10)
memory usage: 76.3 MB
- .value_counts 获取该特征列数据的种类|
# .value_counts 获取该特征列数据的种类
Train_data['notRepairedDamage'].value_counts()
1.0 176922
0.0 24542
Name: notRepairedDamage, dtype: int64
# Train_data.value_counts()
# 二手车原数据中这个特征为类别型特征,且 - 也表示为空值,这里是
# 将 - 替换为nan
# Train_data['notRepairedDamage'].replace('-', np.nan, inplace=True)
以下两个类别特征严重倾斜,一般不会对预测有什么帮助,故这边先删掉,当然你也可以继续挖掘,但是一般意义不大
Train_data["seller"].value_counts()
1 249999
0 1
Name: seller, dtype: int64
Test_data["seller"].value_counts()
1 50000
Name: seller, dtype: int64
Train_data["offerType"].value_counts()
0 249991
1 9
Name: offerType, dtype: int64
Test_data['offerType'].value_counts()
0 49999
1 1
Name: offerType, dtype: int64
del Train_data["seller"]
del Train_data["offerType"]
del Test_data["seller"]
del Test_data["offerType"]
所有特征的value_counts()
for f in Train_data.columns:
print(f)
print(Train_data[f].value_counts())
SaleID
2049 1
265515 1
277805 1
271662 1
312626 1
..
107105 1
113250 1
111203 1
98917 1
2047 1
Name: SaleID, Length: 250000, dtype: int64
name
451 452
73 429
1791 428
821 391
243 346
...
92419 1
88325 1
82182 1
84231 1
157427 1
Name: name, Length: 164312, dtype: int64
regDate
20000010 306
20000001 288
20000002 288
20000007 279
20000008 278
...
19850904 1
19851010 1
19750511 1
19870912 1
19400705 1
Name: regDate, Length: 7537, dtype: int64
model
0.0 20344
6.0 17741
4.0 13837
1.0 13634
12.0 8841
...
226.0 5
245.0 5
243.0 4
249.0 4
250.0 1
Name: model, Length: 251, dtype: int64
brand
0 53699
4 27109
11 26944
10 23762
1 22144
6 17202
9 12210
5 7343
15 6500
12 4704
7 3839
3 3831
17 3543
13 3502
8 3374
28 3161
19 2561
18 2451
16 2274
22 2264
23 2088
14 1892
24 1678
25 1611
20 1610
27 1392
29 1259
34 963
30 604
2 570
31 540
21 522
38 516
35 415
32 406
36 377
33 368
37 324
26 307
39 141
Name: brand, dtype: int64
bodyType
7.0 64571
3.0 53858
4.0 45646
5.0 20343
6.0 15290
2.0 12755
1.0 9882
0.0 2275
Name: bodyType, dtype: int64
fuelType
0.0 150664
5.0 72494
4.0 3577
3.0 385
2.0 183
1.0 147
6.0 60
Name: fuelType, dtype: int64
gearbox
1.0 184645
0.0 51842
Name: gearbox, dtype: int64
power
0 27280
75 16158
60 10765
150 10373
140 9145
...
1986 1
1090 1
10311 1
960 1
3454 1
Name: power, Length: 703, dtype: int64
kilometer
15.0 162161
12.5 25743
10.0 10777
9.0 8424
8.0 7434
7.0 6642
6.0 5859
5.0 5100
0.5 4634
4.0 4204
3.0 4021
2.0 3749
1.0 1252
Name: kilometer, dtype: int64
notRepairedDamage
1.0 176922
0.0 24542
Name: notRepairedDamage, dtype: int64
regionCode
487 550
868 424
149 236
539 227
32 216
...
7959 1
8002 1
6715 1
7117 1
4144 1
Name: regionCode, Length: 8081, dtype: int64
creatDate
20160403 9758
20160404 9521
20160320 9176
20160312 8946
20160321 8895
...
20150618 1
20160114 1
20160201 1
20150611 1
20140310 1
Name: creatDate, Length: 107, dtype: int64
price
0 7312
500 3815
1500 3587
1000 3149
1200 3071
...
11320 1
7230 1
11448 1
9529 1
8188 1
Name: price, Length: 4585, dtype: int64
v_0
71.666307 2
72.346416 2
78.107692 2
71.715545 2
73.734706 2
..
71.161494 1
70.253614 1
70.797686 1
74.588185 1
77.825581 1
Name: v_0, Length: 249747, dtype: int64
v_1
-1.470958 2
-3.128523 2
-3.224945 2
-3.293795 2
-3.322763 2
..
-3.970355 1
11.487790 1
-3.456756 1
-3.746283 1
-3.579301 1
Name: v_1, Length: 249747, dtype: int64
v_2
-0.527186 2
-0.998414 2
0.652201 2
0.356107 2
-9.312859 2
..
-0.815970 1
-6.729062 1
-1.683035 1
0.171102 1
0.852139 1
Name: v_2, Length: 249747, dtype: int64
v_3
3.580573 2
-0.633228 2
-2.541859 2
0.161395 2
20.571558 2
..
1.067454 1
-0.826230 1
-6.306510 1
0.201140 1
4.146853 1
Name: v_3, Length: 249747, dtype: int64
v_4
2.038620 2
-0.591751 2
-12.603294 2
-0.321072 2
-0.429618 2
..
0.742918 1
2.722358 1
-0.317880 1
-0.356648 1
1.327513 1
Name: v_4, Length: 249747, dtype: int64
v_5
1.273623 2
-1.589295 2
-2.350140 2
0.080770 2
-2.434300 2
..
-1.374641 1
-2.369201 1
-2.194464 1
1.226827 1
-1.218480 1
Name: v_5, Length: 249747, dtype: int64
v_6
3.854950 2
-2.337177 2
-2.840736 2
-2.988814 2
0.912034 2
..
-8.718013 1
3.185567 1
3.443525 1
-2.653621 1
-3.138425 1
Name: v_6, Length: 249747, dtype: int64
v_7
-2.915058 2
-2.518469 2
-1.175198 2
-3.672233 2
-2.563102 2
..
-1.171334 1
-2.324847 1
4.015706 1
-1.895407 1
-2.156468 1
Name: v_7, Length: 249747, dtype: int64
v_8
0.000000 48244
0.315924 2
0.315905 2
0.314498 2
0.315560 2
...
0.315494 1
0.289243 1
0.316095 1
0.316209 1
0.315702 1
Name: v_8, Length: 201543, dtype: int64
v_9
1.101174 2
0.118624 2
0.164335 2
0.114609 2
0.112811 2
..
1.110851 1
1.101634 1
0.116084 1
0.090707 1
0.112558 1
Name: v_9, Length: 249747, dtype: int64
v_10
0.000000 25342
0.081665 2
0.086726 2
0.081616 2
0.081701 2
...
0.089640 1
0.091852 1
0.082066 1
0.081448 1
0.087517 1
Name: v_10, Length: 224427, dtype: int64
v_11
0.000000 7421
0.121584 2
0.102037 2
0.166840 2
0.134519 2
...
0.092895 1
0.108411 1
0.131894 1
0.075781 1
0.078286 1
Name: v_11, Length: 242335, dtype: int64
v_12
0.000000 22426
0.053098 2
0.053437 2
0.055474 2
0.053432 2
...
0.053485 1
0.053471 1
0.055616 1
0.053447 1
0.053329 1
Name: v_12, Length: 227338, dtype: int64
v_13
0.000000 13495
0.130205 2
0.123467 2
0.123337 2
0.130232 2
...
0.123242 1
0.123755 1
0.123252 1
0.123047 1
0.123567 1
Name: v_13, Length: 236266, dtype: int64
v_14
0.000000 53857
0.003751 2
0.000746 2
0.002838 2
0.002283 2
...
0.094690 1
0.000690 1
0.086957 1
0.002928 1
0.083676 1
Name: v_14, Length: 195953, dtype: int64
v_15
0.000000 97223
0.010717 2
0.012704 2
0.143362 2
0.005417 2
...
0.001720 1
0.003263 1
0.007882 1
0.005242 1
0.094839 1
Name: v_15, Length: 152622, dtype: int64
v_16
-3.254226 2
-2.855248 2
-4.373334 2
7.744677 2
-2.847659 2
..
10.816862 1
-6.670231 1
-6.291694 1
-4.147668 1
-6.389964 1
Name: v_16, Length: 249747, dtype: int64
v_17
0.498971 2
0.217974 2
-13.000712 2
-0.675390 2
-1.530593 2
..
1.397213 1
0.610112 1
2.335480 1
-1.500048 1
5.289472 1
Name: v_17, Length: 249747, dtype: int64
v_18
-3.753102 2
7.731945 2
-0.058593 2
-1.171759 2
-2.045338 2
..
-0.860834 1
0.643066 1
5.023034 1
-2.016881 1
6.565301 1
Name: v_18, Length: 249747, dtype: int64
v_19
0.082562 2
-0.469708 2
-0.138257 2
-0.657417 2
0.862429 2
..
-1.249647 1
-0.664831 1
-0.660867 1
0.040847 1
0.700206 1
Name: v_19, Length: 249747, dtype: int64
v_20
-1.214032 2
-2.031659 2
-2.426898 2
-1.542005 2
-0.657360 2
..
-2.098785 1
0.725159 1
-4.682086 1
0.342639 1
1.612570 1
Name: v_20, Length: 249747, dtype: int64
v_21
-3.244933 2
-3.440059 2
-4.070917 2
-3.001142 2
-4.153741 2
..
-3.289808 1
4.942931 1
2.670356 1
-3.793230 1
3.226273 1
Name: v_21, Length: 249747, dtype: int64
v_22
-2.957315 2
4.311760 2
-2.101273 2
-0.936764 2
-2.562937 2
..
6.155533 1
-2.379816 1
-1.419529 1
4.872987 1
5.062041 1
Name: v_22, Length: 249747, dtype: int64
v_23
-1.044909 2
-1.259228 2
-1.183081 2
-0.989739 2
-0.958179 2
..
-1.946367 1
-1.503159 1
-1.175261 1
-0.908192 1
-1.182790 1
Name: v_23, Length: 249747, dtype: int64
- bodyType : 八个类别
- fuelType : 七个类别
- gearbox : 两个类别
- kilometer : 12个类别
- notRepairedDamage : 两个类别
- seller : 两个类别但是严重倾斜 **
- offerType : 两个类别但是严重倾斜 **
- V_8 V_10 V_11 V_12 V_13 V_14 V_15 各有一个值特别大的类别特征
了解预测值的分布
type(Train_data['price'])
pandas.core.series.Series
Train_data['price'].value_counts()
0 7312
500 3815
1500 3587
1000 3149
1200 3071
...
11320 1
7230 1
11448 1
9529 1
8188 1
Name: price, Length: 4585, dtype: int64
## 1) price的分布情况(无界约尔逊分布等)
import scipy.stats as st
y = Train_data['price']
plt.figure(1); plt.title('Johnson SU')
sns.distplot(y, kde=False, fit=st.johnsonsu)
plt.figure(2); plt.title('Normal')
sns.distplot(y, kde=False, fit=st.norm)
plt.figure(3); plt.title('Log Normal')
sns.distplot(y, kde=False, fit=st.lognorm)
<AxesSubplot:title={'center':'Log Normal'}, xlabel='price'>
figure语法及操作
(1)figure语法说明
figure(num=None, figsize=None, dpi=None, facecolor=None, edgecolor=None, frameon=True)
- num:图像编号或名称,数字为编号 ,字符串为名称
- figsize:指定figure的宽和高,单位为英寸
- dpi参数指定绘图对象的分辨率,即每英寸多少个像素,缺省值为 1英寸等于2.5cm,A4纸是 21*30cm的纸张
- facecolor:背景颜色
- edgecolor:边框颜色
- frameon:是否显示边框
(2) 示例:
fig=plt.figure(figsize=(4,3),facecolor='blue')
plt.plot([1,2,3,4],[3,5,7,9])
plt.show()
plt.figure(3); plt.title('Log Normal')
sns.distplot(y, kde=True,rug=True, fit=st.lognorm)
# ked = True
<AxesSubplot:title={'center':'Log Normal'}, xlabel='price', ylabel='Density'>
sns画图
- seaborn.distplot(a, bins=None, hist=True, kde=True, rug=False, fit=None, hist_kws=None, kde_kws=None, rug_kws=None, fit_kws=None, color=None, vertical=False, norm_hist=False, axlabel=None, label=None, ax=None)
将kde设置为True
-
Kernel density estimaton核密度估计
-
核密度估计是在概率论中用来估计未知的密度函数,属于非参数检验方法之一。.由于核密度估计方法不利用有关数据分布的先验知识,对数据分布不附加任何假定,是一种从数据样本本身出发研究数据分布特征的方法,因而,在统计学理论和应用领域均受到高度的重视。
- hist: bool, optional #控制是否显示条形图,默认为True
- kde: bool, optional #控制是否显示核密度估计图,默认为True
- rug: bool, optional #控制是否显示观测的小细条(边际毛毯)默认为false
对预测值分布进行处理
价格不服从正态分布,所以在进行回归之前,需要将其转换.虽然对数变换做的很好,但最佳拟合是无界约翰逊分布
## 2) 查看skewness and kurtosis
sns.distplot(Train_data['price']);
print("Skewness: %f" % Train_data['price'].skew())
print("Kurtosis: %f" % Train_data['price'].kurt())
Skewness: 3.535346
Kurtosis: 21.230678
Train_data.skew(), Train_data.kurt()
(SaleID 0.001712
name 0.513079
regDate -1.540844
model 1.499765
brand 1.314846
bodyType -0.070459
fuelType 0.701802
gearbox -1.357379
power 58.590829
kilometer -1.557472
notRepairedDamage -2.312519
regionCode 0.690405
creatDate -95.428563
price 3.535346
v_0 -1.504738
v_1 1.582428
v_2 1.198679
v_3 1.352193
v_4 0.217941
v_5 2.052749
v_6 0.090718
v_7 0.823610
v_8 -1.532964
v_9 1.529931
v_10 -2.584452
v_11 -0.906428
v_12 -2.842834
v_13 -3.869655
v_14 0.491706
v_15 1.308716
v_16 1.662893
v_17 0.233318
v_18 0.814453
v_19 0.100073
v_20 2.001253
v_21 0.180020
v_22 0.819133
v_23 1.357847
dtype: float64,
SaleID -1.201476
name -1.084474
regDate 11.041006
model 1.741896
brand 1.814245
bodyType -1.070358
fuelType -1.495782
gearbox -0.157525
power 4473.885260
kilometer 1.250933
notRepairedDamage 3.347777
regionCode -0.352973
creatDate 11376.694263
price 21.230678
v_0 2.901641
v_1 1.098703
v_2 3.749872
v_3 4.294578
v_4 6.953348
v_5 6.489791
v_6 -0.564878
v_7 -0.729838
v_8 0.370812
v_9 0.377943
v_10 4.796855
v_11 1.547812
v_12 6.136342
v_13 13.199575
v_14 -1.597532
v_15 -0.029594
v_16 2.240928
v_17 2.569341
v_18 2.967738
v_19 6.923953
v_20 6.852809
v_21 -0.759948
v_22 -0.741708
v_23 0.143713
dtype: float64)
sns.distplot(Train_data.skew(),color='blue',axlabel ='Skewness')
# 通过axlabel、label设置标签
<AxesSubplot:xlabel='Skewness', ylabel='Density'>
sns.distplot(Train_data.kurt(),color='orange',axlabel ='Kurtness')
<AxesSubplot:xlabel='Kurtness', ylabel='Density'>