DataWhale活动-二手车价格预测 task3

最新推荐文章于 2022-09-01 06:00:00 发布

Kevin_young98

最新推荐文章于 2022-09-01 06:00:00 发布

阅读量342

点赞数

本文链接：https://blog.csdn.net/Kevin_young98/article/details/105166472

版权

Task3 特征工程

特征工程目标

对于特征进行进一步分析，并对于数据进行处理
完成对于特征工程的分析，并对于数据进行一些图表或者文字总结

代码示例

1. 导入数据

import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
from operator import itemgetter

%matplotlib inline

train = pd.read_csv('./data/used_car_train_20200313.csv', sep=' ')
test = pd.read_csv('./data/used_car_testA_20200313.csv', sep=' ')
train = train.drop(['SaleID'],axis=1)
test = test.drop(['SaleID'], axis=1)
print(train.shape)
print(test.shape)

(150000, 30)
(50000, 29)

test.head()

	name	regDate	model	brand	bodyType	fuelType	gearbox	power	kilometer	notRepairedDamage	...	v_5	v_6	v_7	v_8	v_9	v_10	v_11	v_12	v_13	v_14
0	66932	20111212	222.0	4	5.0	1.0	1.0	313	15.0	0.0	...	0.264405	0.121800	0.070899	0.106558	0.078867	-7.050969	-0.854626	4.800151	0.620011	-3.664654
1	174960	19990211	19.0	21	0.0	0.0	0.0	75	12.5	1.0	...	0.261745	0.000000	0.096733	0.013705	0.052383	3.679418	-0.729039	-3.796107	-1.541230	-0.757055
2	5356	20090304	82.0	21	0.0	0.0	0.0	109	7.0	0.0	...	0.260216	0.112081	0.078082	0.062078	0.050540	-4.926690	1.001106	0.826562	0.138226	0.754033
3	50688	20100405	0.0	0	0.0	0.0	1.0	160	7.0	0.0	...	0.260466	0.106727	0.081146	0.075971	0.048268	-4.864637	0.505493	1.870379	0.366038	1.312775
4	161428	19970703	26.0	14	2.0	0.0	0.0	75	15.0	0.0	...	0.250999	0.000000	0.077806	0.028600	0.081709	3.616475	-0.673236	-3.197685	-0.025678	-0.101290

5 rows × 29 columns

'regDate' in train.columns

True

2. 删除异常值

def outliers_proc(data, col_name, scale=3):
    #用于清洗异常值，默认用 box_plot（scale=3）进行清洗
    #https://www.cnblogs.com/zhaohuanhuan/p/9055944.html
    def box_plot_outliers(data_ser, box_scale):
        #利用箱线图去除异常值
        #iqr = 四分位距
        iqr = box_scale * (data_ser.quantile(0.75) - data_ser.quantile(0.25))
        val_low = data_ser.quantile(0.25) - iqr
        val_up = data_ser.quantile(0.75) +iqr
        rule_low = (data_ser<val_low)
        rule_up = (data_ser>val_up)
        return (rule_low, rule_up), (val_low, val_up)
    
    data_n = data.copy()
    data_series = data_n[col_name]
    rule, value = box_plot_outliers(data_series, box_scale = scale)
    index = np.arange(data_series.shape[0])[rule[0] | rule[1]]
    print("Delete number is: {}".format(len(index)))
    print("Before column number is: {}".format(data_n.shape[0]))
    data_n = data_n.drop(index)
    data_n.reset_index(drop=True, inplace=True)
    print("Now column number is: {}".format(data_n.shape[0]))
    index_low = np.arange(data_series.shape[0])[rule[0]]
    outliers = data_series.iloc[index_low]
    print('Description of data less than the lower bound is:')
    print(pd.Series(outliers).describe())
    index_up = np.arange(data_series.shape[0])[rule[1]]
    outliers = data_series.iloc[index_up]
    print("Description of data larger than tne upper bound is:")
    print(pd.Series(outliers).describe())
    
    fig, ax = plt.subplots(1, 2, figsize=(10, 7))
    sns.boxplot(y=data[col_name], data=data, palette='Set1', ax=ax[0])
    sns.boxplot(y=data_n[col_name], data=data, palette='Set1', ax=ax[1])
    return data_n

train = outliers_proc(train, 'power', scale=3)

Delete number is: 963
Before column number is: 150000
Now column number is: 149037
Description of data less than the lower bound is:
count    0.0
mean     NaN
std      NaN
min      NaN
25%      NaN
50%      NaN
75%      NaN
max      NaN
Name: power, dtype: float64
Description of data larger than tne upper bound is:
count      963.000000
mean       846.836968
std       1929.418081
min        376.000000
25%        400.000000
50%        436.000000
75%        514.000000
max      19312.000000
Name: power, dtype: float64

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-IWDTbiYo-1585393328693)(output_8_1.png)]

2.特征构造

train['train'] = 1
test['train'] = 0
#ignore_index 忽略原来的index
data = pd.concat([train, test], ignore_index=True, sort=False)

# 使用时间：data['creatDate'] - data['regDate']，反应汽车使用时间，一般来说价格与使用时间成反比
# 不过要注意，数据里有时间出错的格式，所以我们需要 errors='coerce'
data['used_time'] = (pd.to_datetime(data['creatDate'], format='%Y%m%d', errors='coerce')-
                        pd.to_datetime(data['regDate'], format='%Y%m%d', errors='coerce'))

# 看一下空数据，有 15k 个样本的时间是有问题的，我们可以选择删除，也可以选择放着。
# 但是这里不建议删除，因为删除缺失数据占总样本量过大，7.5%
# 我们可以先放着，因为如果我们 XGBoost 之类的决策树，其本身就能处理缺失值，所以可以不用管；
data['used_time'].isnull().sum()

# 从邮编中提取城市信息，因为是德国的数据，所以参考德国的邮编，相当于加入了先验知识
data['city'] = data['regionCode'].apply(lambda x : str(x)[:-3])

train_gb = train.groupby('brand')
all_info = {}
for kind, kind_data in train_gb:
    info = {}
    kind_data = kind_data[kind_data['price'] > 0]
    info['brand_amount'] = len(kind_data)
    info['brand_price_max'] = kind_data.price.max()
    info['brand_price_median'] = kind_data.price.median()
    info['brand_price_min'] = kind_data.price.min()
    info['brand_price_sum'] = kind_data.price.sum()
    info['brand_price_std'] = kind_data.price.std()
    info['brand_price_average'] = round(kind_data.price.sum()/(len(kind_data)+1), 2)
    all_info[kind] = info
brand_fe = pd.DataFrame(all_info).T.reset_index().rename(columns={'index':'brand'})
data = data.merge(brand_fe, how='left', on='brand')

pd.DataFrame(all_info).T

	brand_amount	brand_price_max	brand_price_median	brand_price_min	brand_price_sum	brand_price_std	brand_price_average
0	31429.0	68500.0	3199.0	13.0	173719698.0	6261.371627	5527.19
1	13656.0	84000.0	6399.0	15.0	124044603.0	8988.865406	9082.86
2	318.0	55800.0	7500.0	35.0	3766241.0	10576.224444	11806.40
3	2461.0	37500.0	4990.0	65.0	15954226.0	5396.327503	6480.19
4	16575.0	99999.0	5999.0	12.0	138279069.0	8089.863295	8342.13
5	4662.0	31500.0	2300.0	20.0	15414322.0	3344.689763	3305.67
6	10193.0	35990.0	1800.0	13.0	36457518.0	4562.233331	3576.37
7	2360.0	38900.0	2600.0	60.0	9905909.0	4752.584154	4195.64
8	2070.0	99999.0	2270.0	30.0	10017173.0	6053.233424	4836.88
9	7299.0	68530.0	1400.0	50.0	17805271.0	2975.342884	2439.08
10	13994.0	92900.0	5200.0	15.0	113034210.0	8244.695287	8076.76
11	2944.0	34500.0	2900.0	30.0	13398006.0	4722.160492	4549.41
12	1108.0	27490.0	2625.0	50.0	4494303.0	4066.959950	4052.57
13	3813.0	35000.0	1600.0	20.0	10675790.0	3073.915196	2799.11
14	16073.0	38990.0	1700.0	12.0	49076652.0	3605.595127	3053.17
15	1458.0	45000.0	8500.0	100.0	14373814.0	5425.058140	9851.83
16	2219.0	17900.0	2999.0	20.0	8078352.0	2450.906089	3638.90
17	913.0	55800.0	2200.0	15.0	3328679.0	3952.913330	3641.88
18	315.0	34599.0	1999.0	50.0	1519049.0	6358.409761	4807.12
19	1386.0	42350.0	2800.0	20.0	7228288.0	6186.538949	5211.45
20	1235.0	37800.0	1750.0	15.0	4292737.0	4400.529809	3473.09
21	1546.0	35999.0	4225.0	50.0	8856481.0	5257.235026	5724.94
22	1085.0	43900.0	3950.0	50.0	6543426.0	5877.140886	6025.25
23	183.0	64000.0	1200.0	99.0	597132.0	7333.695140	3245.28
24	630.0	99999.0	27450.0	15.0	20422776.0	19855.495201	32365.73
25	2059.0	22500.0	2500.0	25.0	7515546.0	3556.249839	3648.32
26	878.0	99999.0	5000.0	11.0	7242792.0	10282.987274	8239.81
27	2049.0	62900.0	4200.0	35.0	10862559.0	4853.289240	5298.81
28	633.0	39900.0	3790.0	80.0	3373957.0	4509.036301	5321.70
29	406.0	19990.0	5250.0	500.0	2459028.0	3639.737722	6041.84
30	940.0	23200.0	3295.0	50.0	3939145.0	3659.577291	4186.13
31	318.0	11000.0	1000.0	50.0	560155.0	1829.079211	1755.97
32	588.0	33500.0	2350.0	50.0	2360095.0	4394.596002	4006.95
33	201.0	65000.0	5600.0	980.0	1839801.0	9637.135323	9107.93
34	227.0	2900.0	999.0	60.0	231776.0	554.118445	1016.56
35	180.0	28900.0	950.0	50.0	297977.0	3325.933365	1646.28
36	228.0	20900.0	2250.0	150.0	816001.0	3922.715389	3563.32
37	331.0	86500.0	13250.0	550.0	5371844.0	13541.180315	16180.25
38	65.0	8999.0	2850.0	99.0	215620.0	2140.083145	3266.97
39	9.0	14500.0	1900.0	750.0	39480.0	5520.867233	3948.00

pd.DataFrame(all_info).T.reset_index()

	index	brand_amount	brand_price_max	brand_price_median	brand_price_min	brand_price_sum	brand_price_std	brand_price_average
0	0	31429.0	68500.0	3199.0	13.0	173719698.0	6261.371627	5527.19
1	1	13656.0	84000.0	6399.0	15.0	124044603.0	8988.865406	9082.86
2	2	318.0	55800.0	7500.0	35.0	3766241.0	10576.224444	11806.40
3	3	2461.0	37500.0	4990.0	65.0	15954226.0	5396.327503	6480.19
4	4	16575.0	99999.0	5999.0	12.0	138279069.0	8089.863295	8342.13
5	5	4662.0	31500.0	2300.0	20.0	15414322.0	3344.689763	3305.67
6	6	10193.0	35990.0	1800.0	13.0	36457518.0	4562.233331	3576.37
7	7	2360.0	38900.0	2600.0	60.0	9905909.0	4752.584154	4195.64
8	8	2070.0	99999.0	2270.0	30.0	10017173.0	6053.233424	4836.88
9	9	7299.0	68530.0	1400.0	50.0	17805271.0	2975.342884	2439.08
10	10	13994.0	92900.0	5200.0	15.0	113034210.0	8244.695287	8076.76
11	11	2944.0	34500.0	2900.0	30.0	13398006.0	4722.160492	4549.41
12	12	1108.0	27490.0	2625.0	50.0	4494303.0	4066.959950	4052.57
13	13	3813.0	35000.0	1600.0	20.0	10675790.0	3073.915196	2799.11
14	14	16073.0	38990.0	1700.0	12.0	49076652.0	3605.595127	3053.17
15	15	1458.0	45000.0	8500.0	100.0	14373814.0	5425.058140	9851.83
16	16	2219.0	17900.0	2999.0	20.0	8078352.0	2450.906089	3638.90
17	17	913.0	55800.0	2200.0	15.0	3328679.0	3952.913330	3641.88
18	18	315.0	34599.0	1999.0	50.0	1519049.0	6358.409761	4807.12
19	19	1386.0	42350.0	2800.0	20.0	7228288.0	6186.538949	5211.45
20	20	1235.0	37800.0	1750.0	15.0	4292737.0	4400.529809	3473.09
21	21	1546.0	35999.0	4225.0	50.0	8856481.0	5257.235026	5724.94
22	22	1085.0	43900.0	3950.0	50.0	6543426.0	5877.140886	6025.25
23	23	183.0	64000.0	1200.0	99.0	597132.0	7333.695140	3245.28
24	24	630.0	99999.0	27450.0	15.0	20422776.0	19855.495201	32365.73
25	25	2059.0	22500.0	2500.0	25.0	7515546.0	3556.249839	3648.32
26	26	878.0	99999.0	5000.0	11.0	7242792.0	10282.987274	8239.81
27	27	2049.0	62900.0	4200.0	35.0	10862559.0	4853.289240	5298.81
28	28	633.0	39900.0	3790.0	80.0	3373957.0	4509.036301	5321.70
29	29	406.0	19990.0	5250.0	500.0	2459028.0	3639.737722	6041.84
30	30	940.0	23200.0	3295.0	50.0	3939145.0	3659.577291	4186.13
31	31	318.0	11000.0	1000.0	50.0	560155.0	1829.079211	1755.97
32	32	588.0	33500.0	2350.0	50.0	2360095.0	4394.596002	4006.95
33	33	201.0	65000.0	5600.0	980.0	1839801.0	9637.135323	9107.93
34	34	227.0	2900.0	999.0	60.0	231776.0	554.118445	1016.56
35	35	180.0	28900.0	950.0	50.0	297977.0	3325.933365	1646.28
36	36	228.0	20900.0	2250.0	150.0	816001.0	3922.715389	3563.32
37	37	331.0	86500.0	13250.0	550.0	5371844.0	13541.180315	16180.25
38	38	65.0	8999.0	2850.0	99.0	215620.0	2140.083145	3266.97
39	39	9.0	14500.0	1900.0	750.0	39480.0	5520.867233	3948.00

pd.DataFrame(all_info).T.reset_index().rename(columns={'index':'brand'})

	brand	brand_amount	brand_price_max	brand_price_median	brand_price_min	brand_price_sum	brand_price_std	brand_price_average
0	0	31429.0	68500.0	3199.0	13.0	173719698.0	6261.371627	5527.19
1	1	13656.0	84000.0	6399.0	15.0	124044603.0	8988.865406	9082.86
2	2	318.0	55800.0	7500.0	35.0	3766241.0	10576.224444	11806.40
3	3	2461.0	37500.0	4990.0	65.0	15954226.0	5396.327503	6480.19
4	4	16575.0	99999.0	5999.0	12.0	138279069.0	8089.863295	8342.13
5	5	4662.0	31500.0	2300.0	20.0	15414322.0	3344.689763	3305.67
6	6	10193.0	35990.0	1800.0	13.0	36457518.0	4562.233331	3576.37
7	7	2360.0	38900.0	2600.0	60.0	9905909.0	4752.584154	4195.64
8	8	2070.0	99999.0	2270.0	30.0	10017173.0	6053.233424	4836.88
9	9	7299.0	68530.0	1400.0	50.0	17805271.0	2975.342884	2439.08
10	10	13994.0	92900.0	5200.0	15.0	113034210.0	8244.695287	8076.76
11	11	2944.0	34500.0	2900.0	30.0	13398006.0	4722.160492	4549.41
12	12	1108.0	27490.0	2625.0	50.0	4494303.0	4066.959950	4052.57
13	13	3813.0	35000.0	1600.0	20.0	10675790.0	3073.915196	2799.11
14	14	16073.0	38990.0	1700.0	12.0	49076652.0	3605.595127	3053.17
15	15	1458.0	45000.0	8500.0	100.0	14373814.0	5425.058140	9851.83
16	16	2219.0	17900.0	2999.0	20.0	8078352.0	2450.906089	3638.90
17	17	913.0	55800.0	2200.0	15.0	3328679.0	3952.913330	3641.88
18	18	315.0	34599.0	1999.0	50.0	1519049.0	6358.409761	4807.12
19	19	1386.0	42350.0	2800.0	20.0	7228288.0	6186.538949	5211.45
20	20	1235.0	37800.0	1750.0	15.0	4292737.0	4400.529809	3473.09
21	21	1546.0	35999.0	4225.0	50.0	8856481.0	5257.235026	5724.94
22	22	1085.0	43900.0	3950.0	50.0	6543426.0	5877.140886	6025.25
23	23	183.0	64000.0	1200.0	99.0	597132.0	7333.695140	3245.28
24	24	630.0	99999.0	27450.0	15.0	20422776.0	19855.495201	32365.73
25	25	2059.0	22500.0	2500.0	25.0	7515546.0	3556.249839	3648.32
26	26	878.0	99999.0	5000.0	11.0	7242792.0	10282.987274	8239.81
27	27	2049.0	62900.0	4200.0	35.0	10862559.0	4853.289240	5298.81
28	28	633.0	39900.0	3790.0	80.0	3373957.0	4509.036301	5321.70
29	29	406.0	19990.0	5250.0	500.0	2459028.0	3639.737722	6041.84
30	30	940.0	23200.0	3295.0	50.0	3939145.0	3659.577291	4186.13
31	31	318.0	11000.0	1000.0	50.0	560155.0	1829.079211	1755.97
32	32	588.0	33500.0	2350.0	50.0	2360095.0	4394.596002	4006.95
33	33	201.0	65000.0	5600.0	980.0	1839801.0	9637.135323	9107.93
34	34	227.0	2900.0	999.0	60.0	231776.0	554.118445	1016.56
35	35	180.0	28900.0	950.0	50.0	297977.0	3325.933365	1646.28
36	36	228.0	20900.0	2250.0	150.0	816001.0	3922.715389	3563.32
37	37	331.0	86500.0	13250.0	550.0	5371844.0	13541.180315	16180.25
38	38	65.0	8999.0	2850.0	99.0	215620.0	2140.083145	3266.97
39	39	9.0	14500.0	1900.0	750.0	39480.0	5520.867233	3948.00

# 数据分桶 以 power 为例
# 这时候我们的缺失值也进桶了，
# 为什么要做数据分桶呢，原因有很多，= =
# 1. 离散后稀疏向量内积乘法运算速度更快，计算结果也方便存储，容易扩展；
# 2. 离散后的特征对异常值更具鲁棒性，如 age>30 为 1 否则为 0，对于年龄为 200 的也不会对模型造成很大的干扰；
# 3. LR 属于广义线性模型，表达能力有限，经过离散化后，每个变量有单独的权重，这相当于引入了非线性，能够提升模型的表达能力，加大拟合；
# 4. 离散后特征可以进行特征交叉，提升表达能力，由 M+N 个变量编程 M*N 个变量，进一步引入非线形，提升了表达能力；
# 5. 特征离散后模型更稳定，如用户年龄区间，不会因为用户年龄长了一岁就变化

# 当然还有很多原因，LightGBM 在改进 XGBoost 时就增加了数据分桶，增强了模型的泛化性

bin = [i*10 for i in range(31)]
#labels=False 表示值返回数据在哪个bin
data['power_bin'] = pd.cut(data['power'], bin, labels=False)
data[['power_bin', 'power']].head()

	power_bin	power
0	5.0	60
1	NaN	0
2	16.0	163
3	19.0	193
4	6.0	68

# 利用好了，就可以删掉原始数据了
data = data.drop(['creatDate', 'regDate', 'regionCode'], axis=1)

print(data.shape)
data.columns

(199037, 38)





Index(['name', 'model', 'brand', 'bodyType', 'fuelType', 'gearbox', 'power',
       'kilometer', 'notRepairedDamage', 'seller', 'offerType', 'price', 'v_0',
       'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10',
       'v_11', 'v_12', 'v_13', 'v_14', 'train', 'used_time', 'city',
       'brand_amount', 'brand_price_max', 'brand_price_median',
       'brand_price_min', 'brand_price_sum', 'brand_price_std',
       'brand_price_average', 'power_bin'],
      dtype='object')

# 目前的数据其实已经可以给树模型使用了，所以我们导出一下
data.to_csv('data_for_tree.csv', index=0)

# 我们可以再构造一份特征给 LR NN 之类的模型用
# 之所以分开构造是因为，不同模型对数据集的要求不同
# 我们看下数据分布：
data['power'].plot.hist()

<matplotlib.axes._subplots.AxesSubplot at 0x12dd20050>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ZOZb6VN3-1585393328695)(output_22_1.png)]

# 我们刚刚已经对 train 进行异常值处理了，但是现在还有这么奇怪的分布是因为 test 中的 power 异常值，
# 所以我们其实刚刚 train 中的 power 异常值不删为好，可以用长尾分布截断来代替
train['power'].plot.hist()

<matplotlib.axes._subplots.AxesSubplot at 0x12e680950>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-l4tN1W9t-1585393328695)(output_23_1.png)]

# 我们对其取 log，在做归一化
#why:服从长尾分布的都建议先取log再归一化
from sklearn import preprocessing
min_max_scaler = preprocessing.MinMaxScaler()
data['power'] = np.log(data['power'] + 1) 
data['power'] = ((data['power'] - np.min(data['power'])) / (np.max(data['power']) - np.min(data['power'])))
data['power'].plot.hist()

<matplotlib.axes._subplots.AxesSubplot at 0x12e080950>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-LDcQ920T-1585393328696)(output_24_1.png)]

# km 的比较正常，应该是已经做过分桶了
data['kilometer'].plot.hist()

<matplotlib.axes._subplots.AxesSubplot at 0x137f21250>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-YXjw8R8g-1585393328696)(output_25_1.png)]

# 所以我们可以直接做归一化
data['kilometer'] = ((data['kilometer'] - np.min(data['kilometer'])) / 
                        (np.max(data['kilometer']) - np.min(data['kilometer'])))
data['kilometer'].plot.hist()

<matplotlib.axes._subplots.AxesSubplot at 0x137f86210>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-vCxoDWrJ-1585393328696)(output_26_1.png)]

# 除此之外 还有我们刚刚构造的统计量特征：
# 'brand_amount', 'brand_price_average', 'brand_price_max',
# 'brand_price_median', 'brand_price_min', 'brand_price_std',
# 'brand_price_sum'
# 这里不再一一举例分析了，直接做变换，
def max_min(x):
    return (x - np.min(x)) / (np.max(x) - np.min(x))

data['brand_amount'] = ((data['brand_amount'] - np.min(data['brand_amount'])) / 
                        (np.max(data['brand_amount']) - np.min(data['brand_amount'])))
data['brand_price_average'] = ((data['brand_price_average'] - np.min(data['brand_price_average'])) / 
                               (np.max(data['brand_price_average']) - np.min(data['brand_price_average'])))
data['brand_price_max'] = ((data['brand_price_max'] - np.min(data['brand_price_max'])) / 
                           (np.max(data['brand_price_max']) - np.min(data['brand_price_max'])))
data['brand_price_median'] = ((data['brand_price_median'] - np.min(data['brand_price_median'])) /
                              (np.max(data['brand_price_median']) - np.min(data['brand_price_median'])))
data['brand_price_min'] = ((data['brand_price_min'] - np.min(data['brand_price_min'])) / 
                           (np.max(data['brand_price_min']) - np.min(data['brand_price_min'])))
data['brand_price_std'] = ((data['brand_price_std'] - np.min(data['brand_price_std'])) / 
                           (np.max(data['brand_price_std']) - np.min(data['brand_price_std'])))
data['brand_price_sum'] = ((data['brand_price_sum'] - np.min(data['brand_price_sum'])) / 
                           (np.max(data['brand_price_sum']) - np.min(data['brand_price_sum'])))

# 对类别特征进行 OneEncoder
data = pd.get_dummies(data, columns=['model', 'brand', 'bodyType', 'fuelType',
                                     'gearbox', 'notRepairedDamage', 'power_bin'])

# 这份数据可以给 LR 用
data.to_csv('data_for_lr.csv', index=0)

3特征筛选

1）过滤式

# 相关性分析
print(data['power'].corr(data['price'], method='spearman'))
print(data['kilometer'].corr(data['price'], method='spearman'))
print(data['brand_amount'].corr(data['price'], method='spearman'))
print(data['brand_price_average'].corr(data['price'], method='spearman'))
print(data['brand_price_max'].corr(data['price'], method='spearman'))
print(data['brand_price_median'].corr(data['price'], method='spearman'))

0.5728285196051496
-0.4082569701616764
0.058156610025581514
0.3834909576057687
0.259066833880992
0.38691042393409447

# 当然也可以直接看图
data_numeric = data[['power', 'kilometer', 'brand_amount', 'brand_price_average', 
                     'brand_price_max', 'brand_price_median']]
correlation = data_numeric.corr()

f , ax = plt.subplots(figsize = (7, 7))
plt.title('Correlation of Numeric Features with Price',y=1,size=16)
sns.heatmap(correlation,square = True,  vmax=0.8)

<matplotlib.axes._subplots.AxesSubplot at 0x13807fc10>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-bSjSiFb6-1585393328697)(output_32_1.png)]

2）包裹式

Kevin_young98

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
DataWhale活动-二手车价格预测 task3

Task3 特征工程特征工程目标对于特征进行进一步分析，并对于数据进行处理完成对于特征工程的分析，并对于数据进行一些图表或者文字总结代码示例1. 导入数据import pandas as pdimport numpy as npimport matplotlibimport matplotlib.pyplot as pltimport seaborn as sns...
复制链接

扫一扫