天池二手车预测：特征工程

最新推荐文章于 2021-07-14 11:18:52 发布

weixin_43520514

最新推荐文章于 2021-07-14 11:18:52 发布

阅读量275

点赞数

分类专栏：数据挖掘文章标签：数据挖掘 python 机器学习

本文链接：https://blog.csdn.net/weixin_43520514/article/details/105162336

版权

数据挖掘专栏收录该内容

3 篇文章 0 订阅

订阅专栏

特征工程

1. 特征工程目标及主要工作

特征工程的目标主要是对特征进行进一步的分析和构造，将数据转换为能更好的表示潜在问题的特征，从而提升机器学习的性能
常见的主要工作包括：

异常处理：
通过箱线图（或 3-Sigma）分析删除异常值；
BOX-COX 转换（处理有偏分布）；
长尾截断；
特征归一化/标准化：
标准化（转换为标准正态分布）；
归一化（抓换到 [0,1] 区间）；
针对幂律分布，可以采用公式：
数据分桶：
等频分桶；
等距分桶；
Best-KS 分桶（类似利用基尼指数进行二分类）；
卡方分桶；
缺失值处理：
不处理（针对类似 XGBoost 等树模型）；
删除（缺失数据太多）；
插值补全，包括均值/中位数/众数/建模预测/多重插补/压缩感知补全/矩阵补全等；
分箱，缺失值一个箱；
特征构造：
构造统计量特征，报告计数、求和、比例、标准差等；
时间特征，包括相对时间和绝对时间，节假日，双休日等；
地理信息，包括分箱，分布编码等方法；
非线性变换，包括 log/ 平方/ 根号等；
特征组合，特征交叉；
仁者见仁，智者见智。
特征筛选
过滤式（filter）：先对数据进行特征选择，然后在训练学习器，常见的方法有 Relief/方差选择发/相关系
数法/卡方检验法/互信息法；
包裹式（wrapper）：直接把最终将要使用的学习器的性能作为特征子集的评价准则，常见方法有
LVM（Las Vegas Wrapper）；
嵌入式（embedding）：结合过滤式和包裹式，学习器训练过程中自动进行了特征选择，常见的有
lasso 回归；
降维
PCA/ LDA/ ICA；
特征选择也是一种降维。

2. 代码实现

import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
from operator import itemgetter
%matplotlib inline

0.导入数据

train = pd.read_csv('used_car_train_20200313.csv', sep = ' ')
test = pd.read_csv('used_car_testA_20200313.csv', sep = ' ')
print(train.shape)
print(test.shape)

(150000, 31)
(50000, 30)

train.head().append(train.tail())

	SaleID	name	regDate	model	brand	bodyType	fuelType	gearbox	power	kilometer	...	v_5	v_6	v_7	v_8	v_9	v_10	v_11	v_12	v_13	v_14
0	0	736	20040402	30.0	6	1.0	0.0	0.0	60	12.5	...	0.235676	0.101988	0.129549	0.022816	0.097462	-2.881803	2.804097	-2.420821	0.795292	0.914762
1	1	2262	20030301	40.0	1	2.0	0.0	0.0	0	15.0	...	0.264777	0.121004	0.135731	0.026597	0.020582	-4.900482	2.096338	-1.030483	-1.722674	0.245522
2	2	14874	20040403	115.0	15	1.0	0.0	0.0	163	12.5	...	0.251410	0.114912	0.165147	0.062173	0.027075	-4.846749	1.803559	1.565330	-0.832687	-0.229963
3	3	71865	19960908	109.0	10	0.0	0.0	1.0	193	15.0	...	0.274293	0.110300	0.121964	0.033395	0.000000	-4.509599	1.285940	-0.501868	-2.438353	-0.478699
4	4	111080	20120103	110.0	5	1.0	0.0	0.0	68	5.0	...	0.228036	0.073205	0.091880	0.078819	0.121534	-1.896240	0.910783	0.931110	2.834518	1.923482
149995	149995	163978	20000607	121.0	10	4.0	0.0	1.0	163	15.0	...	0.280264	0.000310	0.048441	0.071158	0.019174	1.988114	-2.983973	0.589167	-1.304370	-0.302592
149996	149996	184535	20091102	116.0	11	0.0	0.0	0.0	125	10.0	...	0.253217	0.000777	0.084079	0.099681	0.079371	1.839166	-2.774615	2.553994	0.924196	-0.272160
149997	149997	147587	20101003	60.0	11	1.0	1.0	0.0	90	6.0	...	0.233353	0.000705	0.118872	0.100118	0.097914	2.439812	-1.630677	2.290197	1.891922	0.414931
149998	149998	45907	20060312	34.0	10	3.0	1.0	0.0	156	15.0	...	0.256369	0.000252	0.081479	0.083558	0.081498	2.075380	-2.633719	1.414937	0.431981	-1.659014
149999	149999	177672	19990204	19.0	28	6.0	0.0	1.0	193	12.5	...	0.284475	0.000000	0.040072	0.062543	0.025819	1.978453	-3.179913	0.031724	-1.483350	-0.342674

10 rows × 31 columns

test.head().append(test.tail())

	SaleID	name	regDate	model	brand	bodyType	fuelType	gearbox	power	kilometer	...	v_5	v_6	v_7	v_8	v_9	v_10	v_11	v_12	v_13	v_14
0	150000	66932	20111212	222.0	4	5.0	1.0	1.0	313	15.0	...	0.264405	0.121800	0.070899	0.106558	0.078867	-7.050969	-0.854626	4.800151	0.620011	-3.664654
1	150001	174960	19990211	19.0	21	0.0	0.0	0.0	75	12.5	...	0.261745	0.000000	0.096733	0.013705	0.052383	3.679418	-0.729039	-3.796107	-1.541230	-0.757055
2	150002	5356	20090304	82.0	21	0.0	0.0	0.0	109	7.0	...	0.260216	0.112081	0.078082	0.062078	0.050540	-4.926690	1.001106	0.826562	0.138226	0.754033
3	150003	50688	20100405	0.0	0	0.0	0.0	1.0	160	7.0	...	0.260466	0.106727	0.081146	0.075971	0.048268	-4.864637	0.505493	1.870379	0.366038	1.312775
4	150004	161428	19970703	26.0	14	2.0	0.0	0.0	75	15.0	...	0.250999	0.000000	0.077806	0.028600	0.081709	3.616475	-0.673236	-3.197685	-0.025678	-0.101290
49995	199995	20903	19960503	4.0	4	4.0	0.0	0.0	116	15.0	...	0.284664	0.130044	0.049833	0.028807	0.004616	-5.978511	1.303174	-1.207191	-1.981240	-0.357695
49996	199996	708	19991011	0.0	0	0.0	0.0	0.0	75	15.0	...	0.268101	0.108095	0.066039	0.025468	0.025971	-3.913825	1.759524	-2.075658	-1.154847	0.169073
49997	199997	6693	20040412	49.0	1	0.0	1.0	1.0	224	15.0	...	0.269432	0.105724	0.117652	0.057479	0.015669	-4.639065	0.654713	1.137756	-1.390531	0.254420
49998	199998	96900	20020008	27.0	1	0.0	0.0	1.0	334	15.0	...	0.261152	0.000490	0.137366	0.086216	0.051383	1.833504	-2.828687	2.465630	-0.911682	-2.057353
49999	199999	193384	20041109	166.0	6	1.0	NaN	1.0	68	9.0	...	0.228730	0.000300	0.103534	0.080625	0.124264	2.914571	-1.135270	0.547628	2.094057	-1.552150

10 rows × 30 columns

train.columns

Index(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType',
       'gearbox', 'power', 'kilometer', 'notRepairedDamage', 'regionCode',
       'seller', 'offerType', 'creatDate', 'price', 'v_0', 'v_1', 'v_2', 'v_3',
       'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12',
       'v_13', 'v_14'],
      dtype='object')

test.columns

Index(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType',
       'gearbox', 'power', 'kilometer', 'notRepairedDamage', 'regionCode',
       'seller', 'offerType', 'creatDate', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4',
       'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13',
       'v_14'],
      dtype='object')

1.删除异常值
箱型图：是一种用作显示一组数据分散情况的统计图，参考https://m.sohu.com/a/220236877_434937
使用seaborn绘制箱型图，参考https://blog.csdn.net/LuohenYJ/article/details/90677918

# 包装一个异常值处理的函数，以备调用
# 通过箱线图分析删除异常值
def outliers_proc(data, col_name, scale=3):
    """
    用于清洗异常值，默认用box_plot(scale=3)清洗
    :param data：接收pandas格式数据
    :param col_name：pandas列名
    :param scale：尺度
    :return：
    """
    
    def box_plot_outliers(data_ser, box_scale):
        """
        利用箱线图去除异常值
        :param data_ser：接收pandas.Series数据
        :param box_scale：箱线图尺度
        :return：
        """
        # 计算四分位距并计算上下限
        iqr = box_scale * (data_ser.quantile(0.75) - data_ser.quantile(0.25))
        val_low = data_ser.quantile(0.25) - iqr
        val_up = data_ser.quantile(0.75) + iqr
        # 根据上下限值筛选数据
        rule_low = (data_ser < val_low)
        rule_up = (data_ser > val_up)
        return (rule_low, rule_up), (val_low, val_up)
    
    data_n = data.copy()
    data_series = data_n[col_name]
    rule, value = box_plot_outliers(data_series, box_scale=scale)
    index = np.arange(data_series.shape[0])[rule[0] | rule[1]]
    print("Delete number is：{}".format(len(index)))
    data_n = data_n.drop(index)
    data_n.reset_index(drop=True, inplace=True)
    print("Now column number is：{}".format(data_n.shape[0]))
    index_low = np.arange(data_series.shape[0])[rule[0]]
    outliers = data_series.iloc[index_low]
    print("Description of data less than the lower bound is：")
    print(pd.Series(outliers).describe())
    index_up = np.arange(data_series.shape[0])[rule[1]]
    outliers = data_series.iloc[index_up]
    print("Description of data more than the upper bound is：")
    print(pd.Series(outliers).describe())
    
    fig, ax = plt.subplots(1, 2, figsize=(10, 7))
    sns.boxplot(y=data[col_name], data=data, palette="Set1", ax=ax[0])
    sns.boxplot(y=data_n[col_name], data=data_n, palette="Set1", ax=ax[1])
    return data_n

# 通过上述函数可以进行某列异常数据的删除操作
# 删不删需要自行判断，但是test的数据不能删

train = outliers_proc(train, 'power', scale=3)

Delete number is：963
Now column number is：149037
Description of data less than the lower bound is：
count    0.0
mean     NaN
std      NaN
min      NaN
25%      NaN
50%      NaN
75%      NaN
max      NaN
Name: power, dtype: float64
Description of data more than the upper bound is：
count      963.000000
mean       846.836968
std       1929.418081
min        376.000000
25%        400.000000
50%        436.000000
75%        514.000000
max      19312.000000
Name: power, dtype: float64

2.特征构造

# 将训练集和测试集放在一起，方便构造特征
train['train'] = 1
test['train'] = 0
data = pd.concat([train, test], ignore_index=True, sort=False)

data.shape

(199037, 32)

data.head().append(data.tail())

	SaleID	name	regDate	model	brand	bodyType	fuelType	gearbox	power	kilometer	...	v_6	v_7	v_8	v_9	v_10	v_11	v_12	v_13	v_14	train
0	0	736	20040402	30.0	6	1.0	0.0	0.0	60	12.5	...	0.101988	0.129549	0.022816	0.097462	-2.881803	2.804097	-2.420821	0.795292	0.914762	1
1	1	2262	20030301	40.0	1	2.0	0.0	0.0	0	15.0	...	0.121004	0.135731	0.026597	0.020582	-4.900482	2.096338	-1.030483	-1.722674	0.245522	1
2	2	14874	20040403	115.0	15	1.0	0.0	0.0	163	12.5	...	0.114912	0.165147	0.062173	0.027075	-4.846749	1.803559	1.565330	-0.832687	-0.229963	1
3	3	71865	19960908	109.0	10	0.0	0.0	1.0	193	15.0	...	0.110300	0.121964	0.033395	0.000000	-4.509599	1.285940	-0.501868	-2.438353	-0.478699	1
4	4	111080	20120103	110.0	5	1.0	0.0	0.0	68	5.0	...	0.073205	0.091880	0.078819	0.121534	-1.896240	0.910783	0.931110	2.834518	1.923482	1
199032	199995	20903	19960503	4.0	4	4.0	0.0	0.0	116	15.0	...	0.130044	0.049833	0.028807	0.004616	-5.978511	1.303174	-1.207191	-1.981240	-0.357695	0
199033	199996	708	19991011	0.0	0	0.0	0.0	0.0	75	15.0	...	0.108095	0.066039	0.025468	0.025971	-3.913825	1.759524	-2.075658	-1.154847	0.169073	0
199034	199997	6693	20040412	49.0	1	0.0	1.0	1.0	224	15.0	...	0.105724	0.117652	0.057479	0.015669	-4.639065	0.654713	1.137756	-1.390531	0.254420	0
199035	199998	96900	20020008	27.0	1	0.0	0.0	1.0	334	15.0	...	0.000490	0.137366	0.086216	0.051383	1.833504	-2.828687	2.465630	-0.911682	-2.057353	0
199036	199999	193384	20041109	166.0	6	1.0	NaN	1.0	68	9.0	...	0.000300	0.103534	0.080625	0.124264	2.914571	-1.135270	0.547628	2.094057	-1.552150	0

10 rows × 32 columns

# 使用时间：data['creatDate']和data['regData']，反映汽车使用时间，一般来说价格与使用时间成反比
# 注意：数据中存在时间出错的格式，需要errors=‘coerce’
data['used_time'] = (pd.to_datetime(data['creatDate'], format='%Y%m%d', errors='coerce') - 
                    pd.to_datetime(data['regDate'], format='%Y%m%d', errors='coerce')).dt.days

data.head().append(data.tail()).iloc[:, 8:18]

	power	kilometer	notRepairedDamage	regionCode	creatDate	price	v_0	v_1
0	60	12.5	0.0	1046	20160404	1850.0	43.357796	3.966344
1	0	15.0	-	4366	20160309	3600.0	45.305273	5.236112
2	163	12.5	0.0	2806	20160402	6222.0	45.978359	4.823792
3	193	15.0	0.0	434	20160312	2400.0	45.687478	4.492574
4	68	5.0	0.0	6977	20160313	5200.0	44.383511	2.031433
199032	116	15.0	0.0	3219	20160320	NaN	45.621391	5.958453
199033	75	15.0	0.0	1857	20160329	NaN	43.935162	4.476841
199034	224	15.0	0.0	3452	20160305	NaN	46.537137	4.170806
199035	334	15.0	0.0	1998	20160404	NaN	46.771359	-3.296814
199036	68	9.0	0.0	3276	20160322	NaN	43.731010	-3.121867

# 观察used_time的数据缺失
data['used_time'].isnull().sum()

可以看到，有15k个数据缺失，可以选择删除，也可以选择放着
如果选择XGBoost之类的决策树，本身就可以处理缺失值，所以可以不用管

# 从邮编中提取城市信息，因为是德国的数据，所以参考德国的邮编，相当于加入了先验知识
data['city'] = data['regionCode'].apply(lambda x : str(x)[:-3])
data.head().append(data.tail())

	SaleID	name	regDate	model	brand	bodyType	fuelType	gearbox	power	kilometer	...	v_8	v_9	v_10	v_11	v_12	v_13	v_14	train	used_time	city
0	0	736	20040402	30.0	6	1.0	0.0	0.0	60	12.5	...	0.022816	0.097462	-2.881803	2.804097	-2.420821	0.795292	0.914762	1	4385.0	1
1	1	2262	20030301	40.0	1	2.0	0.0	0.0	0	15.0	...	0.026597	0.020582	-4.900482	2.096338	-1.030483	-1.722674	0.245522	1	4757.0	4
2	2	14874	20040403	115.0	15	1.0	0.0	0.0	163	12.5	...	0.062173	0.027075	-4.846749	1.803559	1.565330	-0.832687	-0.229963	1	4382.0	2
3	3	71865	19960908	109.0	10	0.0	0.0	1.0	193	15.0	...	0.033395	0.000000	-4.509599	1.285940	-0.501868	-2.438353	-0.478699	1	7125.0
4	4	111080	20120103	110.0	5	1.0	0.0	0.0	68	5.0	...	0.078819	0.121534	-1.896240	0.910783	0.931110	2.834518	1.923482	1	1531.0	6
199032	199995	20903	19960503	4.0	4	4.0	0.0	0.0	116	15.0	...	0.028807	0.004616	-5.978511	1.303174	-1.207191	-1.981240	-0.357695	0	7261.0	3
199033	199996	708	19991011	0.0	0	0.0	0.0	0.0	75	15.0	...	0.025468	0.025971	-3.913825	1.759524	-2.075658	-1.154847	0.169073	0	6014.0	1
199034	199997	6693	20040412	49.0	1	0.0	1.0	1.0	224	15.0	...	0.057479	0.015669	-4.639065	0.654713	1.137756	-1.390531	0.254420	0	4345.0	3
199035	199998	96900	20020008	27.0	1	0.0	0.0	1.0	334	15.0	...	0.086216	0.051383	1.833504	-2.828687	2.465630	-0.911682	-2.057353	0	NaN	1
199036	199999	193384	20041109	166.0	6	1.0	NaN	1.0	68	9.0	...	0.080625	0.124264	2.914571	-1.135270	0.547628	2.094057	-1.552150	0	4151.0	3

10 rows × 34 columns

# 计算某品牌的销售量统计
# 要以训练集中的数据进行统计
train_gb = train.groupby("brand")
all_info = {}
for kind, kind_data in train_gb:
    info = {}
    kind_data = kind_data[kind_data['price'] > 0]
    info['brand_amout'] = len(kind_data)
    info['brand_price_max'] = kind_data.price.max()
    info['brand_price_median'] = kind_data.price.median()
    info['brand_price_min'] = kind_data.price.min()
    info['brand_price_sum'] = kind_data.price.sum()
    info['brand_price_std'] = kind_data.price.std()
    info['brand_price_average'] = round(kind_data.price.sum() / (len(kind_data) + 1), 2)
    all_info[kind] = info
brand_fe = pd.DataFrame(all_info).T.reset_index().rename(columns={"index":"brand"})
data = data.merge(brand_fe, how = 'left', on='brand')
data.shape
data.columns

Index(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType',
       'gearbox', 'power', 'kilometer', 'notRepairedDamage', 'regionCode',
       'seller', 'offerType', 'creatDate', 'price', 'v_0', 'v_1', 'v_2', 'v_3',
       'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12',
       'v_13', 'v_14', 'train', 'used_time', 'city', 'brand_amout',
       'brand_price_average', 'brand_price_max', 'brand_price_median',
       'brand_price_min', 'brand_price_std', 'brand_price_sum'],
      dtype='object')

# 数据分桶，以power为例
# 数据分桶的目的：
# 1）离散后稀疏向量内机乘法运算速度更快，计算结果也方便存储，容易扩展
# 2）离散后的特征对异常值更具鲁棒性，如 age>30 为 1 否则为 0，对于年龄为 200 的也不会对模型造成很大影响
# 3）LR 属于广义线性模型，表达能力有限，经过离散化后，每个变量有单独的权重，这相当于引入了非线性，可以提升模型的性能
# 4）离散后特征可以进行特征交叉，提升表达能力
# 5）特征离散后模型更稳定，如用户年龄区间，不会因为用户年龄长了一岁就变化

bin = [i * 10 for i in range(31)]
data['power_bin'] = pd.cut(data['power'], bin, labels=False)
data[['power_bin', 'power']].head()

	power_bin	power
0	5.0	60
1	NaN	0
2	16.0	163
3	19.0	193
4	6.0	68

# 删除利用过的数据
data = data.drop(['creatDate', 'regDate', 'regionCode'], axis=1)

data = data.drop('SaleID', axis=1)
print(data.shape)
data.columns

(199037, 38)

Index(['name', 'model', 'brand', 'bodyType', 'fuelType', 'gearbox', 'power',
       'kilometer', 'notRepairedDamage', 'seller', 'offerType', 'price', 'v_0',
       'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10',
       'v_11', 'v_12', 'v_13', 'v_14', 'train', 'used_time', 'city',
       'brand_amout', 'brand_price_average', 'brand_price_max',
       'brand_price_median', 'brand_price_min', 'brand_price_std',
       'brand_price_sum', 'power_bin'],
      dtype='object')

# 目前的数据已经可以给树模型使用了，将数据导出
data.to_csv('data_for_tree.csv', index=0)

# 可以再构造一份数据供LR NN之类的模型使用
# 不同模型对数据集的要求不同
# 观察数据分布
data['power'].plot.hist()

<matplotlib.axes._subplots.AxesSubplot at 0x5b70128>

# 我们刚刚对train进行了异常值处理，之所以还会出现这么异常的分布是因为test中还存在异常值
# 所以对train中的异常值不删为好，可以用长尾分布截断来代替
train['power'].plot.hist()

<matplotlib.axes._subplots.AxesSubplot at 0x9e93208>

# 我们对其取log，再作归一化处理
from sklearn import preprocessing
min_max_scaler = preprocessing.MinMaxScaler()
data['power'] = np.log(data['power'] + 1)
data['power'] = ((data['power']-np.min(data['power'])) / (np.max(data['power'])-np.min(data['power'])))
data['power'].plot.hist()

# kl的值比较正常，应该是已经做过分桶了
data['kilometer'].plot.hist()

# 直接对kl做归一化处理
data['kilometer'] = ((data['kilometer'] - np.min(data['kilometer'])) / (np.max(data['kilometer']) - np.min(data['kilometer'])))
data['kilometer'].plot.hist()

data.columns

Index(['name', 'model', 'brand', 'bodyType', 'fuelType', 'gearbox', 'power',
       'kilometer', 'notRepairedDamage', 'seller', 'offerType', 'price', 'v_0',
       'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10',
       'v_11', 'v_12', 'v_13', 'v_14', 'train', 'used_time', 'city',
       'brand_amout', 'brand_price_average', 'brand_price_max',
       'brand_price_median', 'brand_price_min', 'brand_price_std',
       'brand_price_sum', 'power_bin'],
      dtype='object')

# 对前面已经构造的统计量特征做变换
def max_min(x):
    return (x - np.min(x)) / (np.max(x) - np.min(x))
data['brand_amout'] = ((data['brand_amout'] - np.min(data['brand_amout'])) / (np.max(data['brand_amout']) - np.min(data['brand_amout'])))
data['brand_price_average'] = ((data['brand_price_average'] - np.min(data['brand_price_average'])) / (np.max(data['brand_price_average']) - np.min(data['brand_price_average'])))
data['brand_price_max'] = ((data['brand_price_average']) - np.min(data['brand_price_average']) / (np.max(data['brand_price_average']) - np.min(data['brand_price_average'])))
data['brand_price_median'] = ((data['brand_price_median'] - np.min(data['brand_price_median'])) / (np.max(data['brand_price_median']) - np.min(data['brand_price_median'])))
data['brand_price_min'] = ((data['brand_price_min'] - np.min(data['brand_price_min'])) / (np.max(data['brand_price_min']) - np.min(data['brand_price_min'])))
data['brand_price_std'] = ((data['brand_price_std'] - np.min(data['brand_price_std'])) / (np.max(data['brand_price_std']) - np.min(data['brand_price_std'])))
data['brand_price_sum'] = ((data['brand_price_sum'] - np.min(data['brand_price_sum'])) / np.max(data['brand_price_sum']) - np.min(data['brand_price_sum']))

# 对类别特征进行OneEncode
data = pd.get_dummies(data, columns=['model', 'brand', 'bodyType', 'fuelType', 'gearbox', 'notRepairedDamage', 'power_bin'])

print(data.shape)
data.columns

(199037, 369)

Index(['name', 'power', 'kilometer', 'seller', 'offerType', 'price', 'v_0',
       'v_1', 'v_2', 'v_3',
       ...
       'power_bin_20.0', 'power_bin_21.0', 'power_bin_22.0', 'power_bin_23.0',
       'power_bin_24.0', 'power_bin_25.0', 'power_bin_26.0', 'power_bin_27.0',
       'power_bin_28.0', 'power_bin_29.0'],
      dtype='object', length=369)

# 这份数据可以给LR用
data.to_csv('data_for_lr.csv', index=0)

3.特征筛选
1）过滤式

# 相关性分析
print(data['power'].corr(data['price'], method='spearman'))
print(data['kilometer'].corr(data['price'], method='spearman'))
print(data['brand_amout'].corr(data['price'], method='spearman'))
print(data['brand_price_average'].corr(data['price'], method='spearman'))
print(data['brand_price_max'].corr(data['price'], method='spearman'))
print(data['brand_price_median'].corr(data['price'], method='spearman'))

0.5728285196051496
-0.4082569701616764
0.058156610025581514
0.3834909576057687
0.3834909576057687
0.38691042393409447

# 用热力图分析相关性
data_numeric = data[['power', 'kilometer', 'brand_amout', 'brand_price_average', 'brand_price_max', 'brand_price_median']]
correlation = data_numeric.corr()
f, ax = plt.subplots(figsize = (7, 7))
plt.title('Correlation of Numeric Features with Price', y=1, size=16)
sns.heatmap(correlation, square=True, vmax=0.8)

pip install mlxtend

from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LinearRegression
sfs = SFS(LinearRegression(), k_features=10, forward=True, floating=False, scoring='r2', cv=0)
x = data.drop(['price'], axis=1)
x = x.fillna(0)
y = data['price']
y = y.fillna(0)
sfs.fit(x, y)
sfs.k_feature_names_

3. 总结

特征工程是比赛中最至关重要的的一块，特别的传统的比赛，大家的模型可能都差不多，调参带来的效果增幅是非常有限的，但特征工程的好坏往往会决定了最终的排名和成绩。
特征工程的主要目的还是在于将数据转换为能更好地表示潜在问题的特征，从而提高机器学习的性能。比如，异常值处理是为了去除噪声，填补缺失值可以加入先验知识等。
特征构造也属于特征工程的一部分，其目的是为了增强数据的表达。
有些比赛的特征是匿名特征，这导致我们并不清楚特征相互直接的关联性，这时我们就只有单纯基于特征进行处理，比如装箱，groupby，agg 等这样一些操作进行一些特征统计，此外还可以对特征进行进一步的 log，exp 等变换，或者对多个特征进行四则运算（如上面我们算出的使用时长），多项式组合等然后进行筛选。由于特性的匿名性其实限制了很多对于特征的处理，当然有些时候用 NN 去提取一些特征也会达到意想不到的良好效果。
对于知道特征含义（非匿名）的特征工程，特别是在工业类型比赛中，会基于信号处理，频域提取，丰度，偏度等构建更为有实际意义的特征，这就是结合背景的特征构建，在推荐系统中也是这样的，各种类型点击率统计，各时段统计，加用户属性的统计等等，这样一种特征构建往往要深入分析背后的业务逻辑或者说物理原理，从而才能更好的找到 magic。
当然特征工程其实是和模型结合在一起的，这就是为什么要为 LR NN 做分桶和特征归一化的原因，而对于特征的处理效果和特征重要性等往往要通过模型来验证。
总的来说，特征工程是一个入门简单，但想精通非常难的一件事。

weixin_43520514

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
天池二手车预测：特征工程

特征工程1. 特征工程目标及主要工作特征工程的目标主要是对特征进行进一步的分析和构造，将数据转换为能更好的表示潜在问题的特征，从而提升机器学习的性能常见的主要工作包括：异常处理：通过箱线图（或 3-Sigma）分析删除异常值；BOX-COX 转换（处理有偏分布）；长尾截断；特征归一化/标准化：标准化（转换为标准正态分布）；归一化（抓换到 [0,1] 区间）；针对幂律分布...
复制链接

扫一扫