二手车Task3 特征工程_output data chw value range 0-1-CSDN博客

本文链接：https://blog.csdn.net/qq_41920295/article/details/115875991

二手车Task3 特征工程

常见的特征工程包括：

异常处理：

通过箱线图（或 3-Sigma）分析删除异常值；

BOX-COX 转换（处理有偏分布）；

长尾截断；

特征归一化/标准化：

标准化（转换为标准正态分布）；

归一化（抓换到 [0,1] 区间）；

针对幂律分布，可以采用公式： $log(\frac{1+x}{1+median})$

数据分桶：

等频分桶；

等距分桶；

Best-KS 分桶（类似利用基尼指数进行二分类）；

卡方分桶；

缺失值处理：

不处理（针对类似 XGBoost 等树模型）；

删除（缺失数据太多）；

插值补全，包括均值/中位数/众数/建模预测/多重插补/压缩感知补全/矩阵补全等；

分箱，缺失值一个箱；

特征构造：

构造统计量特征，报告计数、求和、比例、标准差等；

时间特征，包括相对时间和绝对时间，节假日，双休日等；

地理信息，包括分箱，分布编码等方法；

非线性变换，包括 log/ 平方/ 根号等；

特征组合，特征交叉；

仁者见仁，智者见智。

特征筛选

过滤式（filter）：先对数据进行特征选择，然后在训练学习器，常见的方法有 Relief/方差选择发/相关系数法/卡方检验法/互信息法；

包裹式（wrapper）：直接把最终将要使用的学习器的性能作为特征子集的评价准则，常见方法有 LVM（Las Vegas Wrapper）；

嵌入式（embedding）：结合过滤式和包裹式，学习器训练过程中自动进行了特征选择，常见的有 lasso 回归；

降维

PCA/ LDA/ ICA；

特征选择也是一种降维。

import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
from operator import itemgetter
'''a = [1,2,3]
b = [[1,2,3],[4,5,6],[7,8,9]]
itemgetter 用于获取对象的哪些位置的数据，参数即为代表位置的序号值
get_1 = itemgetter(1)
   get_1(a)  >>> 2
   get_1(b)  >>> [4,5,6]'''
%matplotlib inline

train = pd.read_csv('car_train_0110.csv', sep=' ')
test = pd.read_csv('car_testA_0110.csv', sep=' ')

print(train.shape)
print(test.shape)

(250000, 40)
(50000, 39)

train.head

<bound method NDFrame.head of         SaleID    name   regDate  model  brand  bodyType  fuelType  gearbox  \
0       134890     734  20160002   13.0      9       NaN       0.0      1.0   
1       306648  196973  20080307   72.0      9       7.0       5.0      1.0   
2       340675   25347  20020312   18.0     12       3.0       0.0      1.0   
3        57332    5382  20000611   38.0      8       7.0       0.0      1.0   
4       265235  173174  20030109   87.0      0       5.0       5.0      1.0   
...        ...     ...       ...    ...    ...       ...       ...      ...   
249995   10556    9332  20170003   13.0      9       NaN       NaN      1.0   
249996  146710  102110  20030511   29.0     17       3.0       0.0      0.0   
249997  116066   82802  20130312  124.0     16       6.0       0.0      1.0   
249998   90082   65971  20121212  111.0      4       7.0       5.0      0.0   
249999   76453   56954  20051111   13.0      9       3.0       0.0      1.0   

        power  kilometer  ...      v_14      v_15       v_16       v_17  \
0           0       15.0  ...  0.092139  0.000000  18.763832  -1.512063   
1         173       15.0  ...  0.001070  0.122335  -5.685612  -0.489963   
2          50       12.5  ...  0.064410  0.003345  -3.295700   1.816499   
3          54       15.0  ...  0.069231  0.000000  -3.405521   1.497826   
4         131        3.0  ...  0.000099  0.001655  -4.475429   0.124138   
...       ...        ...  ...       ...       ...        ...        ...   
249995     58       15.0  ...  0.079119  0.001447  11.782508  20.402576   
249996     61       15.0  ...  0.000000  0.002342  -2.988272   1.500532   
249997    122        3.0  ...  0.003358  0.100760  -6.939560  -1.144959   
249998    184        9.0  ...  0.002974  0.008251  -7.222167  -1.383696   
249999     58       12.5  ...  0.000000  0.009071  10.491312 -11.270043   

            v_18       v_19      v_20      v_21      v_22      v_23  
0      -1.008718 -12.100623 -0.947052  9.077297  0.581214  3.945923  
1      -2.223693  -0.226865 -0.658246 -3.949621  4.593618 -1.145653  
2       3.554439  -0.683675  0.971495  2.625318 -0.851922 -1.246135  
3       4.782636   0.039101  1.227646  3.040629 -0.801854 -1.251894  
4       1.364567  -0.319848 -1.131568 -3.303424 -1.998466 -1.279368  
...          ...        ...       ...       ...       ...       ...  
249995 -2.722772   0.462388 -4.429385  7.883413  0.698405 -1.082013  
249996  3.502201  -0.761715 -2.484556 -2.532968 -0.940266 -1.106426  
249997 -5.337949   0.896026 -0.592565 -3.872725  2.135984  3.807554  
249998 -5.402794  -0.409451 -1.891556 -3.104789 -3.777374  3.186218  
249999 -0.272595  -0.026478 -2.168249 -0.980042 -0.955164 -1.169593  

[250000 rows x 40 columns]>

test.head

<bound method NDFrame.head of        SaleID    name   regDate  model  brand  bodyType  fuelType  gearbox  \
0      720326     505  20060505   19.0     13       7.0       0.0      1.0   
1      714316    1836  20010301    5.0      5       3.0       4.0      1.0   
2      704693  212291  20170610    6.0     18       NaN       5.0      0.0   
3      624972    1345  19820005  215.0     32       7.0       0.0      1.0   
4      669753    1428  20060205   30.0      4       7.0       5.0      1.0   
...       ...     ...       ...    ...    ...       ...       ...      ...   
49995  375033    3803  20010407    6.0     29       5.0       0.0      0.0   
49996  406556   28500  20071001  130.0     10       2.0       0.0      0.0   
49997  511668   98383  19980102   23.0     10       4.0       0.0      1.0   
49998  533139    1489  20031001   70.0      1       7.0       4.0      NaN   
49999  592803     994  20070407   76.0      0       4.0       5.0      NaN   

       power  kilometer  ...      v_14      v_15      v_16      v_17  \
0         90        8.0  ...  0.083340  0.105382 -5.998993  0.147048   
1         75       15.0  ...  0.074478  0.000000 -3.287221  2.081317   
2        150       15.0  ...  0.002032  0.000000  4.368218  8.252188   
3          0        6.0  ...  0.098806  0.100883 -2.537486  0.513955   
4        122       15.0  ...  0.088397  0.002509 -6.197633 -0.191814   
...      ...        ...  ...       ...       ...       ...       ...   
49995    186       10.0  ...  0.000000  0.000372 -3.397636  0.940183   
49996    272        7.0  ...  0.003208  0.116459 -7.055336 -1.260228   
49997    190        0.5  ...  0.049580  0.067015 -4.916501  0.507919   
49998    101       15.0  ...  0.084591  0.000000 -0.424439  3.893203   
49999      0       15.0  ...  0.055724  0.110924 -1.422750  2.749703   

           v_18       v_19       v_20      v_21      v_22      v_23  
0     -1.902847   0.348990   2.324961  3.343910  4.048742 -1.431822  
1      2.937052  -0.123018   1.202395  3.570743 -1.180587 -1.348598  
2     -4.136109 -13.334970  -4.444620 -0.706978 -1.720218  3.569112  
3      4.414962   0.357685   2.700732  5.323602  6.085956 -0.900585  
4     -1.224360  -0.326985   2.254931  4.183037 -2.574004  0.014203  
...         ...        ...        ...       ...       ...       ...  
49995  4.115667   0.146320  -2.348749 -2.636560 -0.965214 -1.097192  
49996 -4.937979   0.881517  -1.590285 -3.495608  3.301887  3.947193  
49997 -0.035475   0.256285   0.734084  0.779931  1.822416  5.012697  
49998 -0.146884   1.830694  18.008141 -2.513048 -3.310876 -1.589404  
49999 -2.160718   0.838089  17.664283 -5.802325  3.063008 -1.308131  

[50000 rows x 39 columns]>

1.箱子的大小取决于数据的四分位距，即IQR = Q3 - Q1（Q3: 75%分位数 , Q1: 25%分位数 , Q3和Q1为四分位数)。50%的数据集中于箱体，若箱体太大即数据分布离散，数据波动较大，箱体小表示数据集中。

2.箱子的上边为上四分位数Q3，下边为下四分位数Q1，箱体中的横线为中位数Q2（50%分位数）

3.箱子的上触须为数据的最大值Max，下触须为数据的最小值Min（注意是非离群点的最大最小值，称为上下相邻值）

4.若数据值 > Q3+1.5 * IQR（上限值）或数据值 < Q1-1.5 * IQR（下限值） ,均视为异常值。数据值 > Q3+3 * IQR 或数据值 < Q1-3 * IQR ,均视为极值。在实际应用中，不会显示异常值与极值的界限，而且一般统称为异常值。

也表明上下触须不一定是数据的最大最小值，
（1）若数据的最大值比上限值小的，那么上触须顶点就是观察到的最大的；若数据的最大值比上限值大的，那么上触须顶点就是上限值，观察到的最大值就是异常点。
（2）若数据的最小值比下限值大的，那么下触须顶点就是观察到的最小值；若数据的最小值比下限值小的，那么下触须顶点就是下限值，观察到的最小值就是异常点。
上述情况复杂，在线范围外的，直接理解成异常值即可
5.偏度：

对称分布：中位线在箱子中间，上下相邻值到箱子的距离等长，离群点在上下限值外的分布也大致相同。
右偏分布：中位数更靠近下四分位数，上相邻值到箱子的距离比下相邻值到箱子的距离长，离群点多数在上限值之外。
左偏分布：中位数更靠近上四分位数，下相邻值到箱子的距离比上相邻值到箱子的距离长，离群点多数在下限值之外。

# 这里我包装了一个异常值处理的代码，可以随便调用。
def outliers_proc(data, col_name, scale=3):
    """
    用于清洗异常值，默认用 box_plot（scale=3）进行清洗
    :param data: 接收 pandas 数据格式
    :param col_name: pandas 列名
    :param scale: 尺度
    :return:
    """

    def box_plot_outliers(data_ser, box_scale):
        """
        利用箱线图去除异常值
        :param data_ser: 接收 pandas.Series 数据格式
        :param box_scale: 箱线图尺度，
        :return:
        """
        iqr = box_scale * (data_ser.quantile(0.75) - data_ser.quantile(0.25))
        val_low = data_ser.quantile(0.25) - iqr#极值
        val_up = data_ser.quantile(0.75) + iqr
        rule_low = (data_ser < val_low)#异常值
        rule_up = (data_ser > val_up)
        return (rule_low, rule_up), (val_low, val_up)

    data_n = data.copy()
    data_series = data_n[col_name]
    rule, value = box_plot_outliers(data_series, box_scale=scale)
    index = np.arange(data_series.shape[0])[rule[0] | rule[1]]
    '''x[0].shape
它打印1024

当我打印x.shape [0]时

x.shape[0]
它打印10

我知道这是一个愚蠢的问题，也许还有另一个类似的问题，但是有人可以向我解释吗？

参考方案

x是一个2D数组，也可以看作是一维数组，具有10行和1024列。 x[0]是第一个具有1024个元素的1D子数组（x中有10个这样的1D子数组），并且x[0].shape给出该子数组的形状，恰好是1元组-(1024, )。

另一方面，x.shape是2个元组，表示x的形状，在这种情况下为(10, 1024)。 x.shape[0]给出该元组中的第一个元素，即10。'''
    print("Delete number is: {}".format(len(index)))
    data_n = data_n.drop(index)
    data_n.reset_index(drop=True, inplace=True)
    print("Now column number is: {}".format(data_n.shape[0]))
    index_low = np.arange(data_series.shape[0])[rule[0]]
    outliers = data_series.iloc[index_low]
    print("Description of data less than the lower bound is:")
    print(pd.Series(outliers).describe())
    index_up = np.arange(data_series.shape[0])[rule[1]]
    outliers = data_series.iloc[index_up]
    print("Description of data larger than the upper bound is:")
    print(pd.Series(outliers).describe())
    
    fig, ax = plt.subplots(1, 2, figsize=(10, 7))
    sns.boxplot(y=data[col_name], data=data, palette="Set1", ax=ax[0])
    sns.boxplot(y=data_n[col_name], data=data_n, palette="Set1", ax=ax[1])
    return data_n

#这里主要是删除train不能删除test数据

train = outliers_proc(train, 'power', scale=3)

Delete number is: 1290
Now column number is: 248710
Description of data less than the lower bound is:
count    0.0
mean     NaN
std      NaN
min      NaN
25%      NaN
50%      NaN
75%      NaN
max      NaN
Name: power, dtype: float64
Description of data larger than the upper bound is:
count     1290.000000
mean      1104.958915
std       2373.619469
min        392.000000
25%        420.000000
50%        476.000000
75%        579.000000
max      20000.000000
Name: power, dtype: float64

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Q4Xl6RDp-1618841127812)(output_9_1.png)]

3.3.2 特征构造

train['train']=1
test['train']=0

data = pd.concat([train, test], ignore_index=True, sort=False)

data

	SaleID	name	regDate	model	brand	bodyType	fuelType	gearbox	power	kilometer	...	v_15	v_16	v_17	v_18	v_19	v_20	v_21	v_22	v_23	train
0	134890	734	20160002	13.0	9	NaN	0.0	1.0	0	15.0	...	0.000000	18.763832	-1.512063	-1.008718	-12.100623	-0.947052	9.077297	0.581214	3.945923	1
1	306648	196973	20080307	72.0	9	7.0	5.0	1.0	173	15.0	...	0.122335	-5.685612	-0.489963	-2.223693	-0.226865	-0.658246	-3.949621	4.593618	-1.145653	1
2	340675	25347	20020312	18.0	12	3.0	0.0	1.0	50	12.5	...	0.003345	-3.295700	1.816499	3.554439	-0.683675	0.971495	2.625318	-0.851922	-1.246135	1
3	57332	5382	20000611	38.0	8	7.0	0.0	1.0	54	15.0	...	0.000000	-3.405521	1.497826	4.782636	0.039101	1.227646	3.040629	-0.801854	-1.251894	1
4	265235	173174	20030109	87.0	0	5.0	5.0	1.0	131	3.0	...	0.001655	-4.475429	0.124138	1.364567	-0.319848	-1.131568	-3.303424	-1.998466	-1.279368	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
298705	375033	3803	20010407	6.0	29	5.0	0.0	0.0	186	10.0	...	0.000372	-3.397636	0.940183	4.115667	0.146320	-2.348749	-2.636560	-0.965214	-1.097192	0
298706	406556	28500	20071001	130.0	10	2.0	0.0	0.0	272	7.0	...	0.116459	-7.055336	-1.260228	-4.937979	0.881517	-1.590285	-3.495608	3.301887	3.947193	0
298707	511668	98383	19980102	23.0	10	4.0	0.0	1.0	190	0.5	...	0.067015	-4.916501	0.507919	-0.035475	0.256285	0.734084	0.779931	1.822416	5.012697	0
298708	533139	1489	20031001	70.0	1	7.0	4.0	NaN	101	15.0	...	0.000000	-0.424439	3.893203	-0.146884	1.830694	18.008141	-2.513048	-3.310876	-1.589404	0
298709	592803	994	20070407	76.0	0	4.0	5.0	NaN	0	15.0	...	0.110924	-1.422750	2.749703	-2.160718	0.838089	17.664283	-5.802325	3.063008	-1.308131	0

298710 rows × 41 columns

（1）arg：int，float，str，datetime，list，tuple，1-d数组，Series，DataFrame / dict-like，要转换为日期时间的对象

（2）errors：{‘ignore’，‘raise’，‘coerce’}，默认为’raise’

如果为“ raise”，则无效的解析将引发异常
如果为“coerce”，则将无效解析设置为NaT
如果为“ ignore”，则无效的解析将返回输入
（3）dayfirst：bool，默认为False，

如果arg是str或类似列表，则指定日期解析顺序。
如果为True，则首先解析日期，例如12/10/11解析为2011-10-12。
警告：dayfirst = True并不严格，但更喜欢使用day first进行解析（这是一个已知的错误，基于dateutil的行为）
（4）yearfirst：布尔值，默认为False，

如果arg是str或类似列表，则指定日期解析顺序。
如果True解析日期以年份为第一，则将10/11/12解析为2010-11-12。
如果dayfirst和yearfirst均为True，则在yearfirst之后（与dateutil相同）。
警告：yearfirst = True并不严格，但更喜欢使用year first进行解析（这是一个已知的错误，基于dateutil的行为）。

data['used_time'] = (pd.to_datetime(data['creatDate'], format='%Y%m%d', errors='coerce') - 
                            pd.to_datetime(data['regDate'], format='%Y%m%d', errors='coerce')).dt.days
'''比如x=‘1998-10-19’
x.dt.days=19
x.dt.year=1998'''

'比如x=‘1998-10-19’\nx.dt.days=19\nx.dt.year=1998'

# 看一下空数据，有 30k 个样本的时间是有问题的，我们可以选择删除，也可以选择放着。
# 但是这里不建议删除，因为删除缺失数据占总样本量过大，7.5%
# 我们可以先放着，因为如果我们 XGBoost 之类的决策树，其本身就能处理缺失值，所以可以不用管；
data['used_time'].isnull().sum()

# 从邮编中提取城市信息，因为是德国的数据，所以参考德国的邮编，相当于加入了先验知识
data['city'] = data['regionCode'].apply(lambda x : str(x)[:-3])

# 计算某品牌的销售统计量，同学们还可以计算其他特征的统计量
# 这里要以 train 的数据计算统计量
train_gb = train.groupby("brand")
all_info = {}
for kind, kind_data in train_gb:
    info = {}
    kind_data = kind_data[kind_data['price'] > 0]
    info['brand_amount'] = len(kind_data)
    info['brand_price_max'] = kind_data.price.max()
    info['brand_price_median'] = kind_data.price.median()
    info['brand_price_min'] = kind_data.price.min()
    info['brand_price_sum'] = kind_data.price.sum()
    info['brand_price_std'] = kind_data.price.std()
    info['brand_price_average'] = round(kind_data.price.sum() / (len(kind_data) + 1), 2)
    all_info[kind] = info
brand_fe = pd.DataFrame(all_info).T.reset_index().rename(columns={"index": "brand"})
data = data.merge(brand_fe, how='left', on='brand')#
#dataframe的merge是按照两个dataframe共有的column进行连接，两个dataframe必须具有同名的column

# 数据分桶 以 power 为例
# 这时候我们的缺失值也进桶了，
# 为什么要做数据分桶呢，原因有很多，= =
# 1. 离散后稀疏向量内积乘法运算速度更快，计算结果也方便存储，容易扩展；
# 2. 离散后的特征对异常值更具鲁棒性，如 age>30 为 1 否则为 0，对于年龄为 200 的也不会对模型造成很大的干扰；
# 3. LR 属于广义线性模型，表达能力有限，经过离散化后，每个变量有单独的权重，这相当于引入了非线性，能够提升模型的表达能力，加大拟合；
# 4. 离散后特征可以进行特征交叉，提升表达能力，由 M+N 个变量编程 M*N 个变量，进一步引入非线形，提升了表达能力；
# 5. 特征离散后模型更稳定，如用户年龄区间，不会因为用户年龄长了一岁就变化

# 当然还有很多原因，LightGBM 在改进 XGBoost 时就增加了数据分桶，增强了模型的泛化性

bin = [i*10 for i in range(31)]
data['power_bin'] = pd.cut(data['power'], bin, labels=False)
data[['power_bin', 'power']].head()

	power_bin	power
0	NaN	0
1	17.0	173
2	4.0	50
3	5.0	54
4	13.0	131

# 利用好了，就可以删掉原始数据了
data = data.drop(['creatDate', 'regDate', 'regionCode'], axis=1)

print(data.shape)
data.columns

(298710, 48)





Index(['SaleID', 'name', 'model', 'brand', 'bodyType', 'fuelType', 'gearbox',
       'power', 'kilometer', 'notRepairedDamage', 'seller', 'offerType',
       'price', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6', 'v_7', 'v_8',
       'v_9', 'v_10', 'v_11', 'v_12', 'v_13', 'v_14', 'v_15', 'v_16', 'v_17',
       'v_18', 'v_19', 'v_20', 'v_21', 'v_22', 'v_23', 'train', 'used_time',
       'city', 'brand_amount', 'brand_price_max', 'brand_price_median',
       'brand_price_min', 'brand_price_sum', 'brand_price_std',
       'brand_price_average', 'power_bin'],
      dtype='object')

# 目前的数据其实已经可以给树模型使用了，所以我们导出一下
data.to_csv('data_for_tree.csv', index=0)

# 我们可以再构造一份特征给 LR NN 之类的模型用
# 之所以分开构造是因为，不同模型对数据集的要求不同
# 我们看下数据分布：
data['power'].plot.hist()

<AxesSubplot:ylabel='Frequency'>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-SibOt9FL-1618841127814)(output_23_1.png)]

# 我们刚刚已经对 train 进行异常值处理了，但是现在还有这么奇怪的分布是因为 test 中的 power 异常值，
# 所以我们其实刚刚 train 中的 power 异常值不删为好，可以用长尾分布截断来代替
train['power'].plot.hist()

<AxesSubplot:ylabel='Frequency'>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-pe5r0m7g-1618841127816)(output_24_1.png)]

# 我们对其取 log，在做归一化
from sklearn import preprocessing
min_max_scaler = preprocessing.MinMaxScaler()
data['power'] = np.log(data['power'] + 1) 
data['power'] = ((data['power'] - np.min(data['power'])) / (np.max(data['power']) - np.min(data['power'])))
data['power'].plot.hist()

<AxesSubplot:ylabel='Frequency'>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-I8zEsTqo-1618841127819)(output_25_1.png)]

# km 的比较正常，应该是已经做过分桶了
data['kilometer'].plot.hist()

<AxesSubplot:ylabel='Frequency'>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-6gLhQbzH-1618841127820)(output_26_1.png)]

# 所以我们可以直接做归一化
data['kilometer'] = ((data['kilometer'] - np.min(data['kilometer'])) / 
                        (np.max(data['kilometer']) - np.min(data['kilometer'])))
data['kilometer'].plot.hist()

<AxesSubplot:ylabel='Frequency'>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-AsLBb3xu-1618841127822)(output_27_1.png)]

# 除此之外 还有我们刚刚构造的统计量特征：
# 'brand_amount', 'brand_price_average', 'brand_price_max',
# 'brand_price_median', 'brand_price_min', 'brand_price_std',
# 'brand_price_sum'
# 这里不再一一举例分析了，直接做变换，
def max_min(x):
    return (x - np.min(x)) / (np.max(x) - np.min(x))

data['brand_amount'] = ((data['brand_amount'] - np.min(data['brand_amount'])) / 
                        (np.max(data['brand_amount']) - np.min(data['brand_amount'])))
data['brand_price_average'] = ((data['brand_price_average'] - np.min(data['brand_price_average'])) / 
                               (np.max(data['brand_price_average']) - np.min(data['brand_price_average'])))
data['brand_price_max'] = ((data['brand_price_max'] - np.min(data['brand_price_max'])) / 
                           (np.max(data['brand_price_max']) - np.min(data['brand_price_max'])))
data['brand_price_median'] = ((data['brand_price_median'] - np.min(data['brand_price_median'])) /
                              (np.max(data['brand_price_median']) - np.min(data['brand_price_median'])))
data['brand_price_min'] = ((data['brand_price_min'] - np.min(data['brand_price_min'])) / 
                           (np.max(data['brand_price_min']) - np.min(data['brand_price_min'])))
data['brand_price_std'] = ((data['brand_price_std'] - np.min(data['brand_price_std'])) / 
                           (np.max(data['brand_price_std']) - np.min(data['brand_price_std'])))
data['brand_price_sum'] = ((data['brand_price_sum'] - np.min(data['brand_price_sum'])) / 
                           (np.max(data['brand_price_sum']) - np.min(data['brand_price_sum'])))

# 对类别特征进行 OneEncoder
data = pd.get_dummies(data, columns=['model', 'brand', 'bodyType', 'fuelType',
                                     'gearbox', 'notRepairedDamage', 'power_bin'])

get_dummies 是利用pandas实现one hot encode的方式。详细参数请查看官方文档
使用：

pandas.get_dummies(data, prefix=None, prefix_sep=’_’, dummy_na=False, columns=None, sparse=False, drop_first=False)[source]
1
参数说明：

data : array-like, Series, or DataFrame 输入的数据

prefix : string, list of strings, or dict of strings, default None。get_dummies转换后，列名的前缀

columns : list-like, default None。指定需要实现类别转换的列名

dummy_na : bool, default False，增加一列表示空缺值，如果False就忽略空缺值

drop_first : bool, default False，获得k中的k-1个类别值，去除第一个。

print(data.shape)
data.columns

(298710, 381)





Index(['SaleID', 'name', 'power', 'kilometer', 'seller', 'offerType', 'price',
       'v_0', 'v_1', 'v_2',
       ...
       'power_bin_20.0', 'power_bin_21.0', 'power_bin_22.0', 'power_bin_23.0',
       'power_bin_24.0', 'power_bin_25.0', 'power_bin_26.0', 'power_bin_27.0',
       'power_bin_28.0', 'power_bin_29.0'],
      dtype='object', length=381)

# 这份数据可以给 LR 用
data.to_csv('data_for_lr.csv', index=0)

3.3.3 特征筛选

# 相关性分析
print(data['power'].corr(data['price'], method='spearman'))
print(data['kilometer'].corr(data['price'], method='spearman'))
print(data['brand_amount'].corr(data['price'], method='spearman'))
print(data['brand_price_average'].corr(data['price'], method='spearman'))
print(data['brand_price_max'].corr(data['price'], method='spearman'))
print(data['brand_price_median'].corr(data['price'], method='spearman'))

0.5494470244334951
-0.3691728008462487
0.042793827586892146
0.35129036835843275
0.008225052344744431
0.3584555831136035

# 当然也可以直接看图
data_numeric = data[['power', 'kilometer', 'brand_amount', 'brand_price_average', 
                     'brand_price_max', 'brand_price_median']]
correlation = data_numeric.corr()

f , ax = plt.subplots(figsize = (7, 7))
plt.title('Correlation of Numeric Features with Price',y=1,size=16)
sns.heatmap(correlation,square = True,  vmax=0.8)
'''annot: 默认为False，为True的话，会在格子上显示数字
vmax, vmin: 热力图颜色取值的最大值，最小值，默认会从data中推导'''

'annot: 默认为False，为True的话，会在格子上显示数字\nvmax, vmin: 热力图颜色取值的最大值，最小值，默认会从data中推导'

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Dn9B6zVK-1618841127823)(output_35_1.png)]

# k_feature 太大会很难跑，没服务器，所以提前 interrupt 了
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LinearRegression
sfs = SFS(LinearRegression(),
           k_features=10,
           forward=True,
           floating=False,
           scoring = 'r2',
           cv = 0)
x = data.drop(['price'], axis=1)
x = x.fillna(0)
y = data['price']
sfs.fit(x, y)
sfs.k_feature_names_ 
'''estimatorestimator instance
An unfitted estimator.

n_features_to_selectint or float, default=None
The number of features to select. If None, half of the features are selected. If integer, the parameter is the absolute number of features to select. If float between 0 and 1, it is the fraction of features to select.

direction{‘forward’, ‘backward’}, default=’forward’
Whether to perform forward selection or backward selection.

scoringstr, callable, list/tuple or dict, default=None
A single str (see The scoring parameter: defining model evaluation rules) or a callable (see Defining your scoring strategy from metric functions) to evaluate the predictions on the test set.

NOTE that when using custom scorers, each scorer should return a single value. Metric functions returning a list/array of values can be wrapped into multiple scorers that return one value each.

If None, the estimator’s score method is used.'''

---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-38-a8efd37be6c0> in <module>
     11 x = x.fillna(0)
     12 y = data['price']
---> 13 sfs.fit(x, y)
     14 sfs.k_feature_names_
     15 '''estimatorestimator instance


C:\ProgramData\Anaconda3\lib\site-packages\mlxtend\feature_selection\sequential_feature_selector.py in fit(self, X, y, custom_feature_names, groups, **fit_params)
    431 
    432                 if self.forward:
--> 433                     k_idx, k_score, cv_scores = self._inclusion(
    434                         orig_set=orig_set,
    435                         subset=prev_subset,


C:\ProgramData\Anaconda3\lib\site-packages\mlxtend\feature_selection\sequential_feature_selector.py in _inclusion(self, orig_set, subset, X, y, ignore_feature, groups, **fit_params)
    602             parallel = Parallel(n_jobs=n_jobs, verbose=self.verbose,
    603                                 pre_dispatch=self.pre_dispatch)
--> 604             work = parallel(delayed(_calc_score)
    605                             (self, X[:, tuple(subset | {feature})], y,
    606                              tuple(subset | {feature}),


C:\ProgramData\Anaconda3\lib\site-packages\joblib\parallel.py in __call__(self, iterable)
   1039             # remaining jobs.
   1040             self._iterating = False
-> 1041             if self.dispatch_one_batch(iterator):
   1042                 self._iterating = self._original_iterator is not None
   1043 


C:\ProgramData\Anaconda3\lib\site-packages\joblib\parallel.py in dispatch_one_batch(self, iterator)
    857                 return False
    858             else:
--> 859                 self._dispatch(tasks)
    860                 return True
    861 


C:\ProgramData\Anaconda3\lib\site-packages\joblib\parallel.py in _dispatch(self, batch)
    775         with self._lock:
    776             job_idx = len(self._jobs)
--> 777             job = self._backend.apply_async(batch, callback=cb)
    778             # A job can complete so quickly than its callback is
    779             # called before we get here, causing self._jobs to


C:\ProgramData\Anaconda3\lib\site-packages\joblib\_parallel_backends.py in apply_async(self, func, callback)
    206     def apply_async(self, func, callback=None):
    207         """Schedule a func to be run"""
--> 208         result = ImmediateResult(func)
    209         if callback:
    210             callback(result)


C:\ProgramData\Anaconda3\lib\site-packages\joblib\_parallel_backends.py in __init__(self, batch)
    570         # Don't delay the application, to avoid keeping the input
    571         # arguments in memory
--> 572         self.results = batch()
    573 
    574     def get(self):


C:\ProgramData\Anaconda3\lib\site-packages\joblib\parallel.py in __call__(self)
    260         # change the default number of processes to -1
    261         with parallel_backend(self._backend, n_jobs=self._n_jobs):
--> 262             return [func(*args, **kwargs)
    263                     for func, args, kwargs in self.items]
    264 


C:\ProgramData\Anaconda3\lib\site-packages\joblib\parallel.py in <listcomp>(.0)
    260         # change the default number of processes to -1
    261         with parallel_backend(self._backend, n_jobs=self._n_jobs):
--> 262             return [func(*args, **kwargs)
    263                     for func, args, kwargs in self.items]
    264 


C:\ProgramData\Anaconda3\lib\site-packages\mlxtend\feature_selection\sequential_feature_selector.py in _calc_score(selector, X, y, indices, groups, **fit_params)
     35                                  fit_params=fit_params)
     36     else:
---> 37         selector.est_.fit(X, y, **fit_params)
     38         scores = np.array([selector.scorer(selector.est_, X, y)])
     39     return indices, scores


C:\ProgramData\Anaconda3\lib\site-packages\sklearn\linear_model\_base.py in fit(self, X, y, sample_weight)
    516         accept_sparse = False if self.positive else ['csr', 'csc', 'coo']
    517 
--> 518         X, y = self._validate_data(X, y, accept_sparse=accept_sparse,
    519                                    y_numeric=True, multi_output=True)
    520 


C:\ProgramData\Anaconda3\lib\site-packages\sklearn\base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
    431                 y = check_array(y, **check_y_params)
    432             else:
--> 433                 X, y = check_X_y(X, y, **check_params)
    434             out = X, y
    435 


C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\validation.py in inner_f(*args, **kwargs)
     61             extra_args = len(args) - len(all_args)
     62             if extra_args <= 0:
---> 63                 return f(*args, **kwargs)
     64 
     65             # extra_args > 0


C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
    821                     estimator=estimator)
    822     if multi_output:
--> 823         y = check_array(y, accept_sparse='csr', force_all_finite=True,
    824                         ensure_2d=False, dtype=None)
    825     else:


C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\validation.py in inner_f(*args, **kwargs)
     61             extra_args = len(args) - len(all_args)
     62             if extra_args <= 0:
---> 63                 return f(*args, **kwargs)
     64 
     65             # extra_args > 0


C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
    661 
    662         if force_all_finite:
--> 663             _assert_all_finite(array,
    664                                allow_nan=force_all_finite == 'allow-nan')
    665 


C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\validation.py in _assert_all_finite(X, allow_nan, msg_dtype)
    101                 not allow_nan and not np.isfinite(X).all()):
    102             type_err = 'infinity' if allow_nan else 'NaN, infinity'
--> 103             raise ValueError(
    104                     msg_err.format
    105                     (type_err,


ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

常见方法

Filter

去掉取值变化小的特征（Removing features with low variance）
单变量特征选择 (Univariate feature selection)
Wrapper

递归特征消除 (Recursive Feature Elimination)
Embedding

使用SelectFromModel选择特征 (Feature selection using SelectFromModel)
将特征选择过程融入pipeline (Feature selection as part of a pipeline)

包裹式（封装器法）从初始特征集合中不断的选择特征子集，训练学习器，根据学习器的性能来对子集进行评价，直到选择出最佳的子集。包裹式特征选择直接针对给定学习器进行优化

常用实现方法：循序特征选择。

循序向前特征选择：Sequential Forward Selection，SFS
循序向后特征选择：Sequential Backword Selection，SBS

画出来，可以看到边际效益

from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs
import matplotlib.pyplot as plt
fig1 = plot_sfs(sfs.get_metric_dict(), kind=‘std_dev’)
plt.grid()
plt.show()

# 画出来，可以看到边际效益
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs
import matplotlib.pyplot as plt
fig1 = plot_sfs(sfs.get_metric_dict(), kind='std_dev')
plt.grid()
plt.show()

---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

<ipython-input-39-77bdc20246c0> in <module>
      2 from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs
      3 import matplotlib.pyplot as plt
----> 4 fig1 = plot_sfs(sfs.get_metric_dict(), kind='std_dev')
      5 plt.grid()
      6 plt.show()


C:\ProgramData\Anaconda3\lib\site-packages\mlxtend\feature_selection\sequential_feature_selector.py in get_metric_dict(self, confidence_interval)
    725 
    726         """
--> 727         self._check_fitted()
    728         fdict = deepcopy(self.subsets_)
    729         for k in fdict:


C:\ProgramData\Anaconda3\lib\site-packages\mlxtend\feature_selection\sequential_feature_selector.py in _check_fitted(self)
    744     def _check_fitted(self):
    745         if not self.fitted:
--> 746             raise AttributeError('SequentialFeatureSelector has not been'
    747                                  ' fitted, yet.')


AttributeError: SequentialFeatureSelector has not been fitted, yet.