Datawhale 零基础入门数据挖掘-Task3 特征工程

Datawhale 零基础入门数据挖掘-Task3 特征工程

三、 特征工程目标

Tip:此部分为零基础入门数据挖掘的 Task3 特征工程 部分,带你来了解各种特征工程以及分析方法,欢迎大家后续多多交流。

赛题:零基础入门数据挖掘 - 二手车交易价格预测

地址:https://tianchi.aliyun.com/competition/entrance/231784/introduction?spm=5176.12281957.1004.1.38b02448ausjSX

3.1 特征工程目标

  • 对于特征进行进一步分析,并对于数据进行处理

  • 完成对于特征工程的分析,并对于数据进行一些图表或者文字总结并打卡。

3.2 内容介绍

常见的特征工程包括:

  1. 异常处理:
    • 通过箱线图(或 3-Sigma)分析删除异常值;
    • BOX-COX 转换(处理有偏分布);
    • 长尾截断;
  2. 特征归一化/标准化:
    • 标准化(转换为标准正态分布);
    • 归一化(抓换到 [0,1] 区间);
    • 针对幂律分布,可以采用公式: l o g ( 1 + x 1 + m e d i a n ) log(\frac{1+x}{1+median}) log(1+median1+x)
  3. 数据分桶:
    • 等频分桶;
    • 等距分桶;
    • Best-KS 分桶(类似利用基尼指数进行二分类);
    • 卡方分桶;
  4. 缺失值处理:
    • 不处理(针对类似 XGBoost 等树模型);
    • 删除(缺失数据太多);
    • 插值补全,包括均值/中位数/众数/建模预测/多重插补/压缩感知补全/矩阵补全等;
    • 分箱,缺失值一个箱;
  5. 特征构造:
    • 构造统计量特征,报告计数、求和、比例、标准差等;
    • 时间特征,包括相对时间和绝对时间,节假日,双休日等;
    • 地理信息,包括分箱,分布编码等方法;
    • 非线性变换,包括 log/ 平方/ 根号等;
    • 特征组合,特征交叉;
    • 仁者见仁,智者见智。
  6. 特征筛选
    • 过滤式(filter):先对数据进行特征选择,然后在训练学习器,常见的方法有 Relief/方差选择发/相关系数法/卡方检验法/互信息法;
    • 包裹式(wrapper):直接把最终将要使用的学习器的性能作为特征子集的评价准则,常见方法有 LVM(Las Vegas Wrapper) ;
    • 嵌入式(embedding):结合过滤式和包裹式,学习器训练过程中自动进行了特征选择,常见的有 lasso 回归;
  7. 降维
    • PCA/ LDA/ ICA;
    • 特征选择也是一种降维。

3.3 代码示例

3.3.0 导入数据

import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
from operator import itemgetter

%matplotlib inline
path = './data/'
train = pd.read_csv(path+'train.csv', sep=' ')
test = pd.read_csv(path+'testA.csv', sep=' ')
print(train.shape)
print(test.shape)
(150000, 31)
(50000, 30)
train.head()
SaleIDnameregDatemodelbrandbodyTypefuelTypegearboxpowerkilometernotRepairedDamageregionCodesellerofferTypecreatDatepricev_0v_1v_2v_3v_4v_5v_6v_7v_8v_9v_10v_11v_12v_13v_14
007362004040230.061.00.00.06012.50.010460020160404185043.3577963.9663440.0502572.1597441.1437860.2356760.1019880.1295490.0228160.097462-2.8818032.804097-2.4208210.7952920.914762
1122622003030140.012.00.00.0015.0-43660020160309360045.3052735.2361120.1379251.380657-1.4221650.2647770.1210040.1357310.0265970.020582-4.9004822.096338-1.030483-1.7226740.245522
221487420040403115.0151.00.00.016312.50.028060020160402622245.9783594.8237921.319524-0.998467-0.9969110.2514100.1149120.1651470.0621730.027075-4.8467491.8035591.565330-0.832687-0.229963
337186519960908109.0100.00.01.019315.00.04340020160312240045.6874784.492574-0.0506160.883600-2.2280790.2742930.1103000.1219640.0333950.000000-4.5095991.285940-0.501868-2.438353-0.478699
4411108020120103110.051.00.00.0685.00.069770020160313520044.3835112.0314330.572169-1.5712392.2460880.2280360.0732050.0918800.0788190.121534-1.8962400.9107830.9311102.8345181.923482
train.columns
Index(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType',
       'gearbox', 'power', 'kilometer', 'notRepairedDamage', 'regionCode',
       'seller', 'offerType', 'creatDate', 'price', 'v_0', 'v_1', 'v_2', 'v_3',
       'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12',
       'v_13', 'v_14'],
      dtype='object')
test.columns
Index(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType',
       'gearbox', 'power', 'kilometer', 'notRepairedDamage', 'regionCode',
       'seller', 'offerType', 'creatDate', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4',
       'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13',
       'v_14'],
      dtype='object')

3.3.1 删除异常值

# 这里我包装了一个异常值处理的代码,可以随便调用。
def outliers_proc(data, col_name, scale=3):
    """
    用于清洗异常值,默认用 box_plot(scale=3)进行清洗
    :param data: 接收 pandas 数据格式
    :param col_name: pandas 列名
    :param scale: 尺度
    :return:
    """

    def box_plot_outliers(data_ser, box_scale):
        """
        利用箱线图去除异常值
        :param data_ser: 接收 pandas.Series 数据格式
        :param box_scale: 箱线图尺度,
        :return:
        """
        iqr = box_scale * (data_ser.quantile(0.75) - data_ser.quantile(0.25))#表示box_scale倍的箱的高度
        val_low = data_ser.quantile(0.25) - iqr
        val_up = data_ser.quantile(0.75) + iqr
        rule_low = (data_ser < val_low)#是一个bool值可以用来刷特征
        rule_up = (data_ser > val_up)
        return (rule_low, rule_up), (val_low, val_up)

    data_n = data.copy()#先将数据拷贝一份
    data_series = data_n[col_name]#选出指定的列
    rule, value = box_plot_outliers(data_series, box_scale=scale)#获取异常值的范围
    index = np.arange(data_series.shape[0])[rule[0] | rule[1]]#获取异常值的索引
    print("Delete number is: {}".format(len(index)))
    data_n = data_n.drop(index)                    #根据索引删除异常值
    data_n.reset_index(drop=True, inplace=True)#因为中间删除了部分异常值,所以需要重置索引
    print("Now column number is: {}".format(data_n.shape[0]))
    index_low = np.arange(data_series.shape[0])[rule[0]]#选出值比较低的异常值的索引
    outliers = data_series.iloc[index_low]       #获取值比较低的异常值
    print("Description of data less than the lower bound is:")
    print(pd.Series(outliers).describe())                #获取值比较低的异常值的信息
    index_up = np.arange(data_series.shape[0])[rule[1]]#获取值比较高的异常值的索引
    outliers = data_series.iloc[index_up]             #获取值比较高的异常值
    print("Description of data larger than the upper bound is:")
    print(pd.Series(outliers).describe())       #获取值比较高的异常值的信息
    
    fig, ax = plt.subplots(1, 2, figsize=(10, 7))
    sns.boxplot(y=data[col_name], data=data, palette="Set1", ax=ax[0])
    sns.boxplot(y=data_n[col_name], data=data_n, palette="Set1", ax=ax[1])
    return data_n
pd.set_option('display.max_columns', 100 )
train.describe()
SaleIDnameregDatemodelbrandbodyTypefuelTypegearboxpowerkilometerregionCodesellerofferTypecreatDatepricev_0v_1v_2v_3v_4v_5v_6v_7v_8v_9v_10v_11v_12v_13v_14
count150000.000000150000.0000001.500000e+05149999.000000150000.000000145494.000000141320.000000144019.000000150000.000000150000.000000150000.000000150000.000000150000.01.500000e+05150000.000000150000.000000150000.000000150000.000000150000.000000150000.000000150000.000000150000.000000150000.000000150000.000000150000.000000150000.000000150000.000000150000.000000150000.000000150000.000000
mean74999.50000068349.1728732.003417e+0747.1290218.0527331.7923690.3758420.224943119.31654712.5971602583.0772670.0000070.02.016033e+075923.32733344.406268-0.0448090.0807650.0788330.0178750.2482040.0449230.1246920.0581440.061996-0.0010000.0090350.0048130.000313-0.000688
std43301.41452761103.8750955.364988e+0449.5360407.8649561.7606400.5486770.417546177.1684193.9195761885.3632180.0025820.01.067328e+027501.9984772.4575483.6418932.9296182.0265141.1936610.0458040.0517430.2014100.0291860.0356923.7723863.2860712.5174781.2889881.038685
min0.0000000.0000001.991000e+070.0000000.0000000.0000000.0000000.0000000.0000000.5000000.0000000.0000000.02.015062e+0711.00000030.451976-4.295589-4.470671-7.275037-4.3645650.0000000.0000000.0000000.0000000.000000-9.168192-5.558207-9.639552-4.153899-6.546556
25%37499.75000011156.0000001.999091e+0710.0000001.0000000.0000000.0000000.00000075.00000012.5000001018.0000000.0000000.02.016031e+071300.00000043.135799-3.192349-0.970671-1.462580-0.9211910.2436150.0000380.0624740.0353340.033930-3.722303-1.951543-1.871846-1.057789-0.437034
50%74999.50000051638.0000002.003091e+0730.0000006.0000001.0000000.0000000.000000110.00000015.0000002196.0000000.0000000.02.016032e+073250.00000044.610266-3.052671-0.3829470.099722-0.0759100.2577980.0008120.0958660.0570140.0584841.624076-0.358053-0.130753-0.0362450.141246
75%112499.250000118841.2500002.007111e+0766.00000013.0000003.0000001.0000000.000000150.00000015.0000003843.0000000.0000000.02.016033e+077700.00000046.0047214.0006700.2413351.5658380.8687580.2652970.1020090.1252430.0793820.0874912.8443571.2550221.7769330.9428130.680378
max149999.000000196812.0000002.015121e+07247.00000039.0000007.0000006.0000001.00000019312.00000015.0000008120.0000001.0000000.02.016041e+0799999.00000052.3041787.32030819.0354969.8547026.8293520.2918380.1514201.4049360.1607910.22278712.35701118.81904213.84779211.1476698.658418
# 我们可以删掉一些异常数据,以 power 为例。  
# 这里删不删同学可以自行判断
# 但是要注意 test 的数据不能删 = = 不能掩耳盗铃是不是
fig, ax = plt.subplots(1,1, figsize=(10, 7))
#train.describe()
#sns.boxplot(y=data['power'],data=train, palette="Set1", ax=ax)
train = outliers_proc(train, 'power', scale=3)


pd.set_option('display.max_columns', 100 )
train.describe()
Delete number is: 963
Now column number is: 149037
Description of data less than the lower bound is:
count    0.0
mean     NaN
std      NaN
min      NaN
25%      NaN
50%      NaN
75%      NaN
max      NaN
Name: power, dtype: float64
Description of data larger than the upper bound is:
count      963.000000
mean       846.836968
std       1929.418081
min        376.000000
25%        400.000000
50%        436.000000
75%        514.000000
max      19312.000000
Name: power, dtype: float64
SaleIDnameregDatemodelbrandbodyTypefuelTypegearboxpowerkilometerregionCodesellerofferTypecreatDatepricev_0v_1v_2v_3v_4v_5v_6v_7v_8v_9v_10v_11v_12v_13v_14
count149037.000000149037.0000001.490370e+05149036.000000149037.000000144543.000000140405.000000143083.000000149037.000000149037.000000149037.000000149037.000000149037.01.490370e+05149037.000000149037.000000149037.000000149037.000000149037.000000149037.000000149037.000000149037.000000149037.000000149037.000000149037.000000149037.000000149037.000000149037.000000149037.000000149037.000000
mean75000.81004068266.3017302.003396e+0746.9697128.0289731.7855030.3773800.221564114.61568612.6119592583.1895440.0000070.02.016033e+075759.70732844.386358-0.0409150.0773320.0955970.0248320.2480810.0449700.1246920.0579050.0622130.0047120.022959-0.0174780.006048-0.000010
std43312.15896361114.0296655.361493e+0449.3476677.8457091.7541340.5483920.41530064.1897623.9092221885.6758480.0025900.01.070463e+026998.8712862.4454143.6395922.9346152.0160491.1916090.0458550.0517100.2017730.0290330.0356313.7711173.2847662.5025681.2886311.034682
min0.0000000.0000001.991000e+070.0000000.0000000.0000000.0000000.0000000.0000000.5000000.0000000.0000000.02.015062e+0711.00000030.451976-4.295589-4.470671-7.275037-4.3645650.0000000.0000000.0000000.0000000.000000-8.798810-5.403044-9.639552-4.153899-6.546556
25%37485.00000011093.0000001.999091e+0710.0000001.0000000.0000000.0000000.00000075.00000012.5000001018.0000000.0000000.02.016031e+071300.00000043.127305-3.191726-0.973892-1.438553-0.9135250.2435190.0000350.0623080.0352110.034192-3.720560-1.938099-1.880377-1.054625-0.435075
50%74985.00000051489.0000002.003091e+0730.0000006.0000001.0000000.0000000.000000109.00000015.0000002196.0000000.0000000.02.016032e+073200.00000044.595651-3.051661-0.3880560.114803-0.0670130.2576910.0008070.0957630.0567890.0587891.637443-0.347436-0.148725-0.0280040.140485
75%112532.000000118779.0000002.007110e+0766.00000013.0000003.0000001.0000000.000000150.00000015.0000003843.0000000.0000000.02.016033e+077500.00000045.9797864.0000200.2342131.5738180.8736200.2652040.1019980.1251480.0790510.0876432.8494501.2627991.7479680.9474720.678217
max149999.000000196812.0000002.015121e+07247.00000039.0000007.0000006.0000001.000000375.00000015.0000008120.0000001.0000000.02.016041e+0799999.00000052.3041787.32030819.0354969.8547026.8293520.2918380.1514201.4049360.1607910.22278712.35701118.81904213.84779211.1476698.658418

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Tij2xkK0-1585390656196)(output_13_2.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-pmuhHTP4-1585390656197)(output_13_3.png)]

3.3.2 特征构造

# 训练集和测试集放在一起,方便构造特征
train['train']=1
test['train']=0
data = pd.concat([train, test], ignore_index=True, sort=False)
data.head().append(data.tail())
SaleIDnameregDatemodelbrandbodyTypefuelTypegearboxpowerkilometernotRepairedDamageregionCodesellerofferTypecreatDatepricev_0v_1v_2v_3v_4v_5v_6v_7v_8v_9v_10v_11v_12v_13v_14train
007362004040230.061.00.00.06012.50.0104600201604041850.043.3577963.9663440.0502572.1597441.1437860.2356760.1019880.1295490.0228160.097462-2.8818032.804097-2.4208210.7952920.9147621
1122622003030140.012.00.00.0015.0-436600201603093600.045.3052735.2361120.1379251.380657-1.4221650.2647770.1210040.1357310.0265970.020582-4.9004822.096338-1.030483-1.7226740.2455221
221487420040403115.0151.00.00.016312.50.0280600201604026222.045.9783594.8237921.319524-0.998467-0.9969110.2514100.1149120.1651470.0621730.027075-4.8467491.8035591.565330-0.832687-0.2299631
337186519960908109.0100.00.01.019315.00.043400201603122400.045.6874784.492574-0.0506160.883600-2.2280790.2742930.1103000.1219640.0333950.000000-4.5095991.285940-0.501868-2.438353-0.4786991
4411108020120103110.051.00.00.0685.00.0697700201603135200.044.3835112.0314330.572169-1.5712392.2460880.2280360.0732050.0918800.0788190.121534-1.8962400.9107830.9311102.8345181.9234821
19903219999520903199605034.044.00.00.011615.00.032190020160320NaN45.6213915.958453-0.9185710.774826-2.0217390.2846640.1300440.0498330.0288070.004616-5.9785111.303174-1.207191-1.981240-0.3576950
199033199996708199910110.000.00.00.07515.00.018570020160329NaN43.9351624.476841-0.8417101.328253-1.2926750.2681010.1080950.0660390.0254680.025971-3.9138251.759524-2.075658-1.1548470.1690730
19903419999766932004041249.010.01.01.022415.00.034520020160305NaN46.5371374.1708060.388595-0.704689-1.4807100.2694320.1057240.1176520.0574790.015669-4.6390650.6547131.137756-1.3905310.2544200
199035199998969002002000827.010.00.01.033415.00.019980020160404NaN46.771359-3.2968140.243566-1.277411-0.4048810.2611520.0004900.1373660.0862160.0513831.833504-2.8286872.465630-0.911682-2.0573530
19903619999919338420041109166.061.0NaN1.0689.00.032760020160322NaN43.731010-3.1218670.027348-0.8089142.1165510.2287300.0003000.1035340.0806250.1242642.914571-1.1352700.5476282.094057-1.5521500
# 使用时间:data['creatDate'] - data['regDate'],反应汽车使用时间,一般来说价格与使用时间成反比
# 不过要注意,数据里有时间出错的格式,所以我们需要 errors='coerce'
data['used_time'] = (pd.to_datetime(data['creatDate'], format='%Y%m%d', errors='coerce') - #
                            pd.to_datetime(data['regDate'], format='%Y%m%d', errors='coerce')).dt.days#
#data['test1']=pd.to_datetime(data['creatDate'], format='%Y%m%d', errors='coerce')
#data['test2']=pd.to_datetime(data['regDate'], format='%Y%m%d', errors='coerce')
#data['test']=(pd.to_datetime(data['creatDate'], format='%Y%m%d', errors='coerce') -pd.to_datetime(data['regDate'], format='%Y%m%d', errors='coerce'))
data.head().append(data.tail())
SaleIDnameregDatemodelbrandbodyTypefuelTypegearboxpowerkilometernotRepairedDamageregionCodesellerofferTypecreatDatepricev_0v_1v_2v_3v_4v_5v_6v_7v_8v_9v_10v_11v_12v_13v_14trainused_time
007362004040230.061.00.00.06012.50.0104600201604041850.043.3577963.9663440.0502572.1597441.1437860.2356760.1019880.1295490.0228160.097462-2.8818032.804097-2.4208210.7952920.91476214385.0
1122622003030140.012.00.00.0015.0-436600201603093600.045.3052735.2361120.1379251.380657-1.4221650.2647770.1210040.1357310.0265970.020582-4.9004822.096338-1.030483-1.7226740.24552214757.0
221487420040403115.0151.00.00.016312.50.0280600201604026222.045.9783594.8237921.319524-0.998467-0.9969110.2514100.1149120.1651470.0621730.027075-4.8467491.8035591.565330-0.832687-0.22996314382.0
337186519960908109.0100.00.01.019315.00.043400201603122400.045.6874784.492574-0.0506160.883600-2.2280790.2742930.1103000.1219640.0333950.000000-4.5095991.285940-0.501868-2.438353-0.47869917125.0
4411108020120103110.051.00.00.0685.00.0697700201603135200.044.3835112.0314330.572169-1.5712392.2460880.2280360.0732050.0918800.0788190.121534-1.8962400.9107830.9311102.8345181.92348211531.0
19903219999520903199605034.044.00.00.011615.00.032190020160320NaN45.6213915.958453-0.9185710.774826-2.0217390.2846640.1300440.0498330.0288070.004616-5.9785111.303174-1.207191-1.981240-0.35769507261.0
199033199996708199910110.000.00.00.07515.00.018570020160329NaN43.9351624.476841-0.8417101.328253-1.2926750.2681010.1080950.0660390.0254680.025971-3.9138251.759524-2.075658-1.1548470.16907306014.0
19903419999766932004041249.010.01.01.022415.00.034520020160305NaN46.5371374.1708060.388595-0.704689-1.4807100.2694320.1057240.1176520.0574790.015669-4.6390650.6547131.137756-1.3905310.25442004345.0
199035199998969002002000827.010.00.01.033415.00.019980020160404NaN46.771359-3.2968140.243566-1.277411-0.4048810.2611520.0004900.1373660.0862160.0513831.833504-2.8286872.465630-0.911682-2.0573530NaN
19903619999919338420041109166.061.0NaN1.0689.00.032760020160322NaN43.731010-3.1218670.027348-0.8089142.1165510.2287300.0003000.1035340.0806250.1242642.914571-1.1352700.5476282.094057-1.55215004151.0
#data['test'] = data['test'].fillna(data['test'].mean())#好像是这个
data['used_time'] = data['used_time'].fillna(data['used_time'].mean())
data.head().append(data.tail())
SaleIDnameregDatemodelbrandbodyTypefuelTypegearboxpowerkilometernotRepairedDamageregionCodesellerofferTypecreatDatepricev_0v_1v_2v_3v_4v_5v_6v_7v_8v_9v_10v_11v_12v_13v_14trainused_time
007362004040230.061.00.00.06012.50.0104600201604041850.043.3577963.9663440.0502572.1597441.1437860.2356760.1019880.1295490.0228160.097462-2.8818032.804097-2.4208210.7952920.91476214385.000000
1122622003030140.012.00.00.0015.0-436600201603093600.045.3052735.2361120.1379251.380657-1.4221650.2647770.1210040.1357310.0265970.020582-4.9004822.096338-1.030483-1.7226740.24552214757.000000
221487420040403115.0151.00.00.016312.50.0280600201604026222.045.9783594.8237921.319524-0.998467-0.9969110.2514100.1149120.1651470.0621730.027075-4.8467491.8035591.565330-0.832687-0.22996314382.000000
337186519960908109.0100.00.01.019315.00.043400201603122400.045.6874784.492574-0.0506160.883600-2.2280790.2742930.1103000.1219640.0333950.000000-4.5095991.285940-0.501868-2.438353-0.47869917125.000000
4411108020120103110.051.00.00.0685.00.0697700201603135200.044.3835112.0314330.572169-1.5712392.2460880.2280360.0732050.0918800.0788190.121534-1.8962400.9107830.9311102.8345181.92348211531.000000
19903219999520903199605034.044.00.00.011615.00.032190020160320NaN45.6213915.958453-0.9185710.774826-2.0217390.2846640.1300440.0498330.0288070.004616-5.9785111.303174-1.207191-1.981240-0.35769507261.000000
199033199996708199910110.000.00.00.07515.00.018570020160329NaN43.9351624.476841-0.8417101.328253-1.2926750.2681010.1080950.0660390.0254680.025971-3.9138251.759524-2.075658-1.1548470.16907306014.000000
19903419999766932004041249.010.01.01.022415.00.034520020160305NaN46.5371374.1708060.388595-0.704689-1.4807100.2694320.1057240.1176520.0574790.015669-4.6390650.6547131.137756-1.3905310.25442004345.000000
199035199998969002002000827.010.00.01.033415.00.019980020160404NaN46.771359-3.2968140.243566-1.277411-0.4048810.2611520.0004900.1373660.0862160.0513831.833504-2.8286872.465630-0.911682-2.05735304441.030582
19903619999919338420041109166.061.0NaN1.0689.00.032760020160322NaN43.731010-3.1218670.027348-0.8089142.1165510.2287300.0003000.1035340.0806250.1242642.914571-1.1352700.5476282.094057-1.55215004151.000000
# 看一下空数据,有 15k 个样本的时间是有问题的,我们可以选择删除,也可以选择放着。
# 但是这里不建议删除,因为删除缺失数据占总样本量过大,7.5%
# 我们可以先放着,因为如果我们 XGBoost 之类的决策树,其本身就能处理缺失值,所以可以不用管;
data['used_time'].isnull().sum()
0
# 从邮编中提取城市信息,因为是德国的数据,所以参考德国的邮编,相当于加入了先验知识
data['city'] = data['regionCode'].apply(lambda x : int(str(x)[:-3])if len(str(x))> 3 else 0)
data.info()
data.head().append(data.tail())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 199037 entries, 0 to 199036
Data columns (total 34 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   SaleID             199037 non-null  int64  
 1   name               199037 non-null  int64  
 2   regDate            199037 non-null  int64  
 3   model              199036 non-null  float64
 4   brand              199037 non-null  int64  
 5   bodyType           193130 non-null  float64
 6   fuelType           187512 non-null  float64
 7   gearbox            191173 non-null  float64
 8   power              199037 non-null  int64  
 9   kilometer          199037 non-null  float64
 10  notRepairedDamage  199037 non-null  object 
 11  regionCode         199037 non-null  int64  
 12  seller             199037 non-null  int64  
 13  offerType          199037 non-null  int64  
 14  creatDate          199037 non-null  int64  
 15  price              149037 non-null  float64
 16  v_0                199037 non-null  float64
 17  v_1                199037 non-null  float64
 18  v_2                199037 non-null  float64
 19  v_3                199037 non-null  float64
 20  v_4                199037 non-null  float64
 21  v_5                199037 non-null  float64
 22  v_6                199037 non-null  float64
 23  v_7                199037 non-null  float64
 24  v_8                199037 non-null  float64
 25  v_9                199037 non-null  float64
 26  v_10               199037 non-null  float64
 27  v_11               199037 non-null  float64
 28  v_12               199037 non-null  float64
 29  v_13               199037 non-null  float64
 30  v_14               199037 non-null  float64
 31  train              199037 non-null  int64  
 32  used_time          199037 non-null  float64
 33  city               199037 non-null  int64  
dtypes: float64(22), int64(11), object(1)
memory usage: 51.6+ MB
SaleIDnameregDatemodelbrandbodyTypefuelTypegearboxpowerkilometernotRepairedDamageregionCodesellerofferTypecreatDatepricev_0v_1v_2v_3v_4v_5v_6v_7v_8v_9v_10v_11v_12v_13v_14trainused_timecity
007362004040230.061.00.00.06012.50.0104600201604041850.043.3577963.9663440.0502572.1597441.1437860.2356760.1019880.1295490.0228160.097462-2.8818032.804097-2.4208210.7952920.91476214385.0000001
1122622003030140.012.00.00.0015.0-436600201603093600.045.3052735.2361120.1379251.380657-1.4221650.2647770.1210040.1357310.0265970.020582-4.9004822.096338-1.030483-1.7226740.24552214757.0000004
221487420040403115.0151.00.00.016312.50.0280600201604026222.045.9783594.8237921.319524-0.998467-0.9969110.2514100.1149120.1651470.0621730.027075-4.8467491.8035591.565330-0.832687-0.22996314382.0000002
337186519960908109.0100.00.01.019315.00.043400201603122400.045.6874784.492574-0.0506160.883600-2.2280790.2742930.1103000.1219640.0333950.000000-4.5095991.285940-0.501868-2.438353-0.47869917125.0000000
4411108020120103110.051.00.00.0685.00.0697700201603135200.044.3835112.0314330.572169-1.5712392.2460880.2280360.0732050.0918800.0788190.121534-1.8962400.9107830.9311102.8345181.92348211531.0000006
19903219999520903199605034.044.00.00.011615.00.032190020160320NaN45.6213915.958453-0.9185710.774826-2.0217390.2846640.1300440.0498330.0288070.004616-5.9785111.303174-1.207191-1.981240-0.35769507261.0000003
199033199996708199910110.000.00.00.07515.00.018570020160329NaN43.9351624.476841-0.8417101.328253-1.2926750.2681010.1080950.0660390.0254680.025971-3.9138251.759524-2.075658-1.1548470.16907306014.0000001
19903419999766932004041249.010.01.01.022415.00.034520020160305NaN46.5371374.1708060.388595-0.704689-1.4807100.2694320.1057240.1176520.0574790.015669-4.6390650.6547131.137756-1.3905310.25442004345.0000003
199035199998969002002000827.010.00.01.033415.00.019980020160404NaN46.771359-3.2968140.243566-1.277411-0.4048810.2611520.0004900.1373660.0862160.0513831.833504-2.8286872.465630-0.911682-2.05735304441.0305821
19903619999919338420041109166.061.0NaN1.0689.00.032760020160322NaN43.731010-3.1218670.027348-0.8089142.1165510.2287300.0003000.1035340.0806250.1242642.914571-1.1352700.5476282.094057-1.55215004151.0000003
#train.groupby("brand")["price"].describe()
train.groupby("brand").describe()
SaleIDnameregDatemodelbodyTypefuelTypegearbox...v_9v_10v_11v_12v_13v_14train
countmeanstdmin25%50%75%maxcountmeanstdmin25%50%75%maxcountmeanstdmin25%50%75%maxcountmeanstdmin25%50%75%maxcountmeanstdmin25%50%75%maxcountmeanstdmin25%50%75%maxcountmean...75%maxcountmeanstdmin25%50%75%maxcountmeanstdmin25%50%75%maxcountmeanstdmin25%50%75%maxcountmeanstdmin25%50%75%maxcountmeanstdmin25%50%75%maxcountmeanstdmin25%50%75%max
brand
031429.075135.89897943216.00317514.037611.0075259.0112411.00149988.031429.068280.54300261481.2593180.010298.0051720.0119576.00196811.031429.02.002938e+0757745.77306119910001.019981111.0020030204.020071101.0020151208.031429.026.23812434.7312750.00.08.044.0230.030295.01.6126421.5767310.00.01.03.07.029572.00.4250640.5347150.00.00.01.06.030017.00.143985...0.0700440.18196831429.00.2026553.819637-8.251269-3.7553491.8129512.96966212.28529931429.00.1301023.414541-5.114536-1.856988-0.3420671.23762918.37908931429.00.0101582.442354-8.679290-1.839407-0.1976391.74140812.15761131429.0-0.2269310.860948-2.125022-1.041485-0.2231710.3666772.28928331429.00.2581340.921088-3.960301-0.0801960.2880500.8725302.30002131429.01.00.01.01.01.01.01.0
113656.074957.97429743258.2594491.037470.0074582.5112659.50149994.013656.062886.93482761452.08812945.06693.0040935.0113516.50196781.013656.02.004154e+0757048.97897419910001.020000102.0020050111.020081104.0020151210.013656.053.60889029.8041531.040.049.065.0247.013359.01.5915861.6340550.00.02.02.07.013043.00.5246490.5311080.00.01.01.06.013210.00.340424...0.0438380.19390713656.0-0.9311703.883604-8.051120-4.7936000.9885102.28710612.10309413656.0-0.4167062.959515-5.161408-2.353655-0.7333641.10790018.40850013656.01.1709142.357446-5.969378-0.6830911.2657072.82097112.97305713656.0-1.0612340.851262-2.521308-1.719965-1.109461-0.5336584.90735513656.0-0.0493400.875329-3.683050-0.4521260.0866910.5660102.02159213656.01.00.01.01.01.01.01.0
2318.078087.81761043396.0631171460.039411.2580894.5116589.50148937.0318.078339.48113255498.550938855.029776.0070464.5117160.00196276.0318.02.004285e+0762012.31758019910304.020000028.0020050158.520090405.2520151112.0318.084.93710796.1349731.02.019.0197.0207.0314.05.9904460.1693003.06.06.06.06.0306.00.7254900.6192040.00.01.01.02.0312.00.753205...0.0630970.132520318.0-1.4656173.744038-7.448232-5.2870600.7409071.67876511.312121318.0-1.9975982.749805-5.040389-3.657794-2.923828-0.37969017.834444318.01.5340461.989458-6.1884360.1514331.2626762.8370658.071152318.0-0.5287890.664283-1.792006-1.012964-0.696783-0.1222901.082124318.0-1.5432331.369426-4.953890-2.186592-1.535352-0.7163600.996607318.01.00.01.01.01.01.01.0
32461.076863.83868343479.43235025.038348.0078782.0115192.00149975.02461.070155.26412059515.8615834.015034.0056292.0116235.00196716.02461.02.006858e+0744660.63238719920107.020040206.0020070705.020100910.0020151208.02461.058.33076050.8531151.03.087.087.0193.02422.01.6787781.2757210.01.02.02.07.02374.00.3967990.5120610.00.00.01.03.02413.00.120182...0.1024730.1908102461.0-0.5871223.517722-7.081028-4.0485631.2505062.38692211.3674862461.0-0.3291332.608295-4.698429-2.096946-0.6781621.08371917.7607942461.01.0758962.009365-5.889687-0.3047891.0221852.46029910.8266232461.00.9208200.938277-1.9449130.4198450.9905311.4420113.7745392461.0-1.3074351.164774-3.975906-2.162526-1.171653-0.4304671.9546722461.01.00.01.01.01.01.01.0
416575.074424.16380143141.1458886.037016.5074228.0111890.00149986.016575.064769.13737662658.75007118.05555.0045038.0117141.50196812.016575.02.003306e+0753330.96949319910001.019990706.0020040012.020071009.0020151212.016575.018.33997030.4713811.04.04.013.0245.016210.01.6489821.9864540.00.00.02.07.015762.00.4404260.5414460.00.00.01.06.016130.00.356293...0.0364440.13887116575.0-0.6440183.854532-8.192090-4.9069361.3987042.33144012.01094316575.0-0.6418412.964420-5.088259-2.582502-1.3973071.00403718.34131716575.00.9683792.217971-7.882603-0.7432300.9318822.48824113.08366116575.0-1.2684780.589530-2.908192-1.820144-1.293616-0.7703271.54999016575.0-0.0005110.672059-4.476282-0.3508780.0055520.4551942.16639616575.01.00.01.01.01.01.01.0
54662.074953.46181943625.8013384.037394.7576202.5112649.75149969.04662.066926.72179360586.3231266.010626.5048965.0117049.50196770.04662.02.003742e+0744883.28928119910008.020010004.0020040206.520070404.5020150907.04662.025.87451740.3972381.05.05.019.0117.04542.01.9993391.5209760.01.01.04.07.04386.00.2628820.4827460.00.00.00.06.04507.00.060794...0.0938200.1732344662.00.2968613.435354-6.538353-3.0916831.8599323.00918312.1484164662.00.3411703.056326-4.226201-1.4612430.0249741.51753818.6307734662.0-0.8909862.047665-7.233860-2.242361-1.0352900.32517411.3698984662.00.8351700.803478-1.5672290.4219480.9145131.3555213.0622674662.00.4139110.571758-2.5124370.0747720.4046660.8169892.2288854662.01.00.01.01.01.01.01.0
610193.074865.55803043552.6301060.037048.0074727.0112624.00149990.010193.067759.67104961047.9530727.010249.0050267.0117899.00196763.010193.02.003343e+0749126.24323519910004.020000005.0020030504.020070307.0020151210.010193.050.60100132.6024291.030.046.069.0236.09823.01.7399981.4772400.01.01.02.07.09451.00.3316050.5075880.00.00.01.05.09741.00.079355...0.0888490.17417510193.00.4535703.751521-7.417392-3.3189332.0974693.29034312.35701110193.00.6479363.466200-4.644868-1.2977720.1047081.73030318.76544310193.0-1.0496892.415648-8.293423-2.783272-1.4082430.60646411.74163810193.00.5038640.827696-1.663949-0.0182580.4657921.0479893.50360510193.00.5627190.793285-4.1001430.1639640.6175021.0768532.46784610193.01.00.01.01.01.01.01.0
72360.075250.20593243678.2522859.037552.2574878.0113342.25149850.02360.069299.37203458127.0886819.017252.7551459.5116960.75196742.02360.02.002641e+0750934.78800319910002.019990307.7520030005.020060903.0020151201.02360.067.32457658.1995051.07.078.090.0195.02293.02.1931971.8884020.00.02.04.07.02243.00.2367370.4728550.00.00.00.03.02278.00.056190...0.0971560.1549292360.00.0433723.698984-7.335358-3.6822161.7972912.89824711.3636432360.0-0.0793063.112975-4.762878-1.830181-0.4475941.01701118.4785542360.0-0.6080902.274225-6.211157-2.247694-0.7743300.89748312.4332022360.00.5691550.857917-1.4871280.1670470.6248661.2450852.4039542360.0-0.7252770.844597-3.576179-1.277176-0.659159-0.1439881.8001452360.01.00.01.01.01.01.01.0
82070.074586.20241544205.548259120.035745.7576107.5111959.50149992.02070.072592.69661858963.38888613.018045.0059220.0120232.75196568.02070.02.003449e+0756140.31832819910002.019990807.5020030605.520080482.5020151111.02070.080.47777862.4278351.032.032.0129.0204.02007.02.1126062.1691380.01.01.03.07.01936.00.2649790.4785130.00.00.01.05.01999.00.118559...0.1266260.1822412070.00.4976743.536123-7.035689-2.8379641.5188813.21139511.4265922070.00.0529553.427019-4.865708-1.648288-0.2783061.36996618.5040172070.0-0.5493222.654042-7.585131-2.729013-1.0413581.6628439.3408882070.01.1366081.208165-1.7556940.6672041.3777091.9600643.6393452070.0-0.5597761.068399-4.548068-1.071647-0.4079860.1252831.8692322070.01.00.01.01.01.01.01.0
97299.075216.05041843212.15351410.038410.5074754.0112200.50149987.07299.070244.82161960841.98592014.013946.0056217.0119169.50196713.07299.02.002628e+0745428.58845419910003.019990809.0020020603.020051004.5020151208.07299.042.14426636.9236231.010.022.066.0123.07019.01.7189061.3720950.01.01.03.07.06722.00.2353470.5223300.00.00.00.06.06973.00.073139...0.1115720.1843517299.00.9339183.472885-6.263700-2.7061622.4776233.49263411.8940367299.00.8541473.520184-4.321031-1.0895170.1224181.93295918.8190427299.0-1.5729672.195678-7.407455-3.127826-1.945874-0.3239959.7119887299.01.2975670.794103-1.3479800.8343391.2979441.7703573.6328367299.00.2345350.751815-2.813847-0.2201900.3320270.7190932.2782357299.01.00.01.01.01.01.01.0
1013994.075205.87480343439.7588813.037610.5075121.0112569.50149998.013994.067331.79834260753.60794016.011242.0048541.0117395.00196800.013994.02.003084e+0752059.49388919910002.019991205.0020030405.520070309.0020151207.013994.038.54559137.3707391.017.031.033.0226.013713.01.9628092.0241220.00.02.03.07.013398.00.5058960.5606480.00.00.01.06.013564.00.575420...0.0440040.12560413994.0-0.7879443.789618-8.436945-4.6763381.2341282.32625612.11834313994.0-0.5957542.864221-5.366580-2.480332-0.9954180.90321518.45519613994.00.8606392.138816-7.195385-0.7228160.6638252.27056713.56201113994.0-1.1112760.690632-2.828945-1.641285-1.178183-0.5968631.44695413994.0-0.2152050.902067-6.113291-0.552021-0.0952890.3568982.11326713994.01.00.01.01.01.01.01.0
112944.075153.74082942823.47453183.038343.7575868.5112185.00149997.02944.069966.15183461340.86931919.012643.0052956.0119103.75196785.02944.02.004489e+0752721.42779719910003.020000510.5020040611.520090404.5020151205.02944.083.09171249.7082461.060.060.0116.0184.02852.01.1048391.1689380.00.01.01.07.02785.00.3030520.4825340.00.00.01.03.02843.00.046782...0.1136170.1893132944.00.2469683.590947-6.711672-3.4689831.7722043.08989011.6754632944.00.4391073.252243-3.911240-1.5288350.0285551.63066218.4845082944.0-0.2056602.523470-6.919479-2.195230-0.2130881.69032811.4431742944.01.3307000.964266-1.6197760.7048631.2503652.0500453.8964732944.0-0.6484240.922214-4.576412-0.930180-0.530655-0.1006111.7020452944.01.00.01.01.01.01.01.0
121108.074192.03068642167.462987345.038334.5073457.5109422.75149629.01108.073052.24368259664.00878421.017032.0061923.5119556.75196793.01108.02.001690e+0755599.54466519910003.019970411.0020010810.520060904.0020151210.01108.055.35830361.0015231.015.015.0131.0176.01067.02.0290532.2681620.00.01.05.07.01044.00.1829500.6254870.00.00.00.05.01082.00.118299...0.0938520.1399331108.00.3010233.609198-6.522588-3.2717452.0084252.90808011.3646331108.00.0691763.483745-4.505058-1.983762-0.5105181.05062818.2136831108.0-0.5464642.234541-5.240744-2.385848-0.7190241.0128849.9018901108.00.5381050.925150-1.5620790.0368410.4379711.3142522.6356821108.0-0.4363980.925910-3.297603-0.823304-0.3297160.1706071.6898721108.01.00.01.01.01.01.01.0
133813.075807.48596943369.18505245.037548.0075584.0113804.00149917.03813.069418.42250260943.54291022.010870.0054468.0120432.00196756.03813.02.003626e+0751050.48893519910003.020000301.0020030512.020080402.0020151204.03813.075.76553975.0523961.016.019.0164.0228.03647.01.5387991.3790870.01.01.02.07.03536.00.2282240.5262080.00.00.00.06.03616.00.024336...0.1242630.1766033813.00.9741843.544076-6.729835-2.6338952.3981463.46756311.9745003813.00.9052733.601515-3.991513-1.1449720.1959621.96039518.4230833813.0-1.1150102.419075-6.660521-2.940837-1.5439280.55486810.2512213813.01.4204371.222494-1.3143330.7727311.5839792.4423503.9902373813.00.1469620.693859-2.795332-0.1761930.2388740.5521592.0393793813.01.00.01.01.01.01.01.0
1416073.074965.40540043273.5332707.037602.0074681.0112612.00149970.016073.068175.40434361311.45193227.011440.0050531.0118903.00196795.016073.02.002211e+0750572.85771819910001.019981201.0020011107.020060301.0020151101.016073.053.72886238.7141591.026.048.073.0217.015518.01.6216011.4577030.01.01.02.07.014883.00.2510920.5161460.00.00.00.06.015382.00.101547...0.0952710.17762516073.00.6287873.556379-6.866449-3.0191362.2078233.34477912.03473616073.00.6354763.371117-4.474906-1.2638050.1088761.68034218.74622916073.0-1.3464392.356970-9.639552-3.098923-1.7038900.12355410.68106416073.00.8974540.792761-1.4011770.2555110.8823861.4755753.46226216073.00.4162380.985198-3.412186-0.0027210.5856021.0441192.74399316073.01.00.01.01.01.01.01.0
151458.075836.77777842919.1758882.039051.2574694.0113072.75149878.01458.054366.36145459170.040930496.05185.0025150.099853.00196730.01458.02.007449e+0738355.20431219911010.020050202.2520080107.520100909.0020151101.01458.091.20919152.5261461.020.0115.0115.0208.01430.01.9510491.5512190.01.01.04.07.01421.00.1048560.3087650.00.00.00.02.01434.00.068340...0.0814600.1589831458.0-1.6382393.886271-6.725212-5.291464-3.1855842.08663110.8662951458.0-0.2002082.910238-4.449206-1.998758-0.0437861.80586818.0264091458.02.2549941.428281-5.1345261.3524912.0640853.01824710.7379351458.0-0.0212080.878520-2.068686-0.602893-0.1688200.5398502.3688251458.0-0.5848351.093781-4.302939-1.218293-0.2443100.1299991.7874711458.01.00.01.01.01.01.01.0
162219.073605.73411443241.19132024.036326.5073669.0110490.50149967.02219.069268.56602162853.98595232.07904.0053724.0122929.00196810.02219.02.005215e+0742439.83964019950003.020020007.5020050309.020081210.0020151203.02219.035.25011343.9776681.021.021.021.0169.02179.02.0316661.5351430.01.01.04.07.02128.00.1846800.4466980.00.00.00.06.01828.00.833698...0.0791650.1757862219.00.2329583.355363-6.007778-3.3030331.9758132.85000412.0626352219.00.0239052.749959-3.747783-1.521359-0.5544271.77056517.9385312219.0-0.0068821.636191-4.818190-1.197193-0.1417421.0508759.4224962219.00.3569790.919330-1.609028-0.3384610.3236560.7991673.6875182219.00.2832700.975365-3.6413320.0626680.4560520.9223921.6736532219.01.00.01.01.01.01.01.0
17913.076391.49507143001.469264127.042395.0076124.0114306.00149902.0913.074632.95947459677.58222763.019057.0060976.0123652.00196611.0913.02.003280e+0740929.41122919910009.020001201.0020030312.020060701.0020151012.0913.053.18948544.6845451.019.035.055.0234.0876.01.4109591.7520780.00.01.02.07.0845.00.3408280.5033550.00.00.01.02.0879.00.067122...0.1126510.162548913.00.4160943.722685-6.531455-3.3341652.1403323.06116212.180503913.00.2660603.476105-4.446526-1.645925-0.4163631.44144318.198690913.0-0.7193562.331593-6.479402-2.384324-1.2615150.9503449.375095913.00.9105721.047373-1.6420760.1482591.1096501.7315152.921252913.0-0.0569880.883488-1.889225-0.860490-0.0301750.7655521.649960913.01.00.01.01.01.01.01.0
18315.077633.35238142433.451781189.041542.0077918.0116743.50149618.0315.082412.24444458198.40794367.030605.0070827.0137309.50193771.0315.02.001261e+0751348.74871619910411.019971208.0020000609.020050109.5020150808.0315.0100.77142970.7244031.037.072.0149.0211.0302.01.6456951.5733340.00.02.02.07.0291.00.1546390.4914480.00.00.00.03.0306.00.160131...0.1109530.157512315.00.8870183.563810-6.671605-2.3378642.2868293.27489211.002059315.00.1665563.711119-4.157351-1.944996-0.7852101.34959018.702045315.0-0.5955732.602414-5.592622-2.584811-1.1884271.28359310.152891315.00.8511470.982307-1.3312710.1442580.7526831.5955113.357300315.0-0.9774891.076947-3.405637-1.797545-0.772322-0.2967631.913713315.01.00.01.01.01.01.01.0
191386.075560.02092443706.641419108.036555.5076038.5113820.75149928.01386.078210.52164559147.648804141.025283.5067916.5126183.00196712.01386.02.002140e+0752415.13703419910002.019980705.0020010802.020060705.0020150905.01386.0111.70346374.0652121.038.059.0178.0233.01361.02.0720061.5873910.02.02.02.06.01322.00.4826020.6274690.00.00.01.05.01345.00.318216...0.1004610.1680841386.00.4583723.322205-7.524217-2.8809101.8358273.01408212.1690011386.0-0.4676562.853088-4.801876-2.113042-0.9815891.18569817.9113451386.0-0.3003602.549155-6.156476-2.391214-0.6951721.67544610.3687901386.00.3141711.097748-1.889330-0.4084160.1706110.9894503.0565401386.0-0.8298981.467238-4.522631-2.368066-0.2224890.2947142.1667401386.01.00.01.01.01.01.01.0
201235.072852.99352242770.279495270.036730.0071808.0109085.00149923.01235.075610.62105357519.55504669.025344.0062591.0124960.50196521.01235.02.002248e+0749976.25160719910005.019990006.0020010911.020060301.0020150912.01235.096.96518272.4030991.019.071.0148.0225.01175.01.9395742.2289540.00.01.03.07.01142.00.2373030.5184510.00.00.00.05.01190.00.164706...0.1113900.1771991235.00.7850343.588094-7.092171-2.4184972.1262363.25791411.3256531235.00.3073633.950357-4.688188-1.583725-0.5317961.17140518.6352901235.0-1.0709522.549985-9.223993-3.036032-1.4255030.5787878.5689871235.00.9206121.223262-1.451778-0.1687081.1655181.9486503.0896691235.0-0.2271021.388666-4.633527-0.6660890.0639630.8363131.8570251235.01.00.01.01.01.01.01.0
211546.074411.11513643042.50742817.037334.7575272.5110606.00149710.01546.068804.48641757792.72366883.019015.0054175.0113818.50196158.01546.02.007134e+0746009.51633319921211.020040403.0020071209.020110307.7520151206.01546.065.70504545.4269321.019.082.082.0191.01521.02.5641032.3583560.01.01.06.07.01486.00.3371470.5578270.00.00.01.05.01503.00.126414...0.0707340.1773061546.0-0.4756853.523654-6.666113-3.9584691.1877832.39782411.4871301546.0-0.4329082.722832-4.713837-2.167777-0.6681030.93867517.9910231546.00.5253702.047954-6.774670-1.0596140.7097482.01003911.6834601546.00.3116081.191928-1.841718-0.5940750.3007310.6774143.1132121546.00.0955641.064292-3.496441-0.4880670.2315810.9482962.0179021546.01.00.01.01.01.01.01.0
221085.074106.37235043456.470274286.036621.0073611.0111236.00149966.01085.075645.85345657361.844785354.024505.0064182.0123326.00196675.01085.02.006849e+0742384.45674619940404.020040502.0020060910.020100704.0020151110.01085.088.13456253.2478921.058.095.0118.0187.01069.02.8278772.3504580.01.02.06.07.01041.00.4803070.5699580.00.00.01.02.01064.00.219925...0.1192860.1802321085.0-0.0688473.459512-7.377229-3.4915101.5541272.61871211.3588951085.0-0.9009182.624135-4.631540-2.373848-1.2204730.30802417.8501231085.00.6522822.106158-5.786973-0.8491870.6605692.1710809.8042981085.00.9903171.206859-1.8286310.2356241.1229121.8959863.4887861085.0-1.3257441.051142-3.483484-2.099871-1.535887-0.5350321.6327161085.01.00.01.01.01.01.01.0
23183.071463.06557442681.617690981.037411.5072602.0106686.50149073.0183.091299.28961756983.542288200.042267.5084050.0142807.00194901.0183.02.001607e+0747984.95217019910601.019980557.0020010012.020041205.0020150712.0183.0141.87978176.0342371.0147.0147.0198.0246.0177.01.3615821.0080890.01.01.02.05.0170.00.2411760.5050730.00.00.00.02.0174.00.132184...0.1519330.195777183.01.8597042.923402-4.659997-1.0390742.9132943.68250310.600353183.00.4137183.339626-3.996334-1.139900-0.2307741.31607718.134975183.0-1.4588362.478543-6.134694-3.138181-2.029954-0.0446755.744553183.01.6856631.572607-1.3190200.6414981.9533082.8377934.207282183.0-0.5840890.954694-3.119766-0.832251-0.573239-0.0498451.813045183.01.00.01.01.01.01.01.0
24630.077544.57777843472.950598104.041483.0077434.0116516.75149770.0630.068077.84603263207.834220754.06010.0048344.0123193.25196698.0630.02.004602e+0755756.40113419910101.020010705.0020050706.020081009.7520151004.0630.0141.43492157.2032441.0135.0167.0167.0196.0609.04.7224960.9407630.04.05.05.07.0596.00.1157720.3644080.00.00.00.02.0605.00.454545...0.0104440.096948630.0-2.1815254.474586-8.563886-7.1146870.4311290.93602911.517281630.0-2.0682033.619911-5.403044-4.436895-3.733340-0.29947417.321064630.04.2324751.772247-2.5016473.0902583.9707515.31465413.847792630.0-2.6064161.006631-4.153899-3.483702-2.789147-1.705373-0.369503630.0-1.5474051.520931-4.990900-2.049624-1.381883-0.4819471.464521630.01.00.01.01.01.01.01.0
252059.074127.03448343501.33822037.037139.0074159.0112051.50149751.02059.078456.48664459066.256877270.022920.0068579.0126283.00196732.02059.02.004729e+0745101.62759619910301.020011109.0020050512.020080610.0020151205.02059.077.58766459.4401441.019.074.0107.0213.02003.01.9186221.5445010.01.01.03.07.01964.00.4027490.5502860.00.00.01.06.01990.00.137186...0.1296090.1797832059.00.6311463.364486-5.958284-3.0205902.1428612.99720811.3188702059.0-0.0119603.091690-4.229087-1.762297-0.7227041.15927418.6721012059.0-0.2919242.100629-7.142263-1.774015-0.3361651.0888739.3761542059.00.9701311.428631-1.543721-0.5442311.1099992.2242283.8410972059.0-0.7091300.910082-2.964917-1.563683-0.554168-0.0951761.5450842059.01.00.01.01.01.01.01.0
26878.077493.37813243487.023734347.040607.7576995.5115285.25149956.0878.088815.64236958856.666895319.037001.0084736.5138247.25196809.0878.02.003434e+0757256.43279919910012.020000002.0020030909.020077805.2520151211.0878.01.0000000.0000001.01.01.01.01.0736.03.5013592.4721580.01.04.06.07.0712.00.6671351.0857150.00.00.01.06.0714.00.523810...0.0319510.115086878.01.5857204.774101-7.936043-2.0713041.9994383.11246512.319303878.01.1228516.295803-4.734807-2.794448-1.1014240.87200018.563847878.01.5142692.664720-6.849828-0.2486311.4200062.91000012.964502878.0-1.4504740.622820-2.683358-2.035405-1.523719-0.929298-0.012991878.00.7942970.602347-1.4210360.3605520.8203681.2572692.228209878.01.00.01.01.01.01.01.0
272049.073845.70571043240.47888315.036399.0073615.0110825.00149922.02049.069949.44119159332.550713456.015265.0052631.0117151.00196757.02049.02.004516e+0752069.92916419910002.020010306.0020050901.020080906.0020150611.02049.0119.45485658.4268711.0111.0136.0160.0219.02013.01.9597621.9219020.01.01.03.07.01969.00.3763330.7976520.00.00.01.04.01996.00.132766...0.1170300.1712622049.0-0.3494113.401541-8.385159-3.7756981.3887472.56836611.4384722049.0-0.2205472.787389-4.900100-1.958291-0.4859051.29166117.9558112049.00.4332051.927001-5.487213-1.0922510.5020351.5708649.2389372049.00.9096571.208100-1.8198200.2675991.1551801.7894083.0362362049.0-1.1482411.261924-4.288604-1.986162-1.035812-0.2080641.7911762049.01.00.01.01.01.01.01.0
28633.074738.65718843356.425965267.037797.0076432.0112074.00149999.0633.075250.40916357674.072361328.021680.0063202.0118631.00196106.0633.02.006528e+0752284.87006019910003.020050609.0020070810.020100907.0020140312.0633.083.76461379.7897941.019.019.0177.0238.0625.02.4176002.2786660.01.01.05.07.0603.00.4063020.7261790.00.00.01.05.0619.00.268174...0.1456580.182138633.00.0832553.220259-6.585369-2.9035761.6806402.60035711.081233633.0-0.6736382.632410-4.493970-2.368460-1.0223170.95103517.505573633.00.6626961.697063-4.261720-0.6092200.5077341.8790618.397233633.00.9138491.738917-1.765759-0.6526580.2163462.6161053.267012633.0-0.7256591.529047-4.800336-0.651846-0.3160540.0565921.681307633.01.00.01.01.01.01.01.0
29406.074621.88423643236.633955228.037153.5073950.5111987.25149794.0406.070945.20689756446.897666364.019719.0055950.5115637.75191417.0406.02.010308e+0722641.54039019990307.020090311.0020101007.020120607.7520150801.0406.0145.52709453.2464851.097.0153.0203.0220.0402.02.7661692.2097160.01.02.06.07.0387.00.4108530.6431670.00.00.01.03.0396.00.000000...0.1549610.180703406.0-0.7903723.346689-6.638600-4.0044751.1893072.3378689.911740406.0-0.8159662.505750-4.278372-2.240155-1.3160431.00390218.215280406.01.2645021.273312-1.3661090.2381201.3137902.1251547.968338406.02.3487160.891323-1.2672621.3401442.7407502.9596093.670654406.0-1.5693581.180541-4.288898-2.280369-2.059071-0.1733541.671054406.01.00.01.01.01.01.01.0
30940.075172.06914944258.585908223.036734.0074946.0114503.50149916.0940.070067.44680957915.463701368.020215.0052366.0116240.00196292.0940.02.003896e+0755861.12082919910005.019990907.5020050009.520080806.0020151207.0940.071.61914969.9019831.019.019.0137.0194.0914.02.8402632.4674610.01.01.06.07.0885.00.1480230.3887690.00.00.00.02.0905.00.081768...0.1122320.177734940.0-0.1264123.654306-6.275393-3.7200921.4576612.74877711.517406940.00.0055543.280222-4.358551-1.716405-0.3501041.35713918.455192940.0-0.1733412.100211-6.730851-1.817245-0.0519231.1652289.049487940.00.5764101.292044-1.367847-0.6347410.5576111.8161353.403661940.0-0.8261641.106440-4.161693-1.277038-0.576715-0.1923271.843991940.01.00.01.01.01.01.01.0
31318.079211.82389944752.440924986.038857.2581918.5119649.75149436.0318.072795.92138459748.13984419.019852.0055504.5124235.25196533.0318.02.002378e+0741200.58892719920009.019991135.2520030156.520060481.0020120903.0318.0122.25157266.3709551.0100.0100.0150.0241.0299.01.5150501.5377600.01.01.01.07.0285.00.0210530.2044720.00.00.00.02.0300.00.110000...0.1540650.197007318.01.5449403.692167-5.214919-2.3566982.9896023.93232011.722887318.01.5967684.075336-3.331715-0.3643590.4532222.30362618.434987318.0-2.1455052.219577-7.476219-3.721772-2.628344-0.4379235.561065318.02.7905181.253698-0.6014782.6280973.1470843.6274274.574046318.0-0.0985711.047429-3.530033-0.3504570.1168560.6059511.781382318.01.00.01.01.01.01.01.0
32588.077254.42176943933.469591785.038649.5079253.0116300.50149906.0588.080697.83163360478.938739532.023408.0073592.0133244.25196673.0588.02.002166e+0737778.35965619910012.019991207.5020020457.520050602.2520150007.0588.0101.00000074.1838041.019.0120.0173.0185.0568.02.4577461.5572470.02.03.03.07.0551.00.4718690.6314690.00.00.01.02.0571.00.553415...0.1098760.148974588.00.4333813.612880-6.647469-3.3211142.0844402.93162411.528951588.0-0.2213593.340916-4.232155-2.162154-1.2013791.06155317.634468588.0-0.6263382.320674-7.328804-2.191988-1.0589430.7404089.804773588.00.5640831.066244-1.316508-0.4793040.7597011.3854502.695913588.0-1.0072971.020659-4.544762-1.584126-0.847772-0.3538231.404809588.01.00.01.01.01.01.01.0
33201.074372.56716446552.749727213.034590.0068461.0118484.00149600.0201.081124.04975157825.076347726.030423.0070408.0127691.00196467.0201.02.002732e+0749375.94977319910111.020000403.0020021211.020061008.0020150808.0201.0111.51741380.1629031.019.0179.0181.0181.0198.00.5555561.3646900.00.00.00.05.0195.00.4205130.5899410.00.00.01.02.0198.00.792929...0.1010850.122254201.0-0.3823103.667147-6.699000-4.3615811.7019812.3888929.788658201.0-0.9631642.475391-4.042492-2.448779-1.8134650.69568614.265191201.01.1626011.828595-1.835974-0.1053050.8253242.4613658.332826201.0-0.2123181.028219-2.181042-1.1227690.1953740.5484071.846387201.0-1.4301851.426232-3.699757-2.725109-1.730609-0.1535711.100867201.01.00.01.01.01.01.01.0
34227.067690.29515445058.232029821.026593.5063861.0108637.00149962.0227.086537.94273158063.238676733.039316.0075063.0135185.00196322.0227.02.001670e+0725854.78644119940305.020000306.5020021009.020040357.0020090806.0227.0103.90308463.7227081.092.092.0141.0216.0214.01.1168221.1302070.01.01.01.07.0206.00.1067960.4507470.00.00.00.02.0218.00.073394...0.1807230.213617227.02.3367483.069714-3.994545-0.6750703.4511183.84228211.482525227.01.2977203.818776-1.979440-0.6285230.0845702.24139118.004163227.0-2.7177302.046480-7.591009-3.716544-2.843971-2.1430895.869691227.03.5934841.6836940.1750893.6534624.2190274.8474375.249750227.0-0.4787320.482595-1.466562-0.833232-0.483200-0.1371730.978073227.01.00.01.01.01.01.01.0
35180.075761.28888942658.6620141239.042427.2576158.0109473.00149835.0180.092504.37777856658.8322391985.043453.5089256.0137603.50193029.0180.01.999566e+0726177.57176719920202.019980753.0019991104.020010412.0020081012.0180.020.79444431.5626001.019.019.019.0240.0174.01.1379311.7744000.00.00.01.07.0163.00.1717790.4387850.00.00.00.02.0171.00.128655...0.0727060.111065180.02.6983222.546303-5.3001562.6791393.2184413.60227411.216075180.0-0.1325263.185104-4.492133-1.452299-0.9338030.00465915.908381180.0-2.1181621.833393-7.024219-3.163407-2.343118-1.3638274.675358180.0-0.3101870.406609-0.710799-0.534089-0.451840-0.1957532.305511180.0-0.2699340.703368-4.304987-0.470896-0.2859570.0872441.152364180.01.00.01.01.01.01.01.0
36228.073029.51315841254.861198699.036801.2574890.0106929.50147599.0228.095786.00000058318.8485702310.045353.2593729.5145547.00196115.0228.02.000187e+0743634.81125219910001.019971006.5020000703.520030533.7520120201.0228.068.73684284.5067111.019.019.0205.0232.0226.02.0752212.0107720.00.02.04.07.0210.00.2809520.6047020.00.00.00.05.0219.00.210046...0.0882180.178423228.01.4287532.682799-5.3951281.5324962.5418142.99633210.525269228.0-1.1407522.139584-3.857012-2.153057-1.579656-0.68523817.097484228.0-0.9051361.914516-5.375736-2.362553-1.0338160.4150384.007988228.0-0.0263820.938113-1.156501-0.848372-0.3795451.0128912.450867228.0-0.6611500.820463-2.472309-1.188085-0.478278-0.0471810.935694228.01.00.01.01.01.01.01.0
37331.073778.83383744279.70850057.034365.5073581.0111356.00149816.0331.082216.23262858785.1689551747.030364.5069461.0136735.50192926.0331.02.005355e+0753794.57523019910106.020010907.5020060112.020091008.0020151210.0330.0190.27575828.1055281.0189.0200.0202.0206.0324.05.9783950.3556220.06.06.06.07.0326.00.8619630.3712280.01.01.01.02.0324.00.478395...0.1066930.222787331.0-1.2058353.866438-8.798810-5.5461080.5863531.29728910.207696331.0-2.4499683.002255-5.391154-4.375784-3.313433-0.70342014.724600331.02.6814812.293266-2.8542450.8741652.8210374.5368639.983565331.00.0913081.051330-2.255182-0.3748340.0602460.45553411.147669331.0-4.0895761.169474-6.546556-4.561176-4.118236-3.6827168.658418331.01.00.01.01.01.01.01.0
3865.079568.16923146074.928132570.035584.0098020.0115964.00147553.065.069180.09230853590.1686912632.021519.0056816.0111613.00189410.065.02.006727e+0742559.87683619950008.020050701.0020080205.020091003.0020150205.065.0171.60000083.5248471.0214.0214.0214.0242.060.04.6666672.2823320.02.06.06.06.060.00.3500000.7552080.00.00.00.02.060.00.000000...0.1568600.21040265.00.9251484.186859-5.543358-3.0305151.9559952.58862411.61338765.0-0.3643504.634212-3.655055-2.905654-1.6276780.21689916.38838665.0-0.0255102.116160-3.477392-1.090534-0.3393300.5255878.29731865.02.6236261.387277-0.8877932.5426583.2551393.4160944.59951465.0-1.8556741.140203-3.268925-2.505578-2.328672-1.7552000.87843165.01.00.01.01.01.01.01.0
399.070365.22222239637.4578085144.049803.0076258.099049.00127071.09.074224.66666759824.38465222825.038387.0050778.065608.00181810.09.02.000169e+0775637.08055219910707.019950009.0019981203.020040710.0020150402.09.086.000000118.7276291.01.019.0244.0244.07.02.5714292.4397500.00.52.04.56.06.00.1666670.4082480.00.00.00.01.07.00.000000...0.0723660.1846689.01.5158732.985496-6.2144281.7177042.1185263.1420733.3993859.02.9106558.745510-3.436782-1.548087-1.0594970.69700118.1976999.00.7602312.972780-2.862146-1.400021-0.2499551.9141295.4529859.00.4966501.477420-0.897785-0.6747030.0754031.5321343.5770099.0-0.8995151.813946-3.620942-2.3067800.0006580.2584031.0056719.01.00.01.01.01.01.01.0

40 rows × 240 columns

# 计算某品牌的销售统计量,同学们还可以计算其他特征的统计量
# 这里要以 train 的数据计算统计量
train_gb = train.groupby("brand")
all_info = {}
for kind, kind_data in train_gb:
    info = {}
    #print('kind/n',kind)
    #print('kind_data/n',kind_data)
    kind_data = kind_data[kind_data['price'] > 0]
    info['brand_amount'] = len(kind_data)
    info['brand_price_max'] = kind_data.price.max()
    info['brand_price_median'] = kind_data.price.median()
    info['brand_price_min'] = kind_data.price.min()
    info['brand_price_sum'] = kind_data.price.sum()
    info['brand_price_std'] = kind_data.price.std()
    info['brand_price_average'] = round(kind_data.price.sum() / (len(kind_data) + 1), 2)
    all_info[kind] = info
brand_fe = pd.DataFrame(all_info).T.reset_index().rename(columns={"index": "brand"})
#brand_fe.head()
data = data.merge(brand_fe, how='left', on='brand')
brand_fe.head()
brandbrand_amountbrand_price_maxbrand_price_medianbrand_price_minbrand_price_sumbrand_price_stdbrand_price_average
0031429.068500.03199.013.0173719698.06261.3716275527.19
1113656.084000.06399.015.0124044603.08988.8654069082.86
22318.055800.07500.035.03766241.010576.22444411806.40
332461.037500.04990.065.015954226.05396.3275036480.19
4416575.099999.05999.012.0138279069.08089.8632958342.13
data.head()
SaleIDnameregDatemodelbrandbodyTypefuelTypegearboxpowerkilometernotRepairedDamageregionCodesellerofferTypecreatDatepricev_0v_1v_2v_3v_4v_5v_6v_7v_8v_9v_10v_11v_12v_13v_14trainused_timecitybrand_amountbrand_price_maxbrand_price_medianbrand_price_minbrand_price_sumbrand_price_stdbrand_price_average
007362004040230.061.00.00.06012.50.0104600201604041850.043.3577963.9663440.0502572.1597441.1437860.2356760.1019880.1295490.0228160.097462-2.8818032.804097-2.4208210.7952920.91476214385.0110193.035990.01800.013.036457518.04562.2333313576.37
1122622003030140.012.00.00.0015.0-436600201603093600.045.3052735.2361120.1379251.380657-1.4221650.2647770.1210040.1357310.0265970.020582-4.9004822.096338-1.030483-1.7226740.24552214757.0413656.084000.06399.015.0124044603.08988.8654069082.86
221487420040403115.0151.00.00.016312.50.0280600201604026222.045.9783594.8237921.319524-0.998467-0.9969110.2514100.1149120.1651470.0621730.027075-4.8467491.8035591.565330-0.832687-0.22996314382.021458.045000.08500.0100.014373814.05425.0581409851.83
337186519960908109.0100.00.01.019315.00.043400201603122400.045.6874784.492574-0.0506160.883600-2.2280790.2742930.1103000.1219640.0333950.000000-4.5095991.285940-0.501868-2.438353-0.47869917125.0013994.092900.05200.015.0113034210.08244.6952878076.76
4411108020120103110.051.00.00.0685.00.0697700201603135200.044.3835112.0314330.572169-1.5712392.2460880.2280360.0732050.0918800.0788190.121534-1.8962400.9107830.9311102.8345181.92348211531.064662.031500.02300.020.015414322.03344.6897633305.67
# 数据分桶 以 power 为例
# 这时候我们的缺失值也进桶了,
# 为什么要做数据分桶呢,原因有很多,= =
# 1. 离散后稀疏向量内积乘法运算速度更快,计算结果也方便存储,容易扩展;(one_hot的优点)
# 2. 离散后的特征对异常值更具鲁棒性,如 age>30 为 1 否则为 0,对于年龄为 200 的也不会对模型造成很大的干扰;
# 3. LR 属于广义线性模型,表达能力有限,经过离散化后,每个变量有单独的权重,这相当于引入了非线性,能够提升模型的表达能力,加大拟合;(one_hot 优点)
# 4. 离散后特征可以进行特征交叉,提升表达能力,由 M+N 个变量编程 M*N 个变量,进一步引入非线形,提升了表达能力;(one_hot优点)
# 5. 特征离散后模型更稳定,如用户年龄区间,不会因为用户年龄长了一岁就变化

# 当然还有很多原因,LightGBM 在改进 XGBoost 时就增加了数据分桶,增强了模型的泛化性

bin = [i*10 for i in range(31)]
data['power_bin'] = pd.cut(data['power'], bin, labels=False)
data[['power_bin', 'power']].head()

power_binpower
05.060
1NaN0
216.0163
319.0193
46.068
data['power_bin']#由于设置了label=False,power_bin 分别表示power被分到了第几个桶中
0          5.0
1          NaN
2         16.0
3         19.0
4          6.0
          ... 
199032    11.0
199033     7.0
199034    22.0
199035     NaN
199036     6.0
Name: power_bin, Length: 199037, dtype: float64
# 利用好了,就可以删掉原始数据了
data = data.drop(['creatDate', 'regDate', 'regionCode'], axis=1)
print(data.shape)
data.columns
(199037, 39)





Index(['SaleID', 'name', 'model', 'brand', 'bodyType', 'fuelType', 'gearbox',
       'power', 'kilometer', 'notRepairedDamage', 'seller', 'offerType',
       'price', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6', 'v_7', 'v_8',
       'v_9', 'v_10', 'v_11', 'v_12', 'v_13', 'v_14', 'train', 'used_time',
       'city', 'brand_amount', 'brand_price_max', 'brand_price_median',
       'brand_price_min', 'brand_price_sum', 'brand_price_std',
       'brand_price_average', 'power_bin'],
      dtype='object')
# 目前的数据其实已经可以给树模型使用了,所以我们导出一下
data.to_csv('data_for_tree.csv', index=0)
data.head()
SaleIDnamemodelbrandbodyTypefuelTypegearboxpowerkilometernotRepairedDamagesellerofferTypepricev_0v_1v_2v_3v_4v_5v_6v_7v_8v_9v_10v_11v_12v_13v_14trainused_timecitybrand_amountbrand_price_maxbrand_price_medianbrand_price_minbrand_price_sumbrand_price_stdbrand_price_averagepower_bin
0073630.061.00.00.06012.50.0001850.043.3577963.9663440.0502572.1597441.1437860.2356760.1019880.1295490.0228160.097462-2.8818032.804097-2.4208210.7952920.91476214385.0110193.035990.01800.013.036457518.04562.2333313576.375.0
11226240.012.00.00.0015.0-003600.045.3052735.2361120.1379251.380657-1.4221650.2647770.1210040.1357310.0265970.020582-4.9004822.096338-1.030483-1.7226740.24552214757.0413656.084000.06399.015.0124044603.08988.8654069082.86NaN
2214874115.0151.00.00.016312.50.0006222.045.9783594.8237921.319524-0.998467-0.9969110.2514100.1149120.1651470.0621730.027075-4.8467491.8035591.565330-0.832687-0.22996314382.021458.045000.08500.0100.014373814.05425.0581409851.8316.0
3371865109.0100.00.01.019315.00.0002400.045.6874784.492574-0.0506160.883600-2.2280790.2742930.1103000.1219640.0333950.000000-4.5095991.285940-0.501868-2.438353-0.47869917125.0013994.092900.05200.015.0113034210.08244.6952878076.7619.0
44111080110.051.00.00.0685.00.0005200.044.3835112.0314330.572169-1.5712392.2460880.2280360.0732050.0918800.0788190.121534-1.8962400.9107830.9311102.8345181.92348211531.064662.031500.02300.020.015414322.03344.6897633305.676.0
# 我们可以再构造一份特征给 LR NN 之类的模型用
# 之所以分开构造是因为,不同模型对数据集的要求不同
# 我们看下数据分布:
data['power'].plot.hist()
<matplotlib.axes._subplots.AxesSubplot at 0x2870d506978>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-4lGElW8Y-1585390656202)(output_31_1.png)]

# 我们刚刚已经对 train 进行异常值处理了,但是现在还有这么奇怪的分布是因为 test 中的 power 异常值,
# 所以我们其实刚刚 train 中的 power 异常值不删为好,可以用长尾分布截断来代替
train['power'].plot.hist()
<matplotlib.axes._subplots.AxesSubplot at 0x287001c6f60>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-EsrGxDJj-1585390656202)(output_32_1.png)]

# 我们对其取 log,在做归一化
from sklearn import preprocessing
min_max_scaler = preprocessing.MinMaxScaler()
data['power'] = np.log(data['power'] + 1) 
data['power'] = ((data['power'] - np.min(data['power'])) / (np.max(data['power']) - np.min(data['power'])))
data['power'].plot.hist()
<matplotlib.axes._subplots.AxesSubplot at 0x2870022bac8>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-jlZHvc4e-1585390656202)(output_33_1.png)]

# km 的比较正常,应该是已经做过分桶了
data['kilometer'].plot.hist()
<matplotlib.axes._subplots.AxesSubplot at 0x287002b1358>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-5E1m1ecB-1585390656202)(output_34_1.png)]

data['kilometer'].value_counts()#从这里我们可以直观的看到为啥判定kilometer可能是分桶过的,他的类别只有这有限的几类。
15.0    128682
12.5     20958
10.0      8506
9.0       6992
8.0       6043
7.0       5442
6.0       4886
5.0       4197
4.0       3576
3.0       3309
2.0       3034
0.5       2431
1.0        981
Name: kilometer, dtype: int64
data['power_bin'].plot.hist()#画一下power分桶之后的效果,看一下
<matplotlib.axes._subplots.AxesSubplot at 0x287004c1ac8>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-OFyhH10Q-1585390656203)(output_36_1.png)]

# 所以我们可以直接做归一化
data['kilometer'] = ((data['kilometer'] - np.min(data['kilometer'])) / 
                        (np.max(data['kilometer']) - np.min(data['kilometer'])))
data['kilometer'].plot.hist()
<matplotlib.axes._subplots.AxesSubplot at 0x28700291e10>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-a1NhwAZX-1585390656203)(output_37_1.png)]

# 除此之外 还有我们刚刚构造的统计量特征:
# 'brand_amount', 'brand_price_average', 'brand_price_max',
# 'brand_price_median', 'brand_price_min', 'brand_price_std',
# 'brand_price_sum'
# 这里不再一一举例分析了,直接做变换,
def max_min(x):
    return (x - np.min(x)) / (np.max(x) - np.min(x))

data['brand_amount'] = ((data['brand_amount'] - np.min(data['brand_amount'])) / 
                        (np.max(data['brand_amount']) - np.min(data['brand_amount'])))
data['brand_price_average'] = ((data['brand_price_average'] - np.min(data['brand_price_average'])) / 
                               (np.max(data['brand_price_average']) - np.min(data['brand_price_average'])))
data['brand_price_max'] = ((data['brand_price_max'] - np.min(data['brand_price_max'])) / 
                           (np.max(data['brand_price_max']) - np.min(data['brand_price_max'])))
data['brand_price_median'] = ((data['brand_price_median'] - np.min(data['brand_price_median'])) /
                              (np.max(data['brand_price_median']) - np.min(data['brand_price_median'])))
data['brand_price_min'] = ((data['brand_price_min'] - np.min(data['brand_price_min'])) / 
                           (np.max(data['brand_price_min']) - np.min(data['brand_price_min'])))
data['brand_price_std'] = ((data['brand_price_std'] - np.min(data['brand_price_std'])) / 
                           (np.max(data['brand_price_std']) - np.min(data['brand_price_std'])))
data['brand_price_sum'] = ((data['brand_price_sum'] - np.min(data['brand_price_sum'])) / 
                           (np.max(data['brand_price_sum']) - np.min(data['brand_price_sum'])))
data.head()

SaleIDnamemodelbrandbodyTypefuelTypegearboxpowerkilometernotRepairedDamagesellerofferTypepricev_0v_1v_2v_3v_4v_5v_6v_7v_8v_9v_10v_11v_12v_13v_14trainused_timecitybrand_amountbrand_price_maxbrand_price_medianbrand_price_minbrand_price_sumbrand_price_stdbrand_price_averagepower_bin
0073630.061.00.00.00.4150910.8275860.0001850.043.3577963.9663440.0502572.1597441.1437860.2356760.1019880.1295490.0228160.097462-2.8818032.804097-2.4208210.7952920.91476214385.010.3241250.3407860.0320750.0020640.2096840.2076600.0816555.0
11226240.012.00.00.00.0000001.000000-003600.045.3052735.2361120.1379251.380657-1.4221650.2647770.1210040.1357310.0265970.020582-4.9004822.096338-1.030483-1.7226740.24552214757.040.4343410.8352300.2056230.0041280.7139850.4370020.257305NaN
2214874115.0151.00.00.00.5149540.8275860.0006222.045.9783594.8237921.319524-0.998467-0.9969110.2514100.1149120.1651470.0621730.027075-4.8467491.8035591.565330-0.832687-0.22996314382.020.0461170.4335780.2849060.0918470.0825330.2523620.28183416.0
3371865109.0100.00.01.00.5319171.0000000.0002400.045.6874784.492574-0.0506160.883600-2.2280790.2742930.1103000.1219640.0333950.000000-4.5095991.285940-0.501868-2.438353-0.47869917125.000.4450990.9268890.1603770.0041280.6505910.3984470.22521219.0
44111080110.051.00.00.00.4275350.3103450.0005200.044.3835112.0314330.572169-1.5712392.2460880.2280360.0732050.0918800.0788190.121534-1.8962400.9107830.9311102.8345181.92348211531.060.1480900.2945450.0509430.0092880.0885240.1445790.0730206.0
# 对类别特征进行 OneEncoder
data = pd.get_dummies(data, columns=['model', 'brand', 'bodyType', 'fuelType',
                                     'gearbox', 'notRepairedDamage', 'power_bin'])
data.head()
SaleIDnamepowerkilometersellerofferTypepricev_0v_1v_2v_3v_4v_5v_6v_7v_8v_9v_10v_11v_12v_13v_14trainused_timecitybrand_amountbrand_price_maxbrand_price_medianbrand_price_minbrand_price_sumbrand_price_stdbrand_price_averagemodel_0.0model_1.0model_2.0model_3.0model_4.0model_5.0model_6.0model_7.0model_8.0model_9.0model_10.0model_11.0model_12.0model_13.0model_14.0model_15.0model_16.0model_17.0...bodyType_0.0bodyType_1.0bodyType_2.0bodyType_3.0bodyType_4.0bodyType_5.0bodyType_6.0bodyType_7.0fuelType_0.0fuelType_1.0fuelType_2.0fuelType_3.0fuelType_4.0fuelType_5.0fuelType_6.0gearbox_0.0gearbox_1.0notRepairedDamage_-notRepairedDamage_0.0notRepairedDamage_1.0power_bin_0.0power_bin_1.0power_bin_2.0power_bin_3.0power_bin_4.0power_bin_5.0power_bin_6.0power_bin_7.0power_bin_8.0power_bin_9.0power_bin_10.0power_bin_11.0power_bin_12.0power_bin_13.0power_bin_14.0power_bin_15.0power_bin_16.0power_bin_17.0power_bin_18.0power_bin_19.0power_bin_20.0power_bin_21.0power_bin_22.0power_bin_23.0power_bin_24.0power_bin_25.0power_bin_26.0power_bin_27.0power_bin_28.0power_bin_29.0
007360.4150910.827586001850.043.3577963.9663440.0502572.1597441.1437860.2356760.1019880.1295490.0228160.097462-2.8818032.804097-2.4208210.7952920.91476214385.010.3241250.3407860.0320750.0020640.2096840.2076600.081655000000000000000000...01000000100000010010000001000000000000000000000000
1122620.0000001.000000003600.045.3052735.2361120.1379251.380657-1.4221650.2647770.1210040.1357310.0265970.020582-4.9004822.096338-1.030483-1.7226740.24552214757.040.4343410.8352300.2056230.0041280.7139850.4370020.257305000000000000000000...00100000100000010100000000000000000000000000000000
22148740.5149540.827586006222.045.9783594.8237921.319524-0.998467-0.9969110.2514100.1149120.1651470.0621730.027075-4.8467491.8035591.565330-0.832687-0.22996314382.020.0461170.4335780.2849060.0918470.0825330.2523620.281834000000000000000000...01000000100000010010000000000000000010000000000000
33718650.5319171.000000002400.045.6874784.492574-0.0506160.883600-2.2280790.2742930.1103000.1219640.0333950.000000-4.5095991.285940-0.501868-2.438353-0.47869917125.000.4450990.9268890.1603770.0041280.6505910.3984470.225212000000000000000000...10000000100000001010000000000000000000010000000000
441110800.4275350.310345005200.044.3835112.0314330.572169-1.5712392.2460880.2280360.0732050.0918800.0788190.121534-1.8962400.9107830.9311102.8345181.92348211531.060.1480900.2945450.0509430.0092880.0885240.1445790.073020000000000000000000...01000000100000010010000000100000000000000000000000

5 rows × 370 columns

print(data.shape)
data.columns
(199037, 370)





Index(['SaleID', 'name', 'power', 'kilometer', 'seller', 'offerType', 'price',
       'v_0', 'v_1', 'v_2',
       ...
       'power_bin_20.0', 'power_bin_21.0', 'power_bin_22.0', 'power_bin_23.0',
       'power_bin_24.0', 'power_bin_25.0', 'power_bin_26.0', 'power_bin_27.0',
       'power_bin_28.0', 'power_bin_29.0'],
      dtype='object', length=370)
# 这份数据可以给 LR 用
data.to_csv('data_for_lr.csv', index=0)

3.3.3 特征筛选

1) 过滤式

# 相关性分析
print(data['power'].corr(data['price'], method='spearman'))
print(data['kilometer'].corr(data['price'], method='spearman'))
print(data['brand_amount'].corr(data['price'], method='spearman'))
print(data['brand_price_average'].corr(data['price'], method='spearman'))
print(data['brand_price_max'].corr(data['price'], method='spearman'))
print(data['brand_price_median'].corr(data['price'], method='spearman'))
0.5728285196051496
-0.4082569701616764
0.058156610025581514
0.3834909576057687
0.259066833880992
0.38691042393409447
# 当然也可以直接看图
data_numeric = data[['power', 'kilometer', 'brand_amount', 'brand_price_average', 
                     'brand_price_max', 'brand_price_median']]
correlation = data_numeric.corr()

f , ax = plt.subplots(figsize = (7, 7))
plt.title('Correlation of Numeric Features with Price',y=1,size=16)
sns.heatmap(correlation,square = True,  vmax=0.8)
<matplotlib.axes._subplots.AxesSubplot at 0x28700594b70>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-QyABs0VQ-1585390656204)(output_47_1.png)]

x.columns
Index(['SaleID', 'name', 'power', 'kilometer', 'seller', 'offerType', 'v_0',
       'v_1', 'v_2', 'v_3',
       ...
       'power_bin_20.0', 'power_bin_21.0', 'power_bin_22.0', 'power_bin_23.0',
       'power_bin_24.0', 'power_bin_25.0', 'power_bin_26.0', 'power_bin_27.0',
       'power_bin_28.0', 'power_bin_29.0'],
      dtype='object', length=369)
# from sklearn.model_selection import cross_val_score, ShuffleSplit
# from sklearn.datasets import load_boston
# from sklearn.ensemble import RandomForestRegressor
# import numpy as np

# # Load boston housing dataset as an example
# boston = load_boston()
# X = x
# Y = y
# # names = x[0]

# rf = RandomForestRegressor(n_estimators=20, max_depth=4)
# scores = []
# # 单独采用每个特征进行建模,并进行交叉验证
# for i in range(X.shape[1]):
#     score = cross_val_score(rf, X[:, i:i+1], Y, scoring="r2",  # 注意X[:, i]和X[:, i:i+1]的区别
#                             cv=ShuffleSplit(len(X), 3, .3))
#     scores.append((format(np.mean(score), '.3f')))#, names[i]
# print(sorted(scores, reverse=True))

2) 包裹式

# !pip install mlxtend
def fill(x):
    if not x:
        x = 0
    return int(x)
data['city'] = data['city'].map(fill)
data.groupby("city").describe()
#data['city'].isnull().sum()
data['city'].value_counts()
0    48645
1    42188
2    35133
3    27325
4    19945
5    13462
6     8313
7     3887
8      139
Name: city, dtype: int64
data['price'][0:train.shape[0]].isnull().sum()
0
print (train.shape[0])
print (test.shape[0])
149037
50000
# k_feature 太大会很难跑,没服务器,所以提前 interrupt 了
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LinearRegression
sfs = SFS(LinearRegression(),
           k_features=20,
           forward=True,
           floating=False,
           scoring = 'r2',
           cv = 0)
x = data.drop(['price'], axis=1)
x = x.fillna(0)[0:train.shape[0]]
y = data['price'][0:train.shape[0]]
sfs.fit(x, y)
sfs.k_feature_names_ 
('kilometer',
 'v_3',
 'v_4',
 'v_6',
 'v_13',
 'v_14',
 'used_time',
 'brand_price_average',
 'model_44.0',
 'model_105.0',
 'model_113.0',
 'model_167.0',
 'brand_16',
 'bodyType_6.0',
 'gearbox_1.0',
 'power_bin_6.0',
 'power_bin_18.0',
 'power_bin_24.0',
 'power_bin_25.0',
 'power_bin_26.0')
# k_feature 太大会很难跑,没服务器,所以提前 interrupt 了
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LinearRegression
sfs = SFS(LinearRegression(),
           k_features=10,
           forward=True,
           floating=False,
           scoring = 'r2',
           cv = 0)
x = data.drop(['price','city'],axis=1)
#x.head()
x = x.fillna(0)[0:train.shape[0]]
y = data['price'][0:train.shape[0]]
sfs.fit(x, y)
sfs.k_feature_names_ 
('kilometer',
 'v_3',
 'v_4',
 'v_13',
 'v_14',
 'used_time',
 'brand_price_average',
 'model_167.0',
 'gearbox_1.0',
 'power_bin_24.0')
# k_feature=sfs.get_metric_dict()
# for fea in k_feature:
#     fea=k_feature[fea]
#     print(f"Feature Name:{fea['feature_names']},")
#           # /t 
#     print(f"Avg_Soure:{fea["avg_score"]}")
x.head()
#print(train.shape[0])
SaleIDnamepowerkilometersellerofferTypev_0v_1v_2v_3v_4v_5v_6v_7v_8v_9v_10v_11v_12v_13v_14trainused_timecitybrand_amountbrand_price_maxbrand_price_medianbrand_price_minbrand_price_sumbrand_price_stdbrand_price_averagemodel_0.0model_1.0model_2.0model_3.0model_4.0model_5.0model_6.0model_7.0model_8.0model_9.0model_10.0model_11.0model_12.0model_13.0model_14.0model_15.0model_16.0model_17.0model_18.0...bodyType_0.0bodyType_1.0bodyType_2.0bodyType_3.0bodyType_4.0bodyType_5.0bodyType_6.0bodyType_7.0fuelType_0.0fuelType_1.0fuelType_2.0fuelType_3.0fuelType_4.0fuelType_5.0fuelType_6.0gearbox_0.0gearbox_1.0notRepairedDamage_-notRepairedDamage_0.0notRepairedDamage_1.0power_bin_0.0power_bin_1.0power_bin_2.0power_bin_3.0power_bin_4.0power_bin_5.0power_bin_6.0power_bin_7.0power_bin_8.0power_bin_9.0power_bin_10.0power_bin_11.0power_bin_12.0power_bin_13.0power_bin_14.0power_bin_15.0power_bin_16.0power_bin_17.0power_bin_18.0power_bin_19.0power_bin_20.0power_bin_21.0power_bin_22.0power_bin_23.0power_bin_24.0power_bin_25.0power_bin_26.0power_bin_27.0power_bin_28.0power_bin_29.0
007360.4150910.8275860043.3577963.9663440.0502572.1597441.1437860.2356760.1019880.1295490.0228160.097462-2.8818032.804097-2.4208210.7952920.91476214385.010.3241250.3407860.0320750.0020640.2096840.2076600.0816550000000000000000000...01000000100000010010000001000000000000000000000000
1122620.0000001.0000000045.3052735.2361120.1379251.380657-1.4221650.2647770.1210040.1357310.0265970.020582-4.9004822.096338-1.030483-1.7226740.24552214757.040.4343410.8352300.2056230.0041280.7139850.4370020.2573050000000000000000000...00100000100000010100000000000000000000000000000000
22148740.5149540.8275860045.9783594.8237921.319524-0.998467-0.9969110.2514100.1149120.1651470.0621730.027075-4.8467491.8035591.565330-0.832687-0.22996314382.020.0461170.4335780.2849060.0918470.0825330.2523620.2818340000000000000000000...01000000100000010010000000000000000010000000000000
33718650.5319171.0000000045.6874784.492574-0.0506160.883600-2.2280790.2742930.1103000.1219640.0333950.000000-4.5095991.285940-0.501868-2.438353-0.47869917125.000.4450990.9268890.1603770.0041280.6505910.3984470.2252120000000000000000000...10000000100000001010000000000000000000010000000000
441110800.4275350.3103450044.3835112.0314330.572169-1.5712392.2460880.2280360.0732050.0918800.0788190.121534-1.8962400.9107830.9311102.8345181.92348211531.060.1480900.2945450.0509430.0092880.0885240.1445790.0730200000000000000000000...01000000100000010010000000100000000000000000000000

5 rows × 369 columns

# 画出来,可以看到边际效益
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs
import matplotlib.pyplot as plt
fig1 = plot_sfs(sfs.get_metric_dict(), kind='std_dev')
plt.grid()
plt.show()
F:\dev\anaconda\envs\python35\lib\site-packages\numpy\core\_methods.py:217: RuntimeWarning: Degrees of freedom <= 0 for slice
  keepdims=keepdims)
F:\dev\anaconda\envs\python35\lib\site-packages\numpy\core\_methods.py:209: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-qIL7Nv8w-1585390656204)(output_60_1.png)]

pd.DataFrame.from_dict(sfs.get_metric_dict()).T
F:\dev\anaconda\envs\python35\lib\site-packages\numpy\core\_methods.py:217: RuntimeWarning: Degrees of freedom <= 0 for slice
  keepdims=keepdims)
F:\dev\anaconda\envs\python35\lib\site-packages\numpy\core\_methods.py:209: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
feature_idxcv_scoresavg_scorefeature_namesci_boundstd_devstd_err
1(9,)[0.5580593794194673]0.558059(v_3,)NaN0NaN
2(9, 30)[0.6253563249806938]0.625356(v_3, brand_price_average)NaN0NaN
3(3, 9, 30)[0.6614119003709955]0.661412(kilometer, v_3, brand_price_average)NaN0NaN
4(3, 9, 30, 335)[0.6712706106724942]0.671271(kilometer, v_3, brand_price_average, gearbox_...NaN0NaN
5(3, 9, 30, 198, 335)[0.6801326459700268]0.680133(kilometer, v_3, brand_price_average, model_16...NaN0NaN
6(3, 9, 22, 30, 198, 335)[0.686927264547389]0.686927(kilometer, v_3, used_time, brand_price_averag...NaN0NaN
7(3, 9, 19, 22, 30, 198, 335)[0.6941981569972937]0.694198(kilometer, v_3, v_13, used_time, brand_price_...NaN0NaN
8(3, 9, 10, 19, 22, 30, 198, 335)[0.6990798224753535]0.69908(kilometer, v_3, v_4, v_13, used_time, brand_p...NaN0NaN
9(3, 9, 10, 19, 22, 30, 198, 335, 363)[0.7036045618841336]0.703605(kilometer, v_3, v_4, v_13, used_time, brand_p...NaN0NaN
10(3, 9, 10, 19, 20, 22, 30, 198, 335, 363)[0.7073329162983002]0.707333(kilometer, v_3, v_4, v_13, v_14, used_time, b...NaN0NaN
11(3, 9, 10, 12, 19, 20, 22, 30, 198, 335, 363)[0.7116225332630737]0.711623(kilometer, v_3, v_4, v_6, v_13, v_14, used_ti...NaN0NaN
12(3, 9, 10, 12, 19, 20, 22, 30, 198, 295, 335, ...[0.7152839215589477]0.715284(kilometer, v_3, v_4, v_6, v_13, v_14, used_ti...NaN0NaN
13(3, 9, 10, 12, 19, 20, 22, 30, 198, 295, 335, ...[0.7183533830790547]0.718353(kilometer, v_3, v_4, v_6, v_13, v_14, used_ti...NaN0NaN
14(3, 9, 10, 12, 19, 20, 22, 30, 144, 198, 295, ...[0.7210323407042653]0.721032(kilometer, v_3, v_4, v_6, v_13, v_14, used_ti...NaN0NaN
15(3, 9, 10, 12, 19, 20, 22, 30, 75, 144, 198, 2...[0.7235732490774848]0.723573(kilometer, v_3, v_4, v_6, v_13, v_14, used_ti...NaN0NaN
16(3, 9, 10, 12, 19, 20, 22, 30, 75, 144, 198, 2...[0.726091372443646]0.726091(kilometer, v_3, v_4, v_6, v_13, v_14, used_ti...NaN0NaN
17(3, 9, 10, 12, 19, 20, 22, 30, 75, 136, 144, 1...[0.7286164680329102]0.728616(kilometer, v_3, v_4, v_6, v_13, v_14, used_ti...NaN0NaN
18(3, 9, 10, 12, 19, 20, 22, 30, 75, 136, 144, 1...[0.7309480347784469]0.730948(kilometer, v_3, v_4, v_6, v_13, v_14, used_ti...NaN0NaN
19(3, 9, 10, 12, 19, 20, 22, 30, 75, 136, 144, 1...[0.7332378240942985]0.733238(kilometer, v_3, v_4, v_6, v_13, v_14, used_ti...NaN0NaN
20(3, 9, 10, 12, 19, 20, 22, 30, 75, 136, 144, 1...[0.7352137419490058]0.735214(kilometer, v_3, v_4, v_6, v_13, v_14, used_ti...NaN0NaN

3) 嵌入式

# 下一章介绍,Lasso 回归和决策树可以完成嵌入式特征选择
# 大部分情况下都是用嵌入式做特征筛选

3.4 经验总结

特征工程是比赛中最至关重要的的一块,特别的传统的比赛,大家的模型可能都差不多,调参带来的效果增幅是非常有限的,但特征工程的好坏往往会决定了最终的排名和成绩。

特征工程的主要目的还是在于将数据转换为能更好地表示潜在问题的特征,从而提高机器学习的性能。比如,异常值处理是为了去除噪声,填补缺失值可以加入先验知识等。

特征构造也属于特征工程的一部分,其目的是为了增强数据的表达。

有些比赛的特征是匿名特征,这导致我们并不清楚特征相互直接的关联性,这时我们就只有单纯基于特征进行处理,比如装箱,groupby,agg 等这样一些操作进行一些特征统计,此外还可以对特征进行进一步的 log,exp 等变换,或者对多个特征进行四则运算(如上面我们算出的使用时长),多项式组合等然后进行筛选。由于特性的匿名性其实限制了很多对于特征的处理,当然有些时候用 NN 去提取一些特征也会达到意想不到的良好效果。

对于知道特征含义(非匿名)的特征工程,特别是在工业类型比赛中,会基于信号处理,频域提取,丰度,偏度等构建更为有实际意义的特征,这就是结合背景的特征构建,在推荐系统中也是这样的,各种类型点击率统计,各时段统计,加用户属性的统计等等,这样一种特征构建往往要深入分析背后的业务逻辑或者说物理原理,从而才能更好的找到 magic。

当然特征工程其实是和模型结合在一起的,这就是为什么要为 LR NN 做分桶和特征归一化的原因,而对于特征的处理效果和特征重要性等往往要通过模型来验证。

总的来说,特征工程是一个入门简单,但想精通非常难的一件事。

Task 3-特征工程 END.

— By: 阿泽

PS:复旦大学计算机研究生
知乎:阿泽 https://www.zhihu.com/people/is-aze(主要面向初学者的知识整理)

关于Datawhale:

Datawhale是一个专注于数据科学与AI领域的开源组织,汇集了众多领域院校和知名企业的优秀学习者,聚合了一群有开源精神和探索精神的团队成员。Datawhale 以“for the learner,和学习者一起成长”为愿景,鼓励真实地展现自我、开放包容、互信互助、敢于试错和勇于担当。同时 Datawhale 用开源的理念去探索开源内容、开源学习和开源方案,赋能人才培养,助力人才成长,建立起人与人,人与知识,人与企业和人与未来的联结。

本次数据挖掘路径学习,专题知识将在天池分享,详情可关注Datawhale:

学习笔记

  1. 特征工程就是把数据转化成能够更好的表示潜在问题的特征,特征工程决定了你预测的上限。

  2. 数据理解:定性的数据和定量的数据了解数据性质方便进行进行数据的的处理

  3. 数据清洗(提高数据质量):我们进行了缺失值跟异常值的处理对于长尾分布的一些特征可以用截断法来代替直接删除异常值,然后如果要用线性的模型的话我们还要对数据进行标准化归一化

  4. 特征构造(为了增强数据表达,添加先验知识):在这里我们构造了时间差的特征,但我们发现这里面会存在一些Nat的值,我的处理之给他们赋予了新构造出来的时间差这列数据的平均值来填补这些缺失值,同时我们对power进行了数据分桶添加了power_pin这个新特征。同时关于为何kilometer推测是已经进行过数据分桶我们可以通过观察原数据很明显的可以看到kilometer只有有限的几类
    ,对于地理信息我们因为有先验知识所以我们取出我们去除后三位留下城市信息,但又由于可能会处理后存在空值,我们用0来replace组成新的一类。通过one_hot来进行了一些非线性变化,好处写在上面了。

  5. 特征选择:
    过滤式-Filter(通过特征与price之间的相关性筛选出一些特征)
    包裹式-Wrapper(用贪心法来找出比较优的一组特征)
    嵌入式-Embedding(学习器自动选择特征)
    个人感觉对于这个题来说特征没有特别多没太有必要进行特征选择

  6. 还有要记的对一些不符合正态分布的数据进行一下取Log的处理,使其尽量的来接近正态分布

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值