天池二手车价格预测比赛（二）——特征工程步骤

最新推荐文章于 2023-02-28 14:56:20 发布

VIP文章 qq_40723803

最新推荐文章于 2023-02-28 14:56:20 发布

阅读量412

点赞数

文章标签：数据分析

本文链接：https://blog.csdn.net/qq_40723803/article/details/105068491

版权

特征工程

1.删除特征中的异常值

包装的异常值处理的代码，可以随便调用。

def outliers_proc(data, col_name, scale=3):
    """
    用于清洗异常值，默认用 box_plot（scale=3）进行清洗
    :param data: 接收 pandas 数据格式
    :param col_name: pandas 列名
    :param scale: 尺度
    :return:
    """
    def box_plot_outliers(data_ser, box_scale):
        """
        利用箱线图去除异常值
        :param data_ser: 接收 pandas.Series 数据格式
        :param box_scale: 箱线图尺度，
        :return:
        """
        # 判断标准：四分位距 * scale ！！！
        iqr = box_scale * (data_ser.quantile(0.75) - data_ser.quantile(0.25))
        # 异常值判断的下界： 下四分位距 - 判断标准
        val_low = data_ser.quantile(0.25) - iqr
        # 异常值判断的上界： 上四分位距 + 判断标准
        val_up = data_ser.quantile(0.75) + iqr
        # 下界过滤一次，生成bool索引
        rule_low = (data_ser < val_low)
        # 上界过滤一次，生成bool索引
        rule_up = (data_ser > val_up)
        return (rule_low, rule_up), (val_low, val_up)

    data_n = data.copy()
    data_series = data_n[col_name]
    
    # 删除条件取并集，满足任何一个异常值删除标准都 ok
    rule, value = box_plot_outliers(data_series, box_scale=scale)
    # 根据 bool 索引找出被删除数据的 index
    index = np.arange(data_series.shape[0])[rule[0] | rule[1]]
    print("被删除的异常值数量为: {}".format(len(index)))
    # 根据 index 删除对应的数据
    data_n = data_n.drop(index)
    data_n.reset_index(drop=True, inplace=True)
    print("删除异常特征值后，数据总量为: {}".format(data_n.shape[0]))
    
    # data_series 未被改动，对异常值进行统计描述
    index_low = np.arange(data_series.shape[0])[rule[0]]
    outliers = data_series.iloc[index_low]
    print("对小于异常值下界的异常特征值进行统计描述:")
    print(pd.Series(outliers).describe())
    index_up = np.arange(data_series.shape[0])[rule[1]]
    outliers = data_series.iloc[index_up]
    print("对大于异常值上界的异常特征值进行统计描述:")
    print(pd.Series(outliers).describe())
    
    # 绘制 某个特征 删除异常值前后的箱线图
    fig, ax = plt.subplots(1, 2, figsize=(10, 7))
    sns.boxplot(y=data[col_name], data=data, palette="Set1", ax=ax[0])
    sns.boxplot(y=data_n[col_name], data=data_n, palette="Set1", ax=ax[1])
    return data_n

可以删掉一些异常数据，以 power 为例，但是最终删不删自行判断
但是要注意只能删除训练集的数据， 测试集的数据不能删(掩耳盗铃)！！！

# 被删除的异常值数量为: 963
Train_data = outliers_proc(Train_data, 'power', scale=3)

2.特征构造

a.训练集和测试集放在一起，方便构造特征

测试集的 price 特征为 nan

Train_data['train'] = 1
Test_data['train'] = 0
# 默认 axis=0
data = pd.concat([Train_data, Test_data], ignore_index=True)
# (199037, 32)
print(data.shape)

b.使用时间（天数）特征构造

反应汽车使用时间，一般来说价格与使用时间成反比，公式为：data[‘creatDate’] - data[‘regDate’]

b-1.对 pd.to_datetime的使用

# 首先介绍日期格式于python的格式转化：
1/17/07 has the format "%m/%d/%y"
17-1-

最低0.47元/天解锁文章

qq_40723803

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
1
评论
天池二手车价格预测比赛（二）——特征工程步骤

特征工程1.删除特征中的异常值2.特征构造a.训练集和测试集放在一起，方便构造特征b.使用时间（天数）特征构造b-1.对pd.to_datetime的使用？？？b-2.时间特征的构造b-3.对 nan 的判断b-4.判断并找出日期值 nan 的数据c.从邮编中提取城市信息——相当于加入了先验知识1.删除特征中的异常值包装的异常值处理的代码，可以随便调用。def outliers_proc(d...
复制链接

扫一扫