二手车价格预测——task3 特征工程

最新推荐文章于 2022-09-07 17:30:30 发布

唐yi壹佰

最新推荐文章于 2022-09-07 17:30:30 发布

阅读量181

点赞数

分类专栏：二手车价格预测文章标签： python 大数据机器学习数据挖掘

本文链接：https://blog.csdn.net/m0_46668150/article/details/115619164

版权

文章目录

前言
一、代码示例
总结

前言

特征工程在数据挖掘中占有至关重要的地位，尤其是在数据挖掘竞赛中，特征工程基本上都是提分的关键点。本次二手车价格预测比赛，数据里的特征较多，当然需要我们去筛选一些特征，以及针对响应的模型，挖掘更多的特征。

提示：以下是本篇文章正文内容，下面案例可供参考

一、代码示例

1.引入库

代码如下（示例）：

import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
from operator import itemgetter

2.读入数据

代码如下（示例）：

train = pd.read_csv('C:\\Users\\TINKPAD\\Desktop\\python_work\\kaggle\二手车交易价格预测\\used_car_train_20200313.csv', sep=' ')
test = pd.read_csv('C:\\Users\\TINKPAD\\Desktop\\python_work\\kaggle\二手车交易价格预测\\used_car_testB_20200421.csv', sep=' ')
print(train.shape)
print(test.shape)

(150000, 31)
(50000, 30)

print(train.columns)
print(test.columns)

Index(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType',
       'gearbox', 'power', 'kilometer', 'notRepairedDamage', 'regionCode',
       'seller', 'offerType', 'creatDate', 'price', 'v_0', 'v_1', 'v_2', 'v_3',
       'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12',
       'v_13', 'v_14'],
      dtype='object')
Index(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType',
       'gearbox', 'power', 'kilometer', 'notRepairedDamage', 'regionCode',
       'seller', 'offerType', 'creatDate', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4',
       'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13',
       'v_14'],
      dtype='object')

3.删除异常值

这里我包装了一个异常值处理的代码，可以随便调用。

def outliers_proc(data, col_name, scale=3):
    """
    用于清洗异常值，默认用 box_plot（scale=3）进行清洗
    :param data: 接收 pandas 数据格式
    :param col_name: pandas 列名
    :param scale: 尺度
    :return:
    """

    def box_plot_outliers(data_ser, box_scale):
        """
        利用箱线图去除异常值
        :param data_ser: 接收 pandas.Series 数据格式
        :param box_scale: 箱线图尺度，
        :return:
        """
        iqr = box_scale * (data_ser.quantile(0.75) - data_ser.quantile(0.25))
        val_low = data_ser.quantile(0.25) - iqr
        val_up = data_ser.quantile(0.75) + iqr
        rule_low = (data_ser < val_low)
        rule_up = (data_ser > val_up)
        return (rule_low, rule_up), (val_low, val_up)

    data_n = data.copy()
    data_series = data_n[col_name]
    rule, value = box_plot_outliers(data_series, box_scale=scale)
    index = np.arange(data_series.shape[0])[rule[0] | rule[1]]
    print("Delete number is: {}".format(len(index)))
    data_n = data_n.drop(index)
    data_n.reset_index(drop=True, inplace=True)
    print("Now column number is: {}".format(data_n.shape[0]))
    index_low = np.arange(data_series.shape[0])[rule[0]]
    outliers = data_series.iloc[index_low]
    print("Description of data less than the lower bound is:")
    print(pd.Series(outliers).describe())
    index_up = np

最低0.47元/天解锁文章

唐yi壹佰

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
二手车价格预测——task3 特征工程

文章目录前言一、代码示例1.引入库2.读入数据3.删除异常值3.特征构造4 特征筛选总结前言特征工程在数据挖掘中占有至关重要的地位，尤其是在数据挖掘竞赛中，特征工程基本上都是提分的关键点。本次二手车价格预测比赛，数据里的特征较多，当然需要我们去筛选一些特征，以及针对响应的模型，挖掘更多的特征。提示：以下是本篇文章正文内容，下面案例可供参考一、代码示例1.引入库代码如下（示例）：import pandas as pdimport numpy as npimport matplotlib.
复制链接

扫一扫

专栏目录