关于天池二手车预测的数据的特征工程

最新推荐文章于 2023-12-03 19:50:55 发布

wxlang123

最新推荐文章于 2023-12-03 19:50:55 发布

阅读量483

点赞数 1

文章标签：数据挖掘 python

本文链接：https://blog.csdn.net/wxlang123/article/details/105166513

版权

赛题：零基础入门数据挖掘 - 二手车交易价格预测
地址：https://tianchi.aliyun.com/competition/entrance/231784/introduction?spm=5176.12281957.1004.1.38b02448ausjSX

**在进行数据挖掘的时候，其实我们现在开源的算法大多都差不多，准确率没有多大区别，最终影响结果的在我看来一个是对数据的处理，一个是对模型的选择，而数据的处理在专业疏于离叫做特征工程。
而在我看来，常见的特征工程处理内容如下：
1：异常值处理
通过画箱型图来选择，然后删除
2：缺失值处理
删除，补全，不处理
3：归一化
对于连续数据或者范围过大的数据进行归一，减少数据对模型的扰动性
4：特征选择
选择比较合适的特征来进行模型的训练
一：数据初步浏览

import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
from operator import itemgetter

%matplotlib inline

path = './data/'
Train_data = pd.read_csv(path+'used_car_train_20200313.csv', sep=' ')
Test_data = pd.read_csv(path+'used_car_testA_20200313.csv', sep=' ')
print(Train_data.shape)
print(Test_data.shape)

(150000, 31)
(50000, 30)

Train_data.head()

在这里插入图片描述

Train_data.columns
Test_data.columns

Index([‘SaleID’, ‘name’, ‘regDate’, ‘model’, ‘brand’, ‘bodyType’, ‘fuelType’,
‘gearbox’, ‘power’, ‘kilometer’, ‘notRepairedDamage’, ‘regionCode’,
‘seller’, ‘offerType’, ‘creatDate’, ‘v_0’, ‘v_1’, ‘v_2’, ‘v_3’, ‘v_4’,
‘v_5’, ‘v_6’, ‘v_7’, ‘v_8’, ‘v_9’, ‘v_10’, ‘v_11’, ‘v_12’, ‘v_13’,
‘v_14’],
dtype=‘object’)
二：数据异常值处理

def outliers_proc(data, col_name, scale=3):
    

    def box_plot_outliers(data_ser, box_scale):
        """
        利用箱线图去除异常值
        :param data_ser: 接收 pandas.Series 数据格式
        :param box_scale: 箱线图尺度，
        :return:
        """
        iqr = box_scale * (data_ser.quantile(0.75) - data_ser.quantile(0.25))
        val_low = data_ser.quantile(0.25) - iqr
        val_up = data_ser.quantile(0.75) + iqr
        rule_low = (data_ser < val_low)
        rule_up = (data_ser > val_up)
        return (rule_low, rule_up), (val_low, val_up)

    data_n = data.copy()
    data_series = data_n[col_name]
    rule, value = box_plot_outliers(data_series, box_scale=scale)
    index = np.arange(data_series.shape[0])[rule[0] | rule[1]]
    print("Delete number is: {}".format(len(index)))
    data_n = data_n.drop(index)
    data_n.reset_index(drop=True, inplace=True)
    print("Now column number is: {}".format(data_n.shape[0]))
    index_low = np.arange(data_series.shape[0])[rule[0]]
    outliers = data_series.iloc[index_low]
    print("Description of data less than the lower bound is:")
    print(pd.Series(outliers).describe()

最低0.47元/天解锁文章

wxlang123

关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
关于天池二手车预测的数据的特征工程

赛题：零基础入门数据挖掘 - 二手车交易价格预测地址：https://tianchi.aliyun.com/competition/entrance/231784/introduction?spm=5176.12281957.1004.1.38b02448ausjSX**在进行数据挖掘的时候，其实我们现在开源的算法大多都差不多，准确率没有多大区别，最终影响结果的在我看来一个是对数据的处理，一个...
复制链接

扫一扫