二手车交易价格预测task1

赛题概况

赛题以预测二手车的交易价格为任务,数据集报名后可见并可下载,该数据来自某交易平台的二手车交易记录,总数据量超过40w,包含31列变量信息,其中15列为匿名变量。为了保证比赛的公平性,将会从中抽取15万条作为训练集,5万条作为测试集A,5万条作为测试集B,同时会对name、model、brand和regionCode等信息进行脱敏。

本赛题的评价标准为MAE(Mean Absolute Error)😗*

M A E = ∑ i = 1 n ∣ y i − y ^ i ∣ n MAE=\frac{\sum_{i=1}^{n}\left|y_{i}-\hat{y}_{i}\right|}{n} MAE=ni=1nyiy^i
其中 y i y_{i} yi代表第 i i i个样本的真实值,其中 y ^ i \hat{y}_{i} y^i代表第 i i i个样本的预测值。

代码示例
## 基础工具
import numpy as np
import pandas as pd
import warnings
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.special import jn
from IPython.display import display, clear_output
import time
import csv
warnings.filterwarnings('ignore')
%matplotlib inline

## 模型预测的
from sklearn import linear_model
from sklearn import preprocessing
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor,GradientBoostingRegressor

## 数据降维处理的
from sklearn.decomposition import PCA,FastICA,FactorAnalysis,SparsePCA

import lightgbm as lgb
import xgboost as xgb

## 参数搜索和评价的
from sklearn.model_selection import GridSearchCV,cross_val_score,StratifiedKFold,train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error
def read_file(path,count=None):
    '''
    定义读取文件的函数
        path: 文件路径
        count: 读取行数
    '''
    # 1
#     list = []
#     with open(path,"r",encoding='UTF-8') as f: #以只读的方式打开文件
#         read_scv=csv.reader(f) #调用csv的reader方法读取文件并赋值给read_scv变量
#         for i,line in enumerate(read_scv):
#             if i == count:
#                 break
#             list.append(line) #将读取到的数据追加到list列表里面
#     list = pd.DataFrame(list)
#     list.columns=['id','heartbeat_signals','label']
#     return list[1:] #返回列表数据
    # 2
    return pd.read_csv(path,sep=' ',nrows = count)
Train_data = read_file('used_car_train_20200313.csv')
TestA_data = read_file('used_car_testA_20200313.csv')

数据字段

  • SaleID 交易ID,唯一编码
  • name 汽车交易名称,已脱敏
  • regDate 汽车注册日期,例如20160101,2016年01月01日
  • model 车型编码,已脱敏
  • brand 汽车品牌,已脱敏
  • bodyType 车身类型:豪华轿车:0,微型车:1,厢型车:2,大巴车:3,敞篷车:4,双门汽车:5,商务车:6,搅拌车:7
  • fuelType 燃油类型:汽油:0,柴油:1,液化石油气:2,天然气:3,混合动力:4,其他:5,电动:6
  • gearbox 变速箱:手动:0,自动:1
  • power 发动机功率:范围 [ 0, 600 ]
  • kilometer 汽车已行驶公里,单位万km
  • notRepairedDamage 汽车有尚未修复的损坏:是:0,否:1
  • regionCode 地区编码,已脱敏
  • seller 销售方:个体:0,非个体:1
  • offerType 报价类型:提供:0,请求:1
  • creatDate 汽车上线时间,即开始售卖时间
  • price 二手车交易价格(预测目标)
  • v_0’, ‘v_1’, ‘v_2’, ‘v_3’, ‘v_4’, ‘v_5’, ‘v_6’, ‘v_7’, ‘v_8’, ‘v_9’, ‘v_10’, ‘v_11’, ‘v_12’, ‘v_13’,‘v_14’(根据汽车的评论、标签等大量信息得到的embedding向量)匿名特征,包含v0-14在内15个匿名特征

数据观察

# 数据纵览
Train_data.head().append(Train_data.tail())
# TestA_data.head().append(TestA_data.tail())
SaleIDnameregDatemodelbrandbodyTypefuelTypegearboxpowerkilometer...v_5v_6v_7v_8v_9v_10v_11v_12v_13v_14
007362004040230.061.00.00.06012.5...0.2356760.1019880.1295490.0228160.097462-2.8818032.804097-2.4208210.7952920.914762
1122622003030140.012.00.00.0015.0...0.2647770.1210040.1357310.0265970.020582-4.9004822.096338-1.030483-1.7226740.245522
221487420040403115.0151.00.00.016312.5...0.2514100.1149120.1651470.0621730.027075-4.8467491.8035591.565330-0.832687-0.229963
337186519960908109.0100.00.01.019315.0...0.2742930.1103000.1219640.0333950.000000-4.5095991.285940-0.501868-2.438353-0.478699
4411108020120103110.051.00.00.0685.0...0.2280360.0732050.0918800.0788190.121534-1.8962400.9107830.9311102.8345181.923482
14999514999516397820000607121.0104.00.01.016315.0...0.2802640.0003100.0484410.0711580.0191741.988114-2.9839730.589167-1.304370-0.302592
14999614999618453520091102116.0110.00.00.012510.0...0.2532170.0007770.0840790.0996810.0793711.839166-2.7746152.5539940.924196-0.272160
1499971499971475872010100360.0111.01.00.0906.0...0.2333530.0007050.1188720.1001180.0979142.439812-1.6306772.2901971.8919220.414931
149998149998459072006031234.0103.01.00.015615.0...0.2563690.0002520.0814790.0835580.0814982.075380-2.6337191.4149370.431981-1.659014
1499991499991776721999020419.0286.00.01.019312.5...0.2844750.0000000.0400720.0625430.0258191.978453-3.1799130.031724-1.483350-0.342674

10 rows × 31 columns

#数据行列信息
print('Train data shape:',Train_data.shape)
print('Test data shape:',TestA_data.shape)

Train data shape: (150000, 31)
Test data shape: (50000, 30)
# 数据信息查看
# 通过info来了解数据每列的type
# Train_data.info()
TestA_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 30 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   SaleID             50000 non-null  int64  
 1   name               50000 non-null  int64  
 2   regDate            50000 non-null  int64  
 3   model              50000 non-null  float64
 4   brand              50000 non-null  int64  
 5   bodyType           48587 non-null  float64
 6   fuelType           47107 non-null  float64
 7   gearbox            48090 non-null  float64
 8   power              50000 non-null  int64  
 9   kilometer          50000 non-null  float64
 10  notRepairedDamage  50000 non-null  object 
 11  regionCode         50000 non-null  int64  
 12  seller             50000 non-null  int64  
 13  offerType          50000 non-null  int64  
 14  creatDate          50000 non-null  int64  
 15  v_0                50000 non-null  float64
 16  v_1                50000 non-null  float64
 17  v_2                50000 non-null  float64
 18  v_3                50000 non-null  float64
 19  v_4                50000 non-null  float64
 20  v_5                50000 non-null  float64
 21  v_6                50000 non-null  float64
 22  v_7                50000 non-null  float64
 23  v_8                50000 non-null  float64
 24  v_9                50000 non-null  float64
 25  v_10               50000 non-null  float64
 26  v_11               50000 non-null  float64
 27  v_12               50000 non-null  float64
 28  v_13               50000 non-null  float64
 29  v_14               50000 non-null  float64
dtypes: float64(20), int64(9), object(1)
memory usage: 11.4+ MB
# 通过 .columns 查看列名
# Train_data.columns
TestA_data.columns
Index(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType',
       'gearbox', 'power', 'kilometer', 'notRepairedDamage', 'regionCode',
       'seller', 'offerType', 'creatDate', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4',
       'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13',
       'v_14'],
      dtype='object')
# 通过 .describe() 可以查看数值特征列的一些统计信息
# 个数count、平均值mean、方差std、最小值min、中位数25% 50% 75% 、以及最大值
# Train_data.describe()
TestA_data.describe()
SaleIDnameregDatemodelbrandbodyTypefuelTypegearboxpowerkilometer...v_5v_6v_7v_8v_9v_10v_11v_12v_13v_14
count50000.00000050000.0000005.000000e+0450000.00000050000.00000048587.00000047107.00000048090.00000050000.00000050000.000000...50000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.000000
mean174999.50000068542.2232802.003393e+0746.8445208.0562401.7821850.3734050.224350119.88362012.595580...0.2486690.0450210.1227440.0579970.062000-0.017855-0.013742-0.013554-0.0031470.001516
std14433.90106761052.8081335.368870e+0449.4695487.8194771.7607360.5464420.417158185.0973873.908979...0.0446010.0517660.1959720.0292110.0356533.7479853.2312582.5159621.2865971.027360
min150000.0000000.0000001.991000e+070.0000000.0000000.0000000.0000000.0000000.0000000.500000...0.0000000.0000000.0000000.0000000.000000-9.160049-5.411964-8.916949-4.123333-6.112667
25%162499.75000011203.5000001.999091e+0710.0000001.0000000.0000000.0000000.00000075.00000012.500000...0.2437620.0000440.0626440.0350840.033714-3.700121-1.971325-1.876703-1.060428-0.437920
50%174999.50000052248.5000002.003091e+0729.0000006.0000001.0000000.0000000.000000109.00000015.000000...0.2578770.0008150.0958280.0570840.0587641.613212-0.355843-0.142779-0.0359560.138799
75%187499.250000118856.5000002.007110e+0765.00000013.0000003.0000001.0000000.000000150.00000015.000000...0.2653280.1020250.1254380.0790770.0874892.8327081.2629141.7643350.9414690.681163
max199999.000000196805.0000002.015121e+07246.00000039.0000007.0000006.0000001.00000020000.00000015.000000...0.2916180.1532651.3588130.1563550.21477512.33887218.85621812.9504985.9132732.624622

8 rows × 29 columns

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值