二手车交易价格预测

该博客围绕数据处理与建模展开。先对数据进行初步检查,转换‘notRepairedDamage’列类型、删除无帮助特征;接着进行探索性数据分析,因数据量大选用函数和赛题描述完成;然后开展特征工程,修正power值、填充缺失值;再选择随机森林、XGBoost、GBDT三个模型,交叉验证后选中XGBoost调参;最后提交结果。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

# 导入相关库及配置
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor
from sklearn.model_selection import cross_val_score, GridSearchCV  # 交叉验证,网格搜索
pd.options.display.max_columns = None  # 取消最大列显示限制
warnings.filterwarnings('ignore')  # 过滤警告信息,保证清爽输出
%matplotlib inline
# 数据的读取和初步处理
df_train = pd.read_csv('datalab/231784/used_car_train_20200313.csv', sep=' ')
df_test = pd.read_csv('datalab/231784/used_car_testA_20200313.csv', sep=' ')
train = df_train.drop(['SaleID'], axis=1)
test = df_test.drop(['SaleID'], axis=1)

1. 数据初瞥

train.head()
nameregDatemodelbrandbodyTypefuelTypegearboxpowerkilometernotRepairedDamageregionCodesellerofferTypecreatDatepricev_0v_1v_2v_3v_4v_5v_6v_7v_8v_9v_10v_11v_12v_13v_14
07362004040230.061.00.00.06012.50.010460020160404185043.3577963.9663440.0502572.1597441.1437860.2356760.1019880.1295490.0228160.097462-2.8818032.804097-2.4208210.7952920.914762
122622003030140.012.00.00.0015.0-43660020160309360045.3052735.2361120.1379251.380657-1.4221650.2647770.1210040.1357310.0265970.020582-4.9004822.096338-1.030483-1.7226740.245522
21487420040403115.0151.00.00.016312.50.028060020160402622245.9783594.8237921.319524-0.998467-0.9969110.2514100.1149120.1651470.0621730.027075-4.8467491.8035591.565330-0.832687-0.229963
37186519960908109.0100.00.01.019315.00.04340020160312240045.6874784.492574-0.0506160.883600-2.2280790.2742930.1103000.1219640.0333950.000000-4.5095991.285940-0.501868-2.438353-0.478699
411108020120103110.051.00.00.0685.00.069770020160313520044.3835112.0314330.572169-1.5712392.2460880.2280360.0732050.0918800.0788190.121534-1.8962400.9107830.9311102.8345181.923482
test.head()
nameregDatemodelbrandbodyTypefuelTypegearboxpowerkilometernotRepairedDamageregionCodesellerofferTypecreatDatev_0v_1v_2v_3v_4v_5v_6v_7v_8v_9v_10v_11v_12v_13v_14
06693220111212222.045.01.01.031315.00.01440002016032949.5931275.2465681.001130-4.1222640.7375320.2644050.1218000.0708990.1065580.078867-7.050969-0.8546264.8001510.620011-3.664654
11749601999021119.0210.00.00.07512.51.05419002016040442.395926-3.253950-1.7537543.646605-0.7255970.2617450.0000000.0967330.0137050.0523833.679418-0.729039-3.796107-1.541230-0.757055
253562009030482.0210.00.00.01097.00.05045002016030845.8413704.7041780.155391-1.118443-0.2291600.2602160.1120810.0780820.0620780.050540-4.9266901.0011060.8265620.1382260.754033
350688201004050.000.00.01.01607.00.04023002016032546.4406494.3191550.428897-2.037916-0.2347570.2604660.1067270.0811460.0759710.048268-4.8646370.5054931.8703790.3660381.312775
41614281997070326.0142.00.00.07515.00.03103002016030942.184604-3.166234-1.5720582.6041430.3874980.2509990.0000000.0778060.0286000.0817093.616475-0.673236-3.197685-0.025678-0.101290
# 查看总览 - 训练集
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 30 columns):
name                 150000 non-null int64
regDate              150000 non-null int64
model                149999 non-null float64
brand                150000 non-null int64
bodyType             145494 non-null float64
fuelType             141320 non-null float64
gearbox              144019 non-null float64
power                150000 non-null int64
kilometer            150000 non-null float64
notRepairedDamage    150000 non-null object
regionCode           150000 non-null int64
seller               150000 non-null int64
offerType            150000 non-null int64
creatDate            150000 non-null int64
price                150000 non-null int64
v_0                  150000 non-null float64
v_1                  150000 non-null float64
v_2                  150000 non-null float64
v_3                  150000 non-null float64
v_4                  150000 non-null float64
v_5                  150000 non-null float64
v_6                  150000 non-null float64
v_7                  150000 non-null float64
v_8                  150000 non-null float64
v_9                  150000 non-null float64
v_10                 150000 non-null float64
v_11                 150000 non-null float64
v_12                 150000 non-null float64
v_13                 150000 non-null float64
v_14                 150000 non-null float64
dtypes: float64(20), int64(9), object(1)
memory usage: 34.3+ MB
# 查看总览 - 测试集
test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 29 columns):
name                 50000 non-null int64
regDate              50000 non-null int64
model                50000 non-null float64
brand                50000 non-null int64
bodyType             48587 non-null float64
fuelType             47107 non-null float64
gearbox              48090 non-null float64
power                50000 non-null int64
kilometer            50000 non-null float64
notRepairedDamage    50000 non-null object
regionCode           50000 non-null int64
seller               50000 non-null int64
offerType            50000 non-null int64
creatDate            50000 non-null int64
v_0                  50000 non-null float64
v_1                  50000 non-null float64
v_2                  50000 non-null float64
v_3                  50000 non-null float64
v_4                  50000 non-null float64
v_5                  50000 non-null float64
v_6                  50000 non-null float64
v_7                  50000 non-null float64
v_8                  50000 non-null float64
v_9                  50000 non-null float64
v_10                 50000 non-null float64
v_11                 50000 non-null float64
v_12                 50000 non-null float64
v_13                 50000 non-null float64
v_14                 50000 non-null float64
dtypes: float64(20), int64(8), object(1)
memory usage: 11.1+ MB
1.1 ‘notRepairedDamage’列是唯一的非数值型特征,只有0或1或’-’, 应该转换数据类型,并将‘-’变为空值
# 转换'-'
train['notRepairedDamage'] = train['notRepairedDamage'].replace('-', np.nan) 
test['notRepairedDamage'] = test['notRepairedDamage'].replace('-', np.nan)

# 转换数据类型
train['notRepairedDamage'] = train['notRepairedDamage'].astype('float64')
test['notRepairedDamage'] = test['notRepairedDamage'].astype('float64')

# 检查是否转换成功
train['notRepairedDamage'].unique(), test['notRepairedDamage'].unique()
(array([  0.,  nan,   1.]), array([  0.,   1.,  nan]))
# 查看数值统计描述 - 测试集
test.describe()
nameregDatemodelbrandbodyTypefuelTypegearboxpowerkilometernotRepairedDamageregionCodesellerofferTypecreatDatev_0v_1v_2v_3v_4v_5v_6v_7v_8v_9v_10v_11v_12v_13v_14
count50000.0000005.000000e+0450000.00000050000.00000048587.00000047107.00000048090.00000050000.00000050000.00000041969.00000050000.00000050000.050000.05.000000e+0450000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.000000
mean68542.2232802.003393e+0746.8445208.0562401.7821850.3734050.224350119.88362012.5955800.1124642590.6048200.00.02.016033e+0744.418233-0.0372380.0505340.0846400.0150010.2486690.0450210.1227440.0579970.062000-0.017855-0.013742-0.013554-0.0031470.001516
std61052.8081335.368870e+0449.4695487.8194771.7607360.5464420.417158185.0973873.9089790.3159401876.9702630.00.07.951521e+012.4299503.6425622.8563412.0265101.1930260.0446010.0517660.1959720.0292110.0356533.7479853.2312582.5159621.2865971.027360
min0.0000001.991000e+070.0000000.0000000.0000000.0000000.0000000.0000000.5000000.0000000.0000000.00.02.015061e+0728.987024-4.137733-4.205728-5.638184-4.2877180.0000000.0000000.0000000.0000000.000000-9.160049-5.411964-8.916949-4.123333-6.112667
25%11203.5000001.999091e+0710.0000001.0000000.0000000.0000000.00000075.00000012.5000000.0000001030.0000000.00.02.016031e+0743.139621-3.191909-0.971266-1.453453-0.9280890.2437620.0000440.0626440.0350840.033714-3.700121-1.971325-1.876703-1.060428-0.437920
50%52248.5000002.003091e+0729.0000006.0000001.0000000.0000000.000000109.00000015.0000000.0000002219.0000000.00.02.016032e+0744.611084-3.050756-0.3881170.097881-0.0702250.2578770.0008150.0958280.0570840.0587641.613212-0.355843-0.142779-0.0359560.138799
75%118856.5000002.007110e+0765.00000013.0000003.0000001.0000000.000000150.00000015.0000000.0000003857.0000000.00.02.016033e+0745.9926393.9973230.2405481.5627000.8637310.2653280.1020250.1254380.0790770.0874892.8327081.2629141.7643350.9414690.681163
max196805.0000002.015121e+07246.00000039.0000007.0000006.0000001.00000020000.00000015.0000001.0000008121.0000000.00.02.016041e+0751.7516847.55351718.3945709.3815995.2701500.2916180.1532651.3588130.1563550.21477512.33887218.85621812.9504985.9132732.624622
# 查看数值统计描述 - 训练集
train.describe()
nameregDatemodelbrandbodyTypefuelTypegearboxpowerkilometernotRepairedDamageregionCodesellerofferTypecreatDatepricev_0v_1v_2v_3v_4v_5v_6v_7v_8v_9v_10v_11v_12v_13v_14
count150000.0000001.500000e+05149999.000000150000.000000145494.000000141320.000000144019.000000150000.000000150000.000000125676.000000150000.000000150000.000000150000.01.500000e+05150000.000000150000.000000150000.000000150000.000000150000.000000150000.000000150000.000000150000.000000150000.000000150000.000000150000.000000150000.000000150000.000000150000.000000150000.000000150000.000000
mean68349.1728732.003417e+0747.1290218.0527331.7923690.3758420.224943119.31654712.5971600.1139042583.0772670.0000070.02.016033e+075923.32733344.406268-0.0448090.0807650.0788330.0178750.2482040.0449230.1246920.0581440.061996-0.0010000.0090350.0048130.000313-0.000688
std61103.8750955.364988e+0449.5360407.8649561.7606400.5486770.417546177.1684193.9195760.3176961885.3632180.0025820.01.067328e+027501.9984772.4575483.6418932.9296182.0265141.1936610.0458040.0517430.2014100.0291860.0356923.7723863.2860712.5174781.2889881.038685
min0.0000001.991000e+070.0000000.0000000.0000000.0000000.0000000.0000000.5000000.0000000.0000000.0000000.02.015062e+0711.00000030.451976-4.295589-4.470671-7.275037-4.3645650.0000000.0000000.0000000.0000000.000000-9.168192-5.558207-9.639552-4.153899-6.546556
25%11156.0000001.999091e+0710.0000001.0000000.0000000.0000000.00000075.00000012.5000000.0000001018.0000000.0000000.02.016031e+071300.00000043.135799-3.192349-0.970671-1.462580-0.9211910.2436150.0000380.0624740.0353340.033930-3.722303-1.951543-1.871846-1.057789-0.437034
50%51638.0000002.003091e+0730.0000006.0000001.0000000.0000000.000000110.00000015.0000000.0000002196.0000000.0000000.02.016032e+073250.00000044.610266-3.052671-0.3829470.099722-0.0759100.2577980.0008120.0958660.0570140.0584841.624076-0.358053-0.130753-0.0362450.141246
75%118841.2500002.007111e+0766.00000013.0000003.0000001.0000000.000000150.00000015.0000000.0000003843.0000000.0000000.02.016033e+077700.00000046.0047214.0006700.2413351.5658380.8687580.2652970.1020090.1252430.0793820.0874912.8443571.2550221.7769330.9428130.680378
max196812.0000002.015121e+07247.00000039.0000007.0000006.0000001.00000019312.00000015.0000001.0000008120.0000001.0000000.02.016041e+0799999.00000052.3041787.32030819.0354969.8547026.8293520.2918380.1514201.4049360.1607910.22278712.35701118.81904213.84779211.1476698.658418
1.2 发现seller特征在训练集和测试集中偏斜极其严重,对预测没有帮助,删去
train.drop(['seller'], axis=1, inplace=True)
test.drop(['seller'], axis=1, inplace=True)
1.3 意外发现两个数据集的offerType列全为0,删去。
train = train.drop(['offerType'], axis=1)
test = test.drop(['offerType'], axis=1)
train.shape, test.shape
((150000, 28), (50000, 27))

2. 探索性数据分析

2.1 用图表展示各特征与售价之间的数量关系(事实证明该图表的绘制非常耗时)
# fig = plt.figure(figsize=(10, 50))

# for i in range(len(train.columns)-1):  # 要减去price列
#     fig.add_subplot(10, 2, i+1)
#     sns.regplot(train.drop(['price'], axis=1).iloc[:, i], train['price'])

# plt.tight_layout()
# plt.show()
2.2 由于数据量过大,受性能限制很难用可视化工具展示数据分布的特征。因地制宜,选用函数及赛题数据描述来完成探索性数据分析

赛题数据描述讲到, power范围为[0, 600], 然而


# 有143个值不合法,需要用别的值替换
train[train['power'] > 600]['power'].count()
143
test[test['power'] > 600]['power'].count()
70
2.3 现在,特征工程能做的只是填充缺失值以及删除某些特征。在开始之前,先看看线性相关系数
# 查看各特征与销售价格之间的线性相关系数
train.corr().unstack()['price'].sort_values(ascending=False)
price                1.000000
v_12                 0.692823
v_8                  0.685798
v_0                  0.628397
regDate              0.611959
gearbox              0.329075
bodyType             0.241303
power                0.219834
fuelType             0.200536
v_5                  0.164317
model                0.136983
v_2                  0.085322
v_6                  0.068970
v_1                  0.060914
v_14                 0.035911
regionCode           0.014036
creatDate            0.002955
name                 0.002030
v_13                -0.013993
brand               -0.043799
v_7                 -0.053024
v_4                 -0.147085
notRepairedDamage   -0.190623
v_9                 -0.206205
v_10                -0.246175
v_11                -0.275320
kilometer           -0.440519
v_3                 -0.730946
dtype: float64
# 在选择需要删除的特征之前,考虑线性相关系数低的。第一步选中系数绝对值小于0.1的特征, 第二步,抛开线性相关系数,从现实角度思考每个特征对售价的影响

# 特征v_2, v_6, v_1, v_14, v_13, v_7:由于是连续型变量,理论上具有数学意义。既然跟售价的线性相关系数极低,为降低噪声,避免过拟合,考虑删去;

# 特征regionCode, brand:并非连续型变量,不具备数学上的可比较性。与售价的线性相关系数低无法说明各自的取值对售价影响不大,保留。

# 特征name:汽车交易名称,训练集共有99662条不重复值,取值不影响售价,删去。

# 特征creatDate:(二手)汽车开始售卖时间,范围在 [20150618, 20160407],间隔短,且与regDate(汽车注册时间)线性相关系数仅为-0.001293,其取值显然对售价影响很小,删去。
2.4 删去特征,同时删去测试集中相应的特征
train.drop(['v_2', 'v_6', 'v_1', 'v_14', 'v_13', 'v_7', 'name', 'creatDate'], axis=1, inplace=True)
test.drop(['v_2', 'v_6', 'v_1', 'v_14', 'v_13', 'v_7', 'name', 'creatDate'], axis=1, inplace=True)
train.shape, test.shape
((150000, 20), (50000, 19))
# 再次查看各特征与销售价格之间的线性相关系数
train.corr().unstack()['price'].sort_values(ascending=False)
price                1.000000
v_12                 0.692823
v_8                  0.685798
v_0                  0.628397
regDate              0.611959
gearbox              0.329075
bodyType             0.241303
power                0.219834
fuelType             0.200536
v_5                  0.164317
model                0.136983
regionCode           0.014036
brand               -0.043799
v_4                 -0.147085
notRepairedDamage   -0.190623
v_9                 -0.206205
v_10                -0.246175
v_11                -0.275320
kilometer           -0.440519
v_3                 -0.730946
dtype: float64

3. 特征工程

3.1 修正特征power大于600的值
# 使用map函数,以power列的中位数来替换数值超出范围的power
train['power'] = train['power'].map(lambda x: train['power'].median() if x > 600 else x)
test['power'] = test['power'].map(lambda x: test['power'].median() if x > 600 else x)
# 检查是否替换成功
train['power'].plot.hist()

在这里插入图片描述

test['power'].plot.hist()

在这里插入图片描述

3.2 填充缺失值
# 查看训练集缺失值存在情况
train.isnull().sum()[train.isnull().sum() > 0]
model                    1
bodyType              4506
fuelType              8680
gearbox               5981
notRepairedDamage    24324
dtype: int64
# 查看测试集缺失值存在情况
test.isnull().sum()[test.isnull().sum() > 0]
bodyType             1413
fuelType             2893
gearbox              1910
notRepairedDamage    8031
dtype: int64
3.2.1 处理训练集特征model的唯一缺失值
train[train['model'].isnull()]
regDatemodelbrandbodyTypefuelTypegearboxpowerkilometernotRepairedDamageregionCodepricev_0v_3v_4v_5v_8v_9v_10v_11v_12
3842420150809NaN376.01.01.0190.02.00.014254795041.139365-7.2750376.8293520.1815620.1484870.2227871.6757-3.250560.876001
# model(车型编码)一般与brand, bodyType, gearbox, power有关,选择以上4个特征与该车相同的车辆的model,选择出现次数最多的值
train[(train['brand'] == 37) & 
      (train['bodyType'] == 6.0) & 
      (train['gearbox'] == 1.0) & 
      (train['power'] == 190)]['model'].value_counts()
157.0    17
199.0    16
202.0     8
200.0     1
Name: model, dtype: int64
# 用157.0填充缺失值
train.loc[38424, 'model'] = 157.0
train.loc[38424, :]
regDate              2.015081e+07
model                1.570000e+02
brand                3.700000e+01
bodyType             6.000000e+00
fuelType             1.000000e+00
gearbox              1.000000e+00
power                1.900000e+02
kilometer            2.000000e+00
notRepairedDamage    0.000000e+00
regionCode           1.425000e+03
price                4.795000e+04
v_0                  4.113937e+01
v_3                 -7.275037e+00
v_4                  6.829352e+00
v_5                  1.815618e-01
v_8                  1.484868e-01
v_9                  2.227875e-01
v_10                 1.675700e+00
v_11                -3.250560e+00
v_12                 8.760013e-01
Name: 38424, dtype: float64
# 查看填充结果
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 20 columns):
regDate              150000 non-null int64
model                150000 non-null float64
brand                150000 non-null int64
bodyType             145494 non-null float64
fuelType             141320 non-null float64
gearbox              144019 non-null float64
power                150000 non-null float64
kilometer            150000 non-null float64
notRepairedDamage    125676 non-null float64
regionCode           150000 non-null int64
price                150000 non-null int64
v_0                  150000 non-null float64
v_3                  150000 non-null float64
v_4                  150000 non-null float64
v_5                  150000 non-null float64
v_8                  150000 non-null float64
v_9                  150000 non-null float64
v_10                 150000 non-null float64
v_11                 150000 non-null float64
v_12                 150000 non-null float64
dtypes: float64(16), int64(4)
memory usage: 22.9 MB
3.2.2 处理bodyType的缺失值
# 看缺失值数量
print(train['bodyType'].isnull().value_counts())
print('\n')
print(test['bodyType'].isnull().value_counts())
False    145494
True       4506
Name: bodyType, dtype: int64


False    48587
True      1413
Name: bodyType, dtype: int64
# bodyType特征缺失值占比较小,先观察它的取值与售价之间的联系,再决定是否删去
# 输出特征与售价之间的线性关系图(类似散点图)
sns.regplot(train['bodyType'], train['price'])

在这里插入图片描述

# 可见不同车身类型的汽车售价差别还是比较大的,故保留该特征,填充缺失值
# 看看车身类型数量分布
print(train['bodyType'].value_counts())
print('\n')
print(test['bodyType'].value_counts())
0.0    41420
1.0    35272
2.0    30324
3.0    13491
4.0     9609
5.0     7607
6.0     6482
7.0     1289
Name: bodyType, dtype: int64


0.0    13985
1.0    11882
2.0     9900
3.0     4433
4.0     3303
5.0     2537
6.0     2116
7.0      431
Name: bodyType, dtype: int64
# 在两个数据集上,车身类型为0.0(豪华轿车)的汽车数量都是最多,所以用0.0来填充缺失值
train.loc[:, 'bodyType'] = train['bodyType'].map(lambda x: 0.0 if pd.isnull(x) else x)
test.loc[:, 'bodyType'] = test['bodyType'].map(lambda x: 0.0 if pd.isnull(x) else x)
3.2.3 处理fuelType缺失值
# 看缺失值数量
print(train['fuelType'].isnull().value_counts())
print('\n')
print(test['fuelType'].isnull().value_counts())
False    141320
True       8680
Name: fuelType, dtype: int64


False    47107
True      2893
Name: fuelType, dtype: int64
# fuel特征缺失值占比较小,先观察它的取值与售价之间的联系,再决定是否删去
# 输出特征与售价之间的线性关系图(类似散点图)
sns.regplot(train['fuelType'], train['price'])

在这里插入图片描述

# 猜想:燃油类型与车身类型相关,如豪华轿车更可能是汽油或电动, 而搅拌车大多是柴油
# 创建字典,保存不同bodyType下, fuelType的众数,并以此填充fuelTyp的缺失值
dict_enu_train, dict_enu_test = {}, {}
for i in [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0]:
    dict_enu_train[i] = train[train['bodyType'] == i]['fuelType'].mode()[0]
    dict_enu_test[i] = test[test['bodyType'] == i]['fuelType'].mode()[0]
    
# 发现dict_enu_train, dict_enu_test是一样的内容
# 开始填充fuelType缺失值
# 在含fuelType缺失值的条目中,将不同bodyType对应的index输出保存到一个字典中
dict_index_train, dict_index_test = {}, {}

for bodytype in [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0]:
    dict_index_train[bodytype] = train[(train['bodyType'] == bodytype) & (train['fuelType'].isnull())].index.tolist()
    dict_index_test[bodytype] = test[(test['bodyType'] == bodytype) & (test['fuelType'].isnull())].index.tolist()
# 分别对每个bodyTYpe所对应的index来填充fuelType列
for bt, ft in dict_enu_train.items():
#     train.loc[tuple(dict_index[bt]), :]['fuelType'] = ft  # 注意:链式索引 (chained indexing)很可能导致赋值失败!
    train.loc[dict_index_train[bt], 'fuelType'] = ft  # Pandas推荐使用这种方法来索引/赋值
    test.loc[dict_index_test[bt], 'fuelType'] = ft
3.2.4 填充gearbox的缺失值
# 看缺失值数量
print(train['gearbox'].isnull().value_counts())
print('\n')
print(test['gearbox'].isnull().value_counts())
False    144019
True       5981
Name: gearbox, dtype: int64


False    48090
True      1910
Name: gearbox, dtype: int64
# gearbox特征缺失值占比较小,先观察它的取值与售价之间的联系,再决定是否删去
# 输出特征与售价之间的线性关系图(类似散点图)
sns.regplot(train['gearbox'], train['price'])

在这里插入图片描述

# 可见变速箱类型的不同不会显著影响售价,删去测试集中带缺失值的行或许是可行的做法,但为避免样本量减少带来的过拟合,还是决定保留该特征并填充其缺失值
# 看看车身类型数量分布
print(train['gearbox'].value_counts())
print('\n')
print(test['gearbox'].value_counts())
0.0    111623
1.0     32396
Name: gearbox, dtype: int64


0.0    37301
1.0    10789
Name: gearbox, dtype: int64
# 训练集
train.loc[:, 'gearbox'] = train['gearbox'].map(lambda x: 0.0 if pd.isnull(x) else x)

# # 对于测试集,为保证预测结果完整性,不能删去任何行。测试集仅有1910个gearbox缺失值,用数量占绝大多数的0.0(手动档)来填充缺失值
test.loc[:, 'gearbox'] = test['gearbox'].map(lambda x: 0.0 if pd.isnull(x) else x)
# 检查填充是否成功
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 20 columns):
regDate              150000 non-null int64
model                150000 non-null float64
brand                150000 non-null int64
bodyType             150000 non-null float64
fuelType             150000 non-null float64
gearbox              150000 non-null float64
power                150000 non-null float64
kilometer            150000 non-null float64
notRepairedDamage    125676 non-null float64
regionCode           150000 non-null int64
price                150000 non-null int64
v_0                  150000 non-null float64
v_3                  150000 non-null float64
v_4                  150000 non-null float64
v_5                  150000 non-null float64
v_8                  150000 non-null float64
v_9                  150000 non-null float64
v_10                 150000 non-null float64
v_11                 150000 non-null float64
v_12                 150000 non-null float64
dtypes: float64(16), int64(4)
memory usage: 22.9 MB
test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 19 columns):
regDate              50000 non-null int64
model                50000 non-null float64
brand                50000 non-null int64
bodyType             50000 non-null float64
fuelType             50000 non-null float64
gearbox              50000 non-null float64
power                50000 non-null float64
kilometer            50000 non-null float64
notRepairedDamage    41969 non-null float64
regionCode           50000 non-null int64
v_0                  50000 non-null float64
v_3                  50000 non-null float64
v_4                  50000 non-null float64
v_5                  50000 non-null float64
v_8                  50000 non-null float64
v_9                  50000 non-null float64
v_10                 50000 non-null float64
v_11                 50000 non-null float64
v_12                 50000 non-null float64
dtypes: float64(16), int64(3)
memory usage: 7.2 MB
3.2.4 最后,处理notRepairedDamage缺失值
# 看缺失值数量
# 缺失值数量在两个数据集中的占比都不低
print(train['notRepairedDamage'].isnull().value_counts())
print('\n')
print(test['notRepairedDamage'].isnull().value_counts())
False    125676
True      24324
Name: notRepairedDamage, dtype: int64


False    41969
True      8031
Name: notRepairedDamage, dtype: int64
# 查看数量分布
print(train['notRepairedDamage'].value_counts())
print('\n')
print(test['notRepairedDamage'].value_counts())
0.0    111361
1.0     14315
Name: notRepairedDamage, dtype: int64


0.0    37249
1.0     4720
Name: notRepairedDamage, dtype: int64
# 查看线性相关系数
train[['notRepairedDamage', 'price']].corr()['price']
notRepairedDamage   -0.190623
price                1.000000
Name: price, dtype: float64
# 在输出特征与售价之间的线性关系图(类似散点图)
sns.regplot(train['notRepairedDamage'], train['price'])

在这里插入图片描述

# 很奇怪,在整个训练集上有尚未修复损坏的汽车比损坏已修复的汽车售价还要高。考虑到剩余接近20个特征的存在,这应该是巧合
# 为简单化问题,仍使用数量占比最大的0.0来填充所有缺失值
train.loc[:, 'notRepairedDamage'] = train['notRepairedDamage'].map(lambda x: 0.0 if pd.isnull(x) else x)
test.loc[:, 'notRepairedDamage'] = test['notRepairedDamage'].map(lambda x: 0.0 if pd.isnull(x) else x)
# 最后。检查填充结果
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 20 columns):
regDate              150000 non-null int64
model                150000 non-null float64
brand                150000 non-null int64
bodyType             150000 non-null float64
fuelType             150000 non-null float64
gearbox              150000 non-null float64
power                150000 non-null float64
kilometer            150000 non-null float64
notRepairedDamage    150000 non-null float64
regionCode           150000 non-null int64
price                150000 non-null int64
v_0                  150000 non-null float64
v_3                  150000 non-null float64
v_4                  150000 non-null float64
v_5                  150000 non-null float64
v_8                  150000 non-null float64
v_9                  150000 non-null float64
v_10                 150000 non-null float64
v_11                 150000 non-null float64
v_12                 150000 non-null float64
dtypes: float64(16), int64(4)
memory usage: 22.9 MB
test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 19 columns):
regDate              50000 non-null int64
model                50000 non-null float64
brand                50000 non-null int64
bodyType             50000 non-null float64
fuelType             50000 non-null float64
gearbox              50000 non-null float64
power                50000 non-null float64
kilometer            50000 non-null float64
notRepairedDamage    50000 non-null float64
regionCode           50000 non-null int64
v_0                  50000 non-null float64
v_3                  50000 non-null float64
v_4                  50000 non-null float64
v_5                  50000 non-null float64
v_8                  50000 non-null float64
v_9                  50000 non-null float64
v_10                 50000 non-null float64
v_11                 50000 non-null float64
v_12                 50000 non-null float64
dtypes: float64(16), int64(3)
memory usage: 7.2 MB

4. 建模与调参

4.1 选择三个集成学习模型:随机森林,XGBoost, 梯度提升树GBDT
rf = RandomForestRegressor(n_estimators=100, max_depth=8, random_state=1) 
xgb = XGBRegressor(n_stimators=150, max_depth=8, learning_rate=0.1, random_state=1)  
gbdt = GradientBoostingRegressor(subsample=0.8, random_state=1)  # subsample小于1可降低方差,但会加大偏差

X = train.drop(['price'], axis=1)
y = train['price']
4.2 交叉验证,观察模型表现
#随机森林
score_rf = -1 * cross_val_score(rf,
                           X,
                           y,
                           scoring='neg_mean_absolute_error',
                           cv=5).mean()  # 取得分均值

print('随机森林模型的平均MAE为:', score_rf)

# XGBoost
score_xgb = -1 * cross_val_score(xgb,
                                X,
                                y,
                                scoring='neg_mean_absolute_error',
                                cv=5).mean()  # 取得分均值

print('XGBoost模型的平均MAE为:', score_xgb)

# 梯度提升树GBDT
score_gbdt = -1 * cross_val_score(gbdt,
                                X,
                                y,
                                scoring='neg_mean_absolute_error',
                                cv=5).mean()  # 取得分均值

print('梯度提升树模型的平均MAE为:', score_gbdt)
随机森林模型的平均MAE为: 924.43649869
XGBoost模型的平均MAE为: 616.449663619
梯度提升树模型的平均MAE为: 893.439059092
4.3 选中XGBoost模型,开始调参(网格搜索)
params = {'n_estimators': [150, 200, 250],
          'learning_rate': [0.1],
          'subsample': [0.5, 0.8]}

model = GridSearchCV(estimator=xgb,
                    param_grid=params,
                    scoring='neg_mean_absolute_error',
                    cv=3)
model.fit(X, y)

# 输出最佳参数
print('最佳参数为:\n', model.best_params_)
print('最佳分数为:\n', model.best_score_)
print('最佳模型为:\n', model.best_estimator_)
最佳参数为:
 {'learning_rate': 0.1, 'n_estimators': 250, 'subsample': 0.8}
最佳分数为:
 -587.043780247
最佳模型为:
 XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=8, min_child_weight=1, missing=None, n_estimators=250,
       n_jobs=1, n_stimators=150, nthread=None, objective='reg:linear',
       random_state=1, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
       seed=None, silent=True, subsample=0.8)

5. 提交结果

predictions = model.predict(test)
result = pd.DataFrame({'SaleID': df_test['SaleID'], 'price': predictions})
result.to_csv('/home/myspace/My_submission.csv', index=False)
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值