【Python机器学习】回归模型：推土机售价预测

Sprite.Nym

已于 2022-09-09 19:53:59 修改

阅读量5.5k

点赞数 9

分类专栏：机器学习文章标签：机器学习 python 回归 XGBOOST 逐步回归

于 2022-09-09 18:15:58 首次发布

本文链接：https://blog.csdn.net/SpriteNym/article/details/126787559

版权

机器学习专栏收录该内容

12 篇文章 1 订阅

订阅专栏

文章目录

使用机器学习预测推土机的售价
零、导入模块
一、EDA
2. 数据清洗+数据预处理
3. 建模
- 3.1 RF
- 3.2 XGB
4. 特征筛选
5. 最终成果

使用机器学习预测推土机的售价

1. 定义问题

考虑到推土机的特性，利用过去的数据，我们能多大程度上预测它未来的价格？

2. 数据来源

kaggle：https://www.kaggle.com/competitions/bluebook-for-bulldozers/overview

The data for this competition is split into three parts:

Train.csv is the training set, which contains data through the end of 2011.
Valid.csv is the validation set, which contains data from January 1, 2012 - April 30, 2012 You make predictions on this set throughout the majority of the competition. Your score on this set is used to create the public leaderboard.
Test.csv is the test set, which won’t be released until the last week of the competition. It contains data from May 1, 2012 - November 2012. Your score on the test set determines your final rank for the competition.

3. 评价标准

The evaluation metric for this competition is the RMSLE (root mean squared log error) between the actual and predicted auction prices.

更多信息：https://www.kaggle.com/competitions/bluebook-for-bulldozers/overview/evaluation

注意：多数回归模型的评价标准都是减小误差。比如这次的目标就是最小化RMSLE。

4. 使用的特征

特征过多，请自行进入kaggle项目主页查看。或点击如下谷歌表格链接：https://docs.google.com/spreadsheets/d/1EIbdGa4S_46USXgg0OHX5jgTc8ld9fTHwPyi_VOV1as/edit#gid=0

零、导入模块

# EDA
import warnings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
sns.set()
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
%config InlineBackend.figure_config = 'svg'
warnings.filterwarnings("ignore")

# 数据预处理
from sklearn.preprocessing import LabelEncoder

# sklearn模型
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor

# 模型评估
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_log_error, mean_absolute_error, mean_squared_error, r2_score

一、EDA

bulldozer_df = pd.read_csv('bluebook-for-bulldozers/TrainAndValid.csv',
                           low_memory = False)
appendix_df = pd.read_csv('bluebook-for-bulldozers/Machine_Appendix.csv',
                           low_memory = False)

1.1 查看基本信息

# 查看各字段类型
bulldozer_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 412698 entries, 0 to 412697
Data columns (total 53 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   SalesID                   412698 non-null  int64  
 1   SalePrice                 412698 non-null  float64
 2   MachineID                 412698 non-null  int64  
 3   ModelID                   412698 non-null  int64  
 4   datasource                412698 non-null  int64  
 5   auctioneerID              392562 non-null  float64
 6   YearMade                  412698 non-null  int64  
 7   MachineHoursCurrentMeter  147504 non-null  float64
 8   UsageBand                 73670 non-null   object 
 9   saledate                  412698 non-null  object 
 10  fiModelDesc               412698 non-null  object 
 11  fiBaseModel               412698 non-null  object 
 12  fiSecondaryDesc           271971 non-null  object 
 13  fiModelSeries             58667 non-null   object 
 14  fiModelDescriptor         74816 non-null   object 
 15  ProductSize               196093 non-null  object 
 16  fiProductClassDesc        412698 non-null  object 
 17  state                     412698 non-null  object 
 18  ProductGroup              412698 non-null  object 
 19  ProductGroupDesc          412698 non-null  object 
 20  Drive_System              107087 non-null  object 
 21  Enclosure                 412364 non-null  object 
 22  Forks                     197715 non-null  object 
 23  Pad_Type                  81096 non-null   object 
 24  Ride_Control              152728 non-null  object 
 25  Stick                     81096 non-null   object 
 26  Transmission              188007 non-null  object 
 27  Turbocharged              81096 non-null   object 
 28  Blade_Extension           25983 non-null   object 
 29  Blade_Width               25983 non-null   object 
 30  Enclosure_Type            25983 non-null   object 
 31  Engine_Horsepower         25983 non-null   object 
 32  Hydraulics                330133 non-null  object 
 33  Pushblock                 25983 non-null   object 
 34  Ripper                    106945 non-null  object 
 35  Scarifier                 25994 non-null   object 
 36  Tip_Control               25983 non-null   object 
 37  Tire_Size                 97638 non-null   object 
 38  Coupler                   220679 non-null  object 
 39  Coupler_System            44974 non-null   object 
 40  Grouser_Tracks            44875 non-null   object 
 41  Hydraulics_Flow           44875 non-null   object 
 42  Track_Type                102193 non-null  object 
 43  Undercarriage_Pad_Width   102916 non-null  object 
 44  Stick_Length              102261 non-null  object 
 45  Thumb                     102332 non-null  object 
 46  Pattern_Changer           102261 non-null  object 
 47  Grouser_Type              102193 non-null  object 
 48  Backhoe_Mounting          80712 non-null   object 
 49  Blade_Type                81875 non-null   object 
 50  Travel_Controls           81877 non-null   object 
 51  Differential_Type         71564 non-null   object 
 52  Steering_Controls         71522 non-null   object 
dtypes: float64(3), int64(5), object(45)
memory usage: 166.9+ MB

# 查看缺失值
bulldozer_df.isna().sum()

SalesID                          0
SalePrice                        0
MachineID                        0
ModelID                          0
datasource                       0
auctioneerID                 20136
YearMade                         0
MachineHoursCurrentMeter    265194
UsageBand                   339028
saledate                         0
fiModelDesc                      0
fiBaseModel                      0
fiSecondaryDesc             140727
fiModelSeries               354031
fiModelDescriptor           337882
ProductSize                 216605
fiProductClassDesc               0
state                            0
ProductGroup                     0
ProductGroupDesc                 0
Drive_System                305611
Enclosure                      334
Forks                       214983
Pad_Type                    331602
Ride_Control                259970
Stick                       331602
Transmission                224691
Turbocharged                331602
Blade_Extension             386715
Blade_Width                 386715
Enclosure_Type              386715
Engine_Horsepower           386715
Hydraulics                   82565
Pushblock                   386715
Ripper                      305753
Scarifier                   386704
Tip_Control                 386715
Tire_Size                   315060
Coupler                     192019
Coupler_System              367724
Grouser_Tracks              367823
Hydraulics_Flow             367823
Track_Type                  310505
Undercarriage_Pad_Width     309782
Stick_Length                310437
Thumb                       310366
Pattern_Changer             310437
Grouser_Type                310505
Backhoe_Mounting            331986
Blade_Type                  330823
Travel_Controls             330821
Differential_Type           341134
Steering_Controls           341176
dtype: int64

# 查看标签分布
bulldozer_df['SalePrice'].hist()

<AxesSubplot:>

请添加图片描述

1.2 特征类型转换

# 改为帕斯卡命名
bulldozer_df.rename(columns={'saledate': 'SaleDate'}, inplace=True)

bulldozer_df['SaleDate'] = pd.to_datetime(bulldozer_df['SaleDate'])

# 按时间查看售价，只看最近几年的，没有明显的上升趋势，只有上下波动
plt.figure(figsize=(20,15))
pd.pivot_table(bulldozer_df[200000::1000], index='SaleDate', values='SalePrice').plot()
plt.show()

请添加图片描述

2005年前后半年和2008年整个一年销量都比较差，除此之外看不出太多信息

# 将数据集按时间排序
bulldozer_df.sort_values(by='SaleDate', inplace=True)
bulldozer_df['SaleDate'].head(20)

205615   1989-01-17
274835   1989-01-31
141296   1989-01-31
212552   1989-01-31
62755    1989-01-31
54653    1989-01-31
81383    1989-01-31
204924   1989-01-31
135376   1989-01-31
113390   1989-01-31
113394   1989-01-31
116419   1989-01-31
32138    1989-01-31
127610   1989-01-31
76171    1989-01-31
127000   1989-01-31
128130   1989-01-31
127626   1989-01-31
55455    1989-01-31
55454    1989-01-31
Name: SaleDate, dtype: datetime64[ns]

1.3 联表+特征初筛

这个数据集比较特殊，还有个appendix表，里面是一些推土机的配件信息，而且这个信息和train表有重复特征，重复特征里面还有匹配不上的情况，现在先联表上来看看

# 制作数据集副本，这是为了方便对数据集做了什么操作后，仍然可以获取原始数据，而不用从头读数据
bd_df = bulldozer_df.copy()
app_df = appendix_df.copy()

# SalesID列丢弃
bd_df.drop('SalesID', axis=1, inplace=True)

# 定义一个查看出入的函数
def check_difference(df1, df2, target_col, on_col):
    temp_df = pd.merge(df1[[on_col, target_col]], df2[[on_col, target_col]], how='left', on=on_col)
    return temp_df[(temp_df[target_col+'_x'] != temp_df[target_col+'_y'])]

# 定义一个合并时互补的函数，冲突时保留df1的数据
def combine(df1, df2, target_col, on_col):
    temp_df0 = pd.merge(df1[[on_col, target_col]], df2[[on_col, target_col]], how='left', on=on_col)
    temp_df0.fillna('', inplace=True)
    temp_df1=temp_df0[(temp_df0[target_col+'_x']== '') & (temp_df0[target_col+'_y']!='')]
    temp_df0[target_col+'_x'].loc[temp_df1.index] = temp_df1[target_col+'_y']
    df1[target_col] = temp_df0[target_col+'_x']
    df2.drop(columns=[target_col], inplace=True)
    return df1

1.3.1 删除包含重复信息的特征

ProductGroup是ProductGroupDesc的首字母缩写版，选择保留后者

fiManufacturerID和fiManufacturerDesc包含的信息一样，选择保留前者

bd_df.drop(columns=['ProductGroup', 'fiProductClassDesc'], inplace=True)
app_df.drop(columns=['ModelID', 'fiModelDesc', 'ProductGroup', 'fiManufacturerDesc'], inplace=True)

1.3.2 fiBaseModel

# 查看枚举值，后续需要做分箱合并处理
bd_df['fiBaseModel'].value_counts()

580      20179
310      17886
D6       13527
416      12900
D5        9636
         ...  
56           1
B4230        1
IS30         1
MM555        1
WLK15        1
Name: fiBaseModel, Length: 1961, dtype: int64

# 查看出入部分有无空值
check_difference(bd_df, app_df, 'fiBaseModel', 'MachineID').isna().sum()

MachineID        0
fiBaseModel_x    0
fiBaseModel_y    0
dtype: int64

# 查看出入部分
check_difference(bd_df, app_df, 'fiBaseModel', 'MachineID')

	MachineID	fiBaseModel_x	fiBaseModel_y
71	1523610	WA150	HD465
98	1059447	WA300	PC100
123	1303779	415	862
128	1340389	MS240	MS120
179	1208516	D31	PC100
...	...	...	...
412474	2308891	1845	75
412484	2287735	T135	T133
412509	2292146	450	465
412655	1846321	TB135	TB125
412695	1918416	337	530

12452 rows × 3 columns

# 选择保留bd_df表的数据
app_df.drop(columns=['fiBaseModel'], inplace=True)

1.3.3 fiSecondaryDesc

# 查看枚举值，后续需要做分箱合并处理
bd_df['fiSecondaryDesc'].value_counts()

C         44431
B         40165
G         37915
H         24729
E         21532
          ...  
BLGPPS        1
MSR           1
LC7A          1
CL            1
BH            1
Name: fiSecondaryDesc, Length: 177, dtype: int64

# 查看不一致部分
check_difference(bd_df, app_df, 'fiSecondaryDesc', 'MachineID')

	MachineID	fiSecondaryDesc_x	fiSecondaryDesc_y
0	1126363	NaN	NaN
1	1194089	NaN	NaN
3	1327630	NaN	NaN
6	1082797	NaN	NaN
7	1527216	NaN	NaN
...	...	...	...
412690	1823846	NaN	NaN
412691	1278794	NaN	NaN
412692	1792049	NaN	NaN
412694	1919104	NaN	NaN
412695	1918416	G	NaN

147009 rows × 3 columns

# 合并互补
bd_df = combine(bd_df, app_df, 'fiSecondaryDesc', 'MachineID')

1.3.4 fiModelSeries

# 查看枚举值，后续需要做分箱合并处理
bd_df['fiModelSeries'].value_counts()

II      13770
LC       9175
III      5351
-1       4646
-2       4033
        ...  
LL          1
6F          1
-2LC        1
-5A         1
VII         1
Name: fiModelSeries, Length: 123, dtype: int64

# 查看不一致部分
check_difference(bd_df, app_df, 'fiModelSeries', 'MachineID')

	MachineID	fiModelSeries_x	fiModelSeries_y
0	1126363	NaN	NaN
1	1194089	NaN	NaN
2	1473654	NaN	NaN
3	1327630	NaN	NaN
4	1336053	NaN	NaN
...	...	...	...
412693	1915521	NaN	NaN
412694	1919104	NaN	NaN
412695	1918416	NaN	NaN
412696	509560	NaN	NaN
412697	1869284	NaN	NaN

362366 rows × 3 columns

# 合并互补
bd_df = combine(bd_df, app_df, 'fiModelSeries', 'MachineID')

1.3.5 fiModelDescriptor

# 查看枚举值
bd_df['fiModelDescriptor'].value_counts()

L            16464
LGP          16143
LC           13295
XL            6700
6             2944
             ...  
K5               1
HighLift         1
High Lift        1
III              1
SL               1
Name: fiModelDescriptor, Length: 140, dtype: int64

# 查看不一致部分
check_difference(bd_df, app_df, 'fiModelDescriptor', 'MachineID')

	MachineID	fiModelDescriptor_x	fiModelDescriptor_y
0	1126363	NaN	NaN
1	1194089	NaN	NaN
2	1473654	NaN	NaN
3	1327630	NaN	NaN
4	1336053	NaN	NaN
...	...	...	...
412693	1915521	NaN	NaN
412694	1919104	NaN	NaN
412695	1918416	NaN	NaN
412696	509560	NaN	NaN
412697	1869284	NaN	NaN

342069 rows × 3 columns

# 合并互补
bd_df = combine(bd_df, app_df, 'fiModelDescriptor', 'MachineID')

1.3.6 ProductGroupDesc

# 查看枚举值
bd_df['ProductGroupDesc'].value_counts()

Track Excavators       104230
Track Type Tractors     82582
Backhoe Loaders         81401
Wheel Loader            73216
Skid Steer Loaders      45011
Motor Graders           26258
Name: ProductGroupDesc, dtype: int64

# 枚举值太多了，而且bd_df里的数据并无空值，这里就选择不互补了，直接保留bd_df的数据
app_df['ProductGroupDesc'].value_counts()

Track Excavators                 89094
Backhoe Loaders                  74074
Track Type Tractors              67362
Wheel Loader                     62426
Skid Steer Loaders               42121
Motor Graders                    22602
Track Loaders                      158
Articulated Trucks                 109
Ag Tractors                        103
Wheel Tractor Scraper              102
Off Highway Trucks                  81
Multi Terrain Loaders               66
Wheel Excavator                     66
Forklift                            53
Skidders                            46
Wheel Feller Buncher                31
Forestry Log Loaders                25
Pipelayers                          11
Telehandler                          9
Wheel Dozer                          7
Knuckleboom Loaders                  6
Track Feller Bunchers                6
Vibratory Double Drum Asphalt        4
Vibratory Single Drum Asphalt        4
Work Tool                            3
Pneumatic Tired Compactor            3
Compactors                           3
Vibratory Single Drum Pad            3
Tandem Roller Static                 2
Harvesters                           2
Forwarders                           2
Engine, Industrial OEM               2
Track Harvesters                     1
Delimber Forestry                    1
Asphalt/Concrete Pavers              1
Vibratory Single Drum Smooth         1
Crane/Dragline                       1
Forestry Excavators                  1
Cold Planers                         1
Name: ProductGroupDesc, dtype: int64

app_df.drop(columns=['ProductGroupDesc'], inplace=True)

1.3.7 MfgYear

# 两边同一个特征名字不一样，先改名
app_df.rename(columns={'MfgYear': 'YearMade'}, inplace=True)

# 查看枚举值，发现异常值1000
bd_df['YearMade'].value_counts()

1000    39391
2005    22096
1998    21751
2004    20914
1999    19274
        ...  
2012        1
1949        1
1942        1
2013        1
1937        1
Name: YearMade, Length: 73, dtype: int64

# 查看不一致部分
check_difference(bd_df, app_df, 'YearMade', 'MachineID')

	MachineID	YearMade_x	YearMade_y
52	1531656	1975	1987.0
58	1078132	1967	1966.0
71	1523610	1986	2007.0
98	1059447	1984	1996.0
122	1525180	1984	1985.0
...	...	...	...
412651	1631729	1000	1990.0
412654	1912106	1000	1996.0
412655	1846321	1000	2005.0
412667	1897535	1998	1999.0
412696	509560	1993	1990.0

36406 rows × 3 columns

这种不一致通常是因为bd_df里面有重复的MachineID，即同一种推土机被卖了多次，而app_df里只有唯一MachineID所导致，决定保留bd_df的数据，不动它

app_df.drop('YearMade', axis=1, inplace=True)

1.3.8 fiManufacturerID、PrimarySizeBasis、PrimaryLower、PrimaryUpper

这三个特征原训练集上都没有，直接联过去

# total_df = bd_df.copy()

total_df = pd.merge(bd_df, app_df, how='left', on='MachineID')

# 动力上限和下限留一个就足够区分了，取下限
total_df.drop(columns=['PrimaryUpper'], inplace=True)

total_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 412698 entries, 0 to 412697
Data columns (total 54 columns):
 #   Column                    Non-Null Count   Dtype         
---  ------                    --------------   -----         
 0   SalePrice                 412698 non-null  float64       
 1   MachineID                 412698 non-null  int64         
 2   ModelID                   412698 non-null  int64         
 3   datasource                412698 non-null  int64         
 4   auctioneerID              392562 non-null  float64       
 5   YearMade                  412698 non-null  int64         
 6   MachineHoursCurrentMeter  147504 non-null  float64       
 7   UsageBand                 73670 non-null   object        
 8   SaleDate                  412698 non-null  datetime64[ns]
 9   fiModelDesc               412698 non-null  object        
 10  fiBaseModel               412698 non-null  object        
 11  fiSecondaryDesc           412698 non-null  object        
 12  fiModelSeries             412698 non-null  object        
 13  fiModelDescriptor         412698 non-null  object        
 14  ProductSize               196093 non-null  object        
 15  state                     412698 non-null  object        
 16  ProductGroupDesc          412698 non-null  object        
 17  Drive_System              107087 non-null  object        
 18  Enclosure                 412364 non-null  object        
 19  Forks                     197715 non-null  object        
 20  Pad_Type                  81096 non-null   object        
 21  Ride_Control              152728 non-null  object        
 22  Stick                     81096 non-null   object        
 23  Transmission              188007 non-null  object        
 24  Turbocharged              81096 non-null   object        
 25  Blade_Extension           25983 non-null   object        
 26  Blade_Width               25983 non-null   object        
 27  Enclosure_Type            25983 non-null   object        
 28  Engine_Horsepower         25983 non-null   object        
 29  Hydraulics                330133 non-null  object        
 30  Pushblock                 25983 non-null   object        
 31  Ripper                    106945 non-null  object        
 32  Scarifier                 25994 non-null   object        
 33  Tip_Control               25983 non-null   object        
 34  Tire_Size                 97638 non-null   object        
 35  Coupler                   220679 non-null  object        
 36  Coupler_System            44974 non-null   object        
 37  Grouser_Tracks            44875 non-null   object        
 38  Hydraulics_Flow           44875 non-null   object        
 39  Track_Type                102193 non-null  object        
 40  Undercarriage_Pad_Width   102916 non-null  object        
 41  Stick_Length              102261 non-null  object        
 42  Thumb                     102332 non-null  object        
 43  Pattern_Changer           102261 non-null  object        
 44  Grouser_Type              102193 non-null  object        
 45  Backhoe_Mounting          80712 non-null   object        
 46  Blade_Type                81875 non-null   object        
 47  Travel_Controls           81877 non-null   object        
 48  Differential_Type         71564 non-null   object        
 49  Steering_Controls         71522 non-null   object        
 50  fiProductClassDesc        412698 non-null  object        
 51  fiManufacturerID          412698 non-null  int64         
 52  PrimarySizeBasis          407439 non-null  object        
 53  PrimaryLower              407439 non-null  float64       
dtypes: datetime64[ns](1), float64(4), int64(5), object(44)
memory usage: 173.2+ MB

raise KeyError

---------------------------------------------------------------------------

KeyError                                  Traceback (most recent call last)

Input In [903], in <cell line: 1>()
----> 1 raise KeyError


KeyError:

1.4 逐个查看特征

1.4.1 datasource

# 173怀疑是172写错成173
total_df['datasource'].value_counts()

132    260776
136     75491
149     33325
121     25191
172     17914
173         1
Name: datasource, dtype: int64

1.4.2 auctioneerID

# 后续要合并
total_df['auctioneerID'].value_counts()

1.0     192773
2.0      57441
3.0      30288
4.0      20877
99.0     12042
6.0      11950
7.0       7847
8.0       7419
5.0       7002
10.0      5876
9.0       4764
11.0      3823
12.0      3610
13.0      3068
18.0      2359
14.0      2277
20.0      2238
19.0      2074
16.0      1807
15.0      1742
21.0      1601
22.0      1429
24.0      1357
23.0      1322
17.0      1275
27.0      1150
25.0       959
28.0       860
26.0       796
0.0        536
Name: auctioneerID, dtype: int64

# 查看auctioneerID和价格的关系
temp = pd.pivot_table(total_df, index='auctioneerID', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('median', 'SalePrice'))

	count	mean	median
	SalePrice	SalePrice	SalePrice
auctioneerID
22.0	1429	18230.020994	14000.0
18.0	2359	18839.799491	14500.0
21.0	1601	19645.315428	15000.0
25.0	959	22478.519291	16000.0
9.0	4764	22515.003149	16775.0
14.0	2277	20804.797980	17000.0
17.0	1275	22974.941176	18000.0
12.0	3610	24762.936288	18250.0
28.0	860	28312.674419	18750.0
20.0	2238	24737.811439	19250.0
99.0	12042	26958.806112	19500.0
13.0	3068	27017.441330	20500.0
16.0	1807	26261.427781	20500.0
27.0	1150	27606.739130	22500.0
4.0	20877	29825.726876	23000.0
23.0	1322	30613.729198	23000.0
2.0	57441	29023.051670	23000.0
10.0	5876	29561.546971	23000.0
0.0	536	29979.850746	23000.0
15.0	1742	29986.337543	23500.0
5.0	7002	29150.217795	23500.0
24.0	1357	33041.156964	24500.0
1.0	192773	32684.870075	25000.0
8.0	7419	32477.978838	25000.0
11.0	3823	32707.088674	25500.0
3.0	30288	33596.962592	26000.0
6.0	11950	34708.070628	27000.0
26.0	796	36157.412060	28000.0
7.0	7847	36288.450236	28500.0
19.0	2074	42715.766635	34000.0

# 画图查看
temp.sort_values(by=('median', 'SalePrice'))[('median', 'SalePrice')].plot(kind='bar')
plt.show()

请添加图片描述

1.4.3 YearMade

total_df['YearMade'] = total_df['YearMade'].astype(int)

total_df['YearMade'].value_counts()

1000    39391
2005    22096
1998    21751
2004    20914
1999    19274
        ...  
2012        1
1949        1
1942        1
2013        1
1937        1
Name: YearMade, Length: 73, dtype: int64

plt.figure(figsize=(14,10))
sns.countplot(total_df['YearMade'])
plt.xticks(rotation=90)
plt.show()

请添加图片描述

第一台拖拉机1904年才发明出来，1904年前的属于异常数据。同时该数据集截至年份是2012年，大于2012的属于异常。异常值后续增加一个新的衍生变量YearMade_is_error区分。

生产时间大于销售时间的，暂时认为是提前订货，不处理。

1.4.4 MachineHoursCurrentMeter

# 查看枚举值
total_df['MachineHoursCurrentMeter'].value_counts()

0.0        73834
2000.0       124
1000.0       117
24.0         115
1500.0       101
           ...  
10834.0        1
3499.0         1
26270.0        1
26901.0        1
17920.0        1
Name: MachineHoursCurrentMeter, Length: 15633, dtype: int64

1.4.5 UsageBand

# 查看枚举值
total_df['UsageBand'].value_counts()

Medium    35832
Low       25311
High      12527
Name: UsageBand, dtype: int64

1.4.6 fiBaseModel

total_df['fiBaseModel'].value_counts()

580      20179
310      17886
D6       13527
416      12900
D5        9636
         ...  
56           1
B4230        1
IS30         1
MM555        1
WLK15        1
Name: fiBaseModel, Length: 1961, dtype: int64

1.4.7 fiSecondaryDesc

total_df['fiSecondaryDesc'].value_counts().head(10)

     136420
C     44658
B     40446
G     38139
H     24759
E     21944
D     20132
F      9454
K      8089
A      5968
Name: fiSecondaryDesc, dtype: int64

temp = pd.pivot_table(total_df, index='fiSecondaryDesc', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)

	count	mean	median
	SalePrice	SalePrice	SalePrice
fiSecondaryDesc
XT	2177	29736.816720	23000.0
LE	2510	31586.065737	23000.0
R	3577	31164.435840	24000.0
N	3844	31028.674558	23500.0
SUPER L	3920	31237.168367	24000.0
P	4475	32841.134078	26000.0
J	4655	31069.470892	24000.0
LC	4862	31265.127931	24000.0
M	5398	32225.976288	24500.0
L	5669	31634.053625	24500.0
A	5968	31266.457775	24000.0
K	8089	30815.391396	23000.0
F	9454	31958.080389	25000.0
D	20132	30766.167644	24000.0
E	21944	31089.640357	24000.0
H	24759	31279.811059	24000.0
G	38139	30902.040200	23500.0
B	40446	31238.508530	24000.0
C	44658	31024.103699	24000.0
	136420	31426.038213	24000.0

1.4.8 fiModelSeries

total_df['fiModelSeries'].value_counts().head(10)

       350453
II      14039
LC       9609
III      5392
-1       5142
-2       4350
-6       3538
-3       2783
-5       2664
-12      1447
Name: fiModelSeries, dtype: int64

1.4.9 fiModelDescriptor

total_df['fiModelDescriptor'].value_counts().head(10)

       333040
L       16676
LGP     16541
LC      15666
XL       6704
6        3238
LT       2681
5        2455
3        2149
CR       1798
Name: fiModelDescriptor, dtype: int64

temp = pd.pivot_table(total_df, index='fiModelDescriptor', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)

	count	mean	median
	SalePrice	SalePrice	SalePrice
fiModelDescriptor
K	349	29030.372493	22000.0
8	351	33369.800570	27000.0
E	359	34427.576602	25500.0
SR	383	28176.827676	20000.0
ZTS	442	28579.751131	21250.0
Z	525	27814.190476	21000.0
2	627	32085.047847	26000.0
SSR	684	33503.508772	27000.0
H	1092	30992.971612	24000.0
7	1115	30983.373991	24000.0
CR	1798	28777.157953	22000.0
3	2149	31477.175430	24000.0
5	2455	30897.668024	24000.0
LT	2681	28216.729206	21000.0
6	3238	30573.305127	24000.0
XL	6704	30564.901700	23000.0
LC	15666	30387.232989	23000.0
LGP	16541	31327.267033	24000.0
L	16676	30739.471036	24000.0
	333040	31340.137891	24000.0

1.4.10 ProductSize

total_df['ProductSize'].fillna('').value_counts()

                  216605
Medium             64342
Large / Medium     51297
Small              27057
Mini               25721
Large              21396
Compact             6280
Name: ProductSize, dtype: int64

temp = pd.pivot_table(total_df.fillna(''), index='ProductSize', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)

	count	mean	median
	SalePrice	SalePrice	SalePrice
ProductSize
Compact	6280	17498.578822	15000.0
Large	21396	42023.433315	34000.0
Mini	25721	15194.647720	13500.0
Small	27057	32511.164061	29000.0
Large / Medium	51297	47828.923231	44000.0
Medium	64342	45703.873473	41000.0
	216605	24047.383408	19000.0

1.4.11 state

# 查看数据分布
total_df['state'].fillna('').value_counts()

Florida           67320
Texas             53110
California        29761
Washington        16222
Georgia           14633
Maryland          13322
Mississippi       13240
Ohio              12369
Illinois          11540
Colorado          11529
New Jersey        11156
North Carolina    10636
Tennessee         10298
Alabama           10292
Pennsylvania      10234
South Carolina     9951
Arizona            9364
New York           8639
Connecticut        8276
Minnesota          7885
Missouri           7178
Nevada             6932
Louisiana          6627
Kentucky           5351
Maine              5096
Indiana            4124
Arkansas           3933
New Mexico         3631
Utah               3046
Unspecified        2801
Wisconsin          2745
New Hampshire      2738
Virginia           2353
Idaho              2025
Oregon             1911
Michigan           1831
Wyoming            1672
Montana            1336
Iowa               1336
Oklahoma           1326
Nebraska            866
West Virginia       840
Kansas              667
Delaware            510
North Dakota        480
Alaska              430
Massachusetts       347
Vermont             300
South Dakota        244
Hawaii              118
Rhode Island         83
Puerto Rico          42
Washington DC         2
Name: state, dtype: int64

# 查看州和价格的关系，有关系
temp = pd.pivot_table(total_df, index='state', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)

	count	mean	median
	SalePrice	SalePrice	SalePrice
state
Minnesota	7885	27199.055168	19500.0
Connecticut	8276	29008.808603	23500.0
New York	8639	25582.237527	20000.0
Arizona	9364	31562.857753	23000.0
South Carolina	9951	29848.894583	22000.0
Pennsylvania	10234	25463.002736	19000.0
Alabama	10292	35438.541489	28000.0
Tennessee	10298	31845.357351	24000.0
North Carolina	10636	32161.773693	25000.0
New Jersey	11156	30982.188867	24500.0
Colorado	11529	31777.748287	25000.0
Illinois	11540	29091.581456	22500.0
Ohio	12369	28228.100493	22000.0
Mississippi	13240	32574.592145	25000.0
Maryland	13322	28621.682180	22000.0
Georgia	14633	32265.550468	24000.0
Washington	16222	27690.731722	21500.0
California	29761	29815.202715	22500.0
Texas	53110	32977.190341	25000.0
Florida	67320	34387.512775	27000.0

# 进一步按价格排序查看
temp.sort_values(by=[('median', 'SalePrice'), ('mean', 'SalePrice')])

	count	mean	median
	SalePrice	SalePrice	SalePrice
state
Indiana	4124	24400.357662	19000.0
Pennsylvania	10234	25463.002736	19000.0
Maine	5096	26176.942700	19500.0
Minnesota	7885	27199.055168	19500.0
New York	8639	25582.237527	20000.0
Kansas	667	28093.403298	20000.0
Puerto Rico	42	26011.904762	20250.0
Wisconsin	2745	27656.247723	21000.0
Idaho	2025	29263.234568	21000.0
Washington	16222	27690.731722	21500.0
Virginia	2353	28798.645984	21500.0
Ohio	12369	28228.100493	22000.0
Maryland	13322	28621.682180	22000.0
Michigan	1831	29003.923648	22000.0
Missouri	7178	29046.000697	22000.0
Kentucky	5351	29815.380303	22000.0
South Carolina	9951	29848.894583	22000.0
Vermont	300	27285.333333	22250.0
Arkansas	3933	28906.893974	22500.0
Illinois	11540	29091.581456	22500.0
California	29761	29815.202715	22500.0
Washington DC	2	22750.000000	22750.0
Massachusetts	347	28382.420749	23000.0
New Hampshire	2738	28928.743608	23000.0
Oregon	1911	30277.590267	23000.0
Delaware	510	31160.098039	23000.0
Arizona	9364	31562.857753	23000.0
Connecticut	8276	29008.808603	23500.0
Louisiana	6627	30201.893768	23500.0
Tennessee	10298	31845.357351	24000.0
Iowa	1336	31927.544910	24000.0
Oklahoma	1326	32258.710407	24000.0
Georgia	14633	32265.550468	24000.0
Montana	1336	32616.654192	24000.0
Nebraska	866	32612.066975	24250.0
New Jersey	11156	30982.188867	24500.0
Colorado	11529	31777.748287	25000.0
North Carolina	10636	32161.773693	25000.0
Mississippi	13240	32574.592145	25000.0
Wyoming	1672	32604.425837	25000.0
Texas	53110	32977.190341	25000.0
Hawaii	118	28879.237288	26000.0
Alaska	430	33281.976744	26000.0
New Mexico	3631	33632.649408	26000.0
Utah	3046	34190.547932	27000.0
Florida	67320	34387.512775	27000.0
Nevada	6932	36332.097519	27000.0
Unspecified	2801	34857.711532	28000.0
Alabama	10292	35438.541489	28000.0
North Dakota	480	39083.750000	29750.0
West Virginia	840	40258.750000	33000.0
Rhode Island	83	37622.289157	34000.0
South Dakota	244	43907.377049	35000.0

1.4.12 ProductGroupDesc

# 查看数据分布
total_df['ProductGroupDesc'].fillna('').value_counts()

Track Excavators       104230
Track Type Tractors     82582
Backhoe Loaders         81401
Wheel Loader            73216
Skid Steer Loaders      45011
Motor Graders           26258
Name: ProductGroupDesc, dtype: int64

# 查看和价格的关系
temp = pd.pivot_table(total_df, index='ProductGroupDesc', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)

	count	mean	median
	SalePrice	SalePrice	SalePrice
ProductGroupDesc
Motor Graders	26258	47561.957422	40000.0
Skid Steer Loaders	45011	10583.944014	10000.0
Wheel Loader	73216	37259.292422	32000.0
Backhoe Loaders	81401	20951.582327	20500.0
Track Type Tractors	82582	36280.132780	29500.0
Track Excavators	104230	35763.457018	29000.0

1.4.13 Drive_System

total_df['Drive_System'].fillna('').value_counts()

                    305611
Two Wheel Drive      47546
Four Wheel Drive     33551
No                   25166
All Wheel Drive        824
Name: Drive_System, dtype: int64

temp = pd.pivot_table(total_df.fillna(''), index='Drive_System', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)

	count	mean	median
	SalePrice	SalePrice	SalePrice
Drive_System
All Wheel Drive	824	60455.279126	59000.0
No	25166	47342.721450	40000.0
Four Wheel Drive	33551	24534.149414	24000.0
Two Wheel Drive	47546	18418.026879	17500.0
	305611	32532.703693	25000.0

1.4.14 Enclosure

total_df['Enclosure'].fillna('').value_counts()

OROPS                  177971
EROPS                  141769
EROPS w AC              92601
                          334
EROPS AC                   18
NO ROPS                     3
None or Unspecified         2
Name: Enclosure, dtype: int64

temp = pd.pivot_table(total_df.fillna(''), index='Enclosure', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)

	count	mean	median
	SalePrice	SalePrice	SalePrice
Enclosure
None or Unspecified	2	16500.000000	16500.0
NO ROPS	3	44333.333333	42500.0
EROPS AC	18	23500.000000	20500.0
	334	27689.446108	25000.0
EROPS w AC	92601	51671.169728	47000.0
EROPS	141769	28687.993278	23000.0
OROPS	177971	22592.082739	18000.0

1.4.15 Forks

total_df['Forks'].fillna('').value_counts()

                       214983
None or Unspecified    183061
Yes                     14654
Name: Forks, dtype: int64

temp = pd.pivot_table(total_df.fillna(''), index='Forks', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)

	count	mean	median
	SalePrice	SalePrice	SalePrice
Forks
Yes	14654	36761.354715	31500.0
None or Unspecified	183061	23593.414184	18500.0
	214983	37327.174954	30000.0

1.4.16 Pad_Type

total_df['Pad_Type'].fillna('').value_counts()

                       331602
None or Unspecified     72395
Reversible               5950
Street                   2725
Grouser                    26
Name: Pad_Type, dtype: int64

temp = pd.pivot_table(total_df.fillna(''), index='Pad_Type', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)

	count	mean	median
	SalePrice	SalePrice	SalePrice
Pad_Type
Grouser	26	30289.423077	30500.0
Street	2725	24995.064220	25000.0
Reversible	5950	27344.033613	27500.0
None or Unspecified	72395	20267.120354	20000.0
	331602	33725.998897	26000.0

1.4.17 Ride_Control

total_df['Ride_Control'].fillna('').value_counts()

                       259970
No                      79389
None or Unspecified     64693
Yes                      8646
Name: Ride_Control, dtype: int64

temp = pd.pivot_table(total_df.fillna(''), index='Ride_Control', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)

	count	mean	median
	SalePrice	SalePrice	SalePrice
Ride_Control
Yes	8646	55278.269142	53000.0
None or Unspecified	64693	34834.650024	29000.0
No	79389	20765.720100	20000.0
	259970	32705.232362	25000.0

1.4.18 Stick

total_df['Stick'].fillna('').value_counts()

            331602
Standard     49854
Extended     31242
Name: Stick, dtype: int64

temp = pd.pivot_table(total_df.fillna(''), index='Stick', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)

	count	mean	median
	SalePrice	SalePrice	SalePrice
Stick
Extended	31242	23257.320914	23000.0
Standard	49854	19501.525113	19000.0
	331602	33725.998897	26000.0

1.4.19 Transmission

total_df['Transmission'].fillna('').value_counts()

                       224691
Standard               143915
None or Unspecified     23889
Powershift              11991
Powershuttle             4286
Hydrostatic              3342
Direct Drive              422
Autoshift                 118
AutoShift                  44
Name: Transmission, dtype: int64

temp = pd.pivot_table(total_df.fillna(''), index='Transmission', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)

	count	mean	median
	SalePrice	SalePrice	SalePrice
Transmission
AutoShift	44	102261.363636	95000.0
Autoshift	118	33141.949153	33250.0
Direct Drive	422	18675.829384	11550.0
Hydrostatic	3342	34705.814183	32000.0
Powershuttle	4286	19264.930938	18000.0
Powershift	11991	38574.352931	29500.0
None or Unspecified	23889	47354.812968	40000.0
Standard	143915	28364.132218	23000.0
	224691	31117.253842	23000.0

1.4.20 Turbocharged

total_df['Turbocharged'].fillna('').value_counts()

                       331602
None or Unspecified     77111
Yes                      3985
Name: Turbocharged, dtype: int64

temp = pd.pivot_table(total_df.fillna(''), index='Turbocharged', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)

	count	mean	median
	SalePrice	SalePrice	SalePrice
Turbocharged
Yes	3985	24144.300376	24000.0
None or Unspecified	77111	20783.276264	20000.0
	331602	33725.998897	26000.0

1.4.21 Blade_Extension

total_df['Blade_Extension'].fillna('').value_counts()

                       386715
None or Unspecified     25406
Yes                       577
Name: Blade_Extension, dtype: int64

temp = pd.pivot_table(total_df.fillna(''), index='Blade_Extension', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)

	count	mean	median
	SalePrice	SalePrice	SalePrice
Blade_Extension
Yes	577	66587.694974	65000.0
None or Unspecified	25406	47337.075022	40000.0
	386715	30103.244279	23500.0

1.4.22 Blade_Width

total_df['Blade_Width'].fillna('').value_counts()

                       386715
14'                      9867
None or Unspecified      9521
12'                      5201
16'                       960
13'                       335
<12'                       99
Name: Blade_Width, dtype: int64

temp = pd.pivot_table(total_df.fillna(''), index='Blade_Width', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)

	count	mean	median
	SalePrice	SalePrice	SalePrice
Blade_Width
<12'	99	27764.646465	21000.0
13'	335	27125.522388	18500.0
16'	960	59882.119792	52750.0
12'	5201	36000.852144	29000.0
None or Unspecified	9521	41899.460141	32000.0
14'	9867	59347.223168	55000.0
	386715	30103.244279	23500.0

1.4.23 Enclosure_Type

total_df['Enclosure_Type'].fillna('').value_counts()

                       386715
None or Unspecified     22469
Low Profile              2675
High Profile              839
Name: Enclosure_Type, dtype: int64

temp = pd.pivot_table(total_df.fillna(''), index='Enclosure_Type', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)

	count	mean	median
	SalePrice	SalePrice	SalePrice
Enclosure_Type
High Profile	839	81897.437426	80000.0
Low Profile	2675	81444.658318	80000.0
None or Unspecified	22469	42480.324759	35000.0
	386715	30103.244279	23500.0

1.4.24 Engine_Horsepower

total_df['Engine_Horsepower'].fillna('').value_counts()

            386715
No           24642
Variable      1341
Name: Engine_Horsepower, dtype: int64

temp = pd.pivot_table(total_df.fillna(''), index='Engine_Horsepower', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)

	count	mean	median
	SalePrice	SalePrice	SalePrice
Engine_Horsepower
Variable	1341	90245.152871	88000.0
No	24642	45452.807321	38000.0
	386715	30103.244279	23500.0

1.4.24 Hydraulics

total_df['Hydraulics'].fillna('').value_counts()

2 Valve                145317
Standard               106515
                        82565
Auxiliary               43224
Base + 1 Function       25511
3 Valve                  5807
4 Valve                  3077
Base + 3 Function         311
Base + 2 Function         132
Base + 5 Function          94
Base + 4 Function          81
Base + 6 Function          54
None or Unspecified        10
Name: Hydraulics, dtype: int64

temp = pd.pivot_table(total_df.fillna(''), index='Hydraulics', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)

	count	mean	median
	SalePrice	SalePrice	SalePrice
Hydraulics
None or Unspecified	10	27400.000000	26250.0
Base + 6 Function	54	68333.333333	69500.0
Base + 4 Function	81	95253.086420	92500.0
Base + 5 Function	94	75601.063830	71000.0
Base + 2 Function	132	89424.242424	89500.0
Base + 3 Function	311	84141.800643	79000.0
4 Valve	3077	56652.551186	51000.0
3 Valve	5807	44479.309454	40000.0
Base + 1 Function	25511	46637.433970	39000.0
Auxiliary	43224	25076.746784	16000.0
	82565	21028.900297	20500.0
Standard	106515	29391.410139	20500.0
2 Valve	145317	36145.196393	30000.0

1.4.25 Pushblock

total_df['Pushblock'].fillna('').value_counts()

                       386715
None or Unspecified     20017
Yes                      5966
Name: Pushblock, dtype: int64

temp = pd.pivot_table(total_df.fillna(''), index='Pushblock', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)

	count	mean	median
	SalePrice	SalePrice	SalePrice
Pushblock
Yes	5966	68305.127724	65000.0
None or Unspecified	20017	41642.525653	33000.0
	386715	30103.244279	23500.0

1.4.26 Ripper

total_df['Ripper'].fillna('').value_counts()

                       305753
None or Unspecified     85405
Yes                      8185
Multi Shank              8071
Single Shank             5284
Name: Ripper, dtype: int64

temp = pd.pivot_table(total_df.fillna(''), index='Ripper', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)

	count	mean	median
	SalePrice	SalePrice	SalePrice
Ripper
Single Shank	5284	50139.366389	42000.0
Multi Shank	8071	48952.397472	41000.0
Yes	8185	63677.683079	60000.0
None or Unspecified	85405	35253.466486	28500.0
	305753	28422.902101	22000.0

1.4.27 Scarifier

total_df['Scarifier'].fillna('').value_counts()

                       386704
None or Unspecified     13033
Yes                     12961
Name: Scarifier, dtype: int64

temp = pd.pivot_table(total_df.fillna(''), index='Scarifier', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)

	count	mean	median
	SalePrice	SalePrice	SalePrice
Scarifier
Yes	12961	49523.266415	40000.0
None or Unspecified	13033	45996.798281	40000.0
	386704	30103.375220	23500.0

1.4.28 Tip_Control

total_df['Tip_Control'].fillna('').value_counts()

                       386715
None or Unspecified     16832
Sideshift & Tip          7164
Tip                      1987
Name: Tip_Control, dtype: int64

temp = pd.pivot_table(total_df.fillna(''), index='Tip_Control', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)

	count	mean	median
	SalePrice	SalePrice	SalePrice
Tip_Control
Tip	1987	62269.265727	57000.0
Sideshift & Tip	7164	46739.039084	41000.0
None or Unspecified	16832	46488.790459	37500.0
	386715	30103.244279	23500.0

1.4.29 Tire_Size

total_df['Tire_Size'].fillna('').value_counts()

                       315060
None or Unspecified     47823
20.5                    15773
14"                      9111
23.5                     8760
26.5                     4635
17.5                     3971
29.5                     2767
17.5"                    1815
13"                       776
20.5"                     737
15.5                      610
15.5"                     463
23.5"                     309
7.0"                       56
23.1"                      20
10"                         9
10 inch                     3
Name: Tire_Size, dtype: int64

temp = pd.pivot_table(total_df.fillna(''), index='Tire_Size', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)

	count	mean	median
	SalePrice	SalePrice	SalePrice
Tire_Size
10 inch	3	16333.333333	15500.0
10"	9	31166.666667	25000.0
23.1"	20	14837.500000	12125.0
7.0"	56	27625.000000	25500.0
23.5"	309	56487.944984	51000.0
15.5"	463	47401.727862	40000.0
15.5	610	17153.442623	15000.0
20.5"	737	74012.632293	68000.0
13"	776	21085.373711	16000.0
17.5"	1815	67478.650138	65000.0
29.5	2767	46650.813155	43000.0
17.5	3971	25344.856459	22500.0
26.5	4635	50903.011866	50000.0
23.5	8760	47236.347032	44000.0
14"	9111	50506.226869	44000.0
20.5	15773	41099.197997	38000.0
None or Unspecified	47823	35357.008218	27000.0
	315060	28433.535937	22000.0

1.4.30 Coupler

total_df['Coupler'].fillna('').value_counts()

                       192019
None or Unspecified    190449
Manual                  23918
Hydraulic                6312
Name: Coupler, dtype: int64

temp = pd.pivot_table(total_df.fillna(''), index='Coupler', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)

	count	mean	median
	SalePrice	SalePrice	SalePrice
Coupler
Hydraulic	6312	46839.729087	43000.0
Manual	23918	33493.829877	27000.0
None or Unspecified	190449	30392.948847	22500.0
	192019	31233.255205	24000.0

1.4.31 Coupler_System

total_df['Coupler_System'].fillna('').value_counts()

                       367724
None or Unspecified     41727
Yes                      3247
Name: Coupler_System, dtype: int64

temp = pd.pivot_table(total_df.fillna(''), index='Coupler_System', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)

	count	mean	median
	SalePrice	SalePrice	SalePrice
Coupler_System
Yes	3247	11593.701879	11500.0
None or Unspecified	41727	10504.809931	10000.0
	367724	33738.521242	26000.0

1.4.32 Grouser_Tracks

total_df['Grouser_Tracks'].fillna('').value_counts()

                       367823
None or Unspecified     41820
Yes                      3055
Name: Grouser_Tracks, dtype: int64

temp = pd.pivot_table(total_df.fillna(''), index='Grouser_Tracks', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)

	count	mean	median
	SalePrice	SalePrice	SalePrice
Grouser_Tracks
Yes	3055	12871.268412	13000.0
None or Unspecified	41820	10410.951196	9750.0
	367823	33732.896625	26000.0

1.4.33 Hydraulics_Flow

total_df['Hydraulics_Flow'].fillna('').value_counts()

                       367823
Standard                44251
High Flow                 597
None or Unspecified        27
Name: Hydraulics_Flow, dtype: int64

temp = pd.pivot_table(total_df.fillna(''), index='Hydraulics_Flow', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)

	count	mean	median
	SalePrice	SalePrice	SalePrice
Hydraulics_Flow
None or Unspecified	27	15053.703704	15000.0
High Flow	597	13183.165829	13000.0
Standard	44251	10540.573185	10000.0
	367823	33732.896625	26000.0

1.4.34 Track_Type

total_df['Track_Type'].fillna('').value_counts()

          310505
Steel      87463
Rubber     14730
Name: Track_Type, dtype: int64

temp = pd.pivot_table(total_df.fillna(''), index='Track_Type', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)

	count	mean	median
	SalePrice	SalePrice	SalePrice
Track_Type
Rubber	14730	15823.905906	13500.0
Steel	87463	39245.894733	34000.0
	310505	29683.235742	22500.0

1.4.35 Undercarriage_Pad_Width

total_df['Undercarriage_Pad_Width'].fillna('').value_counts()

                       309782
None or Unspecified     82444
32 inch                  5287
28 inch                  3152
24 inch                  2998
20 inch                  2664
30 inch                  1602
36 inch                  1544
18 inch                  1439
34 inch                   540
16 inch                   481
31 inch                   191
27 inch                   144
22 inch                   135
26 inch                    98
33 inch                    94
14 inch                    51
15 inch                    33
25 inch                    17
31.5 inch                   2
Name: Undercarriage_Pad_Width, dtype: int64

temp = pd.pivot_table(total_df.fillna(''), index='Undercarriage_Pad_Width', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)

	count	mean	median
	SalePrice	SalePrice	SalePrice
Undercarriage_Pad_Width
31.5 inch	2	108000.000000	108000.0
25 inch	17	22591.176471	20000.0
15 inch	33	16803.030303	15000.0
14 inch	51	16869.607843	14500.0
33 inch	94	51848.936170	47500.0
26 inch	98	30831.632653	28750.0
22 inch	135	33398.518519	26500.0
27 inch	144	33826.041667	28625.0
31 inch	191	48378.795812	44500.0
16 inch	481	20531.702703	17500.0
34 inch	540	61289.722222	60000.0
18 inch	1439	20302.710215	18000.0
36 inch	1544	46214.297280	41000.0
30 inch	1602	36943.008739	30000.0
20 inch	2664	29552.346096	27500.0
24 inch	2998	37054.419613	31500.0
28 inch	3152	37732.447652	33500.0
32 inch	5287	48901.324002	45500.0
None or Unspecified	82444	35017.391514	28000.0
	309782	29688.370739	22500.0

1.4.36 Stick_Length

total_df['Stick_Length'].fillna('').value_counts()

                       310437
None or Unspecified     81539
9' 6"                    5832
10' 6"                   3519
11' 0"                   1601
9' 10"                   1463
9' 8"                    1462
9' 7"                    1423
12' 10"                  1087
10' 2"                   1004
8' 6"                     908
8' 2"                     614
10' 10"                   414
12' 8"                    322
11' 10"                   307
8' 4"                     274
8' 10"                    104
12' 4"                    103
9' 5"                     101
15' 9"                     87
6' 3"                      51
13' 7"                     11
14' 1"                      7
13' 10"                     7
13' 9"                      7
19' 8"                      5
7' 10"                      3
15' 4"                      3
24' 3"                      2
9' 2"                       1
Name: Stick_Length, dtype: int64

temp = pd.pivot_table(total_df.fillna(''), index='Stick_Length', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)

	count	mean	median
	SalePrice	SalePrice	SalePrice
Stick_Length
15' 9"	87	57068.965517	55000.0
9' 5"	101	46074.257426	46000.0
12' 4"	103	47505.339806	40000.0
8' 10"	104	43937.500000	42750.0
8' 4"	274	33362.226277	28000.0
11' 10"	307	48237.785016	42500.0
12' 8"	322	58167.701863	56000.0
10' 10"	414	60149.275362	57000.0
8' 2"	614	33933.306189	31000.0
8' 6"	908	36311.949339	31000.0
10' 2"	1004	45855.229084	44000.0
12' 10"	1087	66730.174793	66000.0
9' 7"	1423	56526.528461	55000.0
9' 8"	1462	49643.296854	47000.0
9' 10"	1463	39891.148325	36500.0
11' 0"	1601	48882.151780	45000.0
10' 6"	3519	53677.642796	50000.0
9' 6"	5832	46655.847051	42500.0
None or Unspecified	81539	32556.659697	26000.0
	310437	29683.170383	22500.0

1.4.37 Thumb

total_df['Thumb'].fillna('').value_counts()

                       310366
None or Unspecified     85074
Manual                   9678
Hydraulic                7580
Name: Thumb, dtype: int64

temp = pd.pivot_table(total_df.fillna(''), index='Thumb', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)

	count	mean	median
	SalePrice	SalePrice	SalePrice
Thumb
Hydraulic	7580	38316.014644	33000.0
Manual	9678	39451.081318	34000.0
None or Unspecified	85074	35234.459047	28000.0
	310366	29683.224368	22500.0

1.4.38 Pattern_Changer

total_df['Pattern_Changer'].fillna('').value_counts()

                       310437
None or Unspecified     92924
Yes                      9269
No                         68
Name: Pattern_Changer, dtype: int64

temp = pd.pivot_table(total_df.fillna(''), index='Pattern_Changer', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)

	count	mean	median
	SalePrice	SalePrice	SalePrice
Pattern_Changer
No	68	29981.617647	27500.0
Yes	9269	52301.261085	52000.0
None or Unspecified	92924	34230.870776	28000.0
	310437	29683.170383	22500.0

1.4.39 Grouser_Type

total_df['Grouser_Type'].fillna('').value_counts()

          310505
Double     86998
Triple     15193
Single         2
Name: Grouser_Type, dtype: int64

temp = pd.pivot_table(total_df.fillna(''), index='Grouser_Type', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)

	count	mean	median
	SalePrice	SalePrice	SalePrice
Grouser_Type
Single	2	52000.000000	52000.0
Triple	15193	42755.055947	37500.0
Double	86998	34667.098784	28000.0
	310505	29683.235742	22500.0

1.4.40 Backhoe_Mounting

total_df['Backhoe_Mounting'].fillna('').value_counts()

                       331986
None or Unspecified     80692
Yes                        20
Name: Backhoe_Mounting, dtype: int64

temp = pd.pivot_table(total_df.fillna(''), index='Backhoe_Mounting', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)

	count	mean	median
	SalePrice	SalePrice	SalePrice
Backhoe_Mounting
Yes	20	16462.500000	15500.0
None or Unspecified	80692	36486.290155	30000.0
	331986	29934.882688	22500.0

1.4.41 Blade_Type

total_df['Blade_Type'].fillna('').value_counts()

                       330823
PAT                     39633
Straight                13461
None or Unspecified     11841
Semi U                   8907
VPAT                     3681
U                        1888
Angle                    1684
No                        743
Landfill                   26
Coal                       11
Name: Blade_Type, dtype: int64

temp = pd.pivot_table(total_df.fillna(''), index='Blade_Type', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)

	count	mean	median
	SalePrice	SalePrice	SalePrice
Blade_Type
Coal	11	50227.272727	48000.0
Landfill	26	69942.307692	63750.0
No	743	25569.078062	21000.0
Angle	1684	32742.923990	25500.0
U	1888	44536.493644	35000.0
VPAT	3681	62566.143711	59000.0
Semi U	8907	58495.607387	55000.0
None or Unspecified	11841	31077.752149	25000.0
Straight	13461	35548.990120	28000.0
PAT	39633	30679.544117	27000.0
	330823	29949.806359	22500.0

1.4.42 Travel_Controls

total_df['Travel_Controls'].fillna('').value_counts()

                       330821
None or Unspecified     71447
Differential Steer       5257
Finger Tip               2693
2 Pedal                  1144
Lever                     902
Pedal                     423
1 Speed                    11
Name: Travel_Controls, dtype: int64

temp = pd.pivot_table(total_df.fillna(''), index='Travel_Controls', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)

	count	mean	median
	SalePrice	SalePrice	SalePrice
Travel_Controls
1 Speed	11	15672.727273	11500.0
Pedal	423	24667.966903	22500.0
Lever	902	33187.900222	26000.0
2 Pedal	1144	25588.002622	21000.0
Finger Tip	2693	57974.053101	55000.0
Differential Steer	5257	68752.648849	67500.0
None or Unspecified	71447	33407.344104	27500.0
	330821	29950.385598	22500.0

1.4.43 Differential_Type

total_df['Differential_Type'].fillna('').value_counts()

                341134
Standard         70169
Limited Slip      1181
No Spin            212
Locking              2
Name: Differential_Type, dtype: int64

temp = pd.pivot_table(total_df.fillna(''), index='Differential_Type', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)

	count	mean	median
	SalePrice	SalePrice	SalePrice
Differential_Type
Locking	2	66000.000000	66000.0
No Spin	212	52889.386792	51475.0
Limited Slip	1181	57566.553768	57000.0
Standard	70169	37061.729952	32000.0
	341134	29907.683667	22500.0

1.4.44 Steering_Controls

total_df['Steering_Controls'].fillna('').value_counts()

                       341176
Conventional            70774
Command Control           594
Four Wheel Standard       139
Wheel                      14
No                          1
Name: Steering_Controls, dtype: int64

temp = pd.pivot_table(total_df.fillna(''), index='Steering_Controls', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)

	count	mean	median
	SalePrice	SalePrice	SalePrice
Steering_Controls
No	1	17500.000000	17500.0
Wheel	14	21517.857143	17500.0
Four Wheel Standard	139	24658.273381	22000.0
Command Control	594	74774.410774	75000.0
Conventional	70774	37168.659098	32000.0
	341176	29907.455419	22500.0

1.4.45 fiManufacturerDesc

total_df['fiManufacturerID'].fillna('').value_counts()

26      169003
43       74527
25       42142
103      38928
121      25033
         ...  
5            2
923          1
1518         1
112          1
525          1
Name: fiManufacturerID, Length: 104, dtype: int64

temp = pd.pivot_table(total_df.fillna(''), index='fiManufacturerID', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)

	count	mean	median
	SalePrice	SalePrice	SalePrice
fiManufacturerID
405	1070	10707.691589	9500.0
129	1117	14373.209490	10000.0
86	1178	34583.191851	30000.0
46	1344	17102.771577	15000.0
158	1378	30334.216255	26000.0
95	1628	34679.652948	31000.0
166	1906	23662.644281	20000.0
135	2553	18356.462593	14500.0
54	2585	18324.776402	15000.0
750	3305	12966.822995	11500.0
176	4224	43841.872633	40000.0
99	5087	33323.884411	28000.0
92	5260	21678.645437	18500.0
55	5902	13283.747035	11750.0
74	11291	33328.462935	27000.0
121	25033	10637.495067	10000.0
103	38928	33710.587988	27500.0
25	42142	21210.952589	18000.0
43	74527	27908.815087	23000.0
26	169003	40321.710425	33000.0

1.4.46 PrimarySizeBasis

total_df['PrimarySizeBasis'].fillna('').value_counts()

Horsepower                     180262
Weight - Metric Tons           105019
Standard Digging Depth - Ft     77848
Operating Capacity - Lbs        44226
                                 5259
Model                              78
Weight - Metric                     4
Weight - Lbs                        1
Cutting Width - Inches              1
Name: PrimarySizeBasis, dtype: int64

temp = pd.pivot_table(total_df.fillna(''), index='PrimarySizeBasis', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)

	count	mean	median
	SalePrice	SalePrice	SalePrice
PrimarySizeBasis
Cutting Width - Inches	1	23000.000000	23000.0
Weight - Lbs	1	27500.000000	27500.0
Weight - Metric	4	22562.500000	22750.0
Model	78	25535.256410	17750.0
	5259	15941.918616	14000.0
Operating Capacity - Lbs	44226	10612.493872	10000.0
Standard Digging Depth - Ft	77848	21202.527078	21000.0
Weight - Metric Tons	105019	35654.973148	29000.0
Horsepower	180262	38455.691062	32000.0

1.4.47 PrimaryLower

total_df['PrimaryLower'].fillna('').value_counts()

14.0       62037
20.0       17943
130.0      17113
150.0      15322
85.0       15054
           ...  
1.8            3
4.5            2
25000.0        1
2.7            1
300.0          1
Name: PrimaryLower, Length: 76, dtype: int64

**总结：**发现的规律如下：

从EXCEL表中可以明显看出部分特征是要么一起出现，要么一起空，这些可以在创建衍生特征时做成交叉特征。
部分特征虽然是object字段，但是能排出顺序，比如“Low、Medium、High”等，和‘xx inch’等，后面处理成定序变量。
空值都单独用一列标记。
没有特殊规律的类别特征统一用Label Encoder编码。
与价格的透视表可以为分箱提供依据。

EDA到此堂堂完结！！！！！！！！！！！！！！！！！！！！！！！

2. 数据清洗+数据预处理

2.1 创建衍生变量

2.1.1 SaleDate

def to_sin(n, i):
    return round(np.sin((2*np.pi/n)*(i-1)+(2*np.pi/7)) + 1, 2)

total_df['SaleYear'] = total_df['SaleDate'].dt.year
total_df['SaleMonth'] = total_df['SaleDate'].dt.month
total_df['SaleDay'] = total_df['SaleDate'].dt.day
total_df['SaleDayOfWeek'] = total_df['SaleDate'].dt.dayofweek
total_df['SaleDayOfYear'] = total_df['SaleDate'].dt.dayofyear
# 删除原来的SaleDate特征
total_df.drop('SaleDate', axis=1, inplace=True)

# 尝试余弦化拉近1月和12月的距离
total_df['SaleDayOfWeek_sin'] = total_df['SaleDayOfWeek'].map(lambda x: to_sin(7, x)).value_counts()
total_df['SaleMonth_sin'] = total_df['SaleMonth'].map(lambda x: to_sin(12, x)).value_counts()

2.1.2 Stick、Turbocharged

def combine_features(df, col_list):
    temp = df[col_list[0]].astype(str)
    for col in col_list[1:]:
        temp += df[col].astype(str)
    return temp

total_df['Stick__Turbocharged'] = combine_features(total_df, ['Stick', 'Turbocharged'])
total_df['Stick__Turbocharged'].value_counts()

nannan                         331602
StandardNone or Unspecified     47981
ExtendedNone or Unspecified     29130
ExtendedYes                      2112
StandardYes                      1873
Name: Stick__Turbocharged, dtype: int64

2.1.3 Blade_Extension、Blade_Width、Enclosure_Type、Engine_Horsepower

total_df['Blade__Blade__Enclosure__Engine'] = combine_features(total_df, ['Blade_Extension', 'Blade_Width', 'Enclosure_Type', 'Engine_Horsepower'])
total_df['Blade__Blade__Enclosure__Engine'].value_counts()

nannannannan                                                         386715
None or UnspecifiedNone or UnspecifiedNone or UnspecifiedNo            8572
None or Unspecified14'None or UnspecifiedNo                            7307
None or Unspecified12'None or UnspecifiedNo                            4466
None or Unspecified14'Low ProfileNo                                    1088
None or Unspecified16'None or UnspecifiedNo                             818
None or UnspecifiedNone or UnspecifiedLow ProfileNo                     435
None or Unspecified12'Low ProfileNo                                     416
None or Unspecified14'Low ProfileVariable                               377
None or Unspecified14'High ProfileNo                                    348
None or Unspecified14'None or UnspecifiedVariable                       343
None or Unspecified13'None or UnspecifiedNo                             313
None or UnspecifiedNone or UnspecifiedNone or UnspecifiedVariable       190
Yes14'None or UnspecifiedNo                                             126
None or Unspecified14'High ProfileVariable                              122
None or UnspecifiedNone or UnspecifiedHigh ProfileNo                    112
None or Unspecified<12'None or UnspecifiedNo                             99
None or Unspecified12'High ProfileNo                                     96
YesNone or UnspecifiedNone or UnspecifiedNo                              86
Yes12'None or UnspecifiedNo                                              84
None or UnspecifiedNone or UnspecifiedLow ProfileVariable                74
None or Unspecified16'Low ProfileNo                                      58
Yes14'Low ProfileNo                                                      55
Yes14'Low ProfileVariable                                                50
Yes12'Low ProfileNo                                                      47
None or Unspecified12'High ProfileVariable                               43
None or Unspecified16'High ProfileNo                                     34
None or UnspecifiedNone or UnspecifiedHigh ProfileVariable               27
None or Unspecified12'Low ProfileVariable                                21
Yes14'High ProfileVariable                                               21
Yes14'None or UnspecifiedVariable                                        18
None or Unspecified12'None or UnspecifiedVariable                        17
None or Unspecified13'Low ProfileNo                                      16
Yes16'Low ProfileNo                                                      15
Yes14'High ProfileNo                                                     12
Yes16'None or UnspecifiedNo                                              10
YesNone or UnspecifiedLow ProfileNo                                       9
YesNone or UnspecifiedNone or UnspecifiedVariable                         9
Yes16'High ProfileNo                                                      6
Yes12'High ProfileNo                                                      6
YesNone or UnspecifiedLow ProfileVariable                                 4
None or Unspecified16'High ProfileVariable                                4
None or Unspecified16'None or UnspecifiedVariable                         4
None or Unspecified16'Low ProfileVariable                                 4
Yes13'None or UnspecifiedNo                                               4
Yes16'Low ProfileVariable                                                 3
Yes12'Low ProfileVariable                                                 3
YesNone or UnspecifiedHigh ProfileNo                                      3
Yes12'High ProfileVariable                                                2
Yes16'High ProfileVariable                                                2
Yes16'None or UnspecifiedVariable                                         2
None or Unspecified13'None or UnspecifiedVariable                         1
None or Unspecified13'High ProfileNo                                      1
Name: Blade__Blade__Enclosure__Engine, dtype: int64

2.1.4 Pushblock、Scarifier、Tip_Control

total_df['Pushblock__Scarifier__TipControl'] = combine_features(total_df, ['Pushblock', 'Scarifier', 'Tip_Control'])
total_df['Pushblock__Scarifier__TipControl'].value_counts()

nannannan                                                    386704
None or UnspecifiedYesNone or Unspecified                      6742
None or UnspecifiedNone or UnspecifiedNone or Unspecified      6228
None or UnspecifiedYesSideshift & Tip                          3088
YesNone or UnspecifiedNone or Unspecified                      2560
None or UnspecifiedNone or UnspecifiedSideshift & Tip          2401
YesYesNone or Unspecified                                      1302
YesNone or UnspecifiedSideshift & Tip                          1182
None or UnspecifiedYesTip                                      1103
YesYesSideshift & Tip                                           493
None or UnspecifiedNone or UnspecifiedTip                       455
YesYesTip                                                       226
YesNone or UnspecifiedTip                                       203
nanYesnan                                                         7
nanNone or Unspecifiednan                                         4
Name: Pushblock__Scarifier__TipControl, dtype: int64

2.1.5 Coupler_System、Grouser_Tracks、Hydraulics_Flow

total_df['CouplerSystem__GrouserTracks__HydraulicsFlow'] = combine_features(total_df, ['Coupler_System', 'Grouser_Tracks', 'Hydraulics_Flow'])
total_df['CouplerSystem__GrouserTracks__HydraulicsFlow'].value_counts()

nannannan                                                    367724
None or UnspecifiedNone or UnspecifiedStandard                39040
YesNone or UnspecifiedStandard                                 2289
None or UnspecifiedYesStandard                                 2149
YesYesStandard                                                  773
None or UnspecifiedNone or UnspecifiedHigh Flow                 364
YesNone or UnspecifiedHigh Flow                                 122
None or Unspecifiednannan                                        91
None or UnspecifiedYesHigh Flow                                  57
YesYesHigh Flow                                                  54
None or UnspecifiedYesNone or Unspecified                        22
Yesnannan                                                         8
None or UnspecifiedNone or UnspecifiedNone or Unspecified         4
YesNone or UnspecifiedNone or Unspecified                         1
Name: CouplerSystem__GrouserTracks__HydraulicsFlow, dtype: int64

2.1.6 Track_Type Undercarriage_Pad_Width Stick_Length Thumb Pattern_Changer Grouser_Type

col_list = ('Track_Type	Undercarriage_Pad_Width	Stick_Length	Thumb	Pattern_Changer	Grouser_Type').split('\t')
total_df['TUSTPG'] = combine_features(total_df, col_list)
total_df['TUSTPG'].value_counts()

nannannannannannan                                                                          309643
SteelNone or UnspecifiedNone or UnspecifiedNone or UnspecifiedNone or UnspecifiedDouble      38336
RubberNone or UnspecifiedNone or UnspecifiedNone or UnspecifiedNone or UnspecifiedDouble     10031
SteelNone or UnspecifiedNone or UnspecifiedManualNone or UnspecifiedDouble                    3900
SteelNone or UnspecifiedNone or UnspecifiedNone or UnspecifiedNone or UnspecifiedTriple       3214
                                                                                             ...  
Steel32 inch8' 2"ManualNone or UnspecifiedDouble                                                 1
Steel30 inch12' 10"ManualNone or UnspecifiedTriple                                               1
Steel20 inch8' 4"ManualNone or UnspecifiedTriple                                                 1
Steel32 inch8' 4"None or UnspecifiedNone or UnspecifiedTriple                                    1
Rubber30 inchNone or UnspecifiedHydraulicNone or UnspecifiedDouble                               1
Name: TUSTPG, Length: 1065, dtype: int64

2.1.7 Backhoe_Mounting Blade_Type Travel_Controls

col_list = ('Backhoe_Mounting	Blade_Type	Travel_Controls').split('\t')
total_df['BBT'] = combine_features(total_df, col_list)
total_df['BBT'].value_counts()

nannannan                                                    330802
None or UnspecifiedPATNone or Unspecified                     37249
None or UnspecifiedStraightNone or Unspecified                11990
None or UnspecifiedNone or UnspecifiedNone or Unspecified     11106
None or UnspecifiedSemi UNone or Unspecified                   6582
None or UnspecifiedSemi UDifferential Steer                    2122
None or UnspecifiedVPATFinger Tip                              1857
None or UnspecifiedUNone or Unspecified                        1816
None or UnspecifiedAngleNone or Unspecified                    1560
None or UnspecifiedVPATNone or Unspecified                     1092
None or UnspecifiedPATDifferential Steer                       1011
None or UnspecifiedStraightDifferential Steer                   913
nanNo2 Pedal                                                    728
None or UnspecifiedVPATDifferential Steer                       708
None or UnspecifiedPATFinger Tip                                584
nanStraight2 Pedal                                              416
None or UnspecifiedPATLever                                     412
None or UnspecifiedPATPedal                                     370
None or UnspecifiedNone or UnspecifiedDifferential Steer        327
None or UnspecifiedNone or UnspecifiedLever                     285
None or UnspecifiedAngleDifferential Steer                      109
None or UnspecifiedSemi UFinger Tip                              97
None or UnspecifiedSemi ULever                                   96
None or UnspecifiedNone or UnspecifiedFinger Tip                 77
None or UnspecifiedStraightFinger Tip                            76
None or UnspecifiedStraightLever                                 56
None or UnspecifiedUDifferential Steer                           54
None or UnspecifiedNone or UnspecifiedPedal                      33
None or UnspecifiedVPATLever                                     24
nanNonan                                                         15
None or UnspecifiedLandfillNone or Unspecified                   14
None or UnspecifiedULever                                        14
YesNone or UnspecifiedNone or Unspecified                        13
None or UnspecifiedAngleLever                                    13
None or UnspecifiedLandfillDifferential Steer                    12
nannan1 Speed                                                    11
nannanNone or Unspecified                                        10
None or UnspecifiedSemi UPedal                                   10
None or UnspecifiedCoalNone or Unspecified                        8
YesPATNone or Unspecified                                         7
None or UnspecifiedStraightPedal                                  7
nanStraightnan                                                    3
None or UnspecifiedUPedal                                         2
None or UnspecifiedCoalLever                                      2
None or UnspecifiedAnglePedal                                     1
nanUnan                                                           1
None or UnspecifiedUFinger Tip                                    1
None or UnspecifiedAngleFinger Tip                                1
None or UnspecifiedCoalDifferential Steer                         1
Name: BBT, dtype: int64

2.1.8 Differential_Type Steering_Controls

col_list = ('Differential_Type	Steering_Controls').split('\t')
total_df['DS'] = combine_features(total_df, col_list)
total_df['DS'].value_counts()

nannan                         341120
StandardConventional            69411
Limited SlipConventional         1154
StandardCommand Control           564
No SpinConventional               209
StandardFour Wheel Standard       139
Standardnan                        54
Limited SlipCommand Control        27
nanWheel                           14
No SpinCommand Control              3
Lockingnan                          2
StandardNo                          1
Name: DS, dtype: int64

2.1.9 PrimarySizeBasis PrimaryLower

col_list = ('PrimarySizeBasis	PrimaryLower').split('\t')
total_df['PP'] = combine_features(total_df, col_list)
total_df['PP'].value_counts()

Standard Digging Depth - Ft14.0    57395
Horsepower20.0                     17941
Horsepower130.0                    17113
Horsepower150.0                    15299
Horsepower85.0                     15054
                                   ...  
Weight - Metric Tons20.0               2
Weight - Lbs25000.0                    1
Cutting Width - Inches0.0              1
Weight - Metric Tons2.7                1
Weight - Metric Tons300.0              1
Name: PP, Length: 90, dtype: int64

2.1.10 fiBaseModel

def is_hybrid(x):
    try:
        return type(eval(x)) == str
    except:
        return True

total_df['fiBaseModel_is_hybrid'] = total_df['fiBaseModel'].map(is_hybrid)

2.1.11 ProductGroupDesc

result = total_df['ProductGroupDesc'].str.split(' ')

total_df['ProductGroupDesc_father'] = [i[-1] for i in result]

total_df['ProductGroupDesc_father'].value_counts()

Loaders       126412
Excavators    104230
Tractors       82582
Loader         73216
Graders        26258
Name: ProductGroupDesc_father, dtype: int64

2.2 独热编码

def to_dummies(df, col_name):
    name_dict = {value: col_name + '_is_' + str(value) for value in df[col_name].unique()}
    return pd.get_dummies(df[col_name]).rename(columns=name_dict)

# col_names = ['CouplerSystem__GrouserTracks__HydraulicsFlow']

# for col in col_names:
#     total_df = pd.concat([total_df, to_dummies(total_df, col)], axis=1)

# total_df.drop(columns=col_names, inplace=True)

感觉没有明显提升，暂时不做了。

2.3 标签编码

2.3.1 UsageBand

usageband_dict = {
    'Low': 1,
    'Medium': 2,
    'High': 3
}
total_df['UsageBand'] = total_df['UsageBand'].map(usageband_dict).fillna(0)

total_df['UsageBand'].value_counts()

0.0    339028
2.0     35832
1.0     25311
3.0     12527
Name: UsageBand, dtype: int64

2.3.2 ProductSize

total_df['ProductSize_is_missing'] = total_df['ProductSize'].isna()

total_df['ProductSize'].value_counts()

Medium            64342
Large / Medium    51297
Small             27057
Mini              25721
Large             21396
Compact            6280
Name: ProductSize, dtype: int64

ProductSize_dict = {
    'Compact': 1,
    'Mini': 2,
    'Small': 3,
    'Medium': 4,
    'Large / Medium': 5,
    'Large': 6,
}
total_df['ProductSize'] = total_df['ProductSize'].map(ProductSize_dict).fillna(0)

total_df['ProductSize'].value_counts()

0.0    216605
4.0     64342
5.0     51297
3.0     27057
2.0     25721
6.0     21396
1.0      6280
Name: ProductSize, dtype: int64

2.3.3 Undercarriage_Pad_Width

total_df['Undercarriage_Pad_Width_is_missing'] = total_df['Undercarriage_Pad_Width'].isna()

total_df['Undercarriage_Pad_Width'].value_counts()

None or Unspecified    82444
32 inch                 5287
28 inch                 3152
24 inch                 2998
20 inch                 2664
30 inch                 1602
36 inch                 1544
18 inch                 1439
34 inch                  540
16 inch                  481
31 inch                  191
27 inch                  144
22 inch                  135
26 inch                   98
33 inch                   94
14 inch                   51
15 inch                   33
25 inch                   17
31.5 inch                  2
Name: Undercarriage_Pad_Width, dtype: int64

total_df['Undercarriage_Pad_Width'] = total_df['Undercarriage_Pad_Width'].fillna('0 inch').map(lambda x: 1 if x == 'None or Unspecified' else eval(x[:-4]))

total_df['Undercarriage_Pad_Width'].value_counts()

0.0     309782
1.0      82444
32.0      5287
28.0      3152
24.0      2998
20.0      2664
30.0      1602
36.0      1544
18.0      1439
34.0       540
16.0       481
31.0       191
27.0       144
22.0       135
26.0        98
33.0        94
14.0        51
15.0        33
25.0        17
31.5         2
Name: Undercarriage_Pad_Width, dtype: int64

2.3.4 Stick_Length

total_df['Stick_Length_is_missing'] = total_df['Stick_Length'].isna()

total_df['Stick_Length'].value_counts()

None or Unspecified    81539
9' 6"                   5832
10' 6"                  3519
11' 0"                  1601
9' 10"                  1463
9' 8"                   1462
9' 7"                   1423
12' 10"                 1087
10' 2"                  1004
8' 6"                    908
8' 2"                    614
10' 10"                  414
12' 8"                   322
11' 10"                  307
8' 4"                    274
8' 10"                   104
12' 4"                   103
9' 5"                    101
15' 9"                    87
6' 3"                     51
13' 7"                    11
14' 1"                     7
13' 10"                    7
13' 9"                     7
19' 8"                     5
7' 10"                     3
15' 4"                     3
24' 3"                     2
9' 2"                      1
Name: Stick_Length, dtype: int64

result = []

for i in total_df['Stick_Length'].fillna(0):
    if  i == 0:
        pass
    elif i == 'None or Unspecified':
        i = 1
    else:
        a = i.find("'")
        b = i.find('"')
        i = round(int(i[:a]) + int(i[a+2:b])/12, 2)
    result.append(i)

total_df['Stick_Length'] = result

total_df['Stick_Length'].value_counts()

0.00     310437
1.00      81539
9.50       5832
10.50      3519
11.00      1601
9.83       1463
9.67       1462
9.58       1423
12.83      1087
10.17      1004
8.50        908
8.17        614
10.83       414
12.67       322
11.83       307
8.33        274
8.83        104
12.33       103
9.42        101
15.75        87
6.25         51
13.58        11
14.08         7
13.83         7
13.75         7
19.67         5
7.83          3
15.33         3
24.25         2
9.17          1
Name: Stick_Length, dtype: int64

2.3.2 其余随机标签编码

le = LabelEncoder()

for col in total_df.columns:
    # 标记空值（事实证明用处不大。。。，只有一列标记派上了用场）
    if total_df[col].isna().sum() > 1000:
        total_df[col + '_is_missing'] = total_df[col].isna()
    # 编码
    if total_df[col].dtype == 'object':
        total_df[col] = le.fit_transform(total_df[col])

2.4 空值和异常值

# 异常值173
total_df['datasource'].map(lambda x: 172 if x == 173 else x)

0         132
1         132
2         132
3         132
4         132
         ... 
412693    149
412694    149
412695    149
412696    149
412697    149
Name: datasource, Length: 412698, dtype: int64

# 这一列的空值用众数填充
total_df['auctioneerID'].fillna(total_df['auctioneerID'].mode()[0], inplace=True)

total_df['MachineHoursCurrentMeter'].fillna(0, inplace=True)
total_df['PrimaryLower'].fillna(0, inplace=True)

total_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 412698 entries, 0 to 412697
Columns: 109 entries, SalePrice to SaleMonth_sin_is_missing
dtypes: bool(40), float64(10), int32(50), int64(9)
memory usage: 157.4 MB

2.5 删除重复列

def drop_dup_cols(df):
    cols = df.columns[60:]
    n = len(cols)
    for i in range(n-1):
        if len(df[df[cols[i]] != df[cols[i+1]]]) < 1000:
            df.drop(columns=[cols[i]], inplace=True)

drop_dup_cols(total_df)

3. 建模

# 以DataFrame的形式展现特征重要性
def show_feature_imp(model, df):
    return pd.DataFrame({df.columns[i]: model.feature_importances_[i] for i in range(len(df.columns))}, index=['imp']).T.sort_values(by='imp', ascending=False)

# 展示两个特征重要性表的共同倒数n%的特征
def show_low_features(df1, df2, ratio):
    n = int(len(df1) * ratio)
    df1_features = df1.tail(n).index
    df2_features = df2.tail(n).index
    return n, {feature: [df1.loc[feature].values[0], df2.loc[feature].values[0]] for feature in df1_features if feature in df2_features}

# 以字典的形式显示分数，可显示的分数有：R平方、MAE、MSE、RMSLE
def show_scores(y_test, y_pred):
    scores_dict = {}
    scores_dict['R2'] = r2_score(y_test, y_pred)
    scores_dict['MAE'] = mean_absolute_error(y_test, y_pred)
    scores_dict['MSE'] = mean_squared_error(y_test, y_pred)
    scores_dict['RMSLE'] = np.sqrt(mean_squared_log_error(y_test, y_pred))
    return scores_dict

# 被后续的逐步回归筛出来的特征
drop_list = ['SalePrice', 'MachineID', 'PrimaryLower_is_missing',  
             'Drive_System_is_missing', 'Forks_is_missing', 'ProductSize_is_missing',
             'Ride_Control_is_missing', 'Stick_is_missing', 'Stick_Length_is_missing', 'Transmission_is_missing', 
             'Turbocharged_is_missing', 'Engine_Horsepower_is_missing', 'Hydraulics_is_missing', 
             'Pad_Type_is_missing', 'Ripper_is_missing', 'Tip_Control_is_missing', 'Tire_Size_is_missing', 
             'Coupler_is_missing', 'Hydraulics_Flow_is_missing', 'Grouser_Type_is_missing', 
             'Backhoe_Mounting_is_missing', 'Travel_Controls_is_missing', 'Steering_Controls_is_missing', 
             'Pushblock_is_missing', 'SaleDayOfWeek', 'SaleMonth', 'Differential_Type', 
             'Stick__Turbocharged', 'SaleDayOfYear', 'MachineHoursCurrentMeter_is_missing', 
             'Scarifier', 'Backhoe_Mounting', 'Forks', 'Track_Type', 'Engine_Horsepower', 
             'Blade_Extension', 'Coupler_System', 'SaleMonth_sin', 'SaleDayOfWeek_sin', 'SaleMonth_sin_is_missing',
             'fiBaseModel_is_hybrid', 'Hydraulics_Flow', 'fiModelDescriptor', 'SaleDay', 'Hydraulics', 'Pushblock', 'Transmission',
             'PrimarySizeBasis', 'Stick','DS', 'Tip_Control', 'Grouser_Tracks', 'fiSecondaryDesc', 'fiModelSeries'
            ]

# 划分训练集、验证集
X = total_df.drop(drop_list, axis=1)
y = total_df['SalePrice']

X_train = X[X['SaleYear']<2012]
X_test = X[X['SaleYear']>=2012]
y_train = y[X['SaleYear']<2012]
y_test = y[X['SaleYear']>=2012]

raise KeyError

---------------------------------------------------------------------------

KeyError                                  Traceback (most recent call last)

Input In [1044], in <cell line: 1>()
----> 1 raise KeyError


KeyError:

3.1 RF

# rfr = RandomForestRegressor(
#     n_estimators = 100,
#     max_depth = 23,
#     min_samples_leaf = 2,
#     min_samples_split = 4,
#     n_jobs = -1,
#     random_state = 10
# )
# rfr.fit(X_train, y_train)
# show_scores(y_test, rfr.predict(X_test))

# rfr_imp = show_feature_imp(rfr, X_train)

# rfr_imp.head(20)

# rfr_grid = {
#     'n_estimators': [100, 150, 200],
#     'max_depth': [15, 20, 25],
#     'max_features': [0.5, 0.6, 0.7],
#     'min_samples_leaf': [2],
#     'min_samples_split': [4],
#     'max_samples': [2000],
#     'n_jobs': [-1]  
# }

# gscv = GridSearchCV(RandomForestRegressor(), param_grid=rfr_grid, cv=5, verbose=True)

# gscv.fit(X_train, y_train)

3.2 XGB

xgb_grid = {
    'n_estimators': [200],
    'learning_rate': [0.0835],
    'reg_lambda': [0.48, 0.5, 0.52]
}

gscv = GridSearchCV(XGBRegressor(), param_grid=xgb_grid)

gscv.fit(X_train, y_train)

gscv.best_params_

xgb = XGBRegressor(
    n_estimators = 250, # 这个很重要
    max_depth = 11, # 这个很重要
    learning_rate = 0.11, # 这个很重要
    reg_lambda = 0.2, # 这个比较重要
    n_jobs=-1,
    random_state = 13
)
xgb.fit(X_train, y_train)
show_scores(y_test, xgb.predict(X_test))

{'R2': 0.884087856079633,
 'MAE': 5745.138324664967,
 'MSE': 79616854.10269143,
 'RMSLE': 0.23743323282056006}

总结：

调参的时候自己要先清楚哪些参数对模型影响大，最好用gscv来两次，第一次先摸清大概范围，第二次再精确到小数点后1-3位。

4. 特征筛选

逐步回归筛选特征（注意，后续提到的逐步回归都以本文自定义的“青春版”逐步回归函数为准）

后向逐步回归：按照给定的特征列表（按照之前fit一遍的特征重要性排升序）遍历。每丢一个特征计算一次RMSLE，如果分降了就保持丢弃这个特征。可以设定模型、遍历次数、丢弃特征的RMSLE下降阈值。
前向逐步回归：按照给定的特征列表（按照之前fit一遍的特征重要性排序）遍历。每丢一个特征计算一次RMSLE，如果分降了就保持丢弃这个特征。可以设定模型、遍历次数、丢弃特征的RMSLE下降阈值。
如果希望更稳重，不要如此轻易就丢弃一个特征，可选的函数改进方式是改为二重循环：每一轮都将每个特征遍历一遍，计算前后的得分（或者误差）差值gain，剔除gain最小/大的特征，然后再进入下一轮，直到达到最大迭代次数或gain达到阈值。

def step_wise_reg(model, X_train, X_test, y_train, y_test, suspicion_list, method='backward', tol=0, show_info=True, dilution_ratio=1):
    X_train = X_train[::dilution_ratio]
    y_train = y_train[::dilution_ratio]
    
    if method == 'backward':
        drop_list = [] 
    elif method == 'forward':
        drop_list = list(suspicion_list)
    else:
        raise KeyError
    
    
    model.fit(X_train.drop(columns=drop_list), y_train)
    old_score = np.sqrt(mean_squared_log_error(y_test, pd.Series(model.predict(X_test.drop(columns=drop_list))).map(lambda x: 0 if x < 0 else x)))

    for last_feature in suspicion_list:
        
        if method == 'backward':
            drop_list.append(last_feature)
        else:
            drop_list.remove(last_feature)

        model.fit(X_train.drop(columns=drop_list), y_train)
        new_score = np.sqrt(mean_squared_log_error(y_test, pd.Series(model.predict(X_test.drop(columns=drop_list))).map(lambda x: 0 if x < 0 else x)))

        if show_info:
            print(old_score, new_score, drop_list)
        
        
        if old_score - new_score < tol:
            if method == 'backward':
                drop_list.remove(last_feature)   
            else:
                drop_list.append(last_feature)
        else:
            old_score = new_score

    return old_score, new_score, drop_list

old_score, new_score, add_drop_list1 = step_wise_reg(xgb, X_train, X_test, y_train, y_test, dilution_ratio=1, method='backward', suspicion_list=show_feature_imp(xgb, X_train).index[::-1])

add_drop_list1

old_score, new_score, add_drop_list2 = step_wise_reg(xgb, X_train, X_test, y_train, y_test, dilution_ratio=1, method='forward', tol=0.0001,suspicion_list=show_feature_imp(xgb, X_train).index[1:])

add_drop_list2

总结：

逐步回归十分好用，过一遍特征，误差就降下去了，推测特征被筛出去的原因主要是：

和其他特征的信息有一定相关性，即存在冗余信息。简单计算一下，就算全部是二分类特征，2的20次方也能表示出100万个不同的样本了，因此对于此次只有40万条数据的数据集来说，50+的特征绝对是存在冗余的，会干扰模型判断。
本身是个好特征，但没有经过恰当的分箱/编码处理，导致蕴含的信息不能传达给模型，反而起到了负提升。这就只有根据业务逻辑反复尝试了，一边尝试分箱，一边要回来调参。
本身是好特征，预处理也到位了，但模型进行逐步回归时没有同步调参，导致误差反而上升。尤其是前向逐步回归，从1个特征增加到50个特征，跨度如此之大，模型需要的参数必然是不同的。

以上2、3两点就造成了误筛。所以使用逐步回归时要注意：被筛出的特征也不一定就是没用的特征，还得自己判断。

5. 最终成果

show_scores(y_test, xgb.predict(X_test))

{'R2': 0.884087856079633,
 'MAE': 5745.138324664967,
 'MSE': 79616854.10269143,
 'RMSLE': 0.23743323282056006}

xgb_imp = show_feature_imp(xgb, X_train)
xgb_imp

	imp
CouplerSystem__GrouserTracks__HydraulicsFlow	0.397497
ProductSize	0.249447
PP	0.079305
Ride_Control	0.042499
fiManufacturerID	0.037125
YearMade	0.031770
PrimaryLower	0.027179
ProductGroupDesc_father	0.015217
SaleYear	0.012649
fiModelDesc	0.012630
Blade_Width	0.010005
Pushblock__Scarifier__TipControl	0.009107
Drive_System	0.006723
Ripper	0.006685
Enclosure	0.005481
Blade__Blade__Enclosure__Engine	0.004808
Pad_Type	0.004427
fiProductClassDesc	0.004194
ProductGroupDesc	0.003965
Travel_Controls	0.003888
Tire_Size	0.003268
Pattern_Changer	0.003167
ModelID	0.002938
UsageBand	0.002606
BBT	0.002432
Steering_Controls	0.002027
Turbocharged	0.001827
fiBaseModel	0.001778
Blade_Type	0.001742
Undercarriage_Pad_Width	0.001496
Stick_Length	0.001339
TUSTPG	0.001334
Thumb	0.001193
Coupler	0.001181
Grouser_Type	0.001138
MachineHoursCurrentMeter	0.001100
state	0.001093
auctioneerID_is_missing	0.001045
Enclosure_Type	0.001007
auctioneerID	0.000875
datasource	0.000812