【Python机器学习】回归模型:推土机售价预测

文章目录

使用机器学习预测推土机的售价

1. 定义问题

考虑到推土机的特性,利用过去的数据,我们能多大程度上预测它未来的价格?

2. 数据来源

kaggle:https://www.kaggle.com/competitions/bluebook-for-bulldozers/overview

The data for this competition is split into three parts:

  • Train.csv is the training set, which contains data through the end of 2011.
  • Valid.csv is the validation set, which contains data from January 1, 2012 - April 30, 2012 You make predictions on this set throughout the majority of the competition. Your score on this set is used to create the public leaderboard.
  • Test.csv is the test set, which won’t be released until the last week of the competition. It contains data from May 1, 2012 - November 2012. Your score on the test set determines your final rank for the competition.

3. 评价标准

The evaluation metric for this competition is the RMSLE (root mean squared log error) between the actual and predicted auction prices.

更多信息:https://www.kaggle.com/competitions/bluebook-for-bulldozers/overview/evaluation

注意:多数回归模型的评价标准都是减小误差。比如这次的目标就是最小化RMSLE。

4. 使用的特征

特征过多,请自行进入kaggle项目主页查看。或点击如下谷歌表格链接:https://docs.google.com/spreadsheets/d/1EIbdGa4S_46USXgg0OHX5jgTc8ld9fTHwPyi_VOV1as/edit#gid=0

零、导入模块

# EDA
import warnings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
sns.set()
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
%config InlineBackend.figure_config = 'svg'
warnings.filterwarnings("ignore")

# 数据预处理
from sklearn.preprocessing import LabelEncoder

# sklearn模型
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor

# 模型评估
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_log_error, mean_absolute_error, mean_squared_error, r2_score

一、EDA

bulldozer_df = pd.read_csv('bluebook-for-bulldozers/TrainAndValid.csv',
                           low_memory = False)
appendix_df = pd.read_csv('bluebook-for-bulldozers/Machine_Appendix.csv',
                           low_memory = False)

1.1 查看基本信息

# 查看各字段类型
bulldozer_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 412698 entries, 0 to 412697
Data columns (total 53 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   SalesID                   412698 non-null  int64  
 1   SalePrice                 412698 non-null  float64
 2   MachineID                 412698 non-null  int64  
 3   ModelID                   412698 non-null  int64  
 4   datasource                412698 non-null  int64  
 5   auctioneerID              392562 non-null  float64
 6   YearMade                  412698 non-null  int64  
 7   MachineHoursCurrentMeter  147504 non-null  float64
 8   UsageBand                 73670 non-null   object 
 9   saledate                  412698 non-null  object 
 10  fiModelDesc               412698 non-null  object 
 11  fiBaseModel               412698 non-null  object 
 12  fiSecondaryDesc           271971 non-null  object 
 13  fiModelSeries             58667 non-null   object 
 14  fiModelDescriptor         74816 non-null   object 
 15  ProductSize               196093 non-null  object 
 16  fiProductClassDesc        412698 non-null  object 
 17  state                     412698 non-null  object 
 18  ProductGroup              412698 non-null  object 
 19  ProductGroupDesc          412698 non-null  object 
 20  Drive_System              107087 non-null  object 
 21  Enclosure                 412364 non-null  object 
 22  Forks                     197715 non-null  object 
 23  Pad_Type                  81096 non-null   object 
 24  Ride_Control              152728 non-null  object 
 25  Stick                     81096 non-null   object 
 26  Transmission              188007 non-null  object 
 27  Turbocharged              81096 non-null   object 
 28  Blade_Extension           25983 non-null   object 
 29  Blade_Width               25983 non-null   object 
 30  Enclosure_Type            25983 non-null   object 
 31  Engine_Horsepower         25983 non-null   object 
 32  Hydraulics                330133 non-null  object 
 33  Pushblock                 25983 non-null   object 
 34  Ripper                    106945 non-null  object 
 35  Scarifier                 25994 non-null   object 
 36  Tip_Control               25983 non-null   object 
 37  Tire_Size                 97638 non-null   object 
 38  Coupler                   220679 non-null  object 
 39  Coupler_System            44974 non-null   object 
 40  Grouser_Tracks            44875 non-null   object 
 41  Hydraulics_Flow           44875 non-null   object 
 42  Track_Type                102193 non-null  object 
 43  Undercarriage_Pad_Width   102916 non-null  object 
 44  Stick_Length              102261 non-null  object 
 45  Thumb                     102332 non-null  object 
 46  Pattern_Changer           102261 non-null  object 
 47  Grouser_Type              102193 non-null  object 
 48  Backhoe_Mounting          80712 non-null   object 
 49  Blade_Type                81875 non-null   object 
 50  Travel_Controls           81877 non-null   object 
 51  Differential_Type         71564 non-null   object 
 52  Steering_Controls         71522 non-null   object 
dtypes: float64(3), int64(5), object(45)
memory usage: 166.9+ MB
# 查看缺失值
bulldozer_df.isna().sum()
SalesID                          0
SalePrice                        0
MachineID                        0
ModelID                          0
datasource                       0
auctioneerID                 20136
YearMade                         0
MachineHoursCurrentMeter    265194
UsageBand                   339028
saledate                         0
fiModelDesc                      0
fiBaseModel                      0
fiSecondaryDesc             140727
fiModelSeries               354031
fiModelDescriptor           337882
ProductSize                 216605
fiProductClassDesc               0
state                            0
ProductGroup                     0
ProductGroupDesc                 0
Drive_System                305611
Enclosure                      334
Forks                       214983
Pad_Type                    331602
Ride_Control                259970
Stick                       331602
Transmission                224691
Turbocharged                331602
Blade_Extension             386715
Blade_Width                 386715
Enclosure_Type              386715
Engine_Horsepower           386715
Hydraulics                   82565
Pushblock                   386715
Ripper                      305753
Scarifier                   386704
Tip_Control                 386715
Tire_Size                   315060
Coupler                     192019
Coupler_System              367724
Grouser_Tracks              367823
Hydraulics_Flow             367823
Track_Type                  310505
Undercarriage_Pad_Width     309782
Stick_Length                310437
Thumb                       310366
Pattern_Changer             310437
Grouser_Type                310505
Backhoe_Mounting            331986
Blade_Type                  330823
Travel_Controls             330821
Differential_Type           341134
Steering_Controls           341176
dtype: int64
# 查看标签分布
bulldozer_df['SalePrice'].hist()
<AxesSubplot:>​    

请添加图片描述

1.2 特征类型转换

# 改为帕斯卡命名
bulldozer_df.rename(columns={'saledate': 'SaleDate'}, inplace=True)
bulldozer_df['SaleDate'] = pd.to_datetime(bulldozer_df['SaleDate'])
# 按时间查看售价,只看最近几年的,没有明显的上升趋势,只有上下波动
plt.figure(figsize=(20,15))
pd.pivot_table(bulldozer_df[200000::1000], index='SaleDate', values='SalePrice').plot()
plt.show()

请添加图片描述

2005年前后半年和2008年整个一年销量都比较差,除此之外看不出太多信息

# 将数据集按时间排序
bulldozer_df.sort_values(by='SaleDate', inplace=True)
bulldozer_df['SaleDate'].head(20)
205615   1989-01-17
274835   1989-01-31
141296   1989-01-31
212552   1989-01-31
62755    1989-01-31
54653    1989-01-31
81383    1989-01-31
204924   1989-01-31
135376   1989-01-31
113390   1989-01-31
113394   1989-01-31
116419   1989-01-31
32138    1989-01-31
127610   1989-01-31
76171    1989-01-31
127000   1989-01-31
128130   1989-01-31
127626   1989-01-31
55455    1989-01-31
55454    1989-01-31
Name: SaleDate, dtype: datetime64[ns]

1.3 联表+特征初筛

这个数据集比较特殊,还有个appendix表,里面是一些推土机的配件信息,而且这个信息和train表有重复特征,重复特征里面还有匹配不上的情况,现在先联表上来看看

# 制作数据集副本,这是为了方便对数据集做了什么操作后,仍然可以获取原始数据,而不用从头读数据
bd_df = bulldozer_df.copy()
app_df = appendix_df.copy()
# SalesID列丢弃
bd_df.drop('SalesID', axis=1, inplace=True)
# 定义一个查看出入的函数
def check_difference(df1, df2, target_col, on_col):
    temp_df = pd.merge(df1[[on_col, target_col]], df2[[on_col, target_col]], how='left', on=on_col)
    return temp_df[(temp_df[target_col+'_x'] != temp_df[target_col+'_y'])]
# 定义一个合并时互补的函数,冲突时保留df1的数据
def combine(df1, df2, target_col, on_col):
    temp_df0 = pd.merge(df1[[on_col, target_col]], df2[[on_col, target_col]], how='left', on=on_col)
    temp_df0.fillna('', inplace=True)
    temp_df1=temp_df0[(temp_df0[target_col+'_x']== '') & (temp_df0[target_col+'_y']!='')]
    temp_df0[target_col+'_x'].loc[temp_df1.index] = temp_df1[target_col+'_y']
    df1[target_col] = temp_df0[target_col+'_x']
    df2.drop(columns=[target_col], inplace=True)
    return df1

1.3.1 删除包含重复信息的特征

ProductGroup是ProductGroupDesc的首字母缩写版,选择保留后者

fiManufacturerID和fiManufacturerDesc包含的信息一样,选择保留前者

bd_df.drop(columns=['ProductGroup', 'fiProductClassDesc'], inplace=True)
app_df.drop(columns=['ModelID', 'fiModelDesc', 'ProductGroup', 'fiManufacturerDesc'], inplace=True)

1.3.2 fiBaseModel

# 查看枚举值,后续需要做分箱合并处理
bd_df['fiBaseModel'].value_counts()
580      20179
310      17886
D6       13527
416      12900
D5        9636
         ...  
56           1
B4230        1
IS30         1
MM555        1
WLK15        1
Name: fiBaseModel, Length: 1961, dtype: int64
# 查看出入部分有无空值
check_difference(bd_df, app_df, 'fiBaseModel', 'MachineID').isna().sum()
MachineID        0
fiBaseModel_x    0
fiBaseModel_y    0
dtype: int64
# 查看出入部分
check_difference(bd_df, app_df, 'fiBaseModel', 'MachineID')
MachineIDfiBaseModel_xfiBaseModel_y
711523610WA150HD465
981059447WA300PC100
1231303779415862
1281340389MS240MS120
1791208516D31PC100
............
4124742308891184575
4124842287735T135T133
4125092292146450465
4126551846321TB135TB125
4126951918416337530

12452 rows × 3 columns

# 选择保留bd_df表的数据
app_df.drop(columns=['fiBaseModel'], inplace=True)

1.3.3 fiSecondaryDesc

# 查看枚举值,后续需要做分箱合并处理
bd_df['fiSecondaryDesc'].value_counts()
C         44431
B         40165
G         37915
H         24729
E         21532
          ...  
BLGPPS        1
MSR           1
LC7A          1
CL            1
BH            1
Name: fiSecondaryDesc, Length: 177, dtype: int64
# 查看不一致部分
check_difference(bd_df, app_df, 'fiSecondaryDesc', 'MachineID')
MachineIDfiSecondaryDesc_xfiSecondaryDesc_y
01126363NaNNaN
11194089NaNNaN
31327630NaNNaN
61082797NaNNaN
71527216NaNNaN
............
4126901823846NaNNaN
4126911278794NaNNaN
4126921792049NaNNaN
4126941919104NaNNaN
4126951918416GNaN

147009 rows × 3 columns

# 合并互补
bd_df = combine(bd_df, app_df, 'fiSecondaryDesc', 'MachineID')

1.3.4 fiModelSeries

# 查看枚举值,后续需要做分箱合并处理
bd_df['fiModelSeries'].value_counts()
II      13770
LC       9175
III      5351
-1       4646
-2       4033
        ...  
LL          1
6F          1
-2LC        1
-5A         1
VII         1
Name: fiModelSeries, Length: 123, dtype: int64
# 查看不一致部分
check_difference(bd_df, app_df, 'fiModelSeries', 'MachineID')
MachineIDfiModelSeries_xfiModelSeries_y
01126363NaNNaN
11194089NaNNaN
21473654NaNNaN
31327630NaNNaN
41336053NaNNaN
............
4126931915521NaNNaN
4126941919104NaNNaN
4126951918416NaNNaN
412696509560NaNNaN
4126971869284NaNNaN

362366 rows × 3 columns

# 合并互补
bd_df = combine(bd_df, app_df, 'fiModelSeries', 'MachineID')

1.3.5 fiModelDescriptor

# 查看枚举值
bd_df['fiModelDescriptor'].value_counts()
L            16464
LGP          16143
LC           13295
XL            6700
6             2944
             ...  
K5               1
HighLift         1
High Lift        1
III              1
SL               1
Name: fiModelDescriptor, Length: 140, dtype: int64
# 查看不一致部分
check_difference(bd_df, app_df, 'fiModelDescriptor', 'MachineID')
MachineIDfiModelDescriptor_xfiModelDescriptor_y
01126363NaNNaN
11194089NaNNaN
21473654NaNNaN
31327630NaNNaN
41336053NaNNaN
............
4126931915521NaNNaN
4126941919104NaNNaN
4126951918416NaNNaN
412696509560NaNNaN
4126971869284NaNNaN

342069 rows × 3 columns

# 合并互补
bd_df = combine(bd_df, app_df, 'fiModelDescriptor', 'MachineID')

1.3.6 ProductGroupDesc

# 查看枚举值
bd_df['ProductGroupDesc'].value_counts()
Track Excavators       104230
Track Type Tractors     82582
Backhoe Loaders         81401
Wheel Loader            73216
Skid Steer Loaders      45011
Motor Graders           26258
Name: ProductGroupDesc, dtype: int64
# 枚举值太多了,而且bd_df里的数据并无空值,这里就选择不互补了,直接保留bd_df的数据
app_df['ProductGroupDesc'].value_counts()
Track Excavators                 89094
Backhoe Loaders                  74074
Track Type Tractors              67362
Wheel Loader                     62426
Skid Steer Loaders               42121
Motor Graders                    22602
Track Loaders                      158
Articulated Trucks                 109
Ag Tractors                        103
Wheel Tractor Scraper              102
Off Highway Trucks                  81
Multi Terrain Loaders               66
Wheel Excavator                     66
Forklift                            53
Skidders                            46
Wheel Feller Buncher                31
Forestry Log Loaders                25
Pipelayers                          11
Telehandler                          9
Wheel Dozer                          7
Knuckleboom Loaders                  6
Track Feller Bunchers                6
Vibratory Double Drum Asphalt        4
Vibratory Single Drum Asphalt        4
Work Tool                            3
Pneumatic Tired Compactor            3
Compactors                           3
Vibratory Single Drum Pad            3
Tandem Roller Static                 2
Harvesters                           2
Forwarders                           2
Engine, Industrial OEM               2
Track Harvesters                     1
Delimber Forestry                    1
Asphalt/Concrete Pavers              1
Vibratory Single Drum Smooth         1
Crane/Dragline                       1
Forestry Excavators                  1
Cold Planers                         1
Name: ProductGroupDesc, dtype: int64
app_df.drop(columns=['ProductGroupDesc'], inplace=True)

1.3.7 MfgYear

# 两边同一个特征名字不一样,先改名
app_df.rename(columns={'MfgYear': 'YearMade'}, inplace=True)
# 查看枚举值,发现异常值1000
bd_df['YearMade'].value_counts()
1000    39391
2005    22096
1998    21751
2004    20914
1999    19274
        ...  
2012        1
1949        1
1942        1
2013        1
1937        1
Name: YearMade, Length: 73, dtype: int64
# 查看不一致部分
check_difference(bd_df, app_df, 'YearMade', 'MachineID')
MachineIDYearMade_xYearMade_y
52153165619751987.0
58107813219671966.0
71152361019862007.0
98105944719841996.0
122152518019841985.0
............
412651163172910001990.0
412654191210610001996.0
412655184632110002005.0
412667189753519981999.0
41269650956019931990.0

36406 rows × 3 columns

这种不一致通常是因为bd_df里面有重复的MachineID,即同一种推土机被卖了多次,而app_df里只有唯一MachineID所导致,决定保留bd_df的数据,不动它

app_df.drop('YearMade', axis=1, inplace=True)

1.3.8 fiManufacturerID、PrimarySizeBasis、PrimaryLower、PrimaryUpper

这三个特征原训练集上都没有,直接联过去

# total_df = bd_df.copy()
total_df = pd.merge(bd_df, app_df, how='left', on='MachineID')
# 动力上限和下限留一个就足够区分了,取下限
total_df.drop(columns=['PrimaryUpper'], inplace=True)
total_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 412698 entries, 0 to 412697
Data columns (total 54 columns):
 #   Column                    Non-Null Count   Dtype         
---  ------                    --------------   -----         
 0   SalePrice                 412698 non-null  float64       
 1   MachineID                 412698 non-null  int64         
 2   ModelID                   412698 non-null  int64         
 3   datasource                412698 non-null  int64         
 4   auctioneerID              392562 non-null  float64       
 5   YearMade                  412698 non-null  int64         
 6   MachineHoursCurrentMeter  147504 non-null  float64       
 7   UsageBand                 73670 non-null   object        
 8   SaleDate                  412698 non-null  datetime64[ns]
 9   fiModelDesc               412698 non-null  object        
 10  fiBaseModel               412698 non-null  object        
 11  fiSecondaryDesc           412698 non-null  object        
 12  fiModelSeries             412698 non-null  object        
 13  fiModelDescriptor         412698 non-null  object        
 14  ProductSize               196093 non-null  object        
 15  state                     412698 non-null  object        
 16  ProductGroupDesc          412698 non-null  object        
 17  Drive_System              107087 non-null  object        
 18  Enclosure                 412364 non-null  object        
 19  Forks                     197715 non-null  object        
 20  Pad_Type                  81096 non-null   object        
 21  Ride_Control              152728 non-null  object        
 22  Stick                     81096 non-null   object        
 23  Transmission              188007 non-null  object        
 24  Turbocharged              81096 non-null   object        
 25  Blade_Extension           25983 non-null   object        
 26  Blade_Width               25983 non-null   object        
 27  Enclosure_Type            25983 non-null   object        
 28  Engine_Horsepower         25983 non-null   object        
 29  Hydraulics                330133 non-null  object        
 30  Pushblock                 25983 non-null   object        
 31  Ripper                    106945 non-null  object        
 32  Scarifier                 25994 non-null   object        
 33  Tip_Control               25983 non-null   object        
 34  Tire_Size                 97638 non-null   object        
 35  Coupler                   220679 non-null  object        
 36  Coupler_System            44974 non-null   object        
 37  Grouser_Tracks            44875 non-null   object        
 38  Hydraulics_Flow           44875 non-null   object        
 39  Track_Type                102193 non-null  object        
 40  Undercarriage_Pad_Width   102916 non-null  object        
 41  Stick_Length              102261 non-null  object        
 42  Thumb                     102332 non-null  object        
 43  Pattern_Changer           102261 non-null  object        
 44  Grouser_Type              102193 non-null  object        
 45  Backhoe_Mounting          80712 non-null   object        
 46  Blade_Type                81875 non-null   object        
 47  Travel_Controls           81877 non-null   object        
 48  Differential_Type         71564 non-null   object        
 49  Steering_Controls         71522 non-null   object        
 50  fiProductClassDesc        412698 non-null  object        
 51  fiManufacturerID          412698 non-null  int64         
 52  PrimarySizeBasis          407439 non-null  object        
 53  PrimaryLower              407439 non-null  float64       
dtypes: datetime64[ns](1), float64(4), int64(5), object(44)
memory usage: 173.2+ MB
raise KeyError
---------------------------------------------------------------------------

KeyError                                  Traceback (most recent call last)

Input In [903], in <cell line: 1>()
----> 1 raise KeyError


KeyError: 

1.4 逐个查看特征

1.4.1 datasource

# 173怀疑是172写错成173
total_df['datasource'].value_counts()
132    260776
136     75491
149     33325
121     25191
172     17914
173         1
Name: datasource, dtype: int64

1.4.2 auctioneerID

# 后续要合并
total_df['auctioneerID'].value_counts()
1.0     192773
2.0      57441
3.0      30288
4.0      20877
99.0     12042
6.0      11950
7.0       7847
8.0       7419
5.0       7002
10.0      5876
9.0       4764
11.0      3823
12.0      3610
13.0      3068
18.0      2359
14.0      2277
20.0      2238
19.0      2074
16.0      1807
15.0      1742
21.0      1601
22.0      1429
24.0      1357
23.0      1322
17.0      1275
27.0      1150
25.0       959
28.0       860
26.0       796
0.0        536
Name: auctioneerID, dtype: int64
# 查看auctioneerID和价格的关系
temp = pd.pivot_table(total_df, index='auctioneerID', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('median', 'SalePrice'))
countmeanmedian
SalePriceSalePriceSalePrice
auctioneerID
22.0142918230.02099414000.0
18.0235918839.79949114500.0
21.0160119645.31542815000.0
25.095922478.51929116000.0
9.0476422515.00314916775.0
14.0227720804.79798017000.0
17.0127522974.94117618000.0
12.0361024762.93628818250.0
28.086028312.67441918750.0
20.0223824737.81143919250.0
99.01204226958.80611219500.0
13.0306827017.44133020500.0
16.0180726261.42778120500.0
27.0115027606.73913022500.0
4.02087729825.72687623000.0
23.0132230613.72919823000.0
2.05744129023.05167023000.0
10.0587629561.54697123000.0
0.053629979.85074623000.0
15.0174229986.33754323500.0
5.0700229150.21779523500.0
24.0135733041.15696424500.0
1.019277332684.87007525000.0
8.0741932477.97883825000.0
11.0382332707.08867425500.0
3.03028833596.96259226000.0
6.01195034708.07062827000.0
26.079636157.41206028000.0
7.0784736288.45023628500.0
19.0207442715.76663534000.0
# 画图查看
temp.sort_values(by=('median', 'SalePrice'))[('median', 'SalePrice')].plot(kind='bar')
plt.show()

请添加图片描述

1.4.3 YearMade

total_df['YearMade'] = total_df['YearMade'].astype(int)
total_df['YearMade'].value_counts()
1000    39391
2005    22096
1998    21751
2004    20914
1999    19274
        ...  
2012        1
1949        1
1942        1
2013        1
1937        1
Name: YearMade, Length: 73, dtype: int64
plt.figure(figsize=(14,10))
sns.countplot(total_df['YearMade'])
plt.xticks(rotation=90)
plt.show()

请添加图片描述

第一台拖拉机1904年才发明出来,1904年前的属于异常数据。同时该数据集截至年份是2012年,大于2012的属于异常。异常值后续增加一个新的衍生变量YearMade_is_error区分。

生产时间大于销售时间的,暂时认为是提前订货,不处理。

1.4.4 MachineHoursCurrentMeter

# 查看枚举值
total_df['MachineHoursCurrentMeter'].value_counts()
0.0        73834
2000.0       124
1000.0       117
24.0         115
1500.0       101
           ...  
10834.0        1
3499.0         1
26270.0        1
26901.0        1
17920.0        1
Name: MachineHoursCurrentMeter, Length: 15633, dtype: int64

1.4.5 UsageBand

# 查看枚举值
total_df['UsageBand'].value_counts()
Medium    35832
Low       25311
High      12527
Name: UsageBand, dtype: int64

1.4.6 fiBaseModel

total_df['fiBaseModel'].value_counts()
580      20179
310      17886
D6       13527
416      12900
D5        9636
         ...  
56           1
B4230        1
IS30         1
MM555        1
WLK15        1
Name: fiBaseModel, Length: 1961, dtype: int64

1.4.7 fiSecondaryDesc

total_df['fiSecondaryDesc'].value_counts().head(10)
     136420
C     44658
B     40446
G     38139
H     24759
E     21944
D     20132
F      9454
K      8089
A      5968
Name: fiSecondaryDesc, dtype: int64
temp = pd.pivot_table(total_df, index='fiSecondaryDesc', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
countmeanmedian
SalePriceSalePriceSalePrice
fiSecondaryDesc
XT217729736.81672023000.0
LE251031586.06573723000.0
R357731164.43584024000.0
N384431028.67455823500.0
SUPER L392031237.16836724000.0
P447532841.13407826000.0
J465531069.47089224000.0
LC486231265.12793124000.0
M539832225.97628824500.0
L566931634.05362524500.0
A596831266.45777524000.0
K808930815.39139623000.0
F945431958.08038925000.0
D2013230766.16764424000.0
E2194431089.64035724000.0
H2475931279.81105924000.0
G3813930902.04020023500.0
B4044631238.50853024000.0
C4465831024.10369924000.0
13642031426.03821324000.0

1.4.8 fiModelSeries

total_df['fiModelSeries'].value_counts().head(10)
       350453
II      14039
LC       9609
III      5392
-1       5142
-2       4350
-6       3538
-3       2783
-5       2664
-12      1447
Name: fiModelSeries, dtype: int64

1.4.9 fiModelDescriptor

total_df['fiModelDescriptor'].value_counts().head(10)
       333040
L       16676
LGP     16541
LC      15666
XL       6704
6        3238
LT       2681
5        2455
3        2149
CR       1798
Name: fiModelDescriptor, dtype: int64
temp = pd.pivot_table(total_df, index='fiModelDescriptor', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
countmeanmedian
SalePriceSalePriceSalePrice
fiModelDescriptor
K34929030.37249322000.0
835133369.80057027000.0
E35934427.57660225500.0
SR38328176.82767620000.0
ZTS44228579.75113121250.0
Z52527814.19047621000.0
262732085.04784726000.0
SSR68433503.50877227000.0
H109230992.97161224000.0
7111530983.37399124000.0
CR179828777.15795322000.0
3214931477.17543024000.0
5245530897.66802424000.0
LT268128216.72920621000.0
6323830573.30512724000.0
XL670430564.90170023000.0
LC1566630387.23298923000.0
LGP1654131327.26703324000.0
L1667630739.47103624000.0
33304031340.13789124000.0

1.4.10 ProductSize

total_df['ProductSize'].fillna('').value_counts()
                  216605
Medium             64342
Large / Medium     51297
Small              27057
Mini               25721
Large              21396
Compact             6280
Name: ProductSize, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='ProductSize', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
countmeanmedian
SalePriceSalePriceSalePrice
ProductSize
Compact628017498.57882215000.0
Large2139642023.43331534000.0
Mini2572115194.64772013500.0
Small2705732511.16406129000.0
Large / Medium5129747828.92323144000.0
Medium6434245703.87347341000.0
21660524047.38340819000.0

1.4.11 state

# 查看数据分布
total_df['state'].fillna('').value_counts()
Florida           67320
Texas             53110
California        29761
Washington        16222
Georgia           14633
Maryland          13322
Mississippi       13240
Ohio              12369
Illinois          11540
Colorado          11529
New Jersey        11156
North Carolina    10636
Tennessee         10298
Alabama           10292
Pennsylvania      10234
South Carolina     9951
Arizona            9364
New York           8639
Connecticut        8276
Minnesota          7885
Missouri           7178
Nevada             6932
Louisiana          6627
Kentucky           5351
Maine              5096
Indiana            4124
Arkansas           3933
New Mexico         3631
Utah               3046
Unspecified        2801
Wisconsin          2745
New Hampshire      2738
Virginia           2353
Idaho              2025
Oregon             1911
Michigan           1831
Wyoming            1672
Montana            1336
Iowa               1336
Oklahoma           1326
Nebraska            866
West Virginia       840
Kansas              667
Delaware            510
North Dakota        480
Alaska              430
Massachusetts       347
Vermont             300
South Dakota        244
Hawaii              118
Rhode Island         83
Puerto Rico          42
Washington DC         2
Name: state, dtype: int64
# 查看州和价格的关系,有关系
temp = pd.pivot_table(total_df, index='state', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
countmeanmedian
SalePriceSalePriceSalePrice
state
Minnesota788527199.05516819500.0
Connecticut827629008.80860323500.0
New York863925582.23752720000.0
Arizona936431562.85775323000.0
South Carolina995129848.89458322000.0
Pennsylvania1023425463.00273619000.0
Alabama1029235438.54148928000.0
Tennessee1029831845.35735124000.0
North Carolina1063632161.77369325000.0
New Jersey1115630982.18886724500.0
Colorado1152931777.74828725000.0
Illinois1154029091.58145622500.0
Ohio1236928228.10049322000.0
Mississippi1324032574.59214525000.0
Maryland1332228621.68218022000.0
Georgia1463332265.55046824000.0
Washington1622227690.73172221500.0
California2976129815.20271522500.0
Texas5311032977.19034125000.0
Florida6732034387.51277527000.0
# 进一步按价格排序查看
temp.sort_values(by=[('median', 'SalePrice'), ('mean', 'SalePrice')])
countmeanmedian
SalePriceSalePriceSalePrice
state
Indiana412424400.35766219000.0
Pennsylvania1023425463.00273619000.0
Maine509626176.94270019500.0
Minnesota788527199.05516819500.0
New York863925582.23752720000.0
Kansas66728093.40329820000.0
Puerto Rico4226011.90476220250.0
Wisconsin274527656.24772321000.0
Idaho202529263.23456821000.0
Washington1622227690.73172221500.0
Virginia235328798.64598421500.0
Ohio1236928228.10049322000.0
Maryland1332228621.68218022000.0
Michigan183129003.92364822000.0
Missouri717829046.00069722000.0
Kentucky535129815.38030322000.0
South Carolina995129848.89458322000.0
Vermont30027285.33333322250.0
Arkansas393328906.89397422500.0
Illinois1154029091.58145622500.0
California2976129815.20271522500.0
Washington DC222750.00000022750.0
Massachusetts34728382.42074923000.0
New Hampshire273828928.74360823000.0
Oregon191130277.59026723000.0
Delaware51031160.09803923000.0
Arizona936431562.85775323000.0
Connecticut827629008.80860323500.0
Louisiana662730201.89376823500.0
Tennessee1029831845.35735124000.0
Iowa133631927.54491024000.0
Oklahoma132632258.71040724000.0
Georgia1463332265.55046824000.0
Montana133632616.65419224000.0
Nebraska86632612.06697524250.0
New Jersey1115630982.18886724500.0
Colorado1152931777.74828725000.0
North Carolina1063632161.77369325000.0
Mississippi1324032574.59214525000.0
Wyoming167232604.42583725000.0
Texas5311032977.19034125000.0
Hawaii11828879.23728826000.0
Alaska43033281.97674426000.0
New Mexico363133632.64940826000.0
Utah304634190.54793227000.0
Florida6732034387.51277527000.0
Nevada693236332.09751927000.0
Unspecified280134857.71153228000.0
Alabama1029235438.54148928000.0
North Dakota48039083.75000029750.0
West Virginia84040258.75000033000.0
Rhode Island8337622.28915734000.0
South Dakota24443907.37704935000.0

1.4.12 ProductGroupDesc

# 查看数据分布
total_df['ProductGroupDesc'].fillna('').value_counts()
Track Excavators       104230
Track Type Tractors     82582
Backhoe Loaders         81401
Wheel Loader            73216
Skid Steer Loaders      45011
Motor Graders           26258
Name: ProductGroupDesc, dtype: int64
# 查看和价格的关系
temp = pd.pivot_table(total_df, index='ProductGroupDesc', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
countmeanmedian
SalePriceSalePriceSalePrice
ProductGroupDesc
Motor Graders2625847561.95742240000.0
Skid Steer Loaders4501110583.94401410000.0
Wheel Loader7321637259.29242232000.0
Backhoe Loaders8140120951.58232720500.0
Track Type Tractors8258236280.13278029500.0
Track Excavators10423035763.45701829000.0

1.4.13 Drive_System

total_df['Drive_System'].fillna('').value_counts()
                    305611
Two Wheel Drive      47546
Four Wheel Drive     33551
No                   25166
All Wheel Drive        824
Name: Drive_System, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Drive_System', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
countmeanmedian
SalePriceSalePriceSalePrice
Drive_System
All Wheel Drive82460455.27912659000.0
No2516647342.72145040000.0
Four Wheel Drive3355124534.14941424000.0
Two Wheel Drive4754618418.02687917500.0
30561132532.70369325000.0

1.4.14 Enclosure

total_df['Enclosure'].fillna('').value_counts()
OROPS                  177971
EROPS                  141769
EROPS w AC              92601
                          334
EROPS AC                   18
NO ROPS                     3
None or Unspecified         2
Name: Enclosure, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Enclosure', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
countmeanmedian
SalePriceSalePriceSalePrice
Enclosure
None or Unspecified216500.00000016500.0
NO ROPS344333.33333342500.0
EROPS AC1823500.00000020500.0
33427689.44610825000.0
EROPS w AC9260151671.16972847000.0
EROPS14176928687.99327823000.0
OROPS17797122592.08273918000.0

1.4.15 Forks

total_df['Forks'].fillna('').value_counts()
                       214983
None or Unspecified    183061
Yes                     14654
Name: Forks, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Forks', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
countmeanmedian
SalePriceSalePriceSalePrice
Forks
Yes1465436761.35471531500.0
None or Unspecified18306123593.41418418500.0
21498337327.17495430000.0

1.4.16 Pad_Type

total_df['Pad_Type'].fillna('').value_counts()
                       331602
None or Unspecified     72395
Reversible               5950
Street                   2725
Grouser                    26
Name: Pad_Type, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Pad_Type', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
countmeanmedian
SalePriceSalePriceSalePrice
Pad_Type
Grouser2630289.42307730500.0
Street272524995.06422025000.0
Reversible595027344.03361327500.0
None or Unspecified7239520267.12035420000.0
33160233725.99889726000.0

1.4.17 Ride_Control

total_df['Ride_Control'].fillna('').value_counts()
                       259970
No                      79389
None or Unspecified     64693
Yes                      8646
Name: Ride_Control, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Ride_Control', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
countmeanmedian
SalePriceSalePriceSalePrice
Ride_Control
Yes864655278.26914253000.0
None or Unspecified6469334834.65002429000.0
No7938920765.72010020000.0
25997032705.23236225000.0

1.4.18 Stick

total_df['Stick'].fillna('').value_counts()
            331602
Standard     49854
Extended     31242
Name: Stick, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Stick', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
countmeanmedian
SalePriceSalePriceSalePrice
Stick
Extended3124223257.32091423000.0
Standard4985419501.52511319000.0
33160233725.99889726000.0

1.4.19 Transmission

total_df['Transmission'].fillna('').value_counts()
                       224691
Standard               143915
None or Unspecified     23889
Powershift              11991
Powershuttle             4286
Hydrostatic              3342
Direct Drive              422
Autoshift                 118
AutoShift                  44
Name: Transmission, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Transmission', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
countmeanmedian
SalePriceSalePriceSalePrice
Transmission
AutoShift44102261.36363695000.0
Autoshift11833141.94915333250.0
Direct Drive42218675.82938411550.0
Hydrostatic334234705.81418332000.0
Powershuttle428619264.93093818000.0
Powershift1199138574.35293129500.0
None or Unspecified2388947354.81296840000.0
Standard14391528364.13221823000.0
22469131117.25384223000.0

1.4.20 Turbocharged

total_df['Turbocharged'].fillna('').value_counts()
                       331602
None or Unspecified     77111
Yes                      3985
Name: Turbocharged, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Turbocharged', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
countmeanmedian
SalePriceSalePriceSalePrice
Turbocharged
Yes398524144.30037624000.0
None or Unspecified7711120783.27626420000.0
33160233725.99889726000.0

1.4.21 Blade_Extension

total_df['Blade_Extension'].fillna('').value_counts()
                       386715
None or Unspecified     25406
Yes                       577
Name: Blade_Extension, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Blade_Extension', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
countmeanmedian
SalePriceSalePriceSalePrice
Blade_Extension
Yes57766587.69497465000.0
None or Unspecified2540647337.07502240000.0
38671530103.24427923500.0

1.4.22 Blade_Width

total_df['Blade_Width'].fillna('').value_counts()
                       386715
14'                      9867
None or Unspecified      9521
12'                      5201
16'                       960
13'                       335
<12'                       99
Name: Blade_Width, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Blade_Width', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
countmeanmedian
SalePriceSalePriceSalePrice
Blade_Width
<12'9927764.64646521000.0
13'33527125.52238818500.0
16'96059882.11979252750.0
12'520136000.85214429000.0
None or Unspecified952141899.46014132000.0
14'986759347.22316855000.0
38671530103.24427923500.0

1.4.23 Enclosure_Type

total_df['Enclosure_Type'].fillna('').value_counts()
                       386715
None or Unspecified     22469
Low Profile              2675
High Profile              839
Name: Enclosure_Type, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Enclosure_Type', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
countmeanmedian
SalePriceSalePriceSalePrice
Enclosure_Type
High Profile83981897.43742680000.0
Low Profile267581444.65831880000.0
None or Unspecified2246942480.32475935000.0
38671530103.24427923500.0

1.4.24 Engine_Horsepower

total_df['Engine_Horsepower'].fillna('').value_counts()
            386715
No           24642
Variable      1341
Name: Engine_Horsepower, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Engine_Horsepower', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
countmeanmedian
SalePriceSalePriceSalePrice
Engine_Horsepower
Variable134190245.15287188000.0
No2464245452.80732138000.0
38671530103.24427923500.0

1.4.24 Hydraulics

total_df['Hydraulics'].fillna('').value_counts()
2 Valve                145317
Standard               106515
                        82565
Auxiliary               43224
Base + 1 Function       25511
3 Valve                  5807
4 Valve                  3077
Base + 3 Function         311
Base + 2 Function         132
Base + 5 Function          94
Base + 4 Function          81
Base + 6 Function          54
None or Unspecified        10
Name: Hydraulics, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Hydraulics', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
countmeanmedian
SalePriceSalePriceSalePrice
Hydraulics
None or Unspecified1027400.00000026250.0
Base + 6 Function5468333.33333369500.0
Base + 4 Function8195253.08642092500.0
Base + 5 Function9475601.06383071000.0
Base + 2 Function13289424.24242489500.0
Base + 3 Function31184141.80064379000.0
4 Valve307756652.55118651000.0
3 Valve580744479.30945440000.0
Base + 1 Function2551146637.43397039000.0
Auxiliary4322425076.74678416000.0
8256521028.90029720500.0
Standard10651529391.41013920500.0
2 Valve14531736145.19639330000.0

1.4.25 Pushblock

total_df['Pushblock'].fillna('').value_counts()
                       386715
None or Unspecified     20017
Yes                      5966
Name: Pushblock, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Pushblock', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
countmeanmedian
SalePriceSalePriceSalePrice
Pushblock
Yes596668305.12772465000.0
None or Unspecified2001741642.52565333000.0
38671530103.24427923500.0

1.4.26 Ripper

total_df['Ripper'].fillna('').value_counts()
                       305753
None or Unspecified     85405
Yes                      8185
Multi Shank              8071
Single Shank             5284
Name: Ripper, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Ripper', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
countmeanmedian
SalePriceSalePriceSalePrice
Ripper
Single Shank528450139.36638942000.0
Multi Shank807148952.39747241000.0
Yes818563677.68307960000.0
None or Unspecified8540535253.46648628500.0
30575328422.90210122000.0

1.4.27 Scarifier

total_df['Scarifier'].fillna('').value_counts()
                       386704
None or Unspecified     13033
Yes                     12961
Name: Scarifier, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Scarifier', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
countmeanmedian
SalePriceSalePriceSalePrice
Scarifier
Yes1296149523.26641540000.0
None or Unspecified1303345996.79828140000.0
38670430103.37522023500.0

1.4.28 Tip_Control

total_df['Tip_Control'].fillna('').value_counts()
                       386715
None or Unspecified     16832
Sideshift & Tip          7164
Tip                      1987
Name: Tip_Control, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Tip_Control', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
countmeanmedian
SalePriceSalePriceSalePrice
Tip_Control
Tip198762269.26572757000.0
Sideshift & Tip716446739.03908441000.0
None or Unspecified1683246488.79045937500.0
38671530103.24427923500.0

1.4.29 Tire_Size

total_df['Tire_Size'].fillna('').value_counts()
                       315060
None or Unspecified     47823
20.5                    15773
14"                      9111
23.5                     8760
26.5                     4635
17.5                     3971
29.5                     2767
17.5"                    1815
13"                       776
20.5"                     737
15.5                      610
15.5"                     463
23.5"                     309
7.0"                       56
23.1"                      20
10"                         9
10 inch                     3
Name: Tire_Size, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Tire_Size', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
countmeanmedian
SalePriceSalePriceSalePrice
Tire_Size
10 inch316333.33333315500.0
10"931166.66666725000.0
23.1"2014837.50000012125.0
7.0"5627625.00000025500.0
23.5"30956487.94498451000.0
15.5"46347401.72786240000.0
15.561017153.44262315000.0
20.5"73774012.63229368000.0
13"77621085.37371116000.0
17.5"181567478.65013865000.0
29.5276746650.81315543000.0
17.5397125344.85645922500.0
26.5463550903.01186650000.0
23.5876047236.34703244000.0
14"911150506.22686944000.0
20.51577341099.19799738000.0
None or Unspecified4782335357.00821827000.0
31506028433.53593722000.0

1.4.30 Coupler

total_df['Coupler'].fillna('').value_counts()
                       192019
None or Unspecified    190449
Manual                  23918
Hydraulic                6312
Name: Coupler, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Coupler', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
countmeanmedian
SalePriceSalePriceSalePrice
Coupler
Hydraulic631246839.72908743000.0
Manual2391833493.82987727000.0
None or Unspecified19044930392.94884722500.0
19201931233.25520524000.0

1.4.31 Coupler_System

total_df['Coupler_System'].fillna('').value_counts()
                       367724
None or Unspecified     41727
Yes                      3247
Name: Coupler_System, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Coupler_System', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
countmeanmedian
SalePriceSalePriceSalePrice
Coupler_System
Yes324711593.70187911500.0
None or Unspecified4172710504.80993110000.0
36772433738.52124226000.0

1.4.32 Grouser_Tracks

total_df['Grouser_Tracks'].fillna('').value_counts()
                       367823
None or Unspecified     41820
Yes                      3055
Name: Grouser_Tracks, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Grouser_Tracks', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
countmeanmedian
SalePriceSalePriceSalePrice
Grouser_Tracks
Yes305512871.26841213000.0
None or Unspecified4182010410.9511969750.0
36782333732.89662526000.0

1.4.33 Hydraulics_Flow

total_df['Hydraulics_Flow'].fillna('').value_counts()
                       367823
Standard                44251
High Flow                 597
None or Unspecified        27
Name: Hydraulics_Flow, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Hydraulics_Flow', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
countmeanmedian
SalePriceSalePriceSalePrice
Hydraulics_Flow
None or Unspecified2715053.70370415000.0
High Flow59713183.16582913000.0
Standard4425110540.57318510000.0
36782333732.89662526000.0

1.4.34 Track_Type

total_df['Track_Type'].fillna('').value_counts()
          310505
Steel      87463
Rubber     14730
Name: Track_Type, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Track_Type', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
countmeanmedian
SalePriceSalePriceSalePrice
Track_Type
Rubber1473015823.90590613500.0
Steel8746339245.89473334000.0
31050529683.23574222500.0

1.4.35 Undercarriage_Pad_Width

total_df['Undercarriage_Pad_Width'].fillna('').value_counts()
                       309782
None or Unspecified     82444
32 inch                  5287
28 inch                  3152
24 inch                  2998
20 inch                  2664
30 inch                  1602
36 inch                  1544
18 inch                  1439
34 inch                   540
16 inch                   481
31 inch                   191
27 inch                   144
22 inch                   135
26 inch                    98
33 inch                    94
14 inch                    51
15 inch                    33
25 inch                    17
31.5 inch                   2
Name: Undercarriage_Pad_Width, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Undercarriage_Pad_Width', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
countmeanmedian
SalePriceSalePriceSalePrice
Undercarriage_Pad_Width
31.5 inch2108000.000000108000.0
25 inch1722591.17647120000.0
15 inch3316803.03030315000.0
14 inch5116869.60784314500.0
33 inch9451848.93617047500.0
26 inch9830831.63265328750.0
22 inch13533398.51851926500.0
27 inch14433826.04166728625.0
31 inch19148378.79581244500.0
16 inch48120531.70270317500.0
34 inch54061289.72222260000.0
18 inch143920302.71021518000.0
36 inch154446214.29728041000.0
30 inch160236943.00873930000.0
20 inch266429552.34609627500.0
24 inch299837054.41961331500.0
28 inch315237732.44765233500.0
32 inch528748901.32400245500.0
None or Unspecified8244435017.39151428000.0
30978229688.37073922500.0

1.4.36 Stick_Length

total_df['Stick_Length'].fillna('').value_counts()
                       310437
None or Unspecified     81539
9' 6"                    5832
10' 6"                   3519
11' 0"                   1601
9' 10"                   1463
9' 8"                    1462
9' 7"                    1423
12' 10"                  1087
10' 2"                   1004
8' 6"                     908
8' 2"                     614
10' 10"                   414
12' 8"                    322
11' 10"                   307
8' 4"                     274
8' 10"                    104
12' 4"                    103
9' 5"                     101
15' 9"                     87
6' 3"                      51
13' 7"                     11
14' 1"                      7
13' 10"                     7
13' 9"                      7
19' 8"                      5
7' 10"                      3
15' 4"                      3
24' 3"                      2
9' 2"                       1
Name: Stick_Length, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Stick_Length', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
countmeanmedian
SalePriceSalePriceSalePrice
Stick_Length
15' 9"8757068.96551755000.0
9' 5"10146074.25742646000.0
12' 4"10347505.33980640000.0
8' 10"10443937.50000042750.0
8' 4"27433362.22627728000.0
11' 10"30748237.78501642500.0
12' 8"32258167.70186356000.0
10' 10"41460149.27536257000.0
8' 2"61433933.30618931000.0
8' 6"90836311.94933931000.0
10' 2"100445855.22908444000.0
12' 10"108766730.17479366000.0
9' 7"142356526.52846155000.0
9' 8"146249643.29685447000.0
9' 10"146339891.14832536500.0
11' 0"160148882.15178045000.0
10' 6"351953677.64279650000.0
9' 6"583246655.84705142500.0
None or Unspecified8153932556.65969726000.0
31043729683.17038322500.0

1.4.37 Thumb

total_df['Thumb'].fillna('').value_counts()
                       310366
None or Unspecified     85074
Manual                   9678
Hydraulic                7580
Name: Thumb, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Thumb', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
countmeanmedian
SalePriceSalePriceSalePrice
Thumb
Hydraulic758038316.01464433000.0
Manual967839451.08131834000.0
None or Unspecified8507435234.45904728000.0
31036629683.22436822500.0

1.4.38 Pattern_Changer

total_df['Pattern_Changer'].fillna('').value_counts()
                       310437
None or Unspecified     92924
Yes                      9269
No                         68
Name: Pattern_Changer, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Pattern_Changer', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
countmeanmedian
SalePriceSalePriceSalePrice
Pattern_Changer
No6829981.61764727500.0
Yes926952301.26108552000.0
None or Unspecified9292434230.87077628000.0
31043729683.17038322500.0

1.4.39 Grouser_Type

total_df['Grouser_Type'].fillna('').value_counts()
          310505
Double     86998
Triple     15193
Single         2
Name: Grouser_Type, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Grouser_Type', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
countmeanmedian
SalePriceSalePriceSalePrice
Grouser_Type
Single252000.00000052000.0
Triple1519342755.05594737500.0
Double8699834667.09878428000.0
31050529683.23574222500.0

1.4.40 Backhoe_Mounting

total_df['Backhoe_Mounting'].fillna('').value_counts()
                       331986
None or Unspecified     80692
Yes                        20
Name: Backhoe_Mounting, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Backhoe_Mounting', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
countmeanmedian
SalePriceSalePriceSalePrice
Backhoe_Mounting
Yes2016462.50000015500.0
None or Unspecified8069236486.29015530000.0
33198629934.88268822500.0

1.4.41 Blade_Type

total_df['Blade_Type'].fillna('').value_counts()
                       330823
PAT                     39633
Straight                13461
None or Unspecified     11841
Semi U                   8907
VPAT                     3681
U                        1888
Angle                    1684
No                        743
Landfill                   26
Coal                       11
Name: Blade_Type, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Blade_Type', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
countmeanmedian
SalePriceSalePriceSalePrice
Blade_Type
Coal1150227.27272748000.0
Landfill2669942.30769263750.0
No74325569.07806221000.0
Angle168432742.92399025500.0
U188844536.49364435000.0
VPAT368162566.14371159000.0
Semi U890758495.60738755000.0
None or Unspecified1184131077.75214925000.0
Straight1346135548.99012028000.0
PAT3963330679.54411727000.0
33082329949.80635922500.0

1.4.42 Travel_Controls

total_df['Travel_Controls'].fillna('').value_counts()
                       330821
None or Unspecified     71447
Differential Steer       5257
Finger Tip               2693
2 Pedal                  1144
Lever                     902
Pedal                     423
1 Speed                    11
Name: Travel_Controls, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Travel_Controls', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
countmeanmedian
SalePriceSalePriceSalePrice
Travel_Controls
1 Speed1115672.72727311500.0
Pedal42324667.96690322500.0
Lever90233187.90022226000.0
2 Pedal114425588.00262221000.0
Finger Tip269357974.05310155000.0
Differential Steer525768752.64884967500.0
None or Unspecified7144733407.34410427500.0
33082129950.38559822500.0

1.4.43 Differential_Type

total_df['Differential_Type'].fillna('').value_counts()
                341134
Standard         70169
Limited Slip      1181
No Spin            212
Locking              2
Name: Differential_Type, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Differential_Type', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
countmeanmedian
SalePriceSalePriceSalePrice
Differential_Type
Locking266000.00000066000.0
No Spin21252889.38679251475.0
Limited Slip118157566.55376857000.0
Standard7016937061.72995232000.0
34113429907.68366722500.0

1.4.44 Steering_Controls

total_df['Steering_Controls'].fillna('').value_counts()
                       341176
Conventional            70774
Command Control           594
Four Wheel Standard       139
Wheel                      14
No                          1
Name: Steering_Controls, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Steering_Controls', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
countmeanmedian
SalePriceSalePriceSalePrice
Steering_Controls
No117500.00000017500.0
Wheel1421517.85714317500.0
Four Wheel Standard13924658.27338122000.0
Command Control59474774.41077475000.0
Conventional7077437168.65909832000.0
34117629907.45541922500.0

1.4.45 fiManufacturerDesc

total_df['fiManufacturerID'].fillna('').value_counts()
26      169003
43       74527
25       42142
103      38928
121      25033
         ...  
5            2
923          1
1518         1
112          1
525          1
Name: fiManufacturerID, Length: 104, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='fiManufacturerID', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
countmeanmedian
SalePriceSalePriceSalePrice
fiManufacturerID
405107010707.6915899500.0
129111714373.20949010000.0
86117834583.19185130000.0
46134417102.77157715000.0
158137830334.21625526000.0
95162834679.65294831000.0
166190623662.64428120000.0
135255318356.46259314500.0
54258518324.77640215000.0
750330512966.82299511500.0
176422443841.87263340000.0
99508733323.88441128000.0
92526021678.64543718500.0
55590213283.74703511750.0
741129133328.46293527000.0
1212503310637.49506710000.0
1033892833710.58798827500.0
254214221210.95258918000.0
437452727908.81508723000.0
2616900340321.71042533000.0

1.4.46 PrimarySizeBasis

total_df['PrimarySizeBasis'].fillna('').value_counts()
Horsepower                     180262
Weight - Metric Tons           105019
Standard Digging Depth - Ft     77848
Operating Capacity - Lbs        44226
                                 5259
Model                              78
Weight - Metric                     4
Weight - Lbs                        1
Cutting Width - Inches              1
Name: PrimarySizeBasis, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='PrimarySizeBasis', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
countmeanmedian
SalePriceSalePriceSalePrice
PrimarySizeBasis
Cutting Width - Inches123000.00000023000.0
Weight - Lbs127500.00000027500.0
Weight - Metric422562.50000022750.0
Model7825535.25641017750.0
525915941.91861614000.0
Operating Capacity - Lbs4422610612.49387210000.0
Standard Digging Depth - Ft7784821202.52707821000.0
Weight - Metric Tons10501935654.97314829000.0
Horsepower18026238455.69106232000.0

1.4.47 PrimaryLower

total_df['PrimaryLower'].fillna('').value_counts()
14.0       62037
20.0       17943
130.0      17113
150.0      15322
85.0       15054
           ...  
1.8            3
4.5            2
25000.0        1
2.7            1
300.0          1
Name: PrimaryLower, Length: 76, dtype: int64

**总结:**发现的规律如下:

  1. 从EXCEL表中可以明显看出部分特征是要么一起出现,要么一起空,这些可以在创建衍生特征时做成交叉特征。

  2. 部分特征虽然是object字段,但是能排出顺序,比如“Low、Medium、High”等,和‘xx inch’等,后面处理成定序变量。

  3. 空值都单独用一列标记。

  4. 没有特殊规律的类别特征统一用Label Encoder编码。

  5. 与价格的透视表可以为分箱提供依据。

EDA到此堂堂完结!!!!!!!!!!!!!!!!!!!!!!!

2. 数据清洗+数据预处理

2.1 创建衍生变量

2.1.1 SaleDate

def to_sin(n, i):
    return round(np.sin((2*np.pi/n)*(i-1)+(2*np.pi/7)) + 1, 2)
total_df['SaleYear'] = total_df['SaleDate'].dt.year
total_df['SaleMonth'] = total_df['SaleDate'].dt.month
total_df['SaleDay'] = total_df['SaleDate'].dt.day
total_df['SaleDayOfWeek'] = total_df['SaleDate'].dt.dayofweek
total_df['SaleDayOfYear'] = total_df['SaleDate'].dt.dayofyear
# 删除原来的SaleDate特征
total_df.drop('SaleDate', axis=1, inplace=True)
# 尝试余弦化拉近1月和12月的距离
total_df['SaleDayOfWeek_sin'] = total_df['SaleDayOfWeek'].map(lambda x: to_sin(7, x)).value_counts()
total_df['SaleMonth_sin'] = total_df['SaleMonth'].map(lambda x: to_sin(12, x)).value_counts()

2.1.2 Stick、Turbocharged

def combine_features(df, col_list):
    temp = df[col_list[0]].astype(str)
    for col in col_list[1:]:
        temp += df[col].astype(str)
    return temp
total_df['Stick__Turbocharged'] = combine_features(total_df, ['Stick', 'Turbocharged'])
total_df['Stick__Turbocharged'].value_counts()
nannan                         331602
StandardNone or Unspecified     47981
ExtendedNone or Unspecified     29130
ExtendedYes                      2112
StandardYes                      1873
Name: Stick__Turbocharged, dtype: int64

2.1.3 Blade_Extension、Blade_Width、Enclosure_Type、Engine_Horsepower

total_df['Blade__Blade__Enclosure__Engine'] = combine_features(total_df, ['Blade_Extension', 'Blade_Width', 'Enclosure_Type', 'Engine_Horsepower'])
total_df['Blade__Blade__Enclosure__Engine'].value_counts()
nannannannan                                                         386715
None or UnspecifiedNone or UnspecifiedNone or UnspecifiedNo            8572
None or Unspecified14'None or UnspecifiedNo                            7307
None or Unspecified12'None or UnspecifiedNo                            4466
None or Unspecified14'Low ProfileNo                                    1088
None or Unspecified16'None or UnspecifiedNo                             818
None or UnspecifiedNone or UnspecifiedLow ProfileNo                     435
None or Unspecified12'Low ProfileNo                                     416
None or Unspecified14'Low ProfileVariable                               377
None or Unspecified14'High ProfileNo                                    348
None or Unspecified14'None or UnspecifiedVariable                       343
None or Unspecified13'None or UnspecifiedNo                             313
None or UnspecifiedNone or UnspecifiedNone or UnspecifiedVariable       190
Yes14'None or UnspecifiedNo                                             126
None or Unspecified14'High ProfileVariable                              122
None or UnspecifiedNone or UnspecifiedHigh ProfileNo                    112
None or Unspecified<12'None or UnspecifiedNo                             99
None or Unspecified12'High ProfileNo                                     96
YesNone or UnspecifiedNone or UnspecifiedNo                              86
Yes12'None or UnspecifiedNo                                              84
None or UnspecifiedNone or UnspecifiedLow ProfileVariable                74
None or Unspecified16'Low ProfileNo                                      58
Yes14'Low ProfileNo                                                      55
Yes14'Low ProfileVariable                                                50
Yes12'Low ProfileNo                                                      47
None or Unspecified12'High ProfileVariable                               43
None or Unspecified16'High ProfileNo                                     34
None or UnspecifiedNone or UnspecifiedHigh ProfileVariable               27
None or Unspecified12'Low ProfileVariable                                21
Yes14'High ProfileVariable                                               21
Yes14'None or UnspecifiedVariable                                        18
None or Unspecified12'None or UnspecifiedVariable                        17
None or Unspecified13'Low ProfileNo                                      16
Yes16'Low ProfileNo                                                      15
Yes14'High ProfileNo                                                     12
Yes16'None or UnspecifiedNo                                              10
YesNone or UnspecifiedLow ProfileNo                                       9
YesNone or UnspecifiedNone or UnspecifiedVariable                         9
Yes16'High ProfileNo                                                      6
Yes12'High ProfileNo                                                      6
YesNone or UnspecifiedLow ProfileVariable                                 4
None or Unspecified16'High ProfileVariable                                4
None or Unspecified16'None or UnspecifiedVariable                         4
None or Unspecified16'Low ProfileVariable                                 4
Yes13'None or UnspecifiedNo                                               4
Yes16'Low ProfileVariable                                                 3
Yes12'Low ProfileVariable                                                 3
YesNone or UnspecifiedHigh ProfileNo                                      3
Yes12'High ProfileVariable                                                2
Yes16'High ProfileVariable                                                2
Yes16'None or UnspecifiedVariable                                         2
None or Unspecified13'None or UnspecifiedVariable                         1
None or Unspecified13'High ProfileNo                                      1
Name: Blade__Blade__Enclosure__Engine, dtype: int64

2.1.4 Pushblock、Scarifier、Tip_Control

total_df['Pushblock__Scarifier__TipControl'] = combine_features(total_df, ['Pushblock', 'Scarifier', 'Tip_Control'])
total_df['Pushblock__Scarifier__TipControl'].value_counts()
nannannan                                                    386704
None or UnspecifiedYesNone or Unspecified                      6742
None or UnspecifiedNone or UnspecifiedNone or Unspecified      6228
None or UnspecifiedYesSideshift & Tip                          3088
YesNone or UnspecifiedNone or Unspecified                      2560
None or UnspecifiedNone or UnspecifiedSideshift & Tip          2401
YesYesNone or Unspecified                                      1302
YesNone or UnspecifiedSideshift & Tip                          1182
None or UnspecifiedYesTip                                      1103
YesYesSideshift & Tip                                           493
None or UnspecifiedNone or UnspecifiedTip                       455
YesYesTip                                                       226
YesNone or UnspecifiedTip                                       203
nanYesnan                                                         7
nanNone or Unspecifiednan                                         4
Name: Pushblock__Scarifier__TipControl, dtype: int64

2.1.5 Coupler_System、Grouser_Tracks、Hydraulics_Flow

total_df['CouplerSystem__GrouserTracks__HydraulicsFlow'] = combine_features(total_df, ['Coupler_System', 'Grouser_Tracks', 'Hydraulics_Flow'])
total_df['CouplerSystem__GrouserTracks__HydraulicsFlow'].value_counts()
nannannan                                                    367724
None or UnspecifiedNone or UnspecifiedStandard                39040
YesNone or UnspecifiedStandard                                 2289
None or UnspecifiedYesStandard                                 2149
YesYesStandard                                                  773
None or UnspecifiedNone or UnspecifiedHigh Flow                 364
YesNone or UnspecifiedHigh Flow                                 122
None or Unspecifiednannan                                        91
None or UnspecifiedYesHigh Flow                                  57
YesYesHigh Flow                                                  54
None or UnspecifiedYesNone or Unspecified                        22
Yesnannan                                                         8
None or UnspecifiedNone or UnspecifiedNone or Unspecified         4
YesNone or UnspecifiedNone or Unspecified                         1
Name: CouplerSystem__GrouserTracks__HydraulicsFlow, dtype: int64

2.1.6 Track_Type Undercarriage_Pad_Width Stick_Length Thumb Pattern_Changer Grouser_Type

col_list = ('Track_Type	Undercarriage_Pad_Width	Stick_Length	Thumb	Pattern_Changer	Grouser_Type').split('\t')
total_df['TUSTPG'] = combine_features(total_df, col_list)
total_df['TUSTPG'].value_counts()
nannannannannannan                                                                          309643
SteelNone or UnspecifiedNone or UnspecifiedNone or UnspecifiedNone or UnspecifiedDouble      38336
RubberNone or UnspecifiedNone or UnspecifiedNone or UnspecifiedNone or UnspecifiedDouble     10031
SteelNone or UnspecifiedNone or UnspecifiedManualNone or UnspecifiedDouble                    3900
SteelNone or UnspecifiedNone or UnspecifiedNone or UnspecifiedNone or UnspecifiedTriple       3214
                                                                                             ...  
Steel32 inch8' 2"ManualNone or UnspecifiedDouble                                                 1
Steel30 inch12' 10"ManualNone or UnspecifiedTriple                                               1
Steel20 inch8' 4"ManualNone or UnspecifiedTriple                                                 1
Steel32 inch8' 4"None or UnspecifiedNone or UnspecifiedTriple                                    1
Rubber30 inchNone or UnspecifiedHydraulicNone or UnspecifiedDouble                               1
Name: TUSTPG, Length: 1065, dtype: int64

2.1.7 Backhoe_Mounting Blade_Type Travel_Controls

col_list = ('Backhoe_Mounting	Blade_Type	Travel_Controls').split('\t')
total_df['BBT'] = combine_features(total_df, col_list)
total_df['BBT'].value_counts()
nannannan                                                    330802
None or UnspecifiedPATNone or Unspecified                     37249
None or UnspecifiedStraightNone or Unspecified                11990
None or UnspecifiedNone or UnspecifiedNone or Unspecified     11106
None or UnspecifiedSemi UNone or Unspecified                   6582
None or UnspecifiedSemi UDifferential Steer                    2122
None or UnspecifiedVPATFinger Tip                              1857
None or UnspecifiedUNone or Unspecified                        1816
None or UnspecifiedAngleNone or Unspecified                    1560
None or UnspecifiedVPATNone or Unspecified                     1092
None or UnspecifiedPATDifferential Steer                       1011
None or UnspecifiedStraightDifferential Steer                   913
nanNo2 Pedal                                                    728
None or UnspecifiedVPATDifferential Steer                       708
None or UnspecifiedPATFinger Tip                                584
nanStraight2 Pedal                                              416
None or UnspecifiedPATLever                                     412
None or UnspecifiedPATPedal                                     370
None or UnspecifiedNone or UnspecifiedDifferential Steer        327
None or UnspecifiedNone or UnspecifiedLever                     285
None or UnspecifiedAngleDifferential Steer                      109
None or UnspecifiedSemi UFinger Tip                              97
None or UnspecifiedSemi ULever                                   96
None or UnspecifiedNone or UnspecifiedFinger Tip                 77
None or UnspecifiedStraightFinger Tip                            76
None or UnspecifiedStraightLever                                 56
None or UnspecifiedUDifferential Steer                           54
None or UnspecifiedNone or UnspecifiedPedal                      33
None or UnspecifiedVPATLever                                     24
nanNonan                                                         15
None or UnspecifiedLandfillNone or Unspecified                   14
None or UnspecifiedULever                                        14
YesNone or UnspecifiedNone or Unspecified                        13
None or UnspecifiedAngleLever                                    13
None or UnspecifiedLandfillDifferential Steer                    12
nannan1 Speed                                                    11
nannanNone or Unspecified                                        10
None or UnspecifiedSemi UPedal                                   10
None or UnspecifiedCoalNone or Unspecified                        8
YesPATNone or Unspecified                                         7
None or UnspecifiedStraightPedal                                  7
nanStraightnan                                                    3
None or UnspecifiedUPedal                                         2
None or UnspecifiedCoalLever                                      2
None or UnspecifiedAnglePedal                                     1
nanUnan                                                           1
None or UnspecifiedUFinger Tip                                    1
None or UnspecifiedAngleFinger Tip                                1
None or UnspecifiedCoalDifferential Steer                         1
Name: BBT, dtype: int64

2.1.8 Differential_Type Steering_Controls

col_list = ('Differential_Type	Steering_Controls').split('\t')
total_df['DS'] = combine_features(total_df, col_list)
total_df['DS'].value_counts()
nannan                         341120
StandardConventional            69411
Limited SlipConventional         1154
StandardCommand Control           564
No SpinConventional               209
StandardFour Wheel Standard       139
Standardnan                        54
Limited SlipCommand Control        27
nanWheel                           14
No SpinCommand Control              3
Lockingnan                          2
StandardNo                          1
Name: DS, dtype: int64

2.1.9 PrimarySizeBasis PrimaryLower

col_list = ('PrimarySizeBasis	PrimaryLower').split('\t')
total_df['PP'] = combine_features(total_df, col_list)
total_df['PP'].value_counts()
Standard Digging Depth - Ft14.0    57395
Horsepower20.0                     17941
Horsepower130.0                    17113
Horsepower150.0                    15299
Horsepower85.0                     15054
                                   ...  
Weight - Metric Tons20.0               2
Weight - Lbs25000.0                    1
Cutting Width - Inches0.0              1
Weight - Metric Tons2.7                1
Weight - Metric Tons300.0              1
Name: PP, Length: 90, dtype: int64

2.1.10 fiBaseModel

def is_hybrid(x):
    try:
        return type(eval(x)) == str
    except:
        return True
total_df['fiBaseModel_is_hybrid'] = total_df['fiBaseModel'].map(is_hybrid)

2.1.11 ProductGroupDesc

result = total_df['ProductGroupDesc'].str.split(' ')
total_df['ProductGroupDesc_father'] = [i[-1] for i in result]    
total_df['ProductGroupDesc_father'].value_counts()
Loaders       126412
Excavators    104230
Tractors       82582
Loader         73216
Graders        26258
Name: ProductGroupDesc_father, dtype: int64

2.2 独热编码

def to_dummies(df, col_name):
    name_dict = {value: col_name + '_is_' + str(value) for value in df[col_name].unique()}
    return pd.get_dummies(df[col_name]).rename(columns=name_dict)
# col_names = ['CouplerSystem__GrouserTracks__HydraulicsFlow']

# for col in col_names:
#     total_df = pd.concat([total_df, to_dummies(total_df, col)], axis=1)

# total_df.drop(columns=col_names, inplace=True)

感觉没有明显提升,暂时不做了。

2.3 标签编码

2.3.1 UsageBand

usageband_dict = {
    'Low': 1,
    'Medium': 2,
    'High': 3
}
total_df['UsageBand'] = total_df['UsageBand'].map(usageband_dict).fillna(0)
total_df['UsageBand'].value_counts()
0.0    339028
2.0     35832
1.0     25311
3.0     12527
Name: UsageBand, dtype: int64

2.3.2 ProductSize

total_df['ProductSize_is_missing'] = total_df['ProductSize'].isna()
total_df['ProductSize'].value_counts()
Medium            64342
Large / Medium    51297
Small             27057
Mini              25721
Large             21396
Compact            6280
Name: ProductSize, dtype: int64
ProductSize_dict = {
    'Compact': 1,
    'Mini': 2,
    'Small': 3,
    'Medium': 4,
    'Large / Medium': 5,
    'Large': 6,
}
total_df['ProductSize'] = total_df['ProductSize'].map(ProductSize_dict).fillna(0)
total_df['ProductSize'].value_counts()
0.0    216605
4.0     64342
5.0     51297
3.0     27057
2.0     25721
6.0     21396
1.0      6280
Name: ProductSize, dtype: int64

2.3.3 Undercarriage_Pad_Width

total_df['Undercarriage_Pad_Width_is_missing'] = total_df['Undercarriage_Pad_Width'].isna()
total_df['Undercarriage_Pad_Width'].value_counts()
None or Unspecified    82444
32 inch                 5287
28 inch                 3152
24 inch                 2998
20 inch                 2664
30 inch                 1602
36 inch                 1544
18 inch                 1439
34 inch                  540
16 inch                  481
31 inch                  191
27 inch                  144
22 inch                  135
26 inch                   98
33 inch                   94
14 inch                   51
15 inch                   33
25 inch                   17
31.5 inch                  2
Name: Undercarriage_Pad_Width, dtype: int64
total_df['Undercarriage_Pad_Width'] = total_df['Undercarriage_Pad_Width'].fillna('0 inch').map(lambda x: 1 if x == 'None or Unspecified' else eval(x[:-4]))
total_df['Undercarriage_Pad_Width'].value_counts()
0.0     309782
1.0      82444
32.0      5287
28.0      3152
24.0      2998
20.0      2664
30.0      1602
36.0      1544
18.0      1439
34.0       540
16.0       481
31.0       191
27.0       144
22.0       135
26.0        98
33.0        94
14.0        51
15.0        33
25.0        17
31.5         2
Name: Undercarriage_Pad_Width, dtype: int64

2.3.4 Stick_Length

total_df['Stick_Length_is_missing'] = total_df['Stick_Length'].isna()
total_df['Stick_Length'].value_counts()
None or Unspecified    81539
9' 6"                   5832
10' 6"                  3519
11' 0"                  1601
9' 10"                  1463
9' 8"                   1462
9' 7"                   1423
12' 10"                 1087
10' 2"                  1004
8' 6"                    908
8' 2"                    614
10' 10"                  414
12' 8"                   322
11' 10"                  307
8' 4"                    274
8' 10"                   104
12' 4"                   103
9' 5"                    101
15' 9"                    87
6' 3"                     51
13' 7"                    11
14' 1"                     7
13' 10"                    7
13' 9"                     7
19' 8"                     5
7' 10"                     3
15' 4"                     3
24' 3"                     2
9' 2"                      1
Name: Stick_Length, dtype: int64
result = []

for i in total_df['Stick_Length'].fillna(0):
    if  i == 0:
        pass
    elif i == 'None or Unspecified':
        i = 1
    else:
        a = i.find("'")
        b = i.find('"')
        i = round(int(i[:a]) + int(i[a+2:b])/12, 2)
    result.append(i)
total_df['Stick_Length'] = result
total_df['Stick_Length'].value_counts()
0.00     310437
1.00      81539
9.50       5832
10.50      3519
11.00      1601
9.83       1463
9.67       1462
9.58       1423
12.83      1087
10.17      1004
8.50        908
8.17        614
10.83       414
12.67       322
11.83       307
8.33        274
8.83        104
12.33       103
9.42        101
15.75        87
6.25         51
13.58        11
14.08         7
13.83         7
13.75         7
19.67         5
7.83          3
15.33         3
24.25         2
9.17          1
Name: Stick_Length, dtype: int64

2.3.2 其余随机标签编码

le = LabelEncoder()
for col in total_df.columns:
    # 标记空值(事实证明用处不大。。。,只有一列标记派上了用场)
    if total_df[col].isna().sum() > 1000:
        total_df[col + '_is_missing'] = total_df[col].isna()
    # 编码
    if total_df[col].dtype == 'object':
        total_df[col] = le.fit_transform(total_df[col])

2.4 空值和异常值

# 异常值173
total_df['datasource'].map(lambda x: 172 if x == 173 else x)
0         132
1         132
2         132
3         132
4         132
         ... 
412693    149
412694    149
412695    149
412696    149
412697    149
Name: datasource, Length: 412698, dtype: int64
# 这一列的空值用众数填充
total_df['auctioneerID'].fillna(total_df['auctioneerID'].mode()[0], inplace=True)
total_df['MachineHoursCurrentMeter'].fillna(0, inplace=True)
total_df['PrimaryLower'].fillna(0, inplace=True)
total_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 412698 entries, 0 to 412697
Columns: 109 entries, SalePrice to SaleMonth_sin_is_missing
dtypes: bool(40), float64(10), int32(50), int64(9)
memory usage: 157.4 MB

2.5 删除重复列

def drop_dup_cols(df):
    cols = df.columns[60:]
    n = len(cols)
    for i in range(n-1):
        if len(df[df[cols[i]] != df[cols[i+1]]]) < 1000:
            df.drop(columns=[cols[i]], inplace=True)
drop_dup_cols(total_df)

3. 建模

# 以DataFrame的形式展现特征重要性
def show_feature_imp(model, df):
    return pd.DataFrame({df.columns[i]: model.feature_importances_[i] for i in range(len(df.columns))}, index=['imp']).T.sort_values(by='imp', ascending=False)
# 展示两个特征重要性表的共同倒数n%的特征
def show_low_features(df1, df2, ratio):
    n = int(len(df1) * ratio)
    df1_features = df1.tail(n).index
    df2_features = df2.tail(n).index
    return n, {feature: [df1.loc[feature].values[0], df2.loc[feature].values[0]] for feature in df1_features if feature in df2_features}
# 以字典的形式显示分数,可显示的分数有:R平方、MAE、MSE、RMSLE
def show_scores(y_test, y_pred):
    scores_dict = {}
    scores_dict['R2'] = r2_score(y_test, y_pred)
    scores_dict['MAE'] = mean_absolute_error(y_test, y_pred)
    scores_dict['MSE'] = mean_squared_error(y_test, y_pred)
    scores_dict['RMSLE'] = np.sqrt(mean_squared_log_error(y_test, y_pred))
    return scores_dict
# 被后续的逐步回归筛出来的特征
drop_list = ['SalePrice', 'MachineID', 'PrimaryLower_is_missing',  
             'Drive_System_is_missing', 'Forks_is_missing', 'ProductSize_is_missing',
             'Ride_Control_is_missing', 'Stick_is_missing', 'Stick_Length_is_missing', 'Transmission_is_missing', 
             'Turbocharged_is_missing', 'Engine_Horsepower_is_missing', 'Hydraulics_is_missing', 
             'Pad_Type_is_missing', 'Ripper_is_missing', 'Tip_Control_is_missing', 'Tire_Size_is_missing', 
             'Coupler_is_missing', 'Hydraulics_Flow_is_missing', 'Grouser_Type_is_missing', 
             'Backhoe_Mounting_is_missing', 'Travel_Controls_is_missing', 'Steering_Controls_is_missing', 
             'Pushblock_is_missing', 'SaleDayOfWeek', 'SaleMonth', 'Differential_Type', 
             'Stick__Turbocharged', 'SaleDayOfYear', 'MachineHoursCurrentMeter_is_missing', 
             'Scarifier', 'Backhoe_Mounting', 'Forks', 'Track_Type', 'Engine_Horsepower', 
             'Blade_Extension', 'Coupler_System', 'SaleMonth_sin', 'SaleDayOfWeek_sin', 'SaleMonth_sin_is_missing',
             'fiBaseModel_is_hybrid', 'Hydraulics_Flow', 'fiModelDescriptor', 'SaleDay', 'Hydraulics', 'Pushblock', 'Transmission',
             'PrimarySizeBasis', 'Stick','DS', 'Tip_Control', 'Grouser_Tracks', 'fiSecondaryDesc', 'fiModelSeries'
            ]
# 划分训练集、验证集
X = total_df.drop(drop_list, axis=1)
y = total_df['SalePrice']

X_train = X[X['SaleYear']<2012]
X_test = X[X['SaleYear']>=2012]
y_train = y[X['SaleYear']<2012]
y_test = y[X['SaleYear']>=2012]
raise KeyError
---------------------------------------------------------------------------

KeyError                                  Traceback (most recent call last)

Input In [1044], in <cell line: 1>()
----> 1 raise KeyError


KeyError: 

3.1 RF

# rfr = RandomForestRegressor(
#     n_estimators = 100,
#     max_depth = 23,
#     min_samples_leaf = 2,
#     min_samples_split = 4,
#     n_jobs = -1,
#     random_state = 10
# )
# rfr.fit(X_train, y_train)
# show_scores(y_test, rfr.predict(X_test))
# rfr_imp = show_feature_imp(rfr, X_train)
# rfr_imp.head(20)
# rfr_grid = {
#     'n_estimators': [100, 150, 200],
#     'max_depth': [15, 20, 25],
#     'max_features': [0.5, 0.6, 0.7],
#     'min_samples_leaf': [2],
#     'min_samples_split': [4],
#     'max_samples': [2000],
#     'n_jobs': [-1]  
# }
# gscv = GridSearchCV(RandomForestRegressor(), param_grid=rfr_grid, cv=5, verbose=True)
# gscv.fit(X_train, y_train)

3.2 XGB

xgb_grid = {
    'n_estimators': [200],
    'learning_rate': [0.0835],
    'reg_lambda': [0.48, 0.5, 0.52]
}
gscv = GridSearchCV(XGBRegressor(), param_grid=xgb_grid)
gscv.fit(X_train, y_train)
gscv.best_params_
xgb = XGBRegressor(
    n_estimators = 250, # 这个很重要
    max_depth = 11, # 这个很重要
    learning_rate = 0.11, # 这个很重要
    reg_lambda = 0.2, # 这个比较重要
    n_jobs=-1,
    random_state = 13
)
xgb.fit(X_train, y_train)
show_scores(y_test, xgb.predict(X_test))
{'R2': 0.884087856079633,
 'MAE': 5745.138324664967,
 'MSE': 79616854.10269143,
 'RMSLE': 0.23743323282056006}

总结:

调参的时候自己要先清楚哪些参数对模型影响大,最好用gscv来两次,第一次先摸清大概范围,第二次再精确到小数点后1-3位。

4. 特征筛选

逐步回归筛选特征(注意,后续提到的逐步回归都以本文自定义的“青春版”逐步回归函数为准)

  1. 后向逐步回归:按照给定的特征列表(按照之前fit一遍的特征重要性排升序)遍历。每丢一个特征计算一次RMSLE,如果分降了就保持丢弃这个特征。可以设定模型、遍历次数、丢弃特征的RMSLE下降阈值。

  2. 前向逐步回归:按照给定的特征列表(按照之前fit一遍的特征重要性排序)遍历。每丢一个特征计算一次RMSLE,如果分降了就保持丢弃这个特征。可以设定模型、遍历次数、丢弃特征的RMSLE下降阈值。

  3. 如果希望更稳重,不要如此轻易就丢弃一个特征,可选的函数改进方式是改为二重循环:每一轮都将每个特征遍历一遍,计算前后的得分(或者误差)差值gain,剔除gain最小/大的特征,然后再进入下一轮,直到达到最大迭代次数或gain达到阈值。

def step_wise_reg(model, X_train, X_test, y_train, y_test, suspicion_list, method='backward', tol=0, show_info=True, dilution_ratio=1):
    X_train = X_train[::dilution_ratio]
    y_train = y_train[::dilution_ratio]
    
    if method == 'backward':
        drop_list = [] 
    elif method == 'forward':
        drop_list = list(suspicion_list)
    else:
        raise KeyError
    
    
    model.fit(X_train.drop(columns=drop_list), y_train)
    old_score = np.sqrt(mean_squared_log_error(y_test, pd.Series(model.predict(X_test.drop(columns=drop_list))).map(lambda x: 0 if x < 0 else x)))

    for last_feature in suspicion_list:
        
        if method == 'backward':
            drop_list.append(last_feature)
        else:
            drop_list.remove(last_feature)

        model.fit(X_train.drop(columns=drop_list), y_train)
        new_score = np.sqrt(mean_squared_log_error(y_test, pd.Series(model.predict(X_test.drop(columns=drop_list))).map(lambda x: 0 if x < 0 else x)))

        if show_info:
            print(old_score, new_score, drop_list)
        
        
        if old_score - new_score < tol:
            if method == 'backward':
                drop_list.remove(last_feature)   
            else:
                drop_list.append(last_feature)
        else:
            old_score = new_score

    return old_score, new_score, drop_list
old_score, new_score, add_drop_list1 = step_wise_reg(xgb, X_train, X_test, y_train, y_test, dilution_ratio=1, method='backward', suspicion_list=show_feature_imp(xgb, X_train).index[::-1])
add_drop_list1
old_score, new_score, add_drop_list2 = step_wise_reg(xgb, X_train, X_test, y_train, y_test, dilution_ratio=1, method='forward', tol=0.0001,suspicion_list=show_feature_imp(xgb, X_train).index[1:])
add_drop_list2

总结:

逐步回归十分好用,过一遍特征,误差就降下去了,推测特征被筛出去的原因主要是:

  1. 和其他特征的信息有一定相关性,即存在冗余信息。简单计算一下,就算全部是二分类特征,2的20次方也能表示出100万个不同的样本了,因此对于此次只有40万条数据的数据集来说,50+的特征绝对是存在冗余的,会干扰模型判断。

  2. 本身是个好特征,但没有经过恰当的分箱/编码处理,导致蕴含的信息不能传达给模型,反而起到了负提升。这就只有根据业务逻辑反复尝试了,一边尝试分箱,一边要回来调参。

  3. 本身是好特征,预处理也到位了,但模型进行逐步回归时没有同步调参,导致误差反而上升。尤其是前向逐步回归,从1个特征增加到50个特征,跨度如此之大,模型需要的参数必然是不同的。

以上2、3两点就造成了误筛。所以使用逐步回归时要注意:被筛出的特征也不一定就是没用的特征,还得自己判断。

5. 最终成果

show_scores(y_test, xgb.predict(X_test))
{'R2': 0.884087856079633,
 'MAE': 5745.138324664967,
 'MSE': 79616854.10269143,
 'RMSLE': 0.23743323282056006}
xgb_imp = show_feature_imp(xgb, X_train)
xgb_imp
imp
CouplerSystem__GrouserTracks__HydraulicsFlow0.397497
ProductSize0.249447
PP0.079305
Ride_Control0.042499
fiManufacturerID0.037125
YearMade0.031770
PrimaryLower0.027179
ProductGroupDesc_father0.015217
SaleYear0.012649
fiModelDesc0.012630
Blade_Width0.010005
Pushblock__Scarifier__TipControl0.009107
Drive_System0.006723
Ripper0.006685
Enclosure0.005481
Blade__Blade__Enclosure__Engine0.004808
Pad_Type0.004427
fiProductClassDesc0.004194
ProductGroupDesc0.003965
Travel_Controls0.003888
Tire_Size0.003268
Pattern_Changer0.003167
ModelID0.002938
UsageBand0.002606
BBT0.002432
Steering_Controls0.002027
Turbocharged0.001827
fiBaseModel0.001778
Blade_Type0.001742
Undercarriage_Pad_Width0.001496
Stick_Length0.001339
TUSTPG0.001334
Thumb0.001193
Coupler0.001181
Grouser_Type0.001138
MachineHoursCurrentMeter0.001100
state0.001093
auctioneerID_is_missing0.001045
Enclosure_Type0.001007
auctioneerID0.000875
datasource0.000812

在这里插入图片描述

  • 9
    点赞
  • 11
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Sprite.Nym

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值