背景
kaggle地址:https://www.kaggle.com/c/house-prices-advanced-regression-techniques/overview
赛题给我们79个描述房屋的特征,要求我们据此预测房屋的最终售价,即对于测试集中每个房屋的ID给出对于的SalePrice字段的预测值,主要考察我们数据清洗、特征工程、模型搭建及调优等方面的技巧。本赛题是典型的回归类问题,评估指标选用的是均方根误差(RMSE),为了使得价格的高低对结果的评估有均等的影响,赛题均方根误差基于预测值和实际值分别取对数对数来计算。
特征初步分析:
1. SalePrice 房屋售价,我们要预测的label,类型:数值型,单位:美元
2. MSSubClass: 建筑的等级,类型:类别型
MSZoning: 区域分类,类型:类别型
LotFrontage: 距离街道的直线距离,类型:数值型,单位:英尺
LotArea: 地皮面积,类型:数值型,单位:平方英尺
Street: 街道类型,类型:类别型
Alley: 巷子类型,类型:类别型
LotShape: 房子整体形状,类型:类别型
LandContour: 平整度级别,类型:类别型
Utilities: 公共设施类型,类型:类别型
LotConfig: 房屋配置,类型:类别型
LandSlope: 倾斜度,类型:类别型
Neighborhood: 市区物理位置,类型:类别型
Condition1: 主干道或者铁路便利程度,类型:类别型
Condition2: 主干道或者铁路便利程度,类型:类别型
BldgType: 住宅类型,类型:类别型
HouseStyle: 住宅风格,类型:类别型
OverallQual: 整体材料和饰面质量,类型:数值型
OverallCond: 总体状况评价,类型:数值型
YearBuilt: 建筑年份,类型:数值型
YearRemodAdd: 改建年份,类型:数值型
RoofStyle: 屋顶类型,类型:类别型
RoofMatl: 屋顶材料,类型:类别型
Exterior1st: 住宅外墙,类型:类别型
Exterior2nd: 住宅外墙,类型:类别型
MasVnrType: 砌体饰面类型,类型:类别型
MasVnrArea: 砌体饰面面积,类型:数值型,单位:平方英尺
ExterQual: 外部材料质量,类型:类别型
ExterCond: 外部材料的现状,类型:类别型
Foundation: 地基类型,类型:类别型
BsmtQual: 地下室高度,类型:类别型
BsmtCond: 地下室概况,类型:类别型
BsmtExposure: 花园地下室墙,类型:类别型
BsmtFinType1: 地下室装饰质量,类型:类别型
BsmtFinSF1: 地下室装饰面积,类型:类别型
BsmtFinType2: 地下室装饰质量,类型:类别型
BsmtFinSF2: 地下室装饰面积,类型:类别型
BsmtUnfSF: 未装饰的地下室面积,类型:数值型,单位:平方英尺
TotalBsmtSF: 地下室总面积,类型:数值型,单位:平方英尺
Heating: 供暖类型,类型:类别型
HeatingQC: 供暖质量和条件,类型:类别型
CentralAir: 中央空调状况,类型:类别型
Electrical: 电力系统,类型:类别型
1stFlrSF: 首层面积,类型:数值型,单位:平方英尺
2ndFlrSF: 二层面积,类型:数值型,单位:平方英尺
LowQualFinSF: 低质装饰面积,类型:数值型,单位:平方英尺
GrLivArea: 地面以上居住面积,类型:数值型,单位:平方英尺
BsmtFullBath: 地下室全浴室,类型:数值
BsmtHalfBath: 地下室半浴室,类型:数值
FullBath: 高档全浴室,类型:数值
HalfBath: 高档半浴室,类型:数值
BedroomAbvGr: 地下室以上的卧室数量,类型:数值
KitchenAbvGr: 厨房数量,类型:数值
KitchenQual: 厨房质量,类型:类别型
TotRmsAbvGrd: 地上除卧室以外的房间数,类型:数值
Functional: 房屋功用性评级,类型:类别型
Fireplaces: 壁炉数量,类型:数值
FireplaceQu: 壁炉质量,类型:类别型
GarageType: 车库位置,类型:类别型
GarageYrBlt: 车库建造年份,类别:数值型
GarageFinish: 车库内饰,类型:类别型
GarageCars: 车库车容量大小,类别:数值型
GarageArea: 车库面积,类别:数值型,单位:平方英尺
GarageQual: 车库质量,类型:类别型
GarageCond: 车库条件,类型:类别型
PavedDrive: 铺的车道情况,类型:类别型
WoodDeckSF: 木地板面积,类型:数值型,单位:平方英尺
OpenPorchSF: 开放式门廊区面积,类型:数值型,单位:平方英尺
EnclosedPorch: 封闭式门廊区面积,类型:数值型,单位:平方英尺
3SsnPorch: 三个季节门廊面积,类型:数值型,单位:平方英尺
ScreenPorch: 纱门门廊面积,类型:数值型,单位:平方英尺
PoolArea: 泳池面积,类型:数值型,单位:平方英尺
PoolQC:泳池质量,类型:类别型
Fence: 围墙质量,类型:类别型
MiscFeature: 其他特征,类型:类别型
MiscVal: 其他杂项特征值,类型:类别型
MoSold: 卖出月份,类别:数值型
YrSold: 卖出年份,类别:数值型
SaleType: 交易类型,类型:类别型
SaleCondition: 交易条件,类型:类别型
数据说明
- train.csv - 训练集
- test.csv - 测试集
- data_description.txt - 每个栏目的完整描述,最初由Dean De Cock编写,但稍作修改以匹配此处使用的列名称
- sample_submission.csv - 来自销售年度和月份的线性回归的基准提交,批量平方英尺和卧室数量
提纲
- 1.数据载入
- 2.数据处理
- 2.1对label数据分析并进行平滑处理
- 2.2变量转化
- 2.3数据清洗
- 2.3.1相关性分析
- 2.3.2缺失值分析
- 2.3.3删除不必要的特征
- 2.3.4删除异常样本点
- 2.3.5缺失值填充
- 3.特征处理
- 3.1数值型特征处理
- 3.2类别型特征处理
- 4.建立模型
- 4.1准备数据
- 4.2随机森林
- 4.3XGBoost
- 4.4Lasso
- 5.模型融合并预测
- 6.输出预测结果
- 7.总结
- 8.优化方向
- 9.项目参考
1.数据载入
In [1]:
# 导入需要的模块
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
% matplotlib inline
import seaborn as sns
from scipy import stats
from scipy.stats import norm, skew
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')
In [2]:
train = pd.read_csv('input/train.csv')
df_train = train.copy()
test = pd.read_csv('input/test.csv')
df_test = test.copy()
In [3]:
(df_train.shape, df_test.shape)
Out[3]:
((1460, 81), (1459, 80))
In [4]:
df_train.head()
Out[4]:
Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | ... | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 60 | RL | 65.0 | 8450 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2008 | WD | Normal | 208500 |
1 | 2 | 20 | RL | 80.0 | 9600 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 5 | 2007 | WD | Normal | 181500 |
2 | 3 | 60 | RL | 68.0 | 11250 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 9 | 2008 | WD | Normal | 223500 |
3 | 4 | 70 | RL | 60.0 | 9550 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2006 | WD | Abnorml | 140000 |
4 | 5 | 60 | RL | 84.0 | 14260 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 12 | 2008 | WD | Normal | 250000 |
5 rows × 81 columns
2.数据处理
2.1 对label数据分析并进行平滑处理
In [5]:
target = df_train['SalePrice']
In [6]:
target.isnull().unique()
Out[6]:
array([False])
In [7]:
target.describe()
Out[7]:
count 1460.000000
mean 180921.195890
std 79442.502883
min 34900.000000
25% 129975.000000
50% 163000.000000
75% 214000.000000
max 755000.000000
Name: SalePrice, dtype: float64
In [8]:
sns.boxplot(y = target)
Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fcac8e5b6d8>
In [9]:
sns.distplot(target , fit=norm) #拟合正态分布来进行绘图
Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fcac6defa20>
In [10]:
target.skew() #target数据的分布偏差较大
Out[10]:
1.8828757597682129
分析:通过以上对label数据的观察,发现target数据的分布偏差较大,以下对target进行平滑处理。
In [11]:
#对target进行平滑处理
target = np.log1p(target)
In [12]:
sns.distplot(target , fit=norm);
In [13]:
target.skew() #偏差变小
Out[13]:
0.12134661989685333
In [14]:
sns.boxplot(y = target) #从下图也可以看出,target的数据分布两端比较均衡
Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fcac6bdc710>
2.2 变量转化
In [15]:
df_train = df_train.drop('SalePrice', axis=1)
df_train['training_set'] = True
df_test['training_set'] = False
df_full = pd.concat([df_train, df_test])
In [16]:
[df_full.shape,df_train.shape,df_test.shape]
Out[16]:
[(2919, 81), (1460, 81), (1459, 81)]
In [17]:
df_full.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2919 entries, 0 to 1458
Data columns (total 81 columns):
Id 2919 non-null int64
MSSubClass 2919 non-null int64
MSZoning 2915 non-null object
LotFrontage 2433 non-null float64
LotArea 2919 non-null int64
Street 2919 non-null object
Alley 198 non-null object
LotShape 2919 non-null object
LandContour 2919 non-null object
Utilities 2917 non-null object
LotConfig 2919 non-null object
LandSlope 2919 non-null object
Neighborhood 2919 non-null object
Condition1 2919 non-null object
Condition2 2919 non-null object
BldgType 2919 non-null object
HouseStyle 2919 non-null object
OverallQual 2919 non-null int64
OverallCond 2919 non-null int64
YearBuilt 2919 non-null int64
YearRemodAdd 2919 non-null int64
RoofStyle 2919 non-null object
RoofMatl 2919 non-null object
Exterior1st 2918 non-null object
Exterior2nd 2918 non-null object
MasVnrType 2895 non-null object
MasVnrArea 2896 non-null float64
ExterQual 2919 non-null object
ExterCond 2919 non-null object
Foundation 2919 non-null object
BsmtQual 2838 non-null object
BsmtCond 2837 non-null object
BsmtExposure 2837 non-null object
BsmtFinType1 2840 non-null object
BsmtFinSF1 2918 non-null float64
BsmtFinType2 2839 non-null object
BsmtFinSF2 2918 non-null float64
BsmtUnfSF 2918 non-null float64
TotalBsmtSF 2918 non-null float64
Heating 2919 non-null object
HeatingQC 2919 non-null object
CentralAir 2919 non-null object
Electrical 2918 non-null object
1stFlrSF 2919 non-null int64
2ndFlrSF 2919 non-null int64
LowQualFinSF 2919 non-null int64
GrLivArea 2919 non-null int64
BsmtFullBath 2917 non-null float64
BsmtHalfBath 2917 non-null float64
FullBath 2919 non-null int64
HalfBath 2919 non-null int64
BedroomAbvGr 2919 non-null int64
KitchenAbvGr 2919 non-null int64
KitchenQual 2918 non-null object
TotRmsAbvGrd 2919 non-null int64
Functional 2917 non-null object
Fireplaces 2919 non-null int64
FireplaceQu 1499 non-null object
GarageType 2762 non-null object
GarageYrBlt 2760 non-null float64
GarageFinish 2760 non-null object
GarageCars 2918 non-null float64
GarageArea 2918 non-null float64
GarageQual 2760 non-null object
GarageCond 2760 non-null object
PavedDrive 2919 non-null object
WoodDeckSF 2919 non-null int64
OpenPorchSF 2919 non-null int64
EnclosedPorch 2919 non-null int64
3SsnPorch 2919 non-null int64
ScreenPorch 2919 non-null int64
PoolArea 2919 non-null int64
PoolQC 10 non-null object
Fence 571 non-null object
MiscFeature 105 non-null object
MiscVal 2919 non-null int64
MoSold 2919 non-null int64
YrSold 2919 non-null int64
SaleType 2918 non-null object
SaleCondition 2919 non-null object
training_set 2919 non-null bool
dtypes: bool(1), float64(11), int64(26), object(43)
memory usage: 1.8+ MB
In [18]:
df_full['MSSubClass'].unique()
Out[18]:
array([ 60, 20, 70, 50, 190, 45, 90, 120, 30, 85, 80, 160, 75,
180, 40, 150])
MSSubClass: The building class (建筑类别),但其数据类型是int64,由于类别之间无大小关系,因此需要将该字段转化成类别型数据。
In [19]:
df_full['MSSubClass'] = df_full['MSSubClass'].astype(str)
In [20]:
df_full['MSSubClass'].dtypes
Out[20]:
dtype('O')
2.3数据清洗
2.3.1相关性分析
In [21]:
corrmat = train.corr()
plt.subplots(figsize=(12,12))
sns.heatmap(corrmat,square=True, cmap="YlGnBu");
In [22]:
cols = corrmat.nlargest(10, 'SalePrice')['SalePrice'].index
cols
Out[22]:
Index(['SalePrice', 'OverallQual', 'GrLivArea', 'GarageCars', 'GarageArea',
'TotalBsmtSF', '1stFlrSF', 'FullBath', 'TotRmsAbvGrd', 'YearBuilt'],
dtype='object')
分析:通过观察上图最后一行,可以发现SalePrice和特征之间的相关性,发现 'OverallQual', 'GrLivArea', 'GarageCars', 'GarageArea','TotalBsmtSF', '1stFlrSF', 'FullBath', 'TotRmsAbvGrd', 'YearBuilt' 对SalePrice的影响很大,且'OverallQual', 'GrLivArea'的影响最大,后续需要重点分析。
下面可分析这几个重要特征之间的相关性
In [23]:
cols = ['SalePrice', 'OverallQual', 'GrLivArea', 'GarageCars', 'GarageArea',
'TotalBsmtSF', '1stFlrSF', 'FullBath', 'TotRmsAbvGrd', 'YearBuilt']
In [24]:
corrmatk = train[cols].corr()
In [25]:
plt.subplots(figsize=(7,7))
sns.heatmap(corrmatk,square=True, cbar=True, annot=True, cmap="YlGnBu");
分析:
- 1.'GarageCars'和'GarageArea'都是描述车库大小的字段,且与SalePrice的相关性相近,因此可以只保留相关性更高的'GarageCars',删除'GarageArea'。
- 2.'GrLivArea'和'TotRmsAbvGrd'都是描述2层及以上的房屋大小,前者更具有描述性,可删除特征'TotRmsAbvGrd'。
- 3.'TotalBsmtSF'和'1stFlrSF'与SalePrice的相关性大小相同,且根据字段含义描述,前者描述的是地下室总面积,后者描述的是一楼的面积,我们可删除一个特征,实际买房时的经验是一楼的面积似乎更重要,因此我们保留'1stFlrSF'字段,删除'TotalBsmtSF。
In [26]:
cols = ['SalePrice', 'OverallQual', 'GrLivArea', 'GarageCars', '1stFlrSF', 'FullBath', 'YearBuilt']
sns.pairplot(train[cols], size = 2.5)
Out[26]:
<seaborn.axisgrid.PairGrid at 0x7fcac6bbddd8>
In [27]:
train['FullBath'].unique()
Out[27]:
array([2, 1, 3, 0])
In [28]:
train['GarageCars'].unique()
Out[28]:
array([2, 3, 1, 0, 4])
In [29]:
train['OverallQual'].unique()
Out[29]:
array([ 7, 6, 8, 5, 9, 4, 10, 3, 1, 2])
分析:
- YearBuilt和SalePrice几乎是正相关的,可进一步分析。
- GrLivArea、TotalBsmtSF和SalePrice几乎都是正相关的关系,GrLivArea喝TotalBsmtSF有异常点,可进一步分析。
分析YearBuilt和SalePrice
In [30]:
tmp = train[['YearBuilt','SalePrice']].groupby('YearBuilt').mean()
tmp.reset_index(inplace = True)
sns.set(style = "ticks")
sns.factorplot(x = "YearBuilt", y = "SalePrice", data = tmp, size = 7, aspect = 2 )
Out[30]:
<seaborn.axisgrid.FacetGrid at 0x7fcac699ea20>
分析:
- 上图的几乎前3分之2的部分,看不出YearBuilt和SalePrice的关系
- 上图的尾部可看出SalePrice随着时间离现在越近,价格成上涨趋势,但整体上也是波动性上涨。
- 总之,没有发现出YearBuilt和SalePrice的必然关系,无需再进一步对YearBuilt特征进一步分析。
分析GrLivArea和SalePrice
In [31]:
data = pd.concat([train['SalePrice'], train['GrLivArea']], axis=1)
data.plot.scatter(x='GrLivArea', y='SalePrice', ylim=(0,800000)) #利用pandas的scatter画图
Out[31]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fcabefe4940>
In [32]:
sns.boxplot(y = test['GrLivArea'])
Out[32]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fcac40e9828>
分析:
- 散点图的右下角两个点明显是异常点,GrLivArea表示的是“Above grade (ground) living area square feet”,一般不会存在面积很大,但价格很低的房子,因此这两个样本点可删除。
- 对于散点图右上角的两个点,由于实际的面积可能达到4000-5000,且test数据也有高于5000的数据,因此这两个样本点不属于异常点。
分析TotalBsmtSF和SalePrice
In [33]:
data = pd.concat([train['SalePrice'], train['1stFlrSF']], axis=1)
data.plot.scatter(x='1stFlrSF', y='SalePrice', ylim=(0,800000)) #利用pandas的scatter画图
Out[33]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fcabef58160>
分析:TotalBsmtSF描述的是地下室的面积,右下角的样本点面积大且价格低,显然不符合实际情况,可删除。
结论:通过以上分析,可删除不必要的特征和异常的样本数据。
- 删除'GarageArea'、'TotRmsAbvGrd'、'TotalBsmtSF三个特征。
- 删除特征GrLivArea大于4000 且 SalePrice小于300000的样本数据。
- 删除特征1stFlrSF大于4000 且 SalePrice小于200000的样本数据。
2.3.2 缺失值分析
In [34]:
#观察train数据的缺失值情况,以下表格从左到有缺失值依次减少
pd.DataFrame(train.isnull().sum().sort_values(ascending=False).head(15)).T
Out[34]:
PoolQC | MiscFeature | Alley | Fence | FireplaceQu | LotFrontage | GarageCond | GarageType | GarageYrBlt | GarageFinish | GarageQual | BsmtExposure | BsmtFinType2 | BsmtFinType1 | BsmtCond | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1453 | 1406 | 1369 | 1179 | 690 | 259 | 81 | 81 | 81 | 81 | 81 | 38 | 38 | 37 | 37 |
In [35]:
#观察test数据的缺失值情况,以下表格从左到有缺失值依次减少
pd.DataFrame(test.isnull().sum().sort_values(ascending=False).head(15)).T
Out[35]:
PoolQC | MiscFeature | Alley | Fence | FireplaceQu | LotFrontage | GarageCond | GarageQual | GarageYrBlt | GarageFinish | GarageType | BsmtCond | BsmtQual | BsmtExposure | BsmtFinType1 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1456 | 1408 | 1352 | 1169 | 730 | 227 | 78 | 78 | 78 | 78 | 76 | 45 | 44 | 44 | 42 |
分析:train的样本数据和test的样本数据,缺失值在各特征上的分布情况大致相同,因此可直接对df_full进行统一处理。
In [36]:
#观察数据的缺失值情况
total = df_full.isnull().sum().sort_values(ascending=False).head(40) #head(30)
percent = ((df_full.isnull().sum()/len(df_full))*100).sort_values(ascending=False).head(40)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data
Out[36]:
Total | Percent | |
---|---|---|
PoolQC | 2909 | 99.657417 |
MiscFeature | 2814 | 96.402878 |
Alley | 2721 | 93.216855 |
Fence | 2348 | 80.438506 |
FireplaceQu | 1420 | 48.646797 |
LotFrontage | 486 | 16.649538 |
GarageCond | 159 | 5.447071 |
GarageYrBlt | 159 | 5.447071 |
GarageFinish | 159 | 5.447071 |
GarageQual | 159 | 5.447071 |
GarageType | 157 | 5.378554 |
BsmtCond | 82 | 2.809181 |
BsmtExposure | 82 | 2.809181 |
BsmtQual | 81 | 2.774923 |
BsmtFinType2 | 80 | 2.740665 |
BsmtFinType1 | 79 | 2.706406 |
MasVnrType | 24 | 0.822199 |
MasVnrArea | 23 | 0.787941 |
MSZoning | 4 | 0.137033 |
Utilities | 2 | 0.068517 |
BsmtHalfBath | 2 | 0.068517 |
BsmtFullBath | 2 | 0.068517 |
Functional | 2 | 0.068517 |
Exterior1st | 1 | 0.034258 |
TotalBsmtSF | 1 | 0.034258 |
BsmtUnfSF | 1 | 0.034258 |
BsmtFinSF2 | 1 | 0.034258 |
GarageArea | 1 | 0.034258 |
KitchenQual | 1 | 0.034258 |
GarageCars | 1 | 0.034258 |
BsmtFinSF1 | 1 | 0.034258 |
Exterior2nd | 1 | 0.034258 |
SaleType | 1 | 0.034258 |
Electrical | 1 | 0.034258 |
YearRemodAdd | 0 | 0.000000 |
RoofMatl | 0 | 0.000000 |
RoofStyle | 0 | 0.000000 |
training_set | 0 | 0.000000 |
YearBuilt | 0 | 0.000000 |
OverallCond | 0 | 0.000000 |
分析:
通过对缺失值的观察和特征文档的解释,得到以下分析。
- 数据中的None值是有具体含义的,表示无。
- PoolQC: Pool quality,文档中说明了数据NA表示No Pool,因此,该字段无需删除,但需对缺失值进行填充“None”(最好是填充“No Pool”,但由于填充的值只是代表一类数据,所以填充的内容是什么无所谓,方便起见,填充“None”,下同)。
- MiscFeature:Miscellaneous feature not covered in other categories,NA表示None。需对缺失值进行填充“None”。
- Alley: Type of alley access to property,NA表示No alley access,因此,该字段无需删除,但需对缺失值进行填充“None”。
- Fence: Fence quality,NA表示No Fence,因此该字段无需删除,但需对缺失值进行填充“None”。
- FireplaceQu:Fireplace quality,NA表示No Fireplace,因此该字段无需删除,但需对缺失值进行填充“None”。
- LotFrontage: Linear feet of street connected to property,该字段的缺失值可通过邻居的该字段的中位数进行填充。
进一步分析 BsmtCond, BsmtExposure, BsmtQual, BsmtFinType2, BsmtFinType1,缺失值表示No Basement。
In [37]:
#train同时都为空的有37个样本数据
df_train[(df_train['BsmtCond'].isnull())&(df_train['BsmtExposure'].isnull())&(df_train['BsmtQual'].isnull())&(df_train['BsmtFinType2'].isnull())]['BsmtFinType1'].isnull().sum()
Out[37]:
37
In [38]:
#找到各特征的样本缺失数
df_train['BsmtCond'].isnull().sum() ,df_train['BsmtExposure'].isnull().sum(), df_train['BsmtQual'].isnull().sum(), df_train['BsmtFinType2'].isnull().sum(), df_train['BsmtFinType1'].isnull().sum()
Out[38]:
(37, 38, 37, 38, 37)
In [39]:
#对于BsmtExposure,找到其余特征都不缺失,而该特征缺失的样本 ,对该样本填充BsmtExposure的众数
df_train[(~df_train['BsmtCond'].isnull())&(df_train['BsmtExposure'].isnull())&(~df_train['BsmtQual'].isnull())&(~df_train['BsmtFinType2'].isnull())&(~df_train['BsmtFinType1'].isnull())][['BsmtExposure','BsmtFinType2']]
Out[39]:
BsmtExposure | BsmtFinType2 | |
---|---|---|
948 | NaN | Unf |
In [40]:
#对于BsmtFinType2,找到其余特征都不缺失,而该特征缺失的样本 ,对该样本填充BsmtFinType2的众数
df_train[(~df_train['BsmtCond'].isnull())&(~df_train['BsmtExposure'].isnull())&(~df_train['BsmtQual'].isnull())&(df_train['BsmtFinType2'].isnull())&(~df_train['BsmtFinType1'].isnull())][['BsmtExposure','BsmtFinType2']]
Out[40]:
BsmtExposure | BsmtFinType2 | |
---|---|---|
332 | No | NaN |
In [41]:
#test中同时都为空的42个样本数据的五个特征的缺失值全部填充None,表示没有地下室。
df_test[(df_test['BsmtCond'].isnull())&(df_test['BsmtExposure'].isnull())&(df_test['BsmtQual'].isnull())&(df_test['BsmtFinType2'].isnull())]['BsmtFinType1'].isnull().sum()
Out[41]:
42
In [42]:
#找到各特征的样本缺失数
df_test['BsmtCond'].isnull().sum() ,df_test['BsmtExposure'].isnull().sum(), df_test['BsmtQual'].isnull().sum(), df_test['BsmtFinType2'].isnull().sum(), df_test['BsmtFinType1'].isnull().sum()
Out[42]:
(45, 44, 44, 42, 42)
In [43]:
#对于BsmtCond,找到其余特征都不缺失,而该特征缺失的样本 ,对该样本填充BsmtCond的众数
df_test[(df_test['BsmtCond'].isnull())&(~df_test['BsmtExposure'].isnull())&(~df_test['BsmtQual'].isnull())&(~df_test['BsmtFinType2'].isnull())&(~df_test['BsmtFinType1'].isnull())][['BsmtCond','BsmtExposure','BsmtQual']]
Out[43]:
BsmtCond | BsmtExposure | BsmtQual | |
---|---|---|---|
580 | NaN | Mn | Gd |
725 | NaN | No | TA |
1064 | NaN | Av | TA |
In [44]:
#对于BsmtExposure,找到其余特征都不缺失,而该特征缺失的样本 ,对该样本填充BsmtExposure的众数
df_test[(~df_test['BsmtCond'].isnull())&(df_test['BsmtExposure'].isnull())&(~df_test['BsmtQual'].isnull())&(~df_test['BsmtFinType2'].isnull())&(~df_test['BsmtFinType1'].isnull())][['BsmtCond','BsmtExposure','BsmtQual']]
Out[44]:
BsmtCond | BsmtExposure | BsmtQual | |
---|---|---|---|
27 | TA | NaN | Gd |
888 | TA | NaN | Gd |
In [45]:
#对于BsmtQual,找到其余特征都不缺失,而该特征缺失的样本 ,对该样本填充BsmtQual的众数
df_test[(~df_test['BsmtCond'].isnull())&(~df_test['BsmtExposure'].isnull())&(df_test['BsmtQual'].isnull())&(~df_test['BsmtFinType2'].isnull())&(~df_test['BsmtFinType1'].isnull())][['BsmtCond','BsmtExposure','BsmtQual']]
Out[45]:
BsmtCond | BsmtExposure | BsmtQual | |
---|---|---|---|
757 | Fa | No | NaN |
758 | TA | No | NaN |
In [46]:
df_full[['BsmtCond','BsmtExposure' , 'BsmtQual', 'BsmtFinType2', 'BsmtFinType1']].info() #均为类别型特征
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2919 entries, 0 to 1458
Data columns (total 5 columns):
BsmtCond 2837 non-null object
BsmtExposure 2837 non-null object
BsmtQual 2838 non-null object
BsmtFinType2 2839 non-null object
BsmtFinType1 2840 non-null object
dtypes: object(5)
memory usage: 136.8+ KB
分析结论
train数据中:
- 同时都为空的37个样本数据的五个特征的缺失值全部填充None,表示没有地下室。
- 对于BsmtExposure和BsmtFinType2,找到其余特征都不缺失,而该特征缺失的样本,对该样本填充该特征的众数
test数据中:
- 同时都为空的42个样本数据的五个特征的缺失值全部填充None,表示没有地下室。
- 对于BsmtCond, BsmtExposure和BsmtQual,找到其余特征都不缺失,而该特征缺失的样本,对该样本填充该特征的众数
进一步分析GarageCond,GarageFinish、GarageQual、GarageType、GarageYrBlt
In [47]:
#train数据中 GarageCond,GarageFinish、GarageQual、GarageType四个个特征的缺失值都为81
df_train['GarageCond'].isnull().sum() ,df_train['GarageFinish'].isnull().sum(), df_train['GarageQual'].isnull().sum(), df_train['GarageType'].isnull().sum(), df_train['GarageYrBlt'].isnull().sum()
Out[47]:
(81, 81, 81, 81, 81)
In [48]:
#test中找到各特征的样本缺失数
df_test['GarageCond'].isnull().sum() ,df_test['GarageFinish'].isnull().sum(), df_test['GarageQual'].isnull().sum(), df_test['GarageType'].isnull().sum() , df_test['GarageYrBlt'].isnull().sum()
Out[48]:
(78, 78, 78, 76, 78)
In [49]:
#train中GarageCond,GarageFinish、GarageQual、GarageType和GarageYrBlt同时缺失的样本条数也是81
df_train[(df_train['GarageCond'].isnull())&(df_train['GarageFinish'].isnull())&(df_train['GarageQual'].isnull())&(df_train['GarageType'].isnull())]['GarageYrBlt'].isnull().sum()
Out[49]:
81
In [50]:
#test中GarageCond,GarageFinish、GarageQual、GarageType和GarageYrBlt同时缺失的样本条数也是76
df_test[(df_test['GarageCond'].isnull())&(df_test['GarageFinish'].isnull())&(df_test['GarageQual'].isnull())&(df_test['GarageType'].isnull())]['GarageYrBlt'].isnull().sum()
Out[50]:
76
In [51]:
#发现GarageType有两条数据需要再分析一下,找到其余四个特征都为NaN,而GarageType不为NaN的这两条数据
df_test[(df_test['GarageCond'].isnull())&(df_test['GarageFinish'].isnull())&(df_test['GarageQual'].isnull())&\
(df_test['GarageYrBlt'].isnull())&(~df_test['GarageType'].isnull())][['GarageCond','GarageFinish','GarageQual','GarageType','GarageYrBlt']]
Out[51]:
GarageCond | GarageFinish | GarageQual | GarageType | GarageYrBlt | |
---|---|---|---|---|---|
666 | NaN | NaN | NaN | Detchd | NaN |
1116 | NaN | NaN | NaN | Detchd | NaN |
In [52]:
df_full[df_full['GarageType'] == 'Detchd'][['GarageCond','GarageFinish','GarageQual','GarageType','GarageYrBlt']].head()
Out[52]:
GarageCond | GarageFinish | GarageQual | GarageType | GarageYrBlt | |
---|---|---|---|---|---|
3 | TA | Unf | TA | Detchd | 1998.0 |
8 | TA | Unf | Fa | Detchd | 1931.0 |
10 | TA | Unf | TA | Detchd | 1965.0 |
12 | TA | Unf | TA | Detchd | 1962.0 |
15 | TA | Unf | TA | Detchd | 1991.0 |
In [53]:
df_full[df_full['GarageType'] == 'Detchd']['GarageCond'].value_counts()
Out[53]:
TA 692
Fa 67
Po 12
Gd 3
Ex 3
Name: GarageCond, dtype: int64
In [54]:
df_full[df_full['GarageType'] == 'Detchd']['GarageCond'].value_counts().index[0]
Out[54]:
'TA'
In [55]:
df_full[df_full['GarageType'] == 'Detchd']['GarageFinish'].value_counts().index[0]
Out[55]:
'Unf'
In [56]:
df_full[df_full['GarageType'] == 'Detchd']['GarageQual'].value_counts().index[0]
Out[56]:
'TA'
In [57]:
df_test[df_test['GarageType'] == 'Detchd']['GarageYrBlt'].value_counts().index[0]
Out[57]:
1950.0
分析结论(GarageCond,GarageFinish、GarageQual、GarageType和GarageYrBlt)
- GarageCond,GarageFinish、GarageQual、GarageType的确实值表示No Garage。
Train数据中:
- train数据中,GarageCond,GarageFinish、GarageQual、GarageType和GarageYrBlt同时缺失的样本条数也是81。各自缺失值的样本数也是81。说明没有车库就没有车库的建立年份。合理。
- 前四个字段为类别型变量,缺失值可填充'None',GarageYrBlt是年份,float64的数据类型,缺失值可填充0。
test数据中:
- test数据中,GarageCond,GarageFinish、GarageQual、GarageType和GarageYrBlt同时缺失的样本条数是76,这些特征中前四个特征填充None,GarageYrBlt填充0。只有GarageType的缺失值是76,其余特征的缺失值均为78。
- 其余四个特征都为NaN,而GarageType不为NaN的这两条样本,一种思路是:可对这两条样本的GarageCond,GarageFinish、GarageQual和GarageYrBlt的缺失值进行填充,策略是在GarageType为Detchd的前提下,各个特征填充该特征的众数,因此GarageCond,GarageFinish、GarageQual和GarageYrBlt分别填充的是'TA'、'Unf'、'TA'、1950.0。另一种思路:我们认为由于只有GarageType不为NaN,我们可推测这两条样本的GarageType特征为异常值,将数据设置为“None”。这里我们采取第二种思路。
In [58]:
#MSZoning缺失值样本的邻居都是IDOTRR
df_full[df_full['MSZoning'].isnull()]['Neighborhood']
Out[58]:
455 IDOTRR
756 IDOTRR
790 IDOTRR
1444 Mitchel
Name: Neighborhood, dtype: object
In [59]:
df_full[df_full['Neighborhood'] == 'IDOTRR']['MSZoning'].value_counts().index[0] #可填充邻居的MSZoning的众数
Out[59]:
'RM'
分析:
- MasVnrType, MasVnrArea:NA可能是这些房屋没有砖石饰面。因此MasVnrArea填充0,MasVnrType填充None。
- MSZoning: 描述地区分类,缺失值可填充该条样本的邻居的MSZoning的众数,即填充'RM'。
In [60]:
df_full['Utilities'].unique(), df_train['Utilities'].unique(), df_test['Utilities'].unique()
Out[60]:
(array(['AllPub', 'NoSeWa', nan], dtype=object),
array(['AllPub', 'NoSeWa'], dtype=object),
array(['AllPub', nan], dtype=object))
分析:
- Utilities:该字段在train里只有两种类别,在test里只有一种类别,对预测作用不大,该特征可删除。
进一步分析BsmtHalfBath和BsmtFullBath
In [61]:
df_full['BsmtHalfBath'].unique()
Out[61]:
array([ 0., 1., 2., nan])
In [62]:
df_full['BsmtFullBath'].unique()
Out[62]:
array([ 1., 0., 2., 3., nan])
和地下室有关的特征:BsmtFullBath、BsmtHalfBath、BsmtQual、BsmtCond、BsmtExposure、BsmtFinType1、BsmtUnfSF、TotalBsmtSF,上述分析过BsmtCond, BsmtExposure, BsmtQual, BsmtFinType2, BsmtFinType1的缺失值表示No Basement,可观察BsmtHalfBath和BsmtFullBath的缺失值样本对应的BsmtCond, BsmtExposure, BsmtQual, BsmtFinType2, BsmtFinType1是否为空。
In [63]:
df_full[df_full['BsmtHalfBath'].isnull()][['BsmtCond', 'BsmtExposure', 'BsmtQual', 'BsmtFinType2', 'BsmtFinType1']]
Out[63]:
BsmtCond | BsmtExposure | BsmtQual | BsmtFinType2 | BsmtFinType1 | |
---|---|---|---|---|---|
660 | NaN | NaN | NaN | NaN | NaN |
728 | NaN | NaN | NaN | NaN | NaN |
In [64]:
df_full[df_full['BsmtFullBath'].isnull()][['BsmtCond', 'BsmtExposure', 'BsmtQual', 'BsmtFinType2', 'BsmtFinType1']]
Out[64]:
BsmtCond | BsmtExposure | BsmtQual | BsmtFinType2 | BsmtFinType1 | |
---|---|---|---|---|---|
660 | NaN | NaN | NaN | NaN | NaN |
728 | NaN | NaN | NaN | NaN | NaN |
In [65]:
df_full[['BsmtHalfBath','BsmtFullBath']].info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2919 entries, 0 to 1458
Data columns (total 2 columns):
BsmtHalfBath 2917 non-null float64
BsmtFullBath 2917 non-null float64
dtypes: float64(2)
memory usage: 68.4 KB
分析结论:BsmtHalfBath和BsmtFullBath的缺失值全部填充0,没有地下室也就没有地下室浴室。
In [66]:
df_full['Functional'].unique()
Out[66]:
array(['Typ', 'Min1', 'Maj1', 'Min2', 'Mod', 'Maj2', 'Sev', nan],
dtype=object)
In [67]:
df_train[df_train['Functional'].isnull()]
Out[67]:
0 rows × 81 columns
In [68]:
df_test[df_test['Functional'].isnull()]
Out[68]:
Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | ... | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | training_set | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
756 | 2217 | 20 | NaN | 80.0 | 14584 | Pave | NaN | Reg | Low | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2008 | WD | Abnorml | False |
1013 | 2474 | 50 | RM | 60.0 | 10320 | Pave | Grvl | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 9 | 2007 | COD | Abnorml | False |
2 rows × 81 columns
In [69]:
df_full['Functional'].value_counts()
Out[69]:
Typ 2717
Min2 70
Min1 65
Mod 35
Maj1 19
Maj2 9
Sev 2
Name: Functional, dtype: int64
分析结论:Functional特征在train数据中没有缺失值,缺失值都存在test数据中,可对缺失值数据填充众数,即填充'Typ'。
对其余有缺失值的特征进行分析
In [70]:
df_train['Exterior1st'].isnull().sum(), df_train['TotalBsmtSF'].isnull().sum(),df_train['BsmtUnfSF'].isnull().sum(),\
df_train['BsmtFinSF2'].isnull().sum(), df_train['GarageArea'].isnull().sum(),df_train['KitchenQual'].isnull().sum(),\
df_train['GarageCars'].isnull().sum(),df_train['BsmtFinSF1'].isnull().sum(),df_train['Exterior2nd'].isnull().sum(),\
df_train['SaleType'].isnull().sum(),df_train['Electrical'].isnull().sum()
Out[70]:
(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1)
In [71]:
df_test['Exterior1st'].isnull().sum(), df_train['TotalBsmtSF'].isnull().sum(), df_test['BsmtUnfSF'].isnull().sum(),\
df_test['BsmtFinSF2'].isnull().sum(),df_test['GarageArea'].isnull().sum(),df_test['KitchenQual'].isnull().sum(),\
df_test['GarageCars'].isnull().sum(),df_test['BsmtFinSF1'].isnull().sum(),df_test['Exterior2nd'].isnull().sum(),\
df_test['SaleType'].isnull().sum(), df_test['Electrical'].isnull().sum()
Out[71]:
(1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0)
分析结论:
- Exterior1st、KitchenQual、Exterior2nd均为类别型,且只有test中有一条缺失值样本,可对该样本填充该特征的众数。
- Electrical类别型,且只有train中有一条缺失值样本,可对该样本填充该特征的众数。
- SaleType、Electrical均为类别型特征,填充众数即可。
In [72]:
df_test[df_test['TotalBsmtSF'].isnull()][['BsmtCond', 'BsmtExposure', 'BsmtQual', 'BsmtFinType2', 'BsmtFinType1']]
Out[72]:
BsmtCond | BsmtExposure | BsmtQual | BsmtFinType2 | BsmtFinType1 | |
---|---|---|---|---|---|
660 | NaN | NaN | NaN | NaN | NaN |
In [73]:
df_test[df_test['BsmtUnfSF'].isnull()][['BsmtCond', 'BsmtExposure', 'BsmtQual', 'BsmtFinType2', 'BsmtFinType1']]
Out[73]:
BsmtCond | BsmtExposure | BsmtQual | BsmtFinType2 | BsmtFinType1 | |
---|---|---|---|---|---|
660 | NaN | NaN | NaN | NaN | NaN |
In [74]:
df_test[df_test['BsmtFinSF2'].isnull()][['BsmtCond', 'BsmtExposure', 'BsmtQual', 'BsmtFinType2', 'BsmtFinType1']]
Out[74]:
BsmtCond | BsmtExposure | BsmtQual | BsmtFinType2 | BsmtFinType1 | |
---|---|---|---|---|---|
660 | NaN | NaN | NaN | NaN | NaN |
In [75]:
df_test[df_test['BsmtFinSF1'].isnull()][['BsmtCond', 'BsmtExposure', 'BsmtQual', 'BsmtFinType2', 'BsmtFinType1']]
Out[75]:
BsmtCond | BsmtExposure | BsmtQual | BsmtFinType2 | BsmtFinType1 | |
---|---|---|---|---|---|
660 | NaN | NaN | NaN | NaN | NaN |
In [76]:
df_test[['TotalBsmtSF','BsmtUnfSF','BsmtFinSF2','BsmtFinSF1']].info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1459 entries, 0 to 1458
Data columns (total 4 columns):
TotalBsmtSF 1458 non-null float64
BsmtUnfSF 1458 non-null float64
BsmtFinSF2 1458 non-null float64
BsmtFinSF1 1458 non-null float64
dtypes: float64(4)
memory usage: 45.7 KB
分析结论
- 'TotalBsmtSF','BsmtUnfSF','BsmtFinSF2','BsmtFinSF1'都是跟地下室有关,三两者都是数值型特征,缺失值表示没有地下室,因此都填充0。
In [77]:
df_test[df_test['GarageArea'].isnull()][['GarageCond','GarageFinish','GarageQual','GarageType','GarageYrBlt']]
Out[77]:
GarageCond | GarageFinish | GarageQual | GarageType | GarageYrBlt | |
---|---|---|---|---|---|
1116 | NaN | NaN | NaN | Detchd | NaN |
In [78]:
df_test[df_test['GarageCars'].isnull()][['GarageCond','GarageFinish','GarageQual','GarageType','GarageYrBlt']]
Out[78]:
GarageCond | GarageFinish | GarageQual | GarageType | GarageYrBlt | |
---|---|---|---|---|---|
1116 | NaN | NaN | NaN | Detchd | NaN |
In [79]:
df_test[['GarageArea','GarageCars']].info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1459 entries, 0 to 1458
Data columns (total 2 columns):
GarageArea 1458 non-null float64
GarageCars 1458 non-null float64
dtypes: float64(2)
memory usage: 22.9 KB
分析结论
- 发现test中的index是1116的样本中,'GarageCond','GarageFinish','GarageQual','GarageYrBlt','GarageArea','GarageCars'几个字段都为NaN,只有GarageType字段不为NaN,我们推测该条样本的这个特征是异常值,可删除。
- 'GarageArea','GarageCars'两个字段为数值型,填充0。
2.3.3删除不必要的特征
通过相关性分析和缺失值分析,得出结论:可删除以下三个不必要的特征。
In [80]:
df_full.drop('GarageArea', axis=1, inplace=True)
df_full.drop('TotRmsAbvGrd', axis=1, inplace=True)
df_full.drop('TotalBsmtSF', axis=1, inplace=True)
观察类别特征的类别个数,若某类别型特征在test数据中只有一个类别,则该特征对价格的预测无意义,可删除。在缺失值分析中分析了Utilities,可删除该特征。
In [81]:
[test['Utilities'].unique(),train['Utilities'].unique()]
Out[81]:
[array(['AllPub', nan], dtype=object),
array(['AllPub', 'NoSeWa'], dtype=object)]
In [82]:
df_full.drop('Utilities', axis=1, inplace=True)
2.3.4删除异常样本点
通过上述的相关性分析,得出结论,可删除以下样本数据:
- 删除特征GrLivArea大于4000 且 SalePrice小于300000的样本数据。
- 删除特征1stFlrSF大于4000 且 SalePrice小于200000的样本数据。
In [83]:
index1 = train[(train['GrLivArea']>4000) & (train['SalePrice']<300000)].index
index2 = train[(train['1stFlrSF']>4000) & (train['SalePrice']<200000)].index
In [84]:
index1,index2
Out[84]:
(Int64Index([523, 1298], dtype='int64'), Int64Index([1298], dtype='int64'))
综上:删除train对应的index是[523, 1298]的样本即可。
In [85]:
df_train = df_full[df_full['training_set']==True]
df_test = df_full[df_full['training_set']==False]
In [86]:
df_train = df_train.drop(index1)
In [87]:
df_train.shape, df_test.shape
Out[87]:
((1458, 77), (1459, 77))
In [88]:
df_full = pd.concat([df_train, df_test])
In [89]:
#将target对应的index也删除
target = target.drop(index1)
In [90]:
target.shape
Out[90]:
(1458,)
2.3.5缺失值填充
通过上述对缺失值的分析,分别对各个特种实施不同的缺失值填充策略。
类别型特征:
- PoolQC、MiscFeature、Alley、Fence、FireplaceQu、MasVnrType的缺失值填充“None”。
数值型特征:
- LotFrontage 该特征的缺失值可通过邻居的该字段的中位数进行填充。
- MasVnrArea 填充0
- BsmtHalfBath和BsmtFullBath的缺失值全部填充0,没有地下室也就没有地下室浴室。
- TotalBsmtSF(该特征已删除,忽略)、BsmtUnfSF、BsmtFinSF2和BsmtFinSF1都是跟地下室有关,且两者都是数值型特征,缺失值表示没有地下室,因此都填充0。
- 'GarageArea'(该特征已删除,忽略),'GarageCars'两个字段为数值型,填充0。
In [91]:
#df_full_test = df_full
In [92]:
df_full[['PoolQC','MiscFeature','Alley','Fence','FireplaceQu','MasVnrType']] = df_full[['PoolQC','MiscFeature','Alley','Fence','FireplaceQu','MasVnrType']].fillna('None')
In [93]:
df_full.isnull().sum().sort_values(ascending=False).head(10)
Out[93]:
LotFrontage 486
GarageCond 159
GarageYrBlt 159
GarageFinish 159
GarageQual 159
GarageType 157
BsmtCond 82
BsmtExposure 82
BsmtQual 81
BsmtFinType2 80
dtype: int64
In [94]:
#LotFrontage 该特征的缺失值可通过邻居的该字段的中位数进行填充。
df_full['LotFrontage'] = df_full.groupby('Neighborhood')['LotFrontage'].transform(lambda x: x.fillna(x.median()))
df_full['LotFrontage'].isnull().unique()
Out[94]:
array([False])
In [95]:
df_full.isnull().sum().sort_values(ascending=False).head(10)
Out[95]:
GarageCond 159
GarageYrBlt 159
GarageFinish 159
GarageQual 159
GarageType 157
BsmtCond 82
BsmtExposure 82
BsmtQual 81
BsmtFinType2 80
BsmtFinType1 79
dtype: int64
In [96]:
df_full[['MasVnrArea','BsmtHalfBath','BsmtFullBath','BsmtUnfSF','BsmtFinSF2','BsmtFinSF1','GarageCars']] = df_full[['MasVnrArea','BsmtHalfBath','BsmtFullBath','BsmtUnfSF','BsmtFinSF2','BsmtFinSF1','GarageCars']].fillna(0)
In [97]:
df_full['MasVnrArea'].isnull().sum()
Out[97]:
0
BsmtCond, BsmtExposure, BsmtQual, BsmtFinType2, BsmtFinType1
train数据中:
- 同时都为空的37个样本数据的五个特征的缺失值全部填充None,表示没有地下室。
- 对于BsmtExposure和BsmtFinType2,找到其余特征都不缺失,而该特征缺失的样本,对该样本填充该特征的众数
test数据中:
- 同时都为空的42个样本数据的五个特征的缺失值全部填充None,表示没有地下室。
- 对于BsmtCond, BsmtExposure和BsmtQual,找到其余特征都不缺失,而该特征缺失的样本,对该样本填充该特征的众数
In [98]:
df_train = df_full[df_full['training_set']==True]
df_test = df_full[df_full['training_set']==False]
In [99]:
#train数据中,同时都为空的37个样本数据的五个特征的缺失值全部填充None,表示没有地下室。
index_row = df_train[(df_train['BsmtCond'].isnull())&(df_train['BsmtExposure'].isnull())&(df_train['BsmtQual'].isnull())&(df_train['BsmtFinType2'].isnull())\
&(df_train['BsmtFinType1'].isnull())][['BsmtCond','BsmtExposure','BsmtQual','BsmtFinType2','BsmtFinType1']].index.tolist()
In [100]:
df_train.loc[index_row,['BsmtCond','BsmtExposure','BsmtQual','BsmtFinType2','BsmtFinType1']] = df_train.loc[index_row,['BsmtCond','BsmtExposure','BsmtQual','BsmtFinType2','BsmtFinType1']].fillna('None')
In [101]:
df_train.loc[index_row,['BsmtCond','BsmtExposure','BsmtQual','BsmtFinType2','BsmtFinType1']].head()
Out[101]:
BsmtCond | BsmtExposure | BsmtQual | BsmtFinType2 | BsmtFinType1 | |
---|---|---|---|---|---|
17 | None | None | None | None | None |
39 | None | None | None | None | None |
90 | None | None | None | None | None |
102 | None | None | None | None | None |
156 | None | None | None | None | None |
In [102]:
#对于BsmtExposure,找到其余特征都不缺失,而该特征缺失的样本 ,对该样本填充BsmtExposure的众数
index_row = df_train[(~df_train['BsmtCond'].isnull())&(df_train['BsmtExposure'].isnull())&(~df_train['BsmtQual'].isnull())&\
(~df_train['BsmtFinType2'].isnull())&(~df_train['BsmtFinType1'].isnull())].index
index_row
Out[102]:
Int64Index([948], dtype='int64')
In [103]:
freq = df_full['BsmtExposure'].value_counts().sort_values(ascending=False).index[0]
In [104]:
df_train.loc[index_row,'BsmtExposure'] = freq
In [105]:
df_train.loc[index_row,'BsmtExposure']
Out[105]:
948 No
Name: BsmtExposure, dtype: object
In [106]:
#对于BsmtFinType2,找到其余特征都不缺失,而该特征缺失的样本 ,对该样本填充BsmtExposure的众数
index_row = df_train[(~df_train['BsmtCond'].isnull())&(~df_train['BsmtExposure'].isnull())&(~df_train['BsmtQual'].isnull())&\
(df_train['BsmtFinType2'].isnull())&(~df_train['BsmtFinType1'].isnull())].index
index_row
Out[106]:
Int64Index([332], dtype='int64')
In [107]:
freq = df_full['BsmtFinType2'].value_counts().sort_values(ascending=False).index[0]
In [108]:
df_train.loc[index_row,'BsmtFinType2'] = freq
In [109]:
df_train.loc[index_row,'BsmtFinType2']
Out[109]:
332 Unf
Name: BsmtFinType2, dtype: object
In [110]:
#test数据中,同时都为空的42个样本数据的五个特征的缺失值全部填充None,表示没有地下室。
index_row = df_test[(df_test['BsmtCond'].isnull())&(df_test['BsmtExposure'].isnull())&(df_test['BsmtQual'].isnull())&(df_test['BsmtFinType2'].isnull())\
&(df_test['BsmtFinType1'].isnull())][['BsmtCond','BsmtExposure','BsmtQual','BsmtFinType2','BsmtFinType1']].index.tolist()
In [111]:
df_test.loc[index_row,['BsmtCond','BsmtExposure','BsmtQual','BsmtFinType2','BsmtFinType1']] = df_test.loc[index_row,['BsmtCond','BsmtExposure','BsmtQual','BsmtFinType2','BsmtFinType1']].fillna('None')
In [112]:
df_test.loc[index_row,['BsmtCond','BsmtExposure','BsmtQual','BsmtFinType2','BsmtFinType1']].head()
Out[112]:
BsmtCond | BsmtExposure | BsmtQual | BsmtFinType2 | BsmtFinType1 | |
---|---|---|---|---|---|
125 | None | None | None | None | None |
133 | None | None | None | None | None |
269 | None | None | None | None | None |
318 | None | None | None | None | None |
354 | None | None | None | None | None |
In [113]:
#对于BsmtCond,找到其余特征都不缺失,而该特征缺失的样本 ,对该样本填充BsmtCond的众数
index_row = df_test[(df_test['BsmtCond'].isnull())&(~df_test['BsmtExposure'].isnull())&(~df_test['BsmtQual'].isnull())&(~df_test['BsmtFinType2'].isnull())&\
(~df_test['BsmtFinType1'].isnull())].index.tolist()
index_row
Out[113]:
[580, 725, 1064]
In [114]:
freq = df_full['BsmtCond'].value_counts().sort_values(ascending=False).index[0]
freq
Out[114]:
'TA'
In [115]:
df_test.loc[index_row,'BsmtCond'] = df_test.loc[index_row,'BsmtCond'].fillna(freq)
In [116]:
df_test.loc[index_row,'BsmtCond']
Out[116]:
580 TA
725 TA
1064 TA
Name: BsmtCond, dtype: object
In [117]:
#对于BsmtExposure,找到其余特征都不缺失,而该特征缺失的样本 ,对该样本填充BsmtCond的众数
index_row = df_test[(~df_test['BsmtCond'].isnull())&(df_test['BsmtExposure'].isnull())&(~df_test['BsmtQual'].isnull())&(~df_test['BsmtFinType2'].isnull())&\
(~df_test['BsmtFinType1'].isnull())].index.tolist()
index_row
Out[117]:
[27, 888]
In [118]:
freq = df_full['BsmtExposure'].value_counts().sort_values(ascending=False).index[0]
freq
Out[118]:
'No'
In [119]:
df_test.loc[index_row,'BsmtExposure'] = df_test.loc[index_row,'BsmtExposure'].fillna(freq)
In [120]:
df_test.loc[index_row,'BsmtExposure']
Out[120]:
27 No
888 No
Name: BsmtExposure, dtype: object
In [121]:
#对于BsmtQual,找到其余特征都不缺失,而该特征缺失的样本 ,对该样本填充BsmtCond的众数
index_row = df_test[(~df_test['BsmtCond'].isnull())&(~df_test['BsmtExposure'].isnull())&(df_test['BsmtQual'].isnull())&(~df_test['BsmtFinType2'].isnull())&\
(~df_test['BsmtFinType1'].isnull())].index.tolist()
index_row
Out[121]:
[757, 758]
In [122]:
freq = df_full['BsmtQual'].value_counts().sort_values(ascending=False).index[0]
freq
Out[122]:
'TA'
In [123]:
df_test.loc[index_row,'BsmtQual'] = df_test.loc[index_row,'BsmtQual'].fillna(freq)
df_test.loc[index_row,'BsmtQual']
Out[123]:
757 TA
758 TA
Name: BsmtQual, dtype: object
- GarageCond,GarageFinish、GarageQual、GarageType的确实值表示No Garage。
Train数据中:
- train数据中,GarageCond,GarageFinish、GarageQual、GarageType和GarageYrBlt同时缺失的样本条数也是81。各自缺失值的样本数也是81。说明没有车库就没有车库的建立年份。合理。
- 前四个字段为类别型变量,缺失值可填充'None',GarageYrBlt是年份,float64的数据类型,缺失值可填充0。
test数据中:
- test数据中,GarageCond,GarageFinish、GarageQual、GarageType和GarageYrBlt同时缺失的样本条数是76,这些特征中前四个特征填充None,GarageYrBlt填充0。只有GarageType的缺失值是76,其余特征的缺失值均为78。
- 其余四个特征都为NaN,而GarageType不为NaN的这两条样本,一种思路是:可对这两条样本的GarageCond,GarageFinish、GarageQual和GarageYrBlt的缺失值进行填充,策略是在GarageType为Detchd的前提下,各个特征填充该特征的众数,因此GarageCond,GarageFinish、GarageQual和GarageYrBlt分别填充的是'TA'、'Unf'、'TA'、1950.0。另一种思路:我们认为由于只有GarageType不为NaN,我们可推测这两条样本的GarageType特征为异常值,将数据设置为“None”。这里我们采取第二种思路。
In [124]:
index_row = df_train[(df_train['GarageCond'].isnull())&(df_train['GarageFinish'].isnull())&(df_train['GarageQual'].isnull())&(df_train['GarageType'].isnull())\
&(df_train['GarageYrBlt'].isnull())][['GarageCond','GarageFinish','GarageQual','GarageType','GarageYrBlt']].index.tolist()
len(index_row)
Out[124]:
81
In [125]:
df_train.loc[index_row,['GarageCond','GarageFinish','GarageQual','GarageType']] = df_train.loc[index_row,['GarageCond','GarageFinish','GarageQual','GarageType']].fillna('None')
In [126]:
df_train.loc[index_row,['GarageCond','GarageFinish','GarageQual','GarageType']].head()
Out[126]:
GarageCond | GarageFinish | GarageQual | GarageType | |
---|---|---|---|---|
39 | None | None | None | None |
48 | None | None | None | None |
78 | None | None | None | None |
88 | None | None | None | None |
89 | None | None | None | None |
In [127]:
df_train.loc[index_row,'GarageYrBlt'] = df_train.loc[index_row,'GarageYrBlt'].fillna(0)
In [128]:
df_train.loc[index_row,'GarageYrBlt'].head()
Out[128]:
39 0.0
48 0.0
78 0.0
88 0.0
89 0.0
Name: GarageYrBlt, dtype: float64
In [129]:
#test数据
index_row = df_test[(df_test['GarageCond'].isnull())&(df_test['GarageFinish'].isnull())&(df_test['GarageQual'].isnull())&(df_test['GarageType'].isnull())\
&(df_test['GarageYrBlt'].isnull())][['GarageCond','GarageFinish','GarageQual','GarageType','GarageYrBlt']].index.tolist()
In [130]:
len(index_row)
Out[130]:
76
In [131]:
df_test.loc[index_row,['GarageCond','GarageFinish','GarageQual','GarageType']] = df_test.loc[index_row,['GarageCond','GarageFinish','GarageQual','GarageType']].fillna('None')
In [132]:
df_test.loc[index_row,'GarageYrBlt'] = df_test.loc[index_row,'GarageYrBlt'].fillna(0)
In [133]:
df_test.loc[index_row,['GarageCond','GarageFinish','GarageQual','GarageType','GarageYrBlt']].head()
Out[133]:
GarageCond | GarageFinish | GarageQual | GarageType | GarageYrBlt | |
---|---|---|---|---|---|
53 | None | None | None | None | 0.0 |
71 | None | None | None | None | 0.0 |
79 | None | None | None | None | 0.0 |
92 | None | None | None | None | 0.0 |
96 | None | None | None | None | 0.0 |
In [134]:
#GarageType有两条数据需要再分析一下,找到其余四个特征都为NaN,而GarageType不为NaN的这两条数据
index_row = df_test[(df_test['GarageCond'].isnull())&(df_test['GarageFinish'].isnull())&(df_test['GarageQual'].isnull())&\
(df_test['GarageYrBlt'].isnull())&(~df_test['GarageType'].isnull())][['GarageCond','GarageFinish','GarageQual','GarageType','GarageYrBlt']].index.tolist()
In [135]:
index_row
Out[135]:
[666, 1116]
In [136]:
#确实的样本填充“None”,且GarageType特征也设置为“None”
df_test.loc[index_row,['GarageCond','GarageFinish','GarageQual']] = df_test.loc[index_row,['GarageCond','GarageFinish','GarageQual']].fillna('None')
In [137]:
df_test.loc[index_row,['GarageType']] = 'None'
In [138]:
df_test.loc[index_row,'GarageYrBlt'] = df_test.loc[index_row,'GarageYrBlt'].fillna(0)
In [139]:
df_test.loc[index_row,['GarageCond','GarageFinish','GarageQual','GarageType','GarageYrBlt']]
Out[139]:
GarageCond | GarageFinish | GarageQual | GarageType | GarageYrBlt | |
---|---|---|---|---|---|
666 | None | None | None | None | 0.0 |
1116 | None | None | None | None | 0.0 |
MSZoning: 描述地区分类,缺失值可填充该条样本的邻居的MSZoning的众数,即填充'RM'。
Functional特征在train数据中没有缺失值,缺失值都存在test数据中,可对缺失值数据填充众数,即填充'Typ'。
Exterior1st、KitchenQual、Exterior2nd均为类别型,且只有test中有一条缺失值样本,可对该样本填充该特征的众数。 Electrical类别型,且只有train中有一条缺失值样本,可对该样本填充该特征的众数。
SaleType、Electrical均为类别型特征,填充众数即可。
In [140]:
df_full = pd.concat([df_train, df_test])
In [141]:
df_full.shape
Out[141]:
(2917, 77)
In [142]:
df_full[df_full['MSZoning'].isnull()]['Neighborhood']
Out[142]:
455 IDOTRR
756 IDOTRR
790 IDOTRR
1444 Mitchel
Name: Neighborhood, dtype: object
In [143]:
freq = df_full[df_full['Neighborhood'] == 'IDOTRR']['MSZoning'].value_counts().sort_values(ascending=False).index[0]
freq
Out[143]:
'RM'
In [144]:
df_full['MSZoning'] = df_full['MSZoning'].fillna(freq)
In [145]:
df_full['Functional'] = df_full['Functional'].fillna('Typ')
In [146]:
freq1 = df_full['Exterior1st'].value_counts().sort_values(ascending=False).index[0]
freq2 = df_full['KitchenQual'].value_counts().sort_values(ascending=False).index[0]
freq3 = df_full['Exterior2nd'].value_counts().sort_values(ascending=False).index[0]
freq4 = df_full['SaleType'].value_counts().sort_values(ascending=False).index[0]
freq5 = df_full['Electrical'].value_counts().sort_values(ascending=False).index[0]
df_full['Exterior1st'] = df_full['Exterior1st'].fillna(freq1)
df_full['KitchenQual'] = df_full['KitchenQual'].fillna(freq2)
df_full['Exterior2nd'] = df_full['Exterior2nd'].fillna(freq3)
df_full['SaleType'] = df_full['SaleType'].fillna(freq4)
df_full['Electrical'] = df_full['Electrical'].fillna(freq5)
In [147]:
#观察数据的缺失值情况
total = df_full.isnull().sum().sort_values(ascending=False).head(5) #head(30)
percent = ((df_full.isnull().sum()/len(df_full))*100).sort_values(ascending=False).head(5)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data
Out[147]:
Total | Percent | |
---|---|---|
training_set | 0 | 0.0 |
ExterCond | 0 | 0.0 |
RoofStyle | 0 | 0.0 |
RoofMatl | 0 | 0.0 |
Exterior1st | 0 | 0.0 |
备份数据
In [148]:
df_full.drop('Id', axis=1, inplace=True)
In [149]:
df_full[df_full['training_set'] == True].shape, target.shape
Out[149]:
((1458, 76), (1458,))
In [150]:
df_full.to_csv('cleaned_data/data_full_01.csv', index=False) #备份数据
In [151]:
target.to_csv('cleaned_data/data_target_01.csv', index=False) #备份数据target
In [152]:
pd.read_csv('cleaned_data/data_target_01.csv').head()
Out[152]:
12.24769911637256 | |
---|---|
0 | 12.109016 |
1 | 12.317171 |
2 | 11.849405 |
3 | 12.429220 |
4 | 11.870607 |
In [153]:
pd.read_csv('data_full_01.csv').head()
Out[153]:
MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | LotConfig | LandSlope | ... | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | training_set | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 60 | RL | 65.0 | 8450 | Pave | None | Reg | Lvl | Inside | Gtl | ... | 0 | None | None | None | 0 | 2 | 2008 | WD | Normal | True |
1 | 20 | RL | 80.0 | 9600 | Pave | None | Reg | Lvl | FR2 | Gtl | ... | 0 | None | None | None | 0 | 5 | 2007 | WD | Normal | True |
2 | 60 | RL | 68.0 | 11250 | Pave | None | IR1 | Lvl | Inside | Gtl | ... | 0 | None | None | None | 0 | 9 | 2008 | WD | Normal | True |
3 | 70 | RL | 60.0 | 9550 | Pave | None | IR1 | Lvl | Corner | Gtl | ... | 0 | None | None | None | 0 | 2 | 2006 | WD | Abnorml | True |
4 | 60 | RL | 84.0 | 14260 | Pave | None | IR1 | Lvl | FR2 | Gtl | ... | 0 | None | None | None | 0 | 12 | 2008 | WD | Normal | True |
5 rows × 76 columns
以上完成了所有缺失值的填充
3.特征处理
3.1 数值型数据处理
In [154]:
#归一化处理
dtypes_df = pd.DataFrame(df_full.dtypes.sort_values())
dtypes_df = dtypes_df.reset_index()
dtypes_df.columns = ['col','type']
dtypes_df.groupby('type').size()
Out[154]:
type
bool 1
int64 23
float64 9
object 43
dtype: int64
In [155]:
numeric_cols = list(df_full.select_dtypes(include=['int64', 'float64']).columns.values) #不包括bool类型
numeric_col_means = df_full.loc[:, numeric_cols].mean()
numeric_col_std = df_full.loc[:, numeric_cols].std()
df_full.loc[:, numeric_cols] = (df_full.loc[:, numeric_cols] - numeric_col_means) / numeric_col_std
3.2 类别型数据处理
In [156]:
all_dummy_df = pd.get_dummies(df_full)
all_dummy_df.head()
Out[156]:
LotFrontage | LotArea | OverallQual | OverallCond | YearBuilt | YearRemodAdd | MasVnrArea | BsmtFinSF1 | BsmtFinSF2 | BsmtUnfSF | ... | SaleType_ConLw | SaleType_New | SaleType_Oth | SaleType_WD | SaleCondition_Abnorml | SaleCondition_AdjLand | SaleCondition_Alloca | SaleCondition_Family | SaleCondition_Normal | SaleCondition_Partial | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | -0.209167 | -0.216400 | 0.649468 | -0.507416 | 1.047051 | 0.897548 | 0.533996 | 0.601411 | -0.293084 | -0.933602 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
1 | 0.498105 | -0.069097 | -0.061413 | 2.186999 | 0.155579 | -0.394797 | -0.566927 | 1.213775 | -0.293084 | -0.628848 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
2 | -0.067713 | 0.142251 | 0.649468 | -0.507416 | 0.981016 | 0.849683 | 0.343019 | 0.106117 | -0.293084 | -0.287705 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
3 | -0.444924 | -0.075501 | 0.649468 | -0.507416 | -1.858487 | -0.681985 | -0.566927 | -0.501744 | -0.293084 | -0.046631 | ... | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
4 | 0.686711 | 0.527801 | 1.360350 | -0.507416 | 0.947999 | 0.753954 | 1.399006 | 0.486593 | -0.293084 | -0.160345 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
5 rows × 312 columns
4.建立模型
4.1 准备数据
In [157]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import make_scorer, mean_squared_error
In [158]:
df_train = all_dummy_df[all_dummy_df['training_set']==True]
df_train = df_train.drop('training_set', axis=1)
df_test = all_dummy_df[all_dummy_df['training_set']==False]
df_test = df_test.drop('training_set', axis=1)
In [159]:
(df_train.shape, df_test.shape)
Out[159]:
((1458, 311), (1459, 311))
In [160]:
X_train, X_test, y_train, y_test = train_test_split(df_train, target, random_state=1, train_size = 0.7)
4.2随机森林
In [161]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
In [245]:
param_grid = {'n_estimators':[90, 100, 120],
'min_samples_leaf':[1, 2, 3],
'max_depth':[13,15,17]}
scorer = make_scorer(mean_squared_error)
grid_search = GridSearchCV(RandomForestRegressor(random_state=42),
param_grid,
scoring = scorer,
cv = 10)
grid_search.fit(X_train, y_train)
rf_opt = grid_search.best_estimator_
In [246]:
rf_opt
Out[246]:
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=17,
max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=3, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=90, n_jobs=1,
oob_score=False, random_state=42, verbose=0, warm_start=False)
In [247]:
rf_y_pred = rf_opt.predict(X_test)
4.3XGBoost
由于XGBoost用网格搜索交叉验证的方式在云平台上和GPU上都跑不出来,因此根据经验设置超参数。
In [167]:
import xgboost as xgb
xgb_opt = xgb.XGBRegressor(
learning_rate = 0.05,
max_depth = 5,
min_child_weight = 1.5,
n_estimators = 7500,
seed = 42,
silent = 1)
In [168]:
xgb_opt.fit(X_train, y_train)
xgb_y_pred = xgb_opt.predict(X_test) #XGBoost predictions
In [ ]:
# parameters_xgb = {'n_estimators':[7000,7500,8000],
# 'learning_rate':[0.04,0.05,0.07],
# 'max_depth':[3,5,7],
# 'min_child_weight':[1,1.5,2]}
# #scorer_xgb = make_scorer(mean_squared_error)
# grid_search_xgb = GridSearchCV(xgb_regressor,
# parameters_xgb,
# scoring = scorer,
# cv = 10)
# grid_search_xgb.fit(X_train, y_train)
# xgb_opt = grid_search_xgb.best_estimator_
In [ ]:
# xgb_y_pred = xgb_opt.predict(X_test)
4.4Lasso
In [169]:
from sklearn.linear_model import Lasso
lasso_regr = Lasso(random_state=42)
In [172]:
parameters_lasso = {'alpha':[0.0005, 0.001, 0.002],
'max_iter':[40000,50000,60000]}
grid_search_lasso = GridSearchCV(lasso_regr,
parameters_lasso,
scoring = scorer,
cv = 5)
grid_search_lasso.fit(X_train, y_train)
lasso_opt = grid_search_lasso.best_estimator_
In [173]:
lasso_opt
Out[173]:
Lasso(alpha=0.002, copy_X=True, fit_intercept=True, max_iter=40000,
normalize=False, positive=False, precompute=False, random_state=42,
selection='cyclic', tol=0.0001, warm_start=False)
In [174]:
lasso_y_pred = lasso_opt.predict(X_test)
5.模型融合并预测
In [175]:
optrf_mse = mean_squared_error(y_test, rf_y_pred)
xgb_mse = mean_squared_error(y_test, xgb_y_pred)
lasso_mse = mean_squared_error(y_test, lasso_y_pred)
In [176]:
score_df = pd.DataFrame([optrf_mse,xgb_mse,lasso_mse])
score_df.index = ['rf','xgb','lasso']
score_df.T
Out[176]:
rf | xgb | lasso | |
---|---|---|---|
0 | 0.020909 | 0.016095 | 0.015779 |
In [177]:
y_pred_rf = np.expm1(rf_opt.predict(df_test))
y_pred_xgb = np.expm1(xgb_opt.predict(df_test))
y_pred_lasso = np.expm1(lasso_opt.predict(df_test))
y_predict = 0.3*y_pred_rf + 0.35*y_pred_xgb + 0.35*y_pred_lasso
6.输出预测结果
In [178]:
submission_df = pd.DataFrame(data= {'Id' : test.Id, 'SalePrice': y_predict})
submission_df.to_csv('output/submission02.csv', index=False)
7.总结
- 第一次对数据的处理思路非常暴力,即,首先将缺失值大于50%的特征全部删除,其次对数值型特征的缺失值统一填充中位数,对类别型特征的缺失值统一填充“None”。这种处理方式没有从特征的本身含义出发,不合理。
- 第二次对数据的处理思路:
- 从字段本身的含义出发,仔细阅读特征的解释说明文档;
- 通过相关性分析得出特征之间的相关程度以及重要特征与预测值之间的相关程度;
- 对所有有缺失值的特征一一进行分析,对描述内容相似的有缺失值特征统一分析,详见2.3.2缺失值分析。
8.优化方向
- 1.由于赛题中特征较多,且有若干特征共同描述同一事物的情况,例如好几个特种同时描述车库,也有好几个字段同时描述地下室,则可以尝试通过分析这些特征之间以及与价格之间的关系,找出对价格预测影响较小的特征,并将这些特征删除,达到降维的目的。
- 2.特征工程目前只考虑了数值型特征归一化和类别型特征get_dummies,可以继续尝试其他的特征工程处理。
- 3.本赛题我用到了随机森林、XGBoost和Lasso三个模型进行融合,并根据训练集的评估结果给各个模型赋值不同的权重,可以再在模型的选择和融合上优化,或者转换思路,考虑使用stacking的方式。
9.项目参考
Tuned Random Forest, Lasso and XGBoost Regressors
Stacked Regressions : Top 4% on LeaderBoard
Comprehensive data exploration with Python
https://www.kaggle.com/serigne/stacked-regressions-top-4-on-leaderboard
https://blog.csdn.net/Irving_zhang/article/details/78561105
https://blog.csdn.net/qq_27668313/article/details/79034484
https://blog.csdn.net/masbbx123/article/details/79315107
https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python
https://blog.csdn.net/aliceyangxi1987/article/details/71079448
http://www.sohu.com/a/115799487_500658