Kaggle比赛入门新手教程(房价预测案例:前篇)
Kaggle房价预测全流程详解
对于刚刚入门机器学习的童孩来说,如何快速地通过不同实战演练以提高代码能力和流程理解是一个需要关注的问题。Kaggle平台正好提供了数据科学家的所需要的交流环境,并且为痴迷于人工智能的狂热的爱好者举办了各种类型的竞赛(如,数据科学/图像分类/图像识别/自然语言处理/漏洞检测)。
Kaggle社区是一种全球性的交流社区,集中大量优秀的AI科学家和数据分析家,能够相互分享实战经验和代码,并且有基础入门教程,对新手非常友好~
竞赛链接与背景介绍
- Kaggle平台官网:https://www.kaggle.com
- 房价预测竞赛网址: https://www.kaggle.com/c/house-prices-advanced-regression-techniques
房价是一个生活中耳熟能详的概念,在大城市买房尤其成为了上班族几乎最大的苦恼(以后即将面临····),而在美国的爱荷华州埃姆斯市有许多因素影响着房屋的最终价格,例如房屋面积、地下室、浴室和车库等等;
kaggle平台收集了约80个可能影响房价的特征变量,要求数据科学家利用机器学习等工具对房价进行预测,即该案例是一种简单的回归问题。
官方提供的房屋特征描述文件我已翻译成中文,供大家参考。英文原版的可以点击Kaggle竞赛栏目下的下载按钮,数据集也是一样。如下所示:
- SalePrice: 房产销售价格,以美元计价。所要预测的目标变量
- MSSubClass: Identifies the type of dwelling involved in the sale 住所类型
- MSZoning: The general zoning classification 区域分类
- LotFrontage: Linear feet of street connected to property 房子同街道之间的距离
- LotArea: Lot size in square feet 建筑面积
- Street: Type of road access 主路的路面类型
- Alley: Type of alley access 小道的路面类型
- LotShape: General shape of property 房屋外形
- LandContour: Flatness of the property 平整度
- Utilities: Type of utilities available 配套公用设施类型
- LotConfig: Lot configuration 配置
- LandSlope: Slope of property 土地坡度
- Neighborhood: Physical locations within Ames city limits 房屋在埃姆斯市的位置
- Condition1: Proximity to main road or railroad 附近交通情况
- Condition2: Proximity to main road or railroad (if a second is present) 附近交通情况(如果同时满足两种情况)
- BldgType: Type of dwelling 住宅类型
- HouseStyle: Style of dwelling 房屋的层数
- OverallQual: Overall material and finish quality 完工质量和材料
- OverallCond: Overall condition rating 整体条件等级
- YearBuilt: Original construction date 建造年份
- YearRemodAdd: Remodel date 翻修年份
- RoofStyle: Type of roof 屋顶类型
- RoofMatl: Roof material 屋顶材料
- Exterior1st: Exterior covering on house 外立面材料
- Exterior2nd: Exterior covering on house (if more than one material) 外立面材料2
- MasVnrType: Masonry veneer type 装饰石材类型
- MasVnrArea: Masonry veneer area in square feet 装饰石材面积
- ExterQual: Exterior material quality 外立面材料质量
- ExterCond: Present condition of the material on the exterior 外立面材料外观情况
- Foundation: Type of foundation 房屋结构类型
- BsmtQual: Height of the basement 评估地下室层高情况
- BsmtCond: General condition of the basement 地下室总体情况
- BsmtExposure: Walkout or garden level basement walls 地下室出口或者花园层的墙面
- BsmtFinType1: Quality of basement finished area 地下室区域质量
- BsmtFinSF1: Type 1 finished square feet Type 1完工面积
- BsmtFinType2: Quality of second finished area (if present) 二次完工面积质量(如果有)
- BsmtFinSF2: Type 2 finished square feet Type 2完工面积
- BsmtUnfSF: Unfinished square feet of basement area 地下室区域未完工面积
- TotalBsmtSF: Total square feet of basement area 地下室总体面积
- Heating: Type of heating 采暖类型
- HeatingQC: Heating quality and condition 采暖质量和条件
- CentralAir: Central air conditioning 中央空调系统
- Electrical: Electrical system 电力系统
- 1stFlrSF: First Floor square feet 第一层面积
- 2ndFlrSF: Second floor square feet 第二层面积
- LowQualFinSF: Low quality finished square feet (all floors) 低质量完工面积
- GrLivArea: Above grade (ground) living area square feet 地面以上部分起居面积
- BsmtFullBath: Basement full bathrooms 地下室全浴室数量
- BsmtHalfBath: Basement half bathrooms 地下室半浴室数量
- FullBath: Full bathrooms above grade 地面以上全浴室数量
- HalfBath: Half baths above grade 地面以上半浴室数量
- Bedroom: Number of bedrooms above basement level 地面以上卧室数量
- KitchenAbvGr: Number of kitchens 厨房数量
- KitchenQual: Kitchen quality 厨房质量
- TotRmsAbvGrd: Total rooms above grade (does not include bathrooms) 总房间数(不含浴室和地下部分)
- Functional: Home functionality rating 功能性评级
- Fireplaces: Number of fireplaces 壁炉数量
- FireplaceQu: Fireplace quality 壁炉质量
- GarageType: Garage location 车库位置
- GarageYrBlt: Year garage was built 车库建造时间
- GarageFinish: Interior finish of the garage 车库内饰
- GarageCars: Size of garage in car capacity 车壳大小以停车数量表示
- GarageArea: Size of garage in square feet 车库面积
- GarageQual: Garage quality 车库质量
- GarageCond: Garage condition 车库条件
- PavedDrive: Paved driveway 车道铺砌情况
- WoodDeckSF: Wood deck area in square feet 实木地板面积
- OpenPorchSF: Open porch area in square feet 开放式门廊面积
- EnclosedPorch: Enclosed porch area in square feet 封闭式门廊面积
- 3SsnPorch: Three season porch area in square feet 时令门廊面积
- ScreenPorch: Screen porch area in square feet 屏风门廊面积
- PoolArea: Pool area in square feet 游泳池面积
- PoolQC: Pool quality 游泳池质量
- Fence: Fence quality 围栏质量
- MiscFeature: Miscellaneous feature not covered in other categories 其它条件中未包含部分的特性
- MiscVal: $Value of miscellaneous feature 杂项部分价值
- MoSold: Month Sold 卖出月份
- YrSold: Year Sold 卖出年份
- SaleType: Type of sale 出售类型
- SaleCondition: Condition of sale 出售条件
接下来的工作就是基于这些特征进行数据挖掘和构建模型来预测了。整体流程的思路如下:
竞赛代码解析
导入工具包
import numpy as np #基本矩阵计算工具
import pandas as pd #基本数据可视化工具
import matplotlib.pyplot as plt #绘图工具
import seaborn as sns
from datetime import datetime #记录时间
from scipy.stats import skew #偏度计算
from scipy.special import boxcox1p #box-cox变换工具
from scipy.stats import boxcox_normmax
from sklearn.linear_model import LinearRegression, ElasticNetCV, LassoCV, RidgeCV #线性模型
from sklearn.ensemble import GradientBoostingRegressor #GBDT模型
from sklearn.svm import SVR #SVR模型
from sklearn.pipeline import make_pipeline #构建Pipeline
from sklearn.preprocessing import RobustScaler #稳健标准化,用于缩放包含许多异常值的数据
from sklearn.model_selection import KFold, RepeatedKFold, cross_val_score, GridSearchCV #K折取样以及交叉验证
from sklearn.metrics import mean_squared_error #均方根指标
from mlxtend.regressor import StackingCVRegressor #带交叉验证的Stacking回归器
from xgboost import XGBRegressor #XGBoost模型
from lightgbm import LGBMRegressor #LGB模型
import warnings #系统警告提示
import os #系统读取工具
warnings.filterwarnings('ignore') #忽略警告
数据加载
#文件根目录,输入本地下载好的文件目录地址
DATA_ROOT = 'D:/Kaggle比赛/房价回归预测/'
print(os.listdir(DATA_ROOT))
['data_description.txt', 'House_price_submission.csv', 'sample_submission.csv', 'test.csv', 'test_results.csv', 'train.csv', '数据描述中文介绍.txt']
#导入训练集、测试集和提交样本
train = pd.read_csv(f'{DATA_ROOT}/train.csv')
test = pd.read_csv(f'{DATA_ROOT}/test.csv')
sub = pd.read_csv(f'{DATA_ROOT}/sample_submission.csv')
#打印数据维度
print("Train set size:", train.shape)
print("Test set size:", test.shape)
输出结果:
Train set size: (1460, 81) , Test set size: (1459, 80)
#查看训练集数据摘要
print(train.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 1460 non-null int64
1 MSSubClass 1460 non-null int64
2 MSZoning 1460 non-null object
3 LotFrontage 1201 non-null float64
4 LotArea 1460 non-null int64
5 Street 1460 non-null object
6 Alley 91 non-null object
7 LotShape 1460 non-null object
8 LandContour 1460 non-null object
9 Utilities 1460 non-null object
10 LotConfig 1460 non-null object
......
#查看测试集数据摘要
print(test.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
# Column Non-Null Count Dtype
--- ------ -----------