Kaggle项目——房价预测
1. 问题描述
基于项目提供的爱荷华州埃姆斯的房屋历史成交数据,预测新的房屋销售价格
这是一个回归问题
项目的评分标准是均方根误差(RMSE),预测价格和实际价格取对数计算均方根误差
# 导入类库
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import RobustScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import cross_val_score, GridSearchCV, KFold
from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin
from sklearn.base import clone
from sklearn.linear_model import Lasso
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, ExtraTreesRegressor
from sklearn.svm import SVR, LinearSVR
from sklearn.linear_model import ElasticNet, SGDRegressor, BayesianRidge
from sklearn.kernel_ridge import KernelRidge
from xgboost import XGBRegressor
# 显示中文
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
2. 数据理解
2.1 数据概览
# 导入数据
train_df = pd.read_csv('./data/train.csv')
test_df = pd.read_csv('./data/test.csv')
# 查看前几行数据
train_df.head()
print('训练集维度:%s,测试集维度:%s' % (train_df.shape, test_df.shape))
训练集维度:(1460, 81),测试集维度:(1459, 80)
# 查看数据基本信息
train_df.info()
RangeIndex: 1460 entries, 0 to 1459Data columns (total 81 columns):Id 1460 non-null int64MSSubClass 1460 non-null int64MSZoning 1460 non-null objectLotFrontage 1201 non-null float64LotArea 1460 non-null int64Street 1460 non-null objectAlley 91 non-null objectLotShape 1460 non-null objectLandContour 1460 non-null objectUtilities 1460 non-null objectLotConfig 1460 non-null objectLandSlope 1460 non-null objectNeighborhood 1460 non-null objectCondition1 1460 non-null objectCondition2 1460 non-null objectBldgType 1460 non-null objectHouseStyle 1460 non-null objectOverallQual 1460 non-null int64OverallCond 1460 non-null int64YearBuilt 1460 non-null int64YearRemodAdd 1460 non-null int64RoofStyle 1460 non-null objectRoofMatl 1460 non-null objectExterior1st 1460 non-null objectExterior2nd 1460 non-null objectMasVnrType 1452 non-null objectMasVnrArea 1452 non-null float64ExterQual 1460 non-null objectExterCond 1460 non-null objectFoundation 1460 non-null objectBsmtQual 1423 non-null objectBsmtCond 1423 non-null objectBsmtExposure 1422 non-null objectBsmtFinType1 1423 non-null objectBsmtFinSF1 1460 non-null int64BsmtFinType2 1422 non-null objectBsmtFinSF2 1460 non-null int64BsmtUnfSF 1460 non-null int64TotalBsmtSF 1460 non-null int64Heating 1460 non-null objectHeatingQC 1460 non-null objectCentralAir 1460 non-null objectElectrical 1459 non-null object1stFlrSF 1460 non-null int642ndFlrSF 1460 non-null int64LowQualFinSF 1460 non-null int64GrLivArea 1460 non-null int64BsmtFullBath 1460 non-null int64BsmtHalfBath 1460 non-null int64FullBath 1460 non-null int64HalfBath 1460 non-null int64BedroomAbvGr 1460 non-null int64KitchenAbvGr 1460 non-null int64KitchenQual 1460 non-null objectTotRmsAbvGrd 1460 non-null int64Functional 1460 non-null objectFireplaces 1460 non-null int64FireplaceQu 770 non-null objectGarageType 1379 non-null objectGarageYrBlt 1379 non-null float64GarageFinish 1379 non-null objectGarageCars 1460 non-null int64GarageArea 1460 non-null int64GarageQual 1379 non-null objectGarageCond 1379 non-null objectPavedDrive 1460 non-null objectWoodDeckSF 1460 non-null int64OpenPorchSF 1460 non-null int64EnclosedPorch 1460 non-null int643SsnPorch 1460 non-null int64ScreenPorch 1460 non-null int64PoolArea 1460 non-null int64PoolQC 7 non-null objectFence 281 non-null objectMiscFeature 54 non-null objectMiscVal 1460 non-null int64MoSold 1460 non-null int64YrSold 1460 non-null int64SaleType 1460 non-null objectSaleCondition 1460 non-null objectSalePrice 1460 non-null int64dtypes: float64(3), int64(35), object(43)memory usage: 924.0+ KB# 查看数据统计信息train_df.describe()
数据基本信息
训练集维度:(1460, 81),测试集维度:(1459, 80)
特征变量79个(不包括’Id’),目标变量为’SalePrice’
特征变量类型:float64(3), int64(33), object(43)
数据集变量解释
– SalePrice: 房产销售价格,以美元计价。所要预测的目标变量