python房价预测_Kaggle入门级赛题:房价预测——数据分析篇

本次分享的项目来自 Kaggle 的经典赛题:房价预测。分为数据分析和数据挖掘两部分介绍。本篇为数据分析篇。

赛题解读

比赛概述

影响房价的因素有很多,在本题的数据集中有 79 个变量几乎描述了爱荷华州艾姆斯 (Ames, Iowa) 住宅的方方面面,要求预测最终的房价。

技术栈

特征工程 (Creative feature engineering)

回归模型 (Advanced regression techniques like random forest and

gradient boosting)

最终目标

预测出每间房屋的价格,对于测试集中的每一个Id,给出变量SalePrice相应的值。

提交格式

Id,SalePrice

1461,169000.1

1462,187724.1233

1463,175221

etc.

数据分析

数据描述

首先我们导入数据并查看:

train_df = pd.read_csv('./input/train.csv', index_col=0)

test_df = pd.read_csv('./input/test.csv', index_col=0)

train_df.head()

我们可以看到有 80 列,也就是有 79 个特征。

接下来将训练集和测试集合并在一起,这么做是为了进行数据预处理的时候更加方便,让测试集和训练集的特征变换为相同的格式,等预处理进行完之后,再把他们分隔开。

我们知道SalePrice作为我们的训练目标,只出现在训练集中,不出现在测试集,因此我们需要把这一列拿出来再进行合并。在拿出这一列前,我们先来观察它,看看它长什么样子,也就是查看它的分布。

prices = DataFrame({'price': train_df['SalePrice'], 'log(price+1)': np.log1p(train_df['SalePrice'])})

prices.hist()

因为label本身并不平滑,为了我们分类器的学习更加准确,我们需要首先把label给平滑化(正态化)。我在这里使用的是log1p, 也就是 log(x+1)。要注意的是我们这一步把数据平滑化了,在最后算结果的时候,还要把预测到的平滑数据给变回去,那么log1p()的反函数就是expm1(),后面用到时再具体细说。

然后我们把这一列拿出来:

y_train = np.log1p(train_df.pop('SalePrice'))

y_train.head()

Id

1 12.247699

2 12.109016

3 12.317171

4 11.849405

5 12.429220

Name: SalePrice, dtype: float64

这时,y_train就是SalePrice那一列。

然后我们把两个数据集合并起来:

df = pd.concat((train_df, test_df), axis=0)

查看shape:

df.shape

(2919, 79)

df就是我们合并之后的DataFrame。

数据预处理

根据 kaggle 给出的说明,有以下特征及其说明:

SalePrice - the property's sale price in dollars. This is the target variable that you're trying to predict.

MSSubClass: The building class

MSZoning: The general zoning classification

LotFrontage: Linear feet of street connected to property

LotArea: Lot size in square feet

Street: Type of road access

Alley: Type of alley access

LotShape: General shape of property

LandContour: Flatness of the property

Utilities: Type of utilities available

LotConfig: Lot configuration

LandSlope: Slope of property

Neighborhood: Physical locations within Ames city limits

Condition1: Proximity to main road or railroad

Condition2: Proximity to main road or railroad (if a second is present)

BldgType: Type of dwelling

HouseStyle: Style of dwelling

OverallQual: Overall material and finish quality

OverallCond: Overall condition rating

YearBuilt: Original construction date

YearRemodAdd: Remodel date

RoofStyle: Type of roof

RoofMatl: Roof material

Exterior1st: Exterior covering on house

Exterior2nd: Exterior covering on house (if more than one material)

MasVnrType: Masonry veneer type

MasVnrArea: Masonry veneer area in square feet

ExterQual: Exterior material quality

ExterCond: Present condition of the material on the exterior

Foundation: Type of foundation

BsmtQual: Height of the basement

BsmtCond: General condition of the basement

BsmtExposure: Walkout or garden level basement walls

BsmtFinType1: Quality of basement finished area

BsmtFinSF1: Type 1 finished square feet

BsmtFinType2: Quality of second finished area (if present)

BsmtFinSF2: Type 2 finished square feet

BsmtUnfSF: Unfinished square feet of basement area

TotalBsmtSF: Total square feet of basement area

Heating: Type of heating

HeatingQC: Heating quality and condition

CentralAir: Central air conditioning

Electrical: Electrical system

1stFlrSF: First Floor square feet

2ndFlrSF: Second floor square feet

LowQualFinSF: Low quality finished square feet (all floors)

GrLivArea: Above grade (ground) living area square feet

BsmtFullBath: Basement full bathrooms

BsmtHalfBath: Basement half bathrooms

FullBath: Full bathrooms above grade

HalfBath: Half baths above grade

Bedroom: Number of bedrooms above basement level

Kitchen: Number of kitchens

KitchenQual: Kitchen quality

TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)

Functional: Home functionality rating

Fireplaces: Number of fireplaces

FireplaceQu: Fireplace quality

GarageType: Garage location

GarageYrBlt: Year garage was built

GarageFinish: Interior finish of the garage

GarageCars: Size of garage in car capacity

GarageArea: Size of garage in square feet

GarageQual: Garage quality

GarageCond: Garage condition

PavedDrive: Paved driveway

WoodDeckSF: Wood deck area in square feet

OpenPorchSF: Open porch area in square feet

EnclosedPorch: Enclosed porch area in square feet

3SsnPorch: Three season porch area in square feet

ScreenPorch: Screen porch area in square feet

PoolArea: Pool area in square feet

PoolQC: Pool quality

Fence: Fence quality

MiscFeature: Miscellaneous feature not covered in other categories

MiscVal: $Value of miscellaneous feature

MoSold: Month Sold

YrSold: Year Sold

SaleType: Type of sale

SaleCondition: Condition of sale

接下来我们对特征进行分析。上述列出了一个目标变量SalePrice和 79 个特征,数量较多,这一步的特征分析是为了之后的特征工程做准备。

我们来查看哪些特征存在缺失值:

print(pd.isnull(df).sum())

这样并不方便观察,我们先查看缺失值最多的 10 个特征:

df.isnull().sum().sort_values(ascending=False).head(10)

为了更清楚的表示,我们用缺失率来考察缺失情况:

df_na = (df.isnull().sum() / len(df)) * 100

df_na = df_na.drop(df_na[df_na == 0].index).sort_values(ascending=False)

missing_data = pd.DataFrame({'缺失率': df_na})

missing_data.head(10)

对其进行可视化:

f, ax = plt.subplots(figsize=(15,12))

plt.xticks(rotation='90')

sns.barplot(x=df_na.index, y=df_na)

plt.xlabel('Features', fontsize=15)

plt.ylabel('Percent of missing values', fontsize=15)

plt.title('Percent missing data by feature', fontsize=15)

我们可以看到PoolQC、MiscFeature、Alley、Fence、FireplaceQu 等特征存在大量缺失,LotFrontage 有 16.7% 的缺失率,GarageType、GarageFinish、GarageQual 和 GarageCond等缺失率相近,这些特征有的是 category 数据,有的是 numerical 数据,对它们的缺失值如何处理,将在关于特征工程的部分给出。

最后,我们对每个特征进行相关性分析,查看热力图:

corrmat = train_df.corr()

plt.subplots(figsize=(15,12))

sns.heatmap(corrmat, vmax=0.9, square=True)

我们看到有些特征相关性大,容易造成过拟合现象,因此需要进行剔除。在下一篇的数据挖掘篇我们来对这些特征进行处理并训练模型。

不足之处,欢迎指正。

文章来源:segmentfault,作者:秋刀鱼。如果您发现本社区中有涉嫌抄袭的内容,欢迎发送邮件至:william.shi#ucloud.cn(邮箱中#请改为@)进行举报,并提供相关证据,一经查实,本社区将立刻删除涉嫌侵权内容。

3bd16c71c09c78ecd52251f00b4c6dfa.png

后台-系统设置-扩展变量-手机广告位-内容正文底部

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值