kaggle House Prices: Advanced Regression Techniques 房价预测

最新推荐文章于 2024-06-14 09:20:17 发布

Hogumunn

最新推荐文章于 2024-06-14 09:20:17 发布

阅读量810

点赞数

分类专栏： kaggle 文章标签： kaggle 机器学习数据挖掘 python 算法

本文链接：https://blog.csdn.net/qq_43483539/article/details/103916470

版权

House Prices: Advanced Regression Techniques 房价预测

房价预测是我入门Kaggle的第一个比赛，我参考学习了他人的优秀教程：
https://www.kaggle.com/serigne/stacked-regressions-top-4-on-leaderboard

数据集概览

导入相关Python包：
如果没有这个包的话就去cmd用pip安装或者在pycharm安装。

import matplotlib
import numpy as np
import pandas as pd
# 是在使用jupyter notebook 或者 jupyter qtconsole的时候，才会经常用到%matplotlib,
# %matplotlib具体作用是当你调用matplotlib.pyplot的绘图函数plot()进行绘图的时候，或者生成一个figure画布的时候，可以直接在你的python console里面生成图像。
#%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
color = sns.color_palette()
sns.set_style('darkgrid')
import warnings
def ignore_warn(*args,**kwargs):
    pass
warnings.warn = ignore_warn     #忽略警告（来自sklearn和seaborn）
from scipy import stats
from scipy.stats import norm,skew   #做一些统计
pd.set_option('display.float_format',lambda x: '{:.3f}'.format(x))      #将浮点输出限制为3个小数点
from subprocess import check_output

读取csv文件：

这里我那数据文件时，我去是官网拿的，当时在墙内注册，验证码出不来，要翻墙注册账号。

train = pd.read_csv('datasets/train.csv')
test = pd.read_csv('datasets/test.csv')

查看训练、测试集的大小：

print("原始的训练数据集的size是:{}".format(train.shape))
print("原始的测试数据集的size是:{}".format(test.shape))

train_ID = train['Id']
test_ID = test['Id']

train.drop("Id",axis = 1,inplace =True)
test.drop("Id",axis = 1,inplace =True)

print("去除id后的训练数据集的size是:{}".format(train.shape))
print("去除id后的测试数据集的size是:{}".format(test.shape))

原始的训练数据集的size是:(1460, 81)
原始的测试数据集的size是:(1459, 80)
去除id后的训练数据集的size是:(1460, 80)
去除id后的测试数据集的size是:(1459, 79)

特征工程

离群值处理

通过绘制散点图可以直观地看出特征是否有离群值，这里以GrLivArea为例

fig,ax = plt.subplots()
ax.scatter(x=train['GrLivArea'],y = train['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('GrLivArea', fontsize=13)
plt.show()

在这里插入图片描述
我们可以看到图像右下角的两个点有着很大的GrLivArea，但相应的SalePrice却异常地低，我们有理由相信它们是离群值，要将其剔除

#删除离群值
train = train.drop(train[(train['GrLivArea']>4000) & (train['SalePrice']<300000)].index)

#再次检查图片
fig, ax = plt.subplots()
ax.scatter(train['GrLivArea'], train['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('GrLivArea', fontsize=13)
plt.show()

在这里插入图片描述
删除离群值并不总是安全的
我们不能也不必将所有的离群值全部剔除，因为测试集中依然会有一些离群值
用带有一定噪声的数据训练出的模型会具有更高的鲁棒性，从而在测试集中表现得更好

目标值分析

SalePrice是我们将要预测的目标，有必要对其进行分析和处理。

我们画出SalePrice的分布图和QQ图（Quantile Quantile Plot）
QQ图，它是由标准正态分布的分位数为横坐标，样本值为纵坐标的散点图。
如果QQ图上的点在一条直线附近，则说明数据近似于正态分布，且该直线的斜率为标准差，截距为均值。对于QQ图的详细介绍可以参考这篇文章：https://blog.csdn.net/hzwwpgmwy/article/details/79178485

#目标值分析

#拟合scipy.stats分布并在数据上绘制估计的PDF（概率分布函数）
sns.distplot(train['SalePrice'] , fit=norm);

# 获取函数使用的拟合参数
(mu, sigma) = norm.fit(train['SalePrice'])
print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))

#现在绘制分布图
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],
            loc='best')#设置图例的位置
plt.ylabel('Frequency')
plt.title('SalePrice distribution')

#qq图
#它是由标准正态分布的分位数为横坐标，样本值为纵坐标的散点图
#如果QQ图上的点在一条直线附近，则说明数据近似于正态分布，且该直线的斜率为标准差，截距为均值。
fig = plt.figure()
#plt有“plot”和“text”方法的对象,所以能传给plot
res = stats.probplot(train['SalePrice'], plot=plt)
plt.show()

在这里插入图片描述

mu = 180932.92 and sigma = 79467.79

SalePrice的分布呈正偏态，而线性回归模型要求因变量服从正态分布。
我们对其做对数变换，让数据接近正态分布。
变量变换一般有两个重要的的变量变换类型：简单函数变换和标准化
其中简单函数变换最常用用的是平方根、对数和倒数变换来变成高斯分布的数据

#做对数变换，让数据接近正态分布
#我们使用numpy函数log1p，它将log（1+x）应用于列的所有元素
train["SalePrice"] = np.log1p(train["SalePrice"])

#Check the new distribution
sns.distplot(train['SalePrice'] , fit=norm);

# Get the fitted parameters used by the function
(mu, sigma) = norm.fit(train['SalePrice'])
print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))

#Now plot the distribution
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],
            loc='best')
plt.ylabel('Frequency')
plt.title('SalePrice distribution')

#Get also the QQ-plot
fig = plt.figure()
res = stats.probplot(train['SalePrice'], plot=plt)
plt.show()

在这里插入图片描述

mu = 12.02 and sigma = 0.40

正态分布的数据有很多好的性质，使得后续的模型训练有更好的效果。
另一方面，由于这次比赛最终是对预测值的对数的误差进行评估，所以我们在本地测试的时候也应该用同样的标准。

特征相关性

用相关性矩阵热图表现特征与目标值之间以及两两特征之间的相关程度，对特征的处理有指导意义

#热度相关图，查看特征与销售价格之间的关系
corrmat = train.corr()      #得到训练集的相关系数矩阵
plt.subplots(figsize=(12,9))
#图例中最大值时0.9，并且把坐标轴的方向设置为equal，以使每个单元格为方形
sns.heatmap(corrmat, vmax=0.9, square=True)
plt.show()

在这里插入图片描述

缺失值处理

#首先将训练集和测试集合并在一起
ntrain = train.shape[0]
ntest = test.shape[0]
y_train = train.SalePrice.values
all_data = pd.concat((train, test)).reset_index(drop=True)      #合并数据集并且重新索引
all_data.drop(['SalePrice'], axis=1, inplace=True

最低0.47元/天解锁文章

Hogumunn

关注

0
点赞
踩
13

收藏

觉得还不错? 一键收藏
0
评论
kaggle House Prices: Advanced Regression Techniques 房价预测

House Prices: Advanced Regression Techniques 房价预测房价预测是我入门Kaggle的第一个比赛，我参考学习了他人的优秀教程：https://www.kaggle.com/serigne/stacked-regressions-top-4-on-leaderboardhttps://www.cnblogs.com/adamding/p/1139048...
复制链接

扫一扫